US20200152288A1 - System and method for predicting effect of genomic variations on pre-mrna splicing - Google Patents
System and method for predicting effect of genomic variations on pre-mrna splicing Download PDFInfo
- Publication number
- US20200152288A1 US20200152288A1 US16/504,184 US201916504184A US2020152288A1 US 20200152288 A1 US20200152288 A1 US 20200152288A1 US 201916504184 A US201916504184 A US 201916504184A US 2020152288 A1 US2020152288 A1 US 2020152288A1
- Authority
- US
- United States
- Prior art keywords
- branchpoint
- splice acceptor
- natural
- acceptor site
- candidate variant
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000000694 effects Effects 0.000 title claims abstract description 65
- 238000000034 method Methods 0.000 title claims abstract description 62
- 108020004999 messenger RNA Proteins 0.000 title claims abstract description 57
- 108020005067 RNA Splice Sites Proteins 0.000 claims abstract description 156
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 78
- 230000007918 pathogenicity Effects 0.000 claims abstract description 21
- 239000002773 nucleotide Substances 0.000 claims description 82
- 125000003729 nucleotide group Chemical group 0.000 claims description 82
- 230000001717 pathogenic effect Effects 0.000 claims description 61
- 238000011144 upstream manufacturing Methods 0.000 claims description 44
- 238000012216 screening Methods 0.000 claims description 21
- 230000003313 weakening effect Effects 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 11
- 230000004044 response Effects 0.000 claims description 9
- 108091032973 (ribonucleotides)n+m Proteins 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 2
- 239000003795 chemical substances by application Substances 0.000 description 44
- 230000035772 mutation Effects 0.000 description 43
- 102000004169 proteins and genes Human genes 0.000 description 36
- 238000004458 analytical method Methods 0.000 description 24
- 238000013459 approach Methods 0.000 description 18
- 230000002939 deleterious effect Effects 0.000 description 15
- 230000004075 alteration Effects 0.000 description 13
- 238000003860 storage Methods 0.000 description 12
- 102100020948 Growth hormone receptor Human genes 0.000 description 11
- 238000004891 communication Methods 0.000 description 10
- 108091092195 Intron Proteins 0.000 description 9
- 102000007981 Ornithine carbamoyltransferase Human genes 0.000 description 9
- 101710113020 Ornithine transcarbamylase, mitochondrial Proteins 0.000 description 9
- 230000004913 activation Effects 0.000 description 9
- 238000012217 deletion Methods 0.000 description 9
- 230000037430 deletion Effects 0.000 description 9
- 101000763951 Homo sapiens Mitochondrial import inner membrane translocase subunit Tim8 A Proteins 0.000 description 8
- 102100026808 Mitochondrial import inner membrane translocase subunit Tim8 A Human genes 0.000 description 8
- 102100026842 Serine-pyruvate aminotransferase Human genes 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 101710135169 Lysosomal alpha-mannosidase Proteins 0.000 description 7
- 230000001594 aberrant effect Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 102100032248 Dysferlin Human genes 0.000 description 6
- 108700024394 Exon Proteins 0.000 description 6
- 101000629622 Homo sapiens Serine-pyruvate aminotransferase Proteins 0.000 description 6
- 102100023231 Lysosomal alpha-mannosidase Human genes 0.000 description 6
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 230000037436 splice-site mutation Effects 0.000 description 6
- 102100035108 High affinity nerve growth factor receptor Human genes 0.000 description 5
- 101000596894 Homo sapiens High affinity nerve growth factor receptor Proteins 0.000 description 5
- 201000010099 disease Diseases 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 108700028369 Alleles Proteins 0.000 description 4
- 102100029492 Glycogen phosphorylase, muscle form Human genes 0.000 description 4
- 101000700475 Homo sapiens Glycogen phosphorylase, muscle form Proteins 0.000 description 4
- OIRDTQYFTABQOQ-KQYNXXCUSA-N adenosine Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O OIRDTQYFTABQOQ-KQYNXXCUSA-N 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 101100240528 Caenorhabditis elegans nhr-23 gene Proteins 0.000 description 3
- 102100021645 Complex I assembly factor ACAD9, mitochondrial Human genes 0.000 description 3
- 101710141475 Complex I assembly factor ACAD9, mitochondrial Proteins 0.000 description 3
- 102100040998 Conserved oligomeric Golgi complex subunit 6 Human genes 0.000 description 3
- 102100033448 Lysosomal alpha-glucosidase Human genes 0.000 description 3
- 101710081134 Lysosomal alpha-glucosidase Proteins 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 230000004064 dysfunction Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000014759 maintenance of location Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012706 support-vector machine Methods 0.000 description 3
- 102100039819 Actin, alpha cardiac muscle 1 Human genes 0.000 description 2
- 102000052609 BRCA2 Human genes 0.000 description 2
- 108700020462 BRCA2 Proteins 0.000 description 2
- 101150008921 Brca2 gene Proteins 0.000 description 2
- 239000002126 C01EB10 - Adenosine Substances 0.000 description 2
- 102100033775 Collagen alpha-5(IV) chain Human genes 0.000 description 2
- 206010011878 Deafness Diseases 0.000 description 2
- 208000037150 Dysferlin-related limb-girdle muscular dystrophy R2 Diseases 0.000 description 2
- 208000014094 Dystonic disease Diseases 0.000 description 2
- 102100029671 E3 ubiquitin-protein ligase TRIM8 Human genes 0.000 description 2
- 102000004678 Exoribonucleases Human genes 0.000 description 2
- 108010002700 Exoribonucleases Proteins 0.000 description 2
- 208000017359 Hereditary sensory and autonomic neuropathy type 4 Diseases 0.000 description 2
- 101000959247 Homo sapiens Actin, alpha cardiac muscle 1 Proteins 0.000 description 2
- 101000710886 Homo sapiens Collagen alpha-5(IV) chain Proteins 0.000 description 2
- 101000748957 Homo sapiens Conserved oligomeric Golgi complex subunit 6 Proteins 0.000 description 2
- 101000795300 Homo sapiens E3 ubiquitin-protein ligase TRIM8 Proteins 0.000 description 2
- 101100515518 Homo sapiens MYO15A gene Proteins 0.000 description 2
- 101001094831 Homo sapiens Phosphomannomutase 2 Proteins 0.000 description 2
- 101150098365 MYO15A gene Proteins 0.000 description 2
- 102100035362 Phosphomannomutase 2 Human genes 0.000 description 2
- 229960005305 adenosine Drugs 0.000 description 2
- 150000001413 amino acids Chemical class 0.000 description 2
- 201000009563 autosomal recessive limb-girdle muscular dystrophy type 2B Diseases 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 210000000349 chromosome Anatomy 0.000 description 2
- 231100000895 deafness Toxicity 0.000 description 2
- 230000029087 digestion Effects 0.000 description 2
- 208000010118 dystonia Diseases 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 239000010931 gold Substances 0.000 description 2
- 229910052737 gold Inorganic materials 0.000 description 2
- 239000000122 growth hormone Substances 0.000 description 2
- 208000016354 hearing loss disease Diseases 0.000 description 2
- 208000037584 hereditary sensory and autonomic neuropathy Diseases 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000003446 ligand Substances 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 210000001700 mitochondrial membrane Anatomy 0.000 description 2
- 238000007481 next generation sequencing Methods 0.000 description 2
- 230000001575 pathological effect Effects 0.000 description 2
- 230000002028 premature Effects 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 108010060800 serine-pyruvate aminotransferase Proteins 0.000 description 2
- 108010033419 somatotropin-binding protein Proteins 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- XUHRVZXFBWDCFB-QRTDKPMLSA-N (3R)-4-[[(3S,6S,9S,12R,15S,18R,21R,24R,27R,28R)-12-(3-amino-3-oxopropyl)-6-[(2S)-butan-2-yl]-3-(2-carboxyethyl)-18-(hydroxymethyl)-28-methyl-9,15,21,24-tetrakis(2-methylpropyl)-2,5,8,11,14,17,20,23,26-nonaoxo-1-oxa-4,7,10,13,16,19,22,25-octazacyclooctacos-27-yl]amino]-3-[[(2R)-2-[[(3S)-3-hydroxydecanoyl]amino]-4-methylpentanoyl]amino]-4-oxobutanoic acid Chemical compound CCCCCCC[C@H](O)CC(=O)N[C@H](CC(C)C)C(=O)N[C@H](CC(O)=O)C(=O)N[C@@H]1[C@@H](C)OC(=O)[C@H](CCC(O)=O)NC(=O)[C@@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](CCC(N)=O)NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](CO)NC(=O)[C@@H](CC(C)C)NC(=O)[C@@H](CC(C)C)NC1=O)[C@@H](C)CC XUHRVZXFBWDCFB-QRTDKPMLSA-N 0.000 description 1
- BZSALXKCVOJCJJ-IPEMHBBOSA-N (4s)-4-[[(2s)-2-acetamido-3-methylbutanoyl]amino]-5-[[(2s)-1-[[(2s)-1-[[(2s,3r)-1-[[(2s)-1-[[(2s)-1-[[2-[[(2s)-1-amino-1-oxo-3-phenylpropan-2-yl]amino]-2-oxoethyl]amino]-5-(diaminomethylideneamino)-1-oxopentan-2-yl]amino]-1-oxopropan-2-yl]amino]-3-hydroxy Chemical compound CC(=O)N[C@@H](C(C)C)C(=O)N[C@@H](CCC(O)=O)C(=O)N[C@@H](CCCC)C(=O)N[C@@H](CCCC)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)NCC(=O)N[C@H](C(N)=O)CC1=CC=CC=C1 BZSALXKCVOJCJJ-IPEMHBBOSA-N 0.000 description 1
- 102100034767 3-hydroxyisobutyryl-CoA hydrolase, mitochondrial Human genes 0.000 description 1
- 229930024421 Adenine Natural products 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 101150003270 Agxt gene Proteins 0.000 description 1
- 108010033918 Alanine-glyoxylate transaminase Proteins 0.000 description 1
- 102100040743 Alpha-crystallin B chain Human genes 0.000 description 1
- 102000036365 BRCA1 Human genes 0.000 description 1
- 108700020463 BRCA1 Proteins 0.000 description 1
- 101150072950 BRCA1 gene Proteins 0.000 description 1
- 241000010972 Ballerus ballerus Species 0.000 description 1
- 102100034476 CCA tRNA nucleotidyltransferase 1, mitochondrial Human genes 0.000 description 1
- 101150114528 COL4A5 gene Proteins 0.000 description 1
- 108091026890 Coding region Proteins 0.000 description 1
- 108020004705 Codon Proteins 0.000 description 1
- 102100031611 Collagen alpha-1(III) chain Human genes 0.000 description 1
- 102100031457 Collagen alpha-1(V) chain Human genes 0.000 description 1
- 108091035707 Consensus sequence Proteins 0.000 description 1
- 101710103807 Conserved oligomeric Golgi complex subunit 6 Proteins 0.000 description 1
- 102100037147 Cytoplasmic dynein 2 heavy chain 1 Human genes 0.000 description 1
- 102100031867 DNA excision repair protein ERCC-6 Human genes 0.000 description 1
- 102100021046 DNA-binding protein RFX6 Human genes 0.000 description 1
- 101150083642 DYSF gene Proteins 0.000 description 1
- 108090000620 Dysferlin Proteins 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 241000206602 Eukaryota Species 0.000 description 1
- 102100031510 Fibrillin-2 Human genes 0.000 description 1
- 108700039691 Genetic Promoter Regions Proteins 0.000 description 1
- 229920002527 Glycogen Polymers 0.000 description 1
- 208000032000 Glycogen storage disease due to muscle glycogen phosphorylase deficiency Diseases 0.000 description 1
- 206010018462 Glycogen storage disease type V Diseases 0.000 description 1
- 108010051696 Growth Hormone Proteins 0.000 description 1
- 206010053759 Growth retardation Diseases 0.000 description 1
- 101000872461 Homo sapiens 3-hydroxyisobutyryl-CoA hydrolase, mitochondrial Proteins 0.000 description 1
- 101000891982 Homo sapiens Alpha-crystallin B chain Proteins 0.000 description 1
- 101000849001 Homo sapiens CCA tRNA nucleotidyltransferase 1, mitochondrial Proteins 0.000 description 1
- 101000851684 Homo sapiens Chimeric ERCC6-PGBD3 protein Proteins 0.000 description 1
- 101000993285 Homo sapiens Collagen alpha-1(III) chain Proteins 0.000 description 1
- 101000941708 Homo sapiens Collagen alpha-1(V) chain Proteins 0.000 description 1
- 101000881344 Homo sapiens Cytoplasmic dynein 2 heavy chain 1 Proteins 0.000 description 1
- 101000920783 Homo sapiens DNA excision repair protein ERCC-6 Proteins 0.000 description 1
- 101001075461 Homo sapiens DNA-binding protein RFX6 Proteins 0.000 description 1
- 101000846890 Homo sapiens Fibrillin-2 Proteins 0.000 description 1
- 101001015006 Homo sapiens Integrin beta-4 Proteins 0.000 description 1
- 101001051093 Homo sapiens Low-density lipoprotein receptor Proteins 0.000 description 1
- 101000581533 Homo sapiens Methylcrotonoyl-CoA carboxylase beta chain, mitochondrial Proteins 0.000 description 1
- 101000587058 Homo sapiens Methylenetetrahydrofolate reductase Proteins 0.000 description 1
- 101001130226 Homo sapiens Phosphatidylcholine-sterol acyltransferase Proteins 0.000 description 1
- 101000826063 Homo sapiens Radial spoke head protein 3 homolog Proteins 0.000 description 1
- 101000825933 Homo sapiens Structural maintenance of chromosomes flexible hinge domain-containing protein 1 Proteins 0.000 description 1
- 101000798700 Homo sapiens Transmembrane protease serine 3 Proteins 0.000 description 1
- 101000798702 Homo sapiens Transmembrane protease serine 4 Proteins 0.000 description 1
- 101000803527 Homo sapiens Vacuolar ATPase assembly integral membrane protein VMA21 Proteins 0.000 description 1
- 102100033000 Integrin beta-4 Human genes 0.000 description 1
- ONIBWKKTOPOVIA-BYPYZUCNSA-N L-Proline Chemical compound OC(=O)[C@@H]1CCCN1 ONIBWKKTOPOVIA-BYPYZUCNSA-N 0.000 description 1
- 102100024640 Low-density lipoprotein receptor Human genes 0.000 description 1
- 241000124008 Mammalia Species 0.000 description 1
- 108091027974 Mature messenger RNA Proteins 0.000 description 1
- 102100027320 Methylcrotonoyl-CoA carboxylase beta chain, mitochondrial Human genes 0.000 description 1
- 102100029684 Methylenetetrahydrofolate reductase Human genes 0.000 description 1
- 102000008109 Mixed Function Oxygenases Human genes 0.000 description 1
- 108010074633 Mixed Function Oxygenases Proteins 0.000 description 1
- 208000000475 Mohr-Tranebjaerg syndrome Diseases 0.000 description 1
- 208000016285 Movement disease Diseases 0.000 description 1
- 108700010674 N-acetylVal-Nle(7,8)- allatotropin (5-13) Proteins 0.000 description 1
- 108010012255 Neural Cell Adhesion Molecule L1 Proteins 0.000 description 1
- 102100024964 Neural cell adhesion molecule L1 Human genes 0.000 description 1
- 108020004485 Nonsense Codon Proteins 0.000 description 1
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 101150029918 PYGM gene Proteins 0.000 description 1
- 102000006335 Phosphate-Binding Proteins Human genes 0.000 description 1
- 108010058514 Phosphate-Binding Proteins Proteins 0.000 description 1
- 102100031538 Phosphatidylcholine-sterol acyltransferase Human genes 0.000 description 1
- 102000004160 Phosphoric Monoester Hydrolases Human genes 0.000 description 1
- 108090000608 Phosphoric Monoester Hydrolases Proteins 0.000 description 1
- 102000009097 Phosphorylases Human genes 0.000 description 1
- 108010073135 Phosphorylases Proteins 0.000 description 1
- 208000004777 Primary Hyperoxaluria Diseases 0.000 description 1
- ONIBWKKTOPOVIA-UHFFFAOYSA-N Proline Natural products OC(=O)C1CCCN1 ONIBWKKTOPOVIA-UHFFFAOYSA-N 0.000 description 1
- 102000015097 RNA Splicing Factors Human genes 0.000 description 1
- 108010039259 RNA Splicing Factors Proteins 0.000 description 1
- 238000003559 RNA-seq method Methods 0.000 description 1
- 102100023016 Radial spoke head protein 3 homolog Human genes 0.000 description 1
- 102000004278 Receptor Protein-Tyrosine Kinases Human genes 0.000 description 1
- 108090000873 Receptor Protein-Tyrosine Kinases Proteins 0.000 description 1
- 102000000395 SH3 domains Human genes 0.000 description 1
- 108050008861 SH3 domains Proteins 0.000 description 1
- 101100277345 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) DDP1 gene Proteins 0.000 description 1
- 102100038803 Somatotropin Human genes 0.000 description 1
- 108010068542 Somatotropin Receptors Proteins 0.000 description 1
- 102100022770 Structural maintenance of chromosomes flexible hinge domain-containing protein 1 Human genes 0.000 description 1
- 102100024547 Tensin-1 Human genes 0.000 description 1
- 108010088950 Tensins Proteins 0.000 description 1
- 102100032454 Transmembrane protease serine 3 Human genes 0.000 description 1
- 108091000117 Tyrosine 3-Monooxygenase Proteins 0.000 description 1
- 102000048218 Tyrosine 3-monooxygenases Human genes 0.000 description 1
- 102100035048 Vacuolar ATPase assembly integral membrane protein VMA21 Human genes 0.000 description 1
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- 201000008333 alpha-mannosidosis Diseases 0.000 description 1
- 230000003466 anti-cipated effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000000429 assembly Methods 0.000 description 1
- 230000000712 assembly Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 210000004899 c-terminal region Anatomy 0.000 description 1
- 238000003776 cleavage reaction Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 210000000805 cytoplasm Anatomy 0.000 description 1
- 201000008696 deafness-dystonia-optic neuronopathy syndrome Diseases 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 208000035475 disorder Diseases 0.000 description 1
- 239000003623 enhancer Substances 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 231100000221 frame shift mutation induction Toxicity 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 229940096919 glycogen Drugs 0.000 description 1
- 201000004534 glycogen storage disease V Diseases 0.000 description 1
- 239000000710 homodimer Substances 0.000 description 1
- 108091008039 hormone receptors Proteins 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000000126 in silico method Methods 0.000 description 1
- 238000001727 in vivo Methods 0.000 description 1
- 230000002779 inactivation Effects 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000008863 intramolecular interaction Effects 0.000 description 1
- 238000005304 joining Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000035800 maturation Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 210000003205 muscle Anatomy 0.000 description 1
- 201000006938 muscular dystrophy Diseases 0.000 description 1
- 230000000508 neurotrophic effect Effects 0.000 description 1
- 230000037434 nonsense mutation Effects 0.000 description 1
- 208000034814 nonsyndromic genetic hearing loss Diseases 0.000 description 1
- 210000004940 nucleus Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 102000054765 polymorphisms of proteins Human genes 0.000 description 1
- 125000002924 primary amino group Chemical group [H]N([H])* 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 description 1
- 235000007682 pyridoxal 5'-phosphate Nutrition 0.000 description 1
- 239000011589 pyridoxal 5'-phosphate Substances 0.000 description 1
- 102000027426 receptor tyrosine kinases Human genes 0.000 description 1
- 108091008598 receptor tyrosine kinases Proteins 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000003757 reverse transcription PCR Methods 0.000 description 1
- 230000007017 scission Effects 0.000 description 1
- 210000001324 spliceosome Anatomy 0.000 description 1
- 208000011580 syndromic disease Diseases 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 238000002560 therapeutic procedure Methods 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 108700026220 vif Genes Proteins 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
Definitions
- the disclosure herein generally relates to mRNA splicing, and, more particularly, predicting effect of genomic variations on pre-mRNA splicing.
- RNA splicing is a process of cutting introns out of pre-mRNA and stitching together exons to form a final nucleotide sequence that is the mRNA sequence that codes for proteins.
- branchpoint (BP) selection and splice site (SS) selection are key steps in RNA splicing, yet many popular splicing analysis tools do not model this mechanism. If there is a mutation in proximity to an intron's primary branch point, that branchpoint may become unusable.
- a processor implemented method for predicting effect of genomic variations on pre-mRNA splicing includes receiving genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts. Further includes classifying the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant. Further includes evaluating effect of the at least one candidate variant on pre-mRNA splicing, based on a classified region from the classification of the at least one candidate variant.
- evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises: identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using MaxEnt score, determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site and evaluating, in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a PWM evaluator. Further includes predicting pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing.
- a system for predicting effect of genomic variations on pre-mRNA splicing includes a memory storing instructions and one or more hardware processors coupled to the memory, wherein the one or more hardware processors are configured by the instructions to: receive genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts. Further, to classify the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant.
- the evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises: identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using a MaxEnt score, determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site and evaluating, in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a PWM evaluator. Further to predict pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on pre-mRNA splicing.
- one or more non-transitory machine readable information storage mediums comprises one or more instructions which when executed by one or more hardware processors causes receiving genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts. Further includes classifying the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant. Further includes evaluating effect of the at least one candidate variant on pre-mRNA splicing, based on a classified region from the classification of the at least one candidate variant.
- evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises: identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using MaxEnt score, determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site and evaluating, in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a PWM evaluator. Further includes predicting pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing.
- any block diagram herein represent conceptual views of illustrative systems embodying the principles of the present subject matter.
- any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computing device or processor, whether or not such computing device or processor is explicitly shown.
- FIG. 1 illustrates network environment implementing a system 102 for predicting effect of genomic variations on pre-mRNA splicing, in accordance with an embodiment of the present disclosure.
- FIG. 2 is a flow diagram illustrating a method for predicting effect of genomic variations on pre-mRNA splicing, according to an embodiment of the present disclosure.
- FIGS. 3A, 3B and 3C illustrates an analysis pipeline for predicting effect of genomic variations on pre-mRNA splicing, in accordance with an embodiment of the present disclosure.
- FIG. 4 illustrates a block diagram of a system for predicting effect of genomic variations on pre-mRNA splicing, in accordance with some embodiments of the present disclosure.
- Splicing forms a crucial part of pre-mRNA maturation process as accurate excision of introns and joining of exons are essential to eukaryotic gene expression.
- parts of the pre-mRNA are removed by the spliceosome within the nucleus before the mature mRNA is transported to the cytoplasm for translation.
- pre-mRNA is differently spliced leading to alternative transcripts i.e., expression of different proteins from the same gene. More than 70% of protein coding human gene are alternatively spliced and alternative splicing has been proposed to be the major cause of the evolution of phenotypic complexity in mammals.
- Exon skipping is the most common outcome of splicing mutations, followed by activation of cryptic 5′ and 3′ splice sites (5′SS and 3′SS). Exon skipping is due to disruption of natural splice acceptor site or abolishment of the natural branchpoint with no alternative branchpoint available to facilitate splicing. Efficient splicing requires at least three major signals within introns, the 5′ splice site, 3′ splice site and the branchpoint sequence. Auxiliary sequences in introns and exons known as splicing enhancers and silencers act in conjunction to decide splicing to be constitutive or alternative. The 5′ end of the intron is known as splice donor site and 3′ end of the intron is referred as splice acceptor site.
- the divergence from the prototype sequences are associated with alternative transcript generation. Occurrence of such consensus sequences within the introns is quite common in the case of higher eukaryotes framing pseudoexons, indicating the presence of the splice boundaries but insufficient for regulating correct splicing.
- the 3′ end is characterized by presence of the splice acceptor site, branchpoint sequence upstream and the polypyrimidine tract immediately following the branchpoint sequence.
- Branchpoints are defined on the basis of four major criteria: that are proximal to the 3′ splice end of the intron, branchpoint sequence is followed by polypyrimidine tract, a depletion of ‘AG’ dinucleotide between the branchpoint sequence and the 3′ splice site, and the branchpoint is mostly an adenine. So the selection and accurate prediction of branchpoint variant and splice site variant from candidate variants of existing databases of known human gene transcripts is of prime importance and challenging.
- Various embodiments of the present disclosure provided method and system for predicting the effect of genomic variations on pre-mRNA splicing based on MaxEnt tool and a Position Weight Matrix (PWM) evaluator with high accuracy utilized on resource constrained environment.
- the disclosed system includes a variant pipeline which works in real-time in a resource constrained environment or near real-time on CPU.
- the disclosed system and method provides a solution in predicting effect of genomic variations on pre-mRNA splicing. A detailed description of the above described system and method for predicting the effect of genomic variations on pre-mRNA splicing is shown with respect to illustrations represented with reference to FIGS. 1 through 4 .
- FIGS. 1 through 4 where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and method for predicting effect of genomic variations on pre-mRNA splicing.
- the system 102 may receive inputs, for example, inputs via multiple devices and/or machines 104 - 1 , 104 - 2 . . . 104 -N, collectively referred to as devices 104 hereinafter.
- the devices 104 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, VR camera embodying devices, storage devices equipped to receive and store inputs and outputs.
- the devices 104 may include devices capable of capturing and storing data.
- the devices 104 are communicatively coupled to the system 102 through a network 106 , and may be capable of transmitting the data to the system 102 .
- the network 106 may be a wireless network, a wired network or a combination thereof.
- the network 106 can be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and the like.
- the network 106 may either be a dedicated network or a shared network.
- the shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like, to communicate with one another.
- the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
- the devices 104 may send input to the system 102 via the network 106 .
- the system 102 is caused to predict effect of genomic variations on pre-mRNA splicing.
- the system 102 may be embodied in a computing device 110 .
- Examples of the computing device 110 may include, but are not limited to, a desktop personal computer (PC), a notebook, a laptop, a portable computer, a smart phone, a tablet, and the like.
- the system 102 may also be associated with a data repository 112 to store inputs, dataset and output/resultant. Additionally or alternatively, the data repository 112 may be configured to store data and/or information generated during predicting effect of genomic variations on pre-mRNA splicing.
- the repository 112 may be configured outside and communicably coupled to the computing device 110 embodying the system 102 . Alternatively, the data repository 112 may be configured within the system 102 .
- the disclosed system 102 enables predicting effect of genomic variations on pre-mRNA splicing, thereby resulting in high accuracy of predicting pathogenicity and determining branchpoint variants and their pathogenicity based on the availability of alternative branchpoint which could rescue normal splicing.
- An example representation of pipeline of the method for predicting effect of genomic variations on pre-mRNA splicing is shown and described further with reference to FIG. 3A-3C .
- the method 200 may be described in the general context of computer executable instructions.
- computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types.
- the method 200 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communication network.
- the order in which the method 200 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 200 , or an alternative method.
- the method 200 can be implemented in any suitable hardware, software, firmware, or combination thereof.
- the method 200 depicted in the flow chart may be executed by a system, for example, the system 102 of FIG. 1 .
- the system 102 may be embodied in an exemplary computer system, for example computer system 102 .
- the method 200 of FIG. 2 will be explained in more detail below with reference to FIGS. 3A-3C .
- the method 200 is initiated at 202 where genomic position information, reference allele and alternate allele corresponding to a specified version of the human genome is received.
- the at least one candidate transcript that fully contains the variant is obtained from existing databases of known human gene transcripts (herein, referred to as at least one variant).
- Each transcript is represented as a set of one or more non-overlapping intervals, where each interval is represented by four features that include the chromosome on which the transcript is present, the starting genomic coordinate of the interval, the ending genomic coordinate of the interval, and the strand on which the transcript is present (forward or reverse).
- the at least one candidate variant is classified as occurring in one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of the at least one candidate variant. Further, the at least one candidate variant is classified as the splice acceptor site region occurring in genomic coordinate between 15 nucleotides upstream to 3 nucleotides downstream of a natural intron-exon splice acceptor junction of the gene transcripts and as the branch site region occurring in genomic coordinate between 50 nucleotides to 15 nucleotides upstream of a natural splice acceptor junction of the gene transcripts.
- nucleotide and nt and used interchangeable.
- effect of the at least one candidate variant on pre-mRNA splicing is evaluated based on a classified region from the classification of the at least one candidate variant.
- the evaluation is performed by identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using a MaxEnt score and then determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site. Thereafter in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a Position Weight Matrix (PWM) evaluator.
- PWM Position Weight Matrix
- the MaxEnt is a known splice site strength determination tool for calculating strength or weakening of the splice acceptor site, wherein the MaxEnt tool assigns a MaxEnt score based on the effect of the at least one candidate variant on affected natural splice acceptor site region.
- the available MaxEnt Scan tool is used to calculate the splice acceptor site scores for both the canonical splice sites which is the natural occurring splice sites or natural splice site acceptor region and cryptic splice sites which is splice sites activated by a mutation.
- the PWM evaluator is generated using experimentally determined human branch sites.
- the PWM is generated using an experimentally determined 59,359 human branch sites (10 mers), identified based on exoribonuclease digestion and RNA-seq.
- a set of branch point sites is utilized by selecting only the sequences with ‘A’ at the branchpoint as the training set for the Position weight matrix (PWM).
- ‘A’ is chosen as the branchpoint since ones with ‘C’/′T′/G as the branchpoint has very low median scores, while the known A has the highest value, suggesting the PWM generated, in accordance with present embodiments, has a selectivity towards ‘A’ as a branchpoint and is ideal to restrict the PWM scoring to ‘A’. Therefore the PWM was built using the known ‘A’ as the branchpoint.
- a PWM matrix of (m*n) is created by aligning the experimentally determined 59,359 human branch sites (10 mers) with ‘A’ as the branchpoint. In present embodiment a matrix of (10*4) is created. The alignment is then used in calculating the frequency of each nucleotide at each position of the 10mers and thereafter the frequencies of each nucleotide are converted to log odds scores.
- 1,75,031 unique introns from 18,171 canonical transcripts from Gencode database v19 is identified and extracted with the filtering criterion of being surrounded by coding exons on both sides.
- the frequency of each nucleotide (A, T, C, G) across all the introns is used to normalize the raw frequencies of the bases in the training set of branch points. As described above, the normalized frequencies are converted to log odds scores to generate the final PWM.
- the first quartile of the distribution is calculated and is used as a threshold for classifying a site to be a high confidence branch site. In an example embodiment, the determined threshold is 1.46.
- a 40 mer intronic sequence, 10 to 50 bases upstream from the 3′ end of each intron is extracted from the human genome and scanned for 10 mer sequences scoring above the branchpoint threshold.
- pathogenicity of the at least one candidate variant is predicted based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing. Further evaluation and predicting pathogenicity of the at least one candidate variant is further described in detail in reference to FIGS. 3A-3C .
- FIGS. 3A-3C illustrating the analysis pipeline for method of predicting pathogenicity on the pre-mRNA splicing.
- the analysis pipeline is designed to categorize a variant as pathogenic or non-pathogenic.
- the analysis approach in accordance with the present embodiments follows a step by step pipeline represented by FIGS. 3A-3C .
- variants that were in close proximity, that is up to 15 nucleotide upstream to the canonical splice acceptor region are screened for creation of a new cryptic acceptor site or a creation of a new branch site. If a branch site is created, then a suitable downstream splice acceptor site scan is initiated.
- a suitable upstream branch site is scanned for using the PWM evaluator. If the variant disrupted the canonical splice acceptor and the canonical branch site is unaffected, then the screening for a suitable alternative downstream splice acceptor is performed. If a new canonical splice acceptor was predicted downstream to the canonical splice acceptor site, then a screening for a experimentally proven branchpoint is performed using the PWM tool. The detailed step by step process of the pipeline is described in FIGS. 3A-3C .
- a variant 302 for example, at least one candidate variant is received where genomic position information, reference allele and alternate allele corresponding to a specified version of the human genome is received.
- the at least one candidate transcript that fully contains the variant is obtained from existing databases of known human gene transcripts (herein, referred to as at least one variant).
- Each transcript is represented as a set of one or more non-overlapping intervals, where each interval is represented by four features that include the chromosome on which the transcript is present, the starting genomic coordinate of the interval, the ending genomic coordinate of the interval, and the strand on which the transcript is present (forward or reverse).
- the at least one candidate variant is classified as occurring in splice affecting region based on genomic coordinate.
- region occurring in genomic coordinate between 15 nucleotides upstream to 3 nucleotides downstream of a natural intron-exon splice acceptor junction of the gene transcripts is classified as splice acceptor site.
- weakening of the natural splice acceptor site is determined and, in other words, it is determined that the classified at least one candidate variant is affecting the natural splice acceptor site (natural 3′SS).
- the classified at least one candidate variant is checked for creating a new ‘AG’ that is a new 3′SS thereby weakening of the natural 3′SS as determined using MaxEnt score.
- the at least one candidate variant is checked if natural branchpoint suffices or branches out to block C.
- determining presence or absence of natural branchpoint in sequence range of 15 nucleotide to 50 nucleotide of the new splice acceptor site region being active during the pre-mRNA splicing.
- the natural branchpoint Thereafter strength of the natural branchpoint is evaluated using the PWM evaluator and identifying the at least one candidate variant as pathogenic ( 312 ) based on the evaluated strength of the natural branchpoint; or screening for an alternative branchpoint using the PWM evaluator and predicting the at least one candidate as a pathogenic based on the evaluated strength of the alternative branchpoint ( 314 ).
- status of the natural splice acceptor site region is determined. The status herein includes disrupted natural splice acceptor site region or non-disrupted natural splice acceptor site region.
- the at least one candidate variant is predicted as pathogenic or non-pathogenic ( 318 ) based on the determined status.
- the at least one candidate variant is classified as occurring in branch site region based on genomic coordinate.
- region with genomic coordinate between 50 nucleotides to 15 nucleotides upstream of a natural splice acceptor junction of the gene transcripts is classified as branch site.
- weakening of the natural splice acceptor site is determined and, in other words, it is determined that the classified at least one candidate variant is affecting the natural 3′SS.
- the classified at least one candidate variant is checked for creating a new ‘AG’ that is a new 3′SS thereby weakening of the natural 3′SS in response to the creation of the new 3′SS is determined using MaxEnt score.
- the effect of the at least one candidate variant on the branch site for the new splice acceptor site being created is evaluated by determining presence of an alternative branchpoint in sequence range 50 nucleotides to 15 nucleotides upstream of the new splice acceptor site.
- the at least one variant is categorized to be pathogenic if no alternative branchpoint is determined, at 338 the at least one candidate variant is predicted as non-pathogenic if an alternative branchpoint is found.
- the effect of the at least one candidate variant on the branch site for no new splice acceptor site being created is evaluated by screening for natural branchpoint in sequence range having 50 nucleotides to 15 nucleotides upstream of the natural splice acceptor site and determining level of strength of the branch site using the PWM evaluator at 332 .
- the level of strength is determined due to the at least one candidate variant affecting the screened natural branchpoint.
- the at least one candidate variant is predicted as pathogenic.
- the at least one candidate variant is predicted as pathogenic or non-pathogenic ( 338 ) based on an alternative branchpoint screened in sequence range of 50 nucleotides to 15 nucleotides upstream of the natural splice acceptor site region.
- effect of the at least one candidate variant on the splice acceptor site region for no new splice acceptor site being created is evaluated by sequentially performing the steps at 340 , 342 and 344 .
- effect of the at least one candidate variant on the natural branchpoint is determined and level of strength of natural branch site using the PWM evaluator is identified based on the determined effect.
- for an alternative splice acceptor site region in sequence range having 50 nucleotide upstream and 50 nucleotide downstream of the at least one candidate variant is screened and a comparison of strength of the alternative splice acceptor site region and weakened natural splice acceptor site region is performed.
- the at least one candidate variant is predicted as a non-pathogenic variant ( 348 ) or the at least one variant candidate is predicted as a pathogenic variant ( 350 ) or a non-pathogenic variant ( 364 ) based on a screened alternative branchpoint ( 360 ) in the sequence 50 nucleotide to 15 nucleotide upstream to the natural splice acceptor site region.
- the at least one candidate variant is predicted as non-pathogenic ( 348 ) or further presence of natural branchpoint in sequence range of 15 nucleotide to 50 nucleotide to the splice acceptor site region being active during the mRNA splicing is determined ( 352 ) and thereafter strength of the natural branchpoint with the predefined threshold is compared. And, based on the comparison the at least one candidate variant is predicted as pathogenic ( 350 ).
- the at least one candidate variant is predicted as pathogenic ( 354 ) or non-pathogenic ( 356 ) based on an alternative branchpoint screened in the sequence range of 50 nucleotides to 15 nucleotides upstream of the alternative splice acceptor site ( 358 ). Further, based on the comparison of strength of the new branchpoint and the natural branchpoint, the at least one candidate variant is predicted as non-pathogenic ( 364 ). If not, presence of natural branchpoint in the range of 15 nucleotide to 50 nucleotide upstream to the splice acceptor site region being active during the mRNA splicing is determined and thereafter strength of the natural branchpoint with the predefined threshold ( 354 ). Based on the determined presence of natural branchpoint and comparison of strength of the natural branchpoint with the predefined threshold the at least one candidate variant is predicted as pathogenic ( 362 ) or non-pathogenic ( 364 ).
- the focus of the present system and method is to identify a BP given at a random sequence and evaluate the identified BP's role in the functional consequence of splicing of the intron. Further the focus of the present embodiments to predict the impact of the evaluated BP on pathogenicity using a combination of PWM and MaxEnt score.
- There are many tools which can predict a branchpoint but the main drawback is it requires far more input data while predicting BP, like the polypyramidine tract information, the actual splice acceptor site and the distance to the splice acceptor site region, which restricts such tools to predict a branchpoint given at a random sequence.
- the present system and method clearly distinguishes between the BP and SS and evaluates a variant based on the combined output from an individual component.
- the system and method for predicting effect of genomic variations on pre-mRNA splicing In one of the example embodiment, a recent experimentally determined 59,359 human branch sites (10 mers), identified based on exoribonuclease digestion and RNA-sequence is considered.
- the dataset offers a comprehensive dataset for training a high accuracy putative BPS prediction model (10).
- the present example utilize this set of branch point sites, selecting only the sequences with ‘A’ at the branchpoint as the training set for the Position weight matrix (PWM) evaluator. This is because our goal is to create and evaluate a tool that can be used as part of a routine variant annotation scheme to provide high confidence annotations for further clinical interpretation.
- PWM Position weight matrix
- Parameters such as the distance of BPS from the 3′ splice end ( ⁇ 15 to ⁇ 50 nucleotides upstream) of the intron, making sure the BPS (branch point sequence) is part of the intronic region in all transcripts and setting a threshold on the basis of the top 25% scores in the PWM from the training set were chosen to increase the accuracy of the analysis approach.
- Comparisons to outcomes of other existing prediction tools like HSF (Human Splicing Finder), SVM (Support Vector Machine), BP finder, outputs of machine learning prediction tools, along with experimentally proven BPS mutations have been performed to demonstrate the accuracy of our proposed model.
- a variant C>G in intron 9 was detected upon Clinvar based variant screening of Ornithine Carbamoyltransferase coding gene (OTC) as disrupting canonical splice acceptor site.
- OTC Ornithine Carbamoyltransferase coding gene
- Alternative splice acceptor site (MaxEnt: 8.30) was identified 25 bases downstream (in the exonic region) of the canonical splice acceptor junction.
- the canonical branchsite score: 2.80
- i.e. 29 bases upstream to the identified cryptic splice acceptor was deemed suitable.
- a T>C transition was found in intron 14 of Mannosidase Alpha Class 2B Member 1 gene (MAN2B1) disrupting the canonical splice acceptor site.
- MAN2B1 Mannosidase Alpha Class 2B Member 1 gene
- a cryptic branch site is activated and also activation of a cryptic splice acceptor (MaxEnt: 4.78) 31 nt downstream to the canonical 3′ splice site occurs resulting in deletion of the first 31 nt of the exon 15, leading to a frame shift mutation causing pre-mature termination of the protein as a consequence of introduction of a stop codon (Table 1).
- an A>G mutation was found in intron 5.
- the variant is at the canonical splice acceptor site, it has been previously categorized as a splice site mutation, although the role of the variant and the specific effects on the splicing aberrations have not been defined.
- the canonical splice acceptor site of intron 5 was disrupted as a consequence of the variation (MaxEnt: 4.01> ⁇ 3.94). Due to the disruption of the natural splice acceptor site, a cryptic splice acceptor site (MaxEnt: 5.01) 28 nucleotide downstream to the canonical splice acceptor site was activated.
- a potential branch site i.e. 35 bases upstream to the cryptic splice acceptor site was found.
- the original splice acceptor site gets disrupted and a cryptic splice acceptor, along with a cryptic branch point gets activated downstream to the canonical splice site and canonical branch site (Table 2).
- the resulting protein formed is 392 a.a long and loses 9 a.a i.e. an entire p-strand, in the core region as a result of the SNP.
- the deleted protein region forms a part of the active site and the homodimer interface of the protein and is essential for pyridoxal 5′ phosphate binding. Therefore the deletion caused due to the SNP is highly deleterious as it causes protein dysfunctioning.
- a hypothesis can be drawn based on the occurrence of an alternative splice acceptor with a suitable branch site, leading to aberrant splicing. The pre-termination of the transcript due to the splicing disruption might be a cause to primary hyperoxaluria.
- a deleterious variant G>A disrupting the canonical splice acceptor site was found upon screening of the intron 49 of MYO15A gene.
- a cryptic branch site (score: 1.92) was activated at the canonical splice acceptor junction.
- a cryptic splice acceptor site suitable for the cryptic branch site was activated 27 nt downstream (exonic region; MaxEnt: 7.13) to the canonical splice acceptor with the potential to cause partial exon 50 skipping or complete exon 50 skipping might occur as a result of using the stronger splice acceptor site of intron 50 (MaxEnt: 8.93) for splicing.
- the splicing aberration due to disruption of the canonical splice acceptor and the splicing consequences might be the cause behind non-syndromic genetic deafness.
- the resulting splicing aberrations do not lead to disruption of the frame of the protein but alter the protein region essential for peptide ligand binding with proline rich ligands like SH3 protein.
- SH3 domains in the protein are essential for intramolecular interactions leading to proper regulation of the enzymes and also in mediating multiprotein complex assemblies. Therefore, even though the frame of the protein is unaffected, essential active regions of the protein are altered leading to a truncated or non-functional protein.
- the analysis approach was successful in unveiling a hypothesis behind the effect of the intronic variant on splicing of intron 49 in MYO15A gene and the resulting pathogenicity.
- a splice acceptor variant (G>C) was identified upon screening of intron 8 of Growth Hormone Receptor.
- the variant being at the splice acceptor site (AG>AC) disrupted the canonical splice acceptor (MaxEnt: 5.55> ⁇ 2.52) resulting in idiopathic short stature.
- Two different variant transcripts for GHR have been reported, one with complete skipping of exon 9 and the other with partial deletion of exon 9.
- the transcript with partial deletion of exon 9 was formed due to activation of a cryptic splice site downstream (24 nt) of the canonical splice acceptor.
- the occurrence of the splice variants has been reported but the cause behind their formation was not elucidated.
- the splice strength of the cryptic splice acceptor site i.e. in the exonic region
- the variant of interest disrupts the canonical splice acceptor site, leading to aberrant splicing, resulting in a non-functional protein due to premature termination of the protein.
- the variant has been associated with disruption of the canonical splice acceptor and exon 9 skipping indicating that the downstream cryptic splice acceptor was being unused for splicing.
- GHR-(1-279) splice variant
- splice variant i.e. formed due to the activation of the cryptic splice acceptor site is as highly expressed as the canonical transcript, therefore upon disruption of the canonical splice acceptor, it is likely that the downstream cryptic splice acceptor would get activated instead of selecting the disrupted canonical splice acceptor site of the intron 10 leading to exon 9 skipping (Table 2).
- the protein product of GHR as a result of the variant loses 8 a.a from the part of the protein that forms part of the growth hormone binding protein (GHBP) after the cleavage from the GHR.
- GHBP growth hormone binding protein
- NRRK1 Neurotrophic Receptor Tyrosine Kinase 1
- a putative branch site sequence 31 bases upstream to the splice acceptor site, was screened with a deleterious variant T>A.
- the branch site score was drastically reduced after the mutation, 5.70>3.17 (Table 3) and a cryptic splice acceptor site was activated.
- the resulting spliced product after mutation comprised of insertion of an intronic (137 bp) segment attributed to the usage of the upstream cryptic splice acceptor site. Therefore the role of the T>A branch site mutation has been proven to be a major cause of congenital insensitivity to pain with anhidrosis (CIPA) and the analysis approach was successful in determining the same.
- CIPA congenital insensitivity to pain with anhidrosis
- the PWM based approach identified a putative branch site containing a deleterious variant T>A in intron 11 of TH. It has been proven that the deleterious variant leads to alternative splicing, via skipping of exon 12, resulting in absence of 32 amino acids in the final protein product, making it non-functional or usage of cryptic branch site resulting in aberrant splicing or via partial intron retention (36 nucleotides in the mRNA) resulting in incorporation of 12 additional amino acids, rendering the protein non-functional.
- the branch site scores for the predicted branch site reduced significantly as a result of the variant (Table 3).
- disruption of branchpoint causing splicing aberration resulting in exon skipping were validated.
- a deleterious point mutation A>G was discovered in branch site sequence TCCCTGACAG′ i.e. 26 bases upstream to the splice acceptor site of intron 3.
- This intronic mutation A>G has been experimentally proven to result in skipping of exon 4 leading to McArdle disease (17). Based on amplified PCR products from the natural and the mutated samples, retention of exon 4 was concluded and the variant was classified to be a splice acceptor site mutation but the role of the branch site was not addressed.
- the theory of exon 4 skipping is hypothesized to be due to the disruption of the canonical branchpoint (4.43 to null), which is 26 bases upstream to the canonical splice acceptor (Table 4).
- the variant can be hypothesized to be a branch site mutation.
- the analysis approach was capable of determining and classifying an experimentally validated splice mutation as a branchpoint mutation.
- a deleterious variant in the putative branch site TTTGTGATTC′ with the highest score 3.40 was identified 23 bases upstream to the splice acceptor site in the sole intron of Translocase Of Inner Mitochondrial Membrane 8 (TIMM8A) gene, TIMM8A/DDP1 gene dysfunction leads to Mohr-Tranebjaerg syndrome or deafness/dystonia syndrome, there has been evidence of various missense and nonsense mutations in the coding regions of the exons of TIMM8A. There has been a recent finding of an intronic variant A>C causing X-linked dystonia deafness.
- the intronic variant in TIMM8A has been proven to cause protein dysfunction possibly due to splicing aberrations.
- the cause behind the splicing aberrations has not been discussed in terms of the branchpoint disruption.
- the branchpoint scores obtained from the prediction tool it was evident that the splicing aberration was due to branchpoint disruption (Table 3).
- the analysis was able to classify a proven intronic variant as a branchpoint mutation on the basis of the change in branch site scores (3.40>null).
- the PWM based analysis approach is designed to screen for variants that are putative branch sites with ‘A’ as the branchpoint in any given sequence and determine the effect of a mutation in a branch site to the splicing of the intron.
- the PWM of the present embodiments is able to identify putative branch sites in proximity to the intronic end.
- the potential of the PWM is cross-checked with experimentally known branch sites identified by other tools and the outcome matched accurately.
- the cases studied discussed in detail revealed successful identification of known branchpoint mutations and also led to reinterpretation of certain cases indicating the cause behind speculated effects of splicing leading to a pathological condition.
- the basis for the examples discussed above is the PWM matrix generated in accordance with the present embodiments.
- the PWM is created using a dataset of branch site 10 mer sequences containing adenosine as the branchpoint.
- the PWM was able to identify putative branch sites in proximity to the intronic end.
- the potential of the PWM was cross-checked with experimentally known branch sites identified by other tools and the outcome matched accurately.
- the analysis approach of the present method is focused on screening variants in branch sites with “A” as the branchpoint and studying the impact of the variant on splicing and the resulting pathogenicity.
- the input dataset upon variant screening shows a particular branchpoint variant in the COL4A5 gene which was speculated to be a splice site variant but based on the scores obtained for the branch site before and after the mutation from the PWM created, indicated it to be a branchpoint mutation disrupting the branch site.
- the screening of putative branch site variants in the human genome, through the Clinvar.vcf successfully identified 20 cases with deleterious variants (pathogenic/likely pathogenic) as branch site mutations (TABLE 5) and 20 deleterious variants as splice site mutations (TABLE 6).
- An extra filter that is, significant change in the branch site score/splice site acceptor score before and after the mutation was applied in order to pick drastically affected branchpoints/splice sites due to variation.
- variant screening within 15 nt upstream to the intron/exon junction confirmed two experimentally proven cases Ornithine Carbamoyltransferase (OTC), Mannosidase Alpha Class 2B Member 1 (MAN2B1)), with variant disrupting canonical splice acceptor site leading to activation of cryptic splice acceptor site and cryptic branch site.
- OTC Ornithine Carbamoyltransferase
- MAN2B1 Mannosidase Alpha Class 2B Member 1
- the three known cases of branch site mutations and the two known cases of splice site mutations confirmed the potency of the analysis model in identifying potential branch sites in the introns (NTRK1, DYSF, TH; OTC, MAN2B1), while the two discovery cases of branch site mutations and splice site mutations (PYGM, TIMM8A; AGXT, MYO15A) confirms the potency of the analysis approach model in categorizing intronic variants as branchpoint or splice site variants based on the activation of a cryptic branchpoint or cryptic splice site.
- the analysis approach was also tested for the negative set i.e.
- the analysis approach is successful in determining branchpoint variants and determining their pathogenicity based on the availability of alternative branchpoint which could rescue normal splicing.
- the present system and method proved successful in identifying variants that caused disruption of a branchpoint and led to creation of a new splice acceptor (Component of Oligomeric Golgi Complex 6 (COG6), Glucosidase Alpha, Acid (GAA)) at that site. It was also successful in identifying a putative splice acceptor site downstream to the canonical site upon creation of a new branchpoint at the canonical splice acceptor site as a result of the variation. In total, 40 variants with a potency to be a branch site or splice site mutation were identified and their role in causing splicing aberration was predicted with the aid of the designed tool.
- COG6 Oligomeric Golgi Complex 6
- GAA Glucosidase Alpha, Acid
- the PWM based approach is designed to screen for variants that are putative branch sites with ‘A’ as the branchpoint in any given sequence and determine the effect of a mutation in a branch site to the splicing of the intron.
- the embodiments of the present system and method is capable of identifying branchpoint variants and along with other established tools that determine various aspects of splice site was successful in offering a more detailed biological explanation to the consequence of mutations. Also, the discovery cases is identified using the present embodiments hold strong potential in unveiling the cause behind known pathogenic conditions and provide basis for therapeutic developments. Prediction of putative branchpoint or splice site variants in an intron can lay the foundation for the identification of possible genotype-based therapies using exon-skipping techniques (TABLE 7).
- Predicted alternative BP Predicted branchpoint with a higher potential by present prediction tool
- FIG. 4 is a block diagram of an exemplary computer system 401 for implementing embodiments consistent with the present disclosure.
- the computer system 401 may be implemented standalone or in combination of components of the system 102 ( FIG. 1 ). Variations of computer system 401 may be used for implementing the devices included in this disclosure.
- Computer system 401 may comprise a central processing unit (“CPU” or “hardware processor”) 402 .
- the hardware processor 402 may comprise at least one data processor for executing program components for executing user- or system-generated requests.
- the processor may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc.
- the processor may include a microprocessor, such as AMD AthlonTM, DuronTM or OpteronTM, ARM's application, embedded or secure processors, IBM PowerPCTM, Intel's Core, ItaniumTM, XeonTM, CeleronTM or other line of processors, etc.
- the processor 902 may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies like application specific integrated circuits (ASICs), digital signal processors (DSPs), Field Programmable Gate Arrays (FPGAs), etc.
- ASICs application specific integrated circuits
- DSPs digital signal processors
- FPGAs Field Programmable Gate Arrays
- I/O Processor 402 may be disposed in communication with one or more input/output (I/O) devices via I/O interface 403 .
- the I/O interface 403 may employ communication protocols/methods such as, without limitation, audio, analog, digital, monoaural, RCA, stereo, IEEE-1394, serial bus, universal serial bus (USB), infrared, PS/2, BNC, coaxial, component, composite, digital visual interface (DVI), high-definition multimedia interface (HDMI), RF antennas, S-Video, VGA, IEEE 402.11 a/b/g/n/x, Bluetooth, cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMax, or the like), etc.
- CDMA code-division multiple access
- HSPA+ high-speed packet access
- GSM global system for mobile communications
- LTE long-term evolution
- WiMax wireless wide area network
- the computer system 401 may communicate with one or more I/O devices.
- the input device 404 may be an antenna, keyboard, mouse, joystick, (infrared) remote control, camera, card reader, fax machine, dongle, biometric reader, microphone, touch screen, touchpad, trackball, sensor (e.g., accelerometer, light sensor, GPS, gyroscope, proximity sensor, or the like), stylus, scanner, storage device, transceiver, video device/source, visors, etc.
- Output device 405 may be a printer, fax machine, video display (e.g., cathode ray tube (CRT), liquid crystal display (LCD), light-emitting diode (LED), plasma, or the like), audio speaker, etc.
- video display e.g., cathode ray tube (CRT), liquid crystal display (LCD), light-emitting diode (LED), plasma, or the like
- audio speaker etc.
- a transceiver 406 may be disposed in connection with the processor 402 . The transceiver may facilitate various types of wireless transmission or reception.
- the transceiver may include an antenna operatively connected to a transceiver chip (e.g., Texas Instruments WiLink WL1283, Broadcom BCM4750IUB8, Infineon Technologies X-Gold 618-PMB9800, or the like), providing IEEE 802.11a/b/g/n, Bluetooth, FM, global positioning system (GPS), 2G/3G HSDPA/HSUPA communications, etc.
- a transceiver chip e.g., Texas Instruments WiLink WL1283, Broadcom BCM4750IUB8, Infineon Technologies X-Gold 618-PMB9800, or the like
- IEEE 802.11a/b/g/n e.g., Texas Instruments WiLink WL1283, Broadcom BCM4750IUB8, Infineon Technologies X-Gold 618-PMB9800, or the like
- IEEE 802.11a/b/g/n e.g., Bluetooth, FM, global positioning system (GPS), 2G/3G HSDPA/HS
- the processor 402 may be disposed in communication with a communication network 408 via a network interface 407 .
- the network interface 407 may communicate with the communication network 408 .
- the network interface may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, IEEE 402.11a/b/g/n/x, etc.
- the communication network 408 may include, without limitation, a direct interconnection, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), the Internet, etc.
- the computer system 401 may communicate with devices 409 and 410 .
- These devices may include, without limitation, personal computer(s), server(s), fax machines, printers, scanners, various mobile devices such as cellular telephones, smartphones (e.g., Apple iPhone, Blackberry, Android-based phones, etc.), tablet computers, eBook readers (Amazon Kindle, Nook, etc.), laptop computers, notebooks, gaming consoles (Microsoft Xbox, Nintendo DS, Sony PlayStation, etc.), or the like.
- the computer system 401 may itself embody one or more of these devices.
- the processor 402 may be disposed in communication with one or more memory devices (e.g., RAM 713 , ROM 714 , etc.) via a storage interface 412 .
- the storage interface may connect to memory devices including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as serial advanced technology attachment (SATA), integrated drive electronics (IDE), IEEE-1394, universal serial bus (USB), fiber channel, small computer systems interface (SCSI), etc.
- the memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, redundant array of independent discs (RAID), solid-state memory devices, solid-state drives, etc. Variations of memory devices may be used for implementing, for example, any databases utilized in this disclosure.
- the memory devices may store a collection of program or database components, including, without limitation, an operating system 416 , user interface application 417 , user/application data 418 (e.g., any data variables or data records discussed in this disclosure), etc.
- the operating system 416 may facilitate resource management and operation of the computer system 401 .
- Examples of operating systems include, without limitation, Apple Macintosh OS X, Unix, Unix-like system distributions (e.g., Berkeley Software Distribution (BSD), FreeBSD, NetBSD, OpenBSD, etc.), Linux distributions (e.g., Red Hat, Ubuntu, Kubuntu, etc.), IBM OS/2, Microsoft Windows (XP, Vista/7/8, etc.), Apple iOS, Google Android, Blackberry OS, or the like.
- User interface 417 may facilitate display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities.
- user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system 401 , such as cursors, icons, check boxes, menus, scrollers, windows, widgets, etc.
- Graphical user interfaces may be employed, including, without limitation, Apple Macintosh operating systems' Aqua, IBM OS/2, Microsoft Windows (e.g., Aero, Metro, etc.), Unix X-Windows, web interface libraries (e.g., ActiveX, Java, Javascript, AJAX, HTML, Adobe Flash, etc.), or the like.
- computer system 401 may store user/application data 418 , such as the data, variables, records, etc. as described in this disclosure.
- databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase.
- databases may be implemented using standardized data structures, such as an array, hash, linked list, structured text file (e.g., XML), table, or as object-oriented databases (e.g., using ObjectStore, Poet, Zope, etc.).
- object-oriented databases e.g., using ObjectStore, Poet, Zope, etc.
- Such databases may be consolidated or distributed, sometimes among the various computer systems discussed above in this disclosure. It is to be understood that the structure and operation of any computer or database component may be combined, consolidated, or distributed in any working combination.
- the server, messaging and instructions transmitted or received may emanate from hardware, including operating system, and program code (i.e., application code) residing in a cloud implementation.
- program code i.e., application code
- one or more of the systems and methods provided herein may be suitable for cloud-based implementation.
- some or all of the data used in the disclosed methods may be sourced from or stored on any cloud computing platform.
- the hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof.
- the device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g.
- ASIC application-specific integrated circuit
- FPGA field-programmable gate array
- the means can include both hardware means and software means.
- the method embodiments described herein could be implemented in hardware and software.
- the device may also include software means.
- the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
- the embodiments herein can comprise hardware and software elements.
- the embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc.
- the functions performed by various modules described herein may be implemented in other modules or combinations of other modules.
- a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
- a computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored.
- a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein.
- the term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Engineering & Computer Science (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
- This U.S. patent application claims priority under 35 U.S.C. § 119 to India Application No. 201821025433, filed on Jul. 7, 2018. The entire contents of the aforementioned application are incorporated herein by reference.
- The disclosure herein generally relates to mRNA splicing, and, more particularly, predicting effect of genomic variations on pre-mRNA splicing.
- RNA splicing is a process of cutting introns out of pre-mRNA and stitching together exons to form a final nucleotide sequence that is the mRNA sequence that codes for proteins. In this regard branchpoint (BP) selection and splice site (SS) selection are key steps in RNA splicing, yet many popular splicing analysis tools do not model this mechanism. If there is a mutation in proximity to an intron's primary branch point, that branchpoint may become unusable.
- Existing methods for branchpoint prediction use wet lab techniques and in-silico methods. The wet lab techniques are time consuming and labour intensive, while existing computational models involving Support Vector Machine algorithm or machine learning tools are based on numerous assumptions which hamper accurate prediction. Various computational methods have been implemented to facilitate accurate branchpoint prediction and the predicted branchpoints have been tested in vivo/vitro but most of the models are built on hypothetical assumptions which do not lead to accurate prediction of branchpoints. In general the search for disease-causing mutations has been mostly restricted to coding exons, intron-exon junction and promoter region of the gene of interest.
- Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a processor implemented method for predicting effect of genomic variations on pre-mRNA splicing is provided. The method includes receiving genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts. Further includes classifying the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant. Further includes evaluating effect of the at least one candidate variant on pre-mRNA splicing, based on a classified region from the classification of the at least one candidate variant. Herein, evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises: identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using MaxEnt score, determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site and evaluating, in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a PWM evaluator. Further includes predicting pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing.
- In another embodiment, a system for predicting effect of genomic variations on pre-mRNA splicing is provided. The system includes a memory storing instructions and one or more hardware processors coupled to the memory, wherein the one or more hardware processors are configured by the instructions to: receive genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts. Further, to classify the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant. Further to evaluate effect of the at least one candidate variant on pre-mRNA splicing, based on a classified region from the classification of the at least one candidate variant, wherein the evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises: identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using a MaxEnt score, determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site and evaluating, in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a PWM evaluator. Further to predict pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on pre-mRNA splicing.
- In yet another embodiment, one or more non-transitory machine readable information storage mediums are provided. Said one or more non-transitory machine readable information storage mediums comprises one or more instructions which when executed by one or more hardware processors causes receiving genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts. Further includes classifying the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant. Further includes evaluating effect of the at least one candidate variant on pre-mRNA splicing, based on a classified region from the classification of the at least one candidate variant. Herein, evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises: identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using MaxEnt score, determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site and evaluating, in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a PWM evaluator. Further includes predicting pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing.
- It should be appreciated by those skilled in the art that any block diagram herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computing device or processor, whether or not such computing device or processor is explicitly shown.
- The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
-
FIG. 1 illustrates network environment implementing asystem 102 for predicting effect of genomic variations on pre-mRNA splicing, in accordance with an embodiment of the present disclosure. -
FIG. 2 is a flow diagram illustrating a method for predicting effect of genomic variations on pre-mRNA splicing, according to an embodiment of the present disclosure. -
FIGS. 3A, 3B and 3C illustrates an analysis pipeline for predicting effect of genomic variations on pre-mRNA splicing, in accordance with an embodiment of the present disclosure. -
FIG. 4 illustrates a block diagram of a system for predicting effect of genomic variations on pre-mRNA splicing, in accordance with some embodiments of the present disclosure. - Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the claims (when included in the specification).
- One of the study for investigating disease-causing BPS mutations provides that in adenosine branchpoints in comparison to other base branchpoints caused more severe splicing defects. A mutation in the branchpoint impairs the lariat formation and may lead to aberrant splicing of the intron, leading to gene dysfunction. The lariat is a lasso-shaped structure formed during the removal of introns in mRNA processing. Mutations at branch sites have been shown to lead to aberrant splicing, which in turn can lead to disease phenotypes. The explosion of the use of next generation sequencing (NGS) in the clinic for diagnosis and screening of disorders may benefit from approaches that can reliably identify mutations in branch sites that may be explanatory of diseases. Development of such tools has been hampered by the absence of a large enough “gold dataset” of known high confident branch sites.
- Splicing forms a crucial part of pre-mRNA maturation process as accurate excision of introns and joining of exons are essential to eukaryotic gene expression. During splicing, parts of the pre-mRNA are removed by the spliceosome within the nucleus before the mature mRNA is transported to the cytoplasm for translation. Depending upon tissue localization and the developmental stage, pre-mRNA is differently spliced leading to alternative transcripts i.e., expression of different proteins from the same gene. More than 70% of protein coding human gene are alternatively spliced and alternative splicing has been proposed to be the major cause of the evolution of phenotypic complexity in mammals.
- Exon skipping is the most common outcome of splicing mutations, followed by activation of cryptic 5′ and 3′ splice sites (5′SS and 3′SS). Exon skipping is due to disruption of natural splice acceptor site or abolishment of the natural branchpoint with no alternative branchpoint available to facilitate splicing. Efficient splicing requires at least three major signals within introns, the 5′ splice site, 3′ splice site and the branchpoint sequence. Auxiliary sequences in introns and exons known as splicing enhancers and silencers act in conjunction to decide splicing to be constitutive or alternative. The 5′ end of the intron is known as splice donor site and 3′ end of the intron is referred as splice acceptor site.
- The divergence from the prototype sequences are associated with alternative transcript generation. Occurrence of such consensus sequences within the introns is quite common in the case of higher eukaryotes framing pseudoexons, indicating the presence of the splice boundaries but insufficient for regulating correct splicing. The 3′ end is characterized by presence of the splice acceptor site, branchpoint sequence upstream and the polypyrimidine tract immediately following the branchpoint sequence. Branchpoints are defined on the basis of four major criteria: that are proximal to the 3′ splice end of the intron, branchpoint sequence is followed by polypyrimidine tract, a depletion of ‘AG’ dinucleotide between the branchpoint sequence and the 3′ splice site, and the branchpoint is mostly an adenine. So the selection and accurate prediction of branchpoint variant and splice site variant from candidate variants of existing databases of known human gene transcripts is of prime importance and challenging.
- Various embodiments of the present disclosure provided method and system for predicting the effect of genomic variations on pre-mRNA splicing based on MaxEnt tool and a Position Weight Matrix (PWM) evaluator with high accuracy utilized on resource constrained environment. The disclosed system includes a variant pipeline which works in real-time in a resource constrained environment or near real-time on CPU. The disclosed system and method provides a solution in predicting effect of genomic variations on pre-mRNA splicing. A detailed description of the above described system and method for predicting the effect of genomic variations on pre-mRNA splicing is shown with respect to illustrations represented with reference to
FIGS. 1 through 4 . - Referring now to the drawings, and more particularly to
FIGS. 1 through 4 , where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and method for predicting effect of genomic variations on pre-mRNA splicing. - Herein, the
system 102 may receive inputs, for example, inputs via multiple devices and/or machines 104-1, 104-2 . . . 104-N, collectively referred to as devices 104 hereinafter. Examples of the devices 104 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, VR camera embodying devices, storage devices equipped to receive and store inputs and outputs. In an embodiment, the devices 104 may include devices capable of capturing and storing data. The devices 104 are communicatively coupled to thesystem 102 through anetwork 106, and may be capable of transmitting the data to thesystem 102. - In one implementation, the
network 106 may be a wireless network, a wired network or a combination thereof. Thenetwork 106 can be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and the like. Thenetwork 106 may either be a dedicated network or a shared network. The shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like, to communicate with one another. Further thenetwork 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like. - The devices 104 may send input to the
system 102 via thenetwork 106. Thesystem 102 is caused to predict effect of genomic variations on pre-mRNA splicing. In an embodiment, thesystem 102 may be embodied in acomputing device 110. Examples of thecomputing device 110 may include, but are not limited to, a desktop personal computer (PC), a notebook, a laptop, a portable computer, a smart phone, a tablet, and the like. Thesystem 102 may also be associated with adata repository 112 to store inputs, dataset and output/resultant. Additionally or alternatively, thedata repository 112 may be configured to store data and/or information generated during predicting effect of genomic variations on pre-mRNA splicing. Therepository 112 may be configured outside and communicably coupled to thecomputing device 110 embodying thesystem 102. Alternatively, thedata repository 112 may be configured within thesystem 102. - In an embodiment, the disclosed
system 102 enables predicting effect of genomic variations on pre-mRNA splicing, thereby resulting in high accuracy of predicting pathogenicity and determining branchpoint variants and their pathogenicity based on the availability of alternative branchpoint which could rescue normal splicing. An example representation of pipeline of the method for predicting effect of genomic variations on pre-mRNA splicing is shown and described further with reference toFIG. 3A-3C . - Referring now to
FIG. 2 , a flow-diagram of a method 200 for predicting effect of genomic variations on pre-mRNA splicing is described, according to some embodiments of present disclosure. The method 200 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method 200 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communication network. The order in which the method 200 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 200, or an alternative method. Furthermore, the method 200 can be implemented in any suitable hardware, software, firmware, or combination thereof. In an embodiment, the method 200 depicted in the flow chart may be executed by a system, for example, thesystem 102 ofFIG. 1 . In an example embodiment, thesystem 102 may be embodied in an exemplary computer system, forexample computer system 102. The method 200 ofFIG. 2 will be explained in more detail below with reference toFIGS. 3A-3C . - Referring to
FIG. 2 , in the illustrated embodiment, the method 200 is initiated at 202 where genomic position information, reference allele and alternate allele corresponding to a specified version of the human genome is received. The at least one candidate transcript that fully contains the variant is obtained from existing databases of known human gene transcripts (herein, referred to as at least one variant). Each transcript is represented as a set of one or more non-overlapping intervals, where each interval is represented by four features that include the chromosome on which the transcript is present, the starting genomic coordinate of the interval, the ending genomic coordinate of the interval, and the strand on which the transcript is present (forward or reverse). - At 204 the at least one candidate variant is classified as occurring in one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of the at least one candidate variant. Further, the at least one candidate variant is classified as the splice acceptor site region occurring in genomic coordinate between 15 nucleotides upstream to 3 nucleotides downstream of a natural intron-exon splice acceptor junction of the gene transcripts and as the branch site region occurring in genomic coordinate between 50 nucleotides to 15 nucleotides upstream of a natural splice acceptor junction of the gene transcripts. Herein nucleotide and nt and used interchangeable.
- At 206, effect of the at least one candidate variant on pre-mRNA splicing is evaluated based on a classified region from the classification of the at least one candidate variant. The evaluation is performed by identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using a MaxEnt score and then determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site. Thereafter in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a Position Weight Matrix (PWM) evaluator. The MaxEnt is a known splice site strength determination tool for calculating strength or weakening of the splice acceptor site, wherein the MaxEnt tool assigns a MaxEnt score based on the effect of the at least one candidate variant on affected natural splice acceptor site region. In an example embodiment, the available MaxEnt Scan tool is used to calculate the splice acceptor site scores for both the canonical splice sites which is the natural occurring splice sites or natural splice site acceptor region and cryptic splice sites which is splice sites activated by a mutation.
- The PWM evaluator is generated using experimentally determined human branch sites. In an example embodiment, the PWM is generated using an experimentally determined 59,359 human branch sites (10 mers), identified based on exoribonuclease digestion and RNA-seq. In said example embodiment, a set of branch point sites is utilized by selecting only the sequences with ‘A’ at the branchpoint as the training set for the Position weight matrix (PWM). In said example embodiment, ‘A’ is chosen as the branchpoint since ones with ‘C’/′T′/G as the branchpoint has very low median scores, while the known A has the highest value, suggesting the PWM generated, in accordance with present embodiments, has a selectivity towards ‘A’ as a branchpoint and is ideal to restrict the PWM scoring to ‘A’. Therefore the PWM was built using the known ‘A’ as the branchpoint. A PWM matrix of (m*n) is created by aligning the experimentally determined 59,359 human branch sites (10 mers) with ‘A’ as the branchpoint. In present embodiment a matrix of (10*4) is created. The alignment is then used in calculating the frequency of each nucleotide at each position of the 10mers and thereafter the frequencies of each nucleotide are converted to log odds scores.
- In said example embodiments, 1,75,031 unique introns from 18,171 canonical transcripts from Gencode database v19 is identified and extracted with the filtering criterion of being surrounded by coding exons on both sides. The frequency of each nucleotide (A, T, C, G) across all the introns is used to normalize the raw frequencies of the bases in the training set of branch points. As described above, the normalized frequencies are converted to log odds scores to generate the final PWM. Based on the branch site scores obtained for the known branch sites with ‘A’ as the branchpoint. The first quartile of the distribution is calculated and is used as a threshold for classifying a site to be a high confidence branch site. In an example embodiment, the determined threshold is 1.46. Further, a 40 mer intronic sequence, 10 to 50 bases upstream from the 3′ end of each intron is extracted from the human genome and scanned for 10 mer sequences scoring above the branchpoint threshold.
- At 208, pathogenicity of the at least one candidate variant is predicted based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing. Further evaluation and predicting pathogenicity of the at least one candidate variant is further described in detail in reference to
FIGS. 3A-3C . - Referring now to
FIGS. 3A-3C , illustrating the analysis pipeline for method of predicting pathogenicity on the pre-mRNA splicing. Herein the analysis pipeline is designed to categorize a variant as pathogenic or non-pathogenic. The analysis approach, in accordance with the present embodiments follows a step by step pipeline represented byFIGS. 3A-3C . In an embodiment, variants that were in close proximity, that is up to 15 nucleotide upstream to the canonical splice acceptor region are screened for creation of a new cryptic acceptor site or a creation of a new branch site. If a branch site is created, then a suitable downstream splice acceptor site scan is initiated. If the variant is creating a splice acceptor, then a suitable upstream branch site is scanned for using the PWM evaluator. If the variant disrupted the canonical splice acceptor and the canonical branch site is unaffected, then the screening for a suitable alternative downstream splice acceptor is performed. If a new canonical splice acceptor was predicted downstream to the canonical splice acceptor site, then a screening for a experimentally proven branchpoint is performed using the PWM tool. The detailed step by step process of the pipeline is described inFIGS. 3A-3C . - Referring now to
FIG. 3A , where avariant 302, for example, at least one candidate variant is received where genomic position information, reference allele and alternate allele corresponding to a specified version of the human genome is received. The at least one candidate transcript that fully contains the variant is obtained from existing databases of known human gene transcripts (herein, referred to as at least one variant). Each transcript is represented as a set of one or more non-overlapping intervals, where each interval is represented by four features that include the chromosome on which the transcript is present, the starting genomic coordinate of the interval, the ending genomic coordinate of the interval, and the strand on which the transcript is present (forward or reverse). The at least one candidate variant is classified as occurring in splice affecting region based on genomic coordinate. At 304, region occurring in genomic coordinate between 15 nucleotides upstream to 3 nucleotides downstream of a natural intron-exon splice acceptor junction of the gene transcripts is classified as splice acceptor site. At 306, weakening of the natural splice acceptor site is determined and, in other words, it is determined that the classified at least one candidate variant is affecting the natural splice acceptor site (natural 3′SS). At 308, the classified at least one candidate variant is checked for creating a new ‘AG’ that is a new 3′SS thereby weakening of the natural 3′SS as determined using MaxEnt score. In response to the determined weakening of the splice acceptor site, that is the weakening of the natural 3′SS, the at least one candidate variant is checked if natural branchpoint suffices or branches out to block C. In other words, at 310, determining presence or absence of natural branchpoint in sequence range of 15 nucleotide to 50 nucleotide of the new splice acceptor site region being active during the pre-mRNA splicing. Thereafter strength of the natural branchpoint is evaluated using the PWM evaluator and identifying the at least one candidate variant as pathogenic (312) based on the evaluated strength of the natural branchpoint; or screening for an alternative branchpoint using the PWM evaluator and predicting the at least one candidate as a pathogenic based on the evaluated strength of the alternative branchpoint (314). At 317, status of the natural splice acceptor site region is determined. The status herein includes disrupted natural splice acceptor site region or non-disrupted natural splice acceptor site region. At 316, the at least one candidate variant is predicted as pathogenic or non-pathogenic (318) based on the determined status. - Referring now to
FIG. 3B , at connector B, the at least one candidate variant is classified as occurring in branch site region based on genomic coordinate. At 320, region with genomic coordinate between 50 nucleotides to 15 nucleotides upstream of a natural splice acceptor junction of the gene transcripts is classified as branch site. At 322, weakening of the natural splice acceptor site is determined and, in other words, it is determined that the classified at least one candidate variant is affecting the natural 3′SS. At 324 the classified at least one candidate variant is checked for creating a new ‘AG’ that is a new 3′SS thereby weakening of the natural 3′SS in response to the creation of the new 3′SS is determined using MaxEnt score. In response to the determined weakening of the splice acceptor site is screened either for natural branchpoint or alternative branchpoint. At 326, the effect of the at least one candidate variant on the branch site for the new splice acceptor site being created is evaluated by determining presence of an alternative branchpoint insequence range 50 nucleotides to 15 nucleotides upstream of the new splice acceptor site. At, 328 the at least one variant is categorized to be pathogenic if no alternative branchpoint is determined, at 338 the at least one candidate variant is predicted as non-pathogenic if an alternative branchpoint is found. - At 330, the effect of the at least one candidate variant on the branch site for no new splice acceptor site being created is evaluated by screening for natural branchpoint in sequence range having 50 nucleotides to 15 nucleotides upstream of the natural splice acceptor site and determining level of strength of the branch site using the PWM evaluator at 332. Herein, the level of strength is determined due to the at least one candidate variant affecting the screened natural branchpoint. At 334, based on the determined level of strength of the branch site the at least one candidate variant is predicted as pathogenic. At 336, the at least one candidate variant is predicted as pathogenic or non-pathogenic (338) based on an alternative branchpoint screened in sequence range of 50 nucleotides to 15 nucleotides upstream of the natural splice acceptor site region.
- Referring now to
FIG. 3C , at connector C, effect of the at least one candidate variant on the splice acceptor site region for no new splice acceptor site being created is evaluated by sequentially performing the steps at 340, 342 and 344. At 340 effect of the at least one candidate variant on the natural branchpoint is determined and level of strength of natural branch site using the PWM evaluator is identified based on the determined effect. At 342, for an alternative splice acceptor site region in sequence range having 50 nucleotide upstream and 50 nucleotide downstream of the at least one candidate variant is screened and a comparison of strength of the alternative splice acceptor site region and weakened natural splice acceptor site region is performed. At 344, presence of a new branchpoint being created and performing comparison of strength of the new branchpoint and the natural branchpoint is determined. Further based on 340, at 346, the at least one candidate variant is predicted as a non-pathogenic variant (348) or the at least one variant candidate is predicted as a pathogenic variant (350) or a non-pathogenic variant (364) based on a screened alternative branchpoint (360) in thesequence 50 nucleotide to 15 nucleotide upstream to the natural splice acceptor site region. - Further based on 342, the at least one candidate variant is predicted as non-pathogenic (348) or further presence of natural branchpoint in sequence range of 15 nucleotide to 50 nucleotide to the splice acceptor site region being active during the mRNA splicing is determined (352) and thereafter strength of the natural branchpoint with the predefined threshold is compared. And, based on the comparison the at least one candidate variant is predicted as pathogenic (350). Further based on 344, the at least one candidate variant is predicted as pathogenic (354) or non-pathogenic (356) based on an alternative branchpoint screened in the sequence range of 50 nucleotides to 15 nucleotides upstream of the alternative splice acceptor site (358). Further, based on the comparison of strength of the new branchpoint and the natural branchpoint, the at least one candidate variant is predicted as non-pathogenic (364). If not, presence of natural branchpoint in the range of 15 nucleotide to 50 nucleotide upstream to the splice acceptor site region being active during the mRNA splicing is determined and thereafter strength of the natural branchpoint with the predefined threshold (354). Based on the determined presence of natural branchpoint and comparison of strength of the natural branchpoint with the predefined threshold the at least one candidate variant is predicted as pathogenic (362) or non-pathogenic (364).
- In accordance with the present embodiments, the focus of the present system and method is to identify a BP given at a random sequence and evaluate the identified BP's role in the functional consequence of splicing of the intron. Further the focus of the present embodiments to predict the impact of the evaluated BP on pathogenicity using a combination of PWM and MaxEnt score. There are many tools which can predict a branchpoint, but the main drawback is it requires far more input data while predicting BP, like the polypyramidine tract information, the actual splice acceptor site and the distance to the splice acceptor site region, which restricts such tools to predict a branchpoint given at a random sequence. The present system and method clearly distinguishes between the BP and SS and evaluates a variant based on the combined output from an individual component.
- Validation and Results
- The results of methods for predicting effect of genomic variations on pre-mRNA splicing have been validated using following examples. It will be understood that the examples discussed herein are only for the purpose of explanation and not to limit the scope of the present subject matter. Further, the test results are shown for a specific example of predicting effect of genomic variations on pre-mRNA splicing and should in no way be construed as the only method that can be formed through the described method.
- In one of the example embodiment, the system and method for predicting effect of genomic variations on pre-mRNA splicing. In present embodiment, a recent experimentally determined 59,359 human branch sites (10 mers), identified based on exoribonuclease digestion and RNA-sequence is considered. The dataset offers a comprehensive dataset for training a high accuracy putative BPS prediction model (10). The present example, utilize this set of branch point sites, selecting only the sequences with ‘A’ at the branchpoint as the training set for the Position weight matrix (PWM) evaluator. This is because our goal is to create and evaluate a tool that can be used as part of a routine variant annotation scheme to provide high confidence annotations for further clinical interpretation. Parameters such as the distance of BPS from the 3′ splice end (−15 to −50 nucleotides upstream) of the intron, making sure the BPS (branch point sequence) is part of the intronic region in all transcripts and setting a threshold on the basis of the top 25% scores in the PWM from the training set were chosen to increase the accuracy of the analysis approach. Comparisons to outcomes of other existing prediction tools like HSF (Human Splicing Finder), SVM (Support Vector Machine), BP finder, outputs of machine learning prediction tools, along with experimentally proven BPS mutations have been performed to demonstrate the accuracy of our proposed model.
- The analysis method as described in accordance with the present embodiments, based on the PWM was successful in identifying the role of pathogenicity of 3 Clinvar annotated deleterious mutation cases (Table 1) in known branchpoints listed in the high confident branchpoint dataset is described below. The present analysis was successful in confirming the experimentally known cases of variants causing splicing aberrations due to activation of cryptic splice sites and branchpoint. The experiments were conducted for various known variants.
- In an embodiment, a variant C>G in intron 9 was detected upon Clinvar based variant screening of Ornithine Carbamoyltransferase coding gene (OTC) as disrupting canonical splice acceptor site. Alternative splice acceptor site (MaxEnt: 8.30) was identified 25 bases downstream (in the exonic region) of the canonical splice acceptor junction. The canonical branchsite (score: 2.80) i.e. 29 bases upstream to the identified cryptic splice acceptor was deemed suitable. The inactivation of the canonical splice acceptor and activation of the cryptic acceptor site has been experimentally verified with the aid of PCR and the resulting aberration in splicing has been proven to cause an aberrant 50 amino C-terminal sequence in the protein resulting in hyperammoneamic crisis. The value corresponding to OTC are as shown in Table 1.
- In another embodiment, a T>C transition was found in intron 14 of Mannosidase Alpha Class 2B Member 1 gene (MAN2B1) disrupting the canonical splice acceptor site. Upon the loss of the canonical splice acceptor, a cryptic branch site is activated and also activation of a cryptic splice acceptor (MaxEnt: 4.78) 31 nt downstream to the canonical 3′ splice site occurs resulting in deletion of the first 31 nt of the
exon 15, leading to a frame shift mutation causing pre-mature termination of the protein as a consequence of introduction of a stop codon (Table 1). With the aid of RT-PCR, the disruption of the canonical 3′ splice acceptor site and the activation of the cryptic splice site leading to partial exon deletion has been confirmed. Overall, the analysis approach displayed the potential to unveil one of the causes behind deficiency of alpha-mannosidase. -
TABLE 1 Variant Gene position Sequence Score OTC 38280273 TTTCTTTGTTGTGTCAT[C > G]AGGCT 7.73 > −1.02 MAN2B1 12763276 GTGGACCCTTTTCTGCCC[A > G]GCAC 4.4 > −3.56 - Experiments revealed some of the discovery cases. Herein, reason behind the splicing aberrations due to known pathogenic candidate variants was unveiled and such cases were categorized as discovery cases.
- In an example embodiment, upon screening of the AGXT gene for variants, an A>G mutation was found in intron 5. As the variant is at the canonical splice acceptor site, it has been previously categorized as a splice site mutation, although the role of the variant and the specific effects on the splicing aberrations have not been defined. The canonical splice acceptor site of intron 5 was disrupted as a consequence of the variation (MaxEnt: 4.01>−3.94). Due to the disruption of the natural splice acceptor site, a cryptic splice acceptor site (MaxEnt: 5.01) 28 nucleotide downstream to the canonical splice acceptor site was activated. Further, upon screening for suitable branch sites for the cryptic splice acceptor, a potential branch site, i.e. 35 bases upstream to the cryptic splice acceptor site was found. Overall, on the basis of the proposed model it can be observed that due to the mutation, the original splice acceptor site gets disrupted and a cryptic splice acceptor, along with a cryptic branch point gets activated downstream to the canonical splice site and canonical branch site (Table 2). The resulting protein formed is 392 a.a long and loses 9 a.a i.e. an entire p-strand, in the core region as a result of the SNP. The deleted protein region forms a part of the active site and the homodimer interface of the protein and is essential for pyridoxal 5′ phosphate binding. Therefore the deletion caused due to the SNP is highly deleterious as it causes protein dysfunctioning. A hypothesis can be drawn based on the occurrence of an alternative splice acceptor with a suitable branch site, leading to aberrant splicing. The pre-termination of the transcript due to the splicing disruption might be a cause to primary hyperoxaluria.
- In another embodiment, a deleterious variant G>A disrupting the canonical splice acceptor site was found upon screening of the intron 49 of MYO15A gene. As a result of the variant, a cryptic branch site (score: 1.92) was activated at the canonical splice acceptor junction. A cryptic splice acceptor site suitable for the cryptic branch site was activated 27 nt downstream (exonic region; MaxEnt: 7.13) to the canonical splice acceptor with the potential to cause
partial exon 50 skipping orcomplete exon 50 skipping might occur as a result of using the stronger splice acceptor site of intron 50 (MaxEnt: 8.93) for splicing. The splicing aberration due to disruption of the canonical splice acceptor and the splicing consequences might be the cause behind non-syndromic genetic deafness. The resulting splicing aberrations do not lead to disruption of the frame of the protein but alter the protein region essential for peptide ligand binding with proline rich ligands like SH3 protein. SH3 domains in the protein are essential for intramolecular interactions leading to proper regulation of the enzymes and also in mediating multiprotein complex assemblies. Therefore, even though the frame of the protein is unaffected, essential active regions of the protein are altered leading to a truncated or non-functional protein. Overall, the analysis approach was successful in unveiling a hypothesis behind the effect of the intronic variant on splicing of intron 49 in MYO15A gene and the resulting pathogenicity. - In yet another example embodiment, a reinterpreted case, a splice acceptor variant (G>C) was identified upon screening of intron 8 of Growth Hormone Receptor. The variant being at the splice acceptor site (AG>AC) disrupted the canonical splice acceptor (MaxEnt: 5.55>−2.52) resulting in idiopathic short stature. Two different variant transcripts for GHR have been reported, one with complete skipping of exon 9 and the other with partial deletion of exon 9. The transcript with partial deletion of exon 9 was formed due to activation of a cryptic splice site downstream (24 nt) of the canonical splice acceptor. The occurrence of the splice variants has been reported but the cause behind their formation was not elucidated. The splice strength of the cryptic splice acceptor site (i.e. in the exonic region) is greater than the canonical splice acceptor site and the variant of interest disrupts the canonical splice acceptor site, leading to aberrant splicing, resulting in a non-functional protein due to premature termination of the protein. The variant has been associated with disruption of the canonical splice acceptor and exon 9 skipping indicating that the downstream cryptic splice acceptor was being unused for splicing. But based on the hypothesis drawn using the analysis model and the experimental evidence, GHR-(1-279) (splice variant), i.e. formed due to the activation of the cryptic splice acceptor site is as highly expressed as the canonical transcript, therefore upon disruption of the canonical splice acceptor, it is likely that the downstream cryptic splice acceptor would get activated instead of selecting the disrupted canonical splice acceptor site of the intron 10 leading to exon 9 skipping (Table 2). The protein product of GHR as a result of the variant loses 8 a.a from the part of the protein that forms part of the growth hormone binding protein (GHBP) after the cleavage from the GHR. Therefore deletion of such an essential region from the protein would lead to dysfunctioning of the protein and might be the cause behind the deleteriousness of the variant. Overall, the analysis approach was successful in reinterpreting the role of the deleterious variant (G>C) in GHR intron 8 splicing and pathogenicity causing growth hormone insensitivity.
-
TABLE 2 Variant Gene position Sequence Score AGXT 241813393 AGCAAACCACCCATCTAC[A > C]GGCA 4.01 > −3.94 MYO15A 18060469 GACCCGAGCCTGGCCCATA[G > A]GCT 3.14 > −5.61 GHR 42718153 AAATTTTATATGTTTTCAA[G > C]GAT 5.55 > −2.52 - In an embodiment, discoveries arising from predicted branch site variants were studied. Herein, experimentally known cases: The PWM based approach along with well-established splice site strength determination tool (MaxEnt) was tested on experimentally determined cases of branchpoint variants causing pathogenicity (NTKR1, DYSF, TH). The output of the analysis approach exactly reflected the experimental findings.
- In an embodiment, based on the output of the predicted branchpoint variants, in the case of NTRK1 (neurotrophic tyrosine kinase receptor family) gene, a putative branch site sequence, 31 bases upstream to the splice acceptor site, was screened with a deleterious variant T>A. The branch site score was drastically reduced after the mutation, 5.70>3.17 (Table 3) and a cryptic splice acceptor site was activated. The resulting spliced product after mutation comprised of insertion of an intronic (137 bp) segment attributed to the usage of the upstream cryptic splice acceptor site. Therefore the role of the T>A branch site mutation has been proven to be a major cause of congenital insensitivity to pain with anhidrosis (CIPA) and the analysis approach was successful in determining the same.
- In yet another example embodiment, upon screening a deleterious mutation (A>G) in intron 31 of DYSF gene was identified. On the basis of the change in branch site scores it was revealed that the variant disrupts the branch site (Table 3). The deleterious mutation A>G has been experimentally verified to disrupt the branchpoint, leading to failure of lariat formation and skipping of exon 32 of dysferlin gene, resulting in recessively inherited limb-girdle muscular dystrophy type 2B (LGMD2B) and muscular dystrophies with distal presentations.
- In yet another example embodiment, the PWM based approach identified a putative branch site containing a deleterious variant T>A in intron 11 of TH. It has been proven that the deleterious variant leads to alternative splicing, via skipping of exon 12, resulting in absence of 32 amino acids in the final protein product, making it non-functional or usage of cryptic branch site resulting in aberrant splicing or via partial intron retention (36 nucleotides in the mRNA) resulting in incorporation of 12 additional amino acids, rendering the protein non-functional. The branch site scores for the predicted branch site reduced significantly as a result of the variant (Table 3). It has been proven that a branch site mutation (T>A) in the gene of the enzyme tyrosine hydroxylase (TH), two bases upstream of the branchpoint of intron 11 leads to aberrant protein product causing severe extrapyramidal movement disorder. The alternative splicing, leading to intron retention was also verified using the present method.
-
TABLE 3 BP Gene position Sequence Score NTRK1 156843392 GCCC[T > A]GACCT 5.701 > 3.174 DYSF 71817308 CCACTC[A > G]CTC 5.568 > Disrupted TH 2180717 GGGC[T > A]GATGC 4.206 > 1.679 - In an embodiment, disruption of branchpoint causing splicing aberration resulting in exon skipping were validated.
- In yet another example embodiment, from the predicted deleterious branchpoint variants in PYGM gene, a deleterious point mutation A>G was discovered in branch site sequence TCCCTGACAG′ i.e. 26 bases upstream to the splice acceptor site of
intron 3. This intronic mutation A>G has been experimentally proven to result in skipping of exon 4 leading to McArdle disease (17). Based on amplified PCR products from the natural and the mutated samples, retention of exon 4 was concluded and the variant was classified to be a splice acceptor site mutation but the role of the branch site was not addressed. Based on the proposed analysis approach and the scores obtained for the branch site strengths, the theory of exon 4 skipping is hypothesized to be due to the disruption of the canonical branchpoint (4.43 to null), which is 26 bases upstream to the canonical splice acceptor (Table 4). As the proximity of the variant to the canonical splice acceptor is 26 bases upstream and therefore is not likely to affect the splice site strength, the variant can be hypothesized to be a branch site mutation. Overall, the analysis approach was capable of determining and classifying an experimentally validated splice mutation as a branchpoint mutation. - In yet another example embodiment, a deleterious variant in the putative branch site TTTGTGATTC′ with the highest score 3.40 was identified 23 bases upstream to the splice acceptor site in the sole intron of Translocase Of Inner Mitochondrial Membrane 8 (TIMM8A) gene, TIMM8A/DDP1 gene dysfunction leads to Mohr-Tranebjaerg syndrome or deafness/dystonia syndrome, there has been evidence of various missense and nonsense mutations in the coding regions of the exons of TIMM8A. There has been a recent finding of an intronic variant A>C causing X-linked dystonia deafness. The intronic variant in TIMM8A has been proven to cause protein dysfunction possibly due to splicing aberrations. The cause behind the splicing aberrations has not been discussed in terms of the branchpoint disruption. On the basis of the branchpoint scores obtained from the prediction tool, it was evident that the splicing aberration was due to branchpoint disruption (Table 3). Overall, the analysis was able to classify a proven intronic variant as a branchpoint mutation on the basis of the change in branch site scores (3.40>null).
-
TABLE 4 BP Gene position Sequence Score PYGM 64525847 TCCCTG[A > G]CAG 4.430 > Disrupted TIMM8A 100601671 TTTGTG[A > C]TTC 3.401 > Disrupted - In accordance with the present embodiments, the PWM based analysis approach is designed to screen for variants that are putative branch sites with ‘A’ as the branchpoint in any given sequence and determine the effect of a mutation in a branch site to the splicing of the intron. As observed in the aforementioned case studies the PWM of the present embodiments is able to identify putative branch sites in proximity to the intronic end. Also, the potential of the PWM is cross-checked with experimentally known branch sites identified by other tools and the outcome matched accurately. The cases studied discussed in detail revealed successful identification of known branchpoint mutations and also led to reinterpretation of certain cases indicating the cause behind speculated effects of splicing leading to a pathological condition.
- The basis for the examples discussed above is the PWM matrix generated in accordance with the present embodiments. The PWM is created using a dataset of branch site 10 mer sequences containing adenosine as the branchpoint. The PWM was able to identify putative branch sites in proximity to the intronic end. The potential of the PWM was cross-checked with experimentally known branch sites identified by other tools and the outcome matched accurately. The analysis approach of the present method is focused on screening variants in branch sites with “A” as the branchpoint and studying the impact of the variant on splicing and the resulting pathogenicity. The examples, as observed, was successful in identification of known branchpoint mutations and also led to reinterpretation of certain cases indicating the cause behind speculated effects of splicing leading to a pathological condition. The input dataset upon variant screening shows a particular branchpoint variant in the COL4A5 gene which was speculated to be a splice site variant but based on the scores obtained for the branch site before and after the mutation from the PWM created, indicated it to be a branchpoint mutation disrupting the branch site. The screening of putative branch site variants in the human genome, through the Clinvar.vcf successfully identified 20 cases with deleterious variants (pathogenic/likely pathogenic) as branch site mutations (TABLE 5) and 20 deleterious variants as splice site mutations (TABLE 6). An extra filter, that is, significant change in the branch site score/splice site acceptor score before and after the mutation was applied in order to pick drastically affected branchpoints/splice sites due to variation.
-
TABLE 5 BP Variant BP distance distance Intron, Mutated BP/BS GERP Variant Position from from strand Sequence Score Sequence Score mutation? Score MTHFR; 11850989 34 18 11, − GTGTGCA 1.89 GTGTGCA 1.89 No/No 0.05 Chr1: 1185 TGT ERCC6; CFTR; MCCC2; XPC; COL3A1; INS; Chr10: Chr7: Chr5: Chr3: TRNT1; Chr2: DYSF; NTRK1; Chr11: 218 506 117 7089 1420 Chr3: 31 189 Chr2: 718173 Chr1: 1568 2181256 50681652 117251602 70898299 14209904 3188087 189872204 71817308 156843394 28 19 32 16 24 27 26 33 31 30 26 25 19 24 26 43 33 33 2, − 13/− 19/+ 4/+ 3, − 5, + 34, + 31, + 7, +30 TTCCGG ACTCCTA TATGTTA CTCTCCA TTACTG A GAGGT GACTTC CCACTC A C GCCCTG 2.294 2.35 2.49 2.75 4.51 1.67 3.55 5.57 5.70 TTCCAG ACTCCTA TATGTTA CTCTCCA TTACTG G GAGGT GACTTC CCACTC G C GCCCAG AACC TCC TTT GTG TTT AA CAC AATT TC ACCT 1.859 2.35 2.49 1.93 Disrupted 2.25 3.55 Disrupted 3.17 No/Yes No/No No/No No/Yes Yes/Yes No/Yes No/No Yes/Yes No/Yes −0.67 1.70 −3.00 1.96 −0.63 −5.73 2.46 4.9 −0.25 COL4A5; COL4A5; TIMM8A; GAA; BRCA1; COG6; PYGM; ChrX: ChrX: Chr17: Chr17: Chr13: Chr11: MYBPC TH; 10786 ChrX: 107 1006 780 4119 402 645 3; Chr11: 218 107863456 107845097 100601671 78082265 41197857 40273614 64525847 47364835 2187015 32 17 23 22 38 24 26 22 22 32 17 23 21 40 24 26 19 24 30, + 26, + 1, − 7, + 23, − 12, + 3, − 13, − 11, − TGCTTC A TCAATA TTTGTG A TCCCTCA AGAATGA TTTGCA A TCCCTG A CACTT GGGCTG 3.437 2.218 3.401 4.176 1.628 1.673 4.43 3.404 4.206 TGCTTC G TCAATA TTTGTG C TCCCTCA AGAAAGA TTTGCA G TCCCTG G CACTT GGGCAG GTA G CTG TTC GGA ATT CCT CAG CAACA ATGC Disrupted Disrupted Disrupted 3.7 −0.899 Disrupted Disrupted 2.961 1.679 Yes/Yes Yes/Yes Yes/Yes No/Yes No/Yes Yes/Yes Yes/Yes No/Yes No/Yes 2.49 3.15 2.86 −1.67 1.41 1.09 1.73 3.95 −1.97 VMA21; 150572076 26 26 1/+ GTTCTG A 4.83 GTTCTG C Disrupted Yes/Yes 1.95 ChrX: 1505 TTT indicates data missing or illegible when filed - Out of the 20 potential branchpoint mutation cases, three cases of known i.e. experimentally verified branchpoint mutations and two discovery cases of mutations causing splicing aberrations in putative branchsites were successfully identified.
-
TABLE 6 Predicted Natural splice Mutated Splice New Splice Predicted canonical BP Variant acceptor acceptor acceptor; Pos; branch site; GERP Variant Position; Score distance from Intron, strand sequence; MaxEnt Sequence; MaxEnt MaxEnt Score Pos; Score Score HIBCH; 191159383; 3.30 9 3, − CTTCTGTTACAT CTTCTGTTACA TATACCATCTTC Predicted 3.68 Chr2: 191159365 TTGAATAGAAG; GTTGAATAGAA TGTTACAGTTG; Canonical BP 191159365; 9.11 used RSPH3; RFX6; GHR; ACAD9; AGXT, AGXT, Chr6: 15940 Chr6: 117198 Chr5: 42718 Chr3: 12860 Chr2: 24181743 Chr2: 2418133 159407483; 117198938; 42718120; 128603459; 241817408; 241813365; 3.27 3.82 2.02 1.93 6.06 6.15 2 11 1 2 1 2 2, − 1, + 8, + 1, + 9, + 5,+30 GTATTTTC TCCCTTCAA AAATTTTAT AAAATATT GAGCCAGGC AGCAAACCA TATCACTG CTGGCAAT ATGTTTTC TACTATTT CCCTCCTGC CCCATCTAC GTATTTTC TCCCTTCAG AAATTTTAT AAAATATT GAGCCAGGC AGCAAACCA TATCACTG CTGGCAAT ATGTTTTC TACTATTT CCCTCCTGC CCCATCTAC CGGACGG TTTCTTTAT AATGCTGA AGAAGTT GGCGCTCCG TCCTGTACT CCTGATTC CATCCCTTC TTCTGCCC TTCCCATT GCTTCCCAC CGGGCTCC TCTAGAGC AGCTG; CCAGTTC; TCCAGAA AGTCA; CAGAAG; TCTATC A C Predicted AATTTT A T ATATTT A C GCACTG A GC CCACCC A TC TG; canonical BP AT, TA; C; 241817420; T; 241813387; 159407461; used 42718141; 128603485; 4.374 3.35 5.34 0.822 5.72 5.26 3.96 4.1 BRCA2; BRCA2; CRYAB; DYNC2H1; PTEN; Chr13: 32920963G Chr13: 32920962A Chr11: 111779693 Chr11: 1031872 Chr10: 89653781G > C 32920931; 2.87 32920931; 2.87 111779706; 5.50 103187249; 3.85 89653767; 3.89 1 2 2 1 1 12, + 12, + 3, − 80, + 1, + ATAAAATAATTG ATAAAATAATTG TCCTCATTCTTT AAAAAATTGTT TCCTTAACTAAAGTACTC TTTCCTAG GCA; TTTCCT AGGCA; TGGGTT AGGAT; TTTTGACAG G AG ATA; −2.59 ATAAAATAATTG ATAAAATAATTG TCCTCATTCTTT AAAAAATTGTT TCCTTAACTAAAGTACTC TTTCCTAA GCA; - TTTCCT GGGCA; TGGGTT GGGAT; TTTTGACAA G AC ATA; −10.66 ATATTTTCTCCCC TAACATGGATAT GAACATGGTTTC TTATGAATTTT TGCTATGGGATTTCCTG ATTGCAGCAC; TCTCTTAGATT; ATCTCCAGGGA; CTTTATCAGA CAGAAA; 89653820; 8.11 32928997; 10.37 32920924; 4.43 111779669; 7.95 TC; 103187307; or Predicted ACAGTA A CAT; TTCCTC A TTC; TTTTTG A CAA; GTACTC A GAT; 89653780; canonical BP used 32920907; 2.11 111779706; 5.5 103187270; 3.14 5.23 5.03 5.03 5.72 5.78 5.19 MAN2B1; SMCHD1; NF1; MY015A; FAH; Chr19: 1276327 Chr18: 2705691G > Chr17: 29548860A > Chr17: 18060469G > A Chr15: 804644 12763298; 2.90 2705659; 3.06 29548830; 2.70 18060451; 5.24 80464470; 2.64 2 1 8 1 6 14, − 13, + 14, + 49, + 8, + GTGGACCCTT TTTTAAAAACTA CTCTTTTTTAAAAA GACCCGAGCCTGGCCC TGAACTCTC TTCTGCCC AG AATATTAG GTC; ATTCAGGCT; 4.83 ATAG GCT; 3.14 CCCCATGTA GTGGACCCTT TTTTAAAAACTA CTCTTTTTTAAAGA GACCCGAGCCTGGCCC TGAACTCTC TTCTGCCC GG AATATTAA GTC; − ATTCAGGCT; −1.93 ATAA GCT; −5.61 CCCCAGGTA AACGTTTGAT CTTCCCCTCTTT TGTCTTTCTCTTTT GCTGGCTGCGTGGTTC TCTAATGAA CCTGACACAG TATGGAAG CAT; TTAAAGAAT; GCAGGAA; 18060497; CTCTCCCCC GGC; 2705729; 4.49 29548860; 8.40 7.13 AGGTA; CGGCAC A TCC; ATATTA A GTC; Predicted canonical CCCATA A GCT; Predicted 12763271; 2705691; 2.34 BP used 18060469; 1.92 canonical BP 2.89 or used 5.6 5.87 −1.98 5.04 −7.07 OTC; OTC; TMPRSS3; ChrX:38280275G > A ChrX: 38280273C > G Chr21: 43808641 38280243; 3.37 38280243; 3.37 43808664; 2.34 1 3 6 9, + 9, + 4, − TTTCTTTGTTGTG TTTCTTTGTTGTGT TCTTTCTGCACA TCATCAG GCT; CATC AGGCT; 7.73 TCGGCCAGTCC TTTCTTTGTTGTG TTTCTTTGTTGTGT TCTTTCTGCACA TCATCAA GCT; - CATG AGGCT; −3.22 TCAGCCAGTCC; CATGGTGTCCCTG CATGGTGTCCCTG CCTTTCTTTCTG CTGACAGATT; CTGACAGATT; CACATCAGCCA; 38280300; 8.30 38280300; 8.30 43808640; 3.74 GTCATC A AGC; TGTGTC A TGA; Predicted 38280274; 2.51 38280271; 2.80 canonical BP used 5.33 1.54 1.66 - Alongside the variant screening within 15 nt upstream to the intron/exon junction confirmed two experimentally proven cases Ornithine Carbamoyltransferase (OTC), Mannosidase Alpha Class 2B Member 1 (MAN2B1)), with variant disrupting canonical splice acceptor site leading to activation of cryptic splice acceptor site and cryptic branch site. The three known cases of branch site mutations and the two known cases of splice site mutations confirmed the potency of the analysis model in identifying potential branch sites in the introns (NTRK1, DYSF, TH; OTC, MAN2B1), while the two discovery cases of branch site mutations and splice site mutations (PYGM, TIMM8A; AGXT, MYO15A) confirms the potency of the analysis approach model in categorizing intronic variants as branchpoint or splice site variants based on the activation of a cryptic branchpoint or cryptic splice site. The analysis approach was also tested for the negative set i.e. the branchpoint variants that disrupt the branchpoint but cause no pathogenicity which shows that although the predicted branchpoint identified by the PWM tool was being disrupted, there were alternative branchpoints that were compensating for the disruption by enabling normal splicing of the intron. Therefore the analysis approach is successful in determining branchpoint variants and determining their pathogenicity based on the availability of alternative branchpoint which could rescue normal splicing.
- As observed in the present examples, the present system and method proved successful in identifying variants that caused disruption of a branchpoint and led to creation of a new splice acceptor (Component of Oligomeric Golgi Complex 6 (COG6), Glucosidase Alpha, Acid (GAA)) at that site. It was also successful in identifying a putative splice acceptor site downstream to the canonical site upon creation of a new branchpoint at the canonical splice acceptor site as a result of the variation. In total, 40 variants with a potency to be a branch site or splice site mutation were identified and their role in causing splicing aberration was predicted with the aid of the designed tool. It was observed that few of the mutations did not affect the frame of the protein but were highly deleterious, for such cases, attributes like protein structure and function were checked. It was observed that for AGXT, Acyl-CoA Dehydrogenase Family Member 9 (ACAD9), GHR, MYO15A although the Single nucleotide polymorphisms (SNP) did not cause frame changes of the protein, it caused deletion of part of the active site of the protein affecting or ceasing the function leading to a disease condition. It was also noted that for certain cases like phosphatase and tensin homologue (PTEN), where exon skipping or partial exon deletion was predicted, the protein either is trucated or deletion of active site of the protein renders it non-functional. Overall, SNPs that affect the translational frame of the protein lead to pathogenicity most likely due to a truncated protein product and the SNPs that do not affect the translational frame of the protein lead to pathogenicity due to core regions of the protein being altered. The dataset obtained as a result of screening putative branchpoint mutations was compared against Human splicing factor dataset of identified putative branchpoints and was also compared against the identified branchpoint variants predicted results, which confirmed the PWM based analysis model is reliable for branchpoint prediction and for investigating splicing aberrations as a result of a branch site mutation or splice site mutation.
- Therefore the PWM based approach is designed to screen for variants that are putative branch sites with ‘A’ as the branchpoint in any given sequence and determine the effect of a mutation in a branch site to the splicing of the intron.
- The embodiments of the present system and method is capable of identifying branchpoint variants and along with other established tools that determine various aspects of splice site was successful in offering a more detailed biological explanation to the consequence of mutations. Also, the discovery cases is identified using the present embodiments hold strong potential in unveiling the cause behind known pathogenic conditions and provide basis for therapeutic developments. Prediction of putative branchpoint or splice site variants in an intron can lay the foundation for the identification of possible genotype-based therapies using exon-skipping techniques (TABLE 7).
-
TABLE 7 Chromo- Identified BP Predicted Predicted BP Predicted BP some Gene Intron BPa Position Position Score Alternative BP Position score 2 DYSF†,* 31 CCACTC 71817308 −33 5.568 — — — A CTC 3 XPC †,*,‡3 TTACTG 14209904 −24 4.51 — — — ATTT 5 FBN2† 30 CTCTAC 127680226 −24 2.052 TATAT −36 2.637 ATTC CAACC 9 COL5A1†,‡ 32 AGAGT 137686901 −27 3.246 TGACT −23 4.677 GACTG GACCA 11 TH†,‡ 11 GGGCT 2187015 −22 4.206 — — — GATGC 13 RB1† 23 TTACTA 49047470 −26 3.608 TATTT −15 4.383 ATTG CATCT 16 LCAT†,*,‡ 4 GCCCT 67976510 −20 5.743 — — — GACCC 16 PMM2† 2 ATTCTA 8898599 −25 3.096 — — — AGTG 16 PMM2 †7 GCCTTC 8941558 −23 4.917 — — — ATCT 16 TSC2†,‡ 39 GGCGT 2138031 −18 3.761 — — — GACCA 17 GH1 †3 CAGCA 61995310 −26 2.026 — — — CAGCC 17 ITGB4† 31 TGGCTC 73748510 −17 5.786 — — — ACTC 18 NPC1†,‡ 6 CCACTA 21137182 −28 3.201 TTCTT −15 5.201 ATGC CACTT 19 LDLR†,‡ 9 GCGCT 11224186 −25 4.116 — — — GATGC X F9‡ 2 CCGTTA 138619496 −25 2.85 — — — ATTT X L1CAM‡ 19 TATCCA 153131293 −19 1.301 CAAGT −15 3.642 AGTC CACTG GGCTC −24 2.071 TATCC - †: Branchpoints predicted by Human splice finder (HSF)
- *: Branchpoints confirmed by Mercer et al.
- ‡: Branchpoint variants predicted by Kralovieova, J et al.
- - - -: Same branchpoint predicted by other tools and present tool of interest
- Identified BP: Branchpoints predicted/confirmed by other tools
- Predicted alternative BP: Predicted branchpoint with a higher potential by present prediction tool
-
FIG. 4 is a block diagram of an exemplary computer system 401 for implementing embodiments consistent with the present disclosure. The computer system 401 may be implemented standalone or in combination of components of the system 102 (FIG. 1 ). Variations of computer system 401 may be used for implementing the devices included in this disclosure. Computer system 401 may comprise a central processing unit (“CPU” or “hardware processor”) 402. Thehardware processor 402 may comprise at least one data processor for executing program components for executing user- or system-generated requests. The processor may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc. The processor may include a microprocessor, such as AMD Athlon™, Duron™ or Opteron™, ARM's application, embedded or secure processors, IBM PowerPC™, Intel's Core, Itanium™, Xeon™, Celeron™ or other line of processors, etc. The processor 902 may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies like application specific integrated circuits (ASICs), digital signal processors (DSPs), Field Programmable Gate Arrays (FPGAs), etc. -
Processor 402 may be disposed in communication with one or more input/output (I/O) devices via I/O interface 403. The I/O interface 403 may employ communication protocols/methods such as, without limitation, audio, analog, digital, monoaural, RCA, stereo, IEEE-1394, serial bus, universal serial bus (USB), infrared, PS/2, BNC, coaxial, component, composite, digital visual interface (DVI), high-definition multimedia interface (HDMI), RF antennas, S-Video, VGA, IEEE 402.11 a/b/g/n/x, Bluetooth, cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMax, or the like), etc. - Using the I/
O interface 403, the computer system 401 may communicate with one or more I/O devices. For example, theinput device 404 may be an antenna, keyboard, mouse, joystick, (infrared) remote control, camera, card reader, fax machine, dongle, biometric reader, microphone, touch screen, touchpad, trackball, sensor (e.g., accelerometer, light sensor, GPS, gyroscope, proximity sensor, or the like), stylus, scanner, storage device, transceiver, video device/source, visors, etc. -
Output device 405 may be a printer, fax machine, video display (e.g., cathode ray tube (CRT), liquid crystal display (LCD), light-emitting diode (LED), plasma, or the like), audio speaker, etc. In some embodiments, atransceiver 406 may be disposed in connection with theprocessor 402. The transceiver may facilitate various types of wireless transmission or reception. For example, the transceiver may include an antenna operatively connected to a transceiver chip (e.g., Texas Instruments WiLink WL1283, Broadcom BCM4750IUB8, Infineon Technologies X-Gold 618-PMB9800, or the like), providing IEEE 802.11a/b/g/n, Bluetooth, FM, global positioning system (GPS), 2G/3G HSDPA/HSUPA communications, etc. - In some embodiments, the
processor 402 may be disposed in communication with acommunication network 408 via anetwork interface 407. Thenetwork interface 407 may communicate with thecommunication network 408. The network interface may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, IEEE 402.11a/b/g/n/x, etc. Thecommunication network 408 may include, without limitation, a direct interconnection, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), the Internet, etc. Using thenetwork interface 407 and thecommunication network 408, the computer system 401 may communicate withdevices - In some embodiments, the
processor 402 may be disposed in communication with one or more memory devices (e.g., RAM 713, ROM 714, etc.) via astorage interface 412. The storage interface may connect to memory devices including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as serial advanced technology attachment (SATA), integrated drive electronics (IDE), IEEE-1394, universal serial bus (USB), fiber channel, small computer systems interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, redundant array of independent discs (RAID), solid-state memory devices, solid-state drives, etc. Variations of memory devices may be used for implementing, for example, any databases utilized in this disclosure. - The memory devices may store a collection of program or database components, including, without limitation, an
operating system 416, user interface application 417, user/application data 418 (e.g., any data variables or data records discussed in this disclosure), etc. Theoperating system 416 may facilitate resource management and operation of the computer system 401. Examples of operating systems include, without limitation, Apple Macintosh OS X, Unix, Unix-like system distributions (e.g., Berkeley Software Distribution (BSD), FreeBSD, NetBSD, OpenBSD, etc.), Linux distributions (e.g., Red Hat, Ubuntu, Kubuntu, etc.), IBM OS/2, Microsoft Windows (XP, Vista/7/8, etc.), Apple iOS, Google Android, Blackberry OS, or the like. User interface 417 may facilitate display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities. For example, user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system 401, such as cursors, icons, check boxes, menus, scrollers, windows, widgets, etc. Graphical user interfaces (GUIs) may be employed, including, without limitation, Apple Macintosh operating systems' Aqua, IBM OS/2, Microsoft Windows (e.g., Aero, Metro, etc.), Unix X-Windows, web interface libraries (e.g., ActiveX, Java, Javascript, AJAX, HTML, Adobe Flash, etc.), or the like. - In some embodiments, computer system 401 may store user/
application data 418, such as the data, variables, records, etc. as described in this disclosure. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase. Alternatively, such databases may be implemented using standardized data structures, such as an array, hash, linked list, structured text file (e.g., XML), table, or as object-oriented databases (e.g., using ObjectStore, Poet, Zope, etc.). Such databases may be consolidated or distributed, sometimes among the various computer systems discussed above in this disclosure. It is to be understood that the structure and operation of any computer or database component may be combined, consolidated, or distributed in any working combination. - Additionally, in some embodiments, the server, messaging and instructions transmitted or received may emanate from hardware, including operating system, and program code (i.e., application code) residing in a cloud implementation. Further, it should be noted that one or more of the systems and methods provided herein may be suitable for cloud-based implementation. For example, in some embodiments, some or all of the data used in the disclosed methods may be sourced from or stored on any cloud computing platform.
- The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
- It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
- The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
- The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims (when included in the specification), the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
- Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
- It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments.
Claims (15)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IN201821025433 | 2018-07-07 | ||
IN201821025433 | 2018-07-07 |
Publications (1)
Publication Number | Publication Date |
---|---|
US20200152288A1 true US20200152288A1 (en) | 2020-05-14 |
Family
ID=67184885
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/504,184 Pending US20200152288A1 (en) | 2018-07-07 | 2019-07-05 | System and method for predicting effect of genomic variations on pre-mrna splicing |
Country Status (4)
Country | Link |
---|---|
US (1) | US20200152288A1 (en) |
EP (1) | EP3745406A1 (en) |
JP (1) | JP7453754B2 (en) |
CN (1) | CN110689928A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113215248A (en) * | 2021-06-25 | 2021-08-06 | 中国人民解放军空军军医大学 | MyO15A gene mutation detection kit related to sensorineural deafness |
WO2022059886A1 (en) * | 2020-09-21 | 2022-03-24 | 주식회사 쓰리빌리언 | System for predicting pathogenicity of genetic mutation by using machine learning |
WO2022203705A1 (en) * | 2021-03-26 | 2022-09-29 | Genome International Corporation | A precision medicine portal for human diseases |
CN115579060A (en) * | 2022-12-08 | 2023-01-06 | 国家超级计算天津中心 | Gene locus detection method, device, equipment and medium |
WO2023183422A1 (en) * | 2022-03-24 | 2023-09-28 | Genome International Corporation | Identifying genome features in health and disease |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6931860B2 (en) * | 2019-02-08 | 2021-09-08 | 株式会社Zenick | Pre-mRNA analysis method, information processing device, computer program |
CN113035272B (en) * | 2021-03-08 | 2023-09-05 | 深圳市新合生物医疗科技有限公司 | Method and device for obtaining immunotherapeutic new antigen based on intein cell variation |
CN113241123B (en) * | 2021-04-19 | 2024-02-02 | 西安电子科技大学 | Method and system for fusing multiple characteristic recognition enhancers and intensity thereof |
CN113838522A (en) * | 2021-09-14 | 2021-12-24 | 浙江赛微思生物科技有限公司 | Evaluation processing method for influence of gene mutation sites on splicing possibility |
CN114613431A (en) * | 2021-11-22 | 2022-06-10 | 赛业(广州)生物科技有限公司 | Prediction method, system and platform for influencing mRNA splicing based on base mutation |
CN115691662B (en) * | 2022-11-08 | 2023-06-23 | 温州谱希医学检验实验室有限公司 | Method and system for sequencing myopia/high myopia-related SNP risks based on allosteric probability |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130096838A1 (en) * | 2011-06-10 | 2013-04-18 | William Fairbrother | Gene Splicing Defects |
WO2013017982A1 (en) * | 2011-08-01 | 2013-02-07 | Basf Plant Science Company Gmbh | Method for identification and isolation of terminator sequences causing enhanced transcription |
US20140199698A1 (en) * | 2013-01-14 | 2014-07-17 | Peter Keith Rogan | METHODS OF PREDICTING AND DETERMINING MUTATED mRNA SPLICE ISOFORMS |
US10266828B2 (en) * | 2013-12-16 | 2019-04-23 | Syddansk Universitet | RAS exon 2 skipping for cancer treatment |
LU93116B1 (en) * | 2016-06-22 | 2018-01-24 | Univ Luxembourg | Means and methods for treating parkinson's disease |
-
2019
- 2019-07-05 EP EP19184695.5A patent/EP3745406A1/en active Pending
- 2019-07-05 US US16/504,184 patent/US20200152288A1/en active Pending
- 2019-07-08 JP JP2019126722A patent/JP7453754B2/en active Active
- 2019-07-08 CN CN201910612239.7A patent/CN110689928A/en active Pending
Non-Patent Citations (5)
Title |
---|
Desmet, F. O., Hamroun, D., Lalande, M., Collod-Béroud, G., Claustres, M., & Béroud, C. (2009). Human Splicing Finder: an online bioinformatics tool to predict splicing signals. Nucleic acids research, 37(9), e67, p.1-14. (Year: 2009) * |
François-Olivier Desmet, Dalil Hamroun, Gwenaëlle Collod-Béroud, Mireille Claustres, Christophe Béroud. Bioinformatics identification of splice site signals and prediction of mutation effects. RM Mohan. Research Advances In Nucleic Acids Research, Global Research Network Publishers, pp.1-14. 2010. (Year: 2010) * |
Furdon, P. J., & Kole, R. (1986). Inhibition of splicing but not cleavage at the 5'splice site by truncating human beta-globin pre-mRNA. Proceedings of the National Academy of Sciences, 83(4), 927-931. (Year: 1986) * |
Hubbard, T. J., Aken, B. L., Beal, K., Ballester, B., Cáccamo, M., Chen, Y., ... & Birney, E. (2007). Ensembl 2007. Nucleic acids research, 35(suppl_1), D610-D617. (Year: 2007) * |
Sheth, N., Roca, X., Hastings, M. L., Roeder, T., Krainer, A. R., & Sachidanandam, R. (2006). Comprehensive splice-site analysis using comparative genomics. Nucleic acids research, 34(14), 3955–3967. (Year: 2006) * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022059886A1 (en) * | 2020-09-21 | 2022-03-24 | 주식회사 쓰리빌리언 | System for predicting pathogenicity of genetic mutation by using machine learning |
WO2022203705A1 (en) * | 2021-03-26 | 2022-09-29 | Genome International Corporation | A precision medicine portal for human diseases |
WO2022203704A1 (en) * | 2021-03-26 | 2022-09-29 | Genome International Corporation | A unified portal for regulatory and splicing elements for genome analysis |
CN113215248A (en) * | 2021-06-25 | 2021-08-06 | 中国人民解放军空军军医大学 | MyO15A gene mutation detection kit related to sensorineural deafness |
WO2023183422A1 (en) * | 2022-03-24 | 2023-09-28 | Genome International Corporation | Identifying genome features in health and disease |
CN115579060A (en) * | 2022-12-08 | 2023-01-06 | 国家超级计算天津中心 | Gene locus detection method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
EP3745406A1 (en) | 2020-12-02 |
JP2020038621A (en) | 2020-03-12 |
JP7453754B2 (en) | 2024-03-21 |
CN110689928A (en) | 2020-01-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200152288A1 (en) | System and method for predicting effect of genomic variations on pre-mrna splicing | |
US11081210B2 (en) | Detection of human leukocyte antigen loss of heterozygosity | |
Mielczarek et al. | Review of alignment and SNP calling algorithms for next-generation sequencing data | |
KR102562419B1 (en) | Variant classifier based on deep neural networks | |
Girolami et al. | Contemporary genetic testing in inherited cardiac disease: tools, ethical issues, and clinical applications | |
US11193175B2 (en) | Normalizing tumor mutation burden | |
CN106909806B (en) | The method and apparatus of fixed point detection variation | |
JP2020525887A (en) | Deep learning based splice site classification | |
US20190065670A1 (en) | Predicting disease burden from genome variants | |
Salgado et al. | How to identify pathogenic mutations among all those variations: variant annotation and filtration in the genome sequencing era | |
Kehr et al. | PopIns: population-scale detection of novel sequence insertions | |
Hills et al. | BAIT: Organizing genomes and mapping rearrangements in single cells | |
US11475978B2 (en) | Detection of human leukocyte antigen loss of heterozygosity | |
US20190362807A1 (en) | Genomic variant ranking system for clinical trial matching | |
KR20190098233A (en) | Oncogenic Splice Variants Determination | |
US20190005192A1 (en) | Reliable and Secure Detection Techniques for Processing Genome Data in Next Generation Sequencing (NGS) | |
Barbitoff et al. | Bioinformatics of germline variant discovery for rare disease diagnostics: current approaches and remaining challenges | |
JP2021101629A5 (en) | ||
US20160070855A1 (en) | Systems And Methods For Determination Of Provenance | |
Ecovoiu et al. | Genome ARTIST: a robust, high-accuracy aligner tool for mapping transposon insertions and self-insertions | |
Wang et al. | A primer for disease gene prioritization using next-generation sequencing data | |
US20230064530A1 (en) | Detection of Genetic Variants in Human Leukocyte Antigen Genes | |
Barrie et al. | Elevated genetic risk for multiple sclerosis originated in Steppe Pastoralist populations | |
Chang et al. | Somatic and germline variant calling from next-generation sequencing data | |
Li et al. | Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |