US20200152288A1 - System and method for predicting effect of genomic variations on pre-mrna splicing - Google Patents

System and method for predicting effect of genomic variations on pre-mrna splicing Download PDF

Info

Publication number
US20200152288A1
US20200152288A1 US16/504,184 US201916504184A US2020152288A1 US 20200152288 A1 US20200152288 A1 US 20200152288A1 US 201916504184 A US201916504184 A US 201916504184A US 2020152288 A1 US2020152288 A1 US 2020152288A1
Authority
US
United States
Prior art keywords
branchpoint
splice acceptor
natural
acceptor site
candidate variant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US16/504,184
Inventor
Rajgopal Srinivasan
Akriti JAIN
Poulami CHAUDHURI
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tata Consultancy Services Ltd
Original Assignee
Tata Consultancy Services Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tata Consultancy Services Ltd filed Critical Tata Consultancy Services Ltd
Publication of US20200152288A1 publication Critical patent/US20200152288A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Definitions

  • the disclosure herein generally relates to mRNA splicing, and, more particularly, predicting effect of genomic variations on pre-mRNA splicing.
  • RNA splicing is a process of cutting introns out of pre-mRNA and stitching together exons to form a final nucleotide sequence that is the mRNA sequence that codes for proteins.
  • branchpoint (BP) selection and splice site (SS) selection are key steps in RNA splicing, yet many popular splicing analysis tools do not model this mechanism. If there is a mutation in proximity to an intron's primary branch point, that branchpoint may become unusable.
  • a processor implemented method for predicting effect of genomic variations on pre-mRNA splicing includes receiving genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts. Further includes classifying the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant. Further includes evaluating effect of the at least one candidate variant on pre-mRNA splicing, based on a classified region from the classification of the at least one candidate variant.
  • evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises: identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using MaxEnt score, determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site and evaluating, in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a PWM evaluator. Further includes predicting pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing.
  • a system for predicting effect of genomic variations on pre-mRNA splicing includes a memory storing instructions and one or more hardware processors coupled to the memory, wherein the one or more hardware processors are configured by the instructions to: receive genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts. Further, to classify the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant.
  • the evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises: identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using a MaxEnt score, determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site and evaluating, in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a PWM evaluator. Further to predict pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on pre-mRNA splicing.
  • one or more non-transitory machine readable information storage mediums comprises one or more instructions which when executed by one or more hardware processors causes receiving genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts. Further includes classifying the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant. Further includes evaluating effect of the at least one candidate variant on pre-mRNA splicing, based on a classified region from the classification of the at least one candidate variant.
  • evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises: identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using MaxEnt score, determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site and evaluating, in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a PWM evaluator. Further includes predicting pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing.
  • any block diagram herein represent conceptual views of illustrative systems embodying the principles of the present subject matter.
  • any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computing device or processor, whether or not such computing device or processor is explicitly shown.
  • FIG. 1 illustrates network environment implementing a system 102 for predicting effect of genomic variations on pre-mRNA splicing, in accordance with an embodiment of the present disclosure.
  • FIG. 2 is a flow diagram illustrating a method for predicting effect of genomic variations on pre-mRNA splicing, according to an embodiment of the present disclosure.
  • FIGS. 3A, 3B and 3C illustrates an analysis pipeline for predicting effect of genomic variations on pre-mRNA splicing, in accordance with an embodiment of the present disclosure.
  • FIG. 4 illustrates a block diagram of a system for predicting effect of genomic variations on pre-mRNA splicing, in accordance with some embodiments of the present disclosure.
  • Splicing forms a crucial part of pre-mRNA maturation process as accurate excision of introns and joining of exons are essential to eukaryotic gene expression.
  • parts of the pre-mRNA are removed by the spliceosome within the nucleus before the mature mRNA is transported to the cytoplasm for translation.
  • pre-mRNA is differently spliced leading to alternative transcripts i.e., expression of different proteins from the same gene. More than 70% of protein coding human gene are alternatively spliced and alternative splicing has been proposed to be the major cause of the evolution of phenotypic complexity in mammals.
  • Exon skipping is the most common outcome of splicing mutations, followed by activation of cryptic 5′ and 3′ splice sites (5′SS and 3′SS). Exon skipping is due to disruption of natural splice acceptor site or abolishment of the natural branchpoint with no alternative branchpoint available to facilitate splicing. Efficient splicing requires at least three major signals within introns, the 5′ splice site, 3′ splice site and the branchpoint sequence. Auxiliary sequences in introns and exons known as splicing enhancers and silencers act in conjunction to decide splicing to be constitutive or alternative. The 5′ end of the intron is known as splice donor site and 3′ end of the intron is referred as splice acceptor site.
  • the divergence from the prototype sequences are associated with alternative transcript generation. Occurrence of such consensus sequences within the introns is quite common in the case of higher eukaryotes framing pseudoexons, indicating the presence of the splice boundaries but insufficient for regulating correct splicing.
  • the 3′ end is characterized by presence of the splice acceptor site, branchpoint sequence upstream and the polypyrimidine tract immediately following the branchpoint sequence.
  • Branchpoints are defined on the basis of four major criteria: that are proximal to the 3′ splice end of the intron, branchpoint sequence is followed by polypyrimidine tract, a depletion of ‘AG’ dinucleotide between the branchpoint sequence and the 3′ splice site, and the branchpoint is mostly an adenine. So the selection and accurate prediction of branchpoint variant and splice site variant from candidate variants of existing databases of known human gene transcripts is of prime importance and challenging.
  • Various embodiments of the present disclosure provided method and system for predicting the effect of genomic variations on pre-mRNA splicing based on MaxEnt tool and a Position Weight Matrix (PWM) evaluator with high accuracy utilized on resource constrained environment.
  • the disclosed system includes a variant pipeline which works in real-time in a resource constrained environment or near real-time on CPU.
  • the disclosed system and method provides a solution in predicting effect of genomic variations on pre-mRNA splicing. A detailed description of the above described system and method for predicting the effect of genomic variations on pre-mRNA splicing is shown with respect to illustrations represented with reference to FIGS. 1 through 4 .
  • FIGS. 1 through 4 where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and method for predicting effect of genomic variations on pre-mRNA splicing.
  • the system 102 may receive inputs, for example, inputs via multiple devices and/or machines 104 - 1 , 104 - 2 . . . 104 -N, collectively referred to as devices 104 hereinafter.
  • the devices 104 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, VR camera embodying devices, storage devices equipped to receive and store inputs and outputs.
  • the devices 104 may include devices capable of capturing and storing data.
  • the devices 104 are communicatively coupled to the system 102 through a network 106 , and may be capable of transmitting the data to the system 102 .
  • the network 106 may be a wireless network, a wired network or a combination thereof.
  • the network 106 can be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and the like.
  • the network 106 may either be a dedicated network or a shared network.
  • the shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like, to communicate with one another.
  • the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
  • the devices 104 may send input to the system 102 via the network 106 .
  • the system 102 is caused to predict effect of genomic variations on pre-mRNA splicing.
  • the system 102 may be embodied in a computing device 110 .
  • Examples of the computing device 110 may include, but are not limited to, a desktop personal computer (PC), a notebook, a laptop, a portable computer, a smart phone, a tablet, and the like.
  • the system 102 may also be associated with a data repository 112 to store inputs, dataset and output/resultant. Additionally or alternatively, the data repository 112 may be configured to store data and/or information generated during predicting effect of genomic variations on pre-mRNA splicing.
  • the repository 112 may be configured outside and communicably coupled to the computing device 110 embodying the system 102 . Alternatively, the data repository 112 may be configured within the system 102 .
  • the disclosed system 102 enables predicting effect of genomic variations on pre-mRNA splicing, thereby resulting in high accuracy of predicting pathogenicity and determining branchpoint variants and their pathogenicity based on the availability of alternative branchpoint which could rescue normal splicing.
  • An example representation of pipeline of the method for predicting effect of genomic variations on pre-mRNA splicing is shown and described further with reference to FIG. 3A-3C .
  • the method 200 may be described in the general context of computer executable instructions.
  • computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types.
  • the method 200 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communication network.
  • the order in which the method 200 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 200 , or an alternative method.
  • the method 200 can be implemented in any suitable hardware, software, firmware, or combination thereof.
  • the method 200 depicted in the flow chart may be executed by a system, for example, the system 102 of FIG. 1 .
  • the system 102 may be embodied in an exemplary computer system, for example computer system 102 .
  • the method 200 of FIG. 2 will be explained in more detail below with reference to FIGS. 3A-3C .
  • the method 200 is initiated at 202 where genomic position information, reference allele and alternate allele corresponding to a specified version of the human genome is received.
  • the at least one candidate transcript that fully contains the variant is obtained from existing databases of known human gene transcripts (herein, referred to as at least one variant).
  • Each transcript is represented as a set of one or more non-overlapping intervals, where each interval is represented by four features that include the chromosome on which the transcript is present, the starting genomic coordinate of the interval, the ending genomic coordinate of the interval, and the strand on which the transcript is present (forward or reverse).
  • the at least one candidate variant is classified as occurring in one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of the at least one candidate variant. Further, the at least one candidate variant is classified as the splice acceptor site region occurring in genomic coordinate between 15 nucleotides upstream to 3 nucleotides downstream of a natural intron-exon splice acceptor junction of the gene transcripts and as the branch site region occurring in genomic coordinate between 50 nucleotides to 15 nucleotides upstream of a natural splice acceptor junction of the gene transcripts.
  • nucleotide and nt and used interchangeable.
  • effect of the at least one candidate variant on pre-mRNA splicing is evaluated based on a classified region from the classification of the at least one candidate variant.
  • the evaluation is performed by identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using a MaxEnt score and then determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site. Thereafter in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a Position Weight Matrix (PWM) evaluator.
  • PWM Position Weight Matrix
  • the MaxEnt is a known splice site strength determination tool for calculating strength or weakening of the splice acceptor site, wherein the MaxEnt tool assigns a MaxEnt score based on the effect of the at least one candidate variant on affected natural splice acceptor site region.
  • the available MaxEnt Scan tool is used to calculate the splice acceptor site scores for both the canonical splice sites which is the natural occurring splice sites or natural splice site acceptor region and cryptic splice sites which is splice sites activated by a mutation.
  • the PWM evaluator is generated using experimentally determined human branch sites.
  • the PWM is generated using an experimentally determined 59,359 human branch sites (10 mers), identified based on exoribonuclease digestion and RNA-seq.
  • a set of branch point sites is utilized by selecting only the sequences with ‘A’ at the branchpoint as the training set for the Position weight matrix (PWM).
  • ‘A’ is chosen as the branchpoint since ones with ‘C’/′T′/G as the branchpoint has very low median scores, while the known A has the highest value, suggesting the PWM generated, in accordance with present embodiments, has a selectivity towards ‘A’ as a branchpoint and is ideal to restrict the PWM scoring to ‘A’. Therefore the PWM was built using the known ‘A’ as the branchpoint.
  • a PWM matrix of (m*n) is created by aligning the experimentally determined 59,359 human branch sites (10 mers) with ‘A’ as the branchpoint. In present embodiment a matrix of (10*4) is created. The alignment is then used in calculating the frequency of each nucleotide at each position of the 10mers and thereafter the frequencies of each nucleotide are converted to log odds scores.
  • 1,75,031 unique introns from 18,171 canonical transcripts from Gencode database v19 is identified and extracted with the filtering criterion of being surrounded by coding exons on both sides.
  • the frequency of each nucleotide (A, T, C, G) across all the introns is used to normalize the raw frequencies of the bases in the training set of branch points. As described above, the normalized frequencies are converted to log odds scores to generate the final PWM.
  • the first quartile of the distribution is calculated and is used as a threshold for classifying a site to be a high confidence branch site. In an example embodiment, the determined threshold is 1.46.
  • a 40 mer intronic sequence, 10 to 50 bases upstream from the 3′ end of each intron is extracted from the human genome and scanned for 10 mer sequences scoring above the branchpoint threshold.
  • pathogenicity of the at least one candidate variant is predicted based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing. Further evaluation and predicting pathogenicity of the at least one candidate variant is further described in detail in reference to FIGS. 3A-3C .
  • FIGS. 3A-3C illustrating the analysis pipeline for method of predicting pathogenicity on the pre-mRNA splicing.
  • the analysis pipeline is designed to categorize a variant as pathogenic or non-pathogenic.
  • the analysis approach in accordance with the present embodiments follows a step by step pipeline represented by FIGS. 3A-3C .
  • variants that were in close proximity, that is up to 15 nucleotide upstream to the canonical splice acceptor region are screened for creation of a new cryptic acceptor site or a creation of a new branch site. If a branch site is created, then a suitable downstream splice acceptor site scan is initiated.
  • a suitable upstream branch site is scanned for using the PWM evaluator. If the variant disrupted the canonical splice acceptor and the canonical branch site is unaffected, then the screening for a suitable alternative downstream splice acceptor is performed. If a new canonical splice acceptor was predicted downstream to the canonical splice acceptor site, then a screening for a experimentally proven branchpoint is performed using the PWM tool. The detailed step by step process of the pipeline is described in FIGS. 3A-3C .
  • a variant 302 for example, at least one candidate variant is received where genomic position information, reference allele and alternate allele corresponding to a specified version of the human genome is received.
  • the at least one candidate transcript that fully contains the variant is obtained from existing databases of known human gene transcripts (herein, referred to as at least one variant).
  • Each transcript is represented as a set of one or more non-overlapping intervals, where each interval is represented by four features that include the chromosome on which the transcript is present, the starting genomic coordinate of the interval, the ending genomic coordinate of the interval, and the strand on which the transcript is present (forward or reverse).
  • the at least one candidate variant is classified as occurring in splice affecting region based on genomic coordinate.
  • region occurring in genomic coordinate between 15 nucleotides upstream to 3 nucleotides downstream of a natural intron-exon splice acceptor junction of the gene transcripts is classified as splice acceptor site.
  • weakening of the natural splice acceptor site is determined and, in other words, it is determined that the classified at least one candidate variant is affecting the natural splice acceptor site (natural 3′SS).
  • the classified at least one candidate variant is checked for creating a new ‘AG’ that is a new 3′SS thereby weakening of the natural 3′SS as determined using MaxEnt score.
  • the at least one candidate variant is checked if natural branchpoint suffices or branches out to block C.
  • determining presence or absence of natural branchpoint in sequence range of 15 nucleotide to 50 nucleotide of the new splice acceptor site region being active during the pre-mRNA splicing.
  • the natural branchpoint Thereafter strength of the natural branchpoint is evaluated using the PWM evaluator and identifying the at least one candidate variant as pathogenic ( 312 ) based on the evaluated strength of the natural branchpoint; or screening for an alternative branchpoint using the PWM evaluator and predicting the at least one candidate as a pathogenic based on the evaluated strength of the alternative branchpoint ( 314 ).
  • status of the natural splice acceptor site region is determined. The status herein includes disrupted natural splice acceptor site region or non-disrupted natural splice acceptor site region.
  • the at least one candidate variant is predicted as pathogenic or non-pathogenic ( 318 ) based on the determined status.
  • the at least one candidate variant is classified as occurring in branch site region based on genomic coordinate.
  • region with genomic coordinate between 50 nucleotides to 15 nucleotides upstream of a natural splice acceptor junction of the gene transcripts is classified as branch site.
  • weakening of the natural splice acceptor site is determined and, in other words, it is determined that the classified at least one candidate variant is affecting the natural 3′SS.
  • the classified at least one candidate variant is checked for creating a new ‘AG’ that is a new 3′SS thereby weakening of the natural 3′SS in response to the creation of the new 3′SS is determined using MaxEnt score.
  • the effect of the at least one candidate variant on the branch site for the new splice acceptor site being created is evaluated by determining presence of an alternative branchpoint in sequence range 50 nucleotides to 15 nucleotides upstream of the new splice acceptor site.
  • the at least one variant is categorized to be pathogenic if no alternative branchpoint is determined, at 338 the at least one candidate variant is predicted as non-pathogenic if an alternative branchpoint is found.
  • the effect of the at least one candidate variant on the branch site for no new splice acceptor site being created is evaluated by screening for natural branchpoint in sequence range having 50 nucleotides to 15 nucleotides upstream of the natural splice acceptor site and determining level of strength of the branch site using the PWM evaluator at 332 .
  • the level of strength is determined due to the at least one candidate variant affecting the screened natural branchpoint.
  • the at least one candidate variant is predicted as pathogenic.
  • the at least one candidate variant is predicted as pathogenic or non-pathogenic ( 338 ) based on an alternative branchpoint screened in sequence range of 50 nucleotides to 15 nucleotides upstream of the natural splice acceptor site region.
  • effect of the at least one candidate variant on the splice acceptor site region for no new splice acceptor site being created is evaluated by sequentially performing the steps at 340 , 342 and 344 .
  • effect of the at least one candidate variant on the natural branchpoint is determined and level of strength of natural branch site using the PWM evaluator is identified based on the determined effect.
  • for an alternative splice acceptor site region in sequence range having 50 nucleotide upstream and 50 nucleotide downstream of the at least one candidate variant is screened and a comparison of strength of the alternative splice acceptor site region and weakened natural splice acceptor site region is performed.
  • the at least one candidate variant is predicted as a non-pathogenic variant ( 348 ) or the at least one variant candidate is predicted as a pathogenic variant ( 350 ) or a non-pathogenic variant ( 364 ) based on a screened alternative branchpoint ( 360 ) in the sequence 50 nucleotide to 15 nucleotide upstream to the natural splice acceptor site region.
  • the at least one candidate variant is predicted as non-pathogenic ( 348 ) or further presence of natural branchpoint in sequence range of 15 nucleotide to 50 nucleotide to the splice acceptor site region being active during the mRNA splicing is determined ( 352 ) and thereafter strength of the natural branchpoint with the predefined threshold is compared. And, based on the comparison the at least one candidate variant is predicted as pathogenic ( 350 ).
  • the at least one candidate variant is predicted as pathogenic ( 354 ) or non-pathogenic ( 356 ) based on an alternative branchpoint screened in the sequence range of 50 nucleotides to 15 nucleotides upstream of the alternative splice acceptor site ( 358 ). Further, based on the comparison of strength of the new branchpoint and the natural branchpoint, the at least one candidate variant is predicted as non-pathogenic ( 364 ). If not, presence of natural branchpoint in the range of 15 nucleotide to 50 nucleotide upstream to the splice acceptor site region being active during the mRNA splicing is determined and thereafter strength of the natural branchpoint with the predefined threshold ( 354 ). Based on the determined presence of natural branchpoint and comparison of strength of the natural branchpoint with the predefined threshold the at least one candidate variant is predicted as pathogenic ( 362 ) or non-pathogenic ( 364 ).
  • the focus of the present system and method is to identify a BP given at a random sequence and evaluate the identified BP's role in the functional consequence of splicing of the intron. Further the focus of the present embodiments to predict the impact of the evaluated BP on pathogenicity using a combination of PWM and MaxEnt score.
  • There are many tools which can predict a branchpoint but the main drawback is it requires far more input data while predicting BP, like the polypyramidine tract information, the actual splice acceptor site and the distance to the splice acceptor site region, which restricts such tools to predict a branchpoint given at a random sequence.
  • the present system and method clearly distinguishes between the BP and SS and evaluates a variant based on the combined output from an individual component.
  • the system and method for predicting effect of genomic variations on pre-mRNA splicing In one of the example embodiment, a recent experimentally determined 59,359 human branch sites (10 mers), identified based on exoribonuclease digestion and RNA-sequence is considered.
  • the dataset offers a comprehensive dataset for training a high accuracy putative BPS prediction model (10).
  • the present example utilize this set of branch point sites, selecting only the sequences with ‘A’ at the branchpoint as the training set for the Position weight matrix (PWM) evaluator. This is because our goal is to create and evaluate a tool that can be used as part of a routine variant annotation scheme to provide high confidence annotations for further clinical interpretation.
  • PWM Position weight matrix
  • Parameters such as the distance of BPS from the 3′ splice end ( ⁇ 15 to ⁇ 50 nucleotides upstream) of the intron, making sure the BPS (branch point sequence) is part of the intronic region in all transcripts and setting a threshold on the basis of the top 25% scores in the PWM from the training set were chosen to increase the accuracy of the analysis approach.
  • Comparisons to outcomes of other existing prediction tools like HSF (Human Splicing Finder), SVM (Support Vector Machine), BP finder, outputs of machine learning prediction tools, along with experimentally proven BPS mutations have been performed to demonstrate the accuracy of our proposed model.
  • a variant C>G in intron 9 was detected upon Clinvar based variant screening of Ornithine Carbamoyltransferase coding gene (OTC) as disrupting canonical splice acceptor site.
  • OTC Ornithine Carbamoyltransferase coding gene
  • Alternative splice acceptor site (MaxEnt: 8.30) was identified 25 bases downstream (in the exonic region) of the canonical splice acceptor junction.
  • the canonical branchsite score: 2.80
  • i.e. 29 bases upstream to the identified cryptic splice acceptor was deemed suitable.
  • a T>C transition was found in intron 14 of Mannosidase Alpha Class 2B Member 1 gene (MAN2B1) disrupting the canonical splice acceptor site.
  • MAN2B1 Mannosidase Alpha Class 2B Member 1 gene
  • a cryptic branch site is activated and also activation of a cryptic splice acceptor (MaxEnt: 4.78) 31 nt downstream to the canonical 3′ splice site occurs resulting in deletion of the first 31 nt of the exon 15, leading to a frame shift mutation causing pre-mature termination of the protein as a consequence of introduction of a stop codon (Table 1).
  • an A>G mutation was found in intron 5.
  • the variant is at the canonical splice acceptor site, it has been previously categorized as a splice site mutation, although the role of the variant and the specific effects on the splicing aberrations have not been defined.
  • the canonical splice acceptor site of intron 5 was disrupted as a consequence of the variation (MaxEnt: 4.01> ⁇ 3.94). Due to the disruption of the natural splice acceptor site, a cryptic splice acceptor site (MaxEnt: 5.01) 28 nucleotide downstream to the canonical splice acceptor site was activated.
  • a potential branch site i.e. 35 bases upstream to the cryptic splice acceptor site was found.
  • the original splice acceptor site gets disrupted and a cryptic splice acceptor, along with a cryptic branch point gets activated downstream to the canonical splice site and canonical branch site (Table 2).
  • the resulting protein formed is 392 a.a long and loses 9 a.a i.e. an entire p-strand, in the core region as a result of the SNP.
  • the deleted protein region forms a part of the active site and the homodimer interface of the protein and is essential for pyridoxal 5′ phosphate binding. Therefore the deletion caused due to the SNP is highly deleterious as it causes protein dysfunctioning.
  • a hypothesis can be drawn based on the occurrence of an alternative splice acceptor with a suitable branch site, leading to aberrant splicing. The pre-termination of the transcript due to the splicing disruption might be a cause to primary hyperoxaluria.
  • a deleterious variant G>A disrupting the canonical splice acceptor site was found upon screening of the intron 49 of MYO15A gene.
  • a cryptic branch site (score: 1.92) was activated at the canonical splice acceptor junction.
  • a cryptic splice acceptor site suitable for the cryptic branch site was activated 27 nt downstream (exonic region; MaxEnt: 7.13) to the canonical splice acceptor with the potential to cause partial exon 50 skipping or complete exon 50 skipping might occur as a result of using the stronger splice acceptor site of intron 50 (MaxEnt: 8.93) for splicing.
  • the splicing aberration due to disruption of the canonical splice acceptor and the splicing consequences might be the cause behind non-syndromic genetic deafness.
  • the resulting splicing aberrations do not lead to disruption of the frame of the protein but alter the protein region essential for peptide ligand binding with proline rich ligands like SH3 protein.
  • SH3 domains in the protein are essential for intramolecular interactions leading to proper regulation of the enzymes and also in mediating multiprotein complex assemblies. Therefore, even though the frame of the protein is unaffected, essential active regions of the protein are altered leading to a truncated or non-functional protein.
  • the analysis approach was successful in unveiling a hypothesis behind the effect of the intronic variant on splicing of intron 49 in MYO15A gene and the resulting pathogenicity.
  • a splice acceptor variant (G>C) was identified upon screening of intron 8 of Growth Hormone Receptor.
  • the variant being at the splice acceptor site (AG>AC) disrupted the canonical splice acceptor (MaxEnt: 5.55> ⁇ 2.52) resulting in idiopathic short stature.
  • Two different variant transcripts for GHR have been reported, one with complete skipping of exon 9 and the other with partial deletion of exon 9.
  • the transcript with partial deletion of exon 9 was formed due to activation of a cryptic splice site downstream (24 nt) of the canonical splice acceptor.
  • the occurrence of the splice variants has been reported but the cause behind their formation was not elucidated.
  • the splice strength of the cryptic splice acceptor site i.e. in the exonic region
  • the variant of interest disrupts the canonical splice acceptor site, leading to aberrant splicing, resulting in a non-functional protein due to premature termination of the protein.
  • the variant has been associated with disruption of the canonical splice acceptor and exon 9 skipping indicating that the downstream cryptic splice acceptor was being unused for splicing.
  • GHR-(1-279) splice variant
  • splice variant i.e. formed due to the activation of the cryptic splice acceptor site is as highly expressed as the canonical transcript, therefore upon disruption of the canonical splice acceptor, it is likely that the downstream cryptic splice acceptor would get activated instead of selecting the disrupted canonical splice acceptor site of the intron 10 leading to exon 9 skipping (Table 2).
  • the protein product of GHR as a result of the variant loses 8 a.a from the part of the protein that forms part of the growth hormone binding protein (GHBP) after the cleavage from the GHR.
  • GHBP growth hormone binding protein
  • NRRK1 Neurotrophic Receptor Tyrosine Kinase 1
  • a putative branch site sequence 31 bases upstream to the splice acceptor site, was screened with a deleterious variant T>A.
  • the branch site score was drastically reduced after the mutation, 5.70>3.17 (Table 3) and a cryptic splice acceptor site was activated.
  • the resulting spliced product after mutation comprised of insertion of an intronic (137 bp) segment attributed to the usage of the upstream cryptic splice acceptor site. Therefore the role of the T>A branch site mutation has been proven to be a major cause of congenital insensitivity to pain with anhidrosis (CIPA) and the analysis approach was successful in determining the same.
  • CIPA congenital insensitivity to pain with anhidrosis
  • the PWM based approach identified a putative branch site containing a deleterious variant T>A in intron 11 of TH. It has been proven that the deleterious variant leads to alternative splicing, via skipping of exon 12, resulting in absence of 32 amino acids in the final protein product, making it non-functional or usage of cryptic branch site resulting in aberrant splicing or via partial intron retention (36 nucleotides in the mRNA) resulting in incorporation of 12 additional amino acids, rendering the protein non-functional.
  • the branch site scores for the predicted branch site reduced significantly as a result of the variant (Table 3).
  • disruption of branchpoint causing splicing aberration resulting in exon skipping were validated.
  • a deleterious point mutation A>G was discovered in branch site sequence TCCCTGACAG′ i.e. 26 bases upstream to the splice acceptor site of intron 3.
  • This intronic mutation A>G has been experimentally proven to result in skipping of exon 4 leading to McArdle disease (17). Based on amplified PCR products from the natural and the mutated samples, retention of exon 4 was concluded and the variant was classified to be a splice acceptor site mutation but the role of the branch site was not addressed.
  • the theory of exon 4 skipping is hypothesized to be due to the disruption of the canonical branchpoint (4.43 to null), which is 26 bases upstream to the canonical splice acceptor (Table 4).
  • the variant can be hypothesized to be a branch site mutation.
  • the analysis approach was capable of determining and classifying an experimentally validated splice mutation as a branchpoint mutation.
  • a deleterious variant in the putative branch site TTTGTGATTC′ with the highest score 3.40 was identified 23 bases upstream to the splice acceptor site in the sole intron of Translocase Of Inner Mitochondrial Membrane 8 (TIMM8A) gene, TIMM8A/DDP1 gene dysfunction leads to Mohr-Tranebjaerg syndrome or deafness/dystonia syndrome, there has been evidence of various missense and nonsense mutations in the coding regions of the exons of TIMM8A. There has been a recent finding of an intronic variant A>C causing X-linked dystonia deafness.
  • the intronic variant in TIMM8A has been proven to cause protein dysfunction possibly due to splicing aberrations.
  • the cause behind the splicing aberrations has not been discussed in terms of the branchpoint disruption.
  • the branchpoint scores obtained from the prediction tool it was evident that the splicing aberration was due to branchpoint disruption (Table 3).
  • the analysis was able to classify a proven intronic variant as a branchpoint mutation on the basis of the change in branch site scores (3.40>null).
  • the PWM based analysis approach is designed to screen for variants that are putative branch sites with ‘A’ as the branchpoint in any given sequence and determine the effect of a mutation in a branch site to the splicing of the intron.
  • the PWM of the present embodiments is able to identify putative branch sites in proximity to the intronic end.
  • the potential of the PWM is cross-checked with experimentally known branch sites identified by other tools and the outcome matched accurately.
  • the cases studied discussed in detail revealed successful identification of known branchpoint mutations and also led to reinterpretation of certain cases indicating the cause behind speculated effects of splicing leading to a pathological condition.
  • the basis for the examples discussed above is the PWM matrix generated in accordance with the present embodiments.
  • the PWM is created using a dataset of branch site 10 mer sequences containing adenosine as the branchpoint.
  • the PWM was able to identify putative branch sites in proximity to the intronic end.
  • the potential of the PWM was cross-checked with experimentally known branch sites identified by other tools and the outcome matched accurately.
  • the analysis approach of the present method is focused on screening variants in branch sites with “A” as the branchpoint and studying the impact of the variant on splicing and the resulting pathogenicity.
  • the input dataset upon variant screening shows a particular branchpoint variant in the COL4A5 gene which was speculated to be a splice site variant but based on the scores obtained for the branch site before and after the mutation from the PWM created, indicated it to be a branchpoint mutation disrupting the branch site.
  • the screening of putative branch site variants in the human genome, through the Clinvar.vcf successfully identified 20 cases with deleterious variants (pathogenic/likely pathogenic) as branch site mutations (TABLE 5) and 20 deleterious variants as splice site mutations (TABLE 6).
  • An extra filter that is, significant change in the branch site score/splice site acceptor score before and after the mutation was applied in order to pick drastically affected branchpoints/splice sites due to variation.
  • variant screening within 15 nt upstream to the intron/exon junction confirmed two experimentally proven cases Ornithine Carbamoyltransferase (OTC), Mannosidase Alpha Class 2B Member 1 (MAN2B1)), with variant disrupting canonical splice acceptor site leading to activation of cryptic splice acceptor site and cryptic branch site.
  • OTC Ornithine Carbamoyltransferase
  • MAN2B1 Mannosidase Alpha Class 2B Member 1
  • the three known cases of branch site mutations and the two known cases of splice site mutations confirmed the potency of the analysis model in identifying potential branch sites in the introns (NTRK1, DYSF, TH; OTC, MAN2B1), while the two discovery cases of branch site mutations and splice site mutations (PYGM, TIMM8A; AGXT, MYO15A) confirms the potency of the analysis approach model in categorizing intronic variants as branchpoint or splice site variants based on the activation of a cryptic branchpoint or cryptic splice site.
  • the analysis approach was also tested for the negative set i.e.
  • the analysis approach is successful in determining branchpoint variants and determining their pathogenicity based on the availability of alternative branchpoint which could rescue normal splicing.
  • the present system and method proved successful in identifying variants that caused disruption of a branchpoint and led to creation of a new splice acceptor (Component of Oligomeric Golgi Complex 6 (COG6), Glucosidase Alpha, Acid (GAA)) at that site. It was also successful in identifying a putative splice acceptor site downstream to the canonical site upon creation of a new branchpoint at the canonical splice acceptor site as a result of the variation. In total, 40 variants with a potency to be a branch site or splice site mutation were identified and their role in causing splicing aberration was predicted with the aid of the designed tool.
  • COG6 Oligomeric Golgi Complex 6
  • GAA Glucosidase Alpha, Acid
  • the PWM based approach is designed to screen for variants that are putative branch sites with ‘A’ as the branchpoint in any given sequence and determine the effect of a mutation in a branch site to the splicing of the intron.
  • the embodiments of the present system and method is capable of identifying branchpoint variants and along with other established tools that determine various aspects of splice site was successful in offering a more detailed biological explanation to the consequence of mutations. Also, the discovery cases is identified using the present embodiments hold strong potential in unveiling the cause behind known pathogenic conditions and provide basis for therapeutic developments. Prediction of putative branchpoint or splice site variants in an intron can lay the foundation for the identification of possible genotype-based therapies using exon-skipping techniques (TABLE 7).
  • Predicted alternative BP Predicted branchpoint with a higher potential by present prediction tool
  • FIG. 4 is a block diagram of an exemplary computer system 401 for implementing embodiments consistent with the present disclosure.
  • the computer system 401 may be implemented standalone or in combination of components of the system 102 ( FIG. 1 ). Variations of computer system 401 may be used for implementing the devices included in this disclosure.
  • Computer system 401 may comprise a central processing unit (“CPU” or “hardware processor”) 402 .
  • the hardware processor 402 may comprise at least one data processor for executing program components for executing user- or system-generated requests.
  • the processor may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc.
  • the processor may include a microprocessor, such as AMD AthlonTM, DuronTM or OpteronTM, ARM's application, embedded or secure processors, IBM PowerPCTM, Intel's Core, ItaniumTM, XeonTM, CeleronTM or other line of processors, etc.
  • the processor 902 may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies like application specific integrated circuits (ASICs), digital signal processors (DSPs), Field Programmable Gate Arrays (FPGAs), etc.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • FPGAs Field Programmable Gate Arrays
  • I/O Processor 402 may be disposed in communication with one or more input/output (I/O) devices via I/O interface 403 .
  • the I/O interface 403 may employ communication protocols/methods such as, without limitation, audio, analog, digital, monoaural, RCA, stereo, IEEE-1394, serial bus, universal serial bus (USB), infrared, PS/2, BNC, coaxial, component, composite, digital visual interface (DVI), high-definition multimedia interface (HDMI), RF antennas, S-Video, VGA, IEEE 402.11 a/b/g/n/x, Bluetooth, cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMax, or the like), etc.
  • CDMA code-division multiple access
  • HSPA+ high-speed packet access
  • GSM global system for mobile communications
  • LTE long-term evolution
  • WiMax wireless wide area network
  • the computer system 401 may communicate with one or more I/O devices.
  • the input device 404 may be an antenna, keyboard, mouse, joystick, (infrared) remote control, camera, card reader, fax machine, dongle, biometric reader, microphone, touch screen, touchpad, trackball, sensor (e.g., accelerometer, light sensor, GPS, gyroscope, proximity sensor, or the like), stylus, scanner, storage device, transceiver, video device/source, visors, etc.
  • Output device 405 may be a printer, fax machine, video display (e.g., cathode ray tube (CRT), liquid crystal display (LCD), light-emitting diode (LED), plasma, or the like), audio speaker, etc.
  • video display e.g., cathode ray tube (CRT), liquid crystal display (LCD), light-emitting diode (LED), plasma, or the like
  • audio speaker etc.
  • a transceiver 406 may be disposed in connection with the processor 402 . The transceiver may facilitate various types of wireless transmission or reception.
  • the transceiver may include an antenna operatively connected to a transceiver chip (e.g., Texas Instruments WiLink WL1283, Broadcom BCM4750IUB8, Infineon Technologies X-Gold 618-PMB9800, or the like), providing IEEE 802.11a/b/g/n, Bluetooth, FM, global positioning system (GPS), 2G/3G HSDPA/HSUPA communications, etc.
  • a transceiver chip e.g., Texas Instruments WiLink WL1283, Broadcom BCM4750IUB8, Infineon Technologies X-Gold 618-PMB9800, or the like
  • IEEE 802.11a/b/g/n e.g., Texas Instruments WiLink WL1283, Broadcom BCM4750IUB8, Infineon Technologies X-Gold 618-PMB9800, or the like
  • IEEE 802.11a/b/g/n e.g., Bluetooth, FM, global positioning system (GPS), 2G/3G HSDPA/HS
  • the processor 402 may be disposed in communication with a communication network 408 via a network interface 407 .
  • the network interface 407 may communicate with the communication network 408 .
  • the network interface may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, IEEE 402.11a/b/g/n/x, etc.
  • the communication network 408 may include, without limitation, a direct interconnection, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), the Internet, etc.
  • the computer system 401 may communicate with devices 409 and 410 .
  • These devices may include, without limitation, personal computer(s), server(s), fax machines, printers, scanners, various mobile devices such as cellular telephones, smartphones (e.g., Apple iPhone, Blackberry, Android-based phones, etc.), tablet computers, eBook readers (Amazon Kindle, Nook, etc.), laptop computers, notebooks, gaming consoles (Microsoft Xbox, Nintendo DS, Sony PlayStation, etc.), or the like.
  • the computer system 401 may itself embody one or more of these devices.
  • the processor 402 may be disposed in communication with one or more memory devices (e.g., RAM 713 , ROM 714 , etc.) via a storage interface 412 .
  • the storage interface may connect to memory devices including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as serial advanced technology attachment (SATA), integrated drive electronics (IDE), IEEE-1394, universal serial bus (USB), fiber channel, small computer systems interface (SCSI), etc.
  • the memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, redundant array of independent discs (RAID), solid-state memory devices, solid-state drives, etc. Variations of memory devices may be used for implementing, for example, any databases utilized in this disclosure.
  • the memory devices may store a collection of program or database components, including, without limitation, an operating system 416 , user interface application 417 , user/application data 418 (e.g., any data variables or data records discussed in this disclosure), etc.
  • the operating system 416 may facilitate resource management and operation of the computer system 401 .
  • Examples of operating systems include, without limitation, Apple Macintosh OS X, Unix, Unix-like system distributions (e.g., Berkeley Software Distribution (BSD), FreeBSD, NetBSD, OpenBSD, etc.), Linux distributions (e.g., Red Hat, Ubuntu, Kubuntu, etc.), IBM OS/2, Microsoft Windows (XP, Vista/7/8, etc.), Apple iOS, Google Android, Blackberry OS, or the like.
  • User interface 417 may facilitate display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities.
  • user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system 401 , such as cursors, icons, check boxes, menus, scrollers, windows, widgets, etc.
  • Graphical user interfaces may be employed, including, without limitation, Apple Macintosh operating systems' Aqua, IBM OS/2, Microsoft Windows (e.g., Aero, Metro, etc.), Unix X-Windows, web interface libraries (e.g., ActiveX, Java, Javascript, AJAX, HTML, Adobe Flash, etc.), or the like.
  • computer system 401 may store user/application data 418 , such as the data, variables, records, etc. as described in this disclosure.
  • databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase.
  • databases may be implemented using standardized data structures, such as an array, hash, linked list, structured text file (e.g., XML), table, or as object-oriented databases (e.g., using ObjectStore, Poet, Zope, etc.).
  • object-oriented databases e.g., using ObjectStore, Poet, Zope, etc.
  • Such databases may be consolidated or distributed, sometimes among the various computer systems discussed above in this disclosure. It is to be understood that the structure and operation of any computer or database component may be combined, consolidated, or distributed in any working combination.
  • the server, messaging and instructions transmitted or received may emanate from hardware, including operating system, and program code (i.e., application code) residing in a cloud implementation.
  • program code i.e., application code
  • one or more of the systems and methods provided herein may be suitable for cloud-based implementation.
  • some or all of the data used in the disclosed methods may be sourced from or stored on any cloud computing platform.
  • the hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof.
  • the device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g.
  • ASIC application-specific integrated circuit
  • FPGA field-programmable gate array
  • the means can include both hardware means and software means.
  • the method embodiments described herein could be implemented in hardware and software.
  • the device may also include software means.
  • the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
  • the embodiments herein can comprise hardware and software elements.
  • the embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc.
  • the functions performed by various modules described herein may be implemented in other modules or combinations of other modules.
  • a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • a computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored.
  • a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein.
  • the term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

This disclosure relates generally to method and system for predicting effect of genomic variations on pre-mRNA splicing. The method include receiving genomic position information of at least one candidate variant, gene transcripts and genomic coordinates information of the gene transcripts; classifying the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant; evaluating effect of the at least one candidate variant on pre-mRNA splicing, based on a classified region from the classification of the at least one candidate variant and predicting pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing.

Description

    PRIORITY CLAIM
  • This U.S. patent application claims priority under 35 U.S.C. § 119 to India Application No. 201821025433, filed on Jul. 7, 2018. The entire contents of the aforementioned application are incorporated herein by reference.
  • TECHNICAL FIELD
  • The disclosure herein generally relates to mRNA splicing, and, more particularly, predicting effect of genomic variations on pre-mRNA splicing.
  • BACKGROUND
  • RNA splicing is a process of cutting introns out of pre-mRNA and stitching together exons to form a final nucleotide sequence that is the mRNA sequence that codes for proteins. In this regard branchpoint (BP) selection and splice site (SS) selection are key steps in RNA splicing, yet many popular splicing analysis tools do not model this mechanism. If there is a mutation in proximity to an intron's primary branch point, that branchpoint may become unusable.
  • Existing methods for branchpoint prediction use wet lab techniques and in-silico methods. The wet lab techniques are time consuming and labour intensive, while existing computational models involving Support Vector Machine algorithm or machine learning tools are based on numerous assumptions which hamper accurate prediction. Various computational methods have been implemented to facilitate accurate branchpoint prediction and the predicted branchpoints have been tested in vivo/vitro but most of the models are built on hypothetical assumptions which do not lead to accurate prediction of branchpoints. In general the search for disease-causing mutations has been mostly restricted to coding exons, intron-exon junction and promoter region of the gene of interest.
  • SUMMARY
  • Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a processor implemented method for predicting effect of genomic variations on pre-mRNA splicing is provided. The method includes receiving genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts. Further includes classifying the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant. Further includes evaluating effect of the at least one candidate variant on pre-mRNA splicing, based on a classified region from the classification of the at least one candidate variant. Herein, evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises: identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using MaxEnt score, determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site and evaluating, in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a PWM evaluator. Further includes predicting pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing.
  • In another embodiment, a system for predicting effect of genomic variations on pre-mRNA splicing is provided. The system includes a memory storing instructions and one or more hardware processors coupled to the memory, wherein the one or more hardware processors are configured by the instructions to: receive genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts. Further, to classify the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant. Further to evaluate effect of the at least one candidate variant on pre-mRNA splicing, based on a classified region from the classification of the at least one candidate variant, wherein the evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises: identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using a MaxEnt score, determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site and evaluating, in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a PWM evaluator. Further to predict pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on pre-mRNA splicing.
  • In yet another embodiment, one or more non-transitory machine readable information storage mediums are provided. Said one or more non-transitory machine readable information storage mediums comprises one or more instructions which when executed by one or more hardware processors causes receiving genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts. Further includes classifying the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant. Further includes evaluating effect of the at least one candidate variant on pre-mRNA splicing, based on a classified region from the classification of the at least one candidate variant. Herein, evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises: identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using MaxEnt score, determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site and evaluating, in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a PWM evaluator. Further includes predicting pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing.
  • It should be appreciated by those skilled in the art that any block diagram herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computing device or processor, whether or not such computing device or processor is explicitly shown.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
  • FIG. 1 illustrates network environment implementing a system 102 for predicting effect of genomic variations on pre-mRNA splicing, in accordance with an embodiment of the present disclosure.
  • FIG. 2 is a flow diagram illustrating a method for predicting effect of genomic variations on pre-mRNA splicing, according to an embodiment of the present disclosure.
  • FIGS. 3A, 3B and 3C illustrates an analysis pipeline for predicting effect of genomic variations on pre-mRNA splicing, in accordance with an embodiment of the present disclosure.
  • FIG. 4 illustrates a block diagram of a system for predicting effect of genomic variations on pre-mRNA splicing, in accordance with some embodiments of the present disclosure.
  • DETAILED DESCRIPTION
  • Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the claims (when included in the specification).
  • One of the study for investigating disease-causing BPS mutations provides that in adenosine branchpoints in comparison to other base branchpoints caused more severe splicing defects. A mutation in the branchpoint impairs the lariat formation and may lead to aberrant splicing of the intron, leading to gene dysfunction. The lariat is a lasso-shaped structure formed during the removal of introns in mRNA processing. Mutations at branch sites have been shown to lead to aberrant splicing, which in turn can lead to disease phenotypes. The explosion of the use of next generation sequencing (NGS) in the clinic for diagnosis and screening of disorders may benefit from approaches that can reliably identify mutations in branch sites that may be explanatory of diseases. Development of such tools has been hampered by the absence of a large enough “gold dataset” of known high confident branch sites.
  • Splicing forms a crucial part of pre-mRNA maturation process as accurate excision of introns and joining of exons are essential to eukaryotic gene expression. During splicing, parts of the pre-mRNA are removed by the spliceosome within the nucleus before the mature mRNA is transported to the cytoplasm for translation. Depending upon tissue localization and the developmental stage, pre-mRNA is differently spliced leading to alternative transcripts i.e., expression of different proteins from the same gene. More than 70% of protein coding human gene are alternatively spliced and alternative splicing has been proposed to be the major cause of the evolution of phenotypic complexity in mammals.
  • Exon skipping is the most common outcome of splicing mutations, followed by activation of cryptic 5′ and 3′ splice sites (5′SS and 3′SS). Exon skipping is due to disruption of natural splice acceptor site or abolishment of the natural branchpoint with no alternative branchpoint available to facilitate splicing. Efficient splicing requires at least three major signals within introns, the 5′ splice site, 3′ splice site and the branchpoint sequence. Auxiliary sequences in introns and exons known as splicing enhancers and silencers act in conjunction to decide splicing to be constitutive or alternative. The 5′ end of the intron is known as splice donor site and 3′ end of the intron is referred as splice acceptor site.
  • The divergence from the prototype sequences are associated with alternative transcript generation. Occurrence of such consensus sequences within the introns is quite common in the case of higher eukaryotes framing pseudoexons, indicating the presence of the splice boundaries but insufficient for regulating correct splicing. The 3′ end is characterized by presence of the splice acceptor site, branchpoint sequence upstream and the polypyrimidine tract immediately following the branchpoint sequence. Branchpoints are defined on the basis of four major criteria: that are proximal to the 3′ splice end of the intron, branchpoint sequence is followed by polypyrimidine tract, a depletion of ‘AG’ dinucleotide between the branchpoint sequence and the 3′ splice site, and the branchpoint is mostly an adenine. So the selection and accurate prediction of branchpoint variant and splice site variant from candidate variants of existing databases of known human gene transcripts is of prime importance and challenging.
  • Various embodiments of the present disclosure provided method and system for predicting the effect of genomic variations on pre-mRNA splicing based on MaxEnt tool and a Position Weight Matrix (PWM) evaluator with high accuracy utilized on resource constrained environment. The disclosed system includes a variant pipeline which works in real-time in a resource constrained environment or near real-time on CPU. The disclosed system and method provides a solution in predicting effect of genomic variations on pre-mRNA splicing. A detailed description of the above described system and method for predicting the effect of genomic variations on pre-mRNA splicing is shown with respect to illustrations represented with reference to FIGS. 1 through 4.
  • Referring now to the drawings, and more particularly to FIGS. 1 through 4, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and method for predicting effect of genomic variations on pre-mRNA splicing.
  • Herein, the system 102 may receive inputs, for example, inputs via multiple devices and/or machines 104-1, 104-2 . . . 104-N, collectively referred to as devices 104 hereinafter. Examples of the devices 104 may include, but are not limited to, a portable computer, a personal digital assistant, a handheld device, VR camera embodying devices, storage devices equipped to receive and store inputs and outputs. In an embodiment, the devices 104 may include devices capable of capturing and storing data. The devices 104 are communicatively coupled to the system 102 through a network 106, and may be capable of transmitting the data to the system 102.
  • In one implementation, the network 106 may be a wireless network, a wired network or a combination thereof. The network 106 can be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and the like. The network 106 may either be a dedicated network or a shared network. The shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like, to communicate with one another. Further the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
  • The devices 104 may send input to the system 102 via the network 106. The system 102 is caused to predict effect of genomic variations on pre-mRNA splicing. In an embodiment, the system 102 may be embodied in a computing device 110. Examples of the computing device 110 may include, but are not limited to, a desktop personal computer (PC), a notebook, a laptop, a portable computer, a smart phone, a tablet, and the like. The system 102 may also be associated with a data repository 112 to store inputs, dataset and output/resultant. Additionally or alternatively, the data repository 112 may be configured to store data and/or information generated during predicting effect of genomic variations on pre-mRNA splicing. The repository 112 may be configured outside and communicably coupled to the computing device 110 embodying the system 102. Alternatively, the data repository 112 may be configured within the system 102.
  • In an embodiment, the disclosed system 102 enables predicting effect of genomic variations on pre-mRNA splicing, thereby resulting in high accuracy of predicting pathogenicity and determining branchpoint variants and their pathogenicity based on the availability of alternative branchpoint which could rescue normal splicing. An example representation of pipeline of the method for predicting effect of genomic variations on pre-mRNA splicing is shown and described further with reference to FIG. 3A-3C.
  • Referring now to FIG. 2, a flow-diagram of a method 200 for predicting effect of genomic variations on pre-mRNA splicing is described, according to some embodiments of present disclosure. The method 200 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method 200 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communication network. The order in which the method 200 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 200, or an alternative method. Furthermore, the method 200 can be implemented in any suitable hardware, software, firmware, or combination thereof. In an embodiment, the method 200 depicted in the flow chart may be executed by a system, for example, the system 102 of FIG. 1. In an example embodiment, the system 102 may be embodied in an exemplary computer system, for example computer system 102. The method 200 of FIG. 2 will be explained in more detail below with reference to FIGS. 3A-3C.
  • Referring to FIG. 2, in the illustrated embodiment, the method 200 is initiated at 202 where genomic position information, reference allele and alternate allele corresponding to a specified version of the human genome is received. The at least one candidate transcript that fully contains the variant is obtained from existing databases of known human gene transcripts (herein, referred to as at least one variant). Each transcript is represented as a set of one or more non-overlapping intervals, where each interval is represented by four features that include the chromosome on which the transcript is present, the starting genomic coordinate of the interval, the ending genomic coordinate of the interval, and the strand on which the transcript is present (forward or reverse).
  • At 204 the at least one candidate variant is classified as occurring in one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of the at least one candidate variant. Further, the at least one candidate variant is classified as the splice acceptor site region occurring in genomic coordinate between 15 nucleotides upstream to 3 nucleotides downstream of a natural intron-exon splice acceptor junction of the gene transcripts and as the branch site region occurring in genomic coordinate between 50 nucleotides to 15 nucleotides upstream of a natural splice acceptor junction of the gene transcripts. Herein nucleotide and nt and used interchangeable.
  • At 206, effect of the at least one candidate variant on pre-mRNA splicing is evaluated based on a classified region from the classification of the at least one candidate variant. The evaluation is performed by identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using a MaxEnt score and then determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site. Thereafter in response to determining that the new splice acceptor site region being created, strength of an identified natural branch point in the classified region using a Position Weight Matrix (PWM) evaluator. The MaxEnt is a known splice site strength determination tool for calculating strength or weakening of the splice acceptor site, wherein the MaxEnt tool assigns a MaxEnt score based on the effect of the at least one candidate variant on affected natural splice acceptor site region. In an example embodiment, the available MaxEnt Scan tool is used to calculate the splice acceptor site scores for both the canonical splice sites which is the natural occurring splice sites or natural splice site acceptor region and cryptic splice sites which is splice sites activated by a mutation.
  • The PWM evaluator is generated using experimentally determined human branch sites. In an example embodiment, the PWM is generated using an experimentally determined 59,359 human branch sites (10 mers), identified based on exoribonuclease digestion and RNA-seq. In said example embodiment, a set of branch point sites is utilized by selecting only the sequences with ‘A’ at the branchpoint as the training set for the Position weight matrix (PWM). In said example embodiment, ‘A’ is chosen as the branchpoint since ones with ‘C’/′T′/G as the branchpoint has very low median scores, while the known A has the highest value, suggesting the PWM generated, in accordance with present embodiments, has a selectivity towards ‘A’ as a branchpoint and is ideal to restrict the PWM scoring to ‘A’. Therefore the PWM was built using the known ‘A’ as the branchpoint. A PWM matrix of (m*n) is created by aligning the experimentally determined 59,359 human branch sites (10 mers) with ‘A’ as the branchpoint. In present embodiment a matrix of (10*4) is created. The alignment is then used in calculating the frequency of each nucleotide at each position of the 10mers and thereafter the frequencies of each nucleotide are converted to log odds scores.
  • In said example embodiments, 1,75,031 unique introns from 18,171 canonical transcripts from Gencode database v19 is identified and extracted with the filtering criterion of being surrounded by coding exons on both sides. The frequency of each nucleotide (A, T, C, G) across all the introns is used to normalize the raw frequencies of the bases in the training set of branch points. As described above, the normalized frequencies are converted to log odds scores to generate the final PWM. Based on the branch site scores obtained for the known branch sites with ‘A’ as the branchpoint. The first quartile of the distribution is calculated and is used as a threshold for classifying a site to be a high confidence branch site. In an example embodiment, the determined threshold is 1.46. Further, a 40 mer intronic sequence, 10 to 50 bases upstream from the 3′ end of each intron is extracted from the human genome and scanned for 10 mer sequences scoring above the branchpoint threshold.
  • At 208, pathogenicity of the at least one candidate variant is predicted based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing. Further evaluation and predicting pathogenicity of the at least one candidate variant is further described in detail in reference to FIGS. 3A-3C.
  • Referring now to FIGS. 3A-3C, illustrating the analysis pipeline for method of predicting pathogenicity on the pre-mRNA splicing. Herein the analysis pipeline is designed to categorize a variant as pathogenic or non-pathogenic. The analysis approach, in accordance with the present embodiments follows a step by step pipeline represented by FIGS. 3A-3C. In an embodiment, variants that were in close proximity, that is up to 15 nucleotide upstream to the canonical splice acceptor region are screened for creation of a new cryptic acceptor site or a creation of a new branch site. If a branch site is created, then a suitable downstream splice acceptor site scan is initiated. If the variant is creating a splice acceptor, then a suitable upstream branch site is scanned for using the PWM evaluator. If the variant disrupted the canonical splice acceptor and the canonical branch site is unaffected, then the screening for a suitable alternative downstream splice acceptor is performed. If a new canonical splice acceptor was predicted downstream to the canonical splice acceptor site, then a screening for a experimentally proven branchpoint is performed using the PWM tool. The detailed step by step process of the pipeline is described in FIGS. 3A-3C.
  • Referring now to FIG. 3A, where a variant 302, for example, at least one candidate variant is received where genomic position information, reference allele and alternate allele corresponding to a specified version of the human genome is received. The at least one candidate transcript that fully contains the variant is obtained from existing databases of known human gene transcripts (herein, referred to as at least one variant). Each transcript is represented as a set of one or more non-overlapping intervals, where each interval is represented by four features that include the chromosome on which the transcript is present, the starting genomic coordinate of the interval, the ending genomic coordinate of the interval, and the strand on which the transcript is present (forward or reverse). The at least one candidate variant is classified as occurring in splice affecting region based on genomic coordinate. At 304, region occurring in genomic coordinate between 15 nucleotides upstream to 3 nucleotides downstream of a natural intron-exon splice acceptor junction of the gene transcripts is classified as splice acceptor site. At 306, weakening of the natural splice acceptor site is determined and, in other words, it is determined that the classified at least one candidate variant is affecting the natural splice acceptor site (natural 3′SS). At 308, the classified at least one candidate variant is checked for creating a new ‘AG’ that is a new 3′SS thereby weakening of the natural 3′SS as determined using MaxEnt score. In response to the determined weakening of the splice acceptor site, that is the weakening of the natural 3′SS, the at least one candidate variant is checked if natural branchpoint suffices or branches out to block C. In other words, at 310, determining presence or absence of natural branchpoint in sequence range of 15 nucleotide to 50 nucleotide of the new splice acceptor site region being active during the pre-mRNA splicing. Thereafter strength of the natural branchpoint is evaluated using the PWM evaluator and identifying the at least one candidate variant as pathogenic (312) based on the evaluated strength of the natural branchpoint; or screening for an alternative branchpoint using the PWM evaluator and predicting the at least one candidate as a pathogenic based on the evaluated strength of the alternative branchpoint (314). At 317, status of the natural splice acceptor site region is determined. The status herein includes disrupted natural splice acceptor site region or non-disrupted natural splice acceptor site region. At 316, the at least one candidate variant is predicted as pathogenic or non-pathogenic (318) based on the determined status.
  • Referring now to FIG. 3B, at connector B, the at least one candidate variant is classified as occurring in branch site region based on genomic coordinate. At 320, region with genomic coordinate between 50 nucleotides to 15 nucleotides upstream of a natural splice acceptor junction of the gene transcripts is classified as branch site. At 322, weakening of the natural splice acceptor site is determined and, in other words, it is determined that the classified at least one candidate variant is affecting the natural 3′SS. At 324 the classified at least one candidate variant is checked for creating a new ‘AG’ that is a new 3′SS thereby weakening of the natural 3′SS in response to the creation of the new 3′SS is determined using MaxEnt score. In response to the determined weakening of the splice acceptor site is screened either for natural branchpoint or alternative branchpoint. At 326, the effect of the at least one candidate variant on the branch site for the new splice acceptor site being created is evaluated by determining presence of an alternative branchpoint in sequence range 50 nucleotides to 15 nucleotides upstream of the new splice acceptor site. At, 328 the at least one variant is categorized to be pathogenic if no alternative branchpoint is determined, at 338 the at least one candidate variant is predicted as non-pathogenic if an alternative branchpoint is found.
  • At 330, the effect of the at least one candidate variant on the branch site for no new splice acceptor site being created is evaluated by screening for natural branchpoint in sequence range having 50 nucleotides to 15 nucleotides upstream of the natural splice acceptor site and determining level of strength of the branch site using the PWM evaluator at 332. Herein, the level of strength is determined due to the at least one candidate variant affecting the screened natural branchpoint. At 334, based on the determined level of strength of the branch site the at least one candidate variant is predicted as pathogenic. At 336, the at least one candidate variant is predicted as pathogenic or non-pathogenic (338) based on an alternative branchpoint screened in sequence range of 50 nucleotides to 15 nucleotides upstream of the natural splice acceptor site region.
  • Referring now to FIG. 3C, at connector C, effect of the at least one candidate variant on the splice acceptor site region for no new splice acceptor site being created is evaluated by sequentially performing the steps at 340, 342 and 344. At 340 effect of the at least one candidate variant on the natural branchpoint is determined and level of strength of natural branch site using the PWM evaluator is identified based on the determined effect. At 342, for an alternative splice acceptor site region in sequence range having 50 nucleotide upstream and 50 nucleotide downstream of the at least one candidate variant is screened and a comparison of strength of the alternative splice acceptor site region and weakened natural splice acceptor site region is performed. At 344, presence of a new branchpoint being created and performing comparison of strength of the new branchpoint and the natural branchpoint is determined. Further based on 340, at 346, the at least one candidate variant is predicted as a non-pathogenic variant (348) or the at least one variant candidate is predicted as a pathogenic variant (350) or a non-pathogenic variant (364) based on a screened alternative branchpoint (360) in the sequence 50 nucleotide to 15 nucleotide upstream to the natural splice acceptor site region.
  • Further based on 342, the at least one candidate variant is predicted as non-pathogenic (348) or further presence of natural branchpoint in sequence range of 15 nucleotide to 50 nucleotide to the splice acceptor site region being active during the mRNA splicing is determined (352) and thereafter strength of the natural branchpoint with the predefined threshold is compared. And, based on the comparison the at least one candidate variant is predicted as pathogenic (350). Further based on 344, the at least one candidate variant is predicted as pathogenic (354) or non-pathogenic (356) based on an alternative branchpoint screened in the sequence range of 50 nucleotides to 15 nucleotides upstream of the alternative splice acceptor site (358). Further, based on the comparison of strength of the new branchpoint and the natural branchpoint, the at least one candidate variant is predicted as non-pathogenic (364). If not, presence of natural branchpoint in the range of 15 nucleotide to 50 nucleotide upstream to the splice acceptor site region being active during the mRNA splicing is determined and thereafter strength of the natural branchpoint with the predefined threshold (354). Based on the determined presence of natural branchpoint and comparison of strength of the natural branchpoint with the predefined threshold the at least one candidate variant is predicted as pathogenic (362) or non-pathogenic (364).
  • In accordance with the present embodiments, the focus of the present system and method is to identify a BP given at a random sequence and evaluate the identified BP's role in the functional consequence of splicing of the intron. Further the focus of the present embodiments to predict the impact of the evaluated BP on pathogenicity using a combination of PWM and MaxEnt score. There are many tools which can predict a branchpoint, but the main drawback is it requires far more input data while predicting BP, like the polypyramidine tract information, the actual splice acceptor site and the distance to the splice acceptor site region, which restricts such tools to predict a branchpoint given at a random sequence. The present system and method clearly distinguishes between the BP and SS and evaluates a variant based on the combined output from an individual component.
  • Validation and Results
  • The results of methods for predicting effect of genomic variations on pre-mRNA splicing have been validated using following examples. It will be understood that the examples discussed herein are only for the purpose of explanation and not to limit the scope of the present subject matter. Further, the test results are shown for a specific example of predicting effect of genomic variations on pre-mRNA splicing and should in no way be construed as the only method that can be formed through the described method.
  • In one of the example embodiment, the system and method for predicting effect of genomic variations on pre-mRNA splicing. In present embodiment, a recent experimentally determined 59,359 human branch sites (10 mers), identified based on exoribonuclease digestion and RNA-sequence is considered. The dataset offers a comprehensive dataset for training a high accuracy putative BPS prediction model (10). The present example, utilize this set of branch point sites, selecting only the sequences with ‘A’ at the branchpoint as the training set for the Position weight matrix (PWM) evaluator. This is because our goal is to create and evaluate a tool that can be used as part of a routine variant annotation scheme to provide high confidence annotations for further clinical interpretation. Parameters such as the distance of BPS from the 3′ splice end (−15 to −50 nucleotides upstream) of the intron, making sure the BPS (branch point sequence) is part of the intronic region in all transcripts and setting a threshold on the basis of the top 25% scores in the PWM from the training set were chosen to increase the accuracy of the analysis approach. Comparisons to outcomes of other existing prediction tools like HSF (Human Splicing Finder), SVM (Support Vector Machine), BP finder, outputs of machine learning prediction tools, along with experimentally proven BPS mutations have been performed to demonstrate the accuracy of our proposed model.
  • The analysis method as described in accordance with the present embodiments, based on the PWM was successful in identifying the role of pathogenicity of 3 Clinvar annotated deleterious mutation cases (Table 1) in known branchpoints listed in the high confident branchpoint dataset is described below. The present analysis was successful in confirming the experimentally known cases of variants causing splicing aberrations due to activation of cryptic splice sites and branchpoint. The experiments were conducted for various known variants.
  • Example 1—OTC
  • In an embodiment, a variant C>G in intron 9 was detected upon Clinvar based variant screening of Ornithine Carbamoyltransferase coding gene (OTC) as disrupting canonical splice acceptor site. Alternative splice acceptor site (MaxEnt: 8.30) was identified 25 bases downstream (in the exonic region) of the canonical splice acceptor junction. The canonical branchsite (score: 2.80) i.e. 29 bases upstream to the identified cryptic splice acceptor was deemed suitable. The inactivation of the canonical splice acceptor and activation of the cryptic acceptor site has been experimentally verified with the aid of PCR and the resulting aberration in splicing has been proven to cause an aberrant 50 amino C-terminal sequence in the protein resulting in hyperammoneamic crisis. The value corresponding to OTC are as shown in Table 1.
  • Example 2—MAN2B1
  • In another embodiment, a T>C transition was found in intron 14 of Mannosidase Alpha Class 2B Member 1 gene (MAN2B1) disrupting the canonical splice acceptor site. Upon the loss of the canonical splice acceptor, a cryptic branch site is activated and also activation of a cryptic splice acceptor (MaxEnt: 4.78) 31 nt downstream to the canonical 3′ splice site occurs resulting in deletion of the first 31 nt of the exon 15, leading to a frame shift mutation causing pre-mature termination of the protein as a consequence of introduction of a stop codon (Table 1). With the aid of RT-PCR, the disruption of the canonical 3′ splice acceptor site and the activation of the cryptic splice site leading to partial exon deletion has been confirmed. Overall, the analysis approach displayed the potential to unveil one of the causes behind deficiency of alpha-mannosidase.
  • TABLE 1
    Variant
    Gene position Sequence Score
    OTC 38280273 TTTCTTTGTTGTGTCAT[C > G]AGGCT 7.73 > −1.02
    MAN2B1 12763276 GTGGACCCTTTTCTGCCC[A > G]GCAC  4.4 > −3.56
  • Experiments revealed some of the discovery cases. Herein, reason behind the splicing aberrations due to known pathogenic candidate variants was unveiled and such cases were categorized as discovery cases.
  • Example 3—Alanine-Glyoxylate and Serine-Pyruvate Aminotransferase (AGXT)
  • In an example embodiment, upon screening of the AGXT gene for variants, an A>G mutation was found in intron 5. As the variant is at the canonical splice acceptor site, it has been previously categorized as a splice site mutation, although the role of the variant and the specific effects on the splicing aberrations have not been defined. The canonical splice acceptor site of intron 5 was disrupted as a consequence of the variation (MaxEnt: 4.01>−3.94). Due to the disruption of the natural splice acceptor site, a cryptic splice acceptor site (MaxEnt: 5.01) 28 nucleotide downstream to the canonical splice acceptor site was activated. Further, upon screening for suitable branch sites for the cryptic splice acceptor, a potential branch site, i.e. 35 bases upstream to the cryptic splice acceptor site was found. Overall, on the basis of the proposed model it can be observed that due to the mutation, the original splice acceptor site gets disrupted and a cryptic splice acceptor, along with a cryptic branch point gets activated downstream to the canonical splice site and canonical branch site (Table 2). The resulting protein formed is 392 a.a long and loses 9 a.a i.e. an entire p-strand, in the core region as a result of the SNP. The deleted protein region forms a part of the active site and the homodimer interface of the protein and is essential for pyridoxal 5′ phosphate binding. Therefore the deletion caused due to the SNP is highly deleterious as it causes protein dysfunctioning. A hypothesis can be drawn based on the occurrence of an alternative splice acceptor with a suitable branch site, leading to aberrant splicing. The pre-termination of the transcript due to the splicing disruption might be a cause to primary hyperoxaluria.
  • Example 4—Myosin XVA (MYO15A)
  • In another embodiment, a deleterious variant G>A disrupting the canonical splice acceptor site was found upon screening of the intron 49 of MYO15A gene. As a result of the variant, a cryptic branch site (score: 1.92) was activated at the canonical splice acceptor junction. A cryptic splice acceptor site suitable for the cryptic branch site was activated 27 nt downstream (exonic region; MaxEnt: 7.13) to the canonical splice acceptor with the potential to cause partial exon 50 skipping or complete exon 50 skipping might occur as a result of using the stronger splice acceptor site of intron 50 (MaxEnt: 8.93) for splicing. The splicing aberration due to disruption of the canonical splice acceptor and the splicing consequences might be the cause behind non-syndromic genetic deafness. The resulting splicing aberrations do not lead to disruption of the frame of the protein but alter the protein region essential for peptide ligand binding with proline rich ligands like SH3 protein. SH3 domains in the protein are essential for intramolecular interactions leading to proper regulation of the enzymes and also in mediating multiprotein complex assemblies. Therefore, even though the frame of the protein is unaffected, essential active regions of the protein are altered leading to a truncated or non-functional protein. Overall, the analysis approach was successful in unveiling a hypothesis behind the effect of the intronic variant on splicing of intron 49 in MYO15A gene and the resulting pathogenicity.
  • Example 5—Growth Hormone Receptor (GHR)
  • In yet another example embodiment, a reinterpreted case, a splice acceptor variant (G>C) was identified upon screening of intron 8 of Growth Hormone Receptor. The variant being at the splice acceptor site (AG>AC) disrupted the canonical splice acceptor (MaxEnt: 5.55>−2.52) resulting in idiopathic short stature. Two different variant transcripts for GHR have been reported, one with complete skipping of exon 9 and the other with partial deletion of exon 9. The transcript with partial deletion of exon 9 was formed due to activation of a cryptic splice site downstream (24 nt) of the canonical splice acceptor. The occurrence of the splice variants has been reported but the cause behind their formation was not elucidated. The splice strength of the cryptic splice acceptor site (i.e. in the exonic region) is greater than the canonical splice acceptor site and the variant of interest disrupts the canonical splice acceptor site, leading to aberrant splicing, resulting in a non-functional protein due to premature termination of the protein. The variant has been associated with disruption of the canonical splice acceptor and exon 9 skipping indicating that the downstream cryptic splice acceptor was being unused for splicing. But based on the hypothesis drawn using the analysis model and the experimental evidence, GHR-(1-279) (splice variant), i.e. formed due to the activation of the cryptic splice acceptor site is as highly expressed as the canonical transcript, therefore upon disruption of the canonical splice acceptor, it is likely that the downstream cryptic splice acceptor would get activated instead of selecting the disrupted canonical splice acceptor site of the intron 10 leading to exon 9 skipping (Table 2). The protein product of GHR as a result of the variant loses 8 a.a from the part of the protein that forms part of the growth hormone binding protein (GHBP) after the cleavage from the GHR. Therefore deletion of such an essential region from the protein would lead to dysfunctioning of the protein and might be the cause behind the deleteriousness of the variant. Overall, the analysis approach was successful in reinterpreting the role of the deleterious variant (G>C) in GHR intron 8 splicing and pathogenicity causing growth hormone insensitivity.
  • TABLE 2
    Variant
    Gene position Sequence Score
    AGXT 241813393 AGCAAACCACCCATCTAC[A > C]GGCA 4.01 > −3.94
    MYO15A 18060469 GACCCGAGCCTGGCCCATA[G > A]GCT 3.14 > −5.61
    GHR 42718153 AAATTTTATATGTTTTCAA[G > C]GAT 5.55 > −2.52
  • In an embodiment, discoveries arising from predicted branch site variants were studied. Herein, experimentally known cases: The PWM based approach along with well-established splice site strength determination tool (MaxEnt) was tested on experimentally determined cases of branchpoint variants causing pathogenicity (NTKR1, DYSF, TH). The output of the analysis approach exactly reflected the experimental findings.
  • Example 6—Neurotrophic Receptor Tyrosine Kinase 1 (NTRK1)
  • In an embodiment, based on the output of the predicted branchpoint variants, in the case of NTRK1 (neurotrophic tyrosine kinase receptor family) gene, a putative branch site sequence, 31 bases upstream to the splice acceptor site, was screened with a deleterious variant T>A. The branch site score was drastically reduced after the mutation, 5.70>3.17 (Table 3) and a cryptic splice acceptor site was activated. The resulting spliced product after mutation comprised of insertion of an intronic (137 bp) segment attributed to the usage of the upstream cryptic splice acceptor site. Therefore the role of the T>A branch site mutation has been proven to be a major cause of congenital insensitivity to pain with anhidrosis (CIPA) and the analysis approach was successful in determining the same.
  • Example 7—Dysferlin (DYSF)
  • In yet another example embodiment, upon screening a deleterious mutation (A>G) in intron 31 of DYSF gene was identified. On the basis of the change in branch site scores it was revealed that the variant disrupts the branch site (Table 3). The deleterious mutation A>G has been experimentally verified to disrupt the branchpoint, leading to failure of lariat formation and skipping of exon 32 of dysferlin gene, resulting in recessively inherited limb-girdle muscular dystrophy type 2B (LGMD2B) and muscular dystrophies with distal presentations.
  • Example 8—Tyrosine Hydroxylase (TH)
  • In yet another example embodiment, the PWM based approach identified a putative branch site containing a deleterious variant T>A in intron 11 of TH. It has been proven that the deleterious variant leads to alternative splicing, via skipping of exon 12, resulting in absence of 32 amino acids in the final protein product, making it non-functional or usage of cryptic branch site resulting in aberrant splicing or via partial intron retention (36 nucleotides in the mRNA) resulting in incorporation of 12 additional amino acids, rendering the protein non-functional. The branch site scores for the predicted branch site reduced significantly as a result of the variant (Table 3). It has been proven that a branch site mutation (T>A) in the gene of the enzyme tyrosine hydroxylase (TH), two bases upstream of the branchpoint of intron 11 leads to aberrant protein product causing severe extrapyramidal movement disorder. The alternative splicing, leading to intron retention was also verified using the present method.
  • TABLE 3
    BP
    Gene position Sequence Score
    NTRK1 156843392 GCCC[T > A]GACCT 5.701 > 3.174
    DYSF 71817308 CCACTC[A > G]CTC 5.568 > Disrupted
    TH 2180717 GGGC[T > A]GATGC 4.206 > 1.679
  • In an embodiment, disruption of branchpoint causing splicing aberration resulting in exon skipping were validated.
  • Example 9—Glycogen Phosphorylase, Muscle Associated (PYGM)
  • In yet another example embodiment, from the predicted deleterious branchpoint variants in PYGM gene, a deleterious point mutation A>G was discovered in branch site sequence TCCCTGACAG′ i.e. 26 bases upstream to the splice acceptor site of intron 3. This intronic mutation A>G has been experimentally proven to result in skipping of exon 4 leading to McArdle disease (17). Based on amplified PCR products from the natural and the mutated samples, retention of exon 4 was concluded and the variant was classified to be a splice acceptor site mutation but the role of the branch site was not addressed. Based on the proposed analysis approach and the scores obtained for the branch site strengths, the theory of exon 4 skipping is hypothesized to be due to the disruption of the canonical branchpoint (4.43 to null), which is 26 bases upstream to the canonical splice acceptor (Table 4). As the proximity of the variant to the canonical splice acceptor is 26 bases upstream and therefore is not likely to affect the splice site strength, the variant can be hypothesized to be a branch site mutation. Overall, the analysis approach was capable of determining and classifying an experimentally validated splice mutation as a branchpoint mutation.
  • Example 10—Translocase of Inner Mitochondrial Membrane 8A (TIMM8A)
  • In yet another example embodiment, a deleterious variant in the putative branch site TTTGTGATTC′ with the highest score 3.40 was identified 23 bases upstream to the splice acceptor site in the sole intron of Translocase Of Inner Mitochondrial Membrane 8 (TIMM8A) gene, TIMM8A/DDP1 gene dysfunction leads to Mohr-Tranebjaerg syndrome or deafness/dystonia syndrome, there has been evidence of various missense and nonsense mutations in the coding regions of the exons of TIMM8A. There has been a recent finding of an intronic variant A>C causing X-linked dystonia deafness. The intronic variant in TIMM8A has been proven to cause protein dysfunction possibly due to splicing aberrations. The cause behind the splicing aberrations has not been discussed in terms of the branchpoint disruption. On the basis of the branchpoint scores obtained from the prediction tool, it was evident that the splicing aberration was due to branchpoint disruption (Table 3). Overall, the analysis was able to classify a proven intronic variant as a branchpoint mutation on the basis of the change in branch site scores (3.40>null).
  • TABLE 4
    BP
    Gene position Sequence Score
    PYGM 64525847 TCCCTG[A > G]CAG 4.430 > Disrupted
    TIMM8A 100601671 TTTGTG[A > C]TTC 3.401 > Disrupted
  • In accordance with the present embodiments, the PWM based analysis approach is designed to screen for variants that are putative branch sites with ‘A’ as the branchpoint in any given sequence and determine the effect of a mutation in a branch site to the splicing of the intron. As observed in the aforementioned case studies the PWM of the present embodiments is able to identify putative branch sites in proximity to the intronic end. Also, the potential of the PWM is cross-checked with experimentally known branch sites identified by other tools and the outcome matched accurately. The cases studied discussed in detail revealed successful identification of known branchpoint mutations and also led to reinterpretation of certain cases indicating the cause behind speculated effects of splicing leading to a pathological condition.
  • The basis for the examples discussed above is the PWM matrix generated in accordance with the present embodiments. The PWM is created using a dataset of branch site 10 mer sequences containing adenosine as the branchpoint. The PWM was able to identify putative branch sites in proximity to the intronic end. The potential of the PWM was cross-checked with experimentally known branch sites identified by other tools and the outcome matched accurately. The analysis approach of the present method is focused on screening variants in branch sites with “A” as the branchpoint and studying the impact of the variant on splicing and the resulting pathogenicity. The examples, as observed, was successful in identification of known branchpoint mutations and also led to reinterpretation of certain cases indicating the cause behind speculated effects of splicing leading to a pathological condition. The input dataset upon variant screening shows a particular branchpoint variant in the COL4A5 gene which was speculated to be a splice site variant but based on the scores obtained for the branch site before and after the mutation from the PWM created, indicated it to be a branchpoint mutation disrupting the branch site. The screening of putative branch site variants in the human genome, through the Clinvar.vcf successfully identified 20 cases with deleterious variants (pathogenic/likely pathogenic) as branch site mutations (TABLE 5) and 20 deleterious variants as splice site mutations (TABLE 6). An extra filter, that is, significant change in the branch site score/splice site acceptor score before and after the mutation was applied in order to pick drastically affected branchpoints/splice sites due to variation.
  • TABLE 5
    BP Variant
    BP  distance  distance  Intron, Mutated BP/BS GERP
    Variant Position from 
    Figure US20200152288A1-20200514-P00899
    from 
    Figure US20200152288A1-20200514-P00899
    strand Sequence Score Sequence Score mutation? Score
    MTHFR; 11850989 34 18 11, − GTGTGCA 1.89 GTGTGCA 1.89 No/No 0.05
    Chr1: 1185 TGT
    ERCC6; CFTR; MCCC2; XPC; COL3A1;
    INS; Chr10: Chr7: Chr5: Chr3: TRNT1; Chr2: DYSF; NTRK1;
    Chr11: 218 506 117 7089 1420 Chr3: 31 189 Chr2: 718173 Chr1: 1568
    2181256 50681652 117251602 70898299 14209904 3188087 189872204 71817308 156843394
    28 19 32 16 24 27 26 33 31
    30 26 25 19 24 26 43 33 33
    2, − 13/− 19/+ 4/+ 3, − 5, + 34, + 31, + 7, +30
    TTCCGG ACTCCTA TATGTTA CTCTCCA TTACTG A GAGGT GACTTC CCACTC A C GCCCTG
    2.294 2.35 2.49 2.75 4.51 1.67 3.55 5.57 5.70
    TTCCAG ACTCCTA TATGTTA CTCTCCA TTACTG G GAGGT GACTTC CCACTC G C GCCCAG
    AACC TCC TTT GTG TTT AA CAC AATT TC ACCT
    1.859 2.35 2.49 1.93 Disrupted 2.25 3.55 Disrupted 3.17
    No/Yes No/No No/No No/Yes Yes/Yes No/Yes No/No Yes/Yes No/Yes
    −0.67 1.70 −3.00 1.96 −0.63 −5.73 2.46 4.9 −0.25
    COL4A5; COL4A5; TIMM8A; GAA; BRCA1; COG6; PYGM;
    ChrX: ChrX: Chr17: Chr17: Chr13: Chr11: MYBPC TH;
    10786 ChrX: 107 1006 780 4119 402 645 3; Chr11: 218
    107863456 107845097 100601671 78082265 41197857 40273614 64525847 47364835 2187015
    32 17 23 22 38 24 26 22 22
    32 17 23 21 40 24 26 19 24
    30, + 26, + 1, − 7, + 23, − 12, + 3, − 13, − 11, −
    TGCTTC A TCAATA TTTGTG A TCCCTCA AGAATGA TTTGCA A TCCCTG A CACTT GGGCTG
    3.437 2.218 3.401 4.176 1.628 1.673 4.43 3.404 4.206
    TGCTTC G TCAATA TTTGTG C TCCCTCA AGAAAGA TTTGCA G TCCCTG G CACTT GGGCAG
    GTA G CTG TTC GGA ATT CCT CAG CAACA ATGC
    Disrupted Disrupted Disrupted 3.7 −0.899 Disrupted Disrupted 2.961 1.679
    Yes/Yes Yes/Yes Yes/Yes No/Yes No/Yes Yes/Yes Yes/Yes No/Yes No/Yes
    2.49 3.15 2.86 −1.67 1.41 1.09 1.73 3.95 −1.97
    VMA21; 150572076 26 26 1/+ GTTCTG A 4.83 GTTCTG C Disrupted Yes/Yes 1.95
    ChrX: 1505 TTT
    Figure US20200152288A1-20200514-P00899
    indicates data missing or illegible when filed
  • Out of the 20 potential branchpoint mutation cases, three cases of known i.e. experimentally verified branchpoint mutations and two discovery cases of mutations causing splicing aberrations in putative branchsites were successfully identified.
  • TABLE 6
    Predicted Natural splice Mutated Splice New Splice Predicted
    canonical BP Variant  acceptor acceptor acceptor; Pos; branch site; GERP 
    Variant Position; Score distance from Intron, strand sequence; MaxEnt Sequence; MaxEnt MaxEnt Score Pos; Score Score
    HIBCH; 191159383; 3.30 9 3, − CTTCTGTTACAT CTTCTGTTACA TATACCATCTTC Predicted 3.68
    Chr2: 191159365 TTGAATAGAAG; GTTGAATAGAA TGTTACAGTTG; Canonical BP
    191159365; 9.11 used
    RSPH3; RFX6; GHR; ACAD9; AGXT, AGXT,
    Chr6: 15940 Chr6: 117198 Chr5: 42718 Chr3: 12860 Chr2: 24181743 Chr2: 2418133
    159407483; 117198938; 42718120; 128603459; 241817408; 241813365;
    3.27 3.82 2.02 1.93 6.06 6.15
    2 11 1 2 1 2
    2, − 1, + 8, + 1, + 9, + 5,+30
    GTATTTTC TCCCTTCAA AAATTTTAT AAAATATT GAGCCAGGC AGCAAACCA
    TATCACTG CTGGCAAT ATGTTTTC TACTATTT CCCTCCTGC CCCATCTAC
    GTATTTTC TCCCTTCAG AAATTTTAT AAAATATT GAGCCAGGC AGCAAACCA
    TATCACTG CTGGCAAT ATGTTTTC TACTATTT CCCTCCTGC CCCATCTAC
    CGGACGG TTTCTTTAT AATGCTGA AGAAGTT GGCGCTCCG TCCTGTACT
    CCTGATTC CATCCCTTC TTCTGCCC TTCCCATT GCTTCCCAC CGGGCTCC
    TCTAGAGC AGCTG; CCAGTTC; TCCAGAA AGTCA; CAGAAG;
    TCTATC A C Predicted AATTTT A T ATATTT A C GCACTG A GC CCACCC A TC
    TG; canonical BP AT, TA; C; 241817420; T; 241813387;
    159407461; used 42718141; 128603485; 4.374 3.35
    5.34 0.822 5.72 5.26 3.96 4.1
    BRCA2; BRCA2; CRYAB; DYNC2H1; PTEN;
    Chr13: 32920963G Chr13: 32920962A Chr11: 111779693 Chr11: 1031872 Chr10: 89653781G > C
    32920931; 2.87 32920931; 2.87 111779706; 5.50 103187249; 3.85 89653767; 3.89
    1 2 2 1 1
    12, + 12, + 3, − 80, + 1, +
    ATAAAATAATTG ATAAAATAATTG TCCTCATTCTTT AAAAAATTGTT TCCTTAACTAAAGTACTC
    TTTCCTAG GCA; TTTCCT AGGCA; TGGGTT AGGAT; TTTTGACAG G AG ATA; −2.59
    ATAAAATAATTG ATAAAATAATTG TCCTCATTCTTT AAAAAATTGTT TCCTTAACTAAAGTACTC
    TTTCCTAA GCA; - TTTCCT GGGCA; TGGGTT GGGAT; TTTTGACAA G AC ATA; −10.66
    ATATTTTCTCCCC TAACATGGATAT GAACATGGTTTC TTATGAATTTT TGCTATGGGATTTCCTG
    ATTGCAGCAC; TCTCTTAGATT; ATCTCCAGGGA; CTTTATCAGA CAGAAA; 89653820; 8.11
    32928997; 10.37 32920924; 4.43 111779669; 7.95 TC; 103187307; or
    Predicted ACAGTA A CAT; TTCCTC A TTC; TTTTTG A CAA; GTACTC A GAT; 89653780;
    canonical BP used 32920907; 2.11 111779706; 5.5 103187270; 3.14 5.23
    5.03 5.03 5.72 5.78 5.19
    MAN2B1; SMCHD1; NF1; MY015A; FAH;
    Chr19: 1276327 Chr18: 2705691G >  Chr17: 29548860A >  Chr17: 18060469G > A Chr15: 804644
    12763298; 2.90 2705659; 3.06 29548830; 2.70 18060451; 5.24 80464470; 2.64
    2 1 8 1 6
    14, − 13, + 14, + 49, + 8, +
    GTGGACCCTT TTTTAAAAACTA CTCTTTTTTAAAAA GACCCGAGCCTGGCCC TGAACTCTC
    TTCTGCCC AG AATATTAG GTC; ATTCAGGCT; 4.83 ATAG GCT; 3.14 CCCCATGTA
    GTGGACCCTT TTTTAAAAACTA CTCTTTTTTAAAGA GACCCGAGCCTGGCCC TGAACTCTC
    TTCTGCCC GG AATATTAA GTC; − ATTCAGGCT; −1.93 ATAA GCT; −5.61 CCCCAGGTA
    AACGTTTGAT CTTCCCCTCTTT TGTCTTTCTCTTTT GCTGGCTGCGTGGTTC TCTAATGAA
    CCTGACACAG TATGGAAG CAT; TTAAAGAAT; GCAGGAA; 18060497; CTCTCCCCC
    GGC; 2705729; 4.49 29548860; 8.40 7.13 AGGTA;
    CGGCAC A TCC; ATATTA A GTC; Predicted canonical CCCATA A GCT; Predicted
    12763271; 2705691; 2.34 BP used 18060469; 1.92 canonical BP
    2.89 or used
    5.6 5.87 −1.98 5.04 −7.07
    OTC; OTC; TMPRSS3;
    ChrX:38280275G > A ChrX: 38280273C > G Chr21: 43808641
    38280243; 3.37 38280243; 3.37 43808664; 2.34
    1 3 6
    9, + 9, + 4, −
    TTTCTTTGTTGTG TTTCTTTGTTGTGT TCTTTCTGCACA
    TCATCAG GCT; CATC AGGCT; 7.73 TCGGCCAGTCC
    TTTCTTTGTTGTG TTTCTTTGTTGTGT TCTTTCTGCACA
    TCATCAA GCT; - CATG AGGCT; −3.22 TCAGCCAGTCC;
    CATGGTGTCCCTG CATGGTGTCCCTG CCTTTCTTTCTG
    CTGACAGATT; CTGACAGATT; CACATCAGCCA;
    38280300; 8.30 38280300; 8.30 43808640; 3.74
    GTCATC A AGC; TGTGTC A TGA; Predicted
    38280274; 2.51 38280271; 2.80 canonical BP
    used
    5.33 1.54 1.66
  • Alongside the variant screening within 15 nt upstream to the intron/exon junction confirmed two experimentally proven cases Ornithine Carbamoyltransferase (OTC), Mannosidase Alpha Class 2B Member 1 (MAN2B1)), with variant disrupting canonical splice acceptor site leading to activation of cryptic splice acceptor site and cryptic branch site. The three known cases of branch site mutations and the two known cases of splice site mutations confirmed the potency of the analysis model in identifying potential branch sites in the introns (NTRK1, DYSF, TH; OTC, MAN2B1), while the two discovery cases of branch site mutations and splice site mutations (PYGM, TIMM8A; AGXT, MYO15A) confirms the potency of the analysis approach model in categorizing intronic variants as branchpoint or splice site variants based on the activation of a cryptic branchpoint or cryptic splice site. The analysis approach was also tested for the negative set i.e. the branchpoint variants that disrupt the branchpoint but cause no pathogenicity which shows that although the predicted branchpoint identified by the PWM tool was being disrupted, there were alternative branchpoints that were compensating for the disruption by enabling normal splicing of the intron. Therefore the analysis approach is successful in determining branchpoint variants and determining their pathogenicity based on the availability of alternative branchpoint which could rescue normal splicing.
  • As observed in the present examples, the present system and method proved successful in identifying variants that caused disruption of a branchpoint and led to creation of a new splice acceptor (Component of Oligomeric Golgi Complex 6 (COG6), Glucosidase Alpha, Acid (GAA)) at that site. It was also successful in identifying a putative splice acceptor site downstream to the canonical site upon creation of a new branchpoint at the canonical splice acceptor site as a result of the variation. In total, 40 variants with a potency to be a branch site or splice site mutation were identified and their role in causing splicing aberration was predicted with the aid of the designed tool. It was observed that few of the mutations did not affect the frame of the protein but were highly deleterious, for such cases, attributes like protein structure and function were checked. It was observed that for AGXT, Acyl-CoA Dehydrogenase Family Member 9 (ACAD9), GHR, MYO15A although the Single nucleotide polymorphisms (SNP) did not cause frame changes of the protein, it caused deletion of part of the active site of the protein affecting or ceasing the function leading to a disease condition. It was also noted that for certain cases like phosphatase and tensin homologue (PTEN), where exon skipping or partial exon deletion was predicted, the protein either is trucated or deletion of active site of the protein renders it non-functional. Overall, SNPs that affect the translational frame of the protein lead to pathogenicity most likely due to a truncated protein product and the SNPs that do not affect the translational frame of the protein lead to pathogenicity due to core regions of the protein being altered. The dataset obtained as a result of screening putative branchpoint mutations was compared against Human splicing factor dataset of identified putative branchpoints and was also compared against the identified branchpoint variants predicted results, which confirmed the PWM based analysis model is reliable for branchpoint prediction and for investigating splicing aberrations as a result of a branch site mutation or splice site mutation.
  • Therefore the PWM based approach is designed to screen for variants that are putative branch sites with ‘A’ as the branchpoint in any given sequence and determine the effect of a mutation in a branch site to the splicing of the intron.
  • The embodiments of the present system and method is capable of identifying branchpoint variants and along with other established tools that determine various aspects of splice site was successful in offering a more detailed biological explanation to the consequence of mutations. Also, the discovery cases is identified using the present embodiments hold strong potential in unveiling the cause behind known pathogenic conditions and provide basis for therapeutic developments. Prediction of putative branchpoint or splice site variants in an intron can lay the foundation for the identification of possible genotype-based therapies using exon-skipping techniques (TABLE 7).
  • TABLE 7
    Chromo- Identified BP Predicted Predicted BP Predicted BP
    some Gene Intron BPa Position Position Score Alternative BP Position score
    2 DYSF†,* 31 CCACTC 71817308 −33 5.568
    A CTC
    3 XPC †,*,‡ 3 TTACTG 14209904 −24 4.51
    ATTT
    5 FBN2 30 CTCTAC 127680226 −24 2.052 TATAT −36 2.637
    ATTC CAACC
    9 COL5A1†,‡ 32 AGAGT 137686901 −27 3.246 TGACT −23 4.677
    GACTG GACCA
    11 TH†,‡ 11 GGGCT 2187015 −22 4.206
    GATGC
    13 RB1 23 TTACTA 49047470 −26 3.608 TATTT −15 4.383
    ATTG CATCT
    16 LCAT†,*,‡ 4 GCCCT 67976510 −20 5.743
    GACCC
    16 PMM2 2 ATTCTA 8898599 −25 3.096
    AGTG
    16 PMM2 7 GCCTTC 8941558 −23 4.917
    ATCT
    16 TSC2†,‡ 39 GGCGT 2138031 −18 3.761
    GACCA
    17 GH1 3 CAGCA 61995310 −26 2.026
    CAGCC
    17 ITGB4 31 TGGCTC 73748510 −17 5.786
    ACTC
    18 NPC1†,‡ 6 CCACTA 21137182 −28 3.201 TTCTT −15 5.201
    ATGC CACTT
    19 LDLR†,‡ 9 GCGCT 11224186 −25 4.116
    GATGC
    X F9 2 CCGTTA 138619496 −25 2.85
    ATTT
    X L1CAM 19 TATCCA 153131293 −19 1.301 CAAGT −15 3.642
    AGTC CACTG
    GGCTC −24 2.071
    TATCC
  • †: Branchpoints predicted by Human splice finder (HSF)
  • *: Branchpoints confirmed by Mercer et al.
  • ‡: Branchpoint variants predicted by Kralovieova, J et al.
  • - - -: Same branchpoint predicted by other tools and present tool of interest
  • Identified BP: Branchpoints predicted/confirmed by other tools
  • Predicted alternative BP: Predicted branchpoint with a higher potential by present prediction tool
  • FIG. 4 is a block diagram of an exemplary computer system 401 for implementing embodiments consistent with the present disclosure. The computer system 401 may be implemented standalone or in combination of components of the system 102 (FIG. 1). Variations of computer system 401 may be used for implementing the devices included in this disclosure. Computer system 401 may comprise a central processing unit (“CPU” or “hardware processor”) 402. The hardware processor 402 may comprise at least one data processor for executing program components for executing user- or system-generated requests. The processor may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc. The processor may include a microprocessor, such as AMD Athlon™, Duron™ or Opteron™, ARM's application, embedded or secure processors, IBM PowerPC™, Intel's Core, Itanium™, Xeon™, Celeron™ or other line of processors, etc. The processor 902 may be implemented using mainframe, distributed processor, multi-core, parallel, grid, or other architectures. Some embodiments may utilize embedded technologies like application specific integrated circuits (ASICs), digital signal processors (DSPs), Field Programmable Gate Arrays (FPGAs), etc.
  • Processor 402 may be disposed in communication with one or more input/output (I/O) devices via I/O interface 403. The I/O interface 403 may employ communication protocols/methods such as, without limitation, audio, analog, digital, monoaural, RCA, stereo, IEEE-1394, serial bus, universal serial bus (USB), infrared, PS/2, BNC, coaxial, component, composite, digital visual interface (DVI), high-definition multimedia interface (HDMI), RF antennas, S-Video, VGA, IEEE 402.11 a/b/g/n/x, Bluetooth, cellular (e.g., code-division multiple access (CDMA), high-speed packet access (HSPA+), global system for mobile communications (GSM), long-term evolution (LTE), WiMax, or the like), etc.
  • Using the I/O interface 403, the computer system 401 may communicate with one or more I/O devices. For example, the input device 404 may be an antenna, keyboard, mouse, joystick, (infrared) remote control, camera, card reader, fax machine, dongle, biometric reader, microphone, touch screen, touchpad, trackball, sensor (e.g., accelerometer, light sensor, GPS, gyroscope, proximity sensor, or the like), stylus, scanner, storage device, transceiver, video device/source, visors, etc.
  • Output device 405 may be a printer, fax machine, video display (e.g., cathode ray tube (CRT), liquid crystal display (LCD), light-emitting diode (LED), plasma, or the like), audio speaker, etc. In some embodiments, a transceiver 406 may be disposed in connection with the processor 402. The transceiver may facilitate various types of wireless transmission or reception. For example, the transceiver may include an antenna operatively connected to a transceiver chip (e.g., Texas Instruments WiLink WL1283, Broadcom BCM4750IUB8, Infineon Technologies X-Gold 618-PMB9800, or the like), providing IEEE 802.11a/b/g/n, Bluetooth, FM, global positioning system (GPS), 2G/3G HSDPA/HSUPA communications, etc.
  • In some embodiments, the processor 402 may be disposed in communication with a communication network 408 via a network interface 407. The network interface 407 may communicate with the communication network 408. The network interface may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), transmission control protocol/internet protocol (TCP/IP), token ring, IEEE 402.11a/b/g/n/x, etc. The communication network 408 may include, without limitation, a direct interconnection, local area network (LAN), wide area network (WAN), wireless network (e.g., using Wireless Application Protocol), the Internet, etc. Using the network interface 407 and the communication network 408, the computer system 401 may communicate with devices 409 and 410. These devices may include, without limitation, personal computer(s), server(s), fax machines, printers, scanners, various mobile devices such as cellular telephones, smartphones (e.g., Apple iPhone, Blackberry, Android-based phones, etc.), tablet computers, eBook readers (Amazon Kindle, Nook, etc.), laptop computers, notebooks, gaming consoles (Microsoft Xbox, Nintendo DS, Sony PlayStation, etc.), or the like. In some embodiments, the computer system 401 may itself embody one or more of these devices.
  • In some embodiments, the processor 402 may be disposed in communication with one or more memory devices (e.g., RAM 713, ROM 714, etc.) via a storage interface 412. The storage interface may connect to memory devices including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as serial advanced technology attachment (SATA), integrated drive electronics (IDE), IEEE-1394, universal serial bus (USB), fiber channel, small computer systems interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, redundant array of independent discs (RAID), solid-state memory devices, solid-state drives, etc. Variations of memory devices may be used for implementing, for example, any databases utilized in this disclosure.
  • The memory devices may store a collection of program or database components, including, without limitation, an operating system 416, user interface application 417, user/application data 418 (e.g., any data variables or data records discussed in this disclosure), etc. The operating system 416 may facilitate resource management and operation of the computer system 401. Examples of operating systems include, without limitation, Apple Macintosh OS X, Unix, Unix-like system distributions (e.g., Berkeley Software Distribution (BSD), FreeBSD, NetBSD, OpenBSD, etc.), Linux distributions (e.g., Red Hat, Ubuntu, Kubuntu, etc.), IBM OS/2, Microsoft Windows (XP, Vista/7/8, etc.), Apple iOS, Google Android, Blackberry OS, or the like. User interface 417 may facilitate display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities. For example, user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system 401, such as cursors, icons, check boxes, menus, scrollers, windows, widgets, etc. Graphical user interfaces (GUIs) may be employed, including, without limitation, Apple Macintosh operating systems' Aqua, IBM OS/2, Microsoft Windows (e.g., Aero, Metro, etc.), Unix X-Windows, web interface libraries (e.g., ActiveX, Java, Javascript, AJAX, HTML, Adobe Flash, etc.), or the like.
  • In some embodiments, computer system 401 may store user/application data 418, such as the data, variables, records, etc. as described in this disclosure. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle or Sybase. Alternatively, such databases may be implemented using standardized data structures, such as an array, hash, linked list, structured text file (e.g., XML), table, or as object-oriented databases (e.g., using ObjectStore, Poet, Zope, etc.). Such databases may be consolidated or distributed, sometimes among the various computer systems discussed above in this disclosure. It is to be understood that the structure and operation of any computer or database component may be combined, consolidated, or distributed in any working combination.
  • Additionally, in some embodiments, the server, messaging and instructions transmitted or received may emanate from hardware, including operating system, and program code (i.e., application code) residing in a cloud implementation. Further, it should be noted that one or more of the systems and methods provided herein may be suitable for cloud-based implementation. For example, in some embodiments, some or all of the data used in the disclosed methods may be sourced from or stored on any cloud computing platform.
  • The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
  • It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software modules located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
  • The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims (when included in the specification), the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
  • Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
  • It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments.

Claims (15)

What is claimed is:
1. A processor-implemented method comprising:
receiving genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts; classifying the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information;
evaluating effect of the at least one candidate variant on pre-m RNA splicing, based on a classified region from the classification of the at least one candidate variant, wherein the evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises:
identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using MaxEnt score;
determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site; and
evaluating strength of an identified natural branchpoint in the classified region using Position Weight Matrix (PWM) evaluator in response to determining that the new splice acceptor site region being created; and
predicting pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on the pre-mRNA splicing.
2. The processor implemented method of claim 1, wherein the at least one candidate variant is classified
as occurring in the splice acceptor site region having genomic coordinate between 15 nucleotides upstream to 3 nucleotides downstream of a natural intron-exon splice acceptor junction of the gene transcripts, and
as occurring in the branch site region having genomic coordinate between 50 nucleotides to 15 nucleotides upstream of a natural splice acceptor junction of the gene transcripts.
3. The processor implemented method of claim 1, wherein the MaxEnt score is splice site strength determination tool for calculating strength or weakening of the splice acceptor site, and wherein the MaxEnt score is assigned based on the effect of the at least one candidate variant on affected natural splice acceptor site region.
4. The processor implemented method of claim 1, wherein the PWM evaluator is generated using experimentally determined human branch sites, wherein generating the PWM evaluator comprises:
filtering the determined human branch sites for 10mers having CA as a branchpoint;
aligning the filtered branch sites to calculate frequency of each of nucleotide at each position of the 10mers in the filtered branch sites;
normalizing the calculated frequency using a background frequency for each of the nucleotide at each position of the 10mers; and
constructing a (m*n) matrix using the normalized frequency to obtain the PWM, and wherein constructing the (m*n) matrix comprises converting each of the normalized frequencies to log odds values and constructing the (m*n) matrix into the PWM evaluator using the log odds values.
5. The processor implemented method of claim 4, wherein the generated PWM evaluator evaluates strength of branchpoint based on a threshold score, and wherein the threshold score is determined using a plurality of branch site scores obtained for branch sites with ‘A’ as the branchpoint.
6. The processor implemented method of claim 1, wherein the step of evaluating effect of the at least one candidate variant on the splice acceptor site region for a new splice acceptor site being created comprises:
determining presence or absence of natural branchpoint in sequence range of 15 nucleotide to 50 nucleotide of the new splice acceptor site region being active during the pre-mRNA splicing, and
based on the determined presence or absence of the natural branchpoint,
evaluating strength of the natural branchpoint using the PWM evaluator and identifying the at least one candidate variant as pathogenic based on the evaluated strength of the natural branchpoint; or
screening for an alternative branchpoint using the PWM evaluator and predicting the at least one candidate as a pathogenic based on the evaluated strength of the alternative branchpoint.
wherein the method further comprises:
during the absence of the alternative branchpoint,
determining status of the natural splice acceptor site region, wherein the status comprising disruptive natural splice acceptor site region or non-disruptive natural splice acceptor site region; and
predicting the at least one candidate variant as pathogenic or non-pathogenic based on the determined status.
7. The processor implemented method of claim 1, wherein the step of evaluating effect of the at least one candidate variant on the splice acceptor site region for no new splice acceptor site being created comprises:
determining effect of the at least one candidate variant on the natural branchpoint, and identifying level of strength of natural branch site using the PWM evaluator based on the determined effect;
screening for an alternative splice acceptor site region in sequence range having 50 nucleotide upstream and 50 nucleotide downstream of the at least one candidate variant and performing a comparison of strength of the alternative splice acceptor site region and weakened natural splice acceptor site region; and
determining presence of a new branchpoint being created and performing comparison of strength of the new branchpoint and the natural branchpoint,
wherein the method further comprises:
based on the identified level of the strength of natural branch site, identifying the at least one candidate variant as a non-pathogenic; or identifying the at least one variant candidate as a pathogenic or non-pathogenic based on a screened alternative branchpoint in the sequence 50 nucleotide to 15 nucleotide upstream to the natural splice acceptor site region.
wherein the method further comprises:
based on the comparison of strength of alternative splice acceptor site region and weakened natural splice acceptor site:
predicting the at least one candidate variant as non-pathogenic; or
determining presence of natural branchpoint in sequence range of 15 nucleotide to 50 nucleotide to the splice acceptor site region being active during the mRNA splicing and comparing strength of the natural branchpoint with the predefined threshold, wherein based on based on the determined presence and the comparison,
predicting the at least one candidate variant as pathogenic or non-pathogenic based on an alternative branchpoint screened in the sequence range of 50 nucleotides to 15 nucleotides upstream of the alternative splice acceptor site.
8. The processor implemented method of claim 7, further comprising based on the comparison of strength of the new branchpoint and the natural branchpoint,
predicting the at least one candidate variant as non-pathogenic; or
determining presence of natural branchpoint in the range of 15 nucleotide to 50 nucleotide upstream to the splice acceptor site region being active during the pre-mRNA splicing and comparing strength of the natural branchpoint with the predefined threshold.
9. The processor implemented method of claim 7, based on the determined presence of natural branchpoint and comparison of strength of the natural branchpoint with the predefined threshold, predicting the at least one candidate variant as pathogenic or non-pathogenic.
10. The processor implemented method of claim 1, wherein the step of evaluating effect of the at least one candidate variant on the branch site for the new splice acceptor site being created comprises:
determining presence of an alternative branchpoint in sequence range having 50 nucleotides to 15 nucleotides upstream of the new splice acceptor site; and
predicting the at least one variant to be pathogenic or non-pathogenic based on the presence of the alternative branchpoint.
11. The processor implemented method of claim 1, wherein the step of evaluating effect of the at least one candidate variant on the branch site for no new splice acceptor site being created comprises:
screening for natural branchpoint in sequence range having 50 nucleotides to 15 nucleotides upstream of the natural splice acceptor site; and
determining level of strength of the branch site using the PWM evaluator, wherein determining the level of strength is due to the at least one candidate variant affecting the screened natural branchpoint,
wherein the method further comprises:
based on the determined level of strength of the branch site,
predicting the at least one candidate variant as pathogenic; or
predicting the at least one candidate variant as pathogenic or non-pathogenic based on an alternative branchpoint screened in sequence range of 50 nucleotides to 15 nucleotides upstream of the natural splice acceptor site region.
12. A system comprising:
a memory storing instructions;
one or more hardware processors coupled to the memory, wherein the one or more hardware processors are configured by the instructions to:
receive genomic position information of at least one candidate variant of gene transcripts and coordinates information of the gene transcripts;
classify the at least one candidate variant into one of a splice acceptor site region and a branch site region based on the coordinates information of the gene transcripts and the genomic position information of at least one candidate variant;
evaluate effect of the at least one candidate variant on pre-mRNA splicing, based on a classified region from the classification of the at least one candidate variant, wherein the evaluating the effect of the at least one candidate variant on the pre-mRNA splicing comprises:
identifying weakening of a natural splice acceptor site in the classified region, due to the at least one candidate variant, using MaxEnt score;
determining that a new splice acceptor site region is being created due to the weakened natural splice acceptor site; and
evaluating strength of an identified natural branchpoint in the classified region using PWM evaluator in response to determining that the new splice acceptor site region being created; and
predict pathogenicity of the at least one candidate variant based on the evaluated effect of the at least one candidate variant on pre-mRNA splicing.
13. The system of claim 12, wherein the at least one candidate variant is classified
as occurring in the splice acceptor site region having genomic coordinate between 15 nucleotides upstream to 3 nucleotides downstream of a natural intron-exon splice acceptor junction of the gene transcripts, and
as occurring in the branch site region having genomic coordinate between 50 nucleotides to 15 nucleotides upstream of a natural splice acceptor junction of the gene transcripts,
wherein evaluating the effect of the at least one candidate variant on the splice acceptor site region for a new splice acceptor site being created comprises:
determining presence or absence of natural branchpoint in sequence range of 15 nucleotide to 50 nucleotide of the new splice acceptor site region being active during the pre-mRNA splicing, and based on the determined presence or absence of the natural branchpoint,
evaluating strength of the natural branchpoint using the PWM evaluator and identifying the at least one candidate variant as pathogenic based on the evaluated strength of the natural branchpoint; or
screening for an alternative branchpoint using the PWM evaluator and predicting the at least one candidate as a pathogenic based on the evaluated strength of the alternative branchpoint,
wherein the one or more hardware processors are further configured by the instructions during the absence of the alternative branchpoint to:
determine status of the natural splice acceptor site region, wherein the status comprising disruptive natural splice acceptor site region or non-disruptive natural splice acceptor site region; and
predict the at least one candidate variant as pathogenic or non-pathogenic based on the determined status.
14. The system of claim 12, wherein evaluating the effect of the at least one candidate variant on the splice acceptor site region for no new splice acceptor site being created comprises:
determining effect of the at least one candidate variant on the natural branchpoint, and identifying level of strength of natural branch site using the PWM evaluator based on the determined effect;
screening for an alternative splice acceptor site region in sequence range having 50 nucleotide upstream and 50 nucleotide downstream of the at least one candidate variant and performing a comparison of strength of the alternative splice acceptor site region and weakened natural splice acceptor site region; and
determining presence of a new branchpoint being created and performing comparison of strength of the new branchpoint and the natural branchpoint.
wherein the one or more hardware processors are further configured by the instructions based on the identified level of the strength of natural branch site to:
identify the at least one candidate variant as a non-pathogenic; or
identify the at least one variant candidate as a pathogenic or non-pathogenic based on a screened alternative branchpoint in the sequence 50 nucleotide to 15 nucleotide upstream to the natural splice acceptor site region.
wherein the one or more hardware processors are further configured by the instructions based on the comparison of strength of alternative splice acceptor site region and weakened natural splice acceptor site to:
predict the at least one candidate variant as non-pathogenic; or
determine presence of natural branchpoint in sequence range of 15 nucleotide to 50 nucleotide to the splice acceptor site region being active during the mRNA splicing and comparing strength of the natural branchpoint with the predefined threshold.
wherein the one or more hardware processors are further configured by the instructions based on the comparison of strength of the new branchpoint and the natural branchpoint to:
predict the at least one candidate variant as non-pathogenic; or
determine presence of natural branchpoint in the range of 15 nucleotide to 50 nucleotide upstream to the splice acceptor site region being active during the mRNA splicing and comparing strength of the natural branchpoint with the predefined threshold,
further comprising based on the determined presence and the comparison, predicting the at least one candidate variant as pathogenic or non-pathogenic based on an alternative branchpoint screened in the sequence range of 50 nucleotides to 15 nucleotides upstream of the alternative splice acceptor site,
wherein the one or more hardware processors are further configured based on the determined presence of natural branchpoint and comparison of strength of the natural branchpoint with the predefined threshold to:
predict the at least one candidate variant as pathogenic or non-pathogenic.
15. The system of claim 12, wherein evaluating the effect of the at least one candidate variant on the branch site for the new splice acceptor site being created comprises:
determining presence of an alternative branchpoint in sequence range having 50 nucleotides to 15 nucleotides upstream of the new splice acceptor site; and
predicting the at least one variant to be pathogenic or non-pathogenic based on the presence of the alternative branchpoint.
wherein the one or more hardware processors are further configured by the instructions to:
evaluate the effect of the at least one candidate variant on the branch site for no new splice acceptor site being created comprises:
screen for natural branchpoint in sequence range having 50 nucleotides to 15 nucleotides upstream of the natural splice acceptor site; and
determine level of strength of the branch site using the PWM evaluator, wherein determining the level of strength is due to the at least one candidate variant affecting the screened natural branchpoint,
wherein the one or more hardware processors are further configured based on the determined level of strength of the branch site to:
predict the at least one candidate variant as pathogenic; or
predict the at least one candidate variant as pathogenic or non-pathogenic based on an alternative branchpoint screened in sequence range of 50 nucleotides to 15 nucleotides upstream of the natural splice acceptor site region.
US16/504,184 2018-07-07 2019-07-05 System and method for predicting effect of genomic variations on pre-mrna splicing Pending US20200152288A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN201821025433 2018-07-07
IN201821025433 2018-07-07

Publications (1)

Publication Number Publication Date
US20200152288A1 true US20200152288A1 (en) 2020-05-14

Family

ID=67184885

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/504,184 Pending US20200152288A1 (en) 2018-07-07 2019-07-05 System and method for predicting effect of genomic variations on pre-mrna splicing

Country Status (4)

Country Link
US (1) US20200152288A1 (en)
EP (1) EP3745406A1 (en)
JP (1) JP7453754B2 (en)
CN (1) CN110689928A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113215248A (en) * 2021-06-25 2021-08-06 中国人民解放军空军军医大学 MyO15A gene mutation detection kit related to sensorineural deafness
WO2022059886A1 (en) * 2020-09-21 2022-03-24 주식회사 쓰리빌리언 System for predicting pathogenicity of genetic mutation by using machine learning
WO2022203705A1 (en) * 2021-03-26 2022-09-29 Genome International Corporation A precision medicine portal for human diseases
CN115579060A (en) * 2022-12-08 2023-01-06 国家超级计算天津中心 Gene locus detection method, device, equipment and medium
WO2023183422A1 (en) * 2022-03-24 2023-09-28 Genome International Corporation Identifying genome features in health and disease

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6931860B2 (en) * 2019-02-08 2021-09-08 株式会社Zenick Pre-mRNA analysis method, information processing device, computer program
CN113035272B (en) * 2021-03-08 2023-09-05 深圳市新合生物医疗科技有限公司 Method and device for obtaining immunotherapeutic new antigen based on intein cell variation
CN113241123B (en) * 2021-04-19 2024-02-02 西安电子科技大学 Method and system for fusing multiple characteristic recognition enhancers and intensity thereof
CN113838522A (en) * 2021-09-14 2021-12-24 浙江赛微思生物科技有限公司 Evaluation processing method for influence of gene mutation sites on splicing possibility
CN114613431A (en) * 2021-11-22 2022-06-10 赛业(广州)生物科技有限公司 Prediction method, system and platform for influencing mRNA splicing based on base mutation
CN115691662B (en) * 2022-11-08 2023-06-23 温州谱希医学检验实验室有限公司 Method and system for sequencing myopia/high myopia-related SNP risks based on allosteric probability

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130096838A1 (en) * 2011-06-10 2013-04-18 William Fairbrother Gene Splicing Defects
WO2013017982A1 (en) * 2011-08-01 2013-02-07 Basf Plant Science Company Gmbh Method for identification and isolation of terminator sequences causing enhanced transcription
US20140199698A1 (en) * 2013-01-14 2014-07-17 Peter Keith Rogan METHODS OF PREDICTING AND DETERMINING MUTATED mRNA SPLICE ISOFORMS
US10266828B2 (en) * 2013-12-16 2019-04-23 Syddansk Universitet RAS exon 2 skipping for cancer treatment
LU93116B1 (en) * 2016-06-22 2018-01-24 Univ Luxembourg Means and methods for treating parkinson's disease

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Desmet, F. O., Hamroun, D., Lalande, M., Collod-Béroud, G., Claustres, M., & Béroud, C. (2009). Human Splicing Finder: an online bioinformatics tool to predict splicing signals. Nucleic acids research, 37(9), e67, p.1-14. (Year: 2009) *
François-Olivier Desmet, Dalil Hamroun, Gwenaëlle Collod-Béroud, Mireille Claustres, Christophe Béroud. Bioinformatics identification of splice site signals and prediction of mutation effects. RM Mohan. Research Advances In Nucleic Acids Research, Global Research Network Publishers, pp.1-14. 2010. (Year: 2010) *
Furdon, P. J., & Kole, R. (1986). Inhibition of splicing but not cleavage at the 5'splice site by truncating human beta-globin pre-mRNA. Proceedings of the National Academy of Sciences, 83(4), 927-931. (Year: 1986) *
Hubbard, T. J., Aken, B. L., Beal, K., Ballester, B., Cáccamo, M., Chen, Y., ... & Birney, E. (2007). Ensembl 2007. Nucleic acids research, 35(suppl_1), D610-D617. (Year: 2007) *
Sheth, N., Roca, X., Hastings, M. L., Roeder, T., Krainer, A. R., & Sachidanandam, R. (2006). Comprehensive splice-site analysis using comparative genomics. Nucleic acids research, 34(14), 3955–3967. (Year: 2006) *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022059886A1 (en) * 2020-09-21 2022-03-24 주식회사 쓰리빌리언 System for predicting pathogenicity of genetic mutation by using machine learning
WO2022203705A1 (en) * 2021-03-26 2022-09-29 Genome International Corporation A precision medicine portal for human diseases
WO2022203704A1 (en) * 2021-03-26 2022-09-29 Genome International Corporation A unified portal for regulatory and splicing elements for genome analysis
CN113215248A (en) * 2021-06-25 2021-08-06 中国人民解放军空军军医大学 MyO15A gene mutation detection kit related to sensorineural deafness
WO2023183422A1 (en) * 2022-03-24 2023-09-28 Genome International Corporation Identifying genome features in health and disease
CN115579060A (en) * 2022-12-08 2023-01-06 国家超级计算天津中心 Gene locus detection method, device, equipment and medium

Also Published As

Publication number Publication date
EP3745406A1 (en) 2020-12-02
JP2020038621A (en) 2020-03-12
JP7453754B2 (en) 2024-03-21
CN110689928A (en) 2020-01-14

Similar Documents

Publication Publication Date Title
US20200152288A1 (en) System and method for predicting effect of genomic variations on pre-mrna splicing
US11081210B2 (en) Detection of human leukocyte antigen loss of heterozygosity
Mielczarek et al. Review of alignment and SNP calling algorithms for next-generation sequencing data
KR102562419B1 (en) Variant classifier based on deep neural networks
Girolami et al. Contemporary genetic testing in inherited cardiac disease: tools, ethical issues, and clinical applications
US11193175B2 (en) Normalizing tumor mutation burden
CN106909806B (en) The method and apparatus of fixed point detection variation
JP2020525887A (en) Deep learning based splice site classification
US20190065670A1 (en) Predicting disease burden from genome variants
Salgado et al. How to identify pathogenic mutations among all those variations: variant annotation and filtration in the genome sequencing era
Kehr et al. PopIns: population-scale detection of novel sequence insertions
Hills et al. BAIT: Organizing genomes and mapping rearrangements in single cells
US11475978B2 (en) Detection of human leukocyte antigen loss of heterozygosity
US20190362807A1 (en) Genomic variant ranking system for clinical trial matching
KR20190098233A (en) Oncogenic Splice Variants Determination
US20190005192A1 (en) Reliable and Secure Detection Techniques for Processing Genome Data in Next Generation Sequencing (NGS)
Barbitoff et al. Bioinformatics of germline variant discovery for rare disease diagnostics: current approaches and remaining challenges
JP2021101629A5 (en)
US20160070855A1 (en) Systems And Methods For Determination Of Provenance
Ecovoiu et al. Genome ARTIST: a robust, high-accuracy aligner tool for mapping transposon insertions and self-insertions
Wang et al. A primer for disease gene prioritization using next-generation sequencing data
US20230064530A1 (en) Detection of Genetic Variants in Human Leukocyte Antigen Genes
Barrie et al. Elevated genetic risk for multiple sclerosis originated in Steppe Pastoralist populations
Chang et al. Somatic and germline variant calling from next-generation sequencing data
Li et al. Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: ADVISORY ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED