WO2020150666A1

WO2020150666A1 - Methods for accurate and sensitive unveiling of rna splicing events and applications thereof

Info

Publication number: WO2020150666A1
Application number: PCT/US2020/014191
Authority: WO
Inventors: Yi Xing; Zijun ZHANG
Original assignee: The Regents Of The University Of California
Priority date: 2019-01-17
Filing date: 2020-01-17
Publication date: 2020-07-23

Abstract

Unveiling of splicing events from RNA-seq data and applications thereof are described. Generally, systems comprising computational models and statistical analyses are performed to unveil splicing events, which can be used to develop research tools, biomarkers, diagnostics, and medicaments.

Description

METHODS FOR ACCURATE AND SENSITIVE UNVEILING OF RNA SPLICING

EVENTS AND APPLICATIONS THEREOF

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Provisional Application Ser. No. 62/793,751 , entitled “Methods for Accurate and Sensitive Unveiling of RNA Splicing Events and Applications Thereof,” filed January 17, 2019, which is incorporated herein by reference in its entirety.

REFERENCE TO A SEQUENCE LISTING SUBMITTED ELECTRONICALLY VIA EFS- WEB

[0001.1] The instant application contains a Sequence Listing which has been filed electronically in ASCII format and is hereby incorporated by reference in its entirety. Said ASCII copy, created on March 24, 2020, is named 05659 Sequence Listing.txt and is 1 , 159 bytes in size.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

[0002] This invention was made with Government support under contracts GM088342 and GM1 17624 awarded by the National Institutes of Health. The Government has certain rights in the invention.

FIELD OF THE INVENTION

[0003] The invention is generally directed to processes related to RNA transcripts, and more specifically to methods and systems for unveiling RNA alternative splicing events, including novel and/or rare transcript isoforms, and applications thereof.

BACKGROUND

[0004] Ribonucleic acid (RNA) is a polymeric molecule existing in biological cells in several forms and having various functions. Much like deoxyribonucleic acid (DNA), RNA is a chain of nucleotides which formulate a sequence. Each nucleotide is composed of a nitrogenous base, a 5-carbon sugar (e.g., ribose), and at least one phosphate group. The four canonical bases of RNA are adenine (A), cytosine (C), guanine (G), and uracil (U), however, several other noncanonical bases are often incorporated into the polymer, such as, for example, inosine (I) or methyl-7-guanosine (m7G).

[0005] RNA transcripts are derived from genomic DNA and provide various functions within a cell. In one function, RNA is the mediator of gene expression: genes that are stored within DNA are transcribed into messenger RNA (mRNA) which is then translated into protein to perform various tasks. A single gene can give rise to several RNA transcripts by a mechanism called alternative splicing. Alternative splicing can selectively choose which sequences of the gene are to be combined and expressed. In many instances, several alternative transcripts of a single gene are expressed simultaneously.

[0006] Alternative splicing is a major mechanism for generating a diverse gene expression profile in living cells. One method to decipher the expression profile is to sequence the RNA. The transcriptome (i.e., RNA transcript expression profile) can be used to infer various biological knowledge, which can lead to inferences and discoveries in biology and medicine.

SUMMARY OF THE INVENTION

[0007] A number of embodiments are directed to unveiling alternative splicing events from RNA sequencing data using Bayesian hypothesis testing. In several embodiments, the results of a computational model are used as an informative prior in the Bayesian hypothesis testing to yield more accurate and sensitive results.

[0008] Processes, in accordance with various embodiments, utilize RNA sequencing data to build and train computational models that can help delineate RNA splicing events between two biological conditions. In a number of embodiments, a deep neural network computational model is built. Numerous embodiments use the built and trained computational models to identify and score differential splicing events between the two biological conditions.

[0009] In several embodiments, unveiled splicing events are used to further develop biomarkers, diagnostics, and medical treatments. A number of these embodiments develop assays that detect unveiled splicing events, including (but not limited to) hybridization and nucleic acid proliferation assays. Various embodiments are also directed to the development of immunological tools and assays (e.g., antibody assays) that can detect a peptide and/or protein that is expressed from the unveiled alternative splicing events.

[0010] In an embodiment to synthesize a nucleic acid molecule, a pair of RNA sequencing data sets are applied in a computational model trained to identify differential splicing events between the pair of RNA sequencing data. The computational model is trained utilizing cis exon sequence features and RNA binding protein expression levels. Differential splicing events between the pair of RNA sequencing data sets are scored and predicted. A posterior ratio of differential splicing events between the pair of RNA sequencing data sets is determined via Bayesian hypothesis testing using the scored splicing events as a prior ratio and observed RNA sequencing read counts as the likelihood ratio. The RNA sequencing read counts are derived from the RNA sequencing data sets. At least one differential splicing event is unveiled between the pair of RNA sequencing data. And a nucleic acid molecule comprising a sequence that spans across an exon-exon junction of the at least one differential splicing event is synthesized. The exon-exon junction is unveiled to be differentially expressed in between the pair RNA sequencing data.

[0011] In another embodiment, the nucleic acid molecule is a DNA molecule.

[0012] In yet another embodiment, the nucleic acid molecule is synthesized via an in vitro synthesizer or a bacterial recombination system.

[0013] In a further embodiment, the synthetic nucleic acid molecule is inserted into a plasmid vector.

[0014] In still yet another embodiment, the synthetic nucleic acid molecule is operably linked to a non-native promoter.

[0015] In yet a further embodiment, the synthetic nucleic acid is utilized within an expression system to produce a peptide.

[0016] In an even further embodiment, the peptide is utilized to generate an antigenbinding molecule.

[0017] In yet an even further embodiment, the antigen binding molecule is a polyclonal antibody or a monoclonal antibody. [0018] In still yet an even further embodiment, a fluorescent probe, a small molecule drug, or an enzyme is covalently linked to the synthetic nucleic acid.

[0019] In still yet an even further embodiment, the synthetic nucleic acid molecule is used to detect the exon-exon junction in a biological sample.

[0020] In still yet an even further embodiment, a biological assay to detect at least one differential splicing event is performed.

[0021] In still yet an even further embodiment, each set of RNA sequencing data of the pair is derived from a different biological condition.

[0022] In still yet an even further embodiment, at least one biological condition is associated with a particular phenotype.

[0023] In still yet an even further embodiment, the particular phenotype is a disease state.

[0024] In still yet an even further embodiment, the pair of RNA sequencing data sets correspond to two different biological tissues, two different biological sources, two separate extraction on a temporal scale, or a test sample with a control sample.

[0025] In still yet an even further embodiment, the computational model includes taking differential splicing formulated as:

where Y_ik is the label for event i in the comparison k ; Y_ik = 1 represent when the event is differentially spliced; E_i is a vector of a number of cis sequence features for exon /; and G_k is a vector of a number normalized gene expression levels of a number of RNA-binding proteins in the two RNA sequencing data sets.

[0026] In still yet an even further embodiment, the computational model is trained utilizing a number of pairwise biological conditions. The computational model is trained utilizing RNA-binding protein depleted cell lines and control cell lines.

[0027] In still yet an even further embodiment, the posterior ratio is calculated via:

wherein l_ij, S_ij are the exon inclusion read count and the exon skipping read count for exon i in sample group j Î {1 ,2}, respectively. And wherein P(H₁) is the prior probability of exon / being differentially spliced, determined by exon-specific cis features and sample-specific trans RNA-binding protein expression levels in the two RNA sequencing data sets, which is independent of the observed RNA-seq read counts on exon i. P(H₀ ) = 1 - P(H₁) is the prior probability of exon i being common. and represent the

likelihoods under the model of differential splicing or common splicing respectively.

[0028] In still yet an even further embodiment, the unveiled splicing event occurs in a lowly expressed gene.

[0029] In still yet an even further embodiment, at least one differential splicing event is considered bona fide when the difference exceeds a user defined threshold with a high probability.

[0030] In still yet an even further embodiment, the high probability is determined by:

[0031] In still yet an even further embodiment, the threshold c is 1 %, 2%, 5%, or 10%.

[0032] In an embodiment to unveil at least one differential splicing event, a pair of RNA sequencing data sets are applied in a computational model trained to identify differential splicing events between the pair of RNA sequencing data. The computational model is trained utilizing cis exon sequence features and RNA binding protein expression levels. Differential splicing events between the pair of RNA sequencing data sets are scored and predicted. A posterior ratio of differential splicing events between the pair of RNA sequencing data sets is determined via Bayesian hypothesis testing using the scored splicing events as a prior ratio and observed RNA sequencing read counts as the likelihood ratio. The RNA sequencing read counts are derived from the RNA sequencing data sets. And at least one differential splicing event is unveiled between the pair of RNA sequencing data.

BRIEF DESCRIPTION OF THE DRAWINGS

[0033] The description and claims will be more fully understood with reference to the following figures and data graphs, which are presented as exemplary embodiments of the invention and should not be construed as a complete recitation of the scope of the invention. [0034] Figure 1 illustrates a pipeline for unveiling splicing events that can be used for developing research tools, diagnostics and/or medicaments in accordance with various embodiments of the invention.

[0035] Figure 2 illustrates a process to identify exon and sample features from RNA sequence data in accordance with an embodiment of the invention.

[0036] Figure 3 illustrates a process to build computational models to identify and score splicing events in accordance with an embodiment of the invention.

[0037] Figure 4 illustrates a process to utilize a built computational model to identify and score splicing events in accordance with an embodiment of the invention.

[0038] Figure 5 illustrates a process to unveil splicing events using Bayesian hypothesis testing and an informative prior in accordance with an embodiment of the invention.

[0039] Figure 6 illustrates a diagram of computing systems configured to unveil splicing events in accordance with various embodiments of the invention.

[0040] Figure 7 illustrates a process for developing polyclonal antibodies having affinity for a peptide and/or protein expressed from an unveiled alternative splicing event in accordance with an embodiment of the invention.

[0041] Figure 8 illustrates features utilized in a deep neural network of differential alternative splicing, including cis sequence features and trans RBP features in accordance with an embodiment of the invention.

[0042] Figure 9 illustrates a deep neural network design in accordance with an embodiment of the invention.

[0043] Figure 10 provides a diagram of training and leave-out data utilized to train and test a deep neural network, utilized in accordance with an embodiment of the invention.

[0044] Figure 1 1 provides a performance data comparison between three alternative splicing event statistical models, generated in accordance with an embodiment of the invention.

[0045] Figure 12 provides performance data results of an alternative splicing event statistical model comparing RNA-seq replicated data and pooled data, generated in accordance with an embodiment of the invention. [0046] Figure 13 provides performance data results of a deep neural network model as training progresses, generated in accordance with an embodiment of the invention.

[0047] Figure 14 provides a comparison of a deep neural network model against other baseline methods, generated in accordance with an embodiment of the invention.

[0048] Figure 15 illustrates the data input for a Bayesian hypothesis testing model, in which an informative prior is utilized in accordance with an embodiment of the invention.

[0049] Figure 16 provides a heat-map data chart describing the relationship between the posterior ratio, the informative prior, and observed RNA-seq read counts, generated in accordance with an embodiment of the invention.

[0050] Figure 17 provides cluster analysis of top 3,000 genes with the highest coefficient of variation (CoV) in gene expression in the ENCODE HepG2 and K562 cell lines, utilized in accordance with an embodiment of the invention.

[0051] Figure 18 provides performance data comparison between using an informative prior (DARTS-info) and using an uninformative prior (DARTS-flat) in Bayesian hypothesis testing, generated in accordance with an embodiment of the invention.

[0052] Figure 19 provides data results when utilizing various training sets in a deep neural network, generated in accordance with an embodiment of the invention.

[0053] Figure 20 provides performance data results of Bayesian hypothesis testing and deep neural network models trained on different sets of data when analyzing epithelial to mesenchymal transition during development, generated in accordance with an embodiment of the invention.

[0054] Figure 21 provides Meta-exon motif analysis of the known ESRP motif for RNA- seq differential events and Bayesian hypothesis testing utilizing a deep neural network model rescued events in the Day 6 vs. Day 0 comparison, generated in accordance with an embodiment of the invention.

[0055] Figure 22 provides an example of Bayesian hypothesis testing utilizing a deep neural network model prediction for the PLEKHA1 gene, generated in accordance with an embodiment of the invention.

[0056] Figure 23 provides differential RNA binding protein expression signatures in various cancer cell lines, generated in accordance with an embodiment of the invention. [0057] Figure 24 provides data demonstrating a correlation of scores generated by a deep neural network model, generated in accordance with an embodiment of the invention.

[0058] Figure 25 provides validation data of Bayesian hypothesis testing utilizing a deep neural network model, generated in accordance with an embodiment of the invention.

[0059] Figure 26A provides cumulative density function of gene expression levels (TPM values) for Bayesian hypothesis testing utilizing a deep neural network model predicted differential events and RNA-seq differential events, generated in accordance with an embodiment of the invention.

[0060] Figure 26B provides performance data of Bayesian hypothesis testing comparing use of a prior (DARTS BHT(info)) with baseline methods that use RNA-seq data alone to call differential splicing (DARTS BHT(flat), rMATS, and SUPPA2, generated in accordance with an embodiment of the invention.

[0061] Figure 27 provides validation data of Bayesian hypothesis testing utilizing three different sets of cis sequence features in a deep neural network model, generated in accordance with an embodiments of the invention.

[0062] Figure 28 provide data performance comparison between DARTS DNN and logistic regression using ENCODE data, generated in accordance with an embodiment of the invention.

DETAILED DESCRIPTION

[0063] Turning now to the drawings and data, methods and processes to unveil splicing events and applications thereof are described, in accordance with various embodiments of the invention. In several embodiments, Bayesian hypothesis testing is performed to unveil differential and/or common splicing events between two biological conditions from RNA sequencing (RNA-seq) data. Many embodiments utilize an informative prior derived from a computational model to augment the Bayesian hypothesis testing to yield more sensitive and accurate results. In some instances, novel splicing events can be unveiled, which can be used in a number of downstream applications. [0064] Alternative splicing is a major cellular mechanism for generating expression complexity, especially in regulatory and functional aspects (e.g., two splice variants of the same gene can have different regulatory and functional properties). In addition, alternative splicing contributes to the diversity of phenotypes in eukaryotic cells of an organism, where each cell has the same DNA genotype.

[0065] To understand RNA transcription and alternative splicing of a particular biological sample, RNA can be sequenced (RNA-seq) to reveal a transcriptome-wide profile of expression, including expression of alternatively spliced genes. The general workflow of RNA-seq alternative splicing analysis involves counting RNA sequence reads mapped to exons and splice junctions; estimating relative abundances of splice isoforms; and detecting differential alternative splicing events between biological conditions using appropriate statistical models. An inherent limitation of this approach is that it solely relies on empirical evidence in RNA-seq derived data, which can be costly to get enough depth to identify lowly or moderately expressed genes and/or isoforms. Even at a high sequencing depth the detection of alternative isoforms can be biased against lowly or moderately expressed genes and/or isoforms due to the fact that these RNA species are buried within the data.

[0066] The availability of large-scale RNA-seq data in public repositories has enabled quantitative measurement of alternative splice isoforms across diverse biological states. For example, the Roadmap Epigenomics consortium has generated deep RNA-seq data across over 100 human tissues and cell types, while the ENCODE consortium has systematically performed RNA-seq of two human cell lines upon knockdown of over 250 RNA binding proteins (RBPs) (Roadmap Epigenomics Consortium, Nature 518, 317-330 (2015); ENCODE Project Consortium, Nature 489, 57-74 (2012); E. L. Van Nostrand, et al. , bioRxiv 179648, (2017); the disclosures of which are herein incorporated by reference in their entirety). Utilizing large-scale RNA-seq resources, a deep learning model of differential alternative splicing events can be constructed in accordance with many embodiments of the invention. In a number of embodiments, a deep learning model generates a predictive prior to augment alternative splicing analysis of a particular RNA- seq dataset. Unveiling of RNA Transcripts Overview

[0067] An embodiment of a pipeline to develop research tools, diagnostics, and medicaments is illustrated in Figure 1. This embodiment is directed to unveiling splicing events and applies the knowledge garnered to applications related to these transcript species. For example, this pipeline can be used to develop research tools, diagnostics and medicaments related to newly discovered RNA transcripts that exist in a disease state (e.g., cancer).

[0068] Process 100 can begin with obtaining (101 ) RNA sequences for at least two biological conditions from biological sources or sequence databases. In many instances, RNA sequences are obtained de novo by extracting the molecules from a biological source and sequencing them. Alternatively, biomolecule sequences can be obtained from a publicly or privately available database. Many databases exist that store datasets of sequences from which a user can extract the data to perform experiments upon.

[0069] In many embodiments, de novo biological sources to extract RNA for sequencing include animal cell samples and/or biopsies for which a user desires to unveil RNA transcript sequences. In particular, RNA molecules to be acquired can be derived from biopsies of human patients associated with a phenotype (e.g., disease state). Alternatively, the biomolecules can be derived from common research sources, such as in vitro tissue culture cell lines or research mouse models. In any case, biomolecules are extracted, processed and sequenced according to methods commonly understood in the field.

[0070] Once sequences are obtained, process 100 can unveil (103) splicing events that exist within the sequence datasets via Bayesian hypothesis testing. RNA transcript sequences, including sequences of rare and/or novel transcripts, are found within the sequence data, and statistical modeling with analysis can be performed to determine whether these transcripts are differential or common splicing events between a pair of biological conditions. The identified transcripts can be reported and ascertained for further development. [0071] The knowledge of splicing events, including events that result in rare and/or novel transcripts, can then be exploited to develop (105) research tools, diagnostics, and medicaments. Nucleic acids and proteins based on the unveiled splicing events can be synthesized and utilized in downstream applications. Many of these applications include further research on the unveiled transcript sequences. In addition, synthetic nucleic acids and proteins can serve to develop research and clinical tools to recognize biomarkers of a splicing event in various applications. For example, a synthetic nucleic acid and/or PCR kit can be used to develop a hybridization/PCR assay to identify the existence of unveiled RNA molecules in a biological sample or medical biopsy. Likewise, a synthetic peptide derived from unveiled RNA sequences can be utilized to develop an immunological assay to identify the existence of a novel peptide in a biological sample or medical biopsy. Synthetic molecules can also be exploited to develop drugs and medications that key in on naturally occurring splicing events in a patient.

[0072] Splicing events can be associated with many phenotypes and disease states. As such, process 100 can be employed to develop research tools, diagnostics and medicaments for many different types of diseases.

Identification of RNA Transcript Features

[0073] A process for identifying RNA transcript features, including exon-level cis features and sample-level trans RNA binding protein expression levels, in accordance with an embodiment of the invention is shown in Figure 2. Process 200 begins with obtaining (201 ) RNA sequence data for at least one biological condition. In several embodiments, this sequence data is transcriptome data derived from sequencing RNA in a next-generation sequencing platform, such as those manufactured by lllumina, Inc. (San Diego, CA). For splicing events unveiling applications described within, RNA sequencing provides a facile method to obtain sequence data, as it is typically abundant in the biological source, can be easily sequenced by known methods, readily available in numerous public and private databases, has intronic sequences already removed, and many exon reference databases exist for post-sequencing data analysis. [0074] The source of RNA sequence data can be derived de novo (i.e. , from biological tissue), from a public or private database, or generated in silico. Several methods are well known to derive RNA sequence data from biological tissue. Generally, RNA molecules are extracted from tissue, prepped to be sequenced, and then run on a sequencer. For example, RNA can be extracted from a human tissue source such as a biopsy, then prepped into a sequence library using standard techniques, and sequenced on a next- generation sequencing platform, such as those manufactured by lllumina, Inc. (San Diego, CA). Likewise, RNA sequence data can be derived from an available database. For example, transcriptome data can be obtained from the National Center for Biotechnology Information (NCBI), Reference Sequence Database (RefSeq), Encyclopedia of DNA Elements (ENCODE) or Roadmap Epigenomics Project databases. In addition, sequences to be analyzed can be generated in silico, by any appropriate computational methods. It should be noted that the sequence data could be in any appropriate sequence read format, including (but not limited to) single or paired-end reads.

[0075] After obtaining biomolecule sequence data, RNA sequence data is processed (203) to generate exon data for each biological condition. Many methodologies are known to process sequence data, and any appropriate method can be used. For example, the sequence data can be trimmed with the publicly available TrimGalore (http://www. bioinformatics babraham . ac.uk/projects/trim_galore/) or cutAdapt (https://cutadapt.readthedocs.io/en/stable/) methods, which remove adapter sequences and trim poor-quality bases. Mapping can be performed with any appropriate annotated genome, such as, for example, UCSC’s hg19 (http://support.illumina.com/sequencing/sequencing_software/igenome.html) and alignment tool, such as, for example, Bowtie2 (http://bowtie- bio.sourceforge.net/bowtie2/index.shtml), TopHat

(https://ccb.jhu.edu/software/tophat/index.shtml), and STAR

(https://github.com/alexdobin/STAR). Processing of the data will be dependent on the users’ goal, and thus adaptable to the results desired. Although only a few methods of trimming, processing, and mapping sequence data are disclosed, it should be understood many more methods exist and are covered by various embodiments of the invention. [0076] Process 200 also identifies (205) exon-level cis sequence features and sample- level trans RNA binding protein expression levels. Cis sequence features can be extrapolated from an exon sequence (and nearby sequences, especially near splice sites). A number of cis features can be used to predict exon splicing patterns, including evolutionary conservation, splice site strength, regulatory motif composition, and RNA secondary structure. An exemplary table of cis features are provided in Table 1 , but it should be understood that Table 1 only provides a small example of cis features and that many more cis features could be used within various embodiments. Cis sequence features are assigned to particular exons within a sample. The expression of RNA binding proteins (RBP) is also identified, which can also help predict splicing patterns. A number of RBPs are expressed in animal genome. Trans RBP expression levels are assigned to a sample as a whole. An exemplary table of RBP genes are provided in Table 2, but it should be understood that Table 2 only provides a small example of trans features and that many more trans features could be used within various embodiments.

[0077] Process 200 can also output (207) a report containing exon-specific cis features and sample-level trans RNA binding expression levels as determined from the obtained RNA sequence data. As is discussed further below, these features and expression levels can be used for building a model and/or using a built model to unveil RNA splicing events.

[0078] While specific examples of processes for identifying RNA transcript features are described above, one of ordinary skill in the art can appreciate that various steps of the process can be performed in different orders and that certain steps may be optional according to some embodiments of the invention. As such, it should be clear that the various steps of the process could be used as appropriate to the requirements of specific applications. Furthermore, any of a variety of processes for identifying RNA transcript features appropriate to the requirements of a given application can be utilized in accordance with various embodiments of the invention.

Building Models to Score and Predict Splicing Events

[0079] A process for building and training a computational model to score and predict splicing events in accordance with an embodiment of the invention is shown in Figure 3. Process 300 generates (301 ) exon-specific features using exon-level cis sequence features for a particular species’s genome. In several embodiments, a human genome sequence is used to derive cis sequence features. It should be noted that a partial or full genome can be used and that the particular species used will depend on the species to be examined. An exemplary list of cis sequence features is provided in Table 1 . Cis sequence features are assigned to particular exons within a particular species’s genome.

[0080] Process 300 also generates (303) sample-specific features using sample-level trans RNA binding protein RNA expression by analyzing RNA sequence data between a pair of biological conditions. Sample-specific features can be generated for at least one pair of biological conditions. The exact pair of biological conditions will depend on the application. It is noted, however, that a pair of biological conditions should be expected to yield at least some differences in RNA transcript constituents, especially RNA binding protein expression levels and splicing events. In some embodiments, a pair of biological conditions will be a particular phenotype (e.g., disease state) and a control for that phenotype (i.e. , not exemplifying the phenotype). In many embodiments, a pair of biological conditions will be two different tissues, two different sources, two separate extractions on a temporal scale, and/or a test with a control. It should be understood that a number of pairwise conditions can be used and fall within various embodiments of the invention, as would be understood by those skilled in the art.

[0081] Process 300 can also identify (305) differential and common splicing events between a pair of biological conditions (i.e. training labels) from a set of RNA sequence data using a statistical inference model. Differentially spliced exons, in accordance with numerous embodiments, are exons whose splicing levels are significantly different between the pair. In various embodiments, common exons are exons whose splicing levels are similar between the pair. Accordingly, differentially spliced exons will result in different exon-exon junctions between the pair. Multiple embodiments specifically identify high-confidence differential and common splicing events, as defined by the statistical inference model.

[0082] In several embodiments, training labels for differentially spliced vs common exons in each pairwise comparison can be generated using Bayesian hypothesis testing using exon data from RNA sequence data as a statistical inference model. In some embodiments, likelihood ratio testing is used as a statistical inference model. Examples of Bayesian hypothesis testing include (but are not limited to) Deep-learning Augmented RNA-seq analysis of Transcript Splicing (DARTS) as described in the exemplary embodiments below, and Mixture of Isoforms (MISO; https://github.com/yarden/MISO). An example of likelihood ratio testing is replicate Multivariate Analysis of Transcript Splicing (rMATS; http://rnaseq-mats.sourceforge.net/).

[0083] Process 300 can also repeat (307) steps 303 and 305 of generating sample specific features and identifying differential splicing events utilizing sequence data of additional pair(s) of biological conditions. In some embodiments, the RNA sequence data is derived from a large RNA sequence database, such as the ENCODE or Roadmap Epigenomics Project databases. In some embodiments, the large-scale RNA sequence data contains a number of pairwise biological conditions. For example, a large collection of RBP-depleted human cell lines (e.g., from ENCODE database) can be used for generating cis sequence features and trans RBP RNA expression, utilizing pairwise comparisons between RBP-depleted cell lines and control (i.e. , no depletion of RBP). In a number of embodiments, steps 303 and 305 are iteratively repeated over many pairwise comparisons to obtain more training labels under diverse comparisons. Iterative repeating of generating sample specific features and identifying differential and/or common splicing events improves the predictive power in several embodiments.

[0084] Using exon-level cis sequence features from a species’s genome and sample- level trans RNA binding protein RNA expression from a large-scale RNA sequence data source, process 300 builds and trains (309) a computational model to score and predict differential and common splicing events between a pair of biological conditions. A number of computational models can be used in accordance with various embodiments of the invention. In many embodiments, a neural network and (more specifically) a deep neural network is used, which can have a number of hidden layers each having a number of parameters. Typically, a neural network will have two to ten hidden layers with several million parameters, but it should be understood any appropriate number of layers and parameters can be used. In some embodiments, the task of differential splicing can be formulated as:

where Y_ik is the label for event i in the comparison k ; Y_ik = 1 represent when the event is differentially spliced; E_i is a vector of a number of cis sequence features for exon i; and G_k is a vector of a number normalized gene expression levels of a number of RBPs in the two conditions.

[0085] In several embodiments, built and trained computational models are evaluated for their prediction ability. Often,“leave-out” data (i.e., data that is not used to train the model) is used to assess the model. In some embodiments, better or the best performing models are kept for further downstream analyses.

[0086] Process 300 can also output (31 1 ) a report containing computational model parameters and prediction assessment. As is discussed further below, the built computational model can be further used to score and predict splicing events of other pairs of biological conditions. In addition, scored and predicted differential splicing events can be used to unveil splicing events, including novel splicing events, via Bayesian hypothesis testing. [0087] While specific examples of processes for building and training computational models to score and predict differential splicing events are described above, one of ordinary skill in the art can appreciate that various steps of the process can be performed in different orders and that certain steps may be optional according to some embodiments of the invention. As such, it should be clear that the various steps of the process could be used as appropriate to the requirements of specific applications. Furthermore, any of a variety of processes building and training computational models for scoring and predicting splicing events appropriate to the requirements of a given application can be utilized in accordance with various embodiments of the invention.

Applying Models to Score Splicing Events

[0088] A process for identifying and scoring splicing events using a trained model in accordance with an embodiment of the invention is shown in Figure 4. Process 400 obtains (401 ) RNA-seq data sets for at least a pair of biological conditions. As discussed previously, RNA-seq data can be obtained in a number of ways, including de novo from a biological sample, extracted from a publicly or privately available database, or in silico. Obtained RNA-seq data can be processed in a number of ways, as discussed herein, such that exon-level cis features and sample-level trans RNA binding protein RNA expression levels can be analyzed.

[0089] Process 400 can also obtain (403) a trained computational model that scores and predicts both differential and common splicing events between a pair of RNA-seq data sets. A number of embodiments utilize a deep neural network model. In some embodiments, a computational model is trained as described in reference to Figure 3. Any appropriate trained computation model, however, can be utilized in accordance of the requirements of the various embodiments of the invention.

[0090] The obtained RNA-seq data is entered (405) into the trained computational model. Then the model can be used to score and predict (407) differential and common splicing events between RNA-seq data sets. [0091] Process 400 also outputs (409) a report containing identified and scored splicing events using a previously trained model, specifying which are different and common between a pair of biological conditions. As is discussed further below, identified and scored differential splicing events can be used to unveil splicing events via Bayesian hypothesis testing.

[0092] While specific examples of processes for identifying and scoring splicing events using a previously trained model are described above, one of ordinary skill in the art can appreciate that various steps of the process can be performed in different orders and that certain steps may be optional according to some embodiments of the invention. As such, it should be clear that the various steps of the process could be used as appropriate to the requirements of specific applications. Furthermore, any of a variety of processes for identifying and scoring splicing events appropriate to the requirements of a given application can be utilized in accordance with various embodiments of the invention.

Unveiling of Splicing Events

[0093] A process for unveiling differential splicing events between a pair of biological conditions in accordance with an embodiment of the invention is shown in Figure 5. Process 500 obtains (501 ) scored differential splicing events for at least a pair of RNA- seq data sets. In several embodiments, differential splicing events are scored by a computational model, such as a deep neural network. It is noted that various neural network structures can be utilized and the manner in which deep neural networks can be configured for use in processes and systems in accordance with various embodiments of the invention is discussed further below. In some embodiments, differential splicing events are scored by the method provided in Figure 4. In many embodiments, a pair of RNA-seq data sets correspond to a pairwise comparison of biological conditions.

[0094] Process 500 also obtains (503) observed exon read counts of splicing events for the pair of RNA-seq data sets. In many embodiments, exon read counts of splicing events are derived from RNA-seq data. In several embodiments, the RNA-seq data used in the computational model to score differential splicing events is the same RNA-seq data to derive observed exon read counts. [0095] A posterior ratio of differential splicing events between the pair of RNA-seq data sets is determined (505) via Bayesian hypothesis testing. In a number of embodiments, the computational model’s prediction scores of differential splicing events are entered into the Bayesian hypothesis testing as the prior ratio. In some embodiments, the observed RNA-seq read counts are entered into the Bayesian hypothesis testing as the likelihood ratio. Various embodiments utilize the following equation when calculating a posterior ratio of differential splicing:

wherein:

I_ij, S_ij are the exon inclusion read count and the exon skipping read count for exon i in sample group j Î {1,2}, respectively; and P(H₁) is the prior probability of exon i being differentially spliced, determined by exon-specific cis features and sample-specific trans RBP expression levels in the two RNA data sets, which is independent of the observed RNA-seq read counts. P(H₀) = 1 - P(H₁) is the prior probability of exon i being common. P(I_ij,S_ij|H₁)1, and P(I_ij,S_ij| H₀) represent the likelihoods under the model of differential splicing or common splicing respectively.

[0096] When an informative computational model is used to provide the prior ratio, the detection of differential splicing events greatly improves. Accordingly, an informative prior ratio can decrease the necessary coverage of RNA-seq data, reducing sequencing costs and associated data costs. This can be particularly useful when deeper sequencing analysis is limited for any reason, which may be limited because of costs, lack of sample amount, or limitations in sequencing preparation.

[0097] Utilization of an informative prior can also unveil differential splicing events in lowly expressed genes, which were undiscoverable in prior standard methods (i.e. , methods that did not incorporate an informative prior from computational models). The ability to unveil novel splicing events can result in discovery of new biomarkers for various conditions. This can dramatically improve research and medicine, especially in the fields of diagnostics and therapeutics. [0098] The posterior ratio can be used to unveil (507) differential splicing events between the pair of biological conditions. In a number of embodiments, exon inclusion level differences are considered bona fide when the difference exceed a user defined threshold with a high probability:

In some embodiments, the threshold is defined as one of: 1 %, 2%, 5%, or 10%. The appropriate threshold depends on the application and goal of the user.

[0099] Process 500 also outputs (509) a report containing unveiled splicing events. In several embodiments, the posterior ratio is also reported. As is discussed further below, various splicing events, including those that are novel, can be used to develop diagnostics and therapeutics.

[0100] While specific examples of processes for unveiling differential splicing events between a pair of biological conditions using Bayesian hypothesis testing are described above, one of ordinary skill in the art can appreciate that various steps of the process can be performed in different orders and that certain steps may be optional according to some embodiments of the invention. As such, it should be clear that the various steps of the process could be used as appropriate to the requirements of specific applications. Furthermore, any of a variety of processes for unveiling differential splicing events between a pair of biological conditions appropriate to the requirements of a given application can be utilized in accordance with various embodiments of the invention.

Systems of Splicing Event Unveiling

[0101] Turning now to Figure 6, computer systems (601 ) may be implemented on a set of one or more computing devices in accordance with some embodiments of the invention. The computer systems (601 ) may incorporate a personal computer, a laptop computer, and/or any other computing device with sufficient processing power for the processes described herein. The computer systems (601 ) include a processor (603), which may refer to one or more devices within the set of computing devices that can be configured to perform computations via machine readable instructions stored within a memory (607) of the computer systems (601 ). The processor may include one or more microprocessors (CPUs), one or more graphics processing units (GPUs), and/or one or more digital signal processors (DSPs). According to other embodiments of the invention, the computer system may be implemented on multiple computers.

[0102] In a number of embodiments of the invention, the memory (607) may contain a RNA splicing event scoring application (609) and a Bayesian hypothesis testing application (61 1 ) that performs all or a portion of various methods according to different embodiments of the invention described throughout the present application. As an example, processor (603) may perform a method similar to any of the processes described herein, during which memory (607) may be used to store various intermediate processing data such as exon-level features (609a), sample-level RBP expression (609b), computational model to score splicing events (609c), differential splicing event scores (61 1 a), observed exon read counts (61 1 b), and posterior ratio of splicing events (61 1 c).

[0103] In some embodiments of the invention, the computer systems (601 ) may include an input/output interface (605) that can be utilized to communicate with a variety of devices, including but not limited to other computing systems, a projector, and/or other display devices. As can be readily appreciated, a variety of software architectures can be utilized to implement a computer system as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.

[0104] Although computer systems and processes for unveiling RNA splicing events and performing actions based therein are described above with respect to Figure 6, any of a variety of devices and processes for data associated with RNA splicing events as appropriate to the requirements of a specific application can be utilized in accordance with many embodiments of the invention.

Applications of Chimeric Biomolecules

[0105] Various embodiments are directed to development of research tools, diagnostics, and medicaments related to the splicing events unveiled. As such, novel unveiled RNA transcript sequences can be transformed into various synthetic biomolecules. [0106] In many embodiments, methods described herein can unveil rare and/or novel splicing events resulting in novel RNA transcript isoforms. The sequences of identified splicing events can be used as a template to synthesize nucleic acid molecules (e.g., DNA, RNA) having the same sequence or complementary sequence (or similar sequence or complementary sequence) across the identified splice site (i.e. , exon-exon junction). A similar sequence can incorporate a few to several alterations in nucleotide bases as compared to the unveiled isoform. In some embodiments, a synthetic nucleic acid polymer has 10 or less altered nucleotide bases, 9 or less altered nucleotide bases, 8 or less altered nucleotide bases, 7 or less altered nucleotide bases, 6 or less altered nucleotide bases, 5or less altered nucleotide bases, 4 or less altered nucleotide bases, 3 or less altered nucleotide bases, 2 or less altered nucleotide bases, 1 or less altered nucleotide bases, or no altered nucleotide bases. In numerous embodiments, the synthetic nucleic acid is composed of DNA or RNA bases. It should be understood, however, that other synthetic bases suitable to substitute for canonical nucleic bases can be used. These bases include, but are not limited to, locked nucleic acids, dideoxy nucleic acids, or modified bases (e.g., methylation).

[0107] Multiple embodiments are directed to synthetic nucleic acids synthesized in in vitro synthesizers (e.g., phosphoramidite synthesizer), bacterial recombination system, or other suitable methods. Furthermore, synthesized nucleic acids can be purified and lyophilized, or kept stored in a biological system (e.g., bacteria, yeast). For use in a biological system, synthetic nucleic acid molecules can be inserted into a plasmid vector, or similar. A plasmid vector can also be an expression vector, wherein a suitable promoter and/or a suitable 3’-polyA signal are each operably linked the transcript sequence. In many embodiments, the transcript sequence is operably linked with a non-native promoter and/or 3’-polyA signal. Typical promoters include (but not limited to) cytomegalovirus (CMV), simian vacuolating virus 40 (SV40), elongation factor-1 alpha (EF-1 alpha), U6, and T7. Typical 3’-polyA signals include (but not limited to) simian vacuolating virus 40 (SV40), bovine growth hormone (bGH) and T7. [0108] Purified synthetic nucleic acid polymers can also be chemically modified for various purposes. For example, fluorescent probes, small molecule drugs, or enzymes can be covalently linked to nucleic acid polymers. Accordingly, numerous embodiments are directed to synthetic nucleic acid polymers having a covalent attachment.

[0109] Embodiments are also directed to expression vectors and expression systems that produce novel RNA transcript, and further produce peptides or proteins. These expression systems can incorporate an expression vector, as described above, to express transcripts and proteins in a suitable expression system. Typical expression systems include bacterial (e.g., E. coli), insect (e.g., SF9), yeast (e.g., S. cerevisiae), animal (e.g., CHO), or human (e.g., HEK 293) cell lines. RNA and/or protein molecules can be purified from these systems using standard biotechnology production procedures.

[0110] In addition, many embodiments are also directed to the use of biomarkers to detect a transcript isoform and/or diagnose a biological sample (e.g., clinical biopsy). Multiple methods are well known and can be used to detect an RNA molecule, such as, for example, hybridization, polymerase proliferation (e.g., PCR), or next-generation sequencing. In any of these exemplary assays, transcript isoforms could be detected by extracting the molecules from a sample or biopsy and then detected further downstream.

[0111] Many embodiments are directed to genetically modified cells that incorporate unveiled transcript isoforms. In particular, embodiments are directed to a drug screening platform in which genetically modified cells can be used to identify molecules (e.g., small molecules, biomolecules) that can mitigate, inhibit or even revert a phenotype associated with a particular transcript isoform. Accordingly, a particular isoform is expressed within the genetically modified cells in which diagnostics and or medicaments can be developed and/or tested thereupon.

Antigen Binding Molecule Development & Purification

[0112] In accordance with a number of embodiments, antigen-binding molecules (e.g., antibodies) can be developed with high specificity, preference and affinity for RNA transcript-derived molecules that are antigenic (e.g., proteins, peptides, nucleic acids). In many of embodiments, high affinity antigen binding molecules are developed using transcript-derived molecules, such as those described herein. Many embodiments are also directed to the use of transcript-derived molecules to select and/or purify antigenbinding molecules, as determined by their specificity, preference, and/or affinity.

[0113] Antigen binding molecules can be any antibodies, fragments of antibodies, variants, and derivatives thereof capable of specifically binding an antigen. These include, but are not limited to, polyclonal, monoclonal, multispecific, human, humanized, primatized, or chimeric antibodies, single chain antibodies, epitope-binding fragments, e.g., Fab, Fab' and F(ab')2, Fd, Fvs, single-chain Fvs (scFv), single-chain antibodies, disulfide-linked Fvs (sdFv), fragments comprising either a VL or VH domain, and fragments produced by a Fab expression library.

[0114] By "specifically binds," it is generally meant that an antigen binding molecule (e.g., an antibody) binds to an epitope via its antigen binding domain and that the binding entails some complementarity between the antigen binding domain and the epitope. According to this definition, an antibody is said to "specifically bind" to an epitope when it binds to that epitope, via its antigen-binding domain more readily than it would bind to a random, unrelated epitope.

[0115] By "preferentially binds," it is meant that an antigen-binding molecule (e.g., antibody) specifically binds to an epitope more readily than it would bind to a related, similar, homologous, or analogous epitope. Thus, an antibody that“preferentially binds” to a given epitope would more likely bind to that epitope than to a related epitope, even though such an antibody may cross-react with the related epitope.

[0116] As used herein, the term "affinity" refers to a measure of the strength of the binding of an individual epitope by the antigen-binding molecule.

[0117] Antibodies are composed of a light chain and heavy chain. In general, the light and heavy chains are covalently bonded to each other, and the "tail" portions of the two heavy chains are typically bonded to each other by covalent disulfide linkages. Non- covalent linkages between two heavy chains can be used instead of disulfide linkages, which is typical when the antibodies are generated in culture.

[0118] Both the light and heavy chains are divided into regions of structural and functional homology, such as the constant and variable domains. The variable domains of both the light (VL) and heavy (VH) chain portions determine antigen recognition and specificity. Conversely, the constant domains of the light chain (CL) and the heavy chain (CH1 , CH2 or CH3) confer important biological properties such as secretion, transplacental mobility, Fe receptor binding, complement binding, etc.

[0119] The antigen recognition of the VL and VH domains is determined by the complementarity determining regions (CDRs). In naturally occurring antibodies, there are six CDRs, which are short, non-contiguous sequences that are specifically positioned to form the antigen-binding domain. An antigen-binding domain formed by the positioned CDRs defines a surface complementary to the epitope on the immunoreactive antigen. This complementary surface promotes the non-covalent binding of the antibody to its cognate epitope. The amino acids comprising the CDRs and the framework regions, respectively, can be readily identified for any given heavy or light chain variable region by one of ordinary skill in the art, since they have been precisely defined.

Antibody Acquisition by Immunization

[0120] An embodiment for the production of antibodies with high affinity for a chimeric biomolecule is depicted in Fig. 7. Process 700 begins with an acquisition (701 ) of RNA transcript-derived molecules. As described herein, transcript-derived molecules can be produced by a number of methods known in the art. For example, transcript-derived nucleic acids can be produced in in vitro synthesizers (e.g., phosphoramidite synthesizer), bacterial recombination system, or other suitable methods. Likewise, transcript-derived proteins or peptides can be recombinantly produced in an appropriate expression system, such as bacterial (e.g., E. coli), insect (e.g., SF9), yeast (e.g., S. cerevisiae), animal (e.g., CHO), or human (e.g., HEK 293) cell lines.

[0121] Process 700 administers (703) immunocompetent animals with an immunogenic cocktail having transcript derived molecules. Any immunocompetent animal can be used, such as, for example, human, rabbit, goat, mouse, rat, chicken, or guinea pig. Immunocompetent animals can be administered with a stimulating amount of transcript-derived molecules, with or without conjugate and with or without adjuvant. A stimulating amount of transcript-derived molecules is the amount required to stimulate an immune response that results in production of a collectable amount of antibodies that have affinity for the transcript-derived molecules. The stimulating amount may also depend on the use of conjugate and/or adjuvant. Any appropriate conjugate can be used, such as, for example, hemocyanin. Likewise, any appropriate adjuvant can be used, such as, for example, complete Freund's adjuvant.

[0122] Administration (705) of transcript-derived molecules into an immunocompetent animal can be optionally repeated. Often, repeat administrations can improve antibody production. The appropriate amount of administrations depends on the application, however, typically one, two, three, or four administrations are performed.

[0123] Sera, plasma, blood, and/or other hematopoietic tissue having high affinity antibodies and/or antibody generating cells are harvested (707) from immunocompetent animals at an appropriate amount of time after the final administration of transcript- derived molecules. The appropriate amount of time is the time required for the immunocompetent animal to have an immune response and generate antibodies, which is dependent in part on the immunocompetent animal used. For example, an appropriate time to harvest antibodies from rabbits is typically around 30 days after last immunization. Harvesting of sera, plasma and/or blood can be performed by any of the many methods known in the art.

[0124] Once sera, plasma, blood and/or hematopoietic tissue having high affinity antibodies and/or antibody generating cells are harvested, antibodies may be used in their natural buffer or further purified by a number of methods in accordance with a number of embodiments. Likewise, antibody-generating cells may be cultured and/or stored in accordance with several embodiments.

Monoclonal Antibody Development

[0125] Monoclonal antibodies can be prepared using a wide variety of techniques known in the art including the use of hybridoma, recombinant, and phage display technologies, or a combination thereof. For example, monoclonal antibodies can be produced using hybridoma techniques including those known in the art and taught, for example, in Harlow et al. , Antibodies: A Laboratory Manual, Cold Spring Harbor Laboratory Press, 2nd ed. 1988, the disclosure of which is incorporated herein by reference. The term "monoclonal antibody" refers to an antibody that is derived from a single clone, including any eukaryotic, prokaryotic, or phage clone, and not the method by which it is produced. Thus, the term "monoclonal antibody" is not limited to antibodies produced through hybridoma technology. In certain embodiments, antibodies of the present invention are derived from human B cells that have been immortalized via transformation.

[0126] In the well-known hybridoma process (Kohler et al., Nature 1975 256, 495; the disclosure of which is incorporated herein by reference) the relatively short-lived, or mortal, lymphocytes from a mammal (e.g., B cells derived from a human subject as described herein) are fused with an immortal tumor cell line (e.g., a myeloma cell line), thus, producing hybrid cells or "hybridomas" which are both immortal and capable of producing the genetically coded antibody of the B cell. The resulting hybrids are segregated into single genetic strains by selection, dilution, and regrowth with each individual strain comprising specific genes for the formation of a single antibody. Each single strain produces antibodies, which are homogeneous against a desired antigen.

[0127] Hybridoma cells thus prepared are seeded and grown in a suitable culture medium that preferably contains one or more substances that inhibit the growth or survival of the unfused, parental myeloma cells. Those skilled in the art will appreciate that reagents, cell lines and media for the formation, selection and growth of hybridomas are commercially available from a number of sources and standardized protocols are well established. Generally, culture medium in which the hybridoma cells are growing is assayed for production of monoclonal antibodies against the desired antigen (e.g., transcript-derived molecules). The binding specificity of the monoclonal antibodies produced by hybridoma cells is determined by in vitro assays such as immunoprecipitation, radioimmunoassay (RIA), or enzyme-linked immunosorbent assay (ELISA). After hybridoma cells are identified that produce antibodies of the desired specificity, affinity and/or activity, the clones may be subcloned by limiting dilution procedures and grown by standard methods (Goding, Monoclonal Antibodies: Principles and Practice, Academic Press, 1986 pp 59-103). It will further be appreciated that the monoclonal antibodies secreted by the subclones may be separated from culture medium, ascites fluid or serum by conventional purification procedures such as, for example, protein-A, hydroxylapatite chromatography, gel electrophoresis, dialysis or affinity chromatography. Selection and Purification of Antigen Binding Molecules

[0128] Embodiments are also directed to the selection of antigen binding molecules having high specificity, preference, and affinity for RNA transcript-derived molecules. In various embodiments, transcript-derived molecules, such as those described herein, are used to identify and select such antigen binding molecules. Binding molecules can be used for a variety of downstream applications, including, but not limited to, diagnosis and treatment of tumors and cancers.

[0129] Embodiments can begin with acquisition of antigen binding molecules. Antigen binding molecules can be obtained in a variety of methods, including those described herein. Once antigen-binding molecules are obtained, they can be screened for their specificity, preference, and affinity for chimeric biomolecules.

[0130] Many assays are known in the art to screen the specificity of antibodies for a particular antigen. These assays include, but are not limited to, Western blot, immunoprecipitation, RIA, and ELISA. Accordingly, chimeric biomolecules can be used as antigens to screen for specificity.

[0131] A number of assays are also known to determine antibody preference for a particular antigen over a similar antigen. Particularly, in this application, antigen-binding molecules that preferentially bind transcript derived-molecules of a particular isoform with low cross-reactivity with molecules of another isoform is desired. Accordingly, specificity assays, such as those described above, can be used to directly compare antigen binding molecules’ ability to bind transcript-derived molecules and near off-targets. In addition to comparison assays, direct competition assays can be performed, wherein antigen binding molecules are in contact with transcript-derived molecules and near off-targets and their preference for each antigen is determined.

[0132] Binding affinities of antigen binding molecules can be measured by a number of assays. In many of these assays, the dissociation constant (Kd) can be measured directly. Alternatively, affinity can be determined qualitatively by a number of assays, including, but not limited to Western blot, immunoprecipitation, RIA, and ELISA. EXEMPLARY EMBODIMENTS

[0133] Bioinformatic and biological data support the methods and systems of unveiling RNA splicing events and applications thereof. In the ensuing sections, exemplary computational methods and exemplary applications related to unveiling of RNA splicing events are provided. Exemplary computational methods and exemplary applications related to unveiling of RNA splicing events can also be found within the journal article of Z. Zhang, et al., Nat. Methods. 16, 307-310 (2019), the disclosure of which is incorporated herein by reference in its entirety.

Unveiling of RNA Transcripts via DARTS

[0134] In an exemplary embodiment, DARTS (Deep-learning Augmented RNA-seq analysis of Transcript Splicing), a Bayesian hypothesis testing (DARTS BHT) framework, was developed for statistical inference of differential alternative splicing. DARTS integrates deep learning-based prediction as a prior, and empirical evidence from a specific RNA-seq dataset as likelihood of differential alternative splicing. A core component of DARTS is a deep neural network (DARTS DNN) that uses predictive features to generate a probability of differential alternative splicing between two biological conditions. Unlike existing methods that use only cis sequence features to predict exon splicing patterns in specific samples (See, e.g., E. Park, et al. American journal of human genetics 102, 1 1 -26 (2018); and H.Y. Xiong, et al., Science 347, 1254806 (2015)), the DARTS DNN incorporates both cis sequence features and the mRNA levels of trans RBPs in two biological conditions (Fig. 8). This design allows the DARTS DNN to consider how altered expression of RBPs, including well-annotated splicing factors, affects alternative splicing in response to perturbations or stimuli. In one embodiment, a list of 2,926 cis sequence features was compiled, which included evolutionary conservation, splice site strength, regulatory motif composition, and RNA secondary structure. A list of 1 ,498 annotated RBPs was also compiled (S. Gerstberger, M. Hafner, and T. Tuschl Nature reviews. Genetics 15, 829-845 (2014), the disclosure of which is herein incorporated by reference), whose mRNA levels were treated as RBP features. Examples of cis sequence features and RBPs used in the DARTS DNN are provided in Tables 1 and 2. The DARTS DNN was designed to have 4 hidden layers and 7,923,402 parameters. This DARTS DNN can be trained using high-confidence differentially spliced and common exons in a large compendium of pairwise RNA-seq comparisons between distinct biological states. A schematic diagram of the DARTS DNN is shown in Fig. 9. As can readily be appreciated, the specific number of hidden layers and/or neural network architecture is largely dependent upon the requirements of a given application in accordance with various embodiments of the invention.

[0135] The DARTS DNN can be trained on RNA-seq data from a large collection of RBP-depleted human cell lines generated by the ENCODE consortium (Fig. 10) (See, E. L. Van Nostrand, et al., bioRxiv, 179648 (2017)). Specifically, the RNA-seq data can be derived from the ENCODE experiments wherein 196 RBPs are depleted by at least one shRNA in both the K562 and HepG2 cell lines, corresponding to a total of 408 shRNA knockdown vs. control pairwise comparisons. The remaining ENCODE data, corresponding to 58 RBPs that are depleted in only one cell line, can be excluded from training and used as leave-out data to independently evaluate the DARTS DNN (see below). Note that throughout this example, independent leave-out datasets that had never been utilized during training were used to avoid issues of overfitting in benchmarking the performance of the DARTS DNN. To generate training labels for differentially spliced vs. common exons in each pairwise comparison the DARTS BHT framework can be applied with a flat prior (DARTS-flat, i.e. with only RNA-seq data used for the inference) to calculate the probability of an exon being differentially spliced or common between two conditions.

[0136] The performance of DARTS-flat was benchmarked using simulation datasets, and compared favorably to state-of-the-art differential splicing inference methods (Figs. 1 1 and 12). From the high-confidence differentially spliced vs. common exons called by DARTS-flat on the training RNA-seq data, 90% of the labeled events were used for training and 5-fold cross validation of the DARTS DNN, and the remaining 10% of events for testing the trained DARTS DNN (as described in detail below). The performance of the DARTS DNN increased as the training progressed, reaching a maximum Area Under the Receiver Operating Characteristic curve (AUROC) of 0.97 during cross-validation and 0.86 during testing (Fig. 13). Thus, the DARTS DNN can predict differential splicing upon RBP depletion in the two ENCODE cell lines. [0137] To test the general applicability of the DARTS DNN, the trained DARTS DNN was used to predict differentially spliced vs. common exons from the leave-out data, which included 58 RBPs that were knocked-down in only one of the two cell lines (Fig. 10). These leave-out data were derived from cell lines depleted for a different set of RBPs, therefore the performance of the DARTS DNN on the leave-out data would indicate model generalizability. The trained DARTS DNN model showed a high accuracy (average AUROC=0.87) on the leave-out data. The leave-out data was also used to compare the DARTS DNN to three baseline methods: the identical DNN structure trained on individual leave-out datasets, logistic regression with L2 penalty, and random forest. The baseline methods were trained using 5-fold cross-validation in each leave-out dataset and the average AUROC for each method was plotted (Fig. 14). Another alternative baseline method was also implemented by predicting sample-specific exon inclusion levels and then taking the absolute difference of the predicted exon inclusion levels (PSI values) between the two conditions as the metric for differential splicing. The DARTS DNN trained on the large-scale ENCODE data outperformed baseline methods by a large margin in 57/58 experiments, with the sole exception being AQR knockdown in K562. The best performance of AUROC=0.95 by the DARTS DNN was achieved for RPL23A knockdown in HepG2. The DARTS DNN model trained on individual leave-out datasets was the worst performer, illustrating the importance of training the DARTS DNN on large-scale data comprising diverse perturbation experiments. Collectively, these results indicate that the DARTS DNN can predict differential splicing upon RBP depletion in the two ENCODE cell lines.

[0138] Having demonstrated the performance of the DARTS DNN model, the ability of the DARTS BHT framework to infer differential alternative splicing from a specific RNA- seq dataset was evaluated, incorporating the DARTS DNN prediction score as the informative prior and observed RNA-seq read counts as the likelihood (DARTS-info). Specifically, the posterior ratio of differential splicing consists of two components: the prior ratio, generated by the DARTS DNN model based on cis sequence features and trans RBP expression levels; and the likelihood ratio, determined by modeling the biological variation and estimation uncertainty of splice isoform ratio based on observed RNA-seq read counts (Fig. 15). Simulation studies with varying strengths of informative prior and observed RNA-seq read counts were performed. These studies demonstrated that the informative prior improves the inference when the observed data is limited, for instance due to low gene expression levels or limited RNA-seq depth, but does not overwhelm the evidence in the observed RNA-seq read counts (Fig. 16). Specifically, for true differential splicing events in the simulation, a considerable number of true positives in the low RNA- seq coverage regions can be rescued via a strong informative prior, whereas the effect of the prior was diminished when the observed RNA-seq read counts were large (Fig. 16).

[0139] To investigate the utility of DARTS-info, DARTS-info and DARTS-flat were used to infer cell-type-specific differential splicing events between two ENCODE cell lines (HepG2 and K562). ENCODE generated paired-end RNA-seq data on 24 and 28 biological replicates of HepG2 and K562 respectively, with on average 66 million read pairs per replicate. Cluster analysis confirmed that these 24 and 28 biological replicates clustered into two distinct groups that matched their cell type labels (Fig. 17). To obtain high-confidence differential and common splicing events between the two cell types, replicates of HepG2 or K562 were aggregated and DARTS-flat was applied on this ultra deep RNA-seq dataset to obtain high-confidence differential and common splicing events between the two cell types. Next, DARTS-info and DARTS-flat were applied to all possible (24x28) pairwise comparisons between individual replicates of HepG2 and K562, and the Area Under Precision Recall Curve (AUPR) was computed for the two methods. DARTS- info outperformed DARTS-flat in all pairwise comparisons, and the gain in inference accuracy had a significant negative correlation with the RNA-seq depth of individual replicates (Spearman’s rho=-0.69, p-value<2.2e-16), with the largest gain coming from pairwise comparisons involving low-coverage RNA-seq samples (Fig. 18). It should be noted that by using DARTS-flat to obtain the lists of high-confidence differential and common exons between the two cell types from the ultra-deep RNA-seq data, the comparison at the low sequencing depth was inherently biased towards DARTS-flat and against DARTS-info. Therefore, the consistently superior performance of DARTS-info demonstrates the advantage of incorporating the DNN prediction as prior information when analyzing low-coverage RNA-seq data. DARTS Bayesian Hypothesis Testing (BHT) Framework

[0140] DARTS BHT, a Bayesian statistical framework, was developed to determine the statistical significance of differential splicing events or common splicing events between RNA-seq data of two biological conditions. The DARTS BHT framework was designed to integrate deep learning based prediction as prior and empirical evidence in a specific RNA-seq dataset as likelihood. The framework begins by modeling the simplest case, i.e. testing the difference in exon inclusion levels (percent-spliced-in (PSI) values) between two conditions without replicates, i.e. one sample per condition:

where I_ij,S_ij and y_ij are the exon inclusion read count, the exon skipping read count, and the exon inclusion level for exon i in sample group j Î {1,2}, respectively; f_i is the length normalization function for exon i that accounts for the effective lengths of the exon inclusion and skipping isoforms; m _i is the baseline inclusion level for exon i, and d_i is the expected difference of the exon inclusion levels between the two conditions. The goal of the differential splicing analysis is to test whether the difference in exon inclusion levels between the two conditions d_i exceeds a user-defined threshold c (e.g. 5%) with a high probability, i.e.:

In Bayesian statistics, this test can be approached by assuming a“spike-and-slab” prior for the parameter of interest d The spike-and-slab prior is a two-component mixture prior distribution, with the“spike” component depicting the probability of the model parameter d being constrained around zero, and the“slab” component depicting the unconstrained distribution of the model parameter d. [0141] In the DARTS BHT statistical framework, a spike prior H₀ with a small variance t= t₀ was imposed such that the probability of d concentrates around 0, to account for random biological or technical fluctuations in PSI values between two biological conditions for common splicing events. A slab prior H₁ with a much larger variance t= t₁ was imposed to model the difference in PSI values between two conditions for differential splicing events. Parameters were set as follows: t₀ 0.03, corresponding to 90% density constrained within , and t₁ = 0.3; the final inference is robust to choice of

t values. The posterior probability of a splicing event being generated by the two models can be written as:

where P(H₁) is the prior probability of exon / being differentially spliced, determined by exon-specific cis features and sample-specific trans RBP expression levels in the two biological conditions, which is independent of the observed RNA-seq read counts. P(H₀) = 1 - P(H₁) is the prior probability of exon / being common. P(I_ij,S_ij\H₁), and P(I_ij,S_ij\H₀) represent the likelihoods under the model of differential splicing or common splicing respectively. Z is a normalizing constant.

[0142] The comparison is between only two models, and thus the above equation can be rewritten as a factorization of the ratios between prior and likelihood:

Note that when the prior distribution is flat, i.e. P(H₀) = P(H₁) = 0.5, the above equation is equivalent to a likelihood ratio test, which is referred to as DARTS-flat. When P(H₀) and P(H₁) incorporate an informative prior based on exon- and sample-specific predictive features, this is referred to as DARTS-info. [0143] Using the preceding factorization equation, the marginal posterior probability can be derived for the parameter of interest 5j as a mixture of the posterior

conditioned on the two models:

Hence, the final inference is performed on the probability . In this example,

the user defined threshold c was set such that c=0.05 (i.e. a 5% change in exon inclusion level). Events with

were called as significant differential splicing events and events with

were called as significant common splicing events. Events with 0 were deemed as inconclusive.

Throughout this specification, the subscripts are omitted, and and

are used interchangeably.

DARTS Deep Neural Network (DNN) Model for Predicting Differential Alternative Splicing

[0144] In this example, an important component of the DARTS BHT framework is a deep neural network (DNN) model that generates a probability of differential splicing between two biological conditions using exon- and sample-specific predictive features. The DARTS DNN was designed to predict differential splicing of a given exon based on exon-specific cis sequence features and sample-specific trans RBP expression levels in two biological conditions.

[0145] As noted above, a useful feature of the DARTS BHT framework is its capability to determine the statistical significance of both differential splicing events and common splicing events. Specifically, for a splicing event i in the comparison k between RNA-seq datasets from two distinct biological conditions (j Î {1,2}), let Y_ik = 1 if this event is differentially spliced (i.e. H₁ is true); and Y_ik = 0 if H₀ is true as labels for differential or common splicing events respectively. The task of predicting differential splicing can be formulated as:

where Y_ik is the label for event i in the comparison k; E_i is a vector of 2,926 cis sequence features for exon i, including evolutionary conservation, splice site strength, regulatory motif composition, and RNA secondary structure. G is a vector of 2,996 (=1 ,498x2) normalized gene expression levels of 1 ,498 RBPs in the two conditions. The prediction of P(Y_iki =1) based on the features from any specific RNA-seq dataset can then be incorporated as an informative prior for P(H₁) in the DARTS BHT framework.

[0146] A deep learning model (DARTS DNN) was implemented to learn the unknown function F that maps the predictive features to splicing profiles (differential vs. common). The DARTS DNN was designed with 4 hidden layers and 7,923,402 parameters. The configuration of the DNN was: an input layer with 5922 (=2926+1498*2) variables; 4 fully- connected hidden layers with 1200, 500, 300, 200 variables and the ReLU activation function; and an output layer with 2 variables and the Softmax activation function. The DARTS DNN was implemented using Keras (https://github.com/keras-team/keras) with the Theano backend.

[0147] To mitigate potential overfitting of the DARTS DNN, a drop-out probability was added for connections between hidden layers. Specifically, the variables in the four hidden layers were randomly turned off during the training process with probability 0.6, 0.5, 0.3, and 0.1 , respectively. Batch-normalization layers were also added for all hidden layers to help the model converge and generalize. Finally, the RMSprop optimizer was used to adaptively adjust for the magnitudes of the components of the gradient in this deep architecture and 1000 labeled alternative splicing events were chosen as one mini batch to obtain a more stable gradient. In each mini-batch the composition of positive and negative labels was balanced by adding more positive events in the mini-batch such that ratio of positive to negative = 1 :3 in the mini-batch. Since there were significantly more negative (common) events compared to positive (differential) events, such a balanced composition will provide a gradient for learning the positive events in different mini batches.

[0148] To monitor the training loss and validation loss, the loss was computed every 10 mini-batches and the current model parameters were saved if the validation loss was lower than the previous best model. The DARTS DNN was trained on Tesla K40m. Processing of ENCODE RNA-seq data and training of the DARTS DNN Model

[0149] A comprehensive RNA-seq dataset of a large collection of RBP-depletion experiments in two human cell lines (K562 and HepG2) from the ENCODE consortium was used to train the DARTS DNN. The ENCODE investigators have performed systematic shRNA knockdown of over 250 RBPs in the HepG2 and K562 cell lines. All available (as of May 2017) RNA-seq alignments (ENCODE processing pipeline on the human genome version hg19) for shRNA knockdown and control samples from the ENCODE data portal (https://www.encodeproject.org/) were downloaded.

[0150] RNA-seq alignments (bam files) were processed using rMATS (v4.0.1 ) (See S. Shen, et al. , PNAS 111 , E5593-5601 (2014), the disclosure of which is herein incorporated by reference). Given RNA-seq alignment files, rMATS constructs splicing graphs, detects annotated and novel alternative splicing events, and counts the number of RNA-seq reads for each exon and splice junction. Given the modest depth of the ENCODE RNA-seq data (32 million read pairs per replicate on average), the read counts from the two replicates were pooled together for downstream analyses.

[0151] The raw RNA-seq reads were processed with Kallisto (v0.43.0) to quantify gene expression levels using Gencode (v19) protein coding transcripts as the index (For details on Kallisto and Gencode, see N. L. Bray, et al., Nature biotechnology 34, 525-527 (2016); and J. Harrow, et al., Genome biology 7 Suppl 1 , S4 1 -9 (2006); the disclosures of which are herein incorporated by reference). For each of the two biological conditions in a given comparison (i.e. , shRNA knockdown vs. control), the Kallisto derived gene-level transcripts per million (TPM) values of 1 ,498 known RBPs were extracted. The TPM value of each RBP was normalized by dividing by its maximum observed TPM value of all comparisons, and then used as RBP expression features by the DARTS DNN.

[0152] To generate training labels for the DARTS DNN, DARTS-flat was applied to the ENCODE RNA-seq data. Events with posterior probability P(|Dy |>0.05)>0.9 were called positive (Y=1). Events with posterior probability P(|Dy |>0.05)<0.1 were called negative (Y=0). These significant differential splicing events and significant common splicing events were defined as labeled events and used to train the DARTS DNN. [0153] The vast majority of the RBPs (n=196) in the ENCODE data were knocked- down by at least one shRNA in both HepG2 and K562 cell lines, corresponding to a total of 408 comparisons between knockdown and control. Ten percent of the labeled positive events and of the labeled negative events were set aside in each comparison as the testing data for estimating the generalization error of the trained DNN model. The remaining 90% of the labeled events were further split into 5-fold cross-validation subsets for the purposes of training, monitoring overfitting, and early-stopping. ENCODE RBP knockdown experiments performed in only one cell line (either HepG2 or K562, n=58) were also collected as leave-out datasets. All labeled events in these leave-out datasets were only utilized for evaluating the trained DARTS DNN and were never used during training.

[0154] Four RBPs were randomly drawn without replacement for a training batch, and iterated through all 196 RBPs as an epoch. The performance of the DARTS DNN was measured by Area Under the Receiver Operating Characteristics curve (AUROC). The model with the best performance during training and cross-validation was selected, and subsequently benchmarked using the testing data and leave-out data.

DARTS Informative Prior

[0155] In a typical RNA-seq study, the number of common splicing events can be orders of magnitude larger than differential splicing events, and machine-learning algorithms may be biased to the majority class. To mitigate this potential bias, an unsupervised rank-transformation was used to rescale DARTS DNN scores to derive the informative prior for the DARTS BHT framework. Specifically, a two-component Gaussian mixture model was fit for all the DARTS DNN scores to derive the mean and variance of the two mixed Gaussian components as well as the posterior probability l of each DARTS DNN score belonging to a specific component. Setting the new mean and variance of the two Gaussian components to m₀ and m₁, d₀ and d₁, respectively, each DARTS DNN score was rank-transformed to the new Gaussian components and then averaged by the weight parameter A. Finally, to maintain a valid prior probability, the transformed DARTS DNN scores were rescaled to [a, 1 - a] where aÎ [0,0.5] sets the desired prior strength for the DARTS BHT framework and a smaller a value corresponds to a stronger strength of the informative prior. Using this rescaling scheme, the entire ranks of the DARTS DNN scores are preserved while the potential bias for negative over positive events is reduced. In one example, m₀ - 0.05, m₁ - 0.95, d₀ d₁ = 0.1, and a = 0.05.

[0156] This method is referred to herein as DARTS-info, and it was compared to the baseline DARTS-flat method that uses the same BHT framework but only considers the RNA-seq data without incorporating the DARTS DNN prediction as the informative prior.

DARTS DNN Extension to Additional Tissues and Cell Types

[0157] Next, it was determined whether the DARTS DNN model can be extended to additional tissues and cell types, and whether and how the choice of training datasets influences the performance of the DARTS DNN on other datasets. Deep RNA-seq data from diverse cell types and tissues generated by the Roadmap Epigenomics consortium was used for this analysis. The Roadmap data was processed following the same protocol as for the ENCODE data (See Roadmap Epigenomics Consortium, et al., Nature 518, 317-330 (2015), the disclosure of which is herein incorporated by reference). All Roadmap data with 101 bp x 2 or 100bp x 2 paired-end RNA-seq were used, and reads from the 101 bp x 2 datasets were truncated to 100bp for rMATS. In total, this represented 23 distinct tissues or cell types. All possible pairwise comparisons (n=253) between these 23 RNA-seq samples were performed. All pairwise comparisons involving the thymus tissue in Roadmap (22 comparisons) were held-out to use later for model testing. Three DARTS DNN models were used: trained on ENCODE data only, Roadmap data only, and both. Each model performance was evaluated with the held-out data from ENCODE or Roadmap (Fig. 19).

[0158] The DARTS DNN trained on ENCODE data exhibited high predictive power for differential splicing in held-out ENCODE data and modest predictive power for differential splicing in held-out Roadmap data (Fig. 19, left panel). Conversely, DARTS DNN trained on Roadmap data had high predictive power for held-out Roadmap data and modest predictive power for held-out ENCODE data (Fig. 19, right panel). Finally, the best model performance was achieved with DARTS DNN trained on the combination of ENCODE and Roadmap data (Fig. 19). DARTS DNN Applied to Unveil Epithelial to Mesenchymal Transition Splicing Events

[0159] As a further proof-of-principle study, DARTS DNN trained on ENCODE and Roadmap datasets was applied to unveil alternative splicing during the epithelial- mesenchymal transition (EMT), a key cellular process in embryonic development and cancer metastasis. The trained DARTS model was applied to study EMT-associated alternative splicing events in two distinct human cell culture systems: H358 lung cancer cell line induced to undergo EMT through a 7-day time course via inducible expression of the mesenchymal EMT driver ZEB1 . DARTS-flat was used to compare RNA-seq data from Day 1 to Day 7 against Day 0. Motif analysis was performed by calculating the average percentage of nucleotides covered by any of the top 12 ESRP SELEX-seq hexamer motifs in a 45bp sliding window (For more on ESRP SELEX-seq, see, K. A. Dittmar, Molecular and cellular biology 32, 1468-1482 (2012), the disclosure of which is herein incorporated by reference). Background sequences for the motif analysis were significant common events identified by DARTS-flat.

[0160] The ability of the DARTS DNN model to predict high-confidence differential vs. common splicing events during the EMT time course was examined. As EMT progressed, the number of differential splicing events called by DARTS-flat increased, along with the prediction accuracy of the DARTS DNN model (Fig. 20). The DARTS DNN trained on ENCODE+Roadmap data displayed the best model performance, followed closely by the DARTS DNN trained on Roadmap data, whereas the DARTS DNN trained on ENCODE data performed least well (Fig. 20). These findings were not unexpected, given that the Roadmap RNA-seq data cover diverse tissues and cell types including various epithelial and mesenchymal cell types, whereas the ENCODE data are restricted to HepG2 and K562 cell lines. The best prediction accuracy of AUC=0.82 was achieved by the DARTS DNN trained on ENCODE+Roadmap for the Day 6 versus Day 0 comparison.

[0161] To further investigate the validity of the DARTS predictions, a set of 449 “DARTS DNN rescued” events was compiled from the Day 6 vs. Day 0 comparison. These are splicing events that displayed a high DARTS DNN score of differential splicing (FPR<5%) and a non-trivial splicing change (over 10% difference in exon inclusion level), but did not pass the significance threshold by DARTS-flat using observed RNA-seq read counts alone. A subset of“DARTS DNN rescued” events was unveiled with significantly reduced exon inclusion during EMT, and the intronic regions downstream of these exons were enriched for the previously defined consensus motif of the splicing factors ESRP1 and ESRP2 (Fig. 21 ). A similar pattern of ESRP motif enrichment was observed for differential splicing events called by DARTS-flat using RNA-seq data alone (Fig. 21 ). By contrast, 123 significant events were also found by DARTS-flat but fell below the significance threshold (posterior probability < 0.9) after incorporating the informative prior. These events were not enriched for the ESRP motif. ESRPs are epithelial-specific splicing factors whose downregulation is a major driver of alternative splicing during EMT. The pattern of ESRP motif enrichment in the subset of “DARTS DNN rescued” events with reduced exon inclusion during EMT is consistent with previous findings that ESRP binding downstream of alternative exons enhances exon inclusion. The motif data provide transcriptome-wide regulatory evidence for the validity of the DARTS DNN prediction. As a specific example, DARTS DNN predicted the EMT-associated alternative splicing change in the PLEKHA1 gene (Fig. 22). The DARTS DNN score for this exon is 0.94 in Day 5 versus Day 0, increasing the posterior probability of differential splicing to 0.73 over 0.42 when using RNA-seq data alone. The differential splicing pattern of this exon was apparent throughout the time course and was validated by RT-PCR.

[0162] To extend the DARTS analysis of EMT-associated alternative splicing in the H358 lung cancer cell line, other epithelial and mesenchymal cell lines were compared. Paired-end RNA-seq of the PC3E and GS689 prostate cancer cell lines was performed, which was previously shown to have contrasting epithelial vs. mesenchymal characteristics respectively. A deep RNA-seq dataset with on average 125 million read pairs per replicate for three biological replicates per cell type was generated.

[0163] The analysis revealed that several EMT relevant splicing factors (ESRP1 , ESRP2, RBM47) were differentially expressed in both the GS689-PC3E“EMT system” and during EMT of H358; a few other RBPs were differentially expressed in only one of the two comparisons (Fig. 23). The DARTS DNN scores of these two EMT systems were highly correlated (Spearman’s rho=0.87, p- value<2.2e-16; Fig. 24). This correlation was higher than the correlation of the GS689-PC3E scores with the DARTS DNN scores for any ENCODE RBP-depletion experiment (median=0.69, interquartile range IQR=[0.67, 0.73]). These data suggest that there is a core differential alternative splicing signature between epithelial and mesenchymal cells, and that the DARTS DNN model can capture this signature.

[0164] Finally, to affirm the results of DARTS to unveil bona fide differential splicing events from moderately or lowly expressed genes, RASL-seq was performed on the same PC3E and GS689 RNA samples to generate high-confidence measurements of exon inclusion levels (PS I values) for these two cell types. RASL-seq is a sequencing method for targeted amplification and quantitative profiling of alternative splicing events (See, H. Li, J. Qui, and X. D. Fu, Current protocols in molecular biology Chapter 4, Unit 4 13 1 1 - 19 (2012), the disclosure of which is herein incorporated by reference). RASL-seq reads were aligned to the pool of target splice junctions in the RASL-seq library using Blat (See W. J. Kent Genome research 12, 656-664 (2002), the disclosure of which is herein incorporated by reference).

[0165] The absolute difference of PSI values between the two cell types (denoted as RASL-DPSI) was computed for each alternative splicing event (n=1 ,058) passing the

RASL-seq read coverage filter. RASL-PSI values were calculated as , where I is the

number of exon inclusion splice junction reads and S is the number of exon skipping splice junction reads. Alternative splicing events with total RASL-seq read counts larger than 5 in every replicate were used for downstream analyses. Gene expression levels of RBPs in the two datasets were quantified using Kallisto v0.43.0.

[0166] From the RNA-seq data and DARTS DNN prediction, four groups of alternative splicing events were compiled and their distributions of RASL-APSI values were compared (Fig. 25). For this analysis, events were restricted to those having a RASL- APSI value <0.3. As expected, alternative splicing events called as differential or common using RNA-seq data alone (by DARTS-flat) displayed the highest or lowest RASL-APSI values, respectively. For the remaining alternative splicing events called as inconclusive by DARTS-flat using RNA-seq data alone, two additional groups were compiled: DARTS DNN-predicted differential events, with high DARTS DNN scores (FPR<5%); and DARTS DNN-predicted common events, with low DARTS DNN scores (FPR>80%). The RASL- APSI values of the DARTS DNN-predicted differential events were significantly larger than those of the DARTS DNN-predicted common events (p- value=0.035, one-sided Wilcoxon test), with the DARTS DNN-predicted differential events similar to the RNA-seq differential events and the DARTS DNN-predicted common events similar to the RNA- seq common events (Fig. 25). These events were in genes with significantly lower gene expression levels (p-value=0.001 , Wilcoxon test) and had significantly lower RNA-seq coverage (p-value=2.1 e-7, Wilcoxon test) compared to differential events called by DARTS-flat (Fig. 26A), and were similarly called as insignificant by rMATS when using RNA-seq data alone. These events were in genes with significantly lower gene expression levels (p-value=0.001 , Wilcoxon test) and had significantly lower RNA-seq coverage (p-value=2.1 e-7, Wilcoxon test) compared to differential events called by DARTS BHT(flat) (Fig. 26A), and were similarly called as insignificant by rMATS (4) when using RNA-seq data alone. Collectively, among the events analyzed by RASL-seq, DARTS DNN predicted 52 additional differential splicing events, beyond the 77 events called using RNA-seq data alone (Fig. 25). Importantly, alternative splicing events in moderately or lowly expressed genes with high DARTS DNN prediction scores had a comparable RASL-seq validation rate as the differential splicing events called from RNA- seq data alone in highly expressed genes. Additionally, on this set of RNA-seq inconclusive events with high or low DARTS DNN scores, the RASL-seq data was used to define the ground truth with RASL-|APSI|>5% as differential and RASL-|APSI|<1 % as unchanged events respectively. The performance of DARTS BHT(info) was then benchmarked with DARTS BHT(flat), DARTS DNN, as well as two existing methods rMATS and SUPPA2. rMATS and SUPPA2 were chosen because they represented two distinct strategies (alignment-based vs alignment-free) for quantifying alternative splicing using RNA-seq data. As shown in Fig. 26B, DARTS BHT(info) consistently outperformed baseline methods that use RNA-seq data alone to call differential splicing: AUC of 0.76 for DARTS BHT(info), versus 0.68, 0.63, and 0.61 for DARTS BHT(flat), rMATS, and SUPPA2 respectively. A consistent gain by DARTS BHT(info) over baseline methods was also observed at different FPR thresholds for DARTS DNN-predicted differential events, with the maximum gain observed for the most confidently predicted events of FPR=1 % (Fig. 26B). Together, these data suggest that DARTS can reliably predict and unveil differential splicing events from moderately or lowly expressed genes and expand the findings beyond a conventional RNA-seq splicing analysis, even on a deep RNA-seq dataset.

Detection of 5’ and 3’ Alternative Splicing Patterns

[0167] Having demonstrated the performance of the DARTS DNN on predicting differential exon skipping events, the DNN model was extended to different classes of alternative splicing patterns. Specifically, 2,973, 2,971 , and 1 ,748 cis sequence features were compiled and three DNN models were trained for predicting differential alternative 5’ splice sites (A5SS), alternative 3’ splice sites (A3SS), and retained intron (Rl) events, respectively. The training behavior of these DNN models was similar to the DNN model trained for exon skipping events (Fig. 27). Trained by ENCODE and Roadmap data, these DNN models also exhibited a high prediction accuracy and generalizability in the independent leave-out datasets and outperformed baseline methods by a large margin (Fig. 27). These data extend the utility of the DARTS DNN beyond exon skipping to diverse types of alternative splicing patterns.

Implementation of alternative machine learning strategies and comparison to DARTS DNN

[0168] To benchmark the performance of our trained DARTS DNN model to other alternative machine learning strategies, we implemented two baseline methods, Logistic regression with L2 penalty and Random Forest. Because these baseline methods were unable to scale up to big data (see below), they were trained and benchmarked on individual ENCODE leave-out datasets by cross-validation. The identical events with their corresponding labels and features were fed into the baseline classifiers through 5-fold cross-validation and we recorded the performance measured by AUROC in each of the validation sets. [0169] We implemented the two methods using scikit-learn in python. For the logistic regression, we need to tune one parameter, i.e. the penalty strength, or the inverse of the penalty strength C. This parameter controls the complexity of the classifier and hence the severity of overfitting. We chose C=0 .1 for our implementation of logistic regression because in practice such a penalty achieves good reasonable generalization over different datasets. Although logistic regression is easy to interpret and a good baseline method for most classification tasks, it cannot effectively detect high-order interaction terms, diminishing its predictive power for such complex tasks.

[0170] Another more powerful and robust machine learning strategy implemented as a baseline method was Random Forest. Random Forest is an ensemble learning method where each base classifier is a decision tree that over-fits a set of bootstrapped training samples with a subset of features. The Random Forest classifier has several desirable properties, including being robust to feature scaling and irrelevant features, and being capable of dividing the feature space more flexibly than more conventional partitioning based classification methods. We tuned the hyper-parameter of Random Forest, i.e. the number of trees in the forest. Typically, the more trees in a random forest, the better predictive power it renders to the ensemble classifier. We noticed that for our datasets, 500 trees achieved the best testing accuracy while increasing the number of trees further did not grant much more gain.

[0171] As shown in Fig. 14, Random Forest almost always outperformed Logistic regression given the same training datasets. We can also observe a positive correlation between the performance of Random Forest and Logistic regression, indicating the training data plays an important role in the learning efficiency, despite the fact that the two learning algorithms are based on dramatically different underlying structures. Nevertheless, DARTS DNN showed superior performance compared to the baseline methods, even though these knock-down datasets have never been trained in DARTS DNN. Furthermore, the performance of DARTS DNN does not show strong correlations with the baseline methods, indicating its generalization over the single datasets to a more generic regulatory code. [0172] In the original publication of the DARTS work, we compared DARTS DNN to baseline methods only trained on individual leave-out datasets using cross-validation. The comparison may be inherently biased against DARTS DNN because the leave-out datasets were never trained by DARTS but were cross-validated in the baseline methods. In the meantime, there have been questions about how a simpler model (as compared to DARTS DNN) may perform when trained on the entire data, and the contribution of nonadditive interactions between the features representing cis-elements and trans-acting factors. To evaluate the non-linear interactions in the task of differential splicing prediction, a logistic regression model was fit using the entire ENCODE RBP knockdown data, with the identical training-validation-testing split as in DARTS DNN. Note that the logistic regression model can be viewed as a special form of neural network with no hidden layers, hence the model is so simple that it is unable to capture feature-feature interactions. The testing performance of DARTS DNN and logistic regression is shown in Fig. 28. The simple logistic regression model’s learning power plateaued in less than 500 training batches, and its predictive power was significantly worse than the DARTS DNN model, which has complex non-linear interactions and also benefits from significant amount of training.

RASL-seq Library Preparation and Sequencing

[0173] RASL-seq was performed as described with some modifications (See, Y. Ying, et al. , Cell 170, 312-323 e310 (2017), the disclosure of which is herein incorporated by reference). Total RNA from PC3E and GS689 cell lines were extracted with Trizol (Thermo Fisher Scientific). RASL-seq oligonucleotides were annealed to 1 mg of total RNA, followed by selection by oligo-dT beads. Paired probes tern plated by polyA+ RNA were ligated and then eluted. 5 ml of the eluted ligated oligos were used for 8 cycles of PCR amplification using the following primers:

F1 : 5’- CCGAGATCTACACTCTTTCCCTACACGACGGCGACCACCGAGAT-3’ (SEQ. ID No. 1 )

R1 : 5’- GTGACTGGAGTTCAGACGTGTGCGCTGATGCTACGACCACAGG-3’ (SEQ. ID No. 2). [0174] One third of the resulting PCR products were used in a second round of PCR amplification (9 cycles) using the following primers:

F2: 5’ - AAT GAT AC G G C G AC C AC C GAG AT CT AC ACT CTTT C C CTAC AC G-3’ 3’ (SEQ. ID No. 3)

R2: 5’- CAAGCAGAAGACGGCATACGAGAT[index]GTGACTGGAGTTCAGACGTGTGC-3’ 3’ (SEQ. ID No. 4).

[0175] Indexes used in this study were lllumina indexes D701 -D706. The indexed PCR products were pooled and sequenced on a Miseq with a custom sequencing primer:

5’- ACACTCTTTCCCTACACGACGGCGACCACCGAGAT-3’ 3’ (SEQ. ID No. 5) and a custom index sequencing primer:

5’-TAGCATCAGCGCACACGTCTGAACTCCAGTCAC-3’ 3’ (SEQ. ID No. 6).

DOCTRINE OF EQUIVALENTS

[0176] While the above description contains many specific embodiments of the invention, these should not be construed as limitations on the scope of the invention, but rather as an example of one embodiment thereof. Accordingly, the scope of the invention should be determined not by the embodiments illustrated, but by the appended claims and their equivalents.

Claims

WHAT IS CLAIMED IS:

1 . A method to synthesize a nucleic acid molecule, comprising:

applying a pair of RNA sequencing data sets in a computational model trained to identify differential splicing events between the pair of RNA sequencing data,

wherein the computational model is trained utilizing cis exon sequence features and RNA binding protein expression levels;

scoring and predicting differential splicing events between the pair of RNA sequencing data sets;

determining a posterior ratio of differential splicing events between the pair of RNA sequencing data sets via Bayesian hypothesis testing using the scored splicing events as a prior ratio and observed RNA sequencing read counts as the likelihood ratio, wherein the RNA sequencing read counts are derived from the RNA sequencing data sets; and unveiling at least one differential splicing event between the pair of RNA sequencing data; and

synthesizing a nucleic acid molecule comprising a sequence that spans across an exon-exon junction of the at least one differential splicing event, wherein the exon-exon junction is unveiled to be differentially expressed in between the pair RNA sequencing data.

2. The method of claim 1 , wherein the nucleic acid molecule is a DNA molecule.

3. The method as in claim 2, wherein the nucleic acid molecule is synthesized via an in vitro synthesizer or a bacterial recombination system.

4. The method as in claim 2 or 3, wherein the synthetic nucleic acid molecule is inserted into a plasmid vector.

5. The method as in claim 2, 3, or 4, wherein the synthetic nucleic acid molecule is operably linked to a non-native promoter.

6. The method as in one of claims 2 to 5, wherein the synthetic nucleic acid is utilized within an expression system to produce a peptide.

7. The method as in claim 6, wherein the peptide is utilized to generate an antigen-binding molecule.

8. The method as in claim 7, wherein the antigen binding molecule is a polyclonal antibody or a monoclonal antibody.

9. The method as in claim as in claim 2 or 3, wherein a fluorescent probe, a small molecule drug, or an enzyme is covalently linked to the synthetic nucleic acid.

10. The method as in claim 2, 3, or 9, wherein the synthetic nucleic acid molecule is used to detect the exon-exon junction in a biological sample.

11. The method as in claim 1 further comprising performing a biological assay to detect at least one differential splicing event.

12. The method as in claim 11 , wherein the assay is nucleic acid hybridization, nucleic acid proliferation, RNA sequencing, or an antibody detection assay.

13. The method as in any of the preceding claims, wherein each set of RNA sequencing data of the pair is derived from a different biological condition.

14. The method as in claim 13, wherein at least one biological condition is associated with a particular phenotype.

15. The method as in claim 14, wherein the particular phenotype is a disease state.

16. The method as in claim 13, wherein the pair of RNA sequencing data sets correspond to two different biological tissues, two different biological sources, two separate extraction on a temporal scale, or a test sample with a control sample.

17. The method as in any of the preceding claims, wherein the computational model is a neural network.

18 The method as any of the preceding claims, wherein the computational model includes taking differential splicing formulated as:

P(Y_ik = 1) = F(Y_ik; E_i, G_k)

where Y_k is the label for event / in the comparison k ; Y_ik = 1 represent when the event is differentially spliced; E; is a vector of a number of cis sequence features for exon i; and G_k is a vector of a number normalized gene expression levels of a number of RNA- binding proteins in the two RNA sequencing data sets.

19. The method as in any of the preceding claims, wherein the computational model is trained utilizing a number of pairwise biological conditions.

20. The method as in claim 19, wherein the computational model is trained utilizing RNA-binding protein depleted cell lines and control cell lines.

21. The method as in any of the preceding claims, wherein the posterior ratio is calculated via:

wherein:

I_ij, S_ij are the exon inclusion read count and the exon skipping read count for exon / in sample group j Î {1,2}, respectively; and

P ( H₁) is the prior probability of exon /being differentially spliced, determined by exon-specific cis features and sample-specific trans RNA-binding protein expression levels in the two RNA sequencing data sets, which is independent of the observed RNA-seq read counts on exon i. P(H₀) = 1 - P(H₁) is the prior probability of exon / being common. P(I_ij,S_ij|H₁) and P(I_ij,S_ij|H₀) represent the likelihoods under the model of differential splicing or common splicing respectively.

22. The method as in any of the preceding claims, wherein the unveiled splicing event occurs in a lowly expressed gene.

23. The method as in any of the preceding claims, wherein at least one differential splicing event is considered bona fide when the difference exceeds a user defined threshold with a high probability.

24. The method as in claim 23, wherein the high probability is determined by:

25. The method as in claim 23, wherein the threshold c is 1 %, 2%, 5%, or 10%.

26. A method to unveil at least one differential splicing event between two RNA sequencing data sets, comprising:

determining a posterior ratio of differential splicing events between the pair of RNA sequencing data sets via Bayesian hypothesis testing using the scored splicing events as a prior ratio and observed RNA sequencing read counts as the likelihood ratio, wherein the RNA sequencing read counts are derived from the RNA sequencing data sets; and unveiling at least one differential splicing event between the pair of RNA sequencing data.