EP3834202A1 - Systems and methods for determining effects of therapies and genetic variation on polyadenylation site selection - Google Patents
Systems and methods for determining effects of therapies and genetic variation on polyadenylation site selectionInfo
- Publication number
- EP3834202A1 EP3834202A1 EP19847830.7A EP19847830A EP3834202A1 EP 3834202 A1 EP3834202 A1 EP 3834202A1 EP 19847830 A EP19847830 A EP 19847830A EP 3834202 A1 EP3834202 A1 EP 3834202A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- polyadenylation
- candidate
- sequence
- preferences
- sites
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/11—DNA or RNA fragments; Modified forms thereof; Non-coding nucleic acids having a biological activity
- C12N15/111—General methods applicable to biologically active non-coding nucleic acids
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/11—DNA or RNA fragments; Modified forms thereof; Non-coding nucleic acids having a biological activity
- C12N15/113—Non-coding nucleic acids modulating the expression of genes, e.g. antisense oligonucleotides; Antisense DNA or RNA; Triplex- forming oligonucleotides; Catalytic nucleic acids, e.g. ribozymes; Nucleic acids used in co-suppression or gene silencing
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/008—Artificial life, i.e. computing arrangements simulating life based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N2310/00—Structure or type of the nucleic acid
- C12N2310/10—Type of nucleic acid
- C12N2310/11—Antisense
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N2320/00—Applications; Uses
- C12N2320/10—Applications; Uses in screening processes
- C12N2320/11—Applications; Uses in screening processes for the determination of target sites, i.e. of active nucleic acids
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N2830/00—Vector systems having a special element relevant for transcription
- C12N2830/50—Vector systems having a special element relevant for transcription regulating RNA stability, not being an intron, e.g. poly A signal
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/106—Pharmacogenomics, i.e. genetic variability in individual responses to drugs and drug metabolism
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/156—Polymorphic or mutational markers
Definitions
- Polyadenylation may be a mechanism responsible for regulating messenger ribonucleic acid (mRNA) function, stability, localization, and translation efficiency. As much as 70% of human genes may be subject to alternative polyadenylation (APA), and various mechanisms may influence its regulation.
- PAS polyadenylation site
- mRNA messenger ribonucleic acid
- APA alternative polyadenylation
- PAS polyadenylation site
- transcript or mRNA isoforms that may vary either in their coding sequences or in their 3’ untranslated region (3’-UTR) can be produced. Transcripts differentially cleaved can influence how they are regulated.
- the present disclosure provides a method for determining an effect of an antisense oligonucleotide on a plurality of candidate polyadenylation sites, the method comprising: (a) providing a plurality of genomic sequences, wherein the plurality of genomic sequences comprises (1) a reference sequence and (2) a variant sequence obtained by computer processing the reference sequence based on the antisense oligonucleotide, wherein the antisense oligonucleotide is complementary to at least a portion of the reference sequence; (b) for each of the plurality of genomic sequences: identifying a plurality of candidate polyadenylation sites in the genomic sequence; extracting a polyadenylation feature vector for each of the plurality of candidate polyadenylation sites, wherein each of the polyadenylation feature vectors comprises one or more features determined at least based on one or more nucleotides in the genomic sequence; and applying a trained algorithm to the plurality of polyadenylation feature vectors to calculate a
- calculating the set of preferences for each of the plurality of genomic sequences comprises, for each of the plurality of candidate polyadenylation sites, computer processing by a first computation module the plurality of polyadenylation feature vectors of the genomic sequence to calculate an intermediate representation r L for an /th candidate polyadenylation site, the intermediate representation comprising at least one numerical value; and computer processing by a second computation module the set of intermediate representations r r 2 , ... , r n for the plurality of candidate polyadenylation sites to calculate the set of preferences p lt p 2 , ... , p n corresponding to the plurality of candidate polyadenylation sites.
- the reference sequence is (i) derived from a human genome, (ii) obtained by sequencing deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) of a bodily sample obtained from a subject, or (iii) a genetic aberration thereof.
- the genetic aberration comprises a single nucleotide variant (SNV) or an insertion or deletion (indel).
- At least one of the plurality of polyadenylation feature vectors comprises a feature determined at least based on one or more nucleotides in the genomic sequence, wherein the at least one of the one or more nucleotides is located within about 100 nucleotides of the location in the genomic sequence of the candidate polyadenylation site.
- each of the plurality of polyadenylation feature vectors comprises one or more of: (a) a subsequence of the genomic sequence encoded using a l-of-4 binary vector for a nucleotide selected from adenine (A), thymine (T), cytosine (C), and guanine (G); (b) a subsequence of the genomic sequence encoded using a l-of-4 binary vector for a nucleotide selected from adenine (A), uracil (U), cytosine (C), and guanine (G); (c) a set of binary components; (d) a set of categorical components; (e) a set of integer components; and (f) a set of real-valued components.
- At least one of the set of binary components comprises a value indicative of the presence of a cleavage factor sequence in the candidate polyadenylation site, or a value indicative of the absence of a cleavage factor sequence in the candidate polyadenylation site. In some embodiments, at least one of the set of binary components comprises a value indicative of the presence of a cleavage factor sequence adjacent to the candidate polyadenylation site or a value indicative of the absence of a cleavage factor sequence adjacent to the candidate polyadenylation site.
- At least one of the set of real-valued components comprises a log distance, in number of nucleotides in the genomic sequence, from (1) the candidate polyadenylation site to (2) a nearest different candidate polyadenylation site among the plurality of candidate polyadenylation sites.
- the at least one of the plurality of polyadenylation feature vectors comprises a feature selected from the group listed in
- the method further comprises identifying, for at least one of the plurality of genomic sequences, a maximally preferred candidate polyadenylation site among the plurality of candidate polyadenylation sites, wherein the maximally preferred candidate polyadenylation site has a largest numerical value r max among the set of intermediate representations r r 2 , ... , r n .
- the method further comprises, for at least one of the plurality of genomic sequences, identifying a maximally preferred candidate
- the maximally preferred candidate polyadenylation site has a largest numerical value p max among the set of preferences p t , p 2 , ... , p n .
- calculating the set of preferences comprises providing a set of numerical parameters, and calculating a multiplication product comprising at least one feature from at least one of the plurality of polyadenylation feature vectors and at least one numerical parameter of the set of numerical parameters.
- the method further comprises applying a machine learning algorithm to the plurality of polyadenylation feature vectors to calculate the set of preferences, the machine learning algorithm comprising adjusting at least one numerical parameter of the set of numerical parameters to decrease a loss function.
- adjusting the at least one numerical parameter of the set of numerical parameters comprises performing a gradient-based learning procedure.
- the gradient-based learning procedure comprises stochastic gradient descent.
- the gradient-based learning procedure comprises stochastic gradient descent with momentum and dropout.
- the loss function comprises a cross entropy function.
- a sum of the set of preferences p t , p 2 , ... , p n equals 1.
- each preference p L among the set of preferences p lt p 2 , ... , p n indicates a probability of selection of an /th candidate polyadenylation site among the plurality of candidate polyadenylation sites.
- the first computation module comprises a convolutional neural network, which convolutional neural network is configured to process the plurality of polyadenylation feature vectors to generate the set of intermediate representations r 1 r 2 , ... , r n for the plurality of candidate polyadenylation sites.
- the intermediate representation for the /th candidate polyadenylation site comprises a numerical value r and wherein the second computation module is configured to apply a softmax function to the set of intermediate representations r 1 r 2 , ... , r n for the plurality of candidate
- the intermediate representation for the /th candidate polyadenylation site comprises a numerical value r L .
- relu is a rectified linear function.
- a one-to-one correspondence exists between one or more of the plurality of candidate polyadenylation sites of the reference sequence and one or more of the plurality of candidate polyadenylation sites of the variant sequence
- processing the plurality of sets of preferences comprises comparing each of at least one preference in the set of preferences of the reference sequence to the corresponding preference in the set of preferences of the variant sequence which is in one-to-one correspondence.
- (c) further comprises calculating a set of changes in preference Ap lt Dr 2 , ... , Dr h corresponding to the plurality of candidate polyadenylation sites of the reference sequence and the plurality of candidate polyadenylation sites of the variant sequence to determine the effect of the antisense oligonucleotide.
- the variant sequence obtained by computer processing the reference sequence based on the antisense oligonucleotide is obtained by replacing one or more nucleotides of the at least the portion of the reference sequence with an N base, a uniform weighting of the 4 bases, or randomly selected bases.
- the at least the portion of the reference sequence is within about 100 nucleotides of at least one of the plurality of candidate polyadenylation sites.
- the method further comprises applying the trained algorithm to a plurality of polyadenylation feature vectors indicative of a relative positioning of the plurality of candidate polyadenylation sites to calculate the set of preferences.
- the method further comprises administering a therapeutically effective amount of the antisense oligonucleotide to the subject based at least in part on the determined effect of the antisense oligonucleotide.
- the determined effect of the antisense oligonucleotide comprises a decreased preference for one or more of the plurality of candidate polyadenylation sites.
- the determined effect of the antisense oligonucleotide comprises an increased preference for one or more of the plurality of candidate polyadenylation sites.
- the administered therapeutically effective amount of the antisense oligonucleotide modulates polyadenylation of at least one of the plurality of candidate polyadenylation sites in the subject.
- the antisense oligonucleotide has a length of about 10 to about 50 nucleotides.
- the present disclosure provides a computer system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application for determining an effect of an antisense oligonucleotide on a plurality of candidate polyadenylation sites, the application comprising: a sequence module programmed to provide a plurality of genomic sequences, wherein the plurality of genomic sequences comprises (1) a reference sequence and (2) a variant sequence obtained by computer processing the reference sequence based on the antisense oligonucleotide, wherein the antisense oligonucleotide is complementary to at least a portion of the reference sequence; an identification module programmed to, for each of the plurality of genomic sequences, identify a plurality of candidate polyadenylation sites in the genomic sequence; a feature extraction module programmed to, for each of the plurality of genomic sequences, extract a polyadenylation feature vector for each
- p n corresponding to the plurality of candidate polyadenylation sites; and a processing module programmed to process the plurality of sets of preferences for each of the plurality of genomic sequences with each other to determine the effect of the antisense oligonucleotide.
- the present disclosure provides a non-transitory computer-readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for determining an effect of an antisense oligonucleotide on a plurality of candidate polyadenylation sites, the method comprising: (a) providing a plurality of genomic sequences, wherein the plurality of genomic sequences comprises (1) a reference sequence and (2) a variant sequence obtained by computer processing the reference sequence based on the antisense oligonucleotide, wherein the antisense oligonucleotide is complementary to at least a portion of the reference sequence; and (b) for each of the plurality of genomic sequences: identifying a plurality of candidate polyadenylation sites in the genomic sequence; extracting a polyadenylation feature vector for each of the plurality of candidate polyadenylation sites, wherein each of the polyadenylation feature vectors comprises one or more features determined at least based on one or
- the present disclosure provides a system for determining an effect of an antisense oligonucleotide on a plurality of candidate polyadenylation sites, the system comprising: a database comprising a plurality of genomic sequences generated from
- deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) molecules wherein the plurality of genomic sequences comprises (1) a reference sequence and (2) a variant sequence obtained by computer processing the reference sequence based on the antisense oligonucleotide, wherein the antisense oligonucleotide is complementary to at least a portion of the reference sequence; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (a) for each of the plurality of genomic sequences, identify a plurality of candidate polyadenylation sites in the genomic sequence; (b) for each of the plurality of genomic sequences, extract a polyadenylation feature vector for each of the plurality of candidate polyadenylation sites, wherein each of the polyadenylation feature vectors comprises one or more features determined at least based on one or more nucleotides in the genomic sequence; (c) for each of the plurality of genomic sequences, apply a trained algorithm
- the present disclosure provides a method for identifying tissue-specific polyadenylation features, the method comprising: (a) providing a set of genomic sequences; (b) for each of the set of genomic sequences: identifying a plurality of candidate polyadenylation sites in the genomic sequence; extracting a polyadenylation feature vector for each of the plurality of candidate polyadenylation sites, wherein each of the polyadenylation feature vectors comprises one or more features determined at least based on one or more nucleotides in the genomic sequence; and applying a trained algorithm to the plurality of polyadenylation feature vectors to calculate a set of preferences p lt p 2 , ... , p n for the plurality of candidate polyadenylation sites; and (c) computer processing the set of preferences for each of the set of genomic sequences to identify the tissue-specific polyadenylation features.
- calculating the set of preferences for each of the set of genomic sequences comprises, for each of the plurality of candidate polyadenylation sites, computer processing a first computation module the plurality of polyadenylation feature vectors of the genomic sequence to calculate an intermediate representation r t for an /th candidate polyadenylation site, the intermediate representation comprising at least one numerical value; and computer processing by a second computation module the set of intermediate
- the reference sequence is (i) derived from a human genome, (ii) obtained by sequencing deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) of a bodily sample obtained from a subject, or (iii) a genetic aberration thereof.
- the genetic aberration comprises a single nucleotide variant (SNV) or an insertion or deletion (indel).
- At least one of the plurality of polyadenylation feature vectors comprises a feature determined at least based on one or more nucleotides in the genomic sequence, wherein the at least one of the one or more nucleotides is located within about 100 nucleotides of the location in the genomic sequence of the candidate polyadenylation site.
- each of the plurality of polyadenylation feature vectors comprises one or more of: (a) a subsequence of the genomic sequence encoded using a l-of-4 binary vector for a nucleotide selected from adenine (A), thymine (T), cytosine (C), and guanine (G); (b) a subsequence of the genomic sequence encoded using a l-of-4 binary vector for a nucleotide selected from adenine (A), uracil (U), cytosine (C), and guanine (G); (c) a set of binary components; (d) a set of categorical components; (e) a set of integer components; and (f) a set of real-valued components.
- At least one of the set of binary components comprises a value indicative of the presence of a cleavage factor sequence in the candidate polyadenylation site, or a value indicative of the absence of a cleavage factor sequence in the candidate polyadenylation site. In some embodiments, at least one of the set of binary components comprises a value indicative of the presence of a cleavage factor sequence adjacent to the candidate polyadenylation site or a value indicative of the absence of a cleavage factor sequence adjacent to the candidate polyadenylation site.
- At least one of the set of real-valued components comprises a log distance, in number of nucleotides in the genomic sequence, from (1) the candidate polyadenylation site to (2) a nearest different candidate polyadenylation site among the plurality of candidate polyadenylation sites.
- the at least one of the plurality of polyadenylation feature vectors comprises a feature selected from the group listed in
- the method further comprises identifying, for at least one of the plurality of genomic sequences, a maximally preferred candidate polyadenylation site among the plurality of candidate polyadenylation sites, wherein the maximally preferred candidate polyadenylation site has a largest numerical value r max among the set of intermediate representations r r 2 , ... , r n .
- the method further comprises, for at least one of the plurality of genomic sequences, identifying a maximally preferred candidate
- calculating the set of preferences comprises providing a set of numerical parameters, and calculating a multiplication product comprising at least one feature from at least one of the plurality of polyadenylation feature vectors and at least one numerical parameter of the set of numerical parameters.
- the method further comprises applying a machine learning algorithm to the plurality of polyadenylation feature vectors to calculate the set of preferences, the machine learning algorithm comprising adjusting at least one numerical parameter of the set of numerical parameters to decrease a loss function.
- adjusting the at least one numerical parameter of the set of numerical parameters comprises performing a gradient-based learning procedure.
- the gradient-based learning procedure comprises stochastic gradient descent.
- the gradient-based learning procedure comprises stochastic gradient descent with momentum and dropout.
- the loss function comprises a cross entropy function.
- a sum of the set of preferences p t p 2 , ... , p n equals 1.
- each preference p L among the set of preferences p 1 p 2 , ... , p n indicates a probability of selection of an /th candidate polyadenylation site among the plurality of candidate polyadenylation sites.
- the first computation module comprises a convolutional neural network, which convolutional neural network is configured to process the plurality of polyadenylation feature vectors to generate the set of intermediate representations r r 2 , ... , r n for the plurality of candidate polyadenylation sites.
- the intermediate representation for the /th candidate is configured to process the plurality of polyadenylation feature vectors to generate the set of intermediate representations r r 2 , ... , r n for the plurality of candidate polyadenylation sites.
- the intermediate representation for the /th candidate is configured to process the plurality of polyadenylation feature vectors to generate the set of intermediate representations r r 2 , ... , r n for the plurality of candidate polyadenylation sites.
- the polyadenylation site comprises a numerical value r and wherein the second computation module is configured to apply a softmax function to the set of intermediate representations r 1 r 2 , ... , r n for the plurality of candidate polyadenylation sites to calculate the set of preferences Pi, p 2 , ... , P n for the plurality of candidate polyadenylation sites.
- the second computation module is configured to calculate each preference p L of the set of
- relu is a rectified linear function.
- the method further comprises applying the trained algorithm to a plurality of polyadenylation feature vectors indicative of a relative positioning of the plurality of candidate polyadenylation sites to calculate the set of preferences.
- (c) further comprises, for each of the set of genomic sequences, for each of the plurality of candidate polyadenylation sites, computing a gradient of the set of preferences generated by the convolutional neural network with respect to the features of the polyadenylation feature vector of the candidate polyadenylation site, thereby generating a plurality of feature saliency values of the features of the polyadenylation feature vector of the candidate polyadenylation site.
- the method further comprises, for each of the set of genomic sequences, for each of the plurality of candidate polyadenylation sites, sorting the features of the polyadenylation feature vector of the candidate polyadenylation site based at least in part on the feature saliency values of the features of the polyadenylation feature vector.
- the method further comprises, for each of the set of genomic sequences, for each of the plurality of candidate polyadenylation sites, classifying the features of the polyadenylation feature vector of the candidate polyadenylation site as increasing or decreasing a strength of a polyadenylation site, based at least in part on the feature saliency values of the features of the polyadenylation feature vector of the candidate polyadenylation site.
- (c) further comprises, for each of the set of genomic sequences, for each of the plurality of candidate polyadenylation sites, identifying a feature of the polyadenylation feature vector of the candidate polyadenylation site as a tissue-specific polyadenylation feature based at least in part on whether the feature has a feature saliency value that meets a predetermined criterion.
- the plurality of candidate polyadenylation sites comprises a plurality of tissue-specific polyadenylation sites and a plurality of constitutive polyadenylation sites.
- the predetermined criterion is a feature saliency value having a statistically greater effect on tissue-specific polyadenylation sites compared to constitutive polyadenylation sites, that meets a predetermined threshold.
- the predetermined threshold is that the feature saliency value has a statistically greater effect on the plurality of tissue-specific polyadenylation sites compared to the plurality of constitutive polyadenylation sites, with a P-value of at most a P-value threshold equal to 0.05 divided by a number of features of the polyadenylation feature vector of the candidate polyadenylation site.
- the predetermined threshold is that the feature saliency value has a statistically greater effect on the plurality of tissue-specific polyadenylation sites compared to the plurality of constitutive polyadenylation sites, with aP- value of at most a P-value threshold equal to 0.03 divided by a number of features of the polyadenylation feature vector of the candidate polyadenylation site.
- the predetermined threshold is that the feature saliency value has a statistically greater effect on the plurality of tissue-specific polyadenylation sites compared to the plurality of constitutive polyadenylation sites, with a P-value of at most a P-value threshold equal to 0.01 divided by a number of features of the polyadenylation feature vector of the candidate polyadenylation site.
- the method further comprises, for each of the set of genomic sequences, for each of the plurality of candidate polyadenylation sites, classifying the features of the polyadenylation feature vector of the candidate polyadenylation site as increasing or decreasing a likelihood of a polyadenylation site to be tissue-specific, based at least in part on the feature saliency values of the features of the polyadenylation feature vector of the candidate
- the method further comprises processing the plurality of feature saliency values to generate a feature saliency map.
- the method further comprises identifying one or more genomic regions for polyadenylation-targeted therapeutic purposes based at least in part on the feature saliency map.
- the one or more genomic regions for polyadenylation-targeted therapeutic purposes comprise one or more of: an oligonucleotide targeted Type 1 Poly(A) signal, a location of a Type 4 Poly(A) signal, and an oligonucleotide Type 2 Poly(A) signal.
- the present disclosure provides a computer system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application for identifying tissue-specific polyadenylation features, the application comprising: a sequence module programmed to provide a set of genomic sequences; an identification module programmed to, for each of the set of genomic sequences, identify a plurality of candidate polyadenylation sites in the genomic sequence; a feature extraction module programmed to, for each of the set of genomic sequences, extract a polyadenylation feature vector for each of the plurality of candidate polyadenylation sites, wherein each of the polyadenylation feature vectors comprises one or more features determined at least based on one or more nucleotides in the genomic sequence; a preference computation module programmed to, for each of the plurality of genomic sequences, apply a trained algorithm to the plurality of polyadenylation feature vectors to calculate a set
- the present disclosure provides a non-transitory computer-readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for identifying tissue-specific polyadenylation features, the method comprising: (a) providing a set of genomic sequences; (b) for each of the set of genomic sequences: identifying a plurality of candidate polyadenylation sites in the genomic sequence; extracting a polyadenylation feature vector for each of the plurality of candidate polyadenylation sites, wherein each of the polyadenylation feature vectors comprises one or more features determined at least based on one or more nucleotides in the genomic sequence; and applying a trained algorithm to the plurality of polyadenylation feature vectors to calculate a set of preferences p lt p 2 , ... , p n for the plurality of candidate polyadenylation sites; and (c) processing the set of preferences for each of the set of genomic sequences to identify the tissue-specific poly
- the present disclosure provides a system for identifying tissue- specific polyadenylation features, the system comprising: a database comprising a set of genomic sequences generated from deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) molecules; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (a) for each of the set of genomic sequences, identify a plurality of candidate polyadenylation sites in the genomic sequence; (b) for each of the set of genomic sequences, extract a polyadenylation feature vector for each of the plurality of candidate polyadenylation sites, wherein each of the polyadenylation feature vectors comprises one or more features determined at least based on one or more nucleotides in the genomic sequence; (c) for each of the set of genomic sequences, apply a trained algorithm to the plurality of polyadenylation feature vectors to calculate a set of preferences p t ,
- the present disclosure provides a method for determining an effect of an antisense oligonucleotide on a plurality of candidate polyadenylation sites, comprising processing a sequence of the antisense oligonucleotide to obtain a change in preference corresponding to each of the plurality of candidate polyadenylation sites, to identify at least one of the plurality of candidate polyadenylation sites as having a change in preference that meets a threshold.
- processing the sequence of the antisense oligonucleotide comprises: (i) providing (1) a reference sequence and (2) a variant sequence obtained by computer processing the reference sequence based on the antisense oligonucleotide, wherein the antisense oligonucleotide is complementary to at least a portion of the reference sequence; (ii) using a trained algorithm to calculate (1) a first set of preferences for a plurality of candidate polyadenylation sites of the reference sequence and (2) a second set of preferences for a plurality of candidate polyadenylation sites of the variant sequence; and (iii) computer processing the first set of preferences with the second set of preferences to obtain the plurality of changes in preference.
- (iii) further comprises calculating a set of changes in preference Dr l , Dr 2 , ... , Dr h corresponding to the plurality of candidate polyadenylation sites of the reference sequence and the plurality of candidate polyadenylation sites of the variant sequence.
- the variant sequence obtained by computer processing the reference sequence based on the antisense oligonucleotide is obtained by replacing one or more nucleotides of the at least the portion of the reference sequence with an N base, a uniform weighting of the 4 bases, or randomly selected bases.
- the reference sequence is (i) derived from a human genome, (ii) obtained by sequencing deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) of a bodily sample obtained from a subject, or (iii) a genetic aberration thereof.
- the method further comprises administering a therapeutically effective amount of the antisense oligonucleotide to the subject based at least in part on the identified at least one of the plurality of candidate polyadenylation sites.
- the trained algorithm comprises a machine learning algorithm.
- the antisense oligonucleotide has a length of about 10 to about 50 nucleotides.
- the method further comprises determining a tissue-specific effect of the antisense oligonucleotide based at least in part on whether a plurality of polyadenylation feature vectors of the plurality of candidate polyadenylation sites comprises tissue-specific polyadenylation features.
- the method further comprises determining the tissue-specific effect of the antisense oligonucleotide with a -value of at most about 0.05.
- the method further comprises determining the tissue-specific effect of the antisense oligonucleotide with a P-value of at most about 0.03.
- the method further comprises determining the tissue-specific effect of the antisense oligonucleotide with aP-value of at most about 0.01. In some embodiments, the method further comprises determining the tissue-specific effect of the antisense oligonucleotide based at least in part on whether the plurality of polyadenylation feature vectors of the plurality of candidate polyadenylation sites comprises one or more tissue-specific polyadenylation features selected from the group listed in Table 5.
- FIG. 1 illustrates a schematic of the components of the neural network that represent the polyadenylation model (left) and a comparison of two architectures for the sequence model, a convolutional neural network that operates directly on sequences and a fully-connected neural network that takes in a feature vector processed by a feature extraction pipeline (right).
- FIG. 2 shows a computer system that is programmed or otherwise configured to implement methods provided herein.
- FIGs. 3A and 3B illustrate classification performance of ClinVar variants near polyadenylation sites.
- FIG. 4 illustrates a mutation map of the genomic region chrl 1 : 5,246,678-5,246,777.
- FIG. 5 illustrates an example of predicting the effect of an antisense oligonucleotide experiment.
- FIG. 6 illustrates a saliency map from the Conv-Net of a section of oligo-targeted mRNA.
- FIG. 7 illustrates regions around a polyadenylation site where features are extracted.
- FIG. 8 illustrates an example application of scanning the Conv-Net model across a section of the human genome to identify potential polyadenylation sites.
- FIG. 9 illustrates positive and negative regions for PAS discovery evaluation.
- FIG. 10 illustrates example filters learned by a convolutional neural network.
- polyadenylation site generally refers to a site in a genome that may be involved in a polyadenylation procedure (e.g., cleavage of a precursor messenger ribonucleic acid (RNA), followed by the addition of a poly(A) tail to form a mature messenger RNA (mRNA)).
- RNA messenger ribonucleic acid
- mRNA mature messenger RNA
- pre-mRNA may have a corresponding set of one or more polyadenylation sites, each of which is capable of being subjected to cleavage and polyadenylation.
- sample generally refers to a biological sample.
- a sample may be a fluid or tissue sample.
- the sample nucleic acid molecules may be deoxyribonucleic acid (DNA) molecules, RNA molecules, or both.
- the sample may be a tissue sample.
- the sample may be plasma, serum or blood (e.g., whole blood sample).
- the sample may be a cell- free sample (e.g., cell-free DNA, cfDNA).
- a bodily sample may be derived from any organ, tissue or biological fluid.
- a bodily sample can comprise, for example, a bodily fluid or a solid tissue sample.
- An example of a solid tissue sample is a tumor sample, e.g., from a solid tumor biopsy.
- Bodily fluids may include, for example, blood, serum, plasma, tumor cells, saliva, urine, lymphatic fluid, prostatic fluid, seminal fluid, milk, sputum, stool, tears, and derivatives of these.
- the term“sequencing read,” as used herein, generally refers to a sequence generated by a nucleic acid sequencer.
- the sequence may be in digital form, such as a digital sequence stored in computer memory.
- the nucleic acid sequencer may be a massively parallel array sequencer (e.g., Illumina, Ion Torrent, Pacific Biosciences of California, etc.) or single molecule sequencer (e.g., Oxford Nanopore).
- the nucleic acid sequencer may be a high throughput sequencer.
- Polyadenylation is a mechanism that may occur within human cells.
- a poly(A) tail may be added to a messenger RNA (mRNA).
- mRNA messenger RNA
- the detection of polyadenylation sites depends on patterns within an mRNA sequence and corresponding patterns within a DNA sequence. Genetic variation in a nucleic acid (e.g., mutations in a DNA sequence or an mRNA sequence) can disrupt these patterns. If one or more nucleotides are mutated in a nucleic acid (e.g., an mRNA strand or a DNA strand), the effect of this genetic variation may cause polyadenylation to occur at a different polyadenylation site.
- This effect may result in functional consequences, such as one or more phenotype changes leading to a disease or acting as contributing factors in a disease.
- Polyadenylation may be a mechanism responsible for regulating mRNA function, stability, localization, and translation efficiency. As much as 70% of human genes may be subject to alternative polyadenylation (APA), and widespread mechanisms may influence its regulation.
- APA alternative polyadenylation
- PAS polyadenylation site
- different transcript isoforms that vary either in their coding sequences or in their 3’ untranslated region (3’-UTR) can be produced.
- Transcripts differentially cleaved can influence how they are regulated. For example, longer variants can harbor additional destabilization elements that alter a transcript’s stability, and shortened variants can escape regulation from microRNAs, which have been observed in various cancers.
- APA can be tissue-dependent, so a single gene can generate different transcripts, for instance, based on the tissue in which it is expressed.
- One mechanism of APA regulation may occur at the level of the sequences of the transcript.
- the presence or absence of certain regulatory elements can influence which PAS is selected.
- PAS selection may also be influenced by a site’s position relative to other sites.
- a computational model that can accurately predict how polyadenylation is affected by genomic features as well as cellular context may be highly desirable to understand this widespread phenomenon.
- several inherited diseases have been linked to errors in 3’-end processing. Such a model may enable the exploration of the effects of genetic variations on polyadenylation and their implications for disease.
- the present disclosure provides systems and methods for determining effects of genetic variants on selection of polyadenylation sites during polyadenylation processes.
- the present disclosure provides a polyadenylation code, a computational model that can predict alternative polyadenylation patterns from transcript sequences.
- Many existing approaches of classifying whether a stretch of sequence contains a PAS, or characterizing whether a PAS is tissue-specific may be aimed at improving gene annotations and understanding which features are involved in APA regulation, and may not address the question of predicting how APA sites are variably selected.
- the ability to predict PAS strength may enable this model to generalize to multiple prediction tasks, even though it may not be explicitly trained for them.
- the model can be applied to a gene with multiple PAS to determine the relative transcript isoforms that may be produced in a tissue-specific manner.
- the model can predict the consequence of nucleotide substitutions on PAS strength, which can be used to prioritize genetic variants that affect polyadenylation. It can be used to assess the effects of anti-sense oligonucleotides to alter transcript abundance. It can also scan the 3’-UTR of the human genome to find potential PAS.
- the present disclosure provides examples of these applications and methods to analyze on how different features affect the predictions of the model.
- a score can be calculated that describes or corresponds to the strength of a PAS, or the efficiency in which it is recognized by the 3’-end processing machinery. Such a task may be straightforward if this target variable is directly measurable.
- current sequencing protocols may provide only a measurement of the relative transcript abundance from APA.
- Some approaches to quantify the strength of a PAS may, for example, use normalized read counts, but quantification can be affected by factors such as sequencing biases, transcript length, and RNA decay.
- Some approaches may classify PAS strength based on whether a canonical polyadenylation signal or other reported sequence elements are present near the PAS.
- the present disclosure provides systems and methods to predict a quantitative description of the strength of a PAS by modeling it as a hidden variable, and to infer it from data.
- the position of a PAS relative to neighboring sites can affect its selection. Some biological processes and tissues may favor PAS at the distal end, whereas cells under disease states may tend to utilize PAS that are more proximal.
- the model may include a variable that accounts for the distance between neighboring sites during training. Even though the position of a PAS is modeled, a desirable characteristic of the predictor may be that during inference, positional information may be optional. This can be useful in regions of the genome where there are insufficient annotation sources to ascertain the distance to a nearby PAS.
- this model may also enable this model to be applied to any DNA sequence associated with a site, optionally for the bases within to be modified, and the predicted effect on polyadenylation regulation to be observed.
- the model can be applied to each PAS separately to compare their relative strengths.
- their positions can be factored in to the model’s prediction if annotation sources are available in order to get a better estimate.
- a polyadenylation code model may be constructed and analyzed.
- the polyadenylation code may refer to a model that can infer tissue-specific PAS strength scores from sequence, and optionally account for the influence of position if it is provided.
- the model may take as input a sequence of length 200 bases centered on a PAS. Two or more models which operate on the sequence differently may be
- a first model may be built on hand-crafted features.
- Features may be extracted or derived from genomic sequences (e.g., higher level engineered features, based on composition or counts of multiple bases). Alternatively, features may simply comprise at least a portion of the sequence itself (e.g., lower level raw sequences, such as one-hot encoding of individual bases).
- the genomic sequence may be processed by a feature extraction pipeline, which divides the sequence into 4 regions relative to the PAS (as described, for example, in Example 8, and as described, for example, by Hu et al, Bioinformatic identification of candidate cis-regulatory elements involved in human mRNA polyadenylation, RNA. 2005; which is hereby incorporated by reference in its entirety).
- Some feature may be limited to specific regions, namely the polyadenylation signals in the 5’-5’ and 5’-3’ regions, and hexamers defined by Hu et al. Other features may be computed in all regions, including counts of RNA-binding protein (RBP) motifs that may be involved in polyadenylation, all possible 1 to 4 n-mers counts, and nucleosome positioning features (as described, for example, by van der Heijden et ah, Sequence-based prediction of single nucleosome positioning and genome-wide nucleosome occupancy, Proc. Natl. Acad. Sci. U. S. A., 2012; which is hereby incorporated by reference in its entirety).
- the feature vector may be mapped to a fully-connected neural network. Such a model may be referred to as a Feature-Net.
- a second model may directly learn from the genomic sequence, using a
- Conv-Net convolutional neural network
- the Conv- Net may comprise tunable motif filters which are free to adapt to the input sequence to optimize the predictive performance of the model. It may also contain pooling operations that enable the model to focus on select locations in the input sequence whose composition may maximally activate the motif filters.
- the log distance between sites may also be an input feature for both models.
- the proximal (5’) site may have a position feature of 0, whereas the distal (3’) site may have a position feature that is equal to the logarithm of the distance between the distal site and the proximal site.
- FIG. 1 shows a schematic of both the first model and the second model.
- the sequences are transformed by the Feature-Net and Conv-Net into a hidden representation.
- the Feature-Net may perform feature extraction on the sequence to generate a feature vector, which may then be mapped to a fully-connected neural network.
- the Conv-Net may apply filters to convolve the sequence into a filter map, which may then be rectified, pooled, and flattened.
- the hidden representation may be processed by separate fully-connected hidden layers of a PAS strength predictor to make tissue-specific predictions.
- the architecture therefore factors predictions into two components: a score that describes or corresponds to the tissue-specific PAS strength, followed by predictions that represent the relative abundance of transcripts from RNA- Seq experiments between two competing PAS.
- the parameters of the fully-connected layers model the cell state of tissues, which describes or corresponds to the steady-state environment of the cell, such as the protein concentrations in the cytosol, that can affect transcriptional modifications. These cell state parameters may not be explicitly defined in terms of what they consist of or how they factor in the predictions, but rather may be simply modeled as hidden variables and be learned from data.
- Seven distinct tissue types may be available in the dataset used to train the models. Since there may be two sets of sequencing reads for the naive B-cells obtained from different donors (as described, for example, by Lianoglou et ah, Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression, Genes Dev., 2013; which is hereby incorporated by reference in its entirety), they can be treated as separate tissues, and so the model described herein have eight polyadenylation strength prediction outputs. A choice may be made to not rely on evolutionary conservation to force the models to leam patterns from the genome itself (as described, for example, by Leung et al.
- a training example may comprise two PAS from the same gene and may require the model to predict their relative strengths, which can be interpreted as the probability that each site may be selected for cleavage and polyadenylation.
- the relative strength may be measured by the read counts from RNA-Seq that have been mapped to each site.
- a softmax function may be used to squash the real-valued predictions (e.g., tissue-specific strength predictions) from the PAS strength predictor into a normalized score that can be interpreted as the probability that one PAS is chosen over the other (e.g., relative strength predictions).
- the predictions are penalized against training targets of the relative abundances of transcripts for these PAS, which is measured from the sequencing experiment.
- Results described herein may be based on the predictions from the PAS strength predictor (e.g., the logits) instead of the relative strength predictions that follows the softmax.
- the predictive model may be applied to multiple tasks, even though it may be trained only to the task of modeling competing site selection. Predictions for these other tasks may be evaluated without any additional task-specific training or data augmentation to demonstrate the general applicability of this model.
- 3’-UTR annotations such as those from UCSC (as described, for example, by Kent et al, The human genome browser at UCSC, Genome Res., 2002; which is hereby incorporated by reference in its entirety), GENCODE (as described, for example, by Harrow et al, GENCODE: the reference human genome annotation for the ENCODE project, Genome Res., 2012; which is hereby incorporated by reference in its entirety), RefSeq (as described, for example, by Pruitt et al, NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins, Nucleic Acids Res., 2005; which is hereby incorporated by reference in its entirety), and Ensembl (as described, for example, by Yates et al, Ensembl 2016, Nucleic Acids Res., 2016; which is hereby incorporated by reference in its entirety), may be combined,
- polyadenylation annotations used may include PolyA DB 2 (as described, for example, by Lee et al, PolyA DB 2: mRNA polyadenylation sites in vertebrate genes, Nucleic Acids Res., 2007; which is hereby incorporated by reference in its entirety), GENCODE, and APADB (as described, for example, by Miiller et al, APADB: a database for alternative polyadenylation and microRNA regulation events, Database (Oxford), 2014; which is hereby incorporated by reference in its entirety).
- PolyA DB 2 as described, for example, by Lee et al, PolyA DB 2: mRNA polyadenylation sites in vertebrate genes, Nucleic Acids Res., 2007; which is hereby incorporated by reference in its entirety
- GENCODE GENCODE
- APADB as described, for example, by Miiller et al, APADB: a database for alternative polyadenylation and microRNA regulation events, Database (Oxford
- PAS from different sources may largely overlap, but some sites can be unique to one study due to the differences in cell lines or tissue types as well as sequencing protocol. Due to the inexact nature of 3’-end processing, PAS that are within 50 bases of each other may be clustered, and the resulting peak may be marked as the location of the PAS.
- the final PAS atlas may contain about 19,320 3’-UTR regions with two or more PAS from genes in the hgl9 assembly for a total of 92,218 sites.
- the model may be trained from the relative abundance of transcripts from a 3’-end sequencing experiment of seven distinct human tissues, including the brain, breast, embryonic stem (ES) cells, ovary, skeletal muscle, testis, and two samples of naive B cells (as described, for example, by Lianoglou et al, Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression, Genes Dev., 2013; which is hereby incorporated by reference in its entirety). Other cell lines may also be available in the dataset, but may not be used.
- ES embryonic stem
- the version of aligned reads which have been processed through the studies’ computational pipeline may be used, which include removal of internally primed and antisense reads, as well as application of minimum expression requirements to reduce sequencing noise. These reads may be assigned to the PAS atlas, resulting in read counts associated with each PAS.
- Beta model derived from Bayesian inference (as described, for example, by Xiong et al, Probabilistic estimation of short sequence expression using RNA-Seq data and the positional bootstrap, 2016; which is hereby incorporated by reference in its entirety) may be adopted, treating the percent read counts of one site relative to another site as the parameter of a Bernoulli distribution.
- the mean of this distribution can be used as the target to train the model, that is, the PAS usage of site 1 relative to site 2 is (1 + Nsuei) / (2 + Nsuei + Nsuei).
- different combinations of pairs of sites may be generated as training targets and quantified as above.
- An assumption may be that the relative strength of neighboring PAS can be described by the relative read counts at those sites, even if there are other sites present in the same gene. This assumption may simplify the architecture of the computational model and quantification of relative strength between sites.
- the model may be constructed and trained in Python using the TensorFlow library (as described, for example, by Abadi et al, TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, 2015; and by Rampasek et al, TensorFlow: Biology’s Gateway to Deep Learning?, Cell Syst., 2016; each of which is hereby incorporated by reference in its entirety).
- Hidden units of the neural network may comprise rectified linear activation units.
- the feature vectors may be normalized with mean zero and standard deviation of one.
- the input may use a one-hot encoding representation for each of the 4 nucleotides.
- the dimension of the input may be 4 x n.
- Padding may be inserted at both ends of the input so that the motif filters can be applied to each position of the sequence from beginning to end.
- the additional padding on each side of the sequence may be 4 x (m - 1), where these additional padding may be filled with the value 0.25, equivalent to an N nucleotide in IUPAC notation.
- This approach may be similar to that described by Alipanahi et al. (Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning, Nat. Biotechnol, 2015; which is hereby incorporated by reference in its entirety).
- Each training example may consist of a pair of PAS from a gene, where the input is the two sites’ genomic sequences, and the target is their relative read counts computed as described elsewhere herein.
- the input is the two sites’ genomic sequences, and the target is their relative read counts computed as described elsewhere herein.
- different combinations of pairs of sites may be generated as examples. Only examples with more than 10 reads may be kept. This may result in a dataset of 64,572 examples, which is split for training and testing.
- the parameters of the neural network may be initialized (as described, for example, by Glorot et al. , Understanding the difficulty of training deep feedforward neural networks,
- the parameters of the neural network may be trained using a stochastic gradient descent method with momentum and dropout (as described, for example, by Hinton et al. , Improving neural networks by preventing co-adaptation of feature detectors, arXiv Prepr.
- Predictions from each softmax output may be penalized by the cross-entropy function, and its sum across all tissue types may be backpropagated to update the parameters of the neural network.
- Training and testing of the model may be performed in a similar fashion as described, for example, by Leung et al. (Deep learning of the tissue-regulated splicing code, Bioinformatics, 2014; which is hereby incorporated by reference in its entirety). Briefly, data may be split into approximately five equal folds at random for cross validation (e.g., a 5-fold cross-validation). Each fold may contain a unique set of genes that are not found in any of the other folds.
- the validation set may be used for selection of hyperparameters. Examples of the selected hyperparameters for the models can be found in Example 13.
- a graphics processing unit GPU may be used to accelerate training and hyperparameter selection by randomly sampling the hyperparameter space.
- a prediction model may calculate feature vectors x ... , x n for n candidate polyadenylation sites, and may use these to calculate a set of preferences p t , ... , p n for the candidate polyadenylation sites.
- the prediction model may comprise a first computation module and a second computation module, as described elsewhere herein.
- a dataset of polyadenylation sites and the usage of the candidate polyadenylation sites may be used to adjust the parameters Q of the prediction model.
- Polyadenylation sequence data and polyadenylation site usage data may be obtained.
- polyadenylation sequence data may be obtained or derived from a reference genome, by sequencing deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) of one or more bodily samples obtained from one or more subjects, or by performing modifications (e.g., incorporating one or more genetic aberrations) of such data.
- DNA deoxyribonucleic acid
- RNA ribonucleic acid
- Such sequencing may be performed using next-generation sequencing (e.g., massively parallel sequencing or single molecule sequencing).
- a genetic aberration may be, for example, a single nucleotide variant (SNV) or an insertion or deletion (indel).
- Polyadenylation usage data may be obtained using genome annotations, complementary DNA (cDNA) and expressed sequence tag (EST) libraries, or by sequencing polyadenylated RNA of one or more bodily samples obtained from one or more subjects. For each of one or more genomic sequences, a set of candidate polyadenylation sites with corresponding measured preferences may be produced.
- cDNA complementary DNA
- EST expressed sequence tag
- Each training case may correspond to a set of candidate polyadenylation sites that are identified in the sequence data.
- Each training case may comprise feature vectors x lt ... , x n for the n candidate polyadenylation sites, obtained using the genomic sequence; and measured preferences p lt ... , p n for the candidate polyadenylation sites, obtained using the polyadenylation site usage data.
- the feature vector for the /th candidate polyadenylation site is the feature vector for the /th candidate polyadenylation site
- RNA sequence of length m encodes the RNA sequence of length m centered on the polyadenylation site.
- the nucleotides adenine (A), cytosine (C), guanine (G), and uracil (U) may be encoded as (1,0, 0,0), (0, 1,0,0), (0,0, 1,0) and (0,0,0, 1), respectively, and the encodings of the m nucleotides may be appended to form a binary sequence of length 4m. For example, for an RNA sequence
- the prediction model may calculate the preferences p 1 ... , p n by first calculating a set of corresponding intermediate representations r 1 ... , r n . each of which may comprise a numerical value.
- the intermediate representation for the /th candidate polyadenylation site may be calculated using the following linear summation:
- each intermediate representation may comprise a sum of 4m terms.
- the feature vectors may encode other features, such as the presence of certain patterns; a numerical representation of RNA secondary structure; and a numerical encoding of nucleosome positioning.
- the intermediate representation for each polyadenylation site may comprise a single numerical value or a vector of numerical values, and may be calculated using a linear summation as shown above, a multilayer neural network comprised of multiple layers of computations with nonlinearities, a recurrent neural network, or one of many other types of machine learning systems.
- the intermediate representations for the polyadenylation sites may be combined using different computational approaches, such as those described elsewhere herein, to calculate the preferences.
- a set of initial training parameters Q may be generated, e.g., by using preset values, by using a random number generator, or by setting them using additional data.
- a goal of training may be to adjust the set of training parameters Q so that p and p are close for every training case.
- Denoting the index of the training case by j, the polyadenylation feature vectors, the preferences corresponding to the candidate polyadenylation sites calculated by the prediction model, and the measured preferences corresponding to the candidate polyadenylation sites may be denoted respectively by: x J , p J , p J
- These feature vectors and calculated preferences may be initialized, e.g., by seting all initial values to 0 or 1.
- a loss function L(p J , p J , Q) may be evaluated for the calculated preferences and the measured preferences, for the current set of training parameters Q. This loss function may depend on the parameters because the calculation of the preferences depends on the parameters, as described above.
- Suitable loss functions include a negative cross entropy loss function, given by:
- a gradient-based learning procedure may be used to iteratively update the set of training parameters Q so as to decrease the total loss, as given by:
- L L(pi, pi, Q ) + L(p 2 , Ri, q) +— l ⁇ L(p T , r t , Q ), wherein T is the number of training cases. This may be iterated until a stopping criterion is satisfied. Examples of stopping criteria are that a pre-determined number of iterations have been performed, that a decrease in the total loss from one iteration to the next is below a pre-determined threshold, or that the total loss evaluated on a held-out validation set (e.g., a subset of the training data set) increases instead of decreases.
- the loss function can be minimized.
- a parameter update may be obtained by differentiating the selected loss function (to obtain a differential) and numerically evaluating the differential.
- the minimization of the loss function may result in more accurate predictions as training progresses iteratively.
- This gradient-based learning procedure may be combined with a variety of standard techniques, such as batch gradient descent, minibatch learning, stochastic gradient descent, learning with dropout, momentum-based learning methods, and others.
- a final prediction model may be generated comprising a final configuration of the set of training parameters Q. which may then be used to calculate the polyadenylation preferences for any set of polyadenylation site feature vectors.
- a random set of training examples may be selected, the loss function may be evaluated based on this selected random set of training examples, gradients with respect to the model parameters may be computed, and the model parameters may be updated. This process may then be repeated with a different random set of training examples.
- a plurality of models can be trained such that each model generates a plurality of parameters and a prediction, and a plurality of predictions can be combined into a single prediction (e.g., by averaging).
- the same model may be applicable to examples with any number n of candidate polyadenylation sites, it may be advantageous to either only select training examples with the same number of candidate polyadenylation sites in one batch, or to select them such that the number of candidate polyadenylation sites in the same batch are not too dissimilar.
- a single batch of training examples contains cases with different numbers of candidate polyadenylation sites (e.g., a“ragged batch”)
- one or more decoy inputs may need to be added to the cases with fewer candidate polyadenylation sites, thereby making all cases equal (e.g., having equal numbers of candidate polyadenylation sites) for computational reasons (e.g., a“balanced batch”), as well as mask out the preferences outputs corresponding to the decoy inputs.
- the calculations made by the prediction model may be efficiently implemented on a graphics processing unit (GPU) for efficient training and for application at test time.
- GPU graphics processing unit
- a plurality of candidate polyadenylation sites may be identified in a genomic sequence (e.g., a human genome).
- the polyadenylation sites may comprise a contiguous segment of mRNA or DNA.
- the polyadenylation site may correspond to a possible start of a polyadenylation event in the human genome.
- the human genome may be obtained by sequencing mRNA or DNA of a bodily sample obtained from a subject.
- the systems and methods described herein may comprise using trained algorithms to predict the utilization of a set of candidate polyadenylation.
- One or more polyadenylation site feature vectors may be calculated for each candidate polyadenylation site of the plurality of candidate polyadenylation sites.
- the polyadenylation site feature vectors may be calculated by performing calculations on (e.g., processing) an mRNA sequence (or alternatively, a DNA sequence corresponding to the mRNA sequence) data.
- Feature vectors x, for the /th candidate polyadenylation site may be obtained.
- Each feature vector may comprise a vector of one or more features determined at least based on one or more nucleotide positions in the human genome. These features may be determined using other systems.
- a feature may be determined at least based on one or more nucleotides in the genomic sequence. In some embodiments, the at least one of the one or more nucleotides are located within about 50, 40, 30, 25, 20, 15, 10, or 5 nucleotides of the location in the genomic sequence of a polyadenylation site.
- a feature may comprise a raw sequence at a nucleotide position that may be encoded using a l-of-4 binary vector for each nucleotide in a set of possible nucleotides for the sequence type (e.g., mRNA or DNA).
- a set of possible nucleotides may comprise adenine,“A”; uracil,“U”;
- a set of possible nucleotides may comprise adenine,“A”; thymine,“T”; cytosine,“C”; or guanine,“G.”
- a l-of-4 binary vector [0, 1, 0, 0] T in an mRNA sequence may denote that a nucleotide located at a particular nucleotide position in the mRNA sequence is uracil,“U.”
- a l-of-4 binary vector [0, 1, 0, 0] T in a DNA sequence may denote that a nucleotide located at a particular nucleotide position in the DNA sequence is thymine,“T.”
- a feature may comprise a binary component (value).
- a feature may comprise a binary value indicating the presence (e.g., value of 1) or absence (e.g., value of 0) of a certain sequence (e.g., a motif in a polyadenylation site).
- a feature may comprise categorical, integer, or real-valued components.
- a feature may comprise an integer component such as a distance, in number of nucleotides, of a candidate polyadenylation site from a given genomic position.
- a first computation module may be used to process a polyadenylation site feature vector to calculate a set of intermediate representations (r 1; r 2 , ... , r n ) corresponding to the plurality ( n ) of candidate polyadenylation sites. For each candidate polyadenylation site, a series of one or more structure computations may be performed on the feature vectors to determine an intermediate representation r L comprising one or more numerical values.
- Each of the values in the set of intermediate representations may indicate a preference of a candidate polyadenylation site relative to the other candidate polyadenylation sites of the plurality, with higher preference values indicating a higher likelihood of being selected as an actual polyadenylation site in a polyadenylation process, and lower preference values indicating a lower likelihood of being selected as an actual polyadenylation site in a polyadenylation process.
- each of the intermediate representations comprises a single numerical value and if the first candidate polyadenylation site has a largest intermediate representation among the set of intermediate representations corresponding to the plurality of candidate polyadenylation sites, then the first candidate polyadenylation site is the most likely to be selected (e.g., maximally preferred) as an actual polyadenylation site in a polyadenylation process.
- r r 2 , ... , r n may be processed by a second computation module to produce a set of preferences, p lt p 2 , ... , p n -
- a second computation module may be used to calculate a set of preferences p L (p lt p 2 , ... , p n ) for a selection of the /th candidate
- polyadenylation site among the plurality of candidate polyadenylation sites This may be performed using a second computation module denoted by p t , p 2 , ... , p n ⁇ - h(r lt r 2 , ... , r n ).
- h is a pre-determined function on a set of one or more intermediate representation values.
- the second computation module may be operable to normalize the ith preference for a candidate polyadenylation site by using an exponential function for h, by assigning:
- the second computation module may be operable to normalize the /th preference for a candidate polyadenylation site by using a rectified linear function for h. by assigning: p, , where relut) is a
- the second computation module may be operable to normalize the ith preference for a candidate polyadenylation site by using another type of function for h.
- This function may be a monotonic function to preserve order of preferences between a set of intermediate representation values and a set of preference values.
- Each preference p L among the set of preferences may indicate a probability of selection of an /th candidate polyadenylation site among the plurality of candidate polyadenylation sites in a polyadenylation process.
- a maximally preferred candidate polyadenylation site may be identified among the plurality of candidate polyadenylation sites by selecting the candidate polyadenylation site with a largest value of preference p max among the set of preferences (p lt p 2 , ... , p n ).
- a genomic sequence may be constructed by hand or by a computer by combining sequences from different sources, including polyadenylated sequences.
- a polyadenylated mRNA molecule may be reverse transcribed into a complementary DNA (cDNA) molecule, and the resulting cDNA molecule may be sequenced to obtain a polyadenylated sequence.
- This polyadenylated sequence may be mapped to a genome (e.g., a human genome) by hand or by a computer.
- a genome e.g., a human genome
- the variant may be specified with respect to a reference sequence, which may be derived from, e.g., the genome, DNA sequencing, sequencing mRNA, or another approach.
- the variant may be specified by a sequential combination of one or more substitutions, insertions, and deletions with respect to the reference sequence.
- a substitution may be specified by a location in the reference sequence and the nucleotide (e.g., A, T, C, or G) that is substituted for the nucleotide at that location.
- An insertion may be specified by a location in the reference sequence and a nucleotide that is inserted right after the nucleotide at that location.
- a deletion may be specified by a location in the reference sequence at which a nucleotide has been removed from the sequence.
- the reference sequence is from the human genome. In some embodiments, the reference sequence is specified by a set of genomic coordinates. In some embodiments, the genetic variant is specified by a series of substitutions, insertions, and deletions in the genome, as indicated using the set of genomic coordinates.
- the system may maintain a database of sequences along with canonical
- Canonical polyadenylation sites generally refer to polyadenylation sites that have been previously reported or identified using, e.g., genome annotations, cDNA and EST data, RNA-Seq data, or another approach.
- the sequences may be represented as strings (e.g., a sequence) of letters (e.g., representing nucleotides), as substrings from a reference genome (e.g., a human genome), as pointers or genomic coordinates in a reference genome (e.g., a human genome), or another approach.
- the human genome may be used to represent the sequences.
- One or more genetic variants may be identified in a database of reference sequences (e.g., a human genome). Each of the one or more genetic variants may comprise one or more aberrant nucleotide positions in the human genome.
- a genetic variant may be selected from the group consisting of: a substitution at one or more nucleotide positions relative to a reference sequence (e.g., a single nucleotide variant (SNV) or a single nucleotide polymorphism (SNP)), an insertion at one or more nucleotide positions relative to a reference sequence, and a deletion at one or more nucleotide positions relative to a reference sequence.
- a substitution at one or more nucleotide positions relative to a reference sequence e.g., a single nucleotide variant (SNV) or a single nucleotide polymorphism (SNP)
- SNV single nucleotide variant
- SNP single
- a reference sequence may comprise a portion or entirety of a human genome.
- a reference sequence may comprise a portion or entirety of a human reference genome (e.g., GRCh38).
- Genetic variants may be identified using one or more databases of reported variants. Genetic variants may be reported to occur in a cohort of individuals with common characteristics, such as healthy subjects, subjects with a disease state or disorder state, subjects previously diagnosed with a disease state or disorder state, or subjects previously treated for a disease state or disorder state.
- the genetic variant may be mapped to canonical polyadenylation sites from a set of annotated polyadenylation sites. This mapping may be used to identify canonical
- polyadenylation sites that may be affected by the genetic variant and may include
- polyadenylation sites wherein the adjacent nucleotides within a window of size W (e.g., in units of nucleotide locations or bases) are altered by the genetic variant, or wherein the genetic variant alters nucleotides within a window of size W centered on other polyadenylation sites.
- Canonical polyadenylation sites may be identified by other approaches.
- Each canonical polyadenylation site may comprise a contiguous segment of mRNA or DNA, or a location within a contiguous segment of mRNA or DNA.
- a plurality of affected candidate polyadenylation sites may be identified.
- a candidate polyadenylation site may comprise a contiguous segment of mRNA or DNA.
- a set of candidate polyadenylation sites may comprise reported alternative polyadenylation sites.
- the plurality of candidate polyadenylation sites may include canonical polyadenylation sites that may be observed in polyadenylation, as determined by examining annotations or cDNA/EST data or RNA-Seq data.
- the plurality of candidate polyadenylation sites may include additional putative polyadenylation sites that the genetic variant may introduce.
- putative polyadenylation sites may have a higher false positive rate than is required by downstream applications, because the machine learning system described elsewhere herein is capable of determining whether or not such identified putative polyadenylation sites are bona fide polyadenylation sites, thereby achieving a significantly lower false positive rate.
- all nucleotide positions within some window of a reported PAS may be identified as putative polyadenylation sites and are included in the plurality of candidate polyadenylation sites.
- feature vectors may be calculated using the reference sequence, as described elsewhere herein.
- the reference sequence may be processed to obtain feature vectors.
- the prediction model may be used to determine a set of preferences for the plurality of candidate polyadenylation sites, p t , p 2 , ... , p n , as described elsewhere herein.
- the genetic variant sequence (the reference sequence modified by the genetic variant) may be used to calculate modified feature vectors for the plurality of polyadenylation sites, as described elsewhere herein.
- the modified feature vectors for the ith candidate polyadenylation site may be denoted by x t .
- the prediction model may be used to determine a set of modified preferences for the plurality of candidate polyadenylation sites, p 1 p 2 , ... , p n , as described elsewhere herein.
- the preferences for the plurality of candidate polyadenylation sites may be compared to the modified preferences for the plurality of candidate polyadenylation sites to determine a quantified measure of an effect of the genetic variant. Examples of possible methods of calculating this quantified measure are described elsewhere herein.
- the calculation ⁇ - fipcp) of the intermediate representation within the first computation module is performed using a neural network, a deep neural network, a convolutional neural network, a recurrent neural network, a short-term long-term recurrent neural network, or another type of machine learning model.
- a convolutional or recurrent neural network may process the feature vectors separately, and the resulting hidden representation may be subsequently fed into another neural network.
- the feature vectors may be concatenated to form one feature vector, which may be processed by a convolutional or a recurrent neural network, or some other type of neural network.
- the feature vectors may be assembled in various ways for processing within the first computation module.
- modified feature vectors may be calculated.
- the modified feature vectors may comprise the one or more genetic variants for each of the plurality of candidate polyadenylation sites.
- the modified feature vectors may be calculated using a modified sequence of the genetic variant (e.g., substitution, insertion, or deletion applied to the reference sequence, which may be derived from the human genome).
- the modified feature vectors for the ith candidate polyadenylation site may be denoted by x t .
- a tilde symbol (“ ⁇ ”) may be used to denote a feature vector, an un-normalized preference, or a normalized preference that has been modified by a genetic variant.
- a first computation module may be used to process a modified polyadenylation site feature vector to calculate a set of modified intermediate representations ) (f r 2 , ... , f n ) for the plurality of candidate polyadenylation sites.
- the modified intermediate representations f 1 f 2 , ... , r n may be compared to the unmodified intermediate representations r 1 r 2 , ... , r n to determine the effect of the genetic variant.
- a second computation module may be used to calculate a set of modified preferences Pi ( Pi, p2 > > P n ) for the plurality of candidate polyadenylation sites. This calculation may be denoted by pq, p 2 , ... , p n ⁇ - (p, r 2 , ... , f n ). where h is a pre-determined function on one or more modified intermediate representations.
- the intermediate representations may each comprise a single numerical value and second computation module may be operable to normalize the ith preference for a candidate polyadenylation site by using an exponential function for h, by assigning:
- the second computation module may be operable to normalize the ith preference for a candidate polyadenylation site by using a rectified linear function for h. by assigning: p. ⁇ , where relut) is a
- the second computation module may be operable to normalize the ith preference for a candidate polyadenylation site by using another type of function for h.
- This function may be a monotonic function to preserve order of preferences between a set intermediate representations and a set of preference values.
- Each modified preference p among the set of modified preferences (r L , p 2 , ... , p n ) may indicate a probability of selection of an /th candidate polyadenylation site among the plurality of candidate polyadenylation sites in a polyadenylation process.
- a maximally preferred candidate polyadenylation site may be identified among the plurality of candidate polyadenylation sites by selecting the candidate polyadenylation site with a largest value of modified preference p max among the set of preferences (p 1; p 2 , ... , p n ).
- the effect of the genetic variant may be quantified by comparing the preferences of the plurality of candidate polyadenylation sites to the modified preferences. Based at least on this comparison, a quantitative measure may be generated and/or outputted. For example, if the maximally preferred candidate polyadenylation site in the modified and unmodified cases, p max and Pmax , are different, a binary flag may be set to indicate a change.
- a maximally preferred candidate polyadenylation site may be identified among the plurality of candidate
- polyadenylation sites by selecting the candidate polyadenylation site with a largest value of modified intermediate representation f max among the set of intermediate representations (ry, r 2 , - , r n ).
- a set of changes in preference Dr, (Dr , Dr 2 , ... , Dr h ) may be calculated for the plurality of candidate polyadenylation sites.
- the changes in preference may be computed using various methods.
- the set of changes in preference may comprise a change in preference for a canonical polyadenylation site Ap c , c e (1, ... , n ⁇ , which is of particular interest and importance, since any deviation from the canonical polyadenylation site pattern may be indicative of pathogenicity.
- the canonical polyadenylation site may be determined by examining genome annotations, examining cDNA libraries, or by other approaches.
- a total probability mass change DR may be calculated between the set of preferences Pi and the set of modified preferences P .
- the total probability mass change may be given by:
- a potentially cryptic polyadenylation site may be identified as a putative polyadenylation site (e.g., different from the canonical
- the preferences described above may be fed into another module that uses them to determine whether a specific disease is likely.
- An effect of the one or more genetic variants on the set of candidate polyadenylation sites may be determined, by comparing the sets of intermediate representations r L (r 1; r 2 ... , r n ) and the sets of modified intermediate representations (iy, f 2 , ... , f n ).
- Changes in one or more phenotypes in a subject may be identified by sequencing ribonucleic acid (RNA) molecules or deoxyribonucleic acid (DNA) molecules from a bodily sample obtained from the subject to produce a plurality of sequence reads and identifying one or more genetic variants in the plurality of sequence reads.
- RNA ribonucleic acid
- DNA deoxyribonucleic acid
- a set of polyadenylation sites associated with the one or more genetic variants may be identified.
- a set of modified preferences of the set of polyadenylation sites may then be determined, and a set of normalized preferences may also be determined using the reference sequence. These two sets of preferences may be compared to identify or detect changes in one or more phenotypes in the subject.
- the changes in one or more phenotypes in the subject may be identified or detected at a probability of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or greater.
- the effect of the genetic variant may be determined as described elsewhere herein.
- This effect of the genetic variant may be used to identify changes in one or more phenotypes in the subject at a probability of at least about 50%, e.g., by performing correlation studies of cohorts of subjects with reported genetic variants (e.g., DNA mutations) by comparing the changes in preferences to reported changes in one or more phenotypes (e.g., diseases or disorders).
- the probability may indicate a likelihood that a subject with the genetic variant is exhibiting, may exhibit in the future, or is expected to exhibit the change in one or more phenotypes.
- the probability may be at least about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99%.
- a machine learning algorithm may be used to identify the set of polyadenylation sites.
- the set of polyadenylation sites may comprise one or more polyadenylation sites reported to be associated with one or more polyadenylated mRNA sequences.
- RNA molecules may be subjected to reverse transcription (e.g., RT) and/or reverse transcription polymerase chain reaction (e.g., RT-PCR) to generate complementary DNA (cDNA) molecules.
- cDNA complementary DNA
- the cDNA may then be sequenced to produce the plurality of sequence reads.
- the RNA molecules may be messenger RNA (mRNA).
- a library of probes may be generated to enrich for a set of polyadenylation sites in a nucleic acid sample of a subject.
- the set of polyadenylation sites may be generated using a preference computation module, as described elsewhere herein, and may correspond to genetic variants in the nucleic acid sample.
- the set of polyadenylation sites may identify changes in one or more phenotypes in the subject at a probability of at least about 90%. The probability may be at least about 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99%.
- the set of polyadenylation sites may comprise one or more polyadenylation sites reported to be associated with one or more polyadenylation events.
- FIG. 2 shows a computer system 201 that is programmed or otherwise configured to determine effects of a genetic variant on a set of polyadenylation sites.
- the computer system 201 can regulate various aspects of the present disclosure, such as, for example, determining a set of preferences of a plurality of candidate polyadenylation sites.
- the computer system 201 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device.
- the electronic device can be a mobile electronic device.
- the computer system 201 includes a central processing unit 205 (CPU, also “processor” and“computer processor” herein), which can be a single core or multi core processor, or a plurality of processors for parallel processing.
- the computer system 201 also includes memory or memory location 210 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 215 (e.g., hard disk), communication interface 220 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 225, such as cache, other memory, data storage and/or electronic display adapters.
- the memory 210, storage unit 215, interface 220, and peripheral devices 225 are in communication with the CPU 205 through a communication bus (solid lines), such as a motherboard.
- the storage unit 215 can be a data storage unit (or data repository) for storing data.
- the computer system 201 can be operatively coupled to a computer network 230 (“network”) with the aid of the communication interface 220.
- the network 230 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
- the network 230 in some cases is a telecommunication and/or data network.
- the network 230 can include one or more computer servers, which can enable distributed computing, such as cloud computing.
- the network 230 in some cases with the aid of the computer system 201, can implement a peer-to- peer network, which may enable devices coupled to the computer system 201 to behave as a client or a server.
- the CPU 205 can execute a sequence of machine-readable instructions, which can be embodied in a program or software.
- the instructions may be stored in a memory location, such as the memory 210.
- the instructions can be directed to the CPU 205, which can subsequently program or otherwise configure the CPU 205 to implement methods of the present disclosure. Examples of operations performed by the CPU 205 can include fetch, decode, execute, and writeback.
- the CPU 205 can be part of a circuit, such as an integrated circuit. One or more other components of the system 201 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC).
- ASIC application specific integrated circuit
- the storage unit 215 can store files, such as drivers, libraries, and saved programs.
- the storage unit 215 can store user data, e.g., user preferences and user programs.
- the computer system 201 in some cases can include one or more additional data storage units that are external to the computer system 201, such as located on a remote server that is in communication with the computer system 201 through an intranet or the Internet.
- the computer system 201 can communicate with one or more remote computer systems through the network 230.
- the computer system 201 can communicate with a remote computer system of a user.
- remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, smartphones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants.
- the user can access the computer system 201 via the network 230.
- Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 201, such as, for example, on the memory 210 or electronic storage unit 215.
- the machine executable or machine readable code can be provided in the form of software.
- the code can be executed by the processor 205.
- the code can be retrieved from the storage unit 215 and stored on the memory 210 for ready access by the processor 205.
- the electronic storage unit 215 can be precluded, and machine-executable instructions are stored on memory 210.
- the code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime.
- the code can be supplied in a programming language that can be selected to enable the code to execute in a pre compiled or as-compiled fashion.
- aspects of the systems and methods provided herein can be embodied in programming.
- Various aspects of the technology may be thought of as “products” or“articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
- Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk.
- “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server.
- another type of media that may bear the software elements includes optical, electrical, and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
- a machine readable medium such as computer-executable code
- a tangible storage medium such as computer-executable code
- Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings.
- Volatile storage media include dynamic memory, such as main memory of such a computer platform.
- Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system.
- Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
- RF radio frequency
- IR infrared
- Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data.
- Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
- the computer system 201 can include or be in communication with an electronic display 235 that comprises a user interface (UI) 240 for providing, for example, an approach for user selection of a monotonic function.
- UIs include, without limitation, a graphical user interface (GUI) and web-based user interface.
- Methods and systems of the present disclosure can be implemented by way of one or more algorithms.
- An algorithm can be implemented by way of software upon execution by the central processing unit 205.
- the algorithm can, for example, determine a set of preferences of a plurality of candidate polyadenylation sites.
- compositions provided herein A composition of the present disclosure (e.g., an antisense oligonucleotide) may be administered to a subject, such as for therapeutic purposes.
- the composition may be included in a formulation with a therapeutically effective amount of the composition.
- the formulation may include one or more excipients (e.g., a flavorant, colorant, buffer, preservation agent, etc.).
- kits comprising a composition of the present disclosure and instructions usable by a user to administer the composition to a subject.
- the user may be the subject or a healthcare provider of the subject.
- the instructions may direct the user to administer the composition (e.g., in a formulation) to the subject at a given dosing regimen.
- the model is applied to all PAS in the 3’-UTR of each gene.
- a score for each site is computed from the logits (the output of the PAS strength predictor shown in FIG. 1), where a larger value suggests that the site is more likely to be selected.
- the target is defined by the PAS in each gene which has the most measured reads in the 3’-Seq data.
- the metric reported here is the prediction accuracy, or the percentage of genes in which the model has correctly predicted the PAS that has the most reads. This is shown in Table 2 for genes with two to six sites, averaged across all tissues.
- the number of genes used in this evaluation is 2270, 2043, 1745, 1364, and 1163, respectively, where a gene is included only if at least one of its sites has more than 10 reads.
- FIGs. 3A and 3B illustrate classification performance of ClinVar variants near polyadenylation sites, including ROC curves comparing the variant classification performance of the Conv-Net and the Feature-Net (FIG. 3 A), wherein the shaded region shows the one standard deviation zone computed by bootstrapping, and ROC curves comparing performance of a model disclosed herein against other predictors (FIG. 3B). AUC values are shown in the figure legend.
- An advantage of the model disclosed herein is that the PAS strength predictor can be used to characterize individual sites based only on the input sequence. This model is evaluated for suitability and performance of use for pathogenicity predictions.
- the basic approach comprises applying the model to the 200-nucleotide sequence associated with a PAS from the reference genome to first generate a prediction of its strength, and then performing another prediction when one or more nucleotides in the sequence are altered. A difference is then computed between the reference prediction and the variant prediction. Since there are eight predictions, one for each tissue, the largest difference is taken as the score to assess
- a similar approach can be applied to splicing variants (as described, for example, by Xiong et al, The human splicing code reveals new insights into the genetic determinants of disease, Science, 2014; which is hereby incorporated by reference in its entirety).
- a postulate may be that if a variant causes a large change to the strength of a PAS, this can change the relative abundance of differentially 3’-UTR terminated transcripts that deviate from normal, potentially indicating disease associations.
- FIG. 3A shows the ROC curve for this classification task.
- the model can predict pathogenic variants from benign ones with an AUC of 0.98 ⁇ 0.02 and 0.97 ⁇ 0.02, for the Conv-Net and Feature-Net respectively, both with a p-value of less than 1 x 10 8 .
- AUCs are essentially identical for both models, there is a clear advantage in the performance characteristic of the Conv-Net: it outperforms in the low false positive rate region where variant classification matters.
- an input of zero is used for the position feature of the strength model, since each variant is not analyzed with respect to neighboring sites.
- Genomic Evolutionary Rate Profiling (GERP) (as described, for example, by Cooper et al. , Distribution and intensity of constraint in mammalian genomic sequence, Genome Res.
- phastCons as described, for example, by Siepel et al , Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes, Genome Res., 2005; which is hereby incorporated by reference in its entirety
- phyloP as described, for example, by Pollard et al , Detection of nonneutral substitution rates on mammalian phylogenies, Genome Res.
- the pathogenicity score from the model described herein compares favorably, even though it is not explicitly trained for this task.
- the model performs well for this ClinVar dataset, in general, a large difference in PAS strength does not necessarily imply pathogenicity, which is a phenotype that can be many steps downstream of 3’-end processing.
- the model described herein can also be used to search for potential variants that may affect the regulation of polyadenylation.
- the model is applied and a mutation map is generated (as described, for example, by Alipanahi et al, Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning, Nat.
- FIG. 4 illustrates a mutation map of the genomic region chrl 1 :
- Each square represents a change in the model’s score if the original base is substituted.
- the substituted base is represented in each row in the order‘ACGT’.
- Different shades or colors can be used to denote mutations that may increase or decrease the likelihood (e.g., preference) of the PAS for cleavage and polyadenylation.
- the polyadenylation signal is identified as an important region relative to other bases in the sequence.
- a model is trained by centering the input sequence around a PAS at the cleavage site. If a PAS is off-center of the 200-nucleotide input sequence, or when no PAS is present, then the predicted PAS strength of the sequence may be small, due to the lack of sequence elements necessary for cleavage and polyadenylation. Alternatively, if the output of the PAS strength predictor is large, it may suggest that a PAS is present and is positioned near the center of the input sequence. The model may be evaluated for suitability of translation across the genome to find potential PAS. The model disclosed herein may not be explicitly trained for this purpose.
- a section of the human genome is selected and the Conv-Net strength model is applied to the section in a base-by-base manner (as described, for example, in Example 10).
- the average strength prediction from all eight tissues, without application of any filtering or thresholding, is shown.
- a region of the genome with multiple PAS is chosen, where there are differences between annotation sources.
- Region B is less well-defined, is weaker, and approximately aligns with the predicted positions from another PAS predictor (as described, for example, by Cheng et al, Prediction of mRNA polyadenylation sites by support vector machine, Bioinformatics, 2006; which is hereby incorporated by reference in its entirety), as well as the muscle track from PolyA-Seq (in light gray). Finally, a small peak is observed in Region C, predicted to be a very weak PAS, which is present in PolyA-Seq. Note that the model is trained only from 3’-Seq reads and has no knowledge of RNA-Seq information from other datasets or other annotation sources.
- Each sequence is fed as input into the strength predictor, and the outputs from all tissues are averaged into a single value which is used for classification.
- the positional information of the sequence is not used (i.e., it has a position feature of zero).
- the AUC to classify sequences with PAS from negative sequences for the LR, Feature-Net, and the Conv- Net are measured as 0.887 ⁇ 0.003, 0.895 ⁇ 0.004, and 0.907 ⁇ 0.004, respectively.
- 19% contain one of the two canonical polyadenylation signals (AAUAAA and AUUAAA), and 74% contain at least one of the reported polyadenylation signals (as described, for example, in Example 8), indicating the model can distinguish real PAS from background. It does not simply look for the presence of polyadenylation signals to detect PAS in the genome.
- Anti-sense oligonucleotides therapies may include targeting RNAs via
- RNA function by blocking the access of cellular machinery to the RNA.
- Application of this approach is demonstrated in the 3’-UTR, where oligonucleotides targeting polyadenylation signals and sites modulated the abundance of an mRNA (as described, for example, by Vickers et al, Fully modified 2’ MOE oligonucleotides redirect polyadenylation, Nucleic Acids Res., 2001; which is hereby incorporated by reference in its entirety). Based on this, the utility of the model disclosed herein is shown to provide an in- silico evaluation of oligonucleotides targeting regions near the PAS.
- Type 1 Three distinct forms of the transcript (Type 1, Type 2, and Type 3) are described in the study.
- a schematic of the E-selectin mRNA and the position of the polyadenylation signal, along with the targeted region of the oligonucleotides used is shown in FIG. 5 (left). All three forms harbor the canonical polyadenylation signal AAUAAA.
- a non-canonical polyadenylation signal AGUAAA is also present between the Type 1 and Type 2 cleavage site, which is selected when the corresponding signals from Type 1 and Type 2 are blocked.
- it is referred to as the Type 4 form of the transcript.
- Type 3 is by far the dominant form of the transcript, followed by Type 1 and Type 2 (no differentiation is reported between them).
- Type 4 is the least common.
- the predicted strengths for the corresponding PAS for Type 1 to 4 transcripts are respectively: -0.242, -0.420, 0.020, -0.765. These values do not account for the position of the PAS. If the relative positions of the 4 PAS are provided to the model, then the strengths become: -0.242, -0.170, 0.606, -0.584 (where Type 1 is assumed to be in position zero). These predictions match the observed abundances of this mRNA from the study.
- the Vickers et al. study performs a non-quantitative RT-PCR to assess the abundance of isoforms by administering different combinations of oligonucleotides targeting select regions of the transcript. To simulate this, the same regions of the input sequence complementary to the oligonucleotides are blocked by replacing the nucleotides with an N base, and the resulting strengths of each PAS are predicted. The results are shown in FIG. 5, where predicted PAS strength is shown and arranged in an image to match the gel from the Vickers et al. study (right), simulating the effects of blocked nucleotides due to oligonucleotide treatment.
- the oligonucleotides applied are shown on top of each column. Each column is scaled such that the sum of the intensities of each column is constant, but otherwise, no additional processing is performed.
- the Vickers et al. study does not provide values from RT-PCR that permit quantitatively comparison with the output of our model, but qualitatively, patterns of polyadenylation are generally captured. Note that the Vickers et al. study mentions that Type 1 and 2 transcripts are shorter and therefore more efficiently amplified by PCR, and thus appear brighter than expected compared to Type 3. This experimental bias does not affect the simulated RT-PCR results shown in FIG. 5.
- Example 5 Effect of genomic features on the model’s predictions
- Table 3 shows each model’s classification performance. Even though the polyadenylation signals are generally considered to be a main signature of PAS, they only partially account for the predictive performance for PAS selection compared to the full feature set.
- n-mer features are major contributors to the Feature-Net’s performance, which is sufficiently rich to capture many motif patterns.
- Each feature group may have a different number of features (as described, for example, in Example 8), and therefore individual features in the larger feature groups may contribute only weakly, but as a whole affect predictions considerably.
- Position alone may have poor predictive capability, even though it has been suggested to be a key feature in determining whether a PAS is used for tissue-specific regulation.
- an investigation is conducted on the uniqueness of each feature group, by training models with all features minus each feature group from Table 3. Removing the polyadenylation signals from the feature set reduces the performance from 0.866 ⁇ 0.004 to 0.840 ⁇ 0.008. All other groups, when removed, do not significantly reduce the performance of the model compared to the full feature set. This suggests that many features are redundant, and if removed, can be compensated by features in another group.
- the gradient of the output of the neural network with respect to the input feature vector of the neural network is computed.
- This is referred to as the feature saliency of a prediction of the neural network, and the gradients of features with large magnitudes can be interpreted as those that need to change the least to affect the prediction the most (as described, for example, by Simonyan et al, Deep inside convolutional networks: visualising image classification models and saliency maps, Proc. of the Int. Conf. on Learn. Representations, 2014; which is hereby incorporated by reference in its entirety).
- the feature saliency values of each of the sites in the test set are computed, and the features that on average have the largest magnitude are selected.
- Table 4 shows the top 15 features computed using this method and the direction in which the feature affects the strength of a PAS, where an up arrow indicates that the effect is positive.
- Table 4 Top 15 features of the Feature-Net, and the direction in which each feature can increase ( ⁇ ) or decrease (j) the strength of a polyadenylation site.
- the top three features are consistent for all tissue types. Other features vary slightly between tissues and are grouped together unordered. As expected, the two most common canonical polyadenylation signals are the top features which increase the strength of a PAS. The log distance between PAS is also deemed to be important. Some features in this list are consistent with mechanisms of core elements previously reported to be involved in cleavage and polyadenylation, including the upstream UGUA motif which the cleavage factor Im complex binds to, and a GU-rich sequence near the polyadenylation site. The genomic context upstream of the PAS appears to be more important, as most of the top features are in either the 5’-5’ and 5’-3’ region. Three of the features may reduce the strength of a PAS.
- RNA polymerase II interacts with CA-rich RNA sequences, and has been reported to play a role in inhibiting polyadenylation (as described, for example, by Kaneko and Manley et al, The Mammalian RNA Polymerase II C-Terminal Domain Interacts with RNA to Suppress
- Example 6 Determining tissue-specific polyadenylation features
- the set of tissue-specific and constitutive PAS described, for example, by Weng et al. are selected, and the Feature-Net is applied to this set of PAS to generate predictions.
- the gradient-based method described in Example 5 is used to examine the top 200 most confident predictions for tissue-specific PAS, where the model predicts that at least one of the tissue outputs is considerably different than the rest, and for constitutive PAS, where the model predicts that all tissue outputs do not differ significantly.
- the magnitude of the gradients is then analyzed to see which features have a statistically greater effect on tissue-specific PAS compared to constitutive PAS.
- Table 5 Features associated with prediction of tissue-specific polyadenylation sites, and whether the presence of the feature makes a polyadenylation site more ( ⁇ ) or less (j) tissue-specific.
- All but one of the entries in the table describe features that are in the 5’-5’ and 3’-3’ region, that is, most of them are located away from the cleavage site (as described, for example, in Example 8).
- Various G/U-rich features top the list, where its position upstream suggests the PAS is more likely to be constitutive, but if downstream, the PAS is more likely to be tissue specific. Polyadenylation signals are absent from the list.
- sequence signatures may not be fully predictive since tissue-specific proteins can act by modulating core polyadenylation proteins instead of directly binding to the transcript (as described, for example, by MacDonald et al. , Tissue-specific mechanisms of alternative polyadenylation: testis, brain, and beyond, Wiley Inter discip. Rev. RNA, 2010; which is hereby incorporated by reference in its entirety).
- an APA model can be used to assess the effects of genetically defined therapies, such as oligonucleotide therapies as described in Example 4, by combining this example with Example 4, the resulting system and method can be used to identify
- oligonucleotide therapies that act in a tissue-specific manner.
- the system and method can be used to identify other genetically defined therapies, such as gene editing and gene therapies.
- Example 7 A convolution neural network model of polvadenylation to predict the effect of genomic variations
- Conv-Net may leam a better model absent any insights or hypotheses about mechanism. This is surprising at first, but perhaps not so if viewed in the context of other applications of machine learning like computer vision, where hand-crafted features have been largely superseded by models which leam directly from image pixels.
- the Conv-Net has additional advantages that may not be available in feature-based models. For instance, it is completely free to discover novel sequence elements that may be relevant for polyadenylation regulation from data.
- An example set of filters from the Conv-Net model is shown in Example 12. It also has the potential to be more
- Feature extraction from sequences can be the most computational intensive aspect of a model during inference. This is not required for models that directly operate on sequences. There are additional operations that are required in the Conv-Net, but these computations can be significantly sped up by graphics processing units, which can be important for application of the model to entire genomes.
- the Conv-Net Since the Conv-Net operates directly on the genomic sequence, it also enables one to perform analysis at the single-base resolution more naturally. By analyzing the flow of gradients, the Conv-Net can determine how each base in the input sequence changes the output of the model. If a model requires feature extraction, such as the Feature-Net, the output must be analyzed relative to each feature. Furthermore, in the Feature-Net, many features are derived in discrete sections of the genome (four in this case, as described, for example, in Example 8) to reduce the dimensionality of the input.
- the Conv-Net on the other hand, is more efficient at sharing model parameters, thereby enabling the motif filters to be applied at much finer spatial steps across a genomic sequence (a stride of 1 is used), while still make overfitting manageable during training.
- analysis regarding the magnitude and direction of the effect of each base on the model’s output can be performed. This has the potential to offer a prescription to the design of oligonucleotides for anti-sense therapies.
- FIG. 6 shows the saliency map of a region of the oligo-targeted mRNA examined in Example 4, which spans the first three polyadenylation signals.
- an APA model can be used to assess the effects of genetically defined therapies, such as oligonucleotide therapies as described in Example 4, by combining this example with Example 4, the resulting convolutional neural network-based system and method can be used to identify oligonucleotide therapies that act in a tissue-specific manner.
- the convolutional neural network-based system and method can be used to identify other genetically defined therapies, such as gene editing and gene therapies.
- FIG. 7 illustrates regions around a polyadenylation site where features are extracted.
- a given sequence comprising 200 nucleotides may include four different regions (a 5’-5’ region which is 60 nucleotides in length, a 5’-3’ region which is 40 nucleotides in length, a 3’-5’ region which is 40 nucleotides in length, and a 3’-3’ region which is 60 nucleotides in length) such that a Poly(A) site is found in between the 5’-3’ and the 3’-5’ regions.
- Table 6 illustrates examples of feature groups and corresponding regions and number of features. A“*” indicates redundant features that are present in multiple feature groups, which are removed.
- Cis-Elements as described, for example, in Table 1 by Hu et al, Bioinformatic identification of candidate cis-regulatory elements involved in human mRNA polyadenylation, RNA; which is hereby incorporated by reference in its entirety).
- RNA Binding Motifs in IUPAC notation: CPEB1: UUUUAU, hnRNP-Hl:
- GGGAGG, hnRNP-H2 GGAGGG, MBNL vl: GCUUGC, MBNL_v2: YGCY, MBNL_v3: YGCUKY, PABPN1: ARAAGA, PTBP1: UUUUCU, NOVA: UCAY, PCBP1: CCWWHCC, PCBP2: CCYYCCH, ESRP2: UGGGRAD, hnRNP-F/H vl: GGGA, hnRNP-F/H v2:
- an APA model can be used to assess the effects of genetically defined therapies, such as oligonucleotide therapies as described in Example 4, by combining this example with Example 4, the resulting system and method can be used to identify the features, such as RNA-protein binding motifs and nucleosome occupancies, that contribute to the effectiveness of oligonucleotide therapies.
- the system and method can be applied to other genetically defined therapies, such as gene editing and gene therapies.
- chromosome.position.referencewariant based on the hgl9 assembly.
- chrl : 11082794 T : C, chr8 :22058957 :T:C, chrll:2181023:T:C, chrll:5246715:T:C, chrll:5246716:T:A, chrll:5246716:T:C, chrll:5246717 :T:C, chrll : 5246718 :A:G, chrll : 5246718 :A:T, chrll : 46761055 :G:A, chrl6:223691:A:G, chr22 : 51063477 :T:C
- chrl 156109644 : G: A, chrl : 197053394 :G:A, chr2 : 71004492 :T:C, chr2 : 166847735 :T:A, chr2 : 166847735 :T:C, chr2 : 179326003 :A:C, chr2 : 207656535 :T:C, chr3 : 178952181 :T:C, chr4 : 141471538 : C : T , chr4 : 187131799 :T:C, chr5 : 112180071 :A: G, chr5:118877695:A:G, chr6:7586120:T:A, chr6 : 116953612 :A: G, chr6: 158532382 :T:C, chrlO: 27035405 :
- Example 10 Sample predicted polyadenylation track
- FIG. 8 illustrates an example application of scanning the Conv-Net model across a section of the human genome to identify potential polyadenylation sites.
- a snapshot is shown from the UCSC genome browser, showing tracks from top to bottom:
- GENCODE gene annotations GENCODE Poly(A) track, predicted and reported PAS from polyA DB (as described, for example, by Cheng et al. and by Zhang et al, PolyA DB: a database for mammalian mRNA polyadenylation, Nucleic Acids Res., 2005; which is hereby incorporated by reference in its entirety), 3’-Seq (as described, for example, by Lianoglou et al, Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression, Genes Dev., 2013; which is hereby incorporated by reference in its entirety), and PolyA-Seq (forward and reverse strands) (as described, for example, by Derti et al, A quantitative atlas of polyadenylation in five mammals, Genome Res., 2012; which is hereby incorporated by reference in its entirety). At the bottom of FIG. 8, predictions from the model are shown.
- FIG. 9 illustrates positive and negative regions for PAS discovery evaluation.
- Two regions immediately adjacent to each polyadenylation site (PAS) are defined as negatives for classification. This ensures that the negatives have similar nucleotide composition compared to the positive sequences. Regions that are not between existing PAS are excluded to avoid including terminal exonic regions. If the spacing between adjacent PAS cannot fit four negative regions, they are also excluded from the negative set.
- PAS polyadenylation site
- Example 12 Example filters learned by the convolutional neural network
- FIG. 10 illustrates example filters learned by a convolutional neural network.
- an example set of the 80 filters that are learned by the Conv-Net are shown (numbered from 0 to 79). All filters are mean-subtracted and plotted with the same scale (i.e., the max and min for each filter plot is the same). Different shades or colors can be used to denote positive and negative values. Various fdters are blank, suggesting the number of fdters in the Conv-Net model can be reduced. A filter that detects the two most common
- Table 7 illustrates examples of hyperparameters for three different models: a logistic regression (LR), the Feature-Net, and the Conv-Net.
- the following hyperparameters are determined by random sampling and selecting the set that provide the best validation performance. The range each hyperparameter is sampled from is indicated. The number of training epochs is fixed to 50.
- Table 7 Examples of hyperparameters for logistic regression (LR), Feature-Net, and Conv-Net.
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Chemical & Material Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Organic Chemistry (AREA)
- General Engineering & Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Analytical Chemistry (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Microbiology (AREA)
- Biochemistry (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Plant Pathology (AREA)
- Public Health (AREA)
- Robotics (AREA)
- Computational Linguistics (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201862716262P | 2018-08-08 | 2018-08-08 | |
PCT/CA2019/051086 WO2020028989A1 (en) | 2018-08-08 | 2019-08-08 | Systems and methods for determining effects of therapies and genetic variation on polyadenylation site selection |
Publications (2)
Publication Number | Publication Date |
---|---|
EP3834202A1 true EP3834202A1 (en) | 2021-06-16 |
EP3834202A4 EP3834202A4 (en) | 2022-05-11 |
Family
ID=69413933
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP19847830.7A Withdrawn EP3834202A4 (en) | 2018-08-08 | 2019-08-08 | Systems and methods for determining effects of therapies and genetic variation on polyadenylation site selection |
Country Status (4)
Country | Link |
---|---|
US (2) | US11322225B2 (en) |
EP (1) | EP3834202A4 (en) |
CA (1) | CA3107649A1 (en) |
WO (1) | WO2020028989A1 (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020028989A1 (en) * | 2018-08-08 | 2020-02-13 | Deep Genomics Incorporated | Systems and methods for determining effects of therapies and genetic variation on polyadenylation site selection |
US11727284B2 (en) * | 2019-12-12 | 2023-08-15 | Business Objects Software Ltd | Interpretation of machine learning results using feature analysis |
WO2021231550A1 (en) * | 2020-05-15 | 2021-11-18 | Monsanto Technology Llc | Systems and methods for detecting genome edits |
CN111951889B (en) * | 2020-08-18 | 2023-12-22 | 安徽农业大学 | Recognition prediction method and system for M5C locus in RNA sequence |
CN112365924B (en) * | 2020-11-09 | 2023-03-21 | 陕西师范大学 | Bidirectional trinucleotide position specificity preference and point combined mutual information DNA/RNA sequence coding method |
CN113724782B (en) * | 2021-08-19 | 2024-04-02 | 西安交通大学 | Disease prognosis marker screening method based on variable polyadenylation site |
WO2023225221A1 (en) * | 2022-05-18 | 2023-11-23 | The Johns Hopkins University | Machine learning system for predicting gene cleavage sites background |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030224514A1 (en) * | 2002-05-31 | 2003-12-04 | Isis Pharmaceuticals Inc. | Antisense modulation of PPAR-delta expression |
US20020049173A1 (en) * | 1999-03-26 | 2002-04-25 | Bennett C. Frank | Alteration of cellular behavior by antisense modulation of mRNA processing |
US20110239315A1 (en) * | 2009-01-12 | 2011-09-29 | Ulla Bonas | Modular dna-binding domains and methods of use |
US9388403B2 (en) * | 2011-05-31 | 2016-07-12 | Mitsubishi Rayon Co., Ltd. | Nitrile hydratase |
WO2017218925A1 (en) * | 2016-06-16 | 2017-12-21 | Rutgents, The State University Of New Jersey | Modified 3' region extraction and deep sequencing of polyadenylation sites and poly(a) tail length analysis |
US20170298382A1 (en) * | 2012-03-13 | 2017-10-19 | Pioneer Hi-Bred International, Inc. | Genetic reduction of male fertility in plants |
ES2881080T3 (en) * | 2013-03-15 | 2021-11-26 | Labrador Diagnostics Llc | Nucleic acid amplification |
CA2932472A1 (en) * | 2013-12-12 | 2015-06-18 | Massachusetts Institute Of Technology | Compositions and methods of use of crispr-cas systems in nucleotide repeat disorders |
US10185803B2 (en) * | 2015-06-15 | 2019-01-22 | Deep Genomics Incorporated | Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network |
CA2999192A1 (en) * | 2015-09-21 | 2017-03-30 | Association Institut De Myologie | Antisense oligonucleotides hybridizing with a key element of the polyadenylation region of a dux4 pre-mrna and uses thereof |
WO2020028989A1 (en) | 2018-08-08 | 2020-02-13 | Deep Genomics Incorporated | Systems and methods for determining effects of therapies and genetic variation on polyadenylation site selection |
-
2019
- 2019-08-08 WO PCT/CA2019/051086 patent/WO2020028989A1/en unknown
- 2019-08-08 CA CA3107649A patent/CA3107649A1/en active Pending
- 2019-08-08 EP EP19847830.7A patent/EP3834202A4/en not_active Withdrawn
-
2021
- 2021-01-29 US US17/162,224 patent/US11322225B2/en active Active
-
2022
- 2022-03-28 US US17/706,227 patent/US20220336049A1/en not_active Abandoned
Also Published As
Publication number | Publication date |
---|---|
EP3834202A4 (en) | 2022-05-11 |
WO2020028989A1 (en) | 2020-02-13 |
CA3107649A1 (en) | 2020-02-13 |
US20220336049A1 (en) | 2022-10-20 |
US20210241852A1 (en) | 2021-08-05 |
US11322225B2 (en) | 2022-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11322225B2 (en) | Systems and methods for determining effects of therapies and genetic variation on polyadenylation site selection | |
Wen et al. | DeepMirTar: a deep-learning approach for predicting human miRNA targets | |
Bandyopadhyay et al. | MBSTAR: multiple instance learning for predicting specific functional binding sites in microRNA targets | |
Kristensen et al. | Principles and methods of integrative genomic analyses in cancer | |
Xiao et al. | Prediction of lncRNA-protein interactions using HeteSim scores based on heterogeneous networks | |
He et al. | Improved regulatory element prediction based on tissue-specific local epigenomic signatures | |
Savareh et al. | A machine learning approach identified a diagnostic model for pancreatic cancer through using circulating microRNA signatures | |
Chen et al. | Random forests for genomic data analysis | |
US20180107927A1 (en) | Architectures for training neural networks using biological sequences, conservation, and molecular phenotypes | |
Leung et al. | Inference of the human polyadenylation code | |
Zhang et al. | A review on recent computational methods for predicting noncoding RNAs | |
US10347359B2 (en) | Method and system for network modeling to enlarge the search space of candidate genes for diseases | |
US20200082910A1 (en) | Systems and Methods for Determining Effects of Genetic Variation of Splice Site Selection | |
Deng et al. | Accurate prediction of protein-lncRNA interactions by diffusion and HeteSim features across heterogeneous network | |
Saeed et al. | Biological sequence analysis | |
Hwang et al. | Big data and deep learning for RNA biology | |
Goldenberg et al. | Unsupervised detection of genes of influence in lung cancer using biological networks | |
Yang et al. | MSPL: Multimodal self-paced learning for multi-omics feature selection and data integration | |
Zhang et al. | Fast and Efficient Design of Deep Neural Networks for Predicting N7-Methylguanosine Sites Using autoBioSeqpy | |
Cai et al. | Utilizing RNA-seq data for cancer network inference | |
Zhang et al. | Inference of Networks from Large Datasets | |
Uthayopas et al. | PRIMITI: a computational approach for accurate prediction of miRNA-target mRNA interaction | |
Wei | Survival-Related Clustering of Cancer Patients by Integrating Clinical and Biological Datasets | |
Zhou et al. | Analysis of paired miRNA-mRNA microarray expression data using a stepwise multiple linear regression model | |
Saha | Computational methods to study gene regulation in humans using DNA and RNA sequencing data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20210204 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R079 Free format text: PREVIOUS MAIN CLASS: G16B0040000000 Ipc: G16B0040200000 |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 20220411 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: C12N 15/11 20060101ALI20220405BHEP Ipc: G16B 30/00 20190101ALI20220405BHEP Ipc: G16B 20/00 20190101ALI20220405BHEP Ipc: C12Q 1/6883 20180101ALI20220405BHEP Ipc: G16B 40/20 20190101AFI20220405BHEP |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20221110 |