WO2014202573A1 - Method for modulating gene expression - Google Patents

Method for modulating gene expression Download PDF

Info

Publication number
WO2014202573A1
WO2014202573A1 PCT/EP2014/062659 EP2014062659W WO2014202573A1 WO 2014202573 A1 WO2014202573 A1 WO 2014202573A1 EP 2014062659 W EP2014062659 W EP 2014062659W WO 2014202573 A1 WO2014202573 A1 WO 2014202573A1
Authority
WO
WIPO (PCT)
Prior art keywords
dna
sequence
rna
ratio
protein
Prior art date
Application number
PCT/EP2014/062659
Other languages
French (fr)
Inventor
Mihail SAROV
Stoyno STOYNOV
Original Assignee
MAX-PLANCK-Gesellschaft zur Förderung der Wissenschaften e.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by MAX-PLANCK-Gesellschaft zur Förderung der Wissenschaften e.V. filed Critical MAX-PLANCK-Gesellschaft zur Förderung der Wissenschaften e.V.
Publication of WO2014202573A1 publication Critical patent/WO2014202573A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/63Introduction of foreign genetic material using vectors; Vectors; Use of hosts therefor; Regulation of expression
    • C12N15/67General methods for enhancing the expression

Definitions

  • the invention relates to a method for the design and/or modification of synthetic or naturally occurring nucleic acid sequences.
  • the method is suitable for obtaining nucleic acid sequences with a defined level or enhanced predictability of protein expression by adjustment of the thermodynamic stability of the relevant nucleic acid sequence by sequence adjustment.
  • the invention relates to a method for modulating gene expression, more specifically for modulating protein expression levels of any given nucleic acid sequence, by adjusting the thermodynamic stability of the corresponding RNA/DNA and DNA/DNA duplexes via sequence change of the gene encoding the protein to be expressed.
  • RNA splicing is coupled with transcription and catalyzed by a Spliceosome nucleoprotein complex that acts to remove introns and re-join exonic sequences in order to create functional RNAs 1 .
  • This process involves recognition of the 5'-splicing sites by U1 snRNPs 2,3 (small nuclear ribonucleoproteins).
  • U1 snRNPs 2,3 small nuclear ribonucleoproteins.
  • the 3'-inron-exon boundary recognition requires SF1 to interact with the branch point sequence (BP) 4,5 ; U2AF65 6 with the Polypyrimidine tract (PPT) and U2AF35 with the 3'-splice site 6,7 .
  • BP branch point sequence
  • PPT Polypyrimidine tract
  • an intron has been shown to modulate expression levels of a coding region, although this phenomena is dependent on the particular promoters and introns involved.
  • the introduction of the first intron from EF-1 alpha directly downstream of the MCMV promoter led to enhanced expression level of a luciferase reporter gene 61 .
  • RNA polymerase II can travel along the DNA template for thousands and even hundreds of thousands of nucleotides. In the process, it encounters the physical forces of DNA/DNA and RNA/DNA pairing that can vary significantly depending on the local sequence composition. It has been shown 11"13 that the 5'- and 3'- UTRs, introns and exons have characteristic guanine/cytosine (GC) content, which could affect RNA transcription and processing. Nucleotide composition could influence protein recruitment 12 , RNA secondary structure 13 , transcription rate 14,15 , DNA melting 16 or RNA/DNA and DNA/DNA duplex stability 17 .
  • GC guanine/cytosine
  • the physical and structural properties of nucleic acid sequences represent - either alone or in combination with the coding information - poorly understood and potentially complicated factors that render accurate design of nucleic sequences, in particular for determining the expression level of particular genes, to be plagued by lack of reliability, amongst other disadvantages.
  • the thermodynamic stability of nucleotide duplexes has two components. The first component is the forces of the hydrogen interaction between complementary bases and this component strongly depends on the nucleotide composition. The second component is the stacking interaction between the bases, which depends mainly on the neighbouring di- nucleotide distribution. Therefore, the nucleotide distribution and its thermodynamic properties are interrelated. The de novo design of genes and re-design of exogenous genes with predictable levels of expression and/or splicing is therefore difficult.
  • thermodynamic properties of such duplexes represents a significant unknown factor that confounds the design of de novo protein-coding sequences for optimized expression or splicing characteristics.
  • the free energy (AG) necessary to unwind polynucleotide duplexes with defined length can be calculated from the measured values of Entropy (AS) and Enthalpy ( ⁇ ) for nearest-neighbour DNA/DNA 18"23 or RNA/DNA 24,25 interactions.
  • AS Entropy
  • Enthalpy
  • Previous work in this area has shown that exons possess more stable RNA/DNA duplexes than introns in Saccharomyces cerevisiae .
  • AS and ⁇ parameters used for DNA/DNA duplexes 22 lead to significant overestimation of its AG in comparison with AG of RNA/DNA duplexes.
  • direct comparison between the AG value for DNA/DNA duplex stability and the AG value for RNA/DNA duplex stability for any given specified sequence region has until now not been previously possible.
  • the invention relates in a preferred embodiment to a method for de novo design of synthetic genes and/or modification of naturally occurring genes in order to obtain nucleic acid sequences with a defined level or enhanced predictability of mRNA expression, in addition to defined expression levels of splicing variants by modulation of thermodynamic stability of RNA/DNA and DNA/DNA duplexes.
  • the genes designed and modified according to the present invention can be used in order to create and improve gene therapy constructs and vaccines and to enhance in vitro, in vivo or recombinant production of proteins, RNA and DNA.
  • the invention is based on demonstration that the ratio between stability of mRNA/DNA duplexes and DNA/DNA duplexes plays a role in transcription levels and splice events. Particular values of this ratio near 3'-spice sites are characteristic features that can contribute to intron-exon differentiation. Remarkably, throughout all transcripts, the most unstable mRNA/DNA duplexes, compared to the corresponding DNA/DNA duplexes, are situated upstream of the 3'-splice sites and include the polypyrimidine tracts. This characteristic instability is less pronounced in weak alternative splice sites and disease-associated cryptic 3'-splice sites.
  • the present invention therefore enables modification of the thermodynamic pattern of a DNA sequence in order to modulate in vivo transcription and splicing events.
  • One example is to prevent the re-annealing of mRNA to the DNA template behind the RNA polymerase to ensure access of the splicing machinery to the polypyrimidine tract and the branch point.
  • the present invention therefore represents the technical exploitation of the thermodynamic properties of nucleic acid sequences in order to modulate transcription and/or RNA splicing, by using appropriate thermodynamic parameters that allow comparison and subsequent determination of DNA/DNA duplex stability compared to mRNA/DNA duplex stability.
  • an object of the invention is to provide a method for modulating protein expression level by modifying the gene encoding said protein, comprising a) provision of an initial DNA sequence (initial DNA) that comprises one or more
  • Modulation of expression may relate to either increased or reduced expression of a protein of interest, or of a particular splice variant of interest.
  • the invention therefore also relates to the use of the AG ratio (AG of DNA/DNA duplex stability to AG of RNA/DNA duplex stability) of one or more regions of any given gene sequence in determining a functional characteristic of said sequence, for example the likelihood, frequency or rate of splicing, or the level of transcription, of any given sequence region.
  • AG ratio AG of DNA/DNA duplex stability to AG of RNA/DNA duplex stability
  • AG bias AG DNA/DNA : AG RNA/DNA
  • RNA/DNA biased regions with higher stability of the RNA/DNA duplex than the corresponding DNA/DNA duplex.
  • the difference between the AG value for DNA/DNA duplex stability and the AG value for RNA/DNA duplex stability can be calculated (for example AG DNA/DNA - AG
  • RNA/DNA RNA/DNA
  • AG bias As used in the experimental examples, in particular Figures 2d and 3a, the AG bias has been calculated by AG DNA/DNA - AG RNA/DNA.
  • sequence regions with AG bias of above 0 indicate “DNA/DNA biased” regions. Sequence regions with AG bias of below 0 indicate “RNA/DNA biased” regions.
  • Such values relate essentially to the same or an analogous comparison between AG DNA/DNA and AG RNA/DNA values, but may be expressed in a different form.
  • AG ratio or AG bias may still in general be used for alternative numerical expressions of the comparison of AG values described herein.
  • the free energy (AG) necessary to unwind polynucleotide duplexes with defined length can be calculated from the measured values of Entropy (AS) and Enthalpy ( ⁇ ) for the 10 possible nearest-neighbor DNA/DNA 18"23 interactions, and the 16 possible RNA/DNA 24,25 interactions.
  • the AG values can therefore be calculated by methods known in the art.
  • the desired or pre-determined AG ratio of the product DNA is to be assessed and/or determined in relation to the initial DNA sequence. Changes in sequence can be made to the initial DNA sequence in order to adjust the AG ratio of any given sequence region to a value as desired. The AG ratio can be increased or decreased according to the desired outcome with respect to protein expression. The AG values for each of the DNA/DNA and RNA/DNA duplex stabilities may also be adjusted as such, without altering the AG ratio itself.
  • the product DNA sequence can in a preferred embodiment then be assessed for protein expression, which can be determined by various in vitro or in vivo quantitative tests, such as detection of the expressed protein with an affinity reagent, such as an antibody, or by other analytical protein methods such as SDS-PAGE followed by coomassie staining.
  • Comparative tests between initial and product DNA sequences based on analysing protein expression, using comparable expression systems, are preferred.
  • the initial and product sequences could subsequently be introduced separately into the same expression vector, such as a plasmid or viral vector, suitable for expression in any given host organism or cell culture system.
  • the protein expression levels can then be assessed as described herein or by methods known in the art.
  • the novel and surprising aspect of the invention is the recognition that analysis of the thermodynamic properties of the sequence, specifically direct comparison of the DNA/DNA duplex stability to RNA/DNA duplex stability, in any given sequence region reveals functional characteristics of the sequence with regard to expression of the encoded protein.
  • the solution to the technical problem stated above is therefore the utilisation of a comparison between the DNA/DNA duplex stability to RNA/DNA duplex stability.
  • the solution to the technical problem is the provision of the AG ratio (AG of DNA/DNA duplex stability to AG of RNA/DNA duplex stability) for use in determining functional characteristics of nucleic acids sequences. Subsequent modulation of the AG ratio represents an active method step that utilises the surprising relationship between DNA/DNA duplex stability and RNA/DNA duplex stability in order to achieve a more reliable and in some cases pre-determined expression level.
  • the balance between the strengths of the DNA/DNA duplex formation and the RNA/DNA duplex formation is important in determining the functional output/outcome of the transcription and/or splicing molecular machinery.
  • the invention therefore also relates to a method for modifying a gene sequence in order to modulate protein expression by modifying the gene sequence according to the thermodynamic properties of the sequence of interest.
  • the method for modifying a DNA sequence in order to modulate protein expression comprises of a) provision of an initial DNA sequence (initial DNA) that encodes an amino acid sequence of interest, b) determination of the AG ratio for one or more regions of said initial DNA, and c) modification of said initial DNA sequence to provide a product DNA sequence (product DNA) with a desired AG ratio, wherein the protein expression level is dependent on said AG ratio.
  • the method of the present invention for modifying a DNA sequence is carried out for a sequence that comprises one or more introns, and is characterised in that - an increase of the AG ratio in a specified sequence region of the product DNA in comparison to the initial DNA provides increased expression of the protein encoded by said product DNA, wherein said specified sequence region is any given sequence region within 50 nt, preferably 20 nt, upstream of the 3' splice site of an intron.
  • This particular embodiment of the invention relates to the modification of a sequence in order to enhance expression of a particular splice variant of any given sequence comprising one or more introns.
  • a specified region upstream of the 3'-splice site within introns is designed to possess more thermodynamically stable DNA/DNA duplex than RNA/DNA duplex.
  • the branch point consensus sequence will be preferably present.
  • the level of presence of the relevant exon (downstream of said 3' splice site of the intron) in the spliced (final, to be translated) transcript will depend on DNA/DNA duplex stability in comparison with RNA/DNA duplex stability in the particular region upstream from the intron's 3'-splice site.
  • RNA/DNA duplex stability in comparison with RNA/DNA duplex stability in this region (any given sequence region within 50 nt, preferably 20 nt, upstream of the 3' splice site of an introns), the more frequently (i. e. with higher efficiency) the relevant exon will be present in the mature transcript. This effect leads therefore to increased expression of the protein encoded by the modified product DNA. If the thermodynamic bias towards DNA/DNA duplex stability (higher AG ratio) is not present, the splice event will occur with less efficiency and reduced frequency, thereby leading to lower expression of the desired DNA sequence due to reduced transcript number for the desired splice variant.
  • the method of the present invention - relating to enhanced splice efficiency - is characterised in that the AG ratio of the specified sequence region (preferably region within 50 nt, preferably 20 nt, upstream of the 3' splice site of an intron) in the product DNA is above 1 , preferably greater than 1 .05, 1 .10, 1 .15, 1 .2, or greater than 1 .3.
  • the "AG bias" (as an alternative measure to the "AG ratio") of the specified sequence region (preferably region within 50 nt, preferably 20 nt, upstream of the 3' splice site of an intron) in the product DNA is greater than 0, preferably greater than 0.5, 1 .0, 1 .5 or more preferably greater than 2.
  • the present invention therefore relates to a method as described herein, wherein modification of the initial DNA is carried out by insertion of one or more introns (such as synthetic introns) and corresponding splice sites.
  • the invention therefore also relates to separation of a coding sequence with one or more exons by inserting one or more preferably synthetic intronic sequences that can allow expression of a desired number of alternative splicing variants of the gene.
  • the synthetic intronic sequences are designed to contain: 5'-splice site consensus sequence, 3'-splice site consensus sequence and a sequence with a low level of DNA/DNA and RNA/DNA duplex stability between them (in the introns).
  • a specified region upstream of the 3'-splice site is designed to possess more thermodynamically stable DNA/DNA duplex than RNA/DNA duplex. Immediately upstream of this region, the branch point consensus sequence will be present.
  • the level of presence of the relevant exon (downstream of said 3' splice site of the intron) in the spliced (final, to be translated) transcript will depend on DNA/DNA duplex stability in comparison with RNA/DNA duplex stability in the particular region upstream from the intron's 3'-splice site.
  • the greater DNA/DNA duplex stability in comparison with RNA/DNA duplex stability (larger AG ration values) in this region preferably any given sequence region within 50 nt, preferably 20 nt, upstream of the 3' splice site of an introns), the more frequently (i. e. with higher efficiency) the relevant exon will be present in the mature transcript.
  • the insertion of synthetic introns therefore provides a reliable method for modifying a coding DNA sequence in order to provide one or more splice variants with fine-tuned expression levels.
  • the method as described herein is characterised in that reduction of the AG ratio for any given coding region and/or 5' untranslated region (5'-UTR) of the product DNA in comparison to the initial DNA provides increased expression of the protein encoded by said product DNA.
  • This particular embodiment relates to regulation or modulation of the expression level of a coding DNA sequence independent of splice events. This method therefore applies preferably to coding sequences that do not comprise introns to be spliced out before translation.
  • the comparison of DNA/DNA duplex stability with RNA/DNA duplex stability has revealed that coding sequences that show lower AG ratio values are transcribed more highly than sequences with higher AG ratio.
  • thermodynamics properties encompassed by this feature have been neither suggested nor disclosed in the prior art.
  • the exploitation of these thermodynamic properties allows an assessment of transcription efficiency in light of the AG ratio and subsequent modification of the sequence to be expressed according to desired expression level.
  • the AG ratio of the specified coding and/or 5'-UTR sequence region in the product DNA is around 1 or preferably below 1 .
  • the "AG bias" (as an alternative measure to the "AG ratio") of the specified sequence region in the product DNA is around or preferably below 0, preferably less than -0.5, - 1.0, -1.5 or more preferably less than 2.
  • the AG ratio for any given coding region and/or 5' untranslated region (5'-UTR) of the product DNA may be modified by sequence modification, so that the AG ratio is 1 , or a value close to one, that is defined by essentially similar AG values of DNA/DNA and RNA/DNA duplex stability, whereby the AG values as such for both DNA/DNA and RNA/DNA duplex stability are increased compared to those of the initial DNA sequence.
  • This embodiment is beneficial for coding regions (including for example for exon sequences) in order to provide increased expression of the protein encoded by the product DNA. Also in the embodiment where the AG values as such for both DNA/DNA and RNA/DNA duplex stability are increased compared to those of the initial DNA sequence, it is preferred that the AG ratio is 1 , around 1 or below 1.
  • the correlation between more stable duplex structures and higher expression levels is also surprising in that a skilled person may have assumed that lower stability would provide less resistance to a progressing Pol II.
  • the present invention therefore provides a method based on a concept not suggested previously.
  • the method of the present invention is characterised in that modification of the initial DNA is carried out according to or using the degeneracy of the genetic code without changing the amino acid sequence encoded by said initial DNA. It is well known that various nucleic acid triplets may code for the same amino acid. Through modification of the DNA sequence by adjusting the sequence according to the degeneracy of the nucleic acid code, the same amino acid sequence may be maintained, whilst the nucleic acids are selected with regard to their thermodynamic properties, in particular the AG ratio of any given sequence. This embodiment is particularly relevant for coding sequences that do not exhibit introns. In addition to utilising the degeneracy of the genetic code, standard codon optimisation procedures may additionally be considered and taken into account when designing a sequence for expression in a particular host. In one embodiment the nucleic acid sequence coding for the desired protein may be reverse transcribed, either in vitro or in silico, and the nucleic acid sequence subsequently analysed and modified/designed according to the methods described herein.
  • the method of the present invention may also be characterised in that one or more of the steps a), b) and/or c) of the method described above are carried out by one or more computer programmes, executed on a computing device.
  • the invention therefore relates to computational methods that essentially use simulations and/or computer representations of the nucleic acid sequences described herein.
  • the method can be carried out by empirical experimentation, for example by synthesis of particular sequences, empirical analysis of their AG values by experimental approaches known in the art, and finally be re-synthesizing a modified sequence based on nucleotides that have been adjusted or replaced in order to achieve the desired thermodynamic properties, more preferably the desired AG ratio. Re-analysis of the sequence in order to measure the changed AG is also possible. Determination of the melting point of any given sequence is one approach that represents an empirical method of determining or estimating the AG of a particular DNA molecule.
  • the method may also relate to a computer programme product, such as a software product.
  • the AG values of any given sequence are preferably determined in silico through calculation of thermodynamic properties of individual nucleotides and/or longer sequences of multiple nucleotides.
  • the computer programme product of the present invention also encompasses the features as described for the method provided herein. Further details on preferred computer- based approaches are provided in the examples and relevant references as described herein. If the method is carried out in a computer programme, for example by way of simulation, the modified sequence may subsequently be synthesized by methods known to a skilled person in a laboratory and utilised in which ever in vitro or in vivo application is desired.
  • the invention also relates to a method as described herein, wherein the AG values for
  • DNA/DNA duplex stability and/or the AG value for RNA/DNA duplex stability for any given specified sequence region are determined using a sliding-window calculation of entropy (AS) and enthalpy (AH) of nearest neighbour interactions.
  • AS entropy
  • AH enthalpy
  • the invention relates to a method as described herein, wherein AG values for DNA/DNA duplex stability are calculated on 5 to 15, preferably 10 nearest-neighbour interactions, and AG values for RNA/DNA duplex stability are calculated on 10 to 20, preferably 16, nearest-neighbour interactions.
  • the term "nearest neighbour interaction" is known in the art and is described in further detail in references 18-25.
  • the invention relates to a method as described herein, wherein the sliding window approach utilises a 1 to 20 bp, preferably 1 bp, step size and a 1 to 20 bp, preferably 2 to 9 bp, window size. It was an unexpected finding that window and step sizes provided herein led to beneficial results. There had previously been no indication that the thermodynamic properties of the sequence could be interrogated with sufficient resolution at these window sizes.
  • sliding window approach is well known in the art with respect to genomic analysis and experimentation.
  • a particular window size (defined in nucleotides and/or base pair (bp)) is defined and the window is moved in a particular step size (defined in nucleotides and/or base pair), in order to analyse any given stretch of contiguous nucleotides present in a sequence.
  • the invention relates to a method for predicting the location of splice sites in one or more unannotated genomes and/or genomic regions by provision of a DNA sequence of interest, determination of the AG ratio for one or more regions of said DNA sequence, wherein a AG ratio of any given specified sequence region above 1 indicates a 3' splicing site of an intron.
  • software may be developed that is utilised for scanning genomic sequences for potential splice sites.
  • the invention further relates to a method for manufacturing a nucleic acid molecule that corresponds to product DNA that has been modified by the method as described herein, comprising carrying out the method of any one of the preceding claims and subsequently synthesizing, cloning and/or isolating said nucleic acid molecule.
  • the invention also relates therefore to a nucleic acid molecule manufactured according to the method of manufacture as described herein.
  • the nucleic acid may be integrated in an expression vector, such as a plasmid or viral vector.
  • the invention therefore relates to a pair of first and second nucleic acid molecules, wherein said first nucleic acid molecule is an initial DNA sequence (initial DNA) that comprises one or more sequences that encode an amino acid sequence of a protein to be expressed, and said second nucleic acid molecule is a product DNA sequence (product DNA) with a desired AG ratio that has been manufactured according to the method of the preceding claims, wherein said pair of sequences exhibit differences in AG ratio in one or more sequence regions.
  • initial DNA initial DNA
  • product DNA product DNA
  • the invention therefore relates to a pair of first and second nucleic acid molecules as described herein, wherein said sequences differ with respect to their nucleic sequence and AG ratio in one or more sequence regions, without any difference in amino acid sequence of the encoded protein to be expressed between said first and second sequences.
  • the invention therefore relates to a pair of first and second nucleic acid molecules as described herein, wherein the gene encoding the protein to be expressed comprises one or more introns, characterised in that an increased AG ratio in a specified sequence region of the product DNA in comparison to the initial DNA is present and provides increased expression of the protein encoded by said product DNA, wherein said specified sequence region is any given sequence region within 50 nt, preferably 20 nt, upstream of the 3' splicing site of an intron.
  • the invention therefore relates to a pair of first and second nucleic acid molecules as described herein, wherein a reduced AG ratio for any given coding region and/or 5' untranslated region (5'-UTR) of the product DNA in comparison to the initial DNA is present and provides increased expression of the protein encoded by said product DNA.
  • pairs of first and second nucleic acid molecules may be present in a kit, together in a laboratory, for example if said pair have undergone comparative testing for expression level as described herein, or present as saved data files in a computer.
  • the presence of a pair of electronic representations of nucleic acid sequences as described herein is encompassed by the present invention.
  • the method as described herein may be implemented in computer software, the in silico representation of any given pair of sequences as described, whereby one sequence has been modified by the method of the present invention, also falls within the scope of the present invention.
  • the method of the present invention as described herein is of particular value to modification of sequences for expression as gene therapy constructs, vaccines and/or to advance the manufacture of commercially interesting proteins, or the corresponding RNA and/or DNA.
  • the method can be applied in the following fields (but is not restricted to):
  • the method of the invention as described herein comprises the design, simulation of manufacture or manufacture of a therapeutic expression vector comprising a product DNA that has been modified by the method according to any one of the preceding claims encoding a therapeutic protein.
  • design or simulation of manufacture may also relate to a computer-stored embodiment of the invention, for example a computer simulated version of carrying out the method as described herein, or of a pair of sequences as described herein stored on a computing device.
  • Decreasing the level of expression of viral genes by modulating DNA/DNA and/or RNA/DNA duplex stability of the coding sequence can decrease the fitness of the virus (or other pathogens), thereby attenuating the virus.
  • Modulating DNA/DNA and/or RNA/DNA duplex stability of the virus genes may involve hundreds of synonymous mutations into a pathogen, which subsequently reduce the health risk by minimising the chance of the virus becoming virulent via recombination with an already existing virus.
  • the present method can be used to create attenuated vaccines with improved genetic stabilities by de novo synthesis of virus genomes with altered DNA/DNA and RNA/DNA duplex stabilities of some of its coding regions and/or gene regulatory sequences.
  • the method of the invention as described herein comprises the design, simulation of manufacture or manufacture of an attenuated human pathogen, preferably a virus, for use as a vaccine, comprising a product DNA that has been modified by the method according to any one of the preceding claims.
  • the method of the invention as described herein comprises the design, simulation of manufacture or manufacture of an expression vector for recombinant protein expression, comprising a product DNA that has been modified by the method according to any one of the preceding claims encoding a protein to be expressed.
  • nucleic acid deoxyribonucleic acid (DNA), ribonucleic acid (RNA) are known in the art and are sufficiently clear for a skilled person. RNA and DNA may represent different embodiments of the same feature of the invention, especially with regard to coding function of the nucleic acid.
  • the nucleic acids of the present invention also relate to sequences comprising synthetic or chemically modified nucleotides.
  • a gene is commonly understood as a molecular unit of heredity of a living organism.
  • a gene is defined as a region of genomic sequence, corresponding potentially to a unit of inheritance, which may be associated with regulatory regions, coding or non-coding transcribed regions, and/or other functional sequence regions.
  • a gene relates to a nucleic acid sequence which has a primary function of encoding a protein.
  • the gene may include introns, exons and/or regulatory sequences required for expression of the protein.
  • modifying a DNA sequence refers to sequence amendment via insertion, deletion, substitution, inversion or other sequence amendments as commonly known. Computational, cloning, mutation, recombination, PCR or synthetic methods may be used in modifying a sequence. The production of a mutation of the encoded protein during modification may or may not be excluded from the scope of the present invention, although the modification preferably results in no change in amino acid sequence.
  • the same sequence or “the same sequence region”, for example in the context of a region to be analysed for both AG of DNA/DNA and AG of RNA/DNA, is not limiting to exactly the same sequence length, rather to the same region, whereby sequence overlap is sufficient for the "same region" to be analysed.
  • the sequence region analysed for AG of DNA/DNA may be longer or shorter than the sequence region used to analyse AG of RNA/DNA in the "same" region, depending for example on the window size used for the AG calculation, although an overlap is required.
  • Artificial gene synthesis (or de novo synthesis) is a preferred application of the present invention and relates to methods used in synthetic biology used to create artificial genes.
  • PCR polymerase chain reaction
  • Gene synthesis approaches may be based on a combination of organic chemistry and molecular biological techniques and entire genes may be synthesized "de novo", without the need for precursor template DNA. The method has been used to generate functional bacterial chromosomes containing approximately one million base pairs.
  • Oligonucleotide synthesis may be applied during synthesis, whereby oligonucleotides are chemically synthesized using building blocks called nucleoside phosphoramidites. These can be normal or modified nucleosides which have protecting groups to prevent their amines, hydroxyl groups and phosphate groups from interacting incorrectly. HPLC can be used to isolate products with the proper sequence. Meanwhile a large number of oligos can be synthesized in parallel on gene chips. For optimal performance in subsequent gene synthesis procedures they should be prepared individually and in larger scales. Annealing based connection of oligonucleotides may also be used.
  • a set of individually designed oligonucleotides is made on automated solid-phase synthesizers, purified and then connected by specific annealing and standard ligation or polymerase reactions.
  • the synthesis step relies on a set of thermostable DNA ligase and polymerase enzymes.
  • modulate the level of expression of a protein refers to changes in levels of protein produced via translation of a transcribed and optionally spliced RNA molecule corresponding to the coding DNA.
  • the levels of protein after modulation can be determined by skilled person in the art, for example by using affinity reagents such as an antibody directed against the particular protein of interest or readily available protein staining techniques. Other analytical methods could also be used to quantify protein expression levels such as quantitative mass spectrometry or microscopic methods in combination with labelling of said protein of interest.
  • RNA or DNA preferably DNA
  • protein expression is well known to a skilled person and represents protein production via translation of the corresponding optionally spliced RNA encoding the protein of interest.
  • Gene expression or “protein expression” may be used interchangeably.
  • a skilled person understands that a gene may be first "expressed” by transcription, the RNA optionally spliced or otherwise processed, leading ultimately to translation of the transcribed coding RNA.
  • the protein expression level is considered to depend on and/or relate to the amount of corresponding optionally spliced RNA produced via transcription and/or splicing. Transcription levels may therefore also be used as an indication of gene or protein expression.
  • DD DNA/DNA
  • RD sense RNA/DNA
  • DR antisense RNA/DNA
  • Figure 2 A pattern of thermodynamic stability for C. elegans and H. sapiens transcripts. a. Intensity plot of AG of DNA/DNA and RNA/DNA duplexes of the 50 bp sequences
  • Figure 3 Nucleotide distribution and thermodynamic patterns of 3'-splice sites of A. thaliana, C. elegans, D. melanogaster, D. rerio and H. sapiens transcripts.
  • a Mean values of the AG bias surrounding the 3'-splice sites of A. thaliana, C. elegans, D. melanogaster, D. rerio and H. sapiens transcripts.
  • b. Intensity plot of the AG bias of the sequences surrounding 3'-splice sites of A. thaliana, C. elegans, D. melanogaster, D. rerio and H. sapiens transcripts.
  • the results for the 5'- splice sites are aligned with respect to the downstream 5'-alternative splice sites, indicated with a black line.
  • the upstream 5'-alternative splice sites are indicated with a white line.
  • the results for 3'-splice sites are aligned with respect to the upstream 3'-alternative splice sites, indicated with a black line.
  • the downstream 3'-alternative splice sites are indicated with a white line.
  • Figure 5 Thermodynamic stability patterns of 3'- splice sites of the constitutively and alternatively spliced exons in H. sapiens. a. A scheme of the alternative splicing events in panel B-D. The positions of the calculated sites are depicted by arrowheads.
  • c The mean value of the AG bias across the sequences surrounding the 3'-splice sites of 3'- alternatively spliced exons and constitutive exons. 10 th percentile is 10% of 3'-alternative splice sites with the lowest splicing level. 90 th percentile is 10% of 3'-alternative splice sites with the highest splicing level. d. Mean values of the AG bias across the sequences surrounding the 3'-splice sites of cassette exons and constitutive exons.
  • Figure 7 Thermodynamic stability patterns near the 3'-splice sites and the Branch point of both real and pseudo exons in H. sapiens.
  • a Mean values of the AG bias across the 50 bp sequences, surrounding the Branch point of 35 exons of H. sapiens.
  • the coordinates at the horizontal axis indicate the positions with respect of the Branch point.
  • Figure 8 Pre-Spliceosome assembly onto the 3'-splice site of AdML RNA exon 2 with or without annealing of antisense DNA oligonucleotides in vitro.
  • a-c Schematic depiction of the experimental setup for the results shown in panel D. Black - AdML RNA transcript intron 1/exon 2 junction; lower case - intronic region, upper case - exonic region; blue - branch point; underlined - polypyrimidine tract; red - 3'- splice site; green - DNA antisense strand for either the splice site plus the exon, the branch point plus the polypyrimidine tract (BP-PPT), or the DNA/DNA biased region.
  • BP-PPT polypyrimidine tract
  • Pre-spliceosome A complex formation when antisense DNA oligonucleotide is annealed to exon 2 of AdML RNA.
  • n.F. Normalized fluorescence intensity
  • Figure 10 Thermodynamic patterns of the 25 bp sequences, surrounding 1000 5'-and 3'-splice sites of C. elegans transcripts.
  • a The mean value of AG of RNA/DNA duplexes across the 50 bp sequences, surrounding the 5'-and 3'-splice sites, fully situated in coding, 3'-UTR and 5'-UTR of C. elegans transcripts.
  • b The mean value of AG bias across the 50 bp sequences, surrounding the 5'-and 3'-splice sites, fully situated in coding, 3'-UTR and 5'-UTR of C. elegans transcripts.
  • Figure 13 A comparison of the thermodynamic pattern across the trans-splice sites and cis-3'- splice sites of C. elegans transcripts.
  • Figure 14 A comparison of the thermodynamic pattern across the trans-spliced start sites and non-spliced start sites of C. elegans transcripts.
  • a The mean value of AG of RNA/DNA duplexes across the 50 bp sequences, surrounding the 5'-and 3'-splice sites, fully situated in coding, 3'-UTR and 5'-UTR of H. sapiens transcripts.
  • b The mean value of AG bias across the 50 bp sequences, surrounding the 5'-and 3'-splice sites, fully situated in coding, 3'-UTR and 5'-UTR of H. sapiens transcripts.
  • Figure 17 Thermodynamic pattern of DNA/DNA and RNA/DNA duplex stability surrounding the splice sites of A. thaliana, C. elegans, D. melanogaster, D. rerio and H. sapiens,
  • Figure 19 A nucleotide composition and pattern of thermodynamic stability of 3'- splice sites of constitutively and alternatively spliced exons of H. sapiens.
  • Figure 20 Nucleotide composition and pattern of thermodynamic stability across the 3'- splice sites of constitutively and 3'-alternatively spliced exons of H. sapiens.
  • Figure 21 Nucleotide composition and pattern of thermodynamic stability across the 3'- splice sites of constitutive and cassette exons of H. sapiens.
  • Figure 22 Nucleotide composition and pattern of thermodynamic stability of cryptic 3'-splice sites of H. sapiens transcripts.
  • a The mean value of AG bias and nucleotide composition across the 50 bp sequences, surrounding authentic 3'-splice sites and corresponding cryptic 3'-splice sites, located upstream from authentic 3'-splice sites.
  • b The mean value of AG bias and nucleotide composition across the 50 bp sequences, surrounding authentic 3'-splice sites and the corresponding cryptic 3'-splice sites, located downstream from authentic 3'-splice sites.
  • thermodynamic stability of RNA/DNA versus DNA/DNA duplexes we performed high- resolution measurement of the melting temperature (Tm) and compared it to the calculated mean value of AG for the nearest-neighbor interactions of RNA/DNA 25 and DNA/DNA 23 duplexes for three variants of the human IL2 gene 15 , namely wlL2 (27%GC content), natural IL2 (39%CG) and elL2 (60% GC content).
  • Tm melting temperature
  • thermodynamic parameters for DNA/DNA duplexes and RNA/DNA duplexes 25 .
  • Our results suggest that calculation of AG using these parameters allows for accurate comparison of the difference in thermodynamic stability between RNA/DNA and DNA/DNA duplexes.
  • RNA/DNA and DNA/DNA duplex stability were performed high resolution melting analysis of another variant of the IL2 gene 15 called elL2-IL2.
  • Half of the elL2-IL2 sequence is identical to the corresponding sequence of elL2 (60% GC content) and the other half originates from IL2 (39%CG).
  • both RNA/DNA and DNA/DNA duplexes dissociate in two steps as a result of the big difference between the thermodynamic stability of the elL2 part and the IL2 part of the elL2-IL2 gene (Fig. 9b).
  • thermodynamically stable RNA/DNA duplexes in exonic sequences in comparison with the corresponding intronic sequences (Fig. 1 c).
  • the correlation was even more compelling, with only 43 out of 10960 (0.4%) having higher RNA/DNA duplexes stability in their intronic sequences.
  • transcripts classified as "predicted” higher intronic compared to exonic RNA/DNA duplexes stability was over 20 times more frequent (401 out of 4167, or 9.6%) than in the "confirmed” group, which may indicate false positive transcript predictions. This suggests that the thermodynamic parameters can be useful in refining de novo gene structure prediction in C. elegance.
  • thermodynamic profile of the 50 bp upstream and the 50 bp downstream of the transcript start sites, the 5'- and 3'- splice sites and the ends of the 3'- UTRs of all C. elegans transcripts (Fig. 2a, b). This approach allows alignment and color-coded representation of the thermodynamic profile of the individual sequences with respect to the regions responsible for RNA processing ( Figure 2).
  • RNA polymerase maintains a 9 bp RNA/DNA duplex in the transcription bubble during elongation
  • window size 9 bp for AG calculations.
  • introns are less stable than exons (Fig. 2a, b and Fig. 10)
  • the least stable RNA/DNA duplex is located upstream of the 3'-splice site, and includes the polyU tract 27 characteristic of the 3'-consensus in C. elegans (Fig. 3). This region is directly followed by a more stable region, which includes the 3'-splice site.
  • RNA/DNA duplex stability At the 5'-splice site a more stable region at the exon-intron boundary is followed by an intronic region with a significantly lower RNA/DNA duplex stability (Fig. 2a, b).
  • the pattern is similar, but not as pronounced for DNA/DNA duplex stability of the same sequences (Fig. 2a, b).
  • the DNA/DNA bias is most significant at both intron ends ( Figure 2d, e and Fig. 1 1 a).
  • the regions with the strongest RNA/DNA bias next to these regions are situated the regions with the strongest RNA/DNA bias, which include the 5'- and 3'-splice sites.
  • We considered the possibility that the observed pattern is a consequence of the fact that most exons contain protein-coding sequences that have specific sequence constraints.
  • thermodynamic stability distribution is related to RNA splicing, a process which occurs co-transcriptionally and can be directly affected by the thermodynamic stability of the DNA/DNA and RNA/DNA duplexes.
  • thermodynamic stability profile at the start sites of the transcripts (Fig. 2a, b).
  • trans splicing the 5'-end of about 70% of the mRNAs is produced in a process called "trans splicing", whereby the transcript gets spliced to a short "splice leader" RNA in a process similar to normal intron removal 28 (cis-splicing).
  • cis-splicing the start sites of the trans-spliced transcripts have the characteristic thermodynamic profile of a 3'-splice site (Fig. 2a, b and Fig. 13), and differ from the conventional transcript start sites (Fig. 14).
  • the identical thermodynamic properties of cis- and trans- 3'-splice sites suggest that they could contribute to the mechanism of RNA splicing.
  • thermodynamic stability patterns we calculated the free energy of 50 bp upstream and the 50 bp downstream of the same sites in the human genome (Fig. 2b, c).
  • human transcripts both RNA/DNA and DNA/DNA duplexes are more stable compared to C. elegans (Fig. 2a, b, c).
  • the second difference observed is that human transcripts are RNA/DNA biased, except the regions upstream of the 3'-splicing sites and the 3'-UTRs.
  • the 5'-UTR of human transcripts are extremely stable. We analyzed all first and second exon pairs that fall into 5'-UTRs.
  • the thermodynamically stable region includes the first exon of the 5'-UTR and propagates to the intronic sequence, but does not reach the 3'-part of the intron and the second exon, even though it is also part of the 5'-UTR (Fig. 15 and 16). It is known that the first exon and intron of genes possess specific features compared to the rest of the exon-intron pairs, such as shorter exon length 29 , longer intron length 30 and characteristic DNA methylation 31 and histone modification profiles 32 . However, the relationship between these characteristics and the higher stability of the first exon and intron and the biological significance of these correlations are still unclear.
  • thermodynamic profiles for H. sapience and C. elegans share three important features.
  • the region of lowest RNA/DNA duplex stability within the transcript is situated in intronic sequences upstream of 3'-splice sites in both species (Fig. 2a, b, c).
  • this region possesses the strongest DNA/DNA bias (Fig. 2a, b, c. and Fig. 1 1 b).
  • the 3'- and 5'-splice sites have the strongest RNA/DNA bias. Similar to C. elegans, in H. sapiens these patterns are observed in both coding and non-coding genes (results not shown) and in coding and untranslated regions of the coding genes (Fig.
  • RNA/DNA bias throughout the entire pool of measured transcripts was detected at the 3'- and 5'-splice sites of all species. These patterns are statistically significant with p-value ⁇ 2e "15 (Materials and Methods).
  • p-value ⁇ 2e "15 The Polypyrimidine tract, a degenerative, pyrimidine-rich sequence, required for intron-exon recognition is situated in the region of strongest DNA/DNA bias. It is remarkable that regions with significant differences in nucleotide composition among species would possess such a uniform DNA/DNA bias (Fig. 3a, b, c and Fig. 18).
  • RNA/DNA bias for the 5'-alternative slice sites are also detectable, although less pronounced than the profile near 3'-splice sites (Fig. 4d).
  • Some alternative splicing events are subject of intensive cell type and cell cycle specific regulation, which allows differential expression of the splice variants.
  • a substantial fraction of the alternative splicing events result in low abundance alternative transcripts without detectable biological function.
  • Such minor transcript variants are believed to be a result of splicing of inefficient splice sites. If the thermodynamic stability profile near the splice sites plays an important role in the splicing reaction, the constitutive splicing sites will possess more pronounced biases in the thermodynamic profiles than alternative splice sites of low splicing efficiency.
  • thermodynamic stability profile near the 3'-splice sites of human constitutive exons, retained introns, cassette (skipped) exons and 3'-alternative splice sites (Fig. 5a, b and Fig. 19).
  • the cassette exons that are present in one but skipped in other transcripts of the same gene possess similar thermodynamic stability profiles as the constitutive exons (Fig. 5b).
  • the retained introns are believed to be a result of recognition failure of the weak splice sites that flank the introns 34 . Those week splice sites could be a result of the low DNA/DNA bias of the region.
  • To assess if there is a link between the thermodynamic stability and the alternative splice site usage we compared the 10% of 3'-alternative splice sites with either the lowest (10 th percentile) or the highest splicing levels (90 th percentile) 35 . Our results reveal that the 3'-alternative splice sites with the lower splicing levels possess lower DNA/DNA bias upstream of the 3'-splice sites than the average 3'-alternative splice sites (Fig. 5c and Fig. 20).
  • the 3'-alternative splice sites with higher splicing level possess higher DNA/DNA bias than the average 3'-alternative splice sites and are similar to the constitutive splice sites. Even the cassette exons with the lower splicing levels possess significantly diminished
  • thermodynamic profile near the 3'-splice sites of 5'-alternative splice exons was not studied.
  • thermodynamic profile near the 3'-splice sites of higher and lower splicing level of 5'-alternative splice exons suggest that the DNA/DNA bias upstream of the 3'-splice sites could be involved in 3'-spice site recognition.
  • Metazoan transcripts contain large numbers of "cryptic" splice sites that are inactivated or used at low levels due to suppression from nearby and advantageous authentic splice sites 36 .
  • thermodynamic properties of the two alternative sites can influence the splice site selection under normal and disease conditions we compared the thermodynamic profile of disease related cryptic 3'-splice sites 36 , with their corresponding authentic splice sites.
  • the cryptic splice sites can occur both upstream (in the intron) and downstream (in the exon) of the authentic 3'-splice site.
  • Pol II together with the Spliceosome factors will first encounter the cryptic 3'-splice sites and then the authentic splice sites.
  • the DNA/DNA bias is less pronounced at the region situated upstream of cryptic 3'- splice sites than at the same regions of the authentic splice sites.
  • RNA/DNA bias is not as strong at cryptic 3'-splice sites as at authentic splice sites.
  • thermodynamic profile of the 3'-splice sites are less pronounced in the cryptic 3'- splice sites, which could allow them to be bypassed by the Pol II without leading to a splicing event when the authentic splice site is functional.
  • NTRK1 Congenital insensitivity to pain with anhidrosis -41 IVS4-1G>C
  • the BP sequence is a degenerative signal situated several nucleotides upstream of the PPT 39 .
  • the free energy near 35 identified human BPs 40 Our results show that on average the BP is located upstream of the maximum DNA/DNA bias point but is still within the DNA/DNA bias region (Fig. 7a, b).
  • Pseudoexons are regions in the human genome flanked by sequences that resemble authentic splicing regulatory signals but are not spliced into mature mRNAs 8"10 .
  • Previous work 9 has provided a reliable dataset of such sequences, derived using a consensus score based on a position-specific weight matrix, obtained by aligning a large number of real splice sites 10 .
  • RNA/DNA annealing at DNA/DNA biased regions impedes splicing
  • topoisomerase which removes negative supercoiling generated behind RNA polymerases during transcription, also suppresses R-loop formation 45,46 .
  • the accumulation of negative supercoiling in Topo l-deficient cells is supposed to weaken DNA/DNA duplexes and to facilitate both re-annealing of mRNA to the DNA template strand and R-Loop formation.
  • DNA/DNA bias the detected lower stability of RNA/DNA versus DNA/DNA duplexes
  • upstream of the 3'-splicing site prevents the re-annealing of mRNA after transcription to allow spliceosome assembling.
  • RNA is incorporated into a nonspecific heterogeneous nuclear ribonucleoprotein H complex that does not require ATP and functional splicing regulatory sequences.
  • RNA into a pre- spliceosomal complex A leads to re-arrangement of RNA into a pre- spliceosomal complex A as a result of binding of U2AF65, U2AF35 and U2 snRNP to the PPT, 3'-splice sites and BP, respectively 47 . Consequently, a B complex is formed by association of U4/U5/U6 tri-snRNP followed by formation of the active Spliceosome complex (Complex C) as a result of new rearrangements and the incorporation of the U2/5/6 snRNPs 50 .
  • Complex C active Spliceosome complex
  • annealing of antisense DNA oligonucleotides to RNA at the 3'-splice site and the start of the exon does not influence complex A and H formation (Fig. 8).
  • the inhibition of complex A formation by annealing of DNA to the DNA/DNA biased region further suggests that a strong DNA/DNA bias in this region is required to prevent the annealing of DNA to RNA in order to ensure spliceosome assembly.
  • Such a mechanism is also supported by the research of Krainer and collaborators 53 , who studied the splicing of SMN2 exon 7. They showed that the annealing of chimeric antisense oligonucleotides across the PPT sequence of the RNA of intron 6 leads to inhibition of splicing of exon 7 both in vivo and in vitro.
  • H. sapiens melanogaster and D. rerio, H. sapiens (Fig. 3).
  • H. sapiens transcripts we further looked into the nucleotide usage near 3'-splice sites for constitutive and alternative splice sites (Fig. 5 and Fig. 19, 20, 21 ), cryptic and authentic splice sites (Fig. 22) and real exons and pseudoexons (Fig. 7). The comparison of these results with the
  • thermodynamic stability of nucleotide duplexes has two components.
  • the first component is the forces of the hydrogen interaction between complementary bases and this component strongly depends on the nucleotide composition.
  • the second component is the stacking interaction between the bases, which depends mainly on the neighboring di-nucleotide distribution. Therefore, the nucleotide distribution and its thermodynamic properties are interrelated.
  • the polypyrimidine/polyuridine tract is inevitably DNA/DNA biased and at the same time a strong DND/DNA bias that is impossible without an abundance of pyrimidine. While the sequence properties of the polypyrimidine region are known, the fact that they would lead to a specific DNA/DNA bias has been previously overlooked.
  • the potential of the DNA/DNA bias to increase the accessibility of the polypyrimidine tract by preventing message/template annealing can enable the recruitment of U2AF65 to its preferred substrate and the two mechanisms may act in parallel to ensure proper assembly of the splicing machinery.
  • the present invention demonstrates that the least stable RNA/DNA duplexes as compared to the respective DNA/DNA duplexes are situated upstream from the 3'-splice site where the polypyrimidine tract is situated. This characteristic instability is less pronounced in weak alternative splice sites and disease-associated cryptic 3'-splice sites.
  • the essential splicing factor U2AF65 is specifically recruited to these regions of the pre-mRNA despite their relatively poor sequence conservation 54 .
  • the results of the experimental examples suggest that the higher instability of RNA/DNA in comparison to the respective DNA/DNA duplexes in this region can prevent the re-annealing of mRNA to DNA.
  • the resulting mRNA/DNA melting can contribute as a mechanism to allow binding of U2AF65 and SF1 to mRNA in order to initialize the primary steps of Spliceosome assembly.
  • Decreasing of level of expression of virus genes by decreasing DNA/DNA and RNA/DNA duplex stability of the coding sequence can decrease the fitness of the virus (or other pathogens), thereby attenuating the virus.
  • Decreasing of DNA/DNA and RNA/DNA duplex stability of the virus genes will involve preferably multiple, more preferably hundreds of synonymous mutations into a pathogen in order to minimize the chances of regaining virulence via recombination with other viral sequences present in the host genome.
  • the methods described herein can be used to create attenuated vaccines with improved genetic stabilities by de novo synthesis of virus genomes with altered DNA/DNA and RNA/DNA duplex stability of some of its genes and regulatory sequences thereof.
  • An example of the de novo design of such a sequence may be carried out as follows: a) Measurement of thermodynamic stability of RNA/DNA and DNA/DNA duplexes of any one or more genomic regions of interest (such as genes) of the virulent virus, b) Design of one or more, preferably several, modified genomes of the virus by diminishing thermodynamic stability of RNA/DNA duplex of one or more genomic regions of interest (such as its genes), c) De novo synthesis of the modified virus genome(s) according to common de novo synthesis techniques, for example by chemical oligonucleotide synthesis, and/or annealing based connection of oligonucleotides, d) Measurement of the level of the virulence of the viruses with modified genomes, for example via standard techniques used by immunologists for assessing virulence, such as tissue or cell culture virulence assays, e) Detection or measurement of a correlation between thermodynamic stability of RNA/DNA duplex of the viruses' genomes and
  • An example of enhanced gene expression may be carried out as follows: a) Measurement of the thermodynamic stability of RNA/DNA and DNA/DNA duplexes of the gene of the desired protein for production, b) Design of one or more, preferably several, modified genes by increasing the thermodynamic stability of RNA/DNA duplex of its coding sequences, and/or by insertion of intronic sequences with lower AG ratio between RNA/DNA and DNA/DNA stability in downstream of 3'-splice sites, c) De novo synthesis of the modified genes, for example using the techniques mentioned herein, d) Measurement of the level of expression of the modified genes, for example using the
  • thermodynamic stability of RNA/DNA duplex of the coding sequences of the genes and the level of expression e) Detection or measurement of a correlation between thermodynamic stability of RNA/DNA duplex of the coding sequences of the genes and the level of expression, and/or f) Detection or measurement of a correlation between the ratio of thermodynamic stability of RNA/DNA and DNA /DNA of the inserted introns and the level of expression of the genes, and g) Selection of desired modified sequence for further application based on protein or mRNA expression.
  • An example of producing an enhanced gene therapy vector may be carried out as follows: a) Measurement of thermodynamic stability of RNA/DNA and DNA/DNA duplexes of one or more regions of the gene of the protein of interest, b) Design of one or more, preferably several, modified genes by increasing thermodynamic stability of RNA/DNA duplex of its coding sequences and/or by insertion of intronic sequences with a lower ratio between RNA/DNA and DNA/DNA stability downstream of 3'- splice sites, c) De novo synthesis of the modified genes, d) Measurement of the level of expression of the modified genes in an appropriate model system, either in vitro or in vivo, and e) Detection or measurement of a correlation between thermodynamic stability of RNA/DNA duplex of the coding sequences of the genes and the level of expression, and/or f) Detection or measurement of a correlation between ratio of thermodynamic stability of RNA/DNA and DNA /DNA of the inserted introns and the level of expression of the genes, and
  • Cryptic 3'-splice sites and their corresponding authentic 3'-splice sites were obtained from DBASS3 36 .
  • Branch point sequences were obtained from 40 . Only branch points confirmed by a minimum of 3 lariat RT-PCR clones were considered.
  • Pseudo exon and real exon datasets were obtained from 10 .
  • Datasets for cassette exons, 3'-alternative splice sites, 5'- alternative splice sites and constitutive exons were obtained from HEXEvent database 35 .
  • the retained intron dataset was obtained from the UCSC genome Table browser 59 .
  • Double-stranded cDNA of the wlL2, IL2 and elL2 genes was amplified by PCR from plasmids pcDNA3-wlL2, pcDNA3-IL2 and pcDNA3-elL2, respectively 15 .
  • IL2 and elL2 was generated as follows: mRNA was produced by in vitro transcription by T7 RNA polymerase of the respective cDNA. Single-stranded DNA (ssDNA) was produced by digestion with the Lambda exonuclease (NEB) of a double-stranded PCR product with a 5'-phosphate attached to the strand that was to be removed. Finally, mRNA and template ssDNA of the respective gene were annealed after initial denaturation and decreasing the temperature to
  • the probe was loaded on a mini gel, composed of 0.5% agarose, 4% acrylamide, 0.05% bis-acrylamide, 50 mM Tris and 50 mM glycine. The gel was run for 2 h in 50 mM Tris and 50 mM glycine buffer. The gel was dried and exposed with a Kodak BioMax MR film or Phosphorlmager screen. Statistics
  • RNA binding protein RNPS1 alleviates ASF/SF2 depletion-induced genomic instability. RNA 13 (12), 2108-21 15 (2007).
  • Multi-domain conformational selection underlies pre-mRNA splicing regulation by U2AF. Nature 475 (7356), 408-41 1 (201 1 ).

Landscapes

  • Genetics & Genomics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Wood Science & Technology (AREA)
  • Organic Chemistry (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Zoology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Microbiology (AREA)
  • Plant Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Pharmaceuticals Containing Other Organic And Inorganic Compounds (AREA)
  • Medicines That Contain Protein Lipid Enzymes And Other Medicines (AREA)

Abstract

The invention relates to a method for the design and/or modification of synthetic or naturally occurring nucleic acid sequences. The method is suitable for obtaining nucleic acid sequences with a defined level or enhanced predictability of protein expression by adjustment of the thermodynamic stability of the relevant nucleic acid sequence by sequence adjustment. The invention relates to a method for modulating gene expression, more specifically for modulating protein expression levels of any given nucleic acid sequence, by adjusting the thermodynamic stability of the corresponding RNA/DNA and DNA/DNA duplexes via sequence change of the gene encoding the protein to be expressed.

Description

METHOD FOR MODULATING GENE EXPRESSION
DESCRIPTION
The invention relates to a method for the design and/or modification of synthetic or naturally occurring nucleic acid sequences. The method is suitable for obtaining nucleic acid sequences with a defined level or enhanced predictability of protein expression by adjustment of the thermodynamic stability of the relevant nucleic acid sequence by sequence adjustment. The invention relates to a method for modulating gene expression, more specifically for modulating protein expression levels of any given nucleic acid sequence, by adjusting the thermodynamic stability of the corresponding RNA/DNA and DNA/DNA duplexes via sequence change of the gene encoding the protein to be expressed.
BACKGROUND OF THE INVENTION
The de novo design of genes and re-design of exogenous genes with predictable levels of RNA expression and splicing is required for successful gene therapy and protein production. The level of transcription and mRNA processing are subject to precise species- and cell specific- regulation. Until now, the expression of the exogenous genes is commonly performed by cloning of natural coding sequences, whereby the modulation of gene expression is achieved by codon optimization that takes into account the level of codon usage in the host. One example of optimization is found in WO 2006/126070 A2, in which codon-optimized activated protein C is disclosed. Further developments and insights into regulatory events for both overall transcription levels and splicing events are required for improved design of genes in the biotechnology industry.
RNA splicing is coupled with transcription and catalyzed by a Spliceosome nucleoprotein complex that acts to remove introns and re-join exonic sequences in order to create functional RNAs1. This process involves recognition of the 5'-splicing sites by U1 snRNPs2,3 (small nuclear ribonucleoproteins). The 3'-inron-exon boundary recognition requires SF1 to interact with the branch point sequence (BP)4,5; U2AF656 with the Polypyrimidine tract (PPT) and U2AF35 with the 3'-splice site6,7. U2AF recruit the U2 snRNP, which replaces SF1 to bind the BP and catalyzes exon re-joining7. Although all these steps have been intensively studied, the molecular mechanisms underlying recognition of such degenerative sequences as the BP and PPT are still unclear. For example, thousands of sequences in the human genome called pseudo exons are flanked by regions that resemble the consensus splice sites but are never spliced in mature mRNAs8"10. This raises the question whether exonic and intronic sequences are endowed with other attributes that can contribute to intron-exon recognition, or to expression levels, as such. The introduction of an intron has been shown to modulate expression levels of a coding region, although this phenomena is dependent on the particular promoters and introns involved. For example, the introduction of the first intron from EF-1 alpha directly downstream of the MCMV promoter led to enhanced expression level of a luciferase reporter gene61.
This lack of clarity regarding potential functions of exonic and intronic sequences poses a significant hurdle during the design of gene sequences for application in biotechnological endeavours that require reliable regulation of splicing events.
As it reads the information encoded throughout the genome, RNA polymerase II can travel along the DNA template for thousands and even hundreds of thousands of nucleotides. In the process, it encounters the physical forces of DNA/DNA and RNA/DNA pairing that can vary significantly depending on the local sequence composition. It has been shown11"13 that the 5'- and 3'- UTRs, introns and exons have characteristic guanine/cytosine (GC) content, which could affect RNA transcription and processing. Nucleotide composition could influence protein recruitment12, RNA secondary structure13, transcription rate14,15, DNA melting16 or RNA/DNA and DNA/DNA duplex stability17. The physical and structural properties of nucleic acid sequences represent - either alone or in combination with the coding information - poorly understood and potentially complicated factors that render accurate design of nucleic sequences, in particular for determining the expression level of particular genes, to be plagued by lack of reliability, amongst other disadvantages. The thermodynamic stability of nucleotide duplexes has two components. The first component is the forces of the hydrogen interaction between complementary bases and this component strongly depends on the nucleotide composition. The second component is the stacking interaction between the bases, which depends mainly on the neighbouring di- nucleotide distribution. Therefore, the nucleotide distribution and its thermodynamic properties are interrelated. The de novo design of genes and re-design of exogenous genes with predictable levels of expression and/or splicing is therefore difficult.
Despite empirical methods that have been used for the development of methods for the prediction of the stability of DNA/DNA or RNA/DNA hybrid duplexes62, the biological function of the thermodynamic properties of such duplexes represents a significant unknown factor that confounds the design of de novo protein-coding sequences for optimized expression or splicing characteristics.
The free energy (AG) necessary to unwind polynucleotide duplexes with defined length can be calculated from the measured values of Entropy (AS) and Enthalpy (ΔΗ) for nearest-neighbour DNA/DNA18"23 or RNA/DNA24,25 interactions. Previous work in this area has shown that exons possess more stable RNA/DNA duplexes than introns in Saccharomyces cerevisiae . However, this earlier body of work did not allow for direct comparisons between the stability of RNA/DNA and DNA/DNA duplexes because the AS and ΔΗ parameters used for DNA/DNA duplexes22 lead to significant overestimation of its AG in comparison with AG of RNA/DNA duplexes. In light of this drawback, direct comparison between the AG value for DNA/DNA duplex stability and the AG value for RNA/DNA duplex stability for any given specified sequence region has until now not been previously possible. The calculation of a "AG ratio" between the AG value for
DNA/DNA duplex stability and the AG value for RNA/DNA duplex stability has been neither disclosed nor suggested in the prior art.
SUMMARY OF THE INVENTION The invention relates in a preferred embodiment to a method for de novo design of synthetic genes and/or modification of naturally occurring genes in order to obtain nucleic acid sequences with a defined level or enhanced predictability of mRNA expression, in addition to defined expression levels of splicing variants by modulation of thermodynamic stability of RNA/DNA and DNA/DNA duplexes. The genes designed and modified according to the present invention can be used in order to create and improve gene therapy constructs and vaccines and to enhance in vitro, in vivo or recombinant production of proteins, RNA and DNA.
The invention is based on demonstration that the ratio between stability of mRNA/DNA duplexes and DNA/DNA duplexes plays a role in transcription levels and splice events. Particular values of this ratio near 3'-spice sites are characteristic features that can contribute to intron-exon differentiation. Remarkably, throughout all transcripts, the most unstable mRNA/DNA duplexes, compared to the corresponding DNA/DNA duplexes, are situated upstream of the 3'-splice sites and include the polypyrimidine tracts. This characteristic instability is less pronounced in weak alternative splice sites and disease-associated cryptic 3'-splice sites. The present invention therefore enables modification of the thermodynamic pattern of a DNA sequence in order to modulate in vivo transcription and splicing events. One example is to prevent the re-annealing of mRNA to the DNA template behind the RNA polymerase to ensure access of the splicing machinery to the polypyrimidine tract and the branch point.
The present invention therefore represents the technical exploitation of the thermodynamic properties of nucleic acid sequences in order to modulate transcription and/or RNA splicing, by using appropriate thermodynamic parameters that allow comparison and subsequent determination of DNA/DNA duplex stability compared to mRNA/DNA duplex stability.
In light of the prior art the technical problem underlying the invention was the provision of means for designing and/or modifying nucleic acid sequences that exhibit an improved predictability in their level of expression. This problem is solved by the features of the independent claims. Preferred embodiments of the present invention are provided by the dependent claims.
Therefore, an object of the invention is to provide a method for modulating protein expression level by modifying the gene encoding said protein, comprising a) provision of an initial DNA sequence (initial DNA) that comprises one or more
sequences that encode an amino acid sequence of the protein to be expressed, b) determination of the AG ratio (AG of DNA/DNA duplex stability to AG of RNA/DNA duplex stability) for one or more regions of said initial DNA sequence, and c) modification of said initial DNA sequence to provide a product DNA sequence
(product DNA) with a desired (pre-determined) or improved AG ratio, wherein the protein expression level is dependent on (correlates with) said AG ratio.
Specific embodiments and examples of protein expression being dependent on changes in AG ratio are provided herein. The particular modification of sequence, and thereby modification of the AG ratio itself, for any given sequence region, is not intended as a limiting feature of the invention. The technical solution provided herein relates to the development of the AG ratio itself and its use in determining functional characteristics of a sequence with respect to transcription and/or splicing. The AG ratio may be increased or reduced in various regions of a gene
(comprising for example coding regions and optionally introns), depending on the desired change in protein expression. Modulation of expression may relate to either increased or reduced expression of a protein of interest, or of a particular splice variant of interest.
The invention therefore also relates to the use of the AG ratio (AG of DNA/DNA duplex stability to AG of RNA/DNA duplex stability) of one or more regions of any given gene sequence in determining a functional characteristic of said sequence, for example the likelihood, frequency or rate of splicing, or the level of transcription, of any given sequence region.
The AG ratio of the method of the present invention is preferably determined by measurement and/or calculation of the AG value for DNA/DNA duplex stability and the AG value for RNA/DNA duplex stability for any given sequence region, and comparison and/or calculation of the ratio (or difference) between the AG values for DNA/DNA duplex stability of a specified sequence region to AG values for RNA/DNA duplex stability for the same sequence region (AG ratio = AG DNA/DNA : AG RNA/DNA), wherein a sequence region with a AG ratio above 1 exhibits higher DNA/DNA duplex stability than RNA/DNA duplex stability and a sequence region with a AG ratio below 1 exhibits higher RNA/DNA duplex stability than DNA/DNA duplex stability.
In the context of the present invention, the comparison between the AG of DNA/DNA and the AG of RNA/DNA duplexes may be referred to as "AG bias" or "AG ratio". Regions with higher stability of the DNA/DNA duplex in comparison to the corresponding RNA/DNA duplexes may be referred to as "DNA/DNA biased" regions. Such regions exhibit a AG ratio of over 1 (AG ratio = AG DNA/DNA : AG RNA/DNA). Regions with higher stability of the RNA/DNA duplex than the corresponding DNA/DNA duplex will be referred to as "RNA/DNA biased" and exhibit a AG ratio of below 1.
Alternatively, the difference between the AG value for DNA/DNA duplex stability and the AG value for RNA/DNA duplex stability can be calculated (for example AG DNA/DNA - AG
RNA/DNA) and expressed as AG bias. As used in the experimental examples, in particular Figures 2d and 3a, the AG bias has been calculated by AG DNA/DNA - AG RNA/DNA.
Therefore, sequence regions with AG bias of above 0 indicate "DNA/DNA biased" regions. Sequence regions with AG bias of below 0 indicate "RNA/DNA biased" regions.
Such values relate essentially to the same or an analogous comparison between AG DNA/DNA and AG RNA/DNA values, but may be expressed in a different form. The term AG ratio or AG bias may still in general be used for alternative numerical expressions of the comparison of AG values described herein.
The free energy (AG) necessary to unwind polynucleotide duplexes with defined length can be calculated from the measured values of Entropy (AS) and Enthalpy (ΔΗ) for the 10 possible nearest-neighbor DNA/DNA18"23 interactions, and the 16 possible RNA/DNA24,25 interactions. The AG values can therefore be calculated by methods known in the art.
The desired or pre-determined AG ratio of the product DNA is to be assessed and/or determined in relation to the initial DNA sequence. Changes in sequence can be made to the initial DNA sequence in order to adjust the AG ratio of any given sequence region to a value as desired. The AG ratio can be increased or decreased according to the desired outcome with respect to protein expression. The AG values for each of the DNA/DNA and RNA/DNA duplex stabilities may also be adjusted as such, without altering the AG ratio itself. The product DNA sequence can in a preferred embodiment then be assessed for protein expression, which can be determined by various in vitro or in vivo quantitative tests, such as detection of the expressed protein with an affinity reagent, such as an antibody, or by other analytical protein methods such as SDS-PAGE followed by coomassie staining. Comparative tests between initial and product DNA sequences based on analysing protein expression, using comparable expression systems, are preferred. For example, the initial and product sequences could subsequently be introduced separately into the same expression vector, such as a plasmid or viral vector, suitable for expression in any given host organism or cell culture system. The protein expression levels can then be assessed as described herein or by methods known in the art.
The novel and surprising aspect of the invention is the recognition that analysis of the thermodynamic properties of the sequence, specifically direct comparison of the DNA/DNA duplex stability to RNA/DNA duplex stability, in any given sequence region reveals functional characteristics of the sequence with regard to expression of the encoded protein. The solution to the technical problem stated above is therefore the utilisation of a comparison between the DNA/DNA duplex stability to RNA/DNA duplex stability. In another specific embodiment, the solution to the technical problem is the provision of the AG ratio (AG of DNA/DNA duplex stability to AG of RNA/DNA duplex stability) for use in determining functional characteristics of nucleic acids sequences. Subsequent modulation of the AG ratio represents an active method step that utilises the surprising relationship between DNA/DNA duplex stability and RNA/DNA duplex stability in order to achieve a more reliable and in some cases pre-determined expression level.
Although mention has been previously made in the prior art that GC content or thermodynamic stability of a nucleic acid sequence influences its transcription level, to date there has been no indication that a crucial factor in determining both transcription and splicing relates to the direct comparison between DNA/DNA duplex stability and RNA/DNA duplex stability. This insight provides the basis for the methods of the present invention, and is related to the particular mechanism of the transcription and/splicing process, as described herein. Due to the
progressing polymerase or splicing machinery having to separate the DNA duplex during execution of their respective functions, the balance between the strengths of the DNA/DNA duplex formation and the RNA/DNA duplex formation is important in determining the functional output/outcome of the transcription and/or splicing molecular machinery.
The invention therefore also relates to a method for modifying a gene sequence in order to modulate protein expression by modifying the gene sequence according to the thermodynamic properties of the sequence of interest. The method for modifying a DNA sequence in order to modulate protein expression comprises of a) provision of an initial DNA sequence (initial DNA) that encodes an amino acid sequence of interest, b) determination of the AG ratio for one or more regions of said initial DNA, and c) modification of said initial DNA sequence to provide a product DNA sequence (product DNA) with a desired AG ratio, wherein the protein expression level is dependent on said AG ratio.
In a preferred embodiment the method of the present invention for modifying a DNA sequence is carried out for a sequence that comprises one or more introns, and is characterised in that - an increase of the AG ratio in a specified sequence region of the product DNA in comparison to the initial DNA provides increased expression of the protein encoded by said product DNA, wherein said specified sequence region is any given sequence region within 50 nt, preferably 20 nt, upstream of the 3' splice site of an intron. This particular embodiment of the invention relates to the modification of a sequence in order to enhance expression of a particular splice variant of any given sequence comprising one or more introns. A specified region upstream of the 3'-splice site within introns is designed to possess more thermodynamically stable DNA/DNA duplex than RNA/DNA duplex. Immediately upstream of this region, the branch point consensus sequence will be preferably present. The level of presence of the relevant exon (downstream of said 3' splice site of the intron) in the spliced (final, to be translated) transcript will depend on DNA/DNA duplex stability in comparison with RNA/DNA duplex stability in the particular region upstream from the intron's 3'-splice site. The greater DNA/DNA duplex stability in comparison with RNA/DNA duplex stability in this region (any given sequence region within 50 nt, preferably 20 nt, upstream of the 3' splice site of an introns), the more frequently (i. e. with higher efficiency) the relevant exon will be present in the mature transcript. This effect leads therefore to increased expression of the protein encoded by the modified product DNA. If the thermodynamic bias towards DNA/DNA duplex stability (higher AG ratio) is not present, the splice event will occur with less efficiency and reduced frequency, thereby leading to lower expression of the desired DNA sequence due to reduced transcript number for the desired splice variant.
In a preferred embodiment the method of the present invention - relating to enhanced splice efficiency - is characterised in that the AG ratio of the specified sequence region (preferably region within 50 nt, preferably 20 nt, upstream of the 3' splice site of an intron) in the product DNA is above 1 , preferably greater than 1 .05, 1 .10, 1 .15, 1 .2, or greater than 1 .3. In another embodiment the "AG bias" (as an alternative measure to the "AG ratio") of the specified sequence region (preferably region within 50 nt, preferably 20 nt, upstream of the 3' splice site of an intron) in the product DNA is greater than 0, preferably greater than 0.5, 1 .0, 1 .5 or more preferably greater than 2. The present invention therefore relates to a method as described herein, wherein modification of the initial DNA is carried out by insertion of one or more introns (such as synthetic introns) and corresponding splice sites.
The invention therefore also relates to separation of a coding sequence with one or more exons by inserting one or more preferably synthetic intronic sequences that can allow expression of a desired number of alternative splicing variants of the gene. The synthetic intronic sequences are designed to contain: 5'-splice site consensus sequence, 3'-splice site consensus sequence and a sequence with a low level of DNA/DNA and RNA/DNA duplex stability between them (in the introns). In addition, a specified region upstream of the 3'-splice site is designed to possess more thermodynamically stable DNA/DNA duplex than RNA/DNA duplex. Immediately upstream of this region, the branch point consensus sequence will be present. The level of presence of the relevant exon (downstream of said 3' splice site of the intron) in the spliced (final, to be translated) transcript will depend on DNA/DNA duplex stability in comparison with RNA/DNA duplex stability in the particular region upstream from the intron's 3'-splice site. The greater DNA/DNA duplex stability in comparison with RNA/DNA duplex stability (larger AG ration values) in this region (preferably any given sequence region within 50 nt, preferably 20 nt, upstream of the 3' splice site of an introns), the more frequently (i. e. with higher efficiency) the relevant exon will be present in the mature transcript.
The insertion of synthetic introns therefore provides a reliable method for modifying a coding DNA sequence in order to provide one or more splice variants with fine-tuned expression levels.
It was at the time of the invention entirely unknown that by modifying the relative stability of DNA/DNA and RNA/DNA duplexes of a sequence region upstream of the 3' splice site of an intron, splicing efficiency of the corresponding intra could be controlled. The introduction of an intron therefore represents an unexpected and surprising method for producing a gene that can be tightly regulated with respect to its expression level.
In a further aspect of the invention, the method as described herein is characterised in that reduction of the AG ratio for any given coding region and/or 5' untranslated region (5'-UTR) of the product DNA in comparison to the initial DNA provides increased expression of the protein encoded by said product DNA. This particular embodiment relates to regulation or modulation of the expression level of a coding DNA sequence independent of splice events. This method therefore applies preferably to coding sequences that do not comprise introns to be spliced out before translation. The comparison of DNA/DNA duplex stability with RNA/DNA duplex stability has revealed that coding sequences that show lower AG ratio values are transcribed more highly than sequences with higher AG ratio. The thermodynamics properties encompassed by this feature have been neither suggested nor disclosed in the prior art. The exploitation of these thermodynamic properties allows an assessment of transcription efficiency in light of the AG ratio and subsequent modification of the sequence to be expressed according to desired expression level. In a preferred embodiment of the method described herein the AG ratio of the specified coding and/or 5'-UTR sequence region in the product DNA is around 1 or preferably below 1 . In another embodiment the "AG bias" (as an alternative measure to the "AG ratio") of the specified sequence region in the product DNA is around or preferably below 0, preferably less than -0.5, - 1.0, -1.5 or more preferably less than 2. Alternatively, the AG ratio for any given coding region and/or 5' untranslated region (5'-UTR) of the product DNA may be modified by sequence modification, so that the AG ratio is 1 , or a value close to one, that is defined by essentially similar AG values of DNA/DNA and RNA/DNA duplex stability, whereby the AG values as such for both DNA/DNA and RNA/DNA duplex stability are increased compared to those of the initial DNA sequence. This embodiment is beneficial for coding regions (including for example for exon sequences) in order to provide increased expression of the protein encoded by the product DNA. Also in the embodiment where the AG values as such for both DNA/DNA and RNA/DNA duplex stability are increased compared to those of the initial DNA sequence, it is preferred that the AG ratio is 1 , around 1 or below 1.
The correlation between more stable duplex structures and higher expression levels is also surprising in that a skilled person may have assumed that lower stability would provide less resistance to a progressing Pol II. The present invention therefore provides a method based on a concept not suggested previously.
In one embodiment the method of the present invention is characterised in that modification of the initial DNA is carried out according to or using the degeneracy of the genetic code without changing the amino acid sequence encoded by said initial DNA. It is well known that various nucleic acid triplets may code for the same amino acid. Through modification of the DNA sequence by adjusting the sequence according to the degeneracy of the nucleic acid code, the same amino acid sequence may be maintained, whilst the nucleic acids are selected with regard to their thermodynamic properties, in particular the AG ratio of any given sequence. This embodiment is particularly relevant for coding sequences that do not exhibit introns. In addition to utilising the degeneracy of the genetic code, standard codon optimisation procedures may additionally be considered and taken into account when designing a sequence for expression in a particular host. In one embodiment the nucleic acid sequence coding for the desired protein may be reverse transcribed, either in vitro or in silico, and the nucleic acid sequence subsequently analysed and modified/designed according to the methods described herein.
The method of the present invention may also be characterised in that one or more of the steps a), b) and/or c) of the method described above are carried out by one or more computer programmes, executed on a computing device. The invention therefore relates to computational methods that essentially use simulations and/or computer representations of the nucleic acid sequences described herein.
The method can be carried out by empirical experimentation, for example by synthesis of particular sequences, empirical analysis of their AG values by experimental approaches known in the art, and finally be re-synthesizing a modified sequence based on nucleotides that have been adjusted or replaced in order to achieve the desired thermodynamic properties, more preferably the desired AG ratio. Re-analysis of the sequence in order to measure the changed AG is also possible. Determination of the melting point of any given sequence is one approach that represents an empirical method of determining or estimating the AG of a particular DNA molecule.
The method may also relate to a computer programme product, such as a software product. The AG values of any given sequence are preferably determined in silico through calculation of thermodynamic properties of individual nucleotides and/or longer sequences of multiple nucleotides. The computer programme product of the present invention also encompasses the features as described for the method provided herein. Further details on preferred computer- based approaches are provided in the examples and relevant references as described herein. If the method is carried out in a computer programme, for example by way of simulation, the modified sequence may subsequently be synthesized by methods known to a skilled person in a laboratory and utilised in which ever in vitro or in vivo application is desired.
The invention also relates to a method as described herein, wherein the AG values for
DNA/DNA duplex stability and/or the AG value for RNA/DNA duplex stability for any given specified sequence region are determined using a sliding-window calculation of entropy (AS) and enthalpy (AH) of nearest neighbour interactions. In another embodiment the invention relates to a method as described herein, wherein AG values for DNA/DNA duplex stability are calculated on 5 to 15, preferably 10 nearest-neighbour interactions, and AG values for RNA/DNA duplex stability are calculated on 10 to 20, preferably 16, nearest-neighbour interactions. The term "nearest neighbour interaction" is known in the art and is described in further detail in references 18-25. In a preferred embodiment the invention relates to a method as described herein, wherein the sliding window approach utilises a 1 to 20 bp, preferably 1 bp, step size and a 1 to 20 bp, preferably 2 to 9 bp, window size. It was an unexpected finding that window and step sizes provided herein led to beneficial results. There had previously been no indication that the thermodynamic properties of the sequence could be interrogated with sufficient resolution at these window sizes.
Similarly, the term "sliding window approach" is well known in the art with respect to genomic analysis and experimentation. A particular window size (defined in nucleotides and/or base pair (bp)) is defined and the window is moved in a particular step size (defined in nucleotides and/or base pair), in order to analyse any given stretch of contiguous nucleotides present in a sequence.
In light of the methods described herein, the invention relates to a method for predicting the location of splice sites in one or more unannotated genomes and/or genomic regions by provision of a DNA sequence of interest, determination of the AG ratio for one or more regions of said DNA sequence, wherein a AG ratio of any given specified sequence region above 1 indicates a 3' splicing site of an intron. According to the present invention, software may be developed that is utilised for scanning genomic sequences for potential splice sites.
The invention further relates to a method for manufacturing a nucleic acid molecule that corresponds to product DNA that has been modified by the method as described herein, comprising carrying out the method of any one of the preceding claims and subsequently synthesizing, cloning and/or isolating said nucleic acid molecule.
The invention also relates therefore to a nucleic acid molecule manufactured according to the method of manufacture as described herein. The nucleic acid may be integrated in an expression vector, such as a plasmid or viral vector.
In one embodiment the invention therefore relates to a pair of first and second nucleic acid molecules, wherein said first nucleic acid molecule is an initial DNA sequence (initial DNA) that comprises one or more sequences that encode an amino acid sequence of a protein to be expressed, and said second nucleic acid molecule is a product DNA sequence (product DNA) with a desired AG ratio that has been manufactured according to the method of the preceding claims, wherein said pair of sequences exhibit differences in AG ratio in one or more sequence regions.
In one embodiment the invention therefore relates to a pair of first and second nucleic acid molecules as described herein, wherein said sequences differ with respect to their nucleic sequence and AG ratio in one or more sequence regions, without any difference in amino acid sequence of the encoded protein to be expressed between said first and second sequences.
In one embodiment the invention therefore relates to a pair of first and second nucleic acid molecules as described herein, wherein the gene encoding the protein to be expressed comprises one or more introns, characterised in that an increased AG ratio in a specified sequence region of the product DNA in comparison to the initial DNA is present and provides increased expression of the protein encoded by said product DNA, wherein said specified sequence region is any given sequence region within 50 nt, preferably 20 nt, upstream of the 3' splicing site of an intron. In one embodiment the invention therefore relates to a pair of first and second nucleic acid molecules as described herein, wherein a reduced AG ratio for any given coding region and/or 5' untranslated region (5'-UTR) of the product DNA in comparison to the initial DNA is present and provides increased expression of the protein encoded by said product DNA.
The pairs of first and second nucleic acid molecules may be present in a kit, together in a laboratory, for example if said pair have undergone comparative testing for expression level as described herein, or present as saved data files in a computer. In one embodiment the presence of a pair of electronic representations of nucleic acid sequences as described herein is encompassed by the present invention. Considering the method as described herein may be implemented in computer software, the in silico representation of any given pair of sequences as described, whereby one sequence has been modified by the method of the present invention, also falls within the scope of the present invention.
The method of the present invention as described herein is of particular value to modification of sequences for expression as gene therapy constructs, vaccines and/or to advance the manufacture of commercially interesting proteins, or the corresponding RNA and/or DNA. The method can be applied in the following fields (but is not restricted to):
- De novo design and modification of genes for gene therapy, in order to obtain the desired level of expression of the gene as a whole, and/or of its splicing variants.
Successful gene therapy requires defined levels of exogenous expression of a therapeutic gene in human cells because either over expression or under expression of the gene can lead to artificial effects or the absence of the desired function of the expressed protein that will compromise or diminish the effect of gene therapy. The present invention also enables the possibility to express several alternative splicing variants in a controlled manner, thereby also increasing the potential application in successful gene therapy. In one embodiment the method of the invention as described herein comprises the design, simulation of manufacture or manufacture of a therapeutic expression vector comprising a product DNA that has been modified by the method according to any one of the preceding claims encoding a therapeutic protein. According to the present invention the term design or simulation of manufacture may also relate to a computer-stored embodiment of the invention, for example a computer simulated version of carrying out the method as described herein, or of a pair of sequences as described herein stored on a computing device.
- De novo design and modification of existing attenuated vaccines with improved genetic stabilities.
Decreasing the level of expression of viral genes by modulating DNA/DNA and/or RNA/DNA duplex stability of the coding sequence can decrease the fitness of the virus (or other pathogens), thereby attenuating the virus. Modulating DNA/DNA and/or RNA/DNA duplex stability of the virus genes may involve hundreds of synonymous mutations into a pathogen, which subsequently reduce the health risk by minimising the chance of the virus becoming virulent via recombination with an already existing virus. The present method can be used to create attenuated vaccines with improved genetic stabilities by de novo synthesis of virus genomes with altered DNA/DNA and RNA/DNA duplex stabilities of some of its coding regions and/or gene regulatory sequences. In one embodiment the method of the invention as described herein comprises the design, simulation of manufacture or manufacture of an attenuated human pathogen, preferably a virus, for use as a vaccine, comprising a product DNA that has been modified by the method according to any one of the preceding claims.
- Advanced production of proteins, RNA and DNA. Progressively growing numbers of proteins and other bio-macromolecule are produced for the purposes of biotechnology and pharmaceutical industry. Increasing of the level of the gene expression by insertion of intronic sequences or by modulating entire coding regions for a desired AG ratio using the approach described herein will significantly improve production of for example recombinant proteins in eukaryotic cells. In one embodiment the method of the invention as described herein comprises the design, simulation of manufacture or manufacture of an expression vector for recombinant protein expression, comprising a product DNA that has been modified by the method according to any one of the preceding claims encoding a protein to be expressed. The terms nucleic acid, deoxyribonucleic acid (DNA), ribonucleic acid (RNA) are known in the art and are sufficiently clear for a skilled person. RNA and DNA may represent different embodiments of the same feature of the invention, especially with regard to coding function of the nucleic acid. The nucleic acids of the present invention also relate to sequences comprising synthetic or chemically modified nucleotides.
A gene is commonly understood as a molecular unit of heredity of a living organism. For the present invention a gene is defined as a region of genomic sequence, corresponding potentially to a unit of inheritance, which may be associated with regulatory regions, coding or non-coding transcribed regions, and/or other functional sequence regions. For the present invention, in a preferred embodiment, a gene relates to a nucleic acid sequence which has a primary function of encoding a protein. The gene may include introns, exons and/or regulatory sequences required for expression of the protein.
The term "modifying a DNA sequence" refers to sequence amendment via insertion, deletion, substitution, inversion or other sequence amendments as commonly known. Computational, cloning, mutation, recombination, PCR or synthetic methods may be used in modifying a sequence. The production of a mutation of the encoded protein during modification may or may not be excluded from the scope of the present invention, although the modification preferably results in no change in amino acid sequence.
The term "the same sequence" or "the same sequence region", for example in the context of a region to be analysed for both AG of DNA/DNA and AG of RNA/DNA, is not limiting to exactly the same sequence length, rather to the same region, whereby sequence overlap is sufficient for the "same region" to be analysed. For example, the sequence region analysed for AG of DNA/DNA may be longer or shorter than the sequence region used to analyse AG of RNA/DNA in the "same" region, depending for example on the window size used for the AG calculation, although an overlap is required.
Artificial gene synthesis (or de novo synthesis) is a preferred application of the present invention and relates to methods used in synthetic biology used to create artificial genes. Currently based on solid-phase DNA synthesis, artificial synthesis differs from molecular cloning and polymerase chain reaction (PCR) in that the user does not have to begin with pre-existing DNA sequences. Therefore, it is possible to make a completely synthetic double-stranded DNA molecule with no apparent limits on either nucleotide sequence or size. Gene synthesis approaches may be based on a combination of organic chemistry and molecular biological techniques and entire genes may be synthesized "de novo", without the need for precursor template DNA. The method has been used to generate functional bacterial chromosomes containing approximately one million base pairs. Gene synthesis has become an important tool in many fields of recombinant DNA technology including heterologous gene expression, vaccine development, gene therapy and molecular engineering. The synthesis of nucleic acid sequences is often more economical than classical cloning and mutagenesis procedures.
Oligonucleotide synthesis may be applied during synthesis, whereby oligonucleotides are chemically synthesized using building blocks called nucleoside phosphoramidites. These can be normal or modified nucleosides which have protecting groups to prevent their amines, hydroxyl groups and phosphate groups from interacting incorrectly. HPLC can be used to isolate products with the proper sequence. Meanwhile a large number of oligos can be synthesized in parallel on gene chips. For optimal performance in subsequent gene synthesis procedures they should be prepared individually and in larger scales. Annealing based connection of oligonucleotides may also be used. For example, a set of individually designed oligonucleotides is made on automated solid-phase synthesizers, purified and then connected by specific annealing and standard ligation or polymerase reactions. To improve specificity of oligonucleotide annealing, the synthesis step relies on a set of thermostable DNA ligase and polymerase enzymes. The term "modulate the level of expression of a protein" refers to changes in levels of protein produced via translation of a transcribed and optionally spliced RNA molecule corresponding to the coding DNA. The levels of protein after modulation can be determined by skilled person in the art, for example by using affinity reagents such as an antibody directed against the particular protein of interest or readily available protein staining techniques. Other analytical methods could also be used to quantify protein expression levels such as quantitative mass spectrometry or microscopic methods in combination with labelling of said protein of interest.
The term "expression" of a nucleic acid, preferably RNA or DNA, or "protein expression", is well known to a skilled person and represents protein production via translation of the corresponding optionally spliced RNA encoding the protein of interest. "Gene expression" or "protein expression" may be used interchangeably. A skilled person understands that a gene may be first "expressed" by transcription, the RNA optionally spliced or otherwise processed, leading ultimately to translation of the transcribed coding RNA. In one embodiment the protein expression level is considered to depend on and/or relate to the amount of corresponding optionally spliced RNA produced via transcription and/or splicing. Transcription levels may therefore also be used as an indication of gene or protein expression.
FIGURES
The figures provided herein represent examples of particular embodiments of the invention and are not intended to limit the scope of the invention. The figures are to be considered as providing a further description of possible and potentially preferred embodiments that enhance the technical support of one or more non-limiting embodiments. Figure 1. Thermodynamic properties of the transcripts. a. Correlation between AG and Tm of DNA/DNA and RNA/DNA duplexes of wlL2, IL2 and elL2 gene variants. All experiments are performed in triplicates and the error bars represent the standard error of means. The window size for AG calculations is indicated with subscript number after AG.
b. Mean values of AG of DNA/DNA (DD), sense RNA/DNA (RD) and antisense RNA/DNA (DR) duplexes of annotated 5'-UTRs, exons, introns and 3'-UTRs of all C. elegans transcripts. The error bars represent standard deviation.
c. Fraction of the transcripts with more stable RNA/DNA duplexes in introns compared to
exons, for all transcript types or broken down by validation status - confirmed, partially confirmed and predicted transcripts.
Figure 2. A pattern of thermodynamic stability for C. elegans and H. sapiens transcripts. a. Intensity plot of AG of DNA/DNA and RNA/DNA duplexes of the 50 bp sequences
surrounding: all annotated start sites, 5'-and 3'-splice sites of 40 000 exon-intron-exon units and all annotated 3'-UTR ends of C. elegans transcripts.
b. Mean values of AG of DNA/DNA and RNA/DNA duplexes of the 50 bp, sequences
surrounding: all annotated start sites, 5'-and 3'-splice sites of 40 000 exon-intron-exon units, and all annotated 3'-UTR ends of C. elegans and H. sapiens transcripts.
c. Intensity plot of AG of DNA/DNA and RNA/DNA duplexes of the 50 bp sequences
surrounding: all annotated start sites, 5'-and 3'-splice sites of all exon-intron-exon units of chromosome 1 , and all annotated 3'-UTR ends of H. sapiens transcripts.
d. Mean values of the AG bias of the 50 bp sequences, surrounding: annotated start sites, 5'- and 3'-splice sites, and the ends of the 3'-UTR of all C. elegans and H. sapiens transcripts. e. Intensity plot of the AG bias of the 50 bp sequences surrounding: the annotated start sites of C. elegans and H. sapiens transcripts, the annotated ends of the 3'-UTR of C. elegans and H. sapiens transcripts, the 5'-and 3'-splice sites of 40000 exon-intron-exon units of C.
elegans the 5'-and 3'-splice sites of exon-intron-exon units of Chromosome 1 of H. sapiens transcripts.
Figure 3. Nucleotide distribution and thermodynamic patterns of 3'-splice sites of A. thaliana, C. elegans, D. melanogaster, D. rerio and H. sapiens transcripts. a. Mean values of the AG bias surrounding the 3'-splice sites of A. thaliana, C. elegans, D. melanogaster, D. rerio and H. sapiens transcripts. b. Intensity plot of the AG bias of the sequences surrounding 3'-splice sites of A. thaliana, C. elegans, D. melanogaster, D. rerio and H. sapiens transcripts.
c. Nucleotide distribution of the sequences surrounding the 3'-splice sites of A. thaliana, C. elegans, D. melanogaster, D. rerio and H. sapiens transcripts. The distance from the 3'- splice sites intron/exon junctions is specified under each position for the intronic (I) and exonic (E) regions.
Figure 4. Thermodynamic stability patterns of the 5'- and 3'- alternative splice sites in H.
sapiens. a. A scheme of the positioning of the 5'- and 3'- alternative splice sites in the intensity plots in panels b-d.
b. A plot of AG of DNA/DNA duplexes of the 50 bp sequences surrounding: all annotated alternative splice sites with less than 50 bp distance between them. The results for the 5'- splice sites are aligned with respect to the downstream 5'-alternative splice sites, indicated with a black line. The upstream 5'-alternative splice sites are indicated with a white line. The results for 3'-splice sites are aligned with respect to the upstream 3'-alternative splice sites, indicated with a black line. The downstream 3'-alternative splice sites are indicated with a white line.
c. A plot of AG of RNA/DNA duplexes of the 50 bp sequences surrounding: all annotated alternative splice sites with less than 50 bp distance between them. The results are aligned as indicated in B.
d. A plot of AG bias of the 50 bp sequences surrounding: the annotated alternative splice sites with less than 50 bp distance between them. The results are aligned as indicated in B.
Figure 5. Thermodynamic stability patterns of 3'- splice sites of the constitutively and alternatively spliced exons in H. sapiens. a. A scheme of the alternative splicing events in panel B-D. The positions of the calculated sites are depicted by arrowheads.
b. Mean values of the AG bias across the sequences surrounding the 3'-splice sites of
constitutive exons, cassette exons, 3'-alternatively spliced exons and retained introns. The distance from the 3'-splice sites intronic/exon junctions is specified under each position for the intronic (I) and exonic (E) regions.
c. The mean value of the AG bias across the sequences surrounding the 3'-splice sites of 3'- alternatively spliced exons and constitutive exons. 10th percentile is 10% of 3'-alternative splice sites with the lowest splicing level. 90th percentile is 10% of 3'-alternative splice sites with the highest splicing level. d. Mean values of the AG bias across the sequences surrounding the 3'-splice sites of cassette exons and constitutive exons.
e. Mean values of the AG bias across the sequences surrounding 3'-splice sites of 5'- alternatively spliced exons and constitutive exons.
Figure 6. Thermodynamic stability patterns of the cryptic 3'-splice sites in H. sapiens. a. Mean values of the AG bias across the sequences, surrounding the authentic and the
situated upstream cryptic 3'-splice sites. The distance from the 3'-splice sites intronic/exon junctions is specified under each position for the intronic (I) and exonic (E) regions.
b. Mean values of the AG bias across the sequences, surrounding authentic 3'-splice sites and corresponding cryptic 3'-splice sites located downstream from authentic 3'-splice.
Figure 7. Thermodynamic stability patterns near the 3'-splice sites and the Branch point of both real and pseudo exons in H. sapiens. a. Mean values of the AG bias across the 50 bp sequences, surrounding the Branch point of 35 exons of H. sapiens. The coordinates at the horizontal axis indicate the positions with respect of the Branch point.
b. Mean values of the AG bias across the 50 bp sequences, surrounding the 3'-splice sites of 35 exons of H. sapiens. The average position of the Branch point is indicated.
c. Mean values of the AG bias across the 50 bp sequences surrounding the 3'-splice sites of both real and pseudo exons of H. sapiens. The average position of the Branch point is indicated.
Figure 8. Pre-Spliceosome assembly onto the 3'-splice site of AdML RNA exon 2 with or without annealing of antisense DNA oligonucleotides in vitro. a-c Schematic depiction of the experimental setup for the results shown in panel D. Black - AdML RNA transcript intron 1/exon 2 junction; lower case - intronic region, upper case - exonic region; blue - branch point; underlined - polypyrimidine tract; red - 3'- splice site; green - DNA antisense strand for either the splice site plus the exon, the branch point plus the polypyrimidine tract (BP-PPT), or the DNA/DNA biased region.
a. The mean value of the AG bias, surrounding the 3'-splice site of exon 2 of AdML promoter transcripts.
b. Pre-spliceosome A complex formation, when antisense DNA oligonucleotide is annealed to exon 2 of AdML RNA. c. Inhibition of the Pre-spliceosome A complex formation, when antisense DNA oligonucleotide is annealed to either the DNA/DNA biased region plus the Branch point sequence or the DNA/DNA biased region alone.
d. Spliceosome complex formation assay using HELA nuclear extract. The pre-spliceosome compex A formation is not inhibited when antisense DNA oligonucleotide is annealed to both
3'-AG splice site and exonic sequence of AdML RNA (lane 5). The Pre-spliceosome complex A formation is inhibited when an antisense DNA oligonucleotide is annealed to either the Polypyrimidine region plus the Branch point sequence (lane 6) or the DNA/DNA biased region alone (lane 7).
Figure 9. High Resolution Melting temperature (Tm) profiles of DNA/DNA and RNA/DNA duplexes for the tree variants of IL2 gene.
a. Normalized fluorescence intensity (n.F.) of intercalated into RNA/DNA and DNA/DNA duplexes SYTO 9 dye of wlL2, IL2 and elL2 genes.
b. n.F. of intercalated into RNA/DNA and DNA/DNA duplexes SYTO 9 dye of IL2-elL2.
Figure 10. Thermodynamic patterns of the 25 bp sequences, surrounding 1000 5'-and 3'-splice sites of C. elegans transcripts.
a. A plot of AG of RNA/DNA duplexes of the 25 bp sequences, surrounding 1000 5'- splice sites of C. elegans transcripts.
b. A Plot of AG of RNA/DNA duplexes of the 25 bp sequences, surrounding 1000 3'-splice sites of C. elegans transcripts.
Figure 11. Standard deviation of mean value of AG bias of the 5'-and 3'-splice sites of C.
elegans and H. sapiens transcripts.
a. Standard deviation of mean value of AG bias of the 5'-and 3'-splice sites of C. elegans transcripts. The coordinates from the horizontal axis indicate the positions to the inron/exon junction in the intronic (I) and exonic (E) sequences respectively.
b. Standard deviation of mean value of AG bias of the 5'-and 3'-splice sites of H. sapiens transcripts.
Figure 12. Thermodynamic pattern across the 5'-and 3'-splice sites - in 3'-UTR and 5'-UTR parts of C. elegans transcripts.
a. The mean value of AG of RNA/DNA duplexes across the 50 bp sequences, surrounding the 5'-and 3'-splice sites, fully situated in coding, 3'-UTR and 5'-UTR of C. elegans transcripts. b. The mean value of AG bias across the 50 bp sequences, surrounding the 5'-and 3'-splice sites, fully situated in coding, 3'-UTR and 5'-UTR of C. elegans transcripts. Figure 13. A comparison of the thermodynamic pattern across the trans-splice sites and cis-3'- splice sites of C. elegans transcripts.
a. A comparison of the mean value of AG of DNA/DNA and RNA/DNA duplexes across the trans-splice sites and cis-3'-splice sites of C. elegans transcripts.
b. A comparison of the mean value AG bias across the trans-splice sites and cis-3'-splice sites of C. elegans transcripts
Figure 14. A comparison of the thermodynamic pattern across the trans-spliced start sites and non-spliced start sites of C. elegans transcripts.
a. A comparison of the mean value of AG of DNA/DNA and RNA/DNA duplexes across the trans-spliced start sites and non-spliced start sites. Horizontal axis coordinates indicate positions to the start sites at the 5'-intergenic regions (5R) and 5'-UTR (5U) respectively.
b. A comparison of the mean value of AG bias across the trans-spliced start sites and nonspliced start sites.
Figure 15. Thermodynamic pattern across the 5'-and 3'-splice sites fully situated in coding, 3'- UTR and 5'-UTR parts of H. sapiens transcripts.
a. The mean value of AG of RNA/DNA duplexes across the 50 bp sequences, surrounding the 5'-and 3'-splice sites, fully situated in coding, 3'-UTR and 5'-UTR of H. sapiens transcripts. b. The mean value of AG bias across the 50 bp sequences, surrounding the 5'-and 3'-splice sites, fully situated in coding, 3'-UTR and 5'-UTR of H. sapiens transcripts.
Figure 16. Thermodynamic pattern across the 5'-and 3'-splice sites of 1 th , 2nd , 3th , 4th and 5th introns of H. sapiens transcripts.
a. A plot of AG of RNA/DNA duplexes of the 50 bp sequences surrounding the 5'-and 3'-splice sites of 1 th , 2nd , 3th , 4th and 5th introns of H. sapiens.
b. A plot of AG bias of the 50 bp sequences, surrounding the 5'-and 3'-splice sites of 1 th , 2nd , 3th , 4th and 5th introns of H. sapiens transcripts.
Figure 17. Thermodynamic pattern of DNA/DNA and RNA/DNA duplex stability surrounding the splice sites of A. thaliana, C. elegans, D. melanogaster, D. rerio and H. sapiens,
a. A plot of AG of DNA/DNA and RNA/DNA duplexes of the 50 bp sequences surrounding 5'- and 3'-splice sites.
b. The mean value of AG of DNA/DNA and RNA/DNA duplexes surrounding the 5' and 3'- splice sites. Figure 18. AG bias surrounding the splice sites of A. thaliana, C. elegans, D. melanogaster, D. rerio and H. sapiens.
a. A plot of AG bias of the 50 bp sequences surrounding the 5'- and 3'-splice sites of the transcripts.
b. The mean values of AG bias of sequences surrounding the 5'-and 3'-splice sites of the transcripts.
Figure 19. A nucleotide composition and pattern of thermodynamic stability of 3'- splice sites of constitutively and alternatively spliced exons of H. sapiens.
a. Scheme of the alternative splice events
b. The mean value of the AG bias of sequences surrounding the 3'-splice sites of constitutive exons, cassette exons, 3'-alternative spliced exons and retained introns.
c. Nucleotide composition of sequences surrounding the 3'-splice sites of constitutive exons, cassette exons, 3'-alternative spliced exons and retained introns.
Figure 20. Nucleotide composition and pattern of thermodynamic stability across the 3'- splice sites of constitutively and 3'-alternatively spliced exons of H. sapiens.
a. Scheme of the 3'-alternative splice events.
b. The mean value of the AG bias of sequences surrounding the 3'-splice sites of constitutive exons and 3'-alternatively spliced exons.
c. A nucleotide composition of sequences surrounding the 3'-splice sites of constitutive exons and 3'-alternatively spliced exons.
Figure 21. Nucleotide composition and pattern of thermodynamic stability across the 3'- splice sites of constitutive and cassette exons of H. sapiens.
a. Scheme of the cassette exons.
b. The mean value of the AG bias of sequences surrounding the 3'-splice sites of constitutive exons and cassette exons.
c. Nucleotide composition of sequences surrounding the 3'-splice sites of constitutive exons and cassette exons.
Figure 22. Nucleotide composition and pattern of thermodynamic stability of cryptic 3'-splice sites of H. sapiens transcripts.
a. The mean value of AG bias and nucleotide composition across the 50 bp sequences, surrounding authentic 3'-splice sites and corresponding cryptic 3'-splice sites, located upstream from authentic 3'-splice sites. b. The mean value of AG bias and nucleotide composition across the 50 bp sequences, surrounding authentic 3'-splice sites and the corresponding cryptic 3'-splice sites, located downstream from authentic 3'-splice sites.
EXAMPLES The examples provided herein represent practical support for particular embodiments of the invention and are not intended to limit the scope of the invention. The examples are to be considered as providing a further description of possible and potentially preferred embodiments that demonstrate the relevant technical working of one or more non-limiting embodiments.
Thermodynamic stability of RNA/DNA versus DNA/DNA duplexes To evaluate if the measured nearest-neighbour parameters permit accurate comparison between thermodynamic stability of RNA/DNA and DNA/DNA duplexes, we performed high- resolution measurement of the melting temperature (Tm) and compared it to the calculated mean value of AG for the nearest-neighbor interactions of RNA/DNA25 and DNA/DNA23 duplexes for three variants of the human IL2 gene15, namely wlL2 (27%GC content), natural IL2 (39%CG) and elL2 (60% GC content). We found strong correlation (0.945 correlation coefficient, P=0.005) between the measured Tm and the calculated AG values (Fig. 1 a and Fig. 9a) using unified thermodynamic parameters for DNA/DNA duplexes and RNA/DNA duplexes25. Our results suggest that calculation of AG using these parameters allows for accurate comparison of the difference in thermodynamic stability between RNA/DNA and DNA/DNA duplexes. In addition, we confirm the prediction for large variations between DNA/DNA23 and RNA/DNA duplex stability25. The DNA/DNA duplex of wlL2 is more stable (Tm= 75.81 [C°] and AG= 1.17[kcal/mol]) in comparison with the respective RNA/DNA duplex (Tm =73.68[C°] and
AG=1.12[kcal/mol]). In contrast, the RNA/DNA duplex of elL2 is more stable (Tm =91.88[C°] and AG= 1.64 [kcal/mol]) than the respective DNA/DNA duplex (Tm =90.31 [C°] and AG=
1.51 [kcal/mol]). To further evaluate the variation between RNA/DNA and DNA/DNA duplex stability we performed high resolution melting analysis of another variant of the IL2 gene15 called elL2-IL2. Half of the elL2-IL2 sequence is identical to the corresponding sequence of elL2 (60% GC content) and the other half originates from IL2 (39%CG). Remarkably, both RNA/DNA and DNA/DNA duplexes dissociate in two steps as a result of the big difference between the thermodynamic stability of the elL2 part and the IL2 part of the elL2-IL2 gene (Fig. 9b).
Moreover, the DNA/DNA duplex of the IL2 part is more stable (Tm =80.7[C°]) in comparison with the same RNA/DNA duplex (Tm =77[C0]). In contrast, the DNA/DNA duplex of the elL2 part is less stable (Tm =87.5[C0]) in comparison with the corresponding RNA/DNA duplex (Tm
=88.9[C0]). The observed change in the melting behaviour of RNA/DNA and DNA/DNA duplexes in the context of the same molecule further supports the finding that nucleotide composition can change the ratio of the thermodynamic stability of RNA/DNA to DNA/DNA duplexes.
Thermodynamic properties of C. elegans genome
To explore the possible role of the intrinsic thermodynamic properties of the DNA/DNA and RNA/DNA duplexes we calculated their corresponding AG values throughout the entire genome of C. elegans using the measured nearest-neighbor parameters23,25 and a sliding-window approach26 with 1 bp step and window size of 2 bp. The results revealed a striking correlation between the mean value of AG for exons, introns, 5'-UTRs, 3'-UTRs, and transcripts (Fig. 1 b). On average the exonic sequences are more stable than 3'-UTRs and intronic sequences (Fig. 1 b). Remarkably, 97.6% of the transcripts possess more thermodynamically stable RNA/DNA duplexes in exonic sequences in comparison with the corresponding intronic sequences (Fig. 1 c). For the transcripts classified as "confirmed" the correlation was even more compelling, with only 43 out of 10960 (0.4%) having higher RNA/DNA duplexes stability in their intronic sequences. However, for transcripts classified as "predicted" higher intronic compared to exonic RNA/DNA duplexes stability was over 20 times more frequent (401 out of 4167, or 9.6%) than in the "confirmed" group, which may indicate false positive transcript predictions. This suggests that the thermodynamic parameters can be useful in refining de novo gene structure prediction in C. elegance. Particularly interesting was the observed strong DNA/DNA bias in intronic sequences contrasting with the absence of bias in exonic sequences (Fig. 1 b). In order to map precisely the regions in transcripts that contribute to differential stability, we calculated the thermodynamic profile of the 50 bp upstream and the 50 bp downstream of the transcript start sites, the 5'- and 3'- splice sites and the ends of the 3'- UTRs of all C. elegans transcripts (Fig. 2a, b). This approach allows alignment and color-coded representation of the thermodynamic profile of the individual sequences with respect to the regions responsible for RNA processing (Figure 2). As the RNA polymerase maintains a 9 bp RNA/DNA duplex in the transcription bubble during elongation, we used a window size of 9 bp for AG calculations. In addition to the observation that introns are less stable than exons (Fig. 2a, b and Fig. 10), we found that the least stable RNA/DNA duplex is located upstream of the 3'-splice site, and includes the polyU tract27 characteristic of the 3'-consensus in C. elegans (Fig. 3). This region is directly followed by a more stable region, which includes the 3'-splice site. At the 5'-splice site a more stable region at the exon-intron boundary is followed by an intronic region with a significantly lower RNA/DNA duplex stability (Fig. 2a, b). The pattern is similar, but not as pronounced for DNA/DNA duplex stability of the same sequences (Fig. 2a, b). As a result, the DNA/DNA bias is most significant at both intron ends (Figure 2d, e and Fig. 1 1 a). Remarkably, next to these regions are situated the regions with the strongest RNA/DNA bias, which include the 5'- and 3'-splice sites. We considered the possibility that the observed pattern is a consequence of the fact that most exons contain protein-coding sequences that have specific sequence constraints. However, the same pattern was observed for the exon-intron-exon units in both protein-coding regions and in untranslated regions of the coding genes (Fig. 12). Furthermore, we found no significant difference between protein-coding and non-coding genes. Taken together, the observations above suggest that the specific pattern of thermodynamic stability distribution is related to RNA splicing, a process which occurs co-transcriptionally and can be directly affected by the thermodynamic stability of the DNA/DNA and RNA/DNA duplexes.
This is further supported by the thermodynamic stability profile at the start sites of the transcripts (Fig. 2a, b). In C. elegans the 5'-end of about 70% of the mRNAs is produced in a process called "trans splicing", whereby the transcript gets spliced to a short "splice leader" RNA in a process similar to normal intron removal28 (cis-splicing). We found that the start sites of the trans-spliced transcripts have the characteristic thermodynamic profile of a 3'-splice site (Fig. 2a, b and Fig. 13), and differ from the conventional transcript start sites (Fig. 14). The identical thermodynamic properties of cis- and trans- 3'-splice sites suggest that they could contribute to the mechanism of RNA splicing.
Thermodynamic stability pattern of eukaryotic transcripts
To assess whether the observed thermodynamic stability patterns exist in mammals, we calculated the free energy of 50 bp upstream and the 50 bp downstream of the same sites in the human genome (Fig. 2b, c). In human transcripts, both RNA/DNA and DNA/DNA duplexes are more stable compared to C. elegans (Fig. 2a, b, c). The second difference observed is that human transcripts are RNA/DNA biased, except the regions upstream of the 3'-splicing sites and the 3'-UTRs. Moreover, we found that the 5'-UTR of human transcripts are extremely stable. We analyzed all first and second exon pairs that fall into 5'-UTRs. The thermodynamically stable region includes the first exon of the 5'-UTR and propagates to the intronic sequence, but does not reach the 3'-part of the intron and the second exon, even though it is also part of the 5'-UTR (Fig. 15 and 16). It is known that the first exon and intron of genes possess specific features compared to the rest of the exon-intron pairs, such as shorter exon length29, longer intron length30 and characteristic DNA methylation31 and histone modification profiles32. However, the relationship between these characteristics and the higher stability of the first exon and intron and the biological significance of these correlations are still unclear.
Despite these differences, the thermodynamic profiles for H. sapience and C. elegans share three important features. First, the region of lowest RNA/DNA duplex stability within the transcript is situated in intronic sequences upstream of 3'-splice sites in both species (Fig. 2a, b, c). Second, this region possesses the strongest DNA/DNA bias (Fig. 2a, b, c. and Fig. 1 1 b). Third, the 3'- and 5'-splice sites have the strongest RNA/DNA bias. Similar to C. elegans, in H. sapiens these patterns are observed in both coding and non-coding genes (results not shown) and in coding and untranslated regions of the coding genes (Fig. 15) supporting the idea that the observed profiles are linked to a transcription-coupled process. To check whether the observed thermodynamic stability patterns are evolutionarily conserved, we calculated the free energy of the 50bp upstream and the 50 bp downstream of the 3'- and 5'- splice sites of the plant A. thaliana, the insect D. melanogaster and the fish D. rerio (Fig. 17 and 18). The same trend as in C. elegans and H. sapiens was detected in those species (Fig. 3 and 18) whereby the region of strongest DNA/DNA bias was situated within intronic sequences upstream of the 3'-splice sites. With the exception of H. sapiens, the full lengths of intronic sequences of all studied organisms were DNA/DNA biased. In contrast, the strongest RNA/DNA bias throughout the entire pool of measured transcripts was detected at the 3'- and 5'-splice sites of all species. These patterns are statistically significant with p-value <2e"15 (Materials and Methods). We found that the polypyrimidine tract, a degenerative, pyrimidine-rich sequence, required for intron-exon recognition is situated in the region of strongest DNA/DNA bias. It is remarkable that regions with significant differences in nucleotide composition among species would possess such a uniform DNA/DNA bias (Fig. 3a, b, c and Fig. 18).
Thermodynamic pattern near alternative splice sites
Most primary transcripts in metazoans are subject to alternative splicing, whereby the same sequence can be an intron in one transcript and an exon in another . We calculated the thermodynamic stability of all human intron/exon units with a known alternative 3'-splice sites up to 50 bp away from each other and sorted them by increasing distance between the two sites (Fig. 4a, b, c and d). By this approach the specific pattern of strong DNA/DNA bias upstream of 3'-splice sites and RNA/DNA bias at the 3'-splice site can be clearly seen for both alternative splice sites when the distance between them is over 8 nt. (Fig. 4d). The characteristic RNA/DNA bias for the 5'-alternative slice sites are also detectable, although less pronounced than the profile near 3'-splice sites (Fig. 4d). Some alternative splicing events are subject of intensive cell type and cell cycle specific regulation, which allows differential expression of the splice variants. However, a substantial fraction of the alternative splicing events result in low abundance alternative transcripts without detectable biological function. Such minor transcript variants are believed to be a result of splicing of inefficient splice sites. If the thermodynamic stability profile near the splice sites plays an important role in the splicing reaction, the constitutive splicing sites will possess more pronounced biases in the thermodynamic profiles than alternative splice sites of low splicing efficiency. We compared the thermodynamic stability profile near the 3'-splice sites of human constitutive exons, retained introns, cassette (skipped) exons and 3'-alternative splice sites (Fig. 5a, b and Fig. 19). The cassette exons that are present in one but skipped in other transcripts of the same gene possess similar thermodynamic stability profiles as the constitutive exons (Fig. 5b). However, 3'-alternative splice exons possess lower DNA/DNA bias upstream of the 3'-splice sites compared to the constitutive exons. The difference is even stronger for the retained introns, in which, an intron is not removed from the mature transcript (Fig. 5a, b and Fig. 19). The retained introns are believed to be a result of recognition failure of the weak splice sites that flank the introns34. Those week splice sites could be a result of the low DNA/DNA bias of the region. To assess if there is a link between the thermodynamic stability and the alternative splice site usage we compared the 10% of 3'-alternative splice sites with either the lowest (10th percentile) or the highest splicing levels (90th percentile)35. Our results reveal that the 3'-alternative splice sites with the lower splicing levels possess lower DNA/DNA bias upstream of the 3'-splice sites than the average 3'-alternative splice sites (Fig. 5c and Fig. 20). In contrast, the 3'-alternative splice sites with higher splicing level possess higher DNA/DNA bias than the average 3'-alternative splice sites and are similar to the constitutive splice sites. Even the cassette exons with the lower splicing levels possess significantly diminished
DNA/DNA bias than the average cassette exons and constitutive exons (Fig. 5d and Fig. 21 ).
We also studied the thermodynamic profile near the 3'-splice sites of 5'-alternative splice exons as a negative control (Fig. 5e). There was no difference in the thermodynamic profile near the 3'-splice sites of higher and lower splicing level of 5'-alternative splice exons. These results suggest that the DNA/DNA bias upstream of the 3'-splice sites could be involved in 3'-spice site recognition.
Stability pattern near cryptic and authentic 3'-splice sites
Metazoan transcripts contain large numbers of "cryptic" splice sites that are inactivated or used at low levels due to suppression from nearby and advantageous authentic splice sites36.
Mutation of the authentic splice sites activates the cryptic splice sites, leading to aberrant alternative splicing and frequently to genetic disease37,38. To understand how the
thermodynamic properties of the two alternative sites can influence the splice site selection under normal and disease conditions we compared the thermodynamic profile of disease related cryptic 3'-splice sites36, with their corresponding authentic splice sites. The cryptic splice sites can occur both upstream (in the intron) and downstream (in the exon) of the authentic 3'-splice site. We first analyzed 21 cryptic 3'-splice sites situated more than 20 bp downstream from the authentic splice sites36 (Fig. 6a, Fig. 22a and Table S1 ). In this configuration Pol II will transcribe first the authentic 3'-splice sites and then the cryptic 3'-splice sites. In this case, we do not see the characteristic region with the strongest DNA/DNA bias in front of the cryptic 3'-splice sites, which is clearly present at the authentic 3'-splice sites. Such a pattern suggests that the absence of a DNA/DNA biased sequence upstream of the splice site does not allow
independent splicing events at cryptic 3'-splice sites. Mutation in the AG consensus of the authentic 3'-splice sites could still allow recognition of their upstream unstable regions by the Spliceosome and finding of the AG consensus of the cryptic 3'-splice sites.
Table S1 . Cryptic 3'-splice sites situated downstream (in the exon) of the authentic 3'-splice site 1 .
Figure imgf000028_0001
We next analyzed 15 cryptic sites situated more than 20 bp upstream of the authentic 3'-splice site36 (Fig. 6b, Fig. 22b and Table S2). In this configuration Pol II together with the Spliceosome factors will first encounter the cryptic 3'-splice sites and then the authentic splice sites. On average, the DNA/DNA bias is less pronounced at the region situated upstream of cryptic 3'- splice sites than at the same regions of the authentic splice sites. Furthermore, RNA/DNA bias is not as strong at cryptic 3'-splice sites as at authentic splice sites. Our results show that the characteristic thermodynamic profile of the 3'-splice sites are less pronounced in the cryptic 3'- splice sites, which could allow them to be bypassed by the Pol II without leading to a splicing event when the authentic splice site is functional.
Table S2. Cryptic 3'-splice sites situated upstream (in the intron) of the authentic 3'-splice site 1 Gene Disease Distance Mutation
GCK Maturity-onset diabetes of the young -27 IVS5-1G>A
GLB1 Gangliosidosis -28 IVS14-2A>G
FACL4 X-linked mental retardation -28 IVS10-2A>G
ATP7B Wilson disease -39 IVS11-2A>G
NTRK1 (TRKA) Congenital insensitivity to pain with anhidrosis -41 IVS4-1G>C
TP53 Li-Fraumeni syndrome -44 IVS9-1G>C
DMD Muscular dystrophy -45 IVS38-1G>A
APOE APOE deficiency -52 IVS3-2A>G
PAH Phenylketonuria -81 (IVS2-2A>G)
FANCA Fanconi anemia -90 IVS15-1G>T
STK11 Peutz-Jeghers syndrome -113 IVS1-2A>G
STK11 Peutz-Jeghers syndrome -124 IVS1-2A>G
PKP2 Arrhytmogenic right ventricular cardiomyopathy -160 IVS12-1G>C
HBB Beta-thalassemia -271 IVS2-2A>G
TPMT Thiopurine methyltransferase deficiency -330 IVS9-1G>A
Thermodynamic profile of real exons and pseudoexons
The BP sequence is a degenerative signal situated several nucleotides upstream of the PPT 39. In order to map BPs in the context of the region of the strongest DNA/DNA bias we calculated the free energy near 35 identified human BPs 40 Our results show that on average the BP is located upstream of the maximum DNA/DNA bias point but is still within the DNA/DNA bias region (Fig. 7a, b).
Pseudoexons are regions in the human genome flanked by sequences that resemble authentic splicing regulatory signals but are not spliced into mature mRNAs 8"10. Previous work 9 has provided a reliable dataset of such sequences, derived using a consensus score based on a position-specific weight matrix, obtained by aligning a large number of real splice sites 10. We compared the thermodynamic stability near 3'-splice sites of the real noncoding exons and pseudoexons of the same dataset (Fig. 7c). While the maximum of the DNA/DNA bias was higher for the pseudoexons, this is likely due to the very stringent selection criteria for this dataset, which would exclude over 25% of the known real exons. However, the high DNA bias region was shorter for the 3'-splice sites upstream of the pseudoexonic sequences, which potentially puts it at a distance from the branch point. To investigate this further we compared the position of the potential BP in real exons and pseudoexons. As previously shown9, predicted BP consensus sequences are less frequently found in front of 3' splice sites of pseudoexons. Predicted BP consensus sequences were only found for 56% of the pseudoexons, compared to 69% for the real exons. We found that, when present, the average predicted BP consensus sequences are located further away from the 3'-splice sites of pseudoexons in comparison with real exons (Fig. 7c). This, in combination with the short region with DNA/DNA bias, positions the predicted BP consensus sequences outside the DNA/DNA biased region of the 3'-splice sites of pseudoexons (Fig 7c). In contrast, the average predicted BP consensus sequence is situated inside the DNA/DNA biased region of the 3'-splice sites of real exons (Fig. 7c), as was the case with experimentally confirmed BPs (Fig. 7a, b). Our results suggest that the positioning of BP consensus sequences outside the narrow DNA/DNA biased regions of the 3'-splice sites can contribute to the absence of splicing in pseudoexons in addition to underrepresentation of the regulatory sequences 9.
All of the discussed results show that the elongating RNA polymerase, on its way through a DNA template, produces a RNA transcript with significant local differences in the potential for DNA/DNA and RNA/DNA duplex formation and unwinding, which are most pronounced near the 3'-splice sites. These patterns are conserved in eukaryotes and are not dependent on the coding status of the exon. This characteristic instability is less pronounced in weak alternative splice sites and disease-associated cryptic 3'-splice sites thus suggesting a role of the thermodynamic pattern for mRNA splicing.
RNA/DNA annealing at DNA/DNA biased regions impedes splicing
It has recently been shown that depletion of two splicing factors, ASF/SF2 and RNPS1 , which co-transcriptionally bind to the nascent mRNA, leads to formation of mRNA/DNA duplexes as a result of re-annealing of mRNA to DNA template strand in the wake of transcription41,42. The structure created by re-annealing of mRNA to DNA template is called R-loops43. The suppression of R-loops formation from splicing factors suggests that the re-annealing of mRNA to DNA template can interfere with proper RNA splicing41,42,44. Furthermore, Topo I
topoisomerase, which removes negative supercoiling generated behind RNA polymerases during transcription, also suppresses R-loop formation45,46. The accumulation of negative supercoiling in Topo l-deficient cells is supposed to weaken DNA/DNA duplexes and to facilitate both re-annealing of mRNA to the DNA template strand and R-Loop formation. In the context of these findings, we propose that the detected lower stability of RNA/DNA versus DNA/DNA duplexes (DNA/DNA bias), upstream of the 3'-splicing site, prevents the re-annealing of mRNA after transcription to allow spliceosome assembling.
To test this hypothesis we studied whether the re-annealing of the RNA transcript to DNA in DNA/DNA bias region, situated upstream of 3'-splice sites, can inhibit spliceosome assembly. In vitro experiments with crude nuclear extracts have demonstrated a stepwise assembling of the spliceosome onto RNA 47"49. First, RNA is incorporated into a nonspecific heterogeneous nuclear ribonucleoprotein H complex that does not require ATP and functional splicing regulatory sequences. The addition of ATP leads to re-arrangement of RNA into a pre- spliceosomal complex A as a result of binding of U2AF65, U2AF35 and U2 snRNP to the PPT, 3'-splice sites and BP, respectively 47. Consequently, a B complex is formed by association of U4/U5/U6 tri-snRNP followed by formation of the active Spliceosome complex (Complex C) as a result of new rearrangements and the incorporation of the U2/5/6 snRNPs50. We studied whether the annealing of DNA oligonucleotides to RNA, near the 3'-spice site of exon 2 of AdML transcript, will influence the pre-spliceosome complex formation in crude nuclear extracts 51,52. Our results show that the annealing of antisense DNA oligonucleotides covering the DNA/DNA biased region and the BP region, leads to inhibition of complex A formation (Fig. 8) and partially impedes the formation of complex H. The same result was observed with an oligonucleotide covering the DNA/DNA biased region alone. In contrast, the annealing of antisense DNA oligonucleotides to RNA at the 3'-splice site and the start of the exon does not influence complex A and H formation (Fig. 8). The inhibition of complex A formation by annealing of DNA to the DNA/DNA biased region further suggests that a strong DNA/DNA bias in this region is required to prevent the annealing of DNA to RNA in order to ensure spliceosome assembly. Such a mechanism is also supported by the research of Krainer and collaborators53, who studied the splicing of SMN2 exon 7. They showed that the annealing of chimeric antisense oligonucleotides across the PPT sequence of the RNA of intron 6 leads to inhibition of splicing of exon 7 both in vivo and in vitro.
This experiment shows directly that if DNA oligonucleotides are bound downstream from 3'- splice sites of RNA duplexes in vitro, the splicing is abolished. This experiment provides further support that if higher stability of RNA/DNA duplexes in comparison with DNA/DNA duplexes is present downstream from 3'-splice sites, formation of RNA/DNA duplexes at these sites will be favored and will prevent splicing. In the opposite scenario, lower stability of RNA/DNA duplexes in comparison with DNA/DNA duplexes is present downstream from 3'-splice sites, free RNA duplexes at these sites will be favored, that allows loading of the spliceosome and subsequently the splicing reaction.
Discussion of the experimental examples These results of the experimental examples demonstrate that the DNA/DNA bias of the 3'-splice sites influences the splicing process. However, the region of most pronounced DNA/DNA bias coincides with the polypyrimidine tract, which is known to be required for proper splicing 2'6'54 55. This raises the question whether the polypyrimidine tract is required to ensure the DNA/DNA bias necessary to prevent the re-annealing of mRNA to DNA only or its nucleotide composition is sufficient to guarantee specific U2AF binding and spliceosome recruitment. To clarify this issue we calculated the nucleotide usage at each position near all of the 3'-splice sites in A. thaliana, C. elegans, D. melanogaster and D. rerio, H. sapiens (Fig. 3). For the H. sapiens transcripts we further looked into the nucleotide usage near 3'-splice sites for constitutive and alternative splice sites (Fig. 5 and Fig. 19, 20, 21 ), cryptic and authentic splice sites (Fig. 22) and real exons and pseudoexons (Fig. 7). The comparison of these results with the
corresponding thermodynamic profiles does not provide a direct answer because the nucleotide composition and its thermodynamic properties are interrelated. The thermodynamic stability of nucleotide duplexes has two components. The first component is the forces of the hydrogen interaction between complementary bases and this component strongly depends on the nucleotide composition. The second component is the stacking interaction between the bases, which depends mainly on the neighboring di-nucleotide distribution. Therefore, the nucleotide distribution and its thermodynamic properties are interrelated.
The polypyrimidine/polyuridine tract is inevitably DNA/DNA biased and at the same time a strong DND/DNA bias that is impossible without an abundance of pyrimidine. While the sequence properties of the polypyrimidine region are known, the fact that they would lead to a specific DNA/DNA bias has been previously overlooked. The potential of the DNA/DNA bias to increase the accessibility of the polypyrimidine tract by preventing message/template annealing can enable the recruitment of U2AF65 to its preferred substrate and the two mechanisms may act in parallel to ensure proper assembly of the splicing machinery.
The present invention demonstrates that the least stable RNA/DNA duplexes as compared to the respective DNA/DNA duplexes are situated upstream from the 3'-splice site where the polypyrimidine tract is situated. This characteristic instability is less pronounced in weak alternative splice sites and disease-associated cryptic 3'-splice sites. The essential splicing factor U2AF65 is specifically recruited to these regions of the pre-mRNA despite their relatively poor sequence conservation54. The results of the experimental examples suggest that the higher instability of RNA/DNA in comparison to the respective DNA/DNA duplexes in this region can prevent the re-annealing of mRNA to DNA. The resulting mRNA/DNA melting can contribute as a mechanism to allow binding of U2AF65 and SF1 to mRNA in order to initialize the primary steps of Spliceosome assembly.
Practical Implication of the method as described herein 1. De novo design and modification of existing attenuated vaccines with improved genetic stabilities:
Decreasing of level of expression of virus genes by decreasing DNA/DNA and RNA/DNA duplex stability of the coding sequence can decrease the fitness of the virus (or other pathogens), thereby attenuating the virus. Decreasing of DNA/DNA and RNA/DNA duplex stability of the virus genes will involve preferably multiple, more preferably hundreds of synonymous mutations into a pathogen in order to minimize the chances of regaining virulence via recombination with other viral sequences present in the host genome. The methods described herein can be used to create attenuated vaccines with improved genetic stabilities by de novo synthesis of virus genomes with altered DNA/DNA and RNA/DNA duplex stability of some of its genes and regulatory sequences thereof.
An example of the de novo design of such a sequence may be carried out as follows: a) Measurement of thermodynamic stability of RNA/DNA and DNA/DNA duplexes of any one or more genomic regions of interest (such as genes) of the virulent virus, b) Design of one or more, preferably several, modified genomes of the virus by diminishing thermodynamic stability of RNA/DNA duplex of one or more genomic regions of interest (such as its genes), c) De novo synthesis of the modified virus genome(s) according to common de novo synthesis techniques, for example by chemical oligonucleotide synthesis, and/or annealing based connection of oligonucleotides, d) Measurement of the level of the virulence of the viruses with modified genomes, for example via standard techniques used by immunologists for assessing virulence, such as tissue or cell culture virulence assays, e) Detection or measurement of a correlation between thermodynamic stability of RNA/DNA duplex of the viruses' genomes and decreased fitness (virulence) of the viruses with modified genomes, and f) Selection of a desired modified sequence for further application based on viral virulence. 2. Advanced production of proteins, RNA and DNA:
Progressively growing numbers of proteins and other bio-macromolecules are produced for the purposes of biotechnology and the pharmaceutical industry. Increasing of the level of
(recombinant) gene expression by the insertion of intronic sequences, or by increasing of DNA/DNA and RNA/DNA duplex stability of its coding sequences using the approach described herein, will significantly improve production of recombinant proteins in eukaryotic cells.
An example of enhanced gene expression may be carried out as follows: a) Measurement of the thermodynamic stability of RNA/DNA and DNA/DNA duplexes of the gene of the desired protein for production, b) Design of one or more, preferably several, modified genes by increasing the thermodynamic stability of RNA/DNA duplex of its coding sequences, and/or by insertion of intronic sequences with lower AG ratio between RNA/DNA and DNA/DNA stability in downstream of 3'-splice sites, c) De novo synthesis of the modified genes, for example using the techniques mentioned herein, d) Measurement of the level of expression of the modified genes, for example using the
techniques mentioned herein, such as quantitative measurements with an antibody directed against the protein of interest, or by fluorescent microscopy using a fluorescently labelled protein of interest, and e) Detection or measurement of a correlation between thermodynamic stability of RNA/DNA duplex of the coding sequences of the genes and the level of expression, and/or f) Detection or measurement of a correlation between the ratio of thermodynamic stability of RNA/DNA and DNA /DNA of the inserted introns and the level of expression of the genes, and g) Selection of desired modified sequence for further application based on protein or mRNA expression.
3. Gene therapy applications:
Successful gene therapy requires a defined level of exogenous expression of a therapeutic gene in human cells, because either over- or under-expression of the gene can lead to artificial effects or the absence of the required effect of the expressed protein, which may either compromise or diminish the effect of the therapy. The possibility to express several alternative splicing variants in a controlled manner also increases the possibilities for successful gene therapy. Therefore, modulation of gene expression by modification of the thermodynamic stability of DNA/DNA and RNA/DNA duplexes provides a new prospective for fine tuning gene therapy constructs.
An example of producing an enhanced gene therapy vector may be carried out as follows: a) Measurement of thermodynamic stability of RNA/DNA and DNA/DNA duplexes of one or more regions of the gene of the protein of interest, b) Design of one or more, preferably several, modified genes by increasing thermodynamic stability of RNA/DNA duplex of its coding sequences and/or by insertion of intronic sequences with a lower ratio between RNA/DNA and DNA/DNA stability downstream of 3'- splice sites, c) De novo synthesis of the modified genes, d) Measurement of the level of expression of the modified genes in an appropriate model system, either in vitro or in vivo, and e) Detection or measurement of a correlation between thermodynamic stability of RNA/DNA duplex of the coding sequences of the genes and the level of expression, and/or f) Detection or measurement of a correlation between ratio of thermodynamic stability of RNA/DNA and DNA /DNA of the inserted introns and the level of expression of the genes, and g) Selection of desired modified sequence for further application based on expression level of the therapeutic gene of interest.
Methods and materials of the experimental examples
Genomes and annotations
Annotations and sequences were obtained from the Ensembl genome browser56,57 as follows: release 65 of H. sapiens genes (GRCh37.p5), release 68 of D. melanogaster genes (BDGP5), release 68 of Danio rerio genes (Zv9), and release 68 of A. thaliana genes (TAIR 10). The full length sequences of C. elegans transcripts were obtained from Wormbase (WB190). 50 bp sequences, flanking the transcript start sites, end sites and splice sites of C. elegans were obtained from Ensembl (WB220). The list of trans-spliced C. elegans genes was taken from Allen et al58. Cryptic 3'-splice sites and their corresponding authentic 3'-splice sites were obtained from DBASS336. Branch point sequences were obtained from 40. Only branch points confirmed by a minimum of 3 lariat RT-PCR clones were considered. Pseudo exon and real exon datasets were obtained from10. Datasets for cassette exons, 3'-alternative splice sites, 5'- alternative splice sites and constitutive exons were obtained from HEXEvent database 35. The retained intron dataset was obtained from the UCSC genome Table browser 59.
Calculation of thermodynamic stability
AG of the nearest-neighbor interactions was calculated by Perl-based software using Kowalski's sliding-window approach10. Published values of ΔΗ and AS (at 37°C and 1 M salt concentration) for each nearest-neighbor interaction for DNA/DNA duplexes8 and RNA/DNA duplexes9 were used. Calculations were carried out with a step size of 1 bp and a window size of 9 bp, except where specified otherwise. Color-coded representation of the thermodynamic profiles was performed by Partek Genomics Suite software. Mapping of potential Branch points in H. sapiens.
We map the human branch point consensus sequence yUnAy40 at intronic sequence situated from 50th to 10th nucleotide upstream from 3'-splice site.
High Resolution Melting analysis
Double-stranded cDNA of the wlL2, IL2 and elL2 genes was amplified by PCR from plasmids pcDNA3-wlL2, pcDNA3-IL2 and pcDNA3-elL2, respectively15. The RNA/DNA duplex of wlL2,
IL2 and elL2 was generated as follows: mRNA was produced by in vitro transcription by T7 RNA polymerase of the respective cDNA. Single-stranded DNA (ssDNA) was produced by digestion with the Lambda exonuclease (NEB) of a double-stranded PCR product with a 5'-phosphate attached to the strand that was to be removed. Finally, mRNA and template ssDNA of the respective gene were annealed after initial denaturation and decreasing the temperature to
30°C by steps of 1 °C. High resolution melting analysis was performed using a Rotor-Gene 6000 instrument and Syto9 intercalating dye in 50 mM sodium phosphate buffer, pH 7.8 following manufacturer's instructions raising the temperature from 70 to 95 degrees by 0.15 degree steps.
Spliceosome complex formation in vitro assay. Pre-spliceosome A complex assembly was carried out as described previously 55,60. The 32P- UTP radiolabeled RNA substrate
(gggaagcuugcugcacgucuagggcgcaguaguccaggguuuccuugaugaugucauacuuauccugucccuuuuuu uuccacagCUCGCGGUUGAGGACAAACUCUUCGCGGUCUUUCCAGUGGGGAUCC), which includes 85bp from the 3'-half of intron 1 and 46 bp of exon 2 of AdML transcript, was transcribed in vitro by T7 polymerase. 20 fmols of the radiolabeled RNA substrate and 400 fmoles of the corresponding antisense DNA oligonucleotides were annealed after initial denaturation and decreasing of the temperature from 70 down to 36°C by steps of 2°C. 3 μΙ ATP depleted HeLa cell nuclear extracts (IPRACELL), supplemented with 13.3 mM HEPES (pH 8), 0.13 mM EDTA, 3 mM MgCI2, 24.9 mM KCI, 3.33% PVA, 13.3% glycerol, 0.03% NP-40, 0.66 mM DTT and supplemented or not with 2 mM ATP and 22 mM creatine phosphate in a final volume of 9 μΙ. The mixture was incubated for 5 min at 30°C. 1 μΙ of heparin (10 mg/ml) was added and incubated for 10 min at room temperature. The probe was loaded on a mini gel, composed of 0.5% agarose, 4% acrylamide, 0.05% bis-acrylamide, 50 mM Tris and 50 mM glycine. The gel was run for 2 h in 50 mM Tris and 50 mM glycine buffer. The gel was dried and exposed with a Kodak BioMax MR film or Phosphorlmager screen. Statistics
Spearman's rank correlation nonparametric test with 2-tailed significances was used to assess the relationship between AG and melting temperature (Tm) of DNA/DNA or mRNA/DNA duplex stability of wil2, IL2 and elL2 genes. Wilcoxon nonparametric rank sum test was used to statistically evaluate the difference between two related samples: AG of DNA/DNA duplexes of intronic and exonic sequences of confirm C. elegans transcript (n=10961 ), AG of RNA/DNA duplexes of intronic and exonic sequences of confirm C. elegans transcript (n=10961 ). AG of RNA/DNA and DNA/DNA duplexes of intronic sequences of confirm C. elegans transcript (n=10961 ).
Wilcoxon nonparametric rank sum test was also used to statistically evaluate the
thermodynamic profiles across splice sites. The analysis was performed as follows: We calculated the mean value of the difference between AG of DNA/DNA and mRNA/DNA duplexes for the least stable region upstream from every 3'-spice site (situated between the 20th and the 2th nucleotide upstream from the 3'-splice site) and compared it with the mean value of the corresponding adjacent intronic sequence (situated between the 50th and the 21 th nucleotide upstream from the 3'-splice site). We evaluated the differences between the two related samples (H. sapiens n=456101 , C. elegans n=1 15224).
We calculated the mean value of the differences between AG of DNA/DNA and mRNA/DNA duplexes for the most stable region of every 3'-spice site (situated between the 4th nucleotide of the intron and the 7th nucleotide of the exon from the 3'-splice site) and compared it with the mean value of the corresponding adjacent exonic sequence (situated between the 8th and the 50th nucleotide of the exons downstream from the 3'-splice site). We evaluated the difference between the two related samples (H. sapiens n=456101 , C. elegans n=1 15224). We calculated the mean value of the difference between AG of DNA/DNA and mRNA/DNA duplexes for the most stable region of every 5'-spice sites (situated between the 8th nucleotide of the intron and the 8th nucleotide of the exon from the 5'-splice site) and compared it with the mean value of the corresponding adjacent exonic sequence (situated between the 9th and the 50th nucleotide of the exons downstream from the 5'-splice site). We evaluated the difference between the two related samples (H. sapiens n=456101 , C. elegans n=1 15224).
The differences between all pairs of related samples are statistically significant, with p-value of less than 2e-15. References
Burge, C.B., Tuschl, T., & Sharp, P.A., Splicing of Precursors to mRNAs by the
Spliceosomes. (Cold Spring Harbor Laboratory Press, New York, 1999).
Kramer, A., The structure and function of proteins involved in mammalian pre-mRNA splicing. Annu Rev Biochem 65, 367-409 (1996).
Roca, X., Krainer, A.R., & Eperon, I.C., Pick one, but be quick: 5' splice sites and the problems of too many choices. Genes Dev il (2), 129-144 (2013).
4
Berglund, J.A., Abovich, N., & Rosbash, M., A cooperative interaction between U2AF65 and mBBP/SF1 facilitates branchpoint region recognition. Genes Dev 12 (6), 858-867 (1998).
Peled-Zehavi, H., Berglund, J.A., Rosbash, M., & Frankel, A.D., Recognition of RNA branch point sequences by the KH domain of splicing factor 1 (mammalian branch point binding protein) in a splicing factor complex. Mol Cell Biol 21 (15), 5232-5241 (2001 ).
Zorio, D.A. & Blumenthal, T., Both subunits of U2AF recognize the 3' splice site in
Caenorhabditis elegans. Nature 402 (6763), 835-838 (1999).
Wu, S., Romfo, CM., Nilsen, T.W., & Green, M.R., Functional recognition of the 3' splice site AG by the splicing factor U2AF35. Nature 402 (6763), 832-835 (1999).
Sun, H. & Chasin, L.A., Multiple splicing defects in an intronic false exon. Mol Cell Biol 20 (17), 6414-6425 (2000).
Zhang, X.H. & Chasin, L.A., Computational definition of sequence motifs governing constitutive exon splicing. Genes Dev 18 (1 1 ), 1241-1250 (2004).
Zhang, X.H., Leslie, C.S., & Chasin, L.A., Computational searches for splicing signals.
Methods 37 (4), 292-305 (2005).
Zhang, L., Kasif, S., Cantor, C.R., & Broude, N.E., GC/AT-content spikes as genomic punctuation marks. Proc Natl Acad Sci U S A 101 (48), 16855-16860 (2004).
Amit, M. ef a/. , Differential GC Content between Exons and Introns Establishes Distinct Strategies of Splice-Site Recognition. Cell Reports 1 (5), 543-556 (2012).
Zhang, J., Kuo, C.C., & Chen, L., GC content around splice sites affects splicing through pre- mRNA secondary structures. BMC Genomics 12, 90 (201 1 ).
Zamft, B., Bintu, L., Ishibashi, T., & Bustamante, C, Nascent RNA structure modulates the transcriptional dynamics of RNA polymerases. Proc Natl Acad Sci U S A 109 (23), 8948- 8953 (2012).
Kudla, G., Lipinski, L., Caffin, F., Helwak, A., & Zylicz, M., High guanine and cytosine content increases mRNA levels in mammalian cells. PLoS Biol 4 (6), e180 (2006).
Carlon, E., Malki, M.L., & Blossey, R., Exons, introns, and DNA thermodynamics. Phys Rev Lett 94 (17), 178101 (2005).
17
Kraeva, R.I. ef a/. , Stability of mRNA/DNA and DNA/DNA Duplexes Affects mRNA
Transcription. PLoS ONE 2, e290 (2007).
Delcourt, S.G. & Blake, R.D., Stacking energies in DNA. J Biol Chem 266 (23), 15160-15169 (1991 ).
Doktycz, M.J., Goldstein, R.F., Paner, T.M., Gallo, F.J., & Benight, A.S., Studies of DNA dumbbells. I. Melting curves of 17 DNA dumbbells with different duplex stem sequences linked by T4 endloops: evaluation of the nearest-neighbor stacking interactions in DNA. Biopolymers 32 (7), 849-864 (1992).
Sugimoto, N., Nakano, S., Yoneyama, M., & Honda, K., Improved thermodynamic parameters and helix initiation factor to predict stability of DNA duplexes. Nucleic Acids Res 24 (22), 4501-4505 (1996). SantaLucia, J., Jr., Allawi, H.T., & Seneviratne, P.A., Improved nearest-neighbor parameters for predicting DNA duplex stability. Biochemistry 35 (1 1 ), 3555-3562 (1996).
Breslauer, K.J., Frank, R., Blocker, H., & Marky, L.A., Predicting DNA duplex stability from the base sequence. Proc Natl Acad Sci U S A 83 (1 1 ), 3746-3750 (1986).
SantaLucia, J., Jr., A unified view of polymer, dumbbell, and oligonucleotide DNA nearest- neighbor thermodynamics. Proc Natl Acad Sci U S A 95 (4), 1460-1465 (1998).
Freier, S.M. ef a/. , Improved free-energy parameters for predictions of RNA duplex stability. Proc Natl Acad Sci U S A 83 (24), 9373-9377 (1986).
Sugimoto, N. ef a/. , Thermodynamic parameters to predict stability of RNA/DNA hybrid duplexes. Biochemistry 34 (35), 1 121 1-1 1216 (1995).
Huang, Y. & Kowalski, D., WEB-THERMODYN: Sequence analysis software for profiling DNA helical stability. Nucleic Acids Res 31 (13), 3819-3821 (2003).
Morton, J.J. & Blumenthal, T., RNA processing in C. elegans. Methods Cell Biol 106, 187- 217 (201 1 ).
Blumenthal, T., Trans-splicing and operons. WormBook, 1-9 (2005).
Zhang, M.Q., Statistical features of human exons and their flanking regions. Hum Mol Genet 7 (5), 919-932 (1998).
Bradnam, K.R. & Korf, I., Longer first introns are a general property of eukaryotic gene structure. PLoS One 3 (8), e3093 (2008).
Brenet, F. et al. , DNA methylation of the first exon is tightly linked to transcriptional silencing. PLoS One 6 (1 ), e14524 (201 1 ).
Bieberstein, N.I., Carrillo Oesterreich, F., Straube, K., & Neugebauer, K.M., First exon length controls active chromatin signatures and transcription. Cell Rep 2 (1 ), 62-68 (2012).
Matlin, A.J., Clark, F., & Smith, C.W., Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol 6 (5), 386-398 (2005).
Sakabe, N.J. & de Souza, S.J., Sequence features responsible for intron retention in human. BMC Genomics 8, 59 (2007).
Busch, A. & Hertel, K.J., HEXEvent: a database of Human EXon splicing Events. Nucleic Acids Res 41 (Database issue), D1 18-124 (2013).
Vorechovsky, I., Aberrant 3' splice sites in human disease genes: mutation pattern, nucleotide structure and comparison of computational tools that predict their utilization. Nucleic Acids Res 34 (16), 4630-4641 (2006).
Krawczak, M. ef a/. , Single base-pair substitutions in exon-intron junctions of human genes: nature, distribution, and consequences for mRNA splicing. Hum Mutat 28 (2), 150-158 (2007).
Teraoka, S.N. et al. , Splicing defects in the ataxia-telangiectasia gene, ATM: underlying mutations and consequences. Am J Hum Genet 64 (6), 1617-1631 (1999).
Query, C.C., Moore, M.J., & Sharp, P.A., Branch nucleophile selection in pre-mRNA splicing: evidence for the bulged duplex model. Genes Dev 8 (5), 587-597 (1994).
Gao, K., Masuda, A., Matsuura, T., & Ohno, K., Human branch point consensus sequence is yUnAy. Nucleic Acids Res 36 (7), 2257-2267 (2008).
Li, X. & Manley, J.L., Inactivation of the SR protein splicing factor ASF/SF2 results in genomic instability. Cell 122 (3), 365-378 (2005).
Li, X., Niu, T., & Manley, J.L., The RNA binding protein RNPS1 alleviates ASF/SF2 depletion-induced genomic instability. RNA 13 (12), 2108-21 15 (2007).
Huertas, P. & Aguilera, A., Cotranscriptionally formed DNA:RNA hybrids mediate transcription elongation impairment and transcription-associated recombination. Mol Cell 12 (3), 71 1-721 (2003). Moore, M.J. & Proudfoot, N.J., Pre-mRNA processing reaches back to transcription and ahead to translation. Cell 136 (4), 688-700 (2009).
Tuduri, S. et al. , Topoisomerase I suppresses genomic instability by preventing interference between replication and transcription. Nat Cell Biol†† (1 1 ), 1315-1324 (2009).
Masse, E. , Phoenix, P., & Drolet, M., DNA topoisomerases regulate R-loop formation during transcription of the rrnB operon in Escherichia coli. J Biol Chem 272 (19), 12816-12823 (1997).
Konarska, M.M. & Sharp, P.A. , Electrophoretic separation of complexes involved in the splicing of precursors to mRNAs. Ce// 46 (6), 845-855 (1986).
Konarska, M.M. , Analysis of splicing complexes and small nuclear ribonucleoprotein particles by native gel electrophoresis. Methods Enzymol 180, 442-453 (1989).
Matlin, A.J. & Moore, M.J. , Spliceosome assembly and composition. Adv Exp Med Biol 623, 14-35 (2007).
Jurica, M.S. & Moore, M.J. , Pre-mRNA splicing: awash in a sea of proteins. Mol Cell 12 (1 ), 5-14 (2003).
Mayeda, A. & Krainer, A.R. , Preparation of HeLa cell nuclear and cytosolic S100 extracts for in vitro splicing. Methods Mol Biol 1 18, 309-314 (1999).
Mayeda, A. & Krainer, A.R. , Mammalian in vitro splicing assays. Methods Mol Biol 1 18, 315- 321 (1999).
Hua, Y. , Vickers, T.A. , Okunola, H.L., Bennett, C.F. , & Krainer, A.R. , Antisense masking of an hnRNP A1/A2 intronic splicing silencer corrects SMN2 splicing in transgenic mice. Am J Hum Genet 82 (4), 834-848 (2008).
Sickmier, E.A. et al. , Structural basis for polypyrimidine tract recognition by the essential pre- mRNA splicing factor U2AF65. Mol Cell 23 (1 ), 49-59 (2006).
Mackereth, CD. et al. , Multi-domain conformational selection underlies pre-mRNA splicing regulation by U2AF. Nature 475 (7356), 408-41 1 (201 1 ).
Kersey, P.J. et al. , Ensembl Genomes: an integrative resource for genome-scale data from non-vertebrate species. Nucleic Acids Res 40 (Database issue), D91 -97 (2012).
Kinsella, R.J. ef al. , Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database (Oxford) 201 1 , bar030 (201 1 ).
Allen, M.A. , Hillier, L.W., Waterston, R.H. , & Blumenthal, T., A global analysis of C. elegans trans-splicing. Genome Res 21 (2), 255-264 (201 1 ).
Karolchik, D. ef al. , The UCSC Table Browser data retrieval tool. Nucleic Acids Res 32 (Database issue), D493-496 (2004).
Guth, S. , Tange, T.O. , Kellenberger, E., & Valcarcel, J. , Dual function for U2AF(35) in AG- dependent pre-mRNA splicing. Mol Cell Biol 21 (22), 7673-7681 (2001 ).
Kim Seon-Yung et al. "The human elongation factor 1 alpha (EF-1 alpha) first intron highly enhances expression of foreign genes ..." J Biotechnology 93; 183-187 (2002).
Sugimoto N. et al. "Thermodynamic parameters to predict stability of RNA/DNA hybrid duplexes" Biochemistry 34: 1 121 1 -1 1216 (1995).

Claims

Method for modulating protein expression level by modifying the gene encoding said protein, comprising a) provision of an initial DNA sequence (initial DNA) that comprises one or more
sequences that encode an amino acid sequence of the protein to be expressed, b) determination of the AG ratio (AG of DNA/DNA duplex stability to AG of RNA/DNA duplex stability) for one or more regions of said initial DNA sequence, and c) modification of said initial DNA sequence to provide a product DNA sequence
(product DNA) with a desired AG ratio, wherein the protein expression level is dependent on said AG ratio.
Method according to the preceding claim, wherein the AG ratio is determined by measurement and/or calculation of the AG value for DNA/DNA duplex stability and the AG value for RNA/DNA duplex stability for any given specified sequence region, and calculation of the ratio between the AG value for DNA/DNA duplex stability of a specified sequence region to the AG value for RNA/DNA duplex stability for the same sequence region, wherein a sequence region with a AG ratio above 1 exhibits higher DNA/DNA duplex stability than RNA/DNA duplex stability and a sequence region with a AG ratio below 1 exhibits higher RNA/DNA duplex stability than DNA/DNA duplex stability.
Method according to any one of the preceding claims, wherein the gene encoding the protein comprises one or more introns, characterised in that an increase of the AG ratio in a specified sequence region of the product DNA in comparison to the initial DNA provides increased expression of the protein encoded by said product DNA, wherein said specified sequence region is any given sequence region within 50 nt, preferably 20 nt, upstream of the 3' splicing site of an intron.
Method according to the preceding claim, wherein the AG ratio of the specified sequence region in the product DNA is above 1 , preferably above 1.2.
5. Method according to any one of the preceding claims, wherein modification of the initial DNA is carried out by insertion of one or more introns and corresponding splice sites.
6. Method according to any one of the preceding claims, wherein reduction of the AG ratio for any given coding region and/or 5' untranslated region (5'-UTR) of the product DNA in comparison to the initial DNA provides increased expression of the protein encoded by said product DNA.
7. Method according to the preceding claim, wherein the AG ratio of the specified coding
and/or 5'-UTR sequence region in the product DNA is below 1 .
8. Method according to any one of the preceding claims, wherein modification of the initial DNA is carried out according to the degeneracy of the genetic code without changing the amino acid sequence encoded by said initial DNA.
9. Method according to any one of the preceding claims, wherein
one or more of the steps a), b) or c) are carried out by one or more computer programmes, executed on a computing device.
10. Method according to any one of the preceding claims, wherein the AG values for DNA/DNA duplex stability and/or the AG value for RNA/DNA duplex stability for any given specified sequence region are determined using a sliding-window calculation of entropy (AS) and enthalpy (ΔΗ) of nearest neighbour interactions.
1 1 . Method according to any one of the preceding claims, wherein AG values for DNA/DNA duplex stability are calculated on 10 nearest-neighbour interactions, and AG values for RNA/DNA duplex stability are calculated on 10 to 20, preferably 16, nearest-neighbour interactions.
12. Method according to any one of the preceding claims, wherein the sliding window approach utilises a 1 to 20 bp, preferably 1 bp, step size and a 1 to 20 bp, preferably 2 to 9 bp, window size.
13. Method for manufacturing a nucleic acid molecule that corresponds to product DNA that has been modified by the method according to any one of the preceding claims, comprising carrying out the method of any one of the preceding claims and subsequently synthesizing, cloning and/or isolating said nucleic acid molecule.
14. Method according to any one of the preceding claims comprising the design, simulation of manufacture or manufacture of an attenuated human pathogen, preferably a virus, for use as a vaccine, comprising a product DNA that has been modified by the method according to any one of the preceding claims.
15. Method according to any one of the preceding claims comprising the design, simulation of manufacture or manufacture of an expression vector for recombinant protein expression, comprising a product DNA that has been modified by the method according to any one of the preceding claims encoding a protein to be expressed.
16. Method according to any one of the preceding claims comprising the simulation of
manufacture or manufacture of a therapeutic expression vector comprising a product DNA that has been modified by the method according to any one of the preceding claims encoding a therapeutic protein.
17. Method for predicting the location of splice sites in one or more unannotated genomes
and/or genomic regions by provision of a DNA sequence of interest, determination of the AG ratio for one or more regions of said DNA sequence, wherein a AG ratio of any given specified sequence region above 1 indicates a 3' splicing site of an intron.
18. A pair of first and second nucleic acid molecules, wherein said first nucleic acid molecule is an initial DNA sequence (initial DNA) that comprises one or more sequences that encode an amino acid sequence of a protein to be expressed, and said second nucleic acid molecule is a product DNA sequence (product DNA) with a desired AG ratio that has been
manufactured according to the method of the preceding claims, wherein said pair of sequences exhibit differences in AG ratio in one or more sequence regions.
19. A pair of first and second nucleic acid molecules according to the preceding claim, wherein said sequences differ with respect to their nucleic sequence and AG ratio in one or more sequence regions, without any difference in amino acid sequence of the encoded protein to be expressed.
20. A pair of first and second nucleic acid molecules according to any one of the preceding claims, wherein the gene encoding the protein to be expressed comprises one or more introns, characterised in that an increased AG ratio in a specified sequence region of the product DNA in comparison to the initial DNA is present and provides increased expression of the protein encoded by said product DNA, wherein said specified sequence region is any given sequence region within 50 nt, preferably 20 nt, upstream of the 3' splicing site of an intron.
21. A pair of first and second nucleic acid molecules according to any one of the preceding claims, wherein a reduced AG ratio for any given coding region and/or 5' untranslated region (5'-UTR) of the product DNA in comparison to the initial DNA is present and provides increased expression of the protein encoded by said product DNA.
PCT/EP2014/062659 2013-06-17 2014-06-17 Method for modulating gene expression WO2014202573A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP13172309 2013-06-17
EP13172309.0 2013-06-17

Publications (1)

Publication Number Publication Date
WO2014202573A1 true WO2014202573A1 (en) 2014-12-24

Family

ID=48628336

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2014/062659 WO2014202573A1 (en) 2013-06-17 2014-06-17 Method for modulating gene expression

Country Status (1)

Country Link
WO (1) WO2014202573A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006126070A2 (en) * 2005-05-24 2006-11-30 Avestha Gengraine Technologies Pvt Ltd A process comprising codon optimization for the production of recombinant activated human protein c for the treatment of sepsis

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006126070A2 (en) * 2005-05-24 2006-11-30 Avestha Gengraine Technologies Pvt Ltd A process comprising codon optimization for the production of recombinant activated human protein c for the treatment of sepsis

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CHATTERJEE SUBHRANGSU ET AL: "The 5-Me of thyminyl (T) interaction with the neighboring nucleobases dictate the relative stability of isosequential DNA-RNA hybrid duplexes.", 7 November 2005, ORGANIC & BIOMOLECULAR CHEMISTRY 7 NOV 2005, VOL. 3, NR. 21, PAGE(S) 3911 - 3915, ISSN: 1477-0520, XP002727861 *
KIM SEON-YOUNG ET AL: "The human elongation factor 1 alpha (EF-1alpha) first intron highly enhances expression of foreign genes from the murine cytomegalovirus promoter", JOURNAL OF BIOTECHNOLOGY, vol. 93, no. 2, 14 February 2002 (2002-02-14), ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, pages 183 - 187, XP002384369, ISSN: 0168-1656 *
KRAEVA RAYNA I ET AL: "Stability of mRNA/DNA and DNA/DNA Duplexes Affects mRNA Transcription", PLOS ONE, vol. 2, no. 3, March 2007 (2007-03-01), XP002717382, ISSN: 1932-6203 *
NEDELCHEVA-VELEVA MARINA N ET AL: "The thermodynamic patterns of eukaryotic genes suggest a mechanism for intron-exon recognition.", NATURE COMMUNICATIONS, vol. 4, 2101, 1 July 2013 (2013-07-01), NATURE PUBLISHING GROUP, LONDON, UK, pages 1 - 12, XP002717381, ISSN: 2041-1723 *
SUGIMOTO N ET AL: "Thermodynamic parameters to predict stability of RNA/DNA hybrid duplexes", BIOCHEMISTRY, AMERICAN CHEMICAL SOCIETY, US, vol. 34, no. 35, 1 January 1995 (1995-01-01), pages 11211 - 11216, XP002250298, ISSN: 0006-2960, DOI: 10.1021/BI00035A029 *

Similar Documents

Publication Publication Date Title
Piriyapongsa et al. Origin and evolution of human microRNAs from transposable elements
Grillone et al. Non-coding RNAs in cancer: Platforms and strategies for investigating the genomic “dark matter”
Sartorelli et al. Enhancer RNAs are an important regulatory layer of the epigenome
Bevilacqua et al. Genome-wide analysis of RNA secondary structure
Wu et al. Widespread influence of 3′-end structures on mammalian mRNA processing and stability
Amit et al. Differential GC content between exons and introns establishes distinct strategies of splice-site recognition
Jodoin et al. The folding of 5′-UTR human G-quadruplexes possessing a long central loop
Jin et al. New insights into RNA secondary structure in the alternative splicing of pre-mRNAs
Tak et al. Making sense of GWAS: using epigenomics and genome engineering to understand the functional relevance of SNPs in non-coding regions of the human genome
Jeck et al. Circular RNAs are abundant, conserved, and associated with ALU repeats
Mortimer et al. Insights into RNA structure and function from genome-wide studies
Tafer et al. RNAplex: a fast tool for RNA–RNA interaction search
Spies et al. Biased chromatin signatures around polyadenylation sites and exons
Conrad The emerging role of triple helices in RNA biology
Solem et al. The potential of the riboSNitch in personalized medicine
Liu et al. Next generation sequencing for profiling expression of miRNAs: technical progress and applications in drug development
Roberts et al. Continuing analysis of microRNA origins: Formation from transposable element insertions and noncoding RNA mutations
Martin et al. Structural effects of linkage disequilibrium on the transcriptome
Vandivier et al. Chemical modifications mark alternatively spliced and uncapped messenger RNAs in Arabidopsis
Liu et al. Classification and function of RNA–protein interactions
Ergin et al. RNA sequencing and its applications in cancer and rare diseases
Liu et al. Characterization and evolution of 5′ and 3′ untranslated regions in eukaryotes
Jara-Espejo et al. Potential G-quadruplex forming sequences and N 6-methyladenosine colocalize at human pre-mRNA intron splice sites
Quarles et al. Ensemble analysis of primary microRNA structure reveals an extensive capacity to deform near the Drosha cleavage site
Ke et al. Intronic motif pairs cooperate across exons to promote pre-mRNA splicing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14731606

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14731606

Country of ref document: EP

Kind code of ref document: A1