WO2016086988A1 - Optimisation of coding sequence for functional protein expression - Google Patents

Optimisation of coding sequence for functional protein expression Download PDF

Info

Publication number
WO2016086988A1
WO2016086988A1 PCT/EP2014/076436 EP2014076436W WO2016086988A1 WO 2016086988 A1 WO2016086988 A1 WO 2016086988A1 EP 2014076436 W EP2014076436 W EP 2014076436W WO 2016086988 A1 WO2016086988 A1 WO 2016086988A1
Authority
WO
WIPO (PCT)
Prior art keywords
codon
cell
host cell
expression
polynucleotide
Prior art date
Application number
PCT/EP2014/076436
Other languages
French (fr)
Inventor
Lotte Bregje Westerhof
Jacob Bakker
Ruud Hendrikus Petrus Wilbers
Arjen Schots
Geert Smant
Aska Goverse
Johannes Helder
Marten Gerko STERKEN
Laurens Bastian SNOEK
Jan Edward Kammenga
Original Assignee
Wageningen Universiteit
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wageningen Universiteit filed Critical Wageningen Universiteit
Priority to PCT/EP2014/076436 priority Critical patent/WO2016086988A1/en
Publication of WO2016086988A1 publication Critical patent/WO2016086988A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/63Introduction of foreign genetic material using vectors; Vectors; Use of hosts therefor; Regulation of expression
    • C12N15/67General methods for enhancing the expression
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1089Design, preparation, screening or analysis of libraries using computer algorithms

Definitions

  • the present invention relates to an approach aimed at the modification of codons in individual polynucleotide sequences encoding a heterologous protein of interest, without altering the amino acid sequence of the polypeptide to enhance the amount of functional expression in a host organism of interest. Recognising that maximum translation efficiency and therefore protein production is influenced by codon usage of a coding sequence, in its broadest aspect, this approach exploits redundancy in the genetic code by providing a universal set of codons which may be used at certain positions in the polynucleotide sequence in order to achieve improved heterologous protein production in a range of host cells.
  • the present invention also relates to the optimization of the translation efficiency of messenger RNAs on the basis of their secondary structure characteristics, and the provided set of criteria may be used to increase protein expression in particular hosts.
  • codons used most frequently in highly expressed genes have been shown to correspond to genomic G+C content and often match the most abundant tRNAs in many species. It is assumed that codons that match more abundant tRNAs would be translated faster as tRNA availability for translation occurs via diffusion and the chance of encountering a more abundant tRNA is greater than when encountering a rarer tRNA. An increase in translation rate allows ribosomes to finish translation and reinitiate translation sooner.
  • the probability that a ribosome initially loads a non-matching tRNA is smaller when a codon matches a more abundant tRNA resulting in an energetic advantage as three-quarters of the energy to incorporate an amino acid is lost if a non-matching tRNA has to be rejected after proofreading.
  • the use of optimal codons in highly-expressed genes was hypothesized to provide a fitness gain by improved translational efficiency.
  • the codon use of a gene of interest is often adapted to reflect the expression host's codon use in highly expressed genes in order to enhance heterologous protein production.
  • the results obtained with this strategy are variable.
  • a comparison between the overall codon use and the codon use in highly expressed genes of several plant species revealed that optimal codons are not always the codons of which the use is increased most with expression.
  • the codon composition of highly expressed genes differs between monocots and dicots, the same codons often rise in frequency with increasing expression levels (expression codons) and are in many cases C-ending. These conserved expression codons were used to optimise the codon composition of three genes, which enhanced protein yield significantly upon stable and transient expression in plants.
  • the present invention provides a quick, practical, universal method of increasing functional heterologous protein expression with wide application for the expression of heterologous genes in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells.
  • this method removes any need for consideration of the host cell or specific cellular context involved.
  • the present invention also provides specific sets of codon replacements which further improve functional protein expression in particular hosts, specifically prokaryotes, fungi, animals, nematodes, protists and plants.
  • the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
  • the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
  • the present invention provides a method of expressing a heterologous protein in a plant cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table;
  • Threonine ACT Threonine ACT, ACA or ACG ACC
  • the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide.
  • the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
  • the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
  • heterologous protein expression may be achieved by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table, particularly where the host cell is a prokaryotic cell, a fungal cell or a nematode cell:
  • heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
  • heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
  • heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
  • heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
  • AGC and/or:
  • the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;
  • the host cell being selected from a prokaryotic cell, a fungal cell, a plant cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
  • the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;
  • modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
  • heterologous protein expression is further improved by supplementing the codon changes detailed in the table above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
  • the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;
  • modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
  • heterologous protein expression is further improved by supplementing the codon changes detailed in the table above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
  • the present invention provides a method of expressing a heterologous protein in a plant cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table;
  • the host cell is an Arabidopsis thaliana cell.
  • RNAs are folded structures and translation of a given mRNA into a polypeptide requires unfolding.
  • the necessary helicase activity is typically provided by the ribosome itself. This unfolding requires energy and in essence, a linear mRNA (i.e. an RNA polymer without secondary structure) would be optimal for the maximization of protein production.
  • a certain degree of folding makes mRNA less susceptible to degradation and increases its diffusibility.
  • the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the relevant table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the relevant table(s); the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence and wherein the method further comprises; analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence; and incorporating in said polynucleotide sequence a pattern of optimal and non-optimal codons at a site associated
  • the method may comprise merely making the universal codon changes, and/or making modifications according to the replacement codon tables which are specific for particular host cells.
  • analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence typically will include, but is not limited to; examining and taking account of the mean number of stem-loop transitions, mean stem size, mean loop size, standard deviation of the stem size or the loop size (which acts as a proxy measure for even distribution of stem-loops), maximum loop size and/or maximum stem size.
  • uneven stem loop distributions will be discarded and the polynucleotide sequence codon composition will be altered (i.e. non-optimally) based on the observation of mRNA secondary structure to improve translational efficiency and therefore functional protein expression.
  • a novel aspect of the invention is the selection of mRNA structures with the most even distribution of stems and loops that leads to higher levels of expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Consequently, in a further aspect, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide.
  • the first step in selecting the 'ideal' mRNA structure is the generation of a pool of mRNA variants by making all possible combinations of synonymous codons (> 100.000 mRNA variants).
  • all mRNA species in the pool are then folded in silico.
  • the term "in silico" is widely used in the art and will be understood by the average skilled person as meaning performed on a computer or via computer simulation.
  • the RNA structure is predicted in silico using standard techniques and usually under the temperature and salt concentrations relevant for the preferred host. Appropriate software packages or applications incorporating suitable algorithms may be selected for performing the folded mRNA structure prediction. Suitable packages include, but are not limited to; an RNA structure prediction program such as Vienna RNAfold 2.0 (Lorenz et al..
  • the mRNA structure prediction will be carried out using such a prediction program using the standard settings and the folding parameters, for example, those established by Andronescu et al. (Andronescu et al., 2007 Bioinformatics, 23 (13), i19-i28) and preferably, adjusting the folding-temperature to that of the intracellular temperature of the host of interest. More preferably, the temperature and salt concentration parameters will be adjusted to match those of the preferred host. Finally the mRNAs from the library of synonymous variants that have the most even distribution of stems and loops are selected.
  • the mRNAs having the most even distribution of stems and loops may be identified by the structural characteristics outlined below. In particular the standard deviation is used as a measure for an even distribution of the sizes of the stems and loops which is preferred. Typically, the more similar the stem sizes of an mRNA the higher the translation efficiency. Additionally, the more similar the loop sizes of an mRNA the higher the translation efficiency. Where there were several appropriate codons according to the foregoing criteria, previously published data was consulted to make a final selection. Parameters which may be influential include, for example, the folding energy of the 5' terminus and the selection of codons that are frequently used and match the most abundant tRNAs.
  • codons giving the lowest folding energy of the 5' terminus and codons that are frequently used and match the most abundant tRNAs were preferred.
  • Methods for determining the folding energy of mRNA may be based on, but are not limited to those described by Tuller et al. (Tuller et al., 2009, PNAS 107:3645-3650) and Kudla et al. (Kudla et al. 2009, Science, 324:255-258).
  • Tuller et al. Tuler et al., 2009, PNAS 107:3645-3650
  • Kudla et al. Kudla et al.
  • the mRNA molecule from -23 till +39 should have an average folding energy of at least -6 kcal/mol for E. coli and of at least -4 kcal/mol for S.
  • the cerevisiae as determined by the use of sliding windows of 40nt with 1 nt steps. Codon choice of the first 13nts providing a low energy will depend on the 5' UTR provided by the expression cassette ((Kudla et al. 2009, Science, 324: 255-258; Tuller et al., 2009, PNAS 107: 3645-3650). Alternatively, instead of adapting the first 13 nts, the 5'UTR may be adapted to provide a low folding energy.
  • the 5'UTR used in the present examples is very U-rich (GTTTTTATTTTTAATTTTCTTTCAAATACTTCCACC [SEQ ID NO: 1 ]), which in most cases provided a relatively high (close to 0) folding energy when using primarily C-ending codons.
  • GTTTTTATTTTTAATTTTCTTTCAAATACTTCCACC [SEQ ID NO: 1 ]
  • analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence typically will include, but is not limited to; examining and taking account of; the mean number of stem-loop transitions, mean stem size, mean loop size, standard deviation of the stem size or the loop size (which acts as a proxy measure for even distribution of stem-loops), maximum loop size and/or maximum stem size.
  • the polynucleotide sequence codon composition will be altered (i.e. non-optimally) to avoid uneven stem loop distributions to improve translational efficiency and therefore functional protein expression.
  • Such alterations may include incorporating one or more codons listed as second preference or third preference replacement codons in place of the first preference codon where the secondary structure criteria are not fulfilled by inclusion of the first preference codon.
  • such alterations may include retention of the wild-type (WT) or native codon where inclusion of an optimal codon negatively impacts the secondary structure with respect to the particular criteria for each host cell.
  • WT wild-type
  • the polynucleotide will have at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp).
  • the polynucleotide will have stem loop transitions in the range 1 10 to 250/kbp, optionally in the range 1 10 to 200/kbp, 1 1 1 to 249/kbp, 1 12 to 248/kbp, 1 13 to 247/kbp, 1 14 to 246/kbp, 1 15 to 245/kbp, 1 16 to 244/kbp, 1 17 to 243/kbp, 1 18 to 242/kbp, 1 19 to 241 /kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp.
  • the polynucleotide will have a maximum stem size of less than 19 bp. optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 14bp to 15bp. More preferably, the polynucleotide will have a maximum loop size of less than 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp. Additionally, in embodiments wherein the host cell is a prokaryotic cell, preferably a bacterial cell and more preferably an E.
  • the selected polynucleotide will preferably have at least 1 16 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 1 16 to 200/kbp, 1 17 to 249/kbp, 1 18 to 248/kbp, 1 19 to 247/kbp, 120 to 245/kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp.
  • kbp stem loop transitions per kilobase pair
  • the selected polynucleotide will preferably have a mean stem size between 5.45 bp and 2.50 bp, optionally in the range 5.45 to 4.00 bp, 5.40 bp to 2.60 bp, 5.30 bp to 2.70 bp, 5.20 bp to 2.80 bp, 5.10 bp to 2.90 bp, 5.00 bp to 3.00 bp, 4.90 to 3.10 bp, 4.80 to 3.20 bp, 4.70 to 3.30 bp, 4.60 to 3.40 bp, 4.50 to 3.50 bp, 4.40 to 3.60 bp, 4.30 to 3.70 bp, 4.20 to 3.80 bp or 4.10 to 3.90 bp.
  • the method further comprises selecting a polynucleotide having a mean loop size between 3.16 bp and 2.00 bp, optionally in the range 3.10 bp to 2.10 bp, 3.00 bp to 2.20 bp, 2.90 bp to 2.30 bp, 2.80 bp to 2.40 bp, 2.70 bp to 2.50 bp or 2.60 bp to 2.40 bp.
  • the method further comprises selecting a polynucleotide having a loop size standard deviation of between 2.95 and 2 bp, optionally in the range 2.90 bp to 2.10 bp, 2.80 bp to 2.20 bp, 2.70 bp to 2.30 bp, 2.60 bp to 2.40 bp or 2.50 bp to 2.40 bp.
  • the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.50, preferably between 3.50 and 2.00 bp, optionally in the range 3.40 bp to 2.10 bp, 3.30 bp to 2.20 bp, 3.20 bp to 2.30 bp, 3.10 bp to 2.40 bp, 3.00 bp to 2.50 bp, 2.90 bp to 2.60 bp or 2.80 bp to 2.70 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 16 bp, optionally in the range 10bp to 16bp, 1 1 bp to 15bp or 12bp to 14bp.
  • the method further comprises selecting a polynucleotide having a maximum stem size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp, 13bp to 15bp or 12 bp to 14 bp.
  • the selected polynucleotide will preferably have at least 1 16 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range optionally in the range 1 16 to 200/kbp, 1 17 to 249/kbp, 1 18 to 248/kbp, 1 19 to 247/kbp, 120 to 245/kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp.
  • kbp stem loop transitions per kilobase pair
  • the selected polynucleotide will have a mean stem size in the range 5.20 to 2.50 bp, optionally in the range 5.20 bp to 4.00 bp, 5.20 to 2.60 bp, 5.10 bp to 2.70 bp, 5.00 bp to 2.80 bp, 4.90 bp to 2.90 bp, 4.80 bp to 3.00 bp, 4.70 to 3.10 bp, 4.60 to 3.20 bp, 4.50 to 3.30 bp, 4.40 to 3.40 bp, 4.30 to 3.50 bp, 4.20 to 3.60 bp, 4.10 to 3.70 bp or 4.00 to 3.80 bp.
  • the method further comprises selecting a polynucleotide having a mean loop size between 3.32 bp and 3.00 bp. optionally in the range 3.30 bp to 3.00 bp, 3.25 bp to 3.05 bp, 3.20 bp to 3.10 bp or 3.15 bp to 3.10 bp.
  • the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.20 and 2 bp, optionally in the range 3.10 bp to 2.10 bp, 3.00 bp to 2.20 bp, 2.90 bp to 2.30 bp, 2.80 bp to 2.40 bp, 2.70 bp to 2.50 bp or 2.60 bp to 2.40 bp.
  • the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.40, preferably between 3.40 and 2.00 bp, optionally in the range 3.30 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp, 2.80 bp to 2.40 bp or 2.60 bp to 2.50 bp.
  • a polynucleotide having a stem size standard deviation below 3.40 preferably between 3.40 and 2.00 bp, optionally in the range 3.30 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp, 2.80 bp to 2.40 bp or 2.60 bp to 2.50 bp.
  • the method further comprises selecting a polynucleotide having a maximum loop size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp or 13bp to 15bp.
  • the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 12 bp to 15 bp.
  • the selected polynucleotide will preferably have at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp).
  • the polynucleotide will have stem loop transitions in the range 1 10 to 250/kbp, optionally in the range 1 10 to 200/kbp, 1 1 1 to 249/kbp, 1 12 to 248/kbp, 1 13 to 247/kbp, 1 14 to 246/kbp, 1 15 to 245/kbp, 1 16 to 244/kbp, 1 17 to 243/kbp, 1 18 to 242/kbp, 1 19 to 241 /kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp.
  • stem loop transitions in the range 1
  • the selected polynucleotide will preferably have a mean stem size between 5.27 bp and 2.50 bp, optionally in the range 5.27 bp to 4.00 bp, 5.20 to 2.40 bp, 5.10 bp to 2.50 bp, 5.00 to 2.60 bp, 4.90 bp to 2.70 bp, 4.80 bp to 2.80 bp, 4.70 bp to 2.90 bp, 4.60 bp to 3.00 bp, 4.50 to 3.10 bp, 4.40 to 3.20 bp, 4.30 to 3.30 bp, 4.20 to 3.40 bp, 4.10 to 3.50 bp, 4.00 to 3.60 bp or 3.90 to 3.70 bp.
  • the method further comprises selecting a polynucleotide having a mean loop size between 3.77 bp and 3.00 bp, optionally in the range 3.75 bp to 3.00 bp, 3.70 bp to 3.10 bp, 3.60 bp to 3.20 bp or 3.50 bp to 3.30 bp.
  • the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.65 and 2.00 bp, optionally in the range 3.60 bp to 2.10 bp, 3.50 bp to 2.20 bp, 3.40 bp to 2.30 bp, 3.30 bp to 2.40 bp, 3.30 bp to 2.50 bp, 3.20 bp to 2.60 bp, 3.10 bp to 2.70 bp or 3.00 bp to 2.80 bp.
  • the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.25, preferably between 3.25 and 2.00 bp, optionally in the range 3.20 bp to 2.10 bp, 3.10 bp to 2.20 bp, 3.00 bp to 2.30 bp, 2.90 bp to 2.40 bp, 2.80 bp to 2.50 bp or 2.70 bp to 2.60 bp.
  • the method further comprises selecting a polynucleotide having a maximum loop size below 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp.
  • the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10 bp to 19 bp, 1 1 bp to 18 bp, 12 bp to 17 bp, 13 bp to 16 bp or 12 bp to 15 bp.
  • the selected polynucleotide will preferably have at least 1 14 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 1 14 to 200/kbp, 1 15 to 249/kbp, 1 16 to 248/kbp, 1 17 to 247/kbp, 1 18 to 246/kbp, 1 19 to 245/kbp, 120 to 244/kbp, 121 to 243/kbp, 122 to 242/kbp, 123 to 241 /kbp, 124 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to
  • the selected polynucleotide will preferably have a mean stem size between 5.35 and 2.50 bp, optionally in the range 5.35 bp to 4.00 bp, 5.30 to 2.40 bp, 5.20 bp to 2.50 bp, 5.10 to 2.60 bp, 5.00 bp to 2.70 bp, 4.90 bp to 2.80 bp, 4.80 bp to 2.90 bp, 4.70 bp to 3.00 bp, 4.60 to 3.10 bp, 4.50 to 3.20 bp, 4.40 to 3.30 bp, 4.30 to 3.40 bp, 4.20 to 3.50 bp, 4.10 to 3.60 bp, 4.00 to 3.70 bp or 3.90 to 3.80 bp.
  • the method further comprises selecting a polynucleotide having a mean loop size between 3.47 bp and 3.00 bp, optionally in the range 3.45 bp to 3.00 bp, 3.40 bp to 3.10 bp or 3.30 bp to 3.20 bp.
  • the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.37 and 2.00 bp, optionally in the range 3.35 bp to 2.10 bp, 3.30 bp to 2.20 bp, 3.20 bp to 2.30 bp, 3.10 bp to 2.40 bp, 3.00 bp to 2.50 bp, 2.90 bp to 2.60 bp, or 2.80 bp to 2.70 bp.
  • the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.27, preferably between 3.27 and 2.00 bp, optionally in the range 3.25 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp or 2.80 bp to 2.60 bp.
  • the method further comprises selecting a polynucleotide having a maximum loop size below 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp.
  • the method further comprises selecting a polynucleotide having a maximum stem size below 18 bp, optionally in the range 10 bp to 18 bp, 1 1 bp to 17 bp, 12 bp to 16 bp, 13 bp to 15 bp or 12 bp to 14 bp.
  • the selected polynucleotide will preferably have at least 120 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 120 to 200/kbp, 121 to 249/kbp, 122 to 248/kbp, 123 to 247/kbp, 124 to 246/kbp, 125 to 245/kbp, 130 to 240/kbp, 135 to 235/kbp, 140 to 230/kbp, 145 to 225/kbp, 150 to 220/kbp, 155 to 215/kbp, 160 to 210/kbp, 165 to 205/kbp, 170 to 200/kbp, 175 to 195/kbp or 180 to 190/kbp.
  • kbp stem loop transitions per kilobase pair
  • the selected polynucleotide will preferably have a mean stem size between 4.35 and 2.50 bp, optionally in the range 4.35 to 4.00 bp, 4.30 to 2.40 bp, 4.20 bp to 2.50 bp, 4.10 to 2.60 bp, 4.00 bp to 2.70 bp, 3.90 bp to 2.80 bp, 3.80 bp to 2.90 bp, 3.70 bp to 3.00 bp, 3.60 to 3.10 bp, 3.50 to 3.20 bp or 3.40 to 3.30 bp.
  • the method further comprises selecting a polynucleotide having a mean loop size between 5.18 bp and 4.00 bp, optionally in the range 5.15 bp to 4.00 bp, 5.10 bp to 4.10 bp, 5.00 bp to 4.20 bp, 4.90 bp to 4.30 bp, 4.80 bp to 4.40 bp or 4.70 bp to 4.50 bp.
  • the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.00 and 2.00 bp, optionally in the range 2.90 bp to 2.10 bp, 2.80 bp to 2.20 bp, 2.70 bp to 2.30 bp or 2.60 bp to 2.40 bp.
  • the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.28, preferably between 3.28 and 2.00 bp, optionally in the range 3.27 bp to 2.00 bp, 3.25 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp or 2.80 bp to 2.60 bp.
  • the method further comprises selecting a polynucleotide having a maximum loop size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp or 13bp to 15bp.
  • the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 12 bp to 15 bp.
  • the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide, wherein the method further comprises selecting a polynucleotide from a library of synonymous variants wherein the codon usage of the selected polynucleotide most closely matches the most abundant tRNAs in a particular host cell. It will be appreciated that this final step may be undertaken.
  • polynucleotides encoding heterologous proteins of interest may be isolated nucleic acid molecules and may be a DNA molecule, a cDNA molecule, an RNA molecule or synthetically produced DNA or RNA or a chimeric nucleic acid molecule.
  • the polynucleotide is an RNA, it will be understood that normally uracil (U) is to be used in place of thymine (T).
  • polynucleotide refers to a deoxyribonucleotide or ribonucleotide polymer in single- or double-stranded form, or sense or anti-sense, and encompasses analogues of naturally occurring nucleotides that hybridize to nucleic acids in a manner similar to naturally occurring nucleotides.
  • polynucleotides may be derived from any organism, including the host organism, or may be synthesised de novo.
  • a polynucleotide coding sequence may be provided for the protein of interest (POI) having the wild-type (WT) sequence or alternatively having a 'pre-optimised' sequence; that is to say the sequence incorporates at one or more positions for which synonymous codons are available a codon which is associated with the most abundant tRNA for that particular amino acid.
  • WT wild-type
  • a 'pre-optimised' sequence that is to say the sequence incorporates at one or more positions for which synonymous codons are available a codon which is associated with the most abundant tRNA for that particular amino acid.
  • codons corresponding to the most abundant tRNA for particular amino acids are used at each position for which synonymous codons are available.
  • the starting polynucleotide sequence is the WT sequence encoding the POI.
  • the POI may be a native protein of a host cell in which expression of the native protein has been silenced, for example, the polynucleotide sequence encoding that protein has been disrupted, deleted or mutated. In these circumstances, the POI will be considered as a heterologous protein in the context of the mutated host cell.
  • a polynucleotide having a coding sequence may comprise synthesis of a polynucleotide comprising the coding sequence. This may be for example by modification of a pre-existing sequence, e.g. by site-directed mutagenesis or possibly by de novo synthesis.
  • polynucleotide sequences encoding the protein of interest may be prepared by any suitable method known to those of ordinary skill in the art, including but not limited to, for example, direct chemical synthesis or cloning.
  • the starting polynucleotide is a WT sequence or a pre-optimised sequence where the codons match the most abundant tRNAs for a particular host cell
  • the starting polynucleotide sequence may be reviewed and modified by incorporating the relevant replacement codons in silico.
  • the modified polynucleotide may subsequently be synthesised, for example by direct chemical synthesis, for introduction into a desired host cell.
  • the starting polynucleotide sequence may be provided and subsequently modified ex vivo or alternatively in vivo for example by site directed mutagenesis or gene editing techniques.
  • all of the polynucleotide sequence is modified according to the relevant table; that is to say 100% of the length of the coding sequence of the polynucleotide encoding the protein of interest (POI).
  • POI protein of interest
  • each occurrence of a particular 'non-optimal' codon in the starting polynucleotide sequence for which a synonymous codon exists will be replaced with the corresponding replacement codon indicated in the relevant table.
  • this involves modifying every occurrence of that codon within the polynucleotide sequence.
  • each codon will be modified using the synonymous replacement codon appearing first in the table.
  • appropriate replacement codons may be applied to substantially all of the nucleotides in a polynucleotide sequence.
  • At least 75%, 76%, 77%, 78%, 79%, 80%, 81 %, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91 %, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, 99.5% or 100% of the polynucleotide sequence is modified by incorporation of replacement codons according to the relevant table.
  • more than 90% of the polynucleotide sequence is modified by incorporation of replacement codons according to the relevant table.
  • More than 95% of the polynucleotide sequence is modified.
  • 100% of the polynucleotide sequence is modified, that is, each occurrence of a particular codon is replaced with the corresponding replacement codon indicated in the relevant table.
  • the sequence will preferably be provided in an expression construct, e.g. an expression vector.
  • the polynucleotide may be provided in an expression vector.
  • Suitable expression vectors will vary according to the recipient host cell and suitably may incorporate regulatory elements which allow expression in the host cell of interest and preferably which facilitate high-levels of expression. Such regulatory sequences may be capable of influencing transcription or translation of a gene or gene product, for example in terms of initiation, accuracy, rate, stability, downstream processing and mobility.
  • Such elements may include, for example, strong and/or constitutive promoters, 5' and 3' UTR's, transcriptional and/or translational enhancers, transcription factor or protein binding sequences, start sites and termination sequences, ribosome binding sites, recombination sites, polyadenylation sequences, sense or antisense sequences, sequences ensuring correct initiation of transcription and optionally poly- A signals ensuring termination of transcription and transcript stabilisation in the host cell.
  • the regulatory sequences may be plant-, animal-, bacteria-, fungal- or virus derived, and preferably may be derived from the same organism as the host cell.
  • appropriate regulatory elements may vary according to the host cell of interest. For example, regulatory elements which facilitate high-level expression in prokaryotic host cells such as in E.
  • coli may include the pLac, T7, P(Bla), P(Cat), P(Kat), trp or tac promoters.
  • Regulatory elements which facilitate high-level expression in eukaryotic host cells might include the AOX1 or GAL1 promoter in yeast or the CMV- or SV40-promoters, CMV-enhancer, SV40-enhancer, Herpes simplex virus VIP16 transcriptional activator or inclusion of a globin intron in animal cells.
  • constitutive high-level expression may be obtained using, for example, the Zea mays ubiquitin 1 promoter or 35S and 19S promoters of cauliflower mosaic virus.
  • Suitable regulatory elements may be constitutive, whereby they direct expression under most environmental conditions or developmental stages, developmental stage specific or inducible.
  • the promoter is inducible, to direct expression in response to environmental, chemical or developmental cues, such as temperature, light, chemicals, drought, and other stimuli.
  • promoters may be chosen which permit expression of the protein of interest at particular developmental stages or in response to extra- or intra-cellular conditions, signals or externally applied stimuli.
  • a range of promoters exist for use in E. coli which give high- level expression at particular stages of growth (e.g. osmY stationary phase promoter) or in response to particular stimuli (e.g. HtpG Heat Shock Promoter).
  • Suitable expression vectors may comprise additional sequences encoding selectable markers which allow for the selection of said vector in a suitable host cell and/or under particular conditions. Suitable expression vectors may also comprise additional sequences which enable visualisation or quantification of the expressed protein (e.g. 3' GFP or Luciferase fusion tags) in the host cell of interest. Preferred expression vectors are those which also enable the expressed protein to be easily separated from other cellular proteins for downstream applications.
  • the expression vector may incorporate a fusion tag domain, which when fused to the coding sequence of the protein of interest allows the expressed protein to be bound to a matrix, column or beads (e.g. glutathione-S-transferase (GST)).
  • GST glutathione-S-transferase
  • the expression vector comprising the heterologous polynucleotide sequence may optionally comprise polynucleotide sequences coding for one or more transit peptides, capable of to localising the expressed protein to a particular cellular compartment in the host cell.
  • such domains may cause secretion of expressed protein, for example into the extracellular medium to enable the protein to be easily recovered from the cell culture medium.
  • suitable transit peptides may cause the protein to localise to, for example, the cell wall, nucleus or chloroplasts.
  • the methods of the present invention will be useful in the production of a large number of different proteins in the agricultural, chemical, industrial and pharmaceutical fields, particularly for example antibodies, vaccines, hormones and other protein therapeutics.
  • levels of heterologous protein are increased relative to the respective native (i.e. unoptimised) protein by modification of the codon usage of the polynucleotide sequence which encodes the protein of interest.
  • the levels of heterologous protein may increase in the range 5% to 500% relative to native (unoptimised) protein; optionally in the range 10% to 250%, 20% to 200%, 25% to 100%, 30% to 75% or 35 to 65%.
  • proteins of interest may preferably be recovered from the cell culture medium as secreted proteins, although they may also be recovered from host cell lysates.
  • the utility of the present invention resides in the universal applicability of the optimal replacement codons to any polynucleotide having a coding sequence and having one or more of the codons listed in the relevant table for expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells or animal cells.
  • Methods of the invention can be applied to any type of host cell which is genetically accessible and which can be cultured. In other words, the approach may be applied to those cells which are able to serve as a host for production of the protein of interest (POI)). It may therefore be applied to commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells commonly employed for recombinant heterologous protein expression.
  • host cells will be selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell.
  • the host cell may be an Escherichia coli cell.
  • the host cell may be a Saccharomyces cerevisiae cell.
  • the host cell may be a Caenorhabditis elegans cell.
  • the host cell may be a Mus musculus cell.
  • the host cell may be a bacterial cell or alternatively the host cell may be an archaeal cell.
  • Host cells may be gram-negative bacterial cells.
  • Host cells may be gram-positive bacterial cells.
  • host cells may include but are not limited to; an Aliivibrio fischeri cell, a Bacillus subtilis cell, a Caulobacter crescentus cell, an Escherichia coli cell, a Mycoplasma genitalium cell, a Synechocystis cell, a Pseudomonas fluorescens cell.
  • the host cell is a bacterial cell.
  • the host cell is an Escherichia coli (E. coli) cell.
  • E. coli Escherichia coli
  • the host cell is a prokaryotic cell
  • the highest functional protein expression will be achieved by modification of each codon in the polynucleotide sequence for which a synonymous codon exists according to the relevant tables above.
  • preference may be given to the first replacement codon appearing in the relevant table.
  • preference may be given to the second replacement codon appearing in the relevant table.
  • host cells may include but are not limited to; a Chlamydomonas reinhardtii cell, a Dictyostelium discoideum cell, a Tetrahymena thermophila cell, an Emiliania huxleyi cell or a Thalassiosira pseudonana cell.
  • the host cell is a Chlamydomonas cell.
  • the host cell is a Chlamydomonas reinhardtii cell.
  • the host cell may include but is not limited to; fungal cells and yeast cells cells.
  • the host cell may be a Saccharomyces cerevisiae cell, an Ashbya gossypii cell, an Aspergillus fumigatus cell, an Aspergillus nidulans cell, a Candida albicans cell, a Coprinus cinereus cell, a Cunninghamella elegans cell, a Cryptococcus neoformans cell, a Fusarium oxysporum cell, a Magnaporthe oryzae cell, a Neurospora crassa cell, a Schizophyllum commune cell, a Schizosaccharomyces pombe cell, an Ustilago maydis cell or a Zymoseptoria tritici cell.
  • the host cell is a Saccharomyces cerevisiae cell or a Schizosaccharo
  • the host cell is a plant cell
  • any cell type of any plant species including both monocots and dicots, may be used as a host system for expression of a heterologous protein.
  • Preferred plant cells for use in the present invention are genetically tractable, and are commonly derived from either crop species, species which typically exhibit high growth rates, are easily harvested or species which have established genetic resources associated with them.
  • the host cell is an Arabidopsis cell, preferably an Arabidopsis thaliana cell.
  • the host cell may be a Nicotiana cell, preferably a Nicotiana tabacum cell.
  • said plant may suitably be selected from the following: maize (Zea mays), canola (Brassica napus, Brassica rapa ssp.), sugar beet (Beta vulgaris), oat (Avena sp.), barley (Hordeum vulgare), flax (Linum usitatissimum), alfalfa (Medicago sativa), rice (Oryza sativa), rye (Secale cerale), sorghum (Sorghum bicolor, Sorghum vulgare), switchgrass (Panicum virgatum), prairie Cordgrass (Spartina sp.), purple false brome (Brachypodium distachyon), sunflower (helianthus annuas), wheat (Tritium aestivum), soybean (Glycine max), potato (Solanum tuberosum), cotton (Gossypium hirsutum), sweet potato (lopmoea batatus), cass
  • Expression constructs comprising the modified polynucleotide sequence may be located in plasmids (expression vectors) which are used to transform the host cell.
  • transformation may include heat shock, electroporation, particle bombardment, chemical induction, microinjection and viral transformation.
  • the expression levels of the protein of interest in host cells of interest may be determined.
  • the method chosen allows for quantitative assessment of the level of functional expression.
  • functional expression may be directly determined, e.g. as with GFP, luciferase or by enzymatic action of the protein of interest (POI) to generate a detectable optical signal, such as fluorescence or luminescence or a colour change caused by the protein.
  • POI protein of interest
  • the POI will be detectable by a high- throughput screening method, for example, relying on the detection of an optical signal.
  • a high- throughput screening method for example, relying on the detection of an optical signal.
  • using an optical signal which is directly proportionate to the quantity of the expression product from the polynucleotide is a convenient method of measuring expression and is amenable to high throughput processing.
  • Suitable tags may include but are not limited to; a fluorescence reporter molecule translationally-fused to the C-terminal end of the POI, e.g.
  • GFP Yellow Fluorescent Protein
  • RFP Red Fluorescent Protein
  • CFP Cyan Fluorescent Protein
  • the expression vector may incorporate a polynucleotide reporter encoding a luminescent protein, such as a luciferase (e.g. firefly luciferase).
  • the reporter gene may be a chromogenic enzyme which can be used to generate an optical signal, e.g. a chromogenic enzyme (such as beta-galactosidase (LacZ) or beta-glucuronidase (Gus)).
  • Tags used for detection of expression may also be antigen peptide tags.
  • a tag may be provided for affinity purification, e.g. a polyhistidine tag.
  • any tag employed for detection of expression will be cleavable from the POI. It is envisaged that other types of label may also be used to mark the protein including, for example, organic dye molecules or radiolabels.
  • the measurement of expression comprises the detection of an optical signal, for example a fluorescent signal, a luminescent signal or colour signal.
  • an optical signal for example a fluorescent signal, a luminescent signal or colour signal.
  • the optical signal is provided by a GFP reporter fused to the protein of interest.
  • the replacement codon selected from synonymous codons listed as alternatives in the relevant table(s) for a given host is the codon associated with the highest or optimal observed functional expression of the POI, or where more than one codon provides substantially equal such expression, one such codon corresponding with that level of expression. Where there is more than one replacement codon indicated for a given non-optimal codon based on the expression data, this corresponds to the first replacement codon appearing in the relevant table. Therefore where there is choice of codons indicated for a selected position based on the expression data, preference may be given to the first replacement codon appearing in the relevant table. Alternatively, preference may be given to the second replacement codon appearing in the relevant table.
  • the codon in the starting sequence may be retained, i.e. the wild type codon in embodiments where the starting sequence is the wild-type sequence. This will minimise the number of codon changes to convert the starting sequence in a polynucleotide to the selected synonymous coding sequence for improved functional protein expression.
  • Figure 1 shows the influence of codon optimisation on protein yield, mRNA stability and translatability.
  • Panel A is a graphical representation of the nucleotide content of the third codon position in the constructs for Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL- 10) with additional chitinase signal peptide (SP) expression. GFP was also expressed without SP.
  • Panel B is a graphical representation of protein yield in transformed Arabidopsis thaliana seedlings. For each plant analysed the protein yield in ng per mg total soluble protein (TSP) is plotted against the relative mRNA transcript concentration as compared to the A.
  • Figure 2 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked nucleotide use.
  • Figure 3 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked codon use.
  • Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) were rank-normalized and averaged. Subsequently, correlations (Spearman) between expression and codon use were calculated per species and used to generate this heat map. Consistent positive and negative correlations across species are indicated with stars and triangles respectively.
  • Figure 4 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked amino acid use.
  • Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) were rank-normalized and averaged.
  • correlations (Spearman) between expression and amino acid use were calculated per species and used to generate this heat map. Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.
  • Figure 5 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked codon bias.
  • Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) was rank-normalized and averaged.
  • genes were grouped based on expression from the centre (50% highest versus 50% lowest) until, with 1 % steps, the extremes (5% highest versus 5% lowest) were reached.
  • the synonymous codon use frequencies in both high- and low- expressed gene pool were calculated together with the difference in codon use frequency between the high- versus the low-expressed gene pool.
  • the difference in codon use frequency was correlated to the expression defining percentage (Spearman). The relation between the species based on this correlation is visualized in this heat map.
  • Figure 6 shows a graphical representation of mRNA structural features plotted against ranked expression with moving average (black line).
  • the mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy (kcal/mol/nucleotide), fraction of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions per nucleotide were determined.
  • minimal free folding energy kcal/mol/nucleotide
  • Figure 7 shows a heat map where the mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy (kcal/mol/nucleotide), fraction of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions per nucleotide were determined and correlated with expression (Spearman) (Table 2).
  • the heat map demonstrates that highly-expressed genes across all kingdoms prefer a stable, but 'airy' mRNA structure. Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.
  • Figure 8 is a heat map showing correlations (Spearman) between mRNA structure characteristics and protein:mRNA ratios per species (Table 3), demonstrating that highly translated transcripts across kingdoms share a similar 'airy' structure.
  • the mRNA structures of all genes of Escherichia coli (Eubacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy, percentage of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions were determined and correlated (Spearman) with protein:mRNA ratios. Rank-normalized mRNA levels were divided by protein abundance (retrieved from PaxDB). Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.
  • Figure 9 shows mRNA structure predictions of the constructs used for heterologous protein expression. Sequences of the native and optimised variants of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL-10) with additional signal peptide (SP) and GFP without SP flanked by the 5' and 3'-UTRs as expected from our expression cassette were used to predict the mRNA secondary structure.
  • GFP Aequorea victoria green fluorescent protein
  • OVA Gallus gallus ovalbumin
  • IL-10 Mus musculus interleukin-10
  • Figure 10 shows a heat map displaying the relation between species of several kingdoms of life based on translation rate-linked nucleotide use. Correlation (Spearman) between mRNA:protein ratios (proxy for translation rate) and nucleotide content (overall and for each codon position) for the species Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia).
  • Correlation Searman
  • Saccharomyces cerevisiae Frungi
  • Caenorhabditis elegans Animalia
  • Arabidopsis thaliana Plantae
  • Mus musculus Animalia
  • Figure 12 shows a heat map displaying the relation between species of several kingdoms of life based on translation rate-linked amino acid use. Correlation (Spearman) between mRNA:protein ratios (proxy for translation rate) and amino acid use for the species Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia).
  • Figure 13 shows a sequence alignment of native (nat) and optimized (opt) GFP sequences.
  • Figure 14 shows a sequence alignment of native (nat) and optimized (opt) GFP sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase.
  • Figure 15 shows a sequence alignment of native (nat) and optimized (opt) mlL-10 sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase.
  • Figure 16 shows a sequence alignnnent of native (nat) and optimized (opt) OVA sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase.
  • Example 1 - Codon optimisation improves mRNA stability and translatabilitv
  • the genes of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL- 10) were chosen because of their variation in codon use ( Figure 1 a). To eliminate differences caused by translation initiation all genes were preceded by the signal peptide of Arabidopsis thaliana chitinase. GFP was also expressed without this signal peptide, as it is normally not secreted.
  • Protein:mRNA ratios were calculated. Because translatability may be lower with a higher mRNA concentration due to the limited number of free ribosomes, the protein:mRNA ratios were calculated of samples within the same mRNA concentration range, as indicated. The fold change when comparing the optimised to the native variant was calculated for the relative mRNA concentration, protein yield and protein:mRNA ratio. For each average the number of included seedlings is indicated (n). Significance of fold changes were calculated with a Welch's i-test: * P ⁇ 0.05, ** P ⁇ 0.01 , *** P ⁇ 0.001 . dpi 2-5 dpi 5 + p19
  • thermodynamic stability of the predicted secondary mRNA structures was calculated.
  • the minimum free folding energy had decreased, indicative for a more stable mRNA, from -0.25 to -0.35 and -0.31 to -0.33 kcal/mol/nt for GFP and OVA, respectively.
  • the minimum free folding energy increased from - 0.31 to -0.28 kcal/mol/nt indicating a less stable mRNA.
  • an overall increase in physical stability could not explain the increased mRNA transcript levels of IL-10.
  • dsRNA stretches could be processed to small interfering RNAs and, like binding of microRNAs, can trigger gene silencing.
  • gene silencing can also be due to gene methylation, but this always results in the complete absence of transcripts and therefore transformants without detectable expression were not considered.
  • co-expression of the silencing inhibitor p19 gave comparable results.
  • Ribosomes can shield nuclease target sites, however, in large-scale in vivo studies mRNA half-life could not be linked to the number of nuclease target sites or ribosomal density.
  • translation initiation is equal, as is expected in our experiments, an increase in translatability should result in a lower density of ribosomes.
  • optimised variants there would have been fewer ribosomes on the optimised variants compared to their native counterparts, and the optimised variants would be less protected against nucleases.
  • translation per se may not influence mRNA half-life, errors in translation have been proven to lead to mRNA degradation by mRNA surveillance mechanisms.
  • RNA surveillance mechanisms I) nonsense mediated decay by the recognition of a premature stop codon, II) non-stop decay by the lack of a stop codon and III) no-go decay by stalled ribosomes.
  • Occurrence of a premature stop codon or the lack of a stop codon can be caused by a mutation or a ribosomal slip causing a frame-shift.
  • Frame-shifts can be caused by a 'slippery' sequence that may be found in proximity of a strong mRNA structure.
  • a ribosome may also stall at a strong stem-loop structure without slipping and trigger degradation.
  • the native and optimised variants differ in the presence of 'slippery' sequences and/or strong mRNA structures.
  • differences in level of translation-linked mRNA decay may explain the difference in mRNA transcript levels in our experiment.
  • ribosomes have intrinsic helicase activity and recently it was shown that strong mRNA structures such as pseudoknots and hairpins can stall translation only temporarily. It is therefore thought that the mRNA structure provides a mechanical basis for cellular regulation of translation rate.
  • increased mRNA translatability of the optimised genes may be explained by an increased translation rate caused by differences in the mRNA structure.
  • Example 2 General codon bias extends to other kingdoms of life The existence of codon biases in different species has implications for the efficient expression of heterologous proteins in a range of host cells.
  • the general codon bias in plants transcends kingdoms of life expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) was interrogated.
  • Per species >250 microarrays originating from several studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues were used (Table 1A-F).
  • the relative synonymous codon use was calculated. Subsequently, a comparison was made between high- and low-expressed genes, as a correlation between codon use and expression may only be found in genes expressed above a certain threshold. Genes were grouped based on expression from the centre (50% highest versus 50% lowest) until, with 1 % steps, the pools with 5% highest and 5% lowest expressed genes were reached. With each step the codon use frequencies in both high- and low-expressed gene pools were calculated together with the difference in codon use frequency between the high- versus the low-expressed gene pool. Finally, the difference in codon use frequency was correlated (Spearman) to the expression defining percentage.
  • M. musculus seems to have an overall lower codon bias and in -50% of the cases selects for other codons compared to the overall selection of the other species.
  • 13 codons are positively correlated with expression for all species. These 13 codons encode 1 1 different amino acids and a termination of translation (twice a codon for Thr/T). Comparable to the general codon bias found in plants, 8 of these 13 codons are C-ending. Furthermore, 18 codons are consistently negatively correlated with expression in these four species.
  • codons most are A-ending (8), while none of them are C-ending. Strikingly, 5 universal codons were found which were positively correlated with expression for all species, indicating that these codons are conserved in the coding sequences of highly-expressed genes across all kingdoms of life and could therefore find useful application in methods of optimising functional protein expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. In addition several codons were found which were positively correlated with further increases in expression in E. coli, S. cerevisiae and C. elegans. Furthermore in addition to the universal set of codons, several codons were found to be positively correlated with increases in expression in E. coli, S. cerevisiae, C. elegans and Mus musculus. Separately, several codons were found to be positively correlated with increased expression in A. thaliana.
  • Example 3 Highly expressed genes prefer a stable, but 'airy' mRNA structure
  • the relationship between expression and mRNA structure characteristics was evaluated.
  • the mRNA structures of all genes were predicted and determined gene length, minimal free folding energy, number of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of the number of stem/loop transitions and plotted these against expression ( Figure 6; Table 7).
  • a heat map displaying the relation between the species based on the correlation (Spearman) between these structure characteristics and expression was generated (Figure 7; Table 7). This heat map demonstrates that the number of bound nucleotides and the number of stem/loop transitions was consistently positively correlated and mean loop size consistently negatively correlated with expression across all species.
  • Table 7 mRNA characteristics of highly expressed genes per species.
  • Table 8 Calculated mRNA structure characteristics of the constructs used for heterologous protein expression. Analysis of the mRNA secondary structure predictions given in Figure 9. Folding energy, bound nucleotides and number of transitions are corrected for gene length. Stem and loop sizes are mean values.
  • the number of stem-loop transitions is positively correlated with protein:mRNA ratio and mean loop size is negatively correlated across all species.
  • the folding energy is negatively correlated (more stable mRNA) for S. cerevisiae, C. elegans and A. thaliana, but not for E. coli and M. musculus.
  • gene length is consistently negatively correlated with protein:mRNA ratio. This is in line with the fact that the packing density of ribosomes was shown to decrease with mRNA transcript length.
  • a negative correlation with mean stem size is found for all species and the fraction of bound nucleotides is not correlated, except for S. cerevisiae.
  • small stem size must be important for an increased translation rate. This again highlights the tradeoff between mRNA stability and translatability.
  • GFP green-fluorescent protein
  • OVA Gallus gallus ovalbumin
  • IL-10 Mus musculus interleukin-10
  • Optimisation was performed by recoding the protein sequences using the C-ending codons for all amino acids (TCC in the case of Ser), except Arg and Gly, for which the T-ending codons were used, and Gin, Glu and Lys, for which the G-ending codons were used.
  • CTC C-ending codons for all amino acids
  • Arg and Gly for which the T-ending codons were used
  • Gin Glu and Lys
  • Agrobacterium tumefaciens clones were cultured overnight (o/n) at 28°C in LB medium (1 Og/I pepton140, 5g/l yeast extract, 10g/I NaCI with pH7.0) containing 50 ⁇ g/nnl kanamycin. Bacterial cultures were centrifuged for 15 min at 2800 g and resuspended in MMA (20g/l sucrose, 5g/l MS-salts, 1 .95g/l MES, pH5.6) containing 200 ⁇ acetosyringone and 0.03% silwet-L77 till an OD of 0.5 was reached.
  • Arabidopsis thaliana plants were submerged in the bacterial suspension for 1 min and kept in a moist environment for 2 days. Plants were maintained in a controlled greenhouse compartment (UNIFARM, Wageningen) until seeds could be collected. Seeds were sterilized by 4-hour exposure to chlorine gas and plated on basic agar plates (8g/l Bacto Agar, 0.101 g/l KNO 3 ) containing 30 ng/ml hygromycin and 100 ⁇ g/nnl cefotaxim. Plates were kept in the dark at 4°C for 2 days, then placed in artificial light for 7 hours at 24°C, again kept in the dark at RT for 5 days and finally placed in a climate chamber with 12 hour light regime at 24°C for 2 days.
  • Agrobacterium tumefaciens clones were cultured overnight (o/n) at 28°C in LB medium (1 Og/I pepton140, 5g/l yeast extract, 10g/I NaCI with pH7.0) containing 50 ⁇ g ml kanamycin and 20 ⁇ g ml rifampicin.
  • OD was measured again after 16 hours and the bacterial cultures were centrifuged for 15 min at 2800 g.
  • the bacteria were resuspended in MMA infiltration medium (20g/l sucrose, 5g/l MS-salts, 1 .95g/l MES, pH5.6) containing 200 ⁇ acetosyringone till an OD of 1 was reached. All constructs were co-expressed with the tomato bushy stunt virus silencing inhibitor p19 by mixing Agrobacterium cultures 1 :1 . After 1 -2 hours incubation at room temperature, the two youngest fully expanded leaves of 5-6 weeks old Nicotiana benthamiana plants were infiltrated completely.
  • Infiltration was performed by injecting the Agrobacterium suspension into a Nicotiana benthamiana leaf at the abaxial side using a 1 ml syringe. Infiltrated plants were maintained in a controlled greenhouse compartment (UNIFARM, Wageningen) and infiltrated leaves were harvested at selected time points.
  • the oligonucleotides used for amplification of both native and optimised IL-10, OVA and GFP and TIP- 41 were 5'-AACCTCTTCCTCTTCCTC-3' [SEQ ID NO: 2] / 5'- GGAAGTGGGTGCAGTT-3' [SEQ ID NO: 3]; 5'-AACCTCTTCCTCTTCCTC-3' [SEQ ID NO: 4]/ 5'-GGGCAGTAGAAGATGTTC-3' [SEQ ID NO: 5]; 5'- GACGGTAACTACAA-GACC-3' [SEQ ID NO: 6]/ 5'-TTGTCGGCCATGATGTA-3' [SEQ ID NO: 7]; and 5'-GCTCATCGGTACGCTCTTTT-3' [SEQ ID NO: 8]/ 5'- TCCATCAGTCAGAGGCTTCC-3' [SEQ ID NO: 9], respectively.
  • Relative transcript levels of the genes versus TIP-41 were determined by the Pfaffl method (Pfaffl,
  • Crude extract was clarified by centrifugation at 16.000xg for 5 min at 4°C and supernatant was directly used in an ELISA and BCA protein assay.
  • Mouse IL-10 expression levels were determined using the Mouse IL-10 ELISA Ready-SET-Go!
  • a rabbit anti-ovalbumin or a chicken anti-GFP both from Rockland Immunochemicals Inc. was used to coat ELISA plates o/n at 4°C in a moist environment. After this and each following step the plate was washed 5 times with 30 sec intervals in PBST (1 x PBS, 0,05% Tween-20) using an automatic plate washer (BioRad model 1575). The plate was blocked with assay diluent (eBioscience) for 1 h at room temperature. Samples and standard lines were loaded in serial dilutions and incubated for 1 h at room temperature.
  • Standard lines were made from purified chicken ovalbumin (Sigma) or recombinant GFP (Roche).
  • a rabbit anti- ovalbumin:HRP antibody or a rabbit anti-GFP:HRP antibody both from Rockland Immunochemicals Inc.
  • a 3,3',5,5'-Tetramethylbenzidine (TMB) substrate (eBioscience) was added and colouring reaction was stopped using stop solution (0.18M sulphuric acid) after 1 -15 min.
  • Read outs were performed using the model 680 microplate reader (BioRad) to measure the OD at 450 nm with correction filter of 690 nm.
  • TSP total soluble protein
  • BSA bovine serum albumin
  • Gene expression datasets of 5 species were downloaded from Gene Expression Omnibus (GEO).
  • GEO Gene Expression Omnibus
  • Gene-expression sets were selected based on platform (Affimetrix), release date (not earlier than 2008), publication linked to the GEO set and number of samples in the study. In total 2067 gene-expression profiles were collected, representing 8 or 9 different studies per organism. An overview can be found in Table 1A-F.
  • Example 11 Protein abundance datasets Protein abundance datasets were retrieved from PaxDb (Wang et ai, 2012, Mol Cell Proteomics, 1 1 : 492-500), where the integrated datasets of Escherichia coli, Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorhabditis elegans, and Mus musculus were downloaded.
  • Gene expression was normalized based on rank. Per species one array platform was used and per species probes were ranked according to their intensities. The average rank per probe was used as a measure of overall gene expression to distinguish genes with overall low and high expression levels for each species.
  • the coding sequences (CDS) of all genes of 5 species were downloaded from sequence/genome repositories.
  • CDS coding sequences
  • For Arabidopsis thaliana the CDS of the 20101 108 release were obtained from TAIR (Lamesch et al., 2012, Nucleic Acids Research 40: D1202-1210).
  • the open reading frames (without UTR, introns, etc.) of the 201 10203 release were obtained from the Saccharomyces genome database (Cherry et al., 2012, Nucleic Acids Research 40: D700-705).
  • the CDS of WS241 were obtained from WormBase (Yook et al., 2012, Nucleic Acids Research 40: D735-741 ).
  • the CDS of the 20130508 release (GRCm38.p1 ) were obtained from the NCBI CCDS database (Farrell et al., 2014 Nucleic Acids Research 42: D865-872).
  • the mRNAs of all species were folded using Vienna RNA fold (Lorenz et al., 201 1 , Algorithms for Molecular Biology 6: 26) at 20 C, using the parameters of Andronescu et al., (Andronescu et al., 2007, Bioinformatics 23: i19-28).
  • the M. musculus mRNA was also folded at 37 C and the S. cerevisiae also at 30 C, but all the reported comparisons are based on 20 C.
  • Example 12 Gene expression and mRNA folding statistics
  • the correlations (Spearman) between gene expression and the various mRNA- based statistics were calculated by Spearman correlation (in R 3.0.2 x64). For some of the factors a correction was applied for gene-length, these were: number of bound nucleotides, number of unbound nucleotides, energy of the structure, number of stems, number of loops, triplet usage, nucleotide usage, and amino acid usage.
  • a novel aspect of our finding is the selection of mRNA structures with the most even distribution of stems and loops leads to higher levels of expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Below is an example procedure used to select the most optimal mRNA structure for improved functional expression in a host cell of interest.
  • the first step in selecting the 'ideal' mRNA structure is the generation of a pool of mRNA variants by making all possible combinations of synonymous codons (> 100.000 mRNA variants).
  • the second step is in silico folding of all mRNA species in the pool under the temperature and salt concentrations relevant for the preferred host.
  • the third step is the selection of mRNAs from the pool that meet the following criteria:
  • average number of stem-loop transitions is above 1 16 per 1 ,000 bp (or between 1 16 and 250 per 1 ,000 bp) average stem size is below 5.20 bp (or between 5.20 and 2.5 bp)
  • average loop size is below 3.32 bp (or between 3.32 and 3 bp)
  • the standard deviation of the loop size is below 3.20 (or between 3.20 and 2 bp) (measure for even distribution)
  • the standard deviation of the stem size is below 3.40 (or between 3.40 and 2 bp) (measure for even distribution)
  • maximum loop size is below 18 bp (discard uneven stem loop distributions) maximum stem size is below 19 bp (discard uneven stem loop distributions) C. eleaans
  • average stem size is below 5.35 bp (or between 5.35 and 2.5 bp)
  • the standard deviation of the stem size is below 3.27 (or between 3.27 and 2 bp)
  • maximum stem size is below 18 bp E. coli
  • average number of stem-loop transitions is above 1 16 per 1 ,000 bp (or between 1 16 and 250 per 1 ,000 bp)
  • average stem size is below 5.45 bp (or between 5.45 and 2.5 bp)
  • the standard deviation of the stem size is below 3.50 (or between 3.50 and 2 bp)
  • maximum stem size is below 18 bp M.
  • musculus 1 average number of stem-loop transitions is above 120 per 1 ,000 bp (or between 120 and 250 per 1 ,000 bp)
  • average stem size is below 4.35 bp (or between 4.35 and 2.5 bp)
  • average loop size is below 5.18 bp (or between 5.18 and 4 bp)
  • the standard deviation of the stem size is below 3.28 (or between 3.28 and 2 bp)
  • average number of stem-loop transitions is above 1 10 per 1 ,000 bp (or between 1 10 and 250 per 1 ,000 bp)
  • average stem size is below 5.27 bp (or between 5.27 and 2.5 bp)
  • the standard deviation of the loop size is below 3.65 (or between 3.65 and 2 bp)
  • the standard deviation of the stem size is below 3.25 (or between 3.25 and 2 bp)
  • step 3 where there were several appropriate codons according to the foregoing criteria, previously published data was consulted to make a final selection. Codons giving the lowest folding energy of the 5' terminus and codons that are frequently used and match the most abundant tRNAs were preferred.
  • Table 1 C Description of the gathered S. cerevisiae expression data.
  • Table 6A Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Escherichia coli. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low-expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated. AA Triplet All Top 5% Bottom 5% Top/Bottom
  • Table 6C Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Caenorhabditis elegans. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.
  • Table 6D Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Arabidopsis thaliana. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.
  • Table 9 Analysis of the mRNA secondary structure characteristics (stem architecture) of the top 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia).
  • Table 11 Analysis of the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the top 5% expressed genes in Escherichia coii (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
  • Table 14 Analysis of the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the bottom 5% expressed genes in Escherichia coii (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
  • Table 15 Differences in the mRNA secondary structure characteristics (stem architecture) of the top and bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
  • Table 17 Differences in the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the top and bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus musculus (Animalia).

Abstract

The present invention relates to an approach aimed at the modification of codons in individual polynucleotide sequences encoding a heterologous protein of interest, without altering the amino acid sequence of the polypeptide to enhance the amount of functional expression in a host organism of interest. In its broadest aspect, this approach exploits redundancy in the genetic code by providing a universal set of codons which may be used at certain positions in the polynucleotide sequence in order to achieve improved heterologous protein production in a range of host cells. The present invention also relates to specific codons which may be used to increase protein expression in particular hosts. The present invention also relates to the optimization of the translation efficiency of messenger RNAs on the basis of their secondary structure characteristics, and the provided set of criteria may be used to increase protein expression in particular hosts.

Description

OPTIMISATION OF CODING SEQUENCE FOR FUNCTIONAL PROTEIN
EXPRESSION
FIELD OF THE INVENTION The present invention relates to an approach aimed at the modification of codons in individual polynucleotide sequences encoding a heterologous protein of interest, without altering the amino acid sequence of the polypeptide to enhance the amount of functional expression in a host organism of interest. Recognising that maximum translation efficiency and therefore protein production is influenced by codon usage of a coding sequence, in its broadest aspect, this approach exploits redundancy in the genetic code by providing a universal set of codons which may be used at certain positions in the polynucleotide sequence in order to achieve improved heterologous protein production in a range of host cells. The present invention also relates to the optimization of the translation efficiency of messenger RNAs on the basis of their secondary structure characteristics, and the provided set of criteria may be used to increase protein expression in particular hosts.
BACKGROUND TO THE INVENTION
Most amino acids are encoded by multiple synonymous codons and the frequency wherein synonymous codons are used is not equal within a given species. Also, within species a bias in codon use in highly expressed genes can be observed, linking codon use to gene expression. The codons used most frequently in highly expressed genes (optimal codons) have been shown to correspond to genomic G+C content and often match the most abundant tRNAs in many species. It is assumed that codons that match more abundant tRNAs would be translated faster as tRNA availability for translation occurs via diffusion and the chance of encountering a more abundant tRNA is greater than when encountering a rarer tRNA. An increase in translation rate allows ribosomes to finish translation and reinitiate translation sooner. Also, the probability that a ribosome initially loads a non-matching tRNA is smaller when a codon matches a more abundant tRNA resulting in an energetic advantage as three-quarters of the energy to incorporate an amino acid is lost if a non-matching tRNA has to be rejected after proofreading. Thus, the use of optimal codons in highly-expressed genes was hypothesized to provide a fitness gain by improved translational efficiency.
In recognition of the idea that increased translation efficiency may enhance protein yield, codon optimisation of genes for heterologous expression by recruiting optimal codons of the production host has been a common strategy. However such strategies have met with varying success. For example, a study of the heterologous expression of 154 variants of GFP differing only in synonymous codon use in E. coli demonstrated that the use of optimal codons was positively correlated with bacterial growth, but not protein yield {Kudla et al. 2009, Science, 324: 255-258).
However, many of the studies focusing on codon optimisation have not addressed a potentially confounding variable, translational initiation. In the aforementioned study, about half of the variation in GFP protein levels was explained by folding energy of the first third of the mRNA suggesting that whilst the use of optimal codons may have increased the rate of translation, protein yield remained unchanged because the initiation of translation was rate-limiting. Ribosomal density studies indicate that ribosomes are most abundant at the 5' portion of mRNAs and the overall packing density of nearly all mRNAs is below maximum, suggesting that this may be a general feature.
Wang and Roossinck {Wang and Roossinck, 2006, Plant Mol Biol, 61:699-710) determined which codons were most highly-associated with transcripts which accumulate to high levels, by comparing overall codon use to the codon use in highly-transcribed genes in 1 1 plant species. In doing so the authors demonstrated that codon usage bias is correlated positively with gene transcript levels. As such the authors identified 18 codons which are associated with highly-expressed transcripts across 1 1 plant species. Interestingly, the authors found that use of their "optimal" codons appears to be well conserved between eudicots and monocots, but less well conserved between the higher plants and Chlamydomonas reinhardtii. However, the authors did not express polynucleotides incorporating such "optimal" codons in host cells and consequently, the effect on heterologous protein expression of altering the codon complement of their encoding polynucleotides in this way remains to be determined.
Alternatives to plant hosts are frequently required for protein production for a variety of reasons. Wang and Roossinck (Wang and Roossinck, 2006, Plant Mol Biol, 61:699-710) assessed the codons which are associated with the most abundant transcripts across 12 plant species. However, this result provides no information on codons which are relevant for optimising heterologous protein expression in other, non-plant host organisms.
SUMMARY OF THE INVENTION
The codon use of a gene of interest is often adapted to reflect the expression host's codon use in highly expressed genes in order to enhance heterologous protein production. However, the results obtained with this strategy are variable. A comparison between the overall codon use and the codon use in highly expressed genes of several plant species revealed that optimal codons are not always the codons of which the use is increased most with expression. Although the codon composition of highly expressed genes differs between monocots and dicots, the same codons often rise in frequency with increasing expression levels (expression codons) and are in many cases C-ending. These conserved expression codons were used to optimise the codon composition of three genes, which enhanced protein yield significantly upon stable and transient expression in plants.
With the above in mind an alternative method of codon optimisation has been devised that led to a significant increase in both mRNA stability and mRNA translatability (i.e higher mRNA levels and more proteins per mRNA molecule). Unexpectedly, experimental data shown here indicates that this expression-linked codon bias found in plants also extends to other kingdoms of life. On the basis of these experimental data, the present invention provides a series of synonymous codons which are believed to have wide application for the expression of heterologous genes in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells and which have been surprisingly found to correspond with increased functional protein expression therein. Instead of the lengthy and complicated process of trial and error which characterises existing methods of codon optimisation centered on increasing gene expression in specific cellular or environmental contexts, the present invention provides a quick, practical, universal method of increasing functional heterologous protein expression with wide application for the expression of heterologous genes in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Advantageously, this method removes any need for consideration of the host cell or specific cellular context involved. In addition to a series of universally applicable replacement codons for use in commonly used host cells, the present invention also provides specific sets of codon replacements which further improve functional protein expression in particular hosts, specifically prokaryotes, fungi, animals, nematodes, protists and plants.
Accordingly, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
Figure imgf000007_0001
the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
As noted above, Wang and Roossinck (2006) did not actually perform any expression studies to determine the effect of codon optimisation on functional protein expression. In a further aspect the present invention provides a method of expressing a heterologous protein in a plant cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table;
Figure imgf000007_0002
AGA or AGG
Asparagine AAT AAC
Aspartic acid GAT GAC
Cysteine TGT TGC
Glutamic acid GAA GAG
Glutamine CAA CAG
Glycine GGC, GGA or GGG GGT
Histidine CAT CAC
Isoleucine ATT or ATA ATC
Leucine CTT, CTA, CTG, TTA CTC
or TTG
Lysine AAA AAG
Phenylalanine TTT TTC
Proline CCT, CCA or CCG CCC
Serine TCT, TCA, TCG, TCC
AGT or AGC
Threonine ACT, ACA or ACG ACC
Tyrosine TAT TAC
Valine GTT, GTA or GTG GTC
Stop codons TAG or TGA TAA inserting the polynucleotide sequence into an expression vector;
introducing said expression vector into a host cell; and
culturing the host cell to produce the heterologous protein; optionally wherein the corresponding codons are changed according to the following table;
Figure imgf000008_0001
and/or:
Amino Acid DNA Codon Replacement Codon
Figure imgf000009_0001
; and/or:
Figure imgf000009_0002
; and/or:
Figure imgf000009_0003
; and/or:
; and/or:
Figure imgf000009_0004
On the basis of these expression studies using such codon optimisation according to the invention, it was surprisingly discovered that a number of mRNA structural characteristics were found to be positively correlated with expression levels across kingdoms. In particular, the selection of mRNA structures with the most even distribution of stems and loops is positively correlated with higher levels of expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Consequently, in an alternative embodiment, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide.
DETAILED DESCRIPTION
Accordingly, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
Figure imgf000011_0001
the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
In another aspect of the invention, further improvements in heterologous protein expression may be achieved by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table, particularly where the host cell is a prokaryotic cell, a fungal cell or a nematode cell:
Figure imgf000011_0002
In aspects of the invention where the host cell is a prokaryotic cell, for example, an E.coli cell, heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
; and/or:
Figure imgf000012_0001
; and/or:
Figure imgf000012_0002
; and/or:
; and/or:
Figure imgf000012_0003
; and/or: Amino Acid DNA Codon Replacement Codon
Leucine TTA, TTG, CTT, CTC CTG
or CTA and/or:
Amino Acid DNA Codon Replacement Codon
Glycine GGA or GGG GGT or GGC and/or:
Figure imgf000013_0001
and/or:
Figure imgf000013_0002
and/or:
and/or:
Figure imgf000013_0003
and/or:
Figure imgf000013_0004
In aspects of the invention where the host cell is a fungal cell, for example an S. cerevisiae cell, heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
; and/or:
Figure imgf000014_0001
; and/or:
Figure imgf000014_0002
; and/or:
; and/or:
Figure imgf000014_0003
; and/or: Amino Acid DNA Codon Replacement Codon
Proline CCT, CCC or CCG CCA
and/or:
Figure imgf000015_0002
and/or:
Figure imgf000015_0003
and/or:
Figure imgf000015_0004
and/or:
Figure imgf000015_0005
and/or:
Figure imgf000015_0006
and/or:
Figure imgf000015_0001
Glutamic acid GAG GAA
In aspects of the invention where the host cell is a nematode cell, for example, an C. elegans cell, heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
and/or:
Figure imgf000016_0001
and/or:
Figure imgf000016_0002
: and/or:
and/or:
Figure imgf000016_0003
and/or: Amino Acid DNA Codon Replacement Codon
Valine GTA or GTG GTC or GTT and/or:
Amino Acid DNA Codon Replacement Codon
Glutamic acid GAA GAG
and/or:
Figure imgf000017_0001
and/or:
Figure imgf000017_0002
and/or:
and/or:
and/or:
Figure imgf000017_0003
Glutamine CAA CAG
In aspects of the invention where the host cell is a Mus musculus cell, heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
Amino Acid DNA Codon Replacement Codon
Serine TCT, TCA, AGT, TCG or TCC
AGC and/or:
Amino Acid DNA Codon Replacement Codon
Arginine AGA or AGG CGG, CGT, CGC or
CGA and/or:
Amino Acid DNA Codon Replacement Codon
Alanine GCC or GCA GCG or GCT
; and/or:
Figure imgf000018_0001
; and/or:
Figure imgf000018_0002
and/or: and/or:
Figure imgf000019_0001
and/or:
Figure imgf000019_0002
and/or:
Figure imgf000019_0003
and/or:
Figure imgf000019_0004
and/or:
Figure imgf000019_0005
; and/or:
Figure imgf000020_0001
In another aspect, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;
providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
Figure imgf000020_0002
the host cell being selected from a prokaryotic cell, a fungal cell, a plant cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence. In another aspect, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;
providing a polynucleotide sequence which encodes a protein of interest and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
Figure imgf000021_0001
wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
In aspects of the invention where the host cell is a plant cell, preferably an Arabidopsis thaliana cell, heterologous protein expression is further improved by supplementing the codon changes detailed in the table above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
Figure imgf000021_0002
; and/or:
; and/or:
Figure imgf000022_0001
; and/or:
Figure imgf000022_0002
; and/or:
Figure imgf000022_0003
; and/or:
; and/or:
Figure imgf000022_0004
; and/or: Amino Acid DNA Codon Replacement Codon
Glutamic acid GAA GAG
and/or:
Amino Acid DNA Codon Replacement Codon
Phenylalanine TTT TTC
and/or:
Figure imgf000023_0001
and/or:
and/or:
Figure imgf000023_0002
and/or:
Figure imgf000023_0003
In another aspect the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;
providing a polynucleotide sequence which encodes a protein of interest and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
Figure imgf000024_0001
wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
In aspects of the invention where the host cell is a plant cell, preferably an Arabidopsis thaliana cell, heterologous protein expression is further improved by supplementing the codon changes detailed in the table above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
Figure imgf000024_0002
; and/or:
Figure imgf000024_0003
and/or:
and/or:
Figure imgf000025_0001
and/or:
Figure imgf000025_0002
and/or:
Figure imgf000025_0003
and/or:
and/or:
Figure imgf000025_0004
and/or: Amino Acid DNA Codon Replacement Codon
Valine GTC, GTA or GTG GTT and/or:
Amino Acid DNA Codon Replacement Codon
Isoleucine ATA ATC or ATT and/or:
Figure imgf000026_0001
and/or:
Figure imgf000026_0002
and/or:
and/or:
Figure imgf000026_0003
In another aspect the present invention provides a method of expressing a heterologous protein in a plant cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table;
Figure imgf000027_0001
inserting the polynucleotide sequence into an expression vector;
introducing said expression vector into a host cell; and
culturing the host cell to produce the heterologous protein; optionally wherein the corresponding codons are changed according to the following table;
; and/or:
Figure imgf000028_0001
; and/or:
Figure imgf000028_0002
; and/or:
; and/or:
Figure imgf000028_0003
; and/or:
Figure imgf000028_0004
Preferably in this aspect of the invention the host cell is an Arabidopsis thaliana cell.
In addition to establishing precise codon changes which result in improved functional protein expression, a novel aspect of the present invention was uncovered by studying the correlation between expression level and mRNA characteristics including gene length, minimal folding energy, number of bound nucleotides, mean stem and loop sizes (stretches of bound and unbound nucleotides, respectively) and number of stem-loop transitions which revealed a general trend across kingdoms. Messenger RNAs are folded structures and translation of a given mRNA into a polypeptide requires unfolding. The necessary helicase activity is typically provided by the ribosome itself. This unfolding requires energy and in essence, a linear mRNA (i.e. an RNA polymer without secondary structure) would be optimal for the maximization of protein production. However, a certain degree of folding makes mRNA less susceptible to degradation and increases its diffusibility.
The number of bound nucleotides and the number of stem-loop transitions were found to be positively correlated with expression levels, while loop size was negatively correlated with expression. Combining the gene expression data with available protein abundance data demonstrated that protein:mRNA ratio (proxy for translation efficiency) is positively correlated with the number of stem-loop transitions and negatively correlated with stem and loop size. This general pattern across kingdoms reveals a selection pressure created by gene expression on both mRNA stability and translatability. An increase in the number of nucleotide bonds favours stability, while a more even distribution of these bonds enhances translatability. Altogether, our data indicate that a successful codon optimisation strategy should focus on computational models that calculate the ideal mRNA structure whereby both stability and translatability are enhanced. Here we describe a procedure to select mRNAs with optimal folding characteristics out of a pool consisting of all possible mRNAs encoding a given protein. Remarkably, these are not the most compact mRNAs, nor the ones with the lowest unfolding energy. Here we describe a selection procedure based on a set of criteria for the optimisation of recombinant protein production in a given host that relates to the number and distribution of mRNA stem-loop transitions for any given mRNA. On the basis of these experimental data, in another aspect the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the relevant table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the relevant table(s); the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence and wherein the method further comprises; analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence; and incorporating in said polynucleotide sequence a pattern of optimal and non-optimal codons at a site associated with provision of a structural motif; wherein said pattern enables increased expression efficiency of said protein in said host cell compared with the synonymous coding sequence containing solely optimal codons, wherein optimal codons are those codons pre-calculated to provide the highest functional expression of heterologous protein in the host cell or the sole possible codon. As such the method may comprise merely making the universal codon changes, and/or making modifications according to the replacement codon tables which are specific for particular host cells. In preferred embodiments of the invention, analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence typically will include, but is not limited to; examining and taking account of the mean number of stem-loop transitions, mean stem size, mean loop size, standard deviation of the stem size or the loop size (which acts as a proxy measure for even distribution of stem-loops), maximum loop size and/or maximum stem size. In preferred embodiments uneven stem loop distributions will be discarded and the polynucleotide sequence codon composition will be altered (i.e. non-optimally) based on the observation of mRNA secondary structure to improve translational efficiency and therefore functional protein expression.
A novel aspect of the invention is the selection of mRNA structures with the most even distribution of stems and loops that leads to higher levels of expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Consequently, in a further aspect, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide.
Normally, the first step in selecting the 'ideal' mRNA structure is the generation of a pool of mRNA variants by making all possible combinations of synonymous codons (> 100.000 mRNA variants). Typically, all mRNA species in the pool are then folded in silico. The term "in silico" is widely used in the art and will be understood by the average skilled person as meaning performed on a computer or via computer simulation. The RNA structure is predicted in silico using standard techniques and usually under the temperature and salt concentrations relevant for the preferred host. Appropriate software packages or applications incorporating suitable algorithms may be selected for performing the folded mRNA structure prediction. Suitable packages include, but are not limited to; an RNA structure prediction program such as Vienna RNAfold 2.0 (Lorenz et al.. 201 1 , ViennaRNA Package 2.0 Algorithms for Molecular Biology, 6:1 26). Preferably, the mRNA structure prediction will be carried out using such a prediction program using the standard settings and the folding parameters, for example, those established by Andronescu et al. (Andronescu et al., 2007 Bioinformatics, 23 (13), i19-i28) and preferably, adjusting the folding-temperature to that of the intracellular temperature of the host of interest. More preferably, the temperature and salt concentration parameters will be adjusted to match those of the preferred host. Finally the mRNAs from the library of synonymous variants that have the most even distribution of stems and loops are selected. The mRNAs having the most even distribution of stems and loops may be identified by the structural characteristics outlined below. In particular the standard deviation is used as a measure for an even distribution of the sizes of the stems and loops which is preferred. Typically, the more similar the stem sizes of an mRNA the higher the translation efficiency. Additionally, the more similar the loop sizes of an mRNA the higher the translation efficiency. Where there were several appropriate codons according to the foregoing criteria, previously published data was consulted to make a final selection. Parameters which may be influential include, for example, the folding energy of the 5' terminus and the selection of codons that are frequently used and match the most abundant tRNAs. Preferably, codons giving the lowest folding energy of the 5' terminus and codons that are frequently used and match the most abundant tRNAs were preferred. Methods for determining the folding energy of mRNA may be based on, but are not limited to those described by Tuller et al. (Tuller et al., 2009, PNAS 107:3645-3650) and Kudla et al. (Kudla et al. 2009, Science, 324:255-258). For example, the mRNA molecule from -23 till +39 should have an average folding energy of at least -6 kcal/mol for E. coli and of at least -4 kcal/mol for S. cerevisiae as determined by the use of sliding windows of 40nt with 1 nt steps. Codon choice of the first 13nts providing a low energy will depend on the 5' UTR provided by the expression cassette ((Kudla et al. 2009, Science, 324: 255-258; Tuller et al., 2009, PNAS 107: 3645-3650). Alternatively, instead of adapting the first 13 nts, the 5'UTR may be adapted to provide a low folding energy. For example, the 5'UTR used in the present examples is very U-rich (GTTTTTATTTTTAATTTTCTTTCAAATACTTCCACC [SEQ ID NO: 1 ]), which in most cases provided a relatively high (close to 0) folding energy when using primarily C-ending codons. When using the chitinase SP, this was always the case.
In preferred embodiments of the invention, analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence typically will include, but is not limited to; examining and taking account of; the mean number of stem-loop transitions, mean stem size, mean loop size, standard deviation of the stem size or the loop size (which acts as a proxy measure for even distribution of stem-loops), maximum loop size and/or maximum stem size. In preferred embodiments, the polynucleotide sequence codon composition will be altered (i.e. non-optimally) to avoid uneven stem loop distributions to improve translational efficiency and therefore functional protein expression. Such alterations may include incorporating one or more codons listed as second preference or third preference replacement codons in place of the first preference codon where the secondary structure criteria are not fulfilled by inclusion of the first preference codon. Alternatively, for a given position, such alterations may include retention of the wild-type (WT) or native codon where inclusion of an optimal codon negatively impacts the secondary structure with respect to the particular criteria for each host cell. Preferably, the polynucleotide will have at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp). More preferably, the polynucleotide will have stem loop transitions in the range 1 10 to 250/kbp, optionally in the range 1 10 to 200/kbp, 1 1 1 to 249/kbp, 1 12 to 248/kbp, 1 13 to 247/kbp, 1 14 to 246/kbp, 1 15 to 245/kbp, 1 16 to 244/kbp, 1 17 to 243/kbp, 1 18 to 242/kbp, 1 19 to 241 /kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp.
Preferably, the polynucleotide will have a maximum stem size of less than 19 bp. optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 14bp to 15bp. More preferably, the polynucleotide will have a maximum loop size of less than 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp. Additionally, in embodiments where wherein the host cell is a prokaryotic cell, preferably a bacterial cell and more preferably an E. coli cell, the selected polynucleotide will preferably have at least 1 16 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 1 16 to 200/kbp, 1 17 to 249/kbp, 1 18 to 248/kbp, 1 19 to 247/kbp, 120 to 245/kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp. More preferably, the selected polynucleotide will preferably have a mean stem size between 5.45 bp and 2.50 bp, optionally in the range 5.45 to 4.00 bp, 5.40 bp to 2.60 bp, 5.30 bp to 2.70 bp, 5.20 bp to 2.80 bp, 5.10 bp to 2.90 bp, 5.00 bp to 3.00 bp, 4.90 to 3.10 bp, 4.80 to 3.20 bp, 4.70 to 3.30 bp, 4.60 to 3.40 bp, 4.50 to 3.50 bp, 4.40 to 3.60 bp, 4.30 to 3.70 bp, 4.20 to 3.80 bp or 4.10 to 3.90 bp. More preferably, the method further comprises selecting a polynucleotide having a mean loop size between 3.16 bp and 2.00 bp, optionally in the range 3.10 bp to 2.10 bp, 3.00 bp to 2.20 bp, 2.90 bp to 2.30 bp, 2.80 bp to 2.40 bp, 2.70 bp to 2.50 bp or 2.60 bp to 2.40 bp. More preferably, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 2.95 and 2 bp, optionally in the range 2.90 bp to 2.10 bp, 2.80 bp to 2.20 bp, 2.70 bp to 2.30 bp, 2.60 bp to 2.40 bp or 2.50 bp to 2.40 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.50, preferably between 3.50 and 2.00 bp, optionally in the range 3.40 bp to 2.10 bp, 3.30 bp to 2.20 bp, 3.20 bp to 2.30 bp, 3.10 bp to 2.40 bp, 3.00 bp to 2.50 bp, 2.90 bp to 2.60 bp or 2.80 bp to 2.70 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 16 bp, optionally in the range 10bp to 16bp, 1 1 bp to 15bp or 12bp to 14bp. In the most preferred embodiment where the host cell is a plant cell, the method further comprises selecting a polynucleotide having a maximum stem size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp, 13bp to 15bp or 12 bp to 14 bp.
Alternatively, in embodiments where wherein the host cell is a eukaryotic cell, preferably a plant cell and more preferably an Arabidopsis thaliana cell, the selected polynucleotide will preferably have at least 1 16 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range optionally in the range 1 16 to 200/kbp, 1 17 to 249/kbp, 1 18 to 248/kbp, 1 19 to 247/kbp, 120 to 245/kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp. More preferably the selected polynucleotide will have a mean stem size in the range 5.20 to 2.50 bp, optionally in the range 5.20 bp to 4.00 bp, 5.20 to 2.60 bp, 5.10 bp to 2.70 bp, 5.00 bp to 2.80 bp, 4.90 bp to 2.90 bp, 4.80 bp to 3.00 bp, 4.70 to 3.10 bp, 4.60 to 3.20 bp, 4.50 to 3.30 bp, 4.40 to 3.40 bp, 4.30 to 3.50 bp, 4.20 to 3.60 bp, 4.10 to 3.70 bp or 4.00 to 3.80 bp. Preferably, the method further comprises selecting a polynucleotide having a mean loop size between 3.32 bp and 3.00 bp. optionally in the range 3.30 bp to 3.00 bp, 3.25 bp to 3.05 bp, 3.20 bp to 3.10 bp or 3.15 bp to 3.10 bp. More preferably, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.20 and 2 bp, optionally in the range 3.10 bp to 2.10 bp, 3.00 bp to 2.20 bp, 2.90 bp to 2.30 bp, 2.80 bp to 2.40 bp, 2.70 bp to 2.50 bp or 2.60 bp to 2.40 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.40, preferably between 3.40 and 2.00 bp, optionally in the range 3.30 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp, 2.80 bp to 2.40 bp or 2.60 bp to 2.50 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp or 13bp to 15bp. In the most preferred embodiment where the host cell is a plant cell, the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 12 bp to 15 bp.
Alternatively, in embodiments where wherein the host cell is a fungal cell, preferably a Saccharomyces cell, optionally a Saccharomyces cerevisiae cell, the selected polynucleotide will preferably have at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp). Preferably, the polynucleotide will have stem loop transitions in the range 1 10 to 250/kbp, optionally in the range 1 10 to 200/kbp, 1 1 1 to 249/kbp, 1 12 to 248/kbp, 1 13 to 247/kbp, 1 14 to 246/kbp, 1 15 to 245/kbp, 1 16 to 244/kbp, 1 17 to 243/kbp, 1 18 to 242/kbp, 1 19 to 241 /kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp. More preferably, the selected polynucleotide will preferably have a mean stem size between 5.27 bp and 2.50 bp, optionally in the range 5.27 bp to 4.00 bp, 5.20 to 2.40 bp, 5.10 bp to 2.50 bp, 5.00 to 2.60 bp, 4.90 bp to 2.70 bp, 4.80 bp to 2.80 bp, 4.70 bp to 2.90 bp, 4.60 bp to 3.00 bp, 4.50 to 3.10 bp, 4.40 to 3.20 bp, 4.30 to 3.30 bp, 4.20 to 3.40 bp, 4.10 to 3.50 bp, 4.00 to 3.60 bp or 3.90 to 3.70 bp. More preferably, the method further comprises selecting a polynucleotide having a mean loop size between 3.77 bp and 3.00 bp, optionally in the range 3.75 bp to 3.00 bp, 3.70 bp to 3.10 bp, 3.60 bp to 3.20 bp or 3.50 bp to 3.30 bp. More preferably, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.65 and 2.00 bp, optionally in the range 3.60 bp to 2.10 bp, 3.50 bp to 2.20 bp, 3.40 bp to 2.30 bp, 3.30 bp to 2.40 bp, 3.30 bp to 2.50 bp, 3.20 bp to 2.60 bp, 3.10 bp to 2.70 bp or 3.00 bp to 2.80 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.25, preferably between 3.25 and 2.00 bp, optionally in the range 3.20 bp to 2.10 bp, 3.10 bp to 2.20 bp, 3.00 bp to 2.30 bp, 2.90 bp to 2.40 bp, 2.80 bp to 2.50 bp or 2.70 bp to 2.60 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp. In the most preferred embodiment where the host cell is a fungal cell, the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10 bp to 19 bp, 1 1 bp to 18 bp, 12 bp to 17 bp, 13 bp to 16 bp or 12 bp to 15 bp.
Alternatively, in embodiments where wherein the host cell is an animal cell, preferably a nematode cell, optionally a Caenorhabditis elegans cell, the selected polynucleotide will preferably have at least 1 14 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 1 14 to 200/kbp, 1 15 to 249/kbp, 1 16 to 248/kbp, 1 17 to 247/kbp, 1 18 to 246/kbp, 1 19 to 245/kbp, 120 to 244/kbp, 121 to 243/kbp, 122 to 242/kbp, 123 to 241 /kbp, 124 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp. More preferably, the selected polynucleotide will preferably have a mean stem size between 5.35 and 2.50 bp, optionally in the range 5.35 bp to 4.00 bp, 5.30 to 2.40 bp, 5.20 bp to 2.50 bp, 5.10 to 2.60 bp, 5.00 bp to 2.70 bp, 4.90 bp to 2.80 bp, 4.80 bp to 2.90 bp, 4.70 bp to 3.00 bp, 4.60 to 3.10 bp, 4.50 to 3.20 bp, 4.40 to 3.30 bp, 4.30 to 3.40 bp, 4.20 to 3.50 bp, 4.10 to 3.60 bp, 4.00 to 3.70 bp or 3.90 to 3.80 bp. More preferably, the method further comprises selecting a polynucleotide having a mean loop size between 3.47 bp and 3.00 bp, optionally in the range 3.45 bp to 3.00 bp, 3.40 bp to 3.10 bp or 3.30 bp to 3.20 bp. More preferably, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.37 and 2.00 bp, optionally in the range 3.35 bp to 2.10 bp, 3.30 bp to 2.20 bp, 3.20 bp to 2.30 bp, 3.10 bp to 2.40 bp, 3.00 bp to 2.50 bp, 2.90 bp to 2.60 bp, or 2.80 bp to 2.70 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.27, preferably between 3.27 and 2.00 bp, optionally in the range 3.25 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp or 2.80 bp to 2.60 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp. In the most preferred embodiment where the host cell is a fungal cell, the method further comprises selecting a polynucleotide having a maximum stem size below 18 bp, optionally in the range 10 bp to 18 bp, 1 1 bp to 17 bp, 12 bp to 16 bp, 13 bp to 15 bp or 12 bp to 14 bp. Alternatively, in embodiments where wherein the host cell is an animal cell, preferably a mammalian cell, optionally a Mus musculus cell, the selected polynucleotide will preferably have at least 120 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 120 to 200/kbp, 121 to 249/kbp, 122 to 248/kbp, 123 to 247/kbp, 124 to 246/kbp, 125 to 245/kbp, 130 to 240/kbp, 135 to 235/kbp, 140 to 230/kbp, 145 to 225/kbp, 150 to 220/kbp, 155 to 215/kbp, 160 to 210/kbp, 165 to 205/kbp, 170 to 200/kbp, 175 to 195/kbp or 180 to 190/kbp. More preferably, the selected polynucleotide will preferably have a mean stem size between 4.35 and 2.50 bp, optionally in the range 4.35 to 4.00 bp, 4.30 to 2.40 bp, 4.20 bp to 2.50 bp, 4.10 to 2.60 bp, 4.00 bp to 2.70 bp, 3.90 bp to 2.80 bp, 3.80 bp to 2.90 bp, 3.70 bp to 3.00 bp, 3.60 to 3.10 bp, 3.50 to 3.20 bp or 3.40 to 3.30 bp. More preferably, the method further comprises selecting a polynucleotide having a mean loop size between 5.18 bp and 4.00 bp, optionally in the range 5.15 bp to 4.00 bp, 5.10 bp to 4.10 bp, 5.00 bp to 4.20 bp, 4.90 bp to 4.30 bp, 4.80 bp to 4.40 bp or 4.70 bp to 4.50 bp. More preferably still, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.00 and 2.00 bp, optionally in the range 2.90 bp to 2.10 bp, 2.80 bp to 2.20 bp, 2.70 bp to 2.30 bp or 2.60 bp to 2.40 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.28, preferably between 3.28 and 2.00 bp, optionally in the range 3.27 bp to 2.00 bp, 3.25 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp or 2.80 bp to 2.60 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp or 13bp to 15bp. In the most preferred embodiment where the host cell is an animal cell, the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 12 bp to 15 bp. In a final aspect the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide, wherein the method further comprises selecting a polynucleotide from a library of synonymous variants wherein the codon usage of the selected polynucleotide most closely matches the most abundant tRNAs in a particular host cell. It will be appreciated that this final step may be undertaken.
Polynucleotides
In the context of methods of the invention, polynucleotides encoding heterologous proteins of interest (POI) may be isolated nucleic acid molecules and may be a DNA molecule, a cDNA molecule, an RNA molecule or synthetically produced DNA or RNA or a chimeric nucleic acid molecule. In embodiments where the polynucleotide is an RNA, it will be understood that normally uracil (U) is to be used in place of thymine (T). Throughout, the term "polynucleotide" as used herein refers to a deoxyribonucleotide or ribonucleotide polymer in single- or double-stranded form, or sense or anti-sense, and encompasses analogues of naturally occurring nucleotides that hybridize to nucleic acids in a manner similar to naturally occurring nucleotides. Such polynucleotides may be derived from any organism, including the host organism, or may be synthesised de novo.
Prior to modification in accordance with the methods of the invention, a polynucleotide coding sequence may be provided for the protein of interest (POI) having the wild-type (WT) sequence or alternatively having a 'pre-optimised' sequence; that is to say the sequence incorporates at one or more positions for which synonymous codons are available a codon which is associated with the most abundant tRNA for that particular amino acid. In certain embodiments, it may be that codons corresponding to the most abundant tRNA for particular amino acids are used at each position for which synonymous codons are available. Preferably, however, the starting polynucleotide sequence is the WT sequence encoding the POI. In the context of methods of the invention, it will be appreciated that the POI may be a native protein of a host cell in which expression of the native protein has been silenced, for example, the polynucleotide sequence encoding that protein has been disrupted, deleted or mutated. In these circumstances, the POI will be considered as a heterologous protein in the context of the mutated host cell.
The provision of a polynucleotide having a coding sequence may comprise synthesis of a polynucleotide comprising the coding sequence. This may be for example by modification of a pre-existing sequence, e.g. by site-directed mutagenesis or possibly by de novo synthesis.
Polynucleotide Sequence Modification
In all embodiments of the invention, polynucleotide sequences encoding the protein of interest may be prepared by any suitable method known to those of ordinary skill in the art, including but not limited to, for example, direct chemical synthesis or cloning. Whether the starting polynucleotide is a WT sequence or a pre-optimised sequence where the codons match the most abundant tRNAs for a particular host cell, the starting polynucleotide sequence may be reviewed and modified by incorporating the relevant replacement codons in silico. The modified polynucleotide may subsequently be synthesised, for example by direct chemical synthesis, for introduction into a desired host cell. Alternatively, the starting polynucleotide sequence may be provided and subsequently modified ex vivo or alternatively in vivo for example by site directed mutagenesis or gene editing techniques.
In some embodiments of the invention, all of the polynucleotide sequence is modified according to the relevant table; that is to say 100% of the length of the coding sequence of the polynucleotide encoding the protein of interest (POI). In such embodiments, each occurrence of a particular 'non-optimal' codon in the starting polynucleotide sequence for which a synonymous codon exists will be replaced with the corresponding replacement codon indicated in the relevant table. For a particular codon, this involves modifying every occurrence of that codon within the polynucleotide sequence. Preferably, where two or more codons are indicated as replacement codons, each codon will be modified using the synonymous replacement codon appearing first in the table.
Alternatively, in certain situations it may be desirable to limit application of the method to specific regions of the polynucleotide sequence or to omit certain regions from application of the method, for instance to avoid disruption of secondary structural motifs or regulatory elements in the polynucleotide sequence. According to preferred embodiments of the invention, appropriate replacement codons may be applied to substantially all of the nucleotides in a polynucleotide sequence. Preferably, at least 75%, 76%, 77%, 78%, 79%, 80%, 81 %, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91 %, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, 99.5% or 100% of the polynucleotide sequence is modified by incorporation of replacement codons according to the relevant table. In preferred embodiments, more than 90% of the polynucleotide sequence is modified by incorporation of replacement codons according to the relevant table. More preferably still, more than 95% of the polynucleotide sequence is modified. Ideally, 100% of the polynucleotide sequence is modified, that is, each occurrence of a particular codon is replaced with the corresponding replacement codon indicated in the relevant table. Expression Vectors
After modification of the codon composition of the polynucleotide sequence encoding the protein of interest, subsequent expression of the polynucleotide sequence in the chosen host cell may be carried out. In order that expression can be carried out in the host cell of choice, the sequence will preferably be provided in an expression construct, e.g. an expression vector. In some embodiments, the polynucleotide may be provided in an expression vector. Suitable expression vectors will vary according to the recipient host cell and suitably may incorporate regulatory elements which allow expression in the host cell of interest and preferably which facilitate high-levels of expression. Such regulatory sequences may be capable of influencing transcription or translation of a gene or gene product, for example in terms of initiation, accuracy, rate, stability, downstream processing and mobility.
Such elements may include, for example, strong and/or constitutive promoters, 5' and 3' UTR's, transcriptional and/or translational enhancers, transcription factor or protein binding sequences, start sites and termination sequences, ribosome binding sites, recombination sites, polyadenylation sequences, sense or antisense sequences, sequences ensuring correct initiation of transcription and optionally poly- A signals ensuring termination of transcription and transcript stabilisation in the host cell. The regulatory sequences may be plant-, animal-, bacteria-, fungal- or virus derived, and preferably may be derived from the same organism as the host cell. Clearly, appropriate regulatory elements may vary according to the host cell of interest. For example, regulatory elements which facilitate high-level expression in prokaryotic host cells such as in E. coli may include the pLac, T7, P(Bla), P(Cat), P(Kat), trp or tac promoters. Regulatory elements which facilitate high-level expression in eukaryotic host cells might include the AOX1 or GAL1 promoter in yeast or the CMV- or SV40-promoters, CMV-enhancer, SV40-enhancer, Herpes simplex virus VIP16 transcriptional activator or inclusion of a globin intron in animal cells. In plants, constitutive high-level expression may be obtained using, for example, the Zea mays ubiquitin 1 promoter or 35S and 19S promoters of cauliflower mosaic virus.
Suitable regulatory elements may be constitutive, whereby they direct expression under most environmental conditions or developmental stages, developmental stage specific or inducible. Preferably, the promoter is inducible, to direct expression in response to environmental, chemical or developmental cues, such as temperature, light, chemicals, drought, and other stimuli. Suitably, promoters may be chosen which permit expression of the protein of interest at particular developmental stages or in response to extra- or intra-cellular conditions, signals or externally applied stimuli. For example, a range of promoters exist for use in E. coli which give high- level expression at particular stages of growth (e.g. osmY stationary phase promoter) or in response to particular stimuli (e.g. HtpG Heat Shock Promoter).
Suitable expression vectors may comprise additional sequences encoding selectable markers which allow for the selection of said vector in a suitable host cell and/or under particular conditions. Suitable expression vectors may also comprise additional sequences which enable visualisation or quantification of the expressed protein (e.g. 3' GFP or Luciferase fusion tags) in the host cell of interest. Preferred expression vectors are those which also enable the expressed protein to be easily separated from other cellular proteins for downstream applications. For example, the expression vector may incorporate a fusion tag domain, which when fused to the coding sequence of the protein of interest allows the expressed protein to be bound to a matrix, column or beads (e.g. glutathione-S-transferase (GST)).
Furthermore, the expression vector comprising the heterologous polynucleotide sequence may optionally comprise polynucleotide sequences coding for one or more transit peptides, capable of to localising the expressed protein to a particular cellular compartment in the host cell. Advantageously, such domains may cause secretion of expressed protein, for example into the extracellular medium to enable the protein to be easily recovered from the cell culture medium. In plant hosts suitable transit peptides may cause the protein to localise to, for example, the cell wall, nucleus or chloroplasts. The methods of the present invention will be useful in the production of a large number of different proteins in the agricultural, chemical, industrial and pharmaceutical fields, particularly for example antibodies, vaccines, hormones and other protein therapeutics. Advantageously, according to all aspects of the present invention, levels of heterologous protein are increased relative to the respective native (i.e. unoptimised) protein by modification of the codon usage of the polynucleotide sequence which encodes the protein of interest. Preferably, the levels of heterologous protein may increase in the range 5% to 500% relative to native (unoptimised) protein; optionally in the range 10% to 250%, 20% to 200%, 25% to 100%, 30% to 75% or 35 to 65%.
Once expressed, proteins of interest may preferably be recovered from the cell culture medium as secreted proteins, although they may also be recovered from host cell lysates.
Host cells
The utility of the present invention resides in the universal applicability of the optimal replacement codons to any polynucleotide having a coding sequence and having one or more of the codons listed in the relevant table for expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells or animal cells. Methods of the invention can be applied to any type of host cell which is genetically accessible and which can be cultured. In other words, the approach may be applied to those cells which are able to serve as a host for production of the protein of interest (POI)). It may therefore be applied to commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells commonly employed for recombinant heterologous protein expression. Preferably, host cells will be selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell. Typically, the host cell may be an Escherichia coli cell. Typically, the host cell may be a Saccharomyces cerevisiae cell. Typically, the host cell may be a Caenorhabditis elegans cell. Typically, the host cell may be a Mus musculus cell.
In embodiments of the invention where the host cell is a prokaryotic cell, the host cell may be a bacterial cell or alternatively the host cell may be an archaeal cell. Host cells may be gram-negative bacterial cells. Host cells may be gram-positive bacterial cells. Typically, host cells may include but are not limited to; an Aliivibrio fischeri cell, a Bacillus subtilis cell, a Caulobacter crescentus cell, an Escherichia coli cell, a Mycoplasma genitalium cell, a Synechocystis cell, a Pseudomonas fluorescens cell. In preferred embodiments the host cell is a bacterial cell. Preferably the host cell is an Escherichia coli (E. coli) cell. In particularly preferred embodiments where the host cell is a prokaryotic cell, it is envisaged that the highest functional protein expression will be achieved by modification of each codon in the polynucleotide sequence for which a synonymous codon exists according to the relevant tables above. Preferably, where there is choice of codons indicated for a selected position based on the expression data, preference may be given to the first replacement codon appearing in the relevant table. Alternatively, preference may be given to the second replacement codon appearing in the relevant table. Alternatively, in situations where the second or third preference codon is already present in the starting sequence, it may be decided to retain the codon in the starting sequence, i.e. the wild type codon in embodiments where the starting sequence is the wild-type sequence. This will minimise the number of codon changes to convert the starting sequence in a polynucleotide to the selected synonymous coding sequence for improved functional protein expression. In embodiments of the invention where the host cell is a protist cell, host cells may include but are not limited to; a Chlamydomonas reinhardtii cell, a Dictyostelium discoideum cell, a Tetrahymena thermophila cell, an Emiliania huxleyi cell or a Thalassiosira pseudonana cell. In preferred embodiments the host cell is a Chlamydomonas cell. Preferably, the host cell is a Chlamydomonas reinhardtii cell.
In embodiments of the invention where the host cell is a fungal cell, the host cell may include but is not limited to; fungal cells and yeast cells cells. In particular, the host cell may be a Saccharomyces cerevisiae cell, an Ashbya gossypii cell, an Aspergillus fumigatus cell, an Aspergillus nidulans cell, a Candida albicans cell, a Coprinus cinereus cell, a Cunninghamella elegans cell, a Cryptococcus neoformans cell, a Fusarium oxysporum cell, a Magnaporthe oryzae cell, a Neurospora crassa cell, a Schizophyllum commune cell, a Schizosaccharomyces pombe cell, an Ustilago maydis cell or a Zymoseptoria tritici cell. Preferably the host cell is a Saccharomyces cerevisiae cell or a Schizosaccharomyces pombe cell. More preferably the host cell is a Saccharomyces cerevisiae cell.
According to aspects of the present invention where the host cell is a plant cell, any cell type of any plant species, including both monocots and dicots, may be used as a host system for expression of a heterologous protein. Preferred plant cells for use in the present invention are genetically tractable, and are commonly derived from either crop species, species which typically exhibit high growth rates, are easily harvested or species which have established genetic resources associated with them. Commonly, in some preferred embodiments of the invention, the host cell is an Arabidopsis cell, preferably an Arabidopsis thaliana cell. In other preferred embodiments of the invention the host cell may be a Nicotiana cell, preferably a Nicotiana tabacum cell. Alternatively, depending on the application chosen said plant may suitably be selected from the following: maize (Zea mays), canola (Brassica napus, Brassica rapa ssp.), sugar beet (Beta vulgaris), oat (Avena sp.), barley (Hordeum vulgare), flax (Linum usitatissimum), alfalfa (Medicago sativa), rice (Oryza sativa), rye (Secale cerale), sorghum (Sorghum bicolor, Sorghum vulgare), switchgrass (Panicum virgatum), prairie Cordgrass (Spartina sp.), purple false brome (Brachypodium distachyon), sunflower (helianthus annuas), wheat (Tritium aestivum), soybean (Glycine max), potato (Solanum tuberosum), cotton (Gossypium hirsutum), sweet potato (lopmoea batatus), cassava (Manihot esculenta), foxtail (Setaria sp.), Miscanthus sp., peanuts (Arachis hypogaea), cotton (Gossypium hirsutum), sweet potato (lopmoea batatus), cassava (Manihot esculenta), coffee (Cofea spp.), coconut (Cocos nucifera), pineapple (Anana comosus), citrus tree (Citrus spp.) cocoa (Theobroma cacao), tea (Camellia senensis), banana (Musa spp.), avocado (Persea americana), fig (Ficus casica), guava (Psidium guajava), mango (Mangifer indica), olive (Olea europaea), papaya (Carica papaya), cashew (Anacardium occidentale), macadamia (Macadamia intergrifolia), almond (Prunus amygdalus), sugar beet (Beta vulgaris), oat (Avena sp.), barley (Hordeum vulgare), Chlorella, Volvox, Guillardia theta, Bigelowiella natans or Physcomitrella patens.
Transformation of the host cell with a heterologous gene sequence
Expression constructs comprising the modified polynucleotide sequence may be located in plasmids (expression vectors) which are used to transform the host cell. Specific, but non-limiting methods of transformation may include heat shock, electroporation, particle bombardment, chemical induction, microinjection and viral transformation.
Heterologous protein expression analysis
Subsequently, in preferred embodiments of the present invention the expression levels of the protein of interest in host cells of interest may be determined. Preferably the method chosen allows for quantitative assessment of the level of functional expression. In some instances, functional expression may be directly determined, e.g. as with GFP, luciferase or by enzymatic action of the protein of interest (POI) to generate a detectable optical signal, such as fluorescence or luminescence or a colour change caused by the protein. However, in some circumstances it may be chosen to determine physical expression, for instance by antibody probing, and rely on separate test to verify that physical expression is accompanied by the required function. In preferred embodiments of the invention, the POI will be detectable by a high- throughput screening method, for example, relying on the detection of an optical signal. Preferably, using an optical signal which is directly proportionate to the quantity of the expression product from the polynucleotide is a convenient method of measuring expression and is amenable to high throughput processing. For this purpose, it may be necessary for the POI to incorporate a tag, or be labelled with a removable tag, which permits detection and preferably quantification of expression. Suitable tags may include but are not limited to; a fluorescence reporter molecule translationally-fused to the C-terminal end of the POI, e.g. GFP, Yellow Fluorescent Protein (YFP), Red Fluorescent Protein (RFP) or Cyan Fluorescent Protein (CFP). It may be an enzyme which can be used to generate an optical signal. Alternatively, the expression vector may incorporate a polynucleotide reporter encoding a luminescent protein, such as a luciferase (e.g. firefly luciferase). Alternatively, the reporter gene may be a chromogenic enzyme which can be used to generate an optical signal, e.g. a chromogenic enzyme (such as beta-galactosidase (LacZ) or beta-glucuronidase (Gus)). Tags used for detection of expression may also be antigen peptide tags. A tag may be provided for affinity purification, e.g. a polyhistidine tag. Where the POI is ultimately to be used as a therapeutic agent, any tag employed for detection of expression will be cleavable from the POI. It is envisaged that other types of label may also be used to mark the protein including, for example, organic dye molecules or radiolabels.
Accordingly, in a preferred embodiment of the invention, the measurement of expression comprises the detection of an optical signal, for example a fluorescent signal, a luminescent signal or colour signal. In a particularly preferred embodiment the optical signal is provided by a GFP reporter fused to the protein of interest.
The replacement codon selected from synonymous codons listed as alternatives in the relevant table(s) for a given host is the codon associated with the highest or optimal observed functional expression of the POI, or where more than one codon provides substantially equal such expression, one such codon corresponding with that level of expression. Where there is more than one replacement codon indicated for a given non-optimal codon based on the expression data, this corresponds to the first replacement codon appearing in the relevant table. Therefore where there is choice of codons indicated for a selected position based on the expression data, preference may be given to the first replacement codon appearing in the relevant table. Alternatively, preference may be given to the second replacement codon appearing in the relevant table. Routinely, in situations where the second or third preference codon is already present in the starting sequence, for convenience the codon in the starting sequence may be retained, i.e. the wild type codon in embodiments where the starting sequence is the wild-type sequence. This will minimise the number of codon changes to convert the starting sequence in a polynucleotide to the selected synonymous coding sequence for improved functional protein expression.
EXEMPLIFICATION
The invention will now be illustrated below with reference to the following examples and figures, in which:
Figure 1 shows the influence of codon optimisation on protein yield, mRNA stability and translatability. Panel A is a graphical representation of the nucleotide content of the third codon position in the constructs for Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL- 10) with additional chitinase signal peptide (SP) expression. GFP was also expressed without SP. Panel B is a graphical representation of protein yield in transformed Arabidopsis thaliana seedlings. For each plant analysed the protein yield in ng per mg total soluble protein (TSP) is plotted against the relative mRNA transcript concentration as compared to the A. thaliana household gene TIP-41 . Panel C depicts protein yield in g per mg TSP at 2 to 5 days post infiltration (DPI), in transient expression in Nicotiana benthamiana leaves (native and optimised in black and grey bars, respectively) * indicates co-expression with the silencing inhibitor p19 of tomato bushy stunt virus. n=3, error bars indicate standard error. Figure 2 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked nucleotide use. Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) (>250 microarrays per species) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) were rank-normalized and averaged. Subsequently, correlations (Spearman) between expression and nucleotide use (overall and for each codon position) were calculated per species and used to generate this heat map. Consistent positive and negative correlations across species are indicated with stars and triangles, respectively. Figure 3 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked codon use. Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) (>250 microarrays per species) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) were rank-normalized and averaged. Subsequently, correlations (Spearman) between expression and codon use were calculated per species and used to generate this heat map. Consistent positive and negative correlations across species are indicated with stars and triangles respectively.
Figure 4 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked amino acid use. Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) (>250 microarrays per species) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) were rank-normalized and averaged. Subsequently, correlations (Spearman) between expression and amino acid use were calculated per species and used to generate this heat map. Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.
Figure 5 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked codon bias. Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) (>250 microarrays per species) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) was rank-normalized and averaged. Subsequently, genes were grouped based on expression from the centre (50% highest versus 50% lowest) until, with 1 % steps, the extremes (5% highest versus 5% lowest) were reached. With each step the synonymous codon use frequencies in both high- and low- expressed gene pool were calculated together with the difference in codon use frequency between the high- versus the low-expressed gene pool. Finally, the difference in codon use frequency was correlated to the expression defining percentage (Spearman). The relation between the species based on this correlation is visualized in this heat map.
Figure 6 shows a graphical representation of mRNA structural features plotted against ranked expression with moving average (black line). The mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy (kcal/mol/nucleotide), fraction of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions per nucleotide were determined. Previously mentioned mRNA characteristics plotted against expression.
Figure 7 shows a heat map where the mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy (kcal/mol/nucleotide), fraction of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions per nucleotide were determined and correlated with expression (Spearman) (Table 2). The heat map demonstrates that highly-expressed genes across all kingdoms prefer a stable, but 'airy' mRNA structure. Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.
Figure 8 is a heat map showing correlations (Spearman) between mRNA structure characteristics and protein:mRNA ratios per species (Table 3), demonstrating that highly translated transcripts across kingdoms share a similar 'airy' structure. The mRNA structures of all genes of Escherichia coli (Eubacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy, percentage of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions were determined and correlated (Spearman) with protein:mRNA ratios. Rank-normalized mRNA levels were divided by protein abundance (retrieved from PaxDB). Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.
Figure 9 shows mRNA structure predictions of the constructs used for heterologous protein expression. Sequences of the native and optimised variants of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL-10) with additional signal peptide (SP) and GFP without SP flanked by the 5' and 3'-UTRs as expected from our expression cassette were used to predict the mRNA secondary structure.
Figure 10 shows a heat map displaying the relation between species of several kingdoms of life based on translation rate-linked nucleotide use. Correlation (Spearman) between mRNA:protein ratios (proxy for translation rate) and nucleotide content (overall and for each codon position) for the species Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia). For each species >250 microarrays originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A- F) was rank-normalized, averaged and divided by protein abundance (retrieved from PaxDB) before correlations (Spearman) between protein:mRNA ratios and nucleotide use were calculated. Figure 11 shows a heat map displaying the relation between species of several kingdoms of life based on translation rate-linked codon use. Correlation (Spearman) between mRNA:protein ratios (proxy for translation rate) and codon use for the species Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia). For each species >250 microarrays originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) was rank-normalized, averaged and divided by protein abundance (retrieved from PaxDB) before correlations (Spearman) between protein:mRNA ratios and nucleotide use were calculated.
Figure 12 shows a heat map displaying the relation between species of several kingdoms of life based on translation rate-linked amino acid use. Correlation (Spearman) between mRNA:protein ratios (proxy for translation rate) and amino acid use for the species Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia). For each species >250 microarrays originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) was rank-normalized, averaged and divided by protein abundance (retrieved from PaxDB) before correlations (Spearman) between protein :mRNA ratios and nucleotide use were calculated.
Figure 13 shows a sequence alignment of native (nat) and optimized (opt) GFP sequences.
Figure 14 shows a sequence alignment of native (nat) and optimized (opt) GFP sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase. Figure 15 shows a sequence alignment of native (nat) and optimized (opt) mlL-10 sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase. Figure 16 shows a sequence alignnnent of native (nat) and optimized (opt) OVA sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase.
Example 1 - Codon optimisation improves mRNA stability and translatabilitv
Wang and Roossinck (2006) previously compared overall codon use to the codon use in highly expressed genes in 1 1 plant species. Although the codons used most frequently in highly expressed genes (optimal codons) differed between monocots and dicots, the use of the same codons often increases with expression (expression codons). However, the authors did not express the optimised genes in plants. In the experiments shown here, one codon per amino acid that was most often identified as an expression codon across these 1 1 plant species was selected. Strikingly, most of these codons were C-ending, except for the amino acids Arg (CGT) and Gly (GGT). The codons of the amino acids Gin, Glu and Lys, that can only be encoded by A or G-ending codons, were G-ending. To investigate the effect of these codons on heterologous protein production in plants, the gene sequence of three genes was recoded with these codons. The genes of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL- 10) were chosen because of their variation in codon use (Figure 1 a). To eliminate differences caused by translation initiation all genes were preceded by the signal peptide of Arabidopsis thaliana chitinase. GFP was also expressed without this signal peptide, as it is normally not secreted. The native and optimised variants of these four constructs were used to transform Arabidopsis thaliana using the floral dip method and their expression in seedlings was evaluated by determining mRNA transcript and protein levels (Figure 1 b; Table 4). An increased protein yield found upon optimisation could be partly explained by an increase in mRNA transcript levels, i.e. increased mRNA stability (Table 4). Comparing protein:mRNA ratios of transformants within a similar mRNA expression range showed that codon optimisation resulted in more protein per mRNA transcript. Thus, codon optimisation also resulted in increased mRNA translatability.
Upon transient transformation transcript levels are always much higher. An increase in mRNA stability and translatability may than no longer improve protein yield. Therefore, protein yield upon transient expression of the three genes in Nicotiana benthamiana was also determined, with and without co-expression of the gene silencing inhibitor p19 of tomato bushy stunt virus (Figure 1 c; Table 5). Also upon transient expression codon optimisation lead to higher protein yield on all days for all genes, except for OVA unless p19 was co-expressed. In most cases co-expression of p19 had a favourable effect on protein yield independent of optimisation. This is not surprising as, mRNA transcript levels are always high in transient expression, which increases the risk of gene silencing. Thus, the mRNA of the optimised variant of OVA must have been more sensitive to gene silencing compared to the native variant.
Relative
Relative mRNA Protein:
mRNA Fold Protein Fold cone. n mRNA Fold n= cone. change yield change range = ratio change
GFP N 32 0.88 17.03 0.8-2.7 4 22.8±2.70
75*** 1 7.1
0 23 9.25 1276 0.9-2.5 4 161 ±58.5
SP- 1
GFP N 26 1 .63 33.28 1 .4-4.9 1
5.8* -| 2** 18.0±5.16
3.5* 1
0 24 9.53 399.5 1 .2-4.8 2 63.9±14.5
SP- 1 356.2±142
OVA N 26 2.37
2 ^*** 717.3 2.0-5.3 2 .5
5.5*** 2 g**
2 1014±121 .
0 30 5.62 3937 2.2-5.5 3 7
SP-
IL-10 N 17 1 .37 3.30 1 .7-4.2 8
2 -j *** 1.26±0.43
5.5***
1
0 25 4.23 17.9 1 .7-4.1 6 6.68±1 .02 Table 4. Codon optimisation of GFP, interleukin-10 and ovalbumin genes boosts expression in Arabidopsis thaliana. Average relative mRNA transcript concentration as compared to the A. thaliana household gene TIP-41 and protein yield in g per mg total soluble protein (TSP) determined in A. thaliana seedlings upon stable transformation of native (N) and optimised (O) sequences of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL-10) with additional chitinase signal peptide (SP). GFP was also expressed without signal peptide. Protein:mRNA ratios were calculated. Because translatability may be lower with a higher mRNA concentration due to the limited number of free ribosomes, the protein:mRNA ratios were calculated of samples within the same mRNA concentration range, as indicated. The fold change when comparing the optimised to the native variant was calculated for the relative mRNA concentration, protein yield and protein:mRNA ratio. For each average the number of included seedlings is indicated (n). Significance of fold changes were calculated with a Welch's i-test: * P<0.05, ** P<0.01 , ***P<0.001 . dpi 2-5 dpi 5 + p19
Protein yield Fold change Protein yield Fold change
GFP N 5- O 23 I 34 «
SP-GFP N 1
3.2** 2.1
O 3.2 9.2
SP-OVA N 30
17 0.7 2.0*
O 12 61
SP-IL-10 N 8 4
1 .4
O 21 24
Table 5. Codon optimisation boosts protein yield in transient expression in Nicotiana benthamiana. Average protein yield in g per mg total soluble protein determined in N. benthamiana leaves upon transient transformation of native (N) and optimised (O) sequences of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL-10) with additional chitinase signal peptide (SP) (GFP was also expressed without SP) at 2 to 5 days post infiltration (dpi) (n=12) or 5 dpi whereby tested genes were co-expressed with the viral silencing inhibitor p19 of tomato bushy stunt virus. (n=3). Significance of fold change in protein yield were calculated with a Welch's i-test: * P<0.05, ** P<0.01 , ***P<0.001 .
Evaluating the average yield from dpi 2-5 or with co-expression of p19 on dpi 5 revealed a lower yield increase upon codon optimisation compared to stable expression in A. thaliana. This is not surprising as at least some of the gain in mRNA stability due to the codon optimisation is compensated by the increased transcription in transient expression. Whether this gain in protein yield is predominantly the result of an increase in mRNA translatability or a combination of a gain in mRNA stability and translatability remains to be determined.
To explain the differences found in mRNA stability, first the thermodynamic stability of the predicted secondary mRNA structures was calculated. Upon codon optimisation the minimum free folding energy had decreased, indicative for a more stable mRNA, from -0.25 to -0.35 and -0.31 to -0.33 kcal/mol/nt for GFP and OVA, respectively. However, for IL-10, the minimum free folding energy increased from - 0.31 to -0.28 kcal/mol/nt indicating a less stable mRNA. Thus, an overall increase in physical stability could not explain the increased mRNA transcript levels of IL-10. However, it is still possible that unstable regions of IL-10 were removed upon codon optimisation, while the overall stability decreased.
In vivo mRNA half-life is predominantly controlled by other factors than physical stability, namely; the occurrence of a splicing event, through AU-rich destabilizing elements in the UTRs, and the presence of sequences that are targets for microRNA. In our experiments, all genes were expressed using the same expression controlling components, thus contained the same UTRs and did not contain introns. However, the sequences of the ORFs varied greatly between the native and optimised variants (78, 76 and 83% homology for GFP, OVA and IL-10, respectively). Therefore, there could be a difference in the presence of microRNA targets and also a difference in the occurrence of stretches of double stranded (ds)RNA between the native and optimised variants. The dsRNA stretches could be processed to small interfering RNAs and, like binding of microRNAs, can trigger gene silencing. In stable expression, gene silencing can also be due to gene methylation, but this always results in the complete absence of transcripts and therefore transformants without detectable expression were not considered. In our transient expression experiment co-expression of the silencing inhibitor p19 gave comparable results. Taken together, differences in mRNA decay based on above mentioned sequence features are unlikely to explain the differences in mRNA stability in our experiments. Translation has also been linked to mRNA decay. Ribosomes can shield nuclease target sites, however, in large-scale in vivo studies mRNA half-life could not be linked to the number of nuclease target sites or ribosomal density. When translation initiation is equal, as is expected in our experiments, an increase in translatability should result in a lower density of ribosomes. Thus, there would have been fewer ribosomes on the optimised variants compared to their native counterparts, and the optimised variants would be less protected against nucleases. While translation per se may not influence mRNA half-life, errors in translation have been proven to lead to mRNA degradation by mRNA surveillance mechanisms. Three mRNA surveillance mechanisms have been identified: I) nonsense mediated decay by the recognition of a premature stop codon, II) non-stop decay by the lack of a stop codon and III) no-go decay by stalled ribosomes. Occurrence of a premature stop codon or the lack of a stop codon can be caused by a mutation or a ribosomal slip causing a frame-shift. Frame-shifts can be caused by a 'slippery' sequence that may be found in proximity of a strong mRNA structure. A ribosome may also stall at a strong stem-loop structure without slipping and trigger degradation. It is possible that the native and optimised variants differ in the presence of 'slippery' sequences and/or strong mRNA structures. Thus, differences in level of translation-linked mRNA decay may explain the difference in mRNA transcript levels in our experiment. In addition, ribosomes have intrinsic helicase activity and recently it was shown that strong mRNA structures such as pseudoknots and hairpins can stall translation only temporarily. It is therefore thought that the mRNA structure provides a mechanical basis for cellular regulation of translation rate. Thus, increased mRNA translatability of the optimised genes may be explained by an increased translation rate caused by differences in the mRNA structure.
Example 2 - General codon bias extends to other kingdoms of life The existence of codon biases in different species has implications for the efficient expression of heterologous proteins in a range of host cells. To investigate if the general codon bias in plants transcends kingdoms of life expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) was interrogated. Per species >250 microarrays originating from several studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues were used (Table 1A-F). First, the expression was ranked and the average rank was used as a measure of overall expression. Subsequently, the correlation between expression and nucleotide content was analysed per species. The relation between the species based on this correlation was visualized in a heat map (Figure 2).
Surprisingly, a strong positive correlation between expression and overall G content, in particular G on the first codon position and a negative correlation between expression and A and T on the first codon position was found across all kingdoms. Next, the correlation between expression and codon use was evaluated (Figure 3). Across all kingdoms the use of CGT (Arg/R), AAG (Lys/K), GGT (Gly/G), GTT (Val/V) and GCT (Ala/A) is positively correlated with expression. However, the fact that the nucleotide contents of the first and second codon position are correlated with expression indicates that there is a correlation between amino acid usage and expression. Highly expressed genes are relatively rich in the amino acids encoded by G-starting triplets: Ala, Gly, and Val (Figure 4).
First, to uncouple the amino acid bias from the codon use bias, the relative synonymous codon use was calculated. Subsequently, a comparison was made between high- and low-expressed genes, as a correlation between codon use and expression may only be found in genes expressed above a certain threshold. Genes were grouped based on expression from the centre (50% highest versus 50% lowest) until, with 1 % steps, the pools with 5% highest and 5% lowest expressed genes were reached. With each step the codon use frequencies in both high- and low-expressed gene pools were calculated together with the difference in codon use frequency between the high- versus the low-expressed gene pool. Finally, the difference in codon use frequency was correlated (Spearman) to the expression defining percentage. The relation between the species based on this correlation was visualized in a heat map (Figure 5; Table 6A-E show codon use frequencies of all, the bottom 5% low- and top 5% high-expressed genes and fold codon use change (top/bottom) per species).
Strikingly, when clustering the correlations between the 5 species, E. coli, S. cerevisiae, C. elegans and A. thaliana group together well. M. musculus seems to have an overall lower codon bias and in -50% of the cases selects for other codons compared to the overall selection of the other species. Excluding M. musculus, 13 codons are positively correlated with expression for all species. These 13 codons encode 1 1 different amino acids and a termination of translation (twice a codon for Thr/T). Comparable to the general codon bias found in plants, 8 of these 13 codons are C-ending. Furthermore, 18 codons are consistently negatively correlated with expression in these four species. Of these codons most are A-ending (8), while none of them are C-ending. Strikingly, 5 universal codons were found which were positively correlated with expression for all species, indicating that these codons are conserved in the coding sequences of highly-expressed genes across all kingdoms of life and could therefore find useful application in methods of optimising functional protein expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. In addition several codons were found which were positively correlated with further increases in expression in E. coli, S. cerevisiae and C. elegans. Furthermore in addition to the universal set of codons, several codons were found to be positively correlated with increases in expression in E. coli, S. cerevisiae, C. elegans and Mus musculus. Separately, several codons were found to be positively correlated with increased expression in A. thaliana.
Taken together the data suggest that a conserved selection pressure influences expression across all kingdoms of life. Heterologous protein expression experiments suggested a role for the mRNA structure in translation rate. As the translational machinery does not vary greatly across kingdoms, the mRNA structure is a likely candidate to be the driving force behind this selection pressure.
Example 3 - Highly expressed genes prefer a stable, but 'airy' mRNA structure To evaluate if the mRNA structure could be the driver of selection that gives rise to the observed general codon bias, the relationship between expression and mRNA structure characteristics was evaluated. Thereto, the mRNA structures of all genes were predicted and determined gene length, minimal free folding energy, number of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of the number of stem/loop transitions and plotted these against expression (Figure 6; Table 7). Also a heat map displaying the relation between the species based on the correlation (Spearman) between these structure characteristics and expression was generated (Figure 7; Table 7). This heat map demonstrates that the number of bound nucleotides and the number of stem/loop transitions was consistently positively correlated and mean loop size consistently negatively correlated with expression across all species.
The positive correlation with the number of bound nucleotides indicates a general adaptation towards a more stable mRNA molecule. Also, a low folding energy (more stable) is correlated with high expression in S. cerevisiae, C. elegans and A. thaliana, but not in E. coli and M. musculus. Still, in E. coli there seems to be a relation between expression and folding energy, as is demonstrated by the trend line that indicates an optimum (Figure 6). An optimum in mRNA stability may indicate a trade-off between stability and translatability in this species. A trade-off in stability and translatability may also explain why there is a correlation between mRNA folding energy and expression in S. cerevisiae, C. elegans and A. thaliana. These species have an overall lower G+C content resulting in on average weaker mRNAs (Table 7) and have therefore more to gain in terms of stability before translatability is affected.
The number of stem-loop transitions and mean loop size are also correlated with expression (positive and negative, respectively) in all species, which suggests that there is a general adaptation towards dividing nucleotide bonds equally over the mRNA molecule. In other words, highly expressed genes prefer a stable, but 'airy' mRNA molecule. This again indicates a trade-off between mRNA stability and translatability. It is striking that while folding energy in S. cerevisiae, C. elegans and A. thaliana is on average much higher (less stable mRNA) (6-10%) compared to E. coli and M. musculus, the fraction of bound nucleotides, mean stem and loop size and number of transitions do not differ that much (Table 7). This means that while the mRNA folding energy may differ between species with different G+C content, the overall mRNA structure characteristics are more similar across species.
Taken together our data indicate that there is a general selection towards an optimal folding energy across kingdoms of life whereby number and type of nucleotide bonds (e.g. A-U and G-U bonds are weaker than G-C bonds) are balanced with short loops to facilitate efficient translation. This is in line with the observation that translation rate is greatly influenced by G+C content and strong mRNA structures.
Figure imgf000062_0001
Table 7. mRNA characteristics of highly expressed genes per species.
Averages of mRNA characteristics of the top 5% high-expressed genes of Escherichia coii (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabdites elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia).
A link between mRNA structure and expression may explain the increase in mRNA stability and translatability in the heterologous protein expression experiments disclosed herein. Therefore the mRNA structures of the native and optimised variants of the expressed genes were predicted and evaluated (Figure 9; Table 8). Optimised variants of GFP and OVA had an increased folding energy indicative of a more stable mRNA. All optimised variants had an increased number of stem-loop transitions (except SP-GFP), which is in line with a more 'airy' mRNA molecule. Thus, although changes in the mRNA structure upon optimisation differ from gene to gene, an improved mRNA structure could be the basis of increased protein yield in our experiments. Energy Bound nt's Mean stem Mean loop Transitions kcal/mol/nt (fraction) size size
GFP N -0.21 0.56 5.74 4.48 0.097
0 -0.33 0.57 5.15 3.85 0.1 1 1
SP-GFP N -0.22 0.57 5.21 3.89 0.109
0 -0.32 0.54 5.22 4.31 0.104
SP-OVA N -0.29 0.61 5.28 3.34 0.1 16
0 -0.31 0.55 4.38 3.56 0.126
SP-IL-10 N -0.29 0.60 5.02 3.29 0.120
0 -0.27 0.54 4.08 3.47 0.131
Table 8. Calculated mRNA structure characteristics of the constructs used for heterologous protein expression. Analysis of the mRNA secondary structure predictions given in Figure 9. Folding energy, bound nucleotides and number of transitions are corrected for gene length. Stem and loop sizes are mean values.
Example 4 - A more 'airy' mRNA increases translation rate
On a cellular level translation efficiency was demonstrated to be the most important factor in controlling protein abundance whereas protein turnover plays only a minor role. Therefore, protein:mRNA ratio is a good proxy of translation rate. To evaluate if the mRNA structure characteristics found to be linked to expression are also linked to translation rate the expression data was combined with large-scale protein abundance data retrieved from PaxDB. To evaluate to what extent the expression data predicts protein abundance, the correlation (Spearman) between the expression data and the protein abundance was calculated: E. coli 0.59, S. cerevisiae 0.67, C. elegans 0.59, A. thaliana 0.62 and M. musculus 0.36. When the relationship between the protein:mRNA ratio and the previously mentioned mRNA structure characteristics was evaluated a similar picture as when using the expression data was obtained (Figure 8; Table 3; Figure 10-12 heat maps demonstrate the relation between species based on correlations of protein :mRNA ratio and nucleotide content, codon use and amino acid use).
I E. coli S. cerevisiae C. elegans A. thaliana M. musculus Gene length -0.146 -0.1 16 -0.180 -0.139 -0.288
Energy (kcal/mol/nt) 0.043 -0.237 -0.212 -0.138 0.087
Bound (fraction) -0.009 0.148 -0.006 0.062 -0.058
Mean stem size -0.193 -0.01 1 -0.216 -0.058 -0.121
Mean loop size -0.121 -0.182 -0.139 -0.105 -0.015
Transitions /nt 0.199 0.140 0.213 0.104 0.081
Table 3. Correlations (Spearman) between mRNA structure characteristics and mRNA:protein ratios per species. The mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabdites elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy, percentage of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions were determined and correlated (Spearman) with mRNA:protein ratios. Rank-normalized mRNA levels were divided by protein abundance (retrieved from PaxDB).
As with expression, the number of stem-loop transitions is positively correlated with protein:mRNA ratio and mean loop size is negatively correlated across all species. Also, the folding energy is negatively correlated (more stable mRNA) for S. cerevisiae, C. elegans and A. thaliana, but not for E. coli and M. musculus. However, in contrast to the expression data, gene length is consistently negatively correlated with protein:mRNA ratio. This is in line with the fact that the packing density of ribosomes was shown to decrease with mRNA transcript length. Also, a negative correlation with mean stem size is found for all species and the fraction of bound nucleotides is not correlated, except for S. cerevisiae. Thus, small stem size must be important for an increased translation rate. This again highlights the tradeoff between mRNA stability and translatability.
Example 5 - Construct design
The native and optimised sequences coding for Aequorea victoria green-fluorescent protein (GFP) (L29345.1 ; nt 7-807) Gallus gallus ovalbumin (OVA) (NM_205152.2; nt 4-1 161 ) and Mus musculus interleukin-10 (IL-10) (NM_010548.2; nt63-537) together with the optimised sequence for the Arabidopsis thaliana basic chitinase signal peptide (cSP) (BAA82810.1 ; nt15-33) were synthetically made by GeneArt (Thermo Fisher Scientific, Breda, the Netherlands). Optimisation was performed by recoding the protein sequences using the C-ending codons for all amino acids (TCC in the case of Ser), except Arg and Gly, for which the T-ending codons were used, and Gin, Glu and Lys, for which the G-ending codons were used. Synonymous mutations to either native or optimised sequences were sometimes introduced to remove undesired restriction and the cryptic splice sites in native GFP (Reichel et al., 1996, PNAS, 93:5888-5893). Gene fragments were flanked with sequences including the restriction sites Ncol (5') and Eagl-BspHI (3') for cSP, Eagl (3') and Knpl (5") for IL-10 and OVA and Ncol (3") and Kpnl (5") for GFP to allow fragment assembly and subsequent in frame cloning into the plant expression vector pHYG (Westerhof et al., 2012, PloS One, 7: e46460). Fragment assembly was accomplished by the in frame ligation of cSP with IL-10 and OVA using the Eagl site and cSP with GFP using the BspHI (cSP) and Ncol (GFP) sites. ORFs were confirmed by sequencing in expression vector stage. All vectors were transformed to Agrobacterium tumefaciens strain GV3101 for stable transformation of Arabidopsis thaliana or MOG101 for agroinfiltration in Nicotiana benthamiana.
Example 6 - Stable transformation of Arabidopsis thaliana
Agrobacterium tumefaciens clones were cultured overnight (o/n) at 28°C in LB medium (1 Og/I pepton140, 5g/l yeast extract, 10g/I NaCI with pH7.0) containing 50 μg/nnl kanamycin. Bacterial cultures were centrifuged for 15 min at 2800 g and resuspended in MMA (20g/l sucrose, 5g/l MS-salts, 1 .95g/l MES, pH5.6) containing 200 μΜ acetosyringone and 0.03% silwet-L77 till an OD of 0.5 was reached. Arabidopsis thaliana plants were submerged in the bacterial suspension for 1 min and kept in a moist environment for 2 days. Plants were maintained in a controlled greenhouse compartment (UNIFARM, Wageningen) until seeds could be collected. Seeds were sterilized by 4-hour exposure to chlorine gas and plated on basic agar plates (8g/l Bacto Agar, 0.101 g/l KNO3) containing 30 ng/ml hygromycin and 100 μg/nnl cefotaxim. Plates were kept in the dark at 4°C for 2 days, then placed in artificial light for 7 hours at 24°C, again kept in the dark at RT for 5 days and finally placed in a climate chamber with 12 hour light regime at 24°C for 2 days. At this stage 10 to 40 seedlings per transformant plant were selected and placed in individual pots with Knop agar (1x Knop, 1 % sucrose, 8g/l Plant Agar pH6.4) containing 30 μg ml hygromycin and 100 μg ml cefotaxim. Seedlings that showed good growth and root formation after 10 days were transferred to fresh pots and allowed to grow for 2 more weeks. Thereafter plants were harvested and snap- frozen. Plant material was homogenized using a TissueLyser II (Qiagen) and stored at -80°C until further use.
Example 7 - Transient transformation of Nicotiana benthamiana
Agrobacterium tumefaciens clones were cultured overnight (o/n) at 28°C in LB medium (1 Og/I pepton140, 5g/l yeast extract, 10g/I NaCI with pH7.0) containing 50 μg ml kanamycin and 20 μg ml rifampicin. The optical density (OD) of the o/n cultures was measured at 600 nm and used to inoculate 50 ml of LB medium containing 200 μΜ acetosyringone and 50 μg ml kanamycin with x μΙ of culture using the following formula: x = 80000/(1028OD). OD was measured again after 16 hours and the bacterial cultures were centrifuged for 15 min at 2800 g. The bacteria were resuspended in MMA infiltration medium (20g/l sucrose, 5g/l MS-salts, 1 .95g/l MES, pH5.6) containing 200 μΜ acetosyringone till an OD of 1 was reached. All constructs were co-expressed with the tomato bushy stunt virus silencing inhibitor p19 by mixing Agrobacterium cultures 1 :1 . After 1 -2 hours incubation at room temperature, the two youngest fully expanded leaves of 5-6 weeks old Nicotiana benthamiana plants were infiltrated completely. Infiltration was performed by injecting the Agrobacterium suspension into a Nicotiana benthamiana leaf at the abaxial side using a 1 ml syringe. Infiltrated plants were maintained in a controlled greenhouse compartment (UNIFARM, Wageningen) and infiltrated leaves were harvested at selected time points.
Example 8 - Determination of heterologous gene expression Total RNA was isolated from homogenized plant material using the RNAeasy Plant Mini Kit (Qiagen) according to supplier's protocol. A Turbo DNasel (Ambion) treatment was included to remove any residual DNA. cDNA was synthesised using the SuperScript®lll First-Strand Synthesis System (invitrogen) according to supplier's protocol using an oligo(dT) primer. Samples were analysed by quantitative PCR in triplo using ABsolute SYBR Green Fluorescein mix (Thermo Scientific). Arabidopsis thaliana TIP-41 (AY074349.1 ) was used as a reference gene. The oligonucleotides used for amplification of both native and optimised IL-10, OVA and GFP and TIP- 41 were 5'-AACCTCTTCCTCTTCCTC-3' [SEQ ID NO: 2] / 5'- GGAAGTGGGTGCAGTT-3' [SEQ ID NO: 3]; 5'-AACCTCTTCCTCTTCCTC-3' [SEQ ID NO: 4]/ 5'-GGGCAGTAGAAGATGTTC-3' [SEQ ID NO: 5]; 5'- GACGGTAACTACAA-GACC-3' [SEQ ID NO: 6]/ 5'-TTGTCGGCCATGATGTA-3' [SEQ ID NO: 7]; and 5'-GCTCATCGGTACGCTCTTTT-3' [SEQ ID NO: 8]/ 5'- TCCATCAGTCAGAGGCTTCC-3' [SEQ ID NO: 9], respectively. Relative transcript levels of the genes versus TIP-41 were determined by the Pfaffl method (Pfaffl, 2001, Nucleic Acids Research, 29: e45).
Example 9 - Determination of heterologous protein expression
Homogenized plant material was ground in ice-cold extraction buffer (50mM phosphate-buffered saline (PBS) pH=7.4, 100 mM NaCI, 10 mM ethylenediaminetetraacetic acid (EDTA), 0.1 % v/v Tween-20, 2% w/v immobilized polyvinylpolypyrrolidone (PVPP)) using 2 ml/g fresh weight. Crude extract was clarified by centrifugation at 16.000xg for 5 min at 4°C and supernatant was directly used in an ELISA and BCA protein assay. Mouse IL-10 expression levels were determined using the Mouse IL-10 ELISA Ready-SET-Go! kit (eBioscience) according to the supplier's protocol. For the quantification of OVA and GFP, a rabbit anti-ovalbumin or a chicken anti-GFP (both from Rockland Immunochemicals Inc.) was used to coat ELISA plates o/n at 4°C in a moist environment. After this and each following step the plate was washed 5 times with 30 sec intervals in PBST (1 x PBS, 0,05% Tween-20) using an automatic plate washer (BioRad model 1575). The plate was blocked with assay diluent (eBioscience) for 1 h at room temperature. Samples and standard lines were loaded in serial dilutions and incubated for 1 h at room temperature. Standard lines were made from purified chicken ovalbumin (Sigma) or recombinant GFP (Roche). For detection of OVA and GFP a rabbit anti- ovalbumin:HRP antibody or a rabbit anti-GFP:HRP antibody (both from Rockland Immunochemicals Inc.) were used, respectively. A 3,3',5,5'-Tetramethylbenzidine (TMB) substrate (eBioscience) was added and colouring reaction was stopped using stop solution (0.18M sulphuric acid) after 1 -15 min. Read outs were performed using the model 680 microplate reader (BioRad) to measure the OD at 450 nm with correction filter of 690 nm. For sample comparison total soluble protein (TSP) concentration was determined using the BCA Protein Assay Kit (Pierce) according to supplier's protocol using bovine serum albumin (BSA) as a standard.
Example 10 - Gene expression datasets
Gene expression datasets of 5 species (Escherichia coli, Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorhabditis elegans, and Mus musculus) were downloaded from Gene Expression Omnibus (GEO). Gene-expression sets were selected based on platform (Affimetrix), release date (not earlier than 2008), publication linked to the GEO set and number of samples in the study. In total 2067 gene-expression profiles were collected, representing 8 or 9 different studies per organism. An overview can be found in Table 1A-F.
Example 11 - Protein abundance datasets Protein abundance datasets were retrieved from PaxDb (Wang et ai, 2012, Mol Cell Proteomics, 1 1 : 492-500), where the integrated datasets of Escherichia coli, Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorhabditis elegans, and Mus musculus were downloaded.
Example 12 - Gene expression normalization
Gene expression was normalized based on rank. Per species one array platform was used and per species probes were ranked according to their intensities. The average rank per probe was used as a measure of overall gene expression to distinguish genes with overall low and high expression levels for each species. Example 13 - mRNA Sequences
The coding sequences (CDS) of all genes of 5 species were downloaded from sequence/genome repositories. For Escherichia coli, the CDS of strains CFT073, EDL933, MG1655 and Sakai were obtained from NCBI, accesscions NC_004431 .1 , NC_002655.1 , NC_U00096.3 and NC_002695.1 respectively. For Arabidopsis thaliana, the CDS of the 20101 108 release were obtained from TAIR (Lamesch et al., 2012, Nucleic Acids Research 40: D1202-1210). For Saccharomyces cerevisiae, the open reading frames (without UTR, introns, etc.) of the 201 10203 release were obtained from the Saccharomyces genome database (Cherry et al., 2012, Nucleic Acids Research 40: D700-705). For Caenorhabditis elegans, the CDS of WS241 were obtained from WormBase (Yook et al., 2012, Nucleic Acids Research 40: D735-741 ). For Mus musculus, the CDS of the 20130508 release (GRCm38.p1 ) were obtained from the NCBI CCDS database (Farrell et al., 2014 Nucleic Acids Research 42: D865-872).
Example 10 - mRNA folding
The mRNAs of all species were folded using Vienna RNA fold (Lorenz et al., 201 1 , Algorithms for Molecular Biology 6: 26) at 20 C, using the parameters of Andronescu et al., (Andronescu et al., 2007, Bioinformatics 23: i19-28). The M. musculus mRNA was also folded at 37 C and the S. cerevisiae also at 30 C, but all the reported comparisons are based on 20 C.
Example 11 - mRNA sequence and structure statistics
Several statistics were taken from the mRNA sequence: gene length, codon usage, and nucleotide usage. Also from the predicted mRNA structure several statistics were taken: number of bound nucleotides, number of free nucleotides, average stem size, average loop size, variation in stem size, variation in loop size, and energy of the structure.
Example 12 - Gene expression and mRNA folding statistics The correlations (Spearman) between gene expression and the various mRNA- based statistics were calculated by Spearman correlation (in R 3.0.2 x64). For some of the factors a correction was applied for gene-length, these were: number of bound nucleotides, number of unbound nucleotides, energy of the structure, number of stems, number of loops, triplet usage, nucleotide usage, and amino acid usage.
For expression codon analysis, the frequencies of use of synonymous codons was calculated. This was done over a receding window, from 50% highest versus 50% lowest until 5% highest versus 5% lowest, in increments of 1 %.
Example 13 - Sequences used for transformation
A novel aspect of our finding is the selection of mRNA structures with the most even distribution of stems and loops leads to higher levels of expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Below is an example procedure used to select the most optimal mRNA structure for improved functional expression in a host cell of interest.
The first step in selecting the 'ideal' mRNA structure is the generation of a pool of mRNA variants by making all possible combinations of synonymous codons (> 100.000 mRNA variants).
The second step is in silico folding of all mRNA species in the pool under the temperature and salt concentrations relevant for the preferred host. The third step is the selection of mRNAs from the pool that meet the following criteria:
(actually the selection of mRNAs that have the most even distribution of stems and loops, which can be selected by the criteria described below.) For A. thaliana
1 . average number of stem-loop transitions is above 1 16 per 1 ,000 bp (or between 1 16 and 250 per 1 ,000 bp) average stem size is below 5.20 bp (or between 5.20 and 2.5 bp)
average loop size is below 3.32 bp (or between 3.32 and 3 bp)
the standard deviation of the loop size is below 3.20 (or between 3.20 and 2 bp) (measure for even distribution)
the standard deviation of the stem size is below 3.40 (or between 3.40 and 2 bp) (measure for even distribution)
maximum loop size is below 18 bp (discard uneven stem loop distributions) maximum stem size is below 19 bp (discard uneven stem loop distributions) C. eleaans
1 . average number of stem-loop transitions is above 1 14 per 1 ,000 bp (or between 1 14 and 250 per 1 ,000 bp)
2. average stem size is below 5.35 bp (or between 5.35 and 2.5 bp)
3. average loop size is below 3.47 bp (or between 3.47 and 3 bp)
4. the standard deviation of the loop size is below 3.37 (or between 3.37 and 2 bp)
5. the standard deviation of the stem size is below 3.27 (or between 3.27 and 2 bp)
6. maximum loop size is below 20 bp
7. maximum stem size is below 18 bp E. coli
1 . average number of stem-loop transitions is above 1 16 per 1 ,000 bp (or between 1 16 and 250 per 1 ,000 bp)
2. average stem size is below 5.45 bp (or between 5.45 and 2.5 bp)
3. average loop size is below 3.16 bp (or between 3.16 and 2 bp)
4. the standard deviation of the loop size is below 2.95 (or between 2.95 and 2 bp)
5. the standard deviation of the stem size is below 3.50 (or between 3.50 and 2 bp)
6. maximum loop size is below 16 bp
7. maximum stem size is below 18 bp M. musculus 1 . average number of stem-loop transitions is above 120 per 1 ,000 bp (or between 120 and 250 per 1 ,000 bp)
2. average stem size is below 4.35 bp (or between 4.35 and 2.5 bp)
3. average loop size is below 5.18 bp (or between 5.18 and 4 bp)
4. the standard deviation of the loop size is below 3.00 (or between 3.00 and 2 bp)
5. the standard deviation of the stem size is below 3.28 (or between 3.28 and 2 bp)
6. maximum loop size is below 18 bp
7. maximum stem size is below 19 bp
For S. cerevisiae
1 . average number of stem-loop transitions is above 1 10 per 1 ,000 bp (or between 1 10 and 250 per 1 ,000 bp)
2. average stem size is below 5.27 bp (or between 5.27 and 2.5 bp)
3. average loop size is below 3.77 bp (or between 3.77 and 3 bp)
4. the standard deviation of the loop size is below 3.65 (or between 3.65 and 2 bp)
5. the standard deviation of the stem size is below 3.25 (or between 3.25 and 2 bp)
6. maximum loop size is below 20 bp
7. maximum stem size is below 19 bp
After step 3, where there were several appropriate codons according to the foregoing criteria, previously published data was consulted to make a final selection. Codons giving the lowest folding energy of the 5' terminus and codons that are frequently used and match the most abundant tRNAs were preferred.
Example 14 - Sequences used for transformation
All ORFs
GFP-720bp
>GFPnat [SEQ ID NO: 10]
atggccAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGG CGATGTTAATGGGCACAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATACGGAA AACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCcTGGCCAACACTTGTC ACTACTCTCACCTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGCATGA CTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTgCAGGAAAGAACTATATTTTTCAAAGATG ACGGtAACTACAAGACcCGTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAGAATC GAGTTAAAAGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACACAAACTCGAATACAA CTATAACTCACATAATGTATACATCATGGCcGACAAACAGAAGAATGGAATCAAAGTTAACT TCAAAATTAGACACAACATTGAGGATGGAAGCGTTCAATTAGCAGACCATTATCAACAAAAT ACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCCACACAATCTGC CCTT CCAAAGATCCCAACGAAAAGAGAGATCACATGGTCCTTCTTGAGTTTGTAACAGCTG CTGGGATTACACTCGGCATGGATGAACTATACAAATAA
>GFPoPt [SEQ ID NO: 1 1]
atggccTCCAAGGGTGAGGAGCTCTTCACCGGTGTCGTCCCCATCCTCGTCGAGCTCGACGG TGACGTCAACGGTCACAAGTTCTCCGTCTCCGGTGAGGGTGAGGGTGACGCCACCTACGGTA AGCTCACCCTCAAGTTCATCTGCACCACCGGTAAGCTCCCCGTCCCCTGGCCCACCCTCGTC ACCACCCTCACCTACGGTGTCCAGTGCTTCTCCCGTTACCCCGACCACATGAAGCAGCACGA CTTCTTCAAGTCCGCCATGCCCGAGGGTTACGTCCAGGAGCGTACCATCTTCTTCAAGGACG ACGGTAACTACAAGACCCGTGCCGAGGTCAAGTTCGAGGGTGACACCCTCGTCAACCGTATC GAGCTCAAGGGTATCGACTTCAAGGAGGACGGTAACATCCTCGGTCACAAGCTCGAGTACAA CTACAACTCCCACAACGTCTACATCATGGCCGACAAGCAGAAGAACGGTATCAAGGTCAACT TCAAGATCCGTCACAACATCGAGGACGGTTCCGTCCAGCTCGCCGACCACTACCAGCAGAAC ACCCCCATCGGTGACGGTCCCGTCCTCCTCCCCGACAACCACTACCTCTCCACCCAGTCCGC CCTCTCCAAGGACCCCAACGAGAAGCGTGACCACATGGTCCTCCTCGAGTTCGTCACCGCCG CCGGTATCACCCTCGGTATGGACGAGCTCTACAAGTAA
SP-AvGFP-786bp
>chitSPoptGFPnat [SEQ ID NO: 12]
atggccAAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCgGC CGtcatggccAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAG ATGGCGATGTTAATGGGCACAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATAC GGAAAACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCcTGGCCAACACT TGTCACTACTCTCACCTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGC ATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTgCAGGAAAGAACTATATTTTTCAAA GATGACGGtAACTACAAGACcCGTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAG AATCGAGTTAAAAGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACACAAACTCGAAT ACAACTATAACTCACATAATGTATACATCATGGCcGACAAACAGAAGAATGGAATCAAAGTT AACTTCAAAATTAGACACAACATTGAGGATGGAAGCGTTCAATTAGCAGACCATTATCAACA AAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCCACACAAT CTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGGTCCTTCTTGAGTTTGTAACA GCTGCTGGGATTACACTCGGCATGGATGAACTATACAAATAA
>chitSPoptGFPopt [SEQ ID NO: 13]
atggccAAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCgGC CGtcatggccTCCAAGGGTGAGGAGCTCTTCACCGGTGTCGTCCCCATCCTCGTCGAGCTCG ACGGTGACGTCAACGGTCACAAGTTCTCCGTCTCCGGTGAGGGTGAGGGTGACGCCACCTAC GGTAAGCTCACCCTCAAGTTCATCTGCACCACCGGTAAGCTCCCCGTCCCCTGGCCCACCCT CGTCACCACCCTCACCTACGGTGTCCAGTGCTTCTCCCGTTACCCCGACCACATGAAGCAGC ACGACTTCTTCAAGTCCGCCATGCCCGAGGGTTACGTCCAGGAGCGTACCATCTTCTTCAAG GACGACGGTAACTACAAGACCCGTGCCGAGGTCAAGTTCGAGGGTGACACCCTCGTCAACCG TATCGAGCTCAAGGGTATCGACTTCAAGGAGGACGGTAACATCCTCGGTCACAAGCTCGAGT ACAACTACAACTCCCACAACGTCTACATCATGGCCGACAAGCAGAAGAACGGTATCAAGGTC AACTTCAAGATCCGTCACAACATCGAGGACGGTTCCGTCCAGCTCGCCGACCACTACCAGCA GAACACCCCCATCGGTGACGGTCCCGTCCTCCTCCCCGACAACCACTACCTCTCCACCCAGT CCGCCCTCTCCAAGGACCCCAACGAGAAGCGTGACCACATGGTCCTCCTCGAGTTCGTCACC GCCGCCGGTATCACCCTCGGTATGGACGAGCTCTACAAGTAA mIL-10-540bp
>chitSPopt-IL-10nat [SEQ ID NO: 14]
atggccAAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCgGC CGcccagtacagccgggaagacaatAACtgcacccacttcccagtcggccagagccacatgc tcctagagctgcggactgccttcagccaggtgaagactttctttcaaacaaaggaccagctg gacaacatactgctaaccgactccttaatgcaggactttaagggttacttgggttgccaagc cttatcggaaatgatccagttttacctggtagaagtgatgccccaggcagagaagcatggcc cagaaatcaaggagcatttgaattccctgggtgagaagctgaagaccctcaggatgcggctg aggcgctgtcatcgatttctcccctgtgaaaataagagcaaggcagtggagcaggtgaagag tgattttaataagctccaagaccaaggtgtctacaaggccatgaatgaatttgacatcttca tcaactgcatagaagcatacatgatgatcaaaatgaaaagctaa
>chitSPopt-mIL-10opt [SEQ ID NO: 15]
atggccAAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCgGC CGcccagtactcccgtgaggacaacaactgcacccacttccccgtcggtcagtcccacatgc tcctcgagctccgtaccgccttctcccaggtcaagaccttcttccagaccaaggaccagctc gacaacatcctcctcaccgactccctcatgcaggacttcaagggttacctcggttgccaggc cctctccgagatgatccagttctacctcgtcgaggtcatgccccaggccgagaagcacggtc ccgagatcaaggagcacctcaactccctcggtgagaagctcaagaccctccgtatgcgtctc cgtcgttgccaccgtttcctcccctgcgagaacaagtccaaggccgtcgagcaggtcaagtc cgacttcaacaagctccaggaccagggtgtctacaaggccatgaacgagttcgacatcttca tcaactgcatcgaggcctacatgatgatcaagatgaagtcctga OVA-1221bp
>chitSPoptOVAnat (only with pIVT) [SEQ ID NO: 16]
atg (gcc) AAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCg GCCGGCTCCATCGGCGCAGCAAGCATGGAATTTTGTTTTGATGTATTCAAGGAGCTCAAAGT CCACCATGCCAATGAGAACATCTTCTACTGCCCCATTGCCATCATGTCAGCTCTAGCCATGG TATACCTGGGTGCAAAAGACAGCACCAGGACACAGATAAATAAGGTTGTTCGCTTTGATAAA CTTCCAGGATTCGGAGACAGTATTGAAGCTCAGTGTGGCACATCTGTAAACGTTCACTCTTC ACTTAGAGACATCCTCAACCAAATCACCAAACCAAATGATGTTTATTCGTTCAGCCTTGCCA GTAGACTTTATGCTGAAGAGAGATACCCAATCCTGCCAGAATACTTGCAGTGTGTGAAGGAA CTGTATAGAGGAGGCTTGGAACCTATCAACTTTCAAACAGCTGCAGATCAAGCCAGAGAGCT CATCAATTCCTGGGTAGAAAGTCAGACAAATGGAATTATCAGAAATGTCCTTCAGCCAAGCT CCGTGGATTCTCAAACTGCAATGGTTCTGGTTAATGCCATTGTCTTCAAAGGACTGTGGGAG AAAACATTTAAGGATGAAGACACACAAGCAATGCCTTTCAGAGTGACTGAGCAAGAAAGCAA ACCTGTGCAGATGATGTACCAGATTGGTTTATTTAGAGTGGCATCAATGGCTTCTGAGAAAA TGAAGATCCTGGAGCTTCCATTTGCCAGTGGGACAATGAGCATGTTGGTGCTGTTGCCTGAT GAAGTCTCAGGCCTTGAGCAGCTTGAGAGTATAATCAACTTTGAAAAACTGACTGAATGGAC CAGTTCTAATGTTATGGAAGAGAGGAAGATCAAAGTGTACTTACCTCGCATGAAGATGGAGG AAAAATACAACCTCACATCTGTCTTAATGGCTATGGGCATTACTGACGTGTTTAGCTCTTCA GCCAATCTGTCTGGCATCTCCTCAGCAGAGAGCCTGAAGATtTCTCAAGCTGTCCATGCAGC ACATGCAGAAATCAATGAAGCAGGCAGAGAGGTGGTAGGGTCAGCAGAGGCTGGAGTGGATG CTGCAAGCGTCTCTGAAGAATTTAGGGCTGACCATCCATTCCTCTTCTGTATCAAGCACATC GCAACCAACGCCGTTCTCTTCTTTGGCAGATGTGTTTCCCCTTAA >chitSPoptOVAopt [SEQ ID NO: 17]
atg (gcc) AAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCg GCCGGTTCCATCGGTGCCGCCAGCATGGAGTTCTGCTTCGACGTCTTCAAGGAGCTCAAGGT CCACCACGCCAACGAGAACATCTTCTACTGCCCCATCGCCATCATGTCCGCCCTCGCTATGG TCTACCTCGGTGCCAAGGACTCCACCCGTACCCAGATCAACAAGGTCGTCCGTTTCGACAAG CTCCCCGGTTTCGGTGACTCCATCGAGGCCCAGTGCGGTACTTCCGTCAACGTCCACTCCTC CCTCCGTGACATCCTCAACCAGATCACCAAGCCCAACGACGTCTACTCCTTCTCCCTCGCCT CCCGTCTCTACGCCGAGGAGCGTTACCCCATCCTCCCCGAGTACCTCCAGTGCGTCAAGGAG CTCTACCGTGGTGGTCTCGAGCCCATCAACTTCCAGACCGCCGCCGACCAGGCCCGTGAGCT CATCAACTCCTGGGTCGAGTCCCAGACCAACGGTATCATCCGTAACGTCCTCCAGCCCTCCT CCGTCGACTCCCAGACCGCTATGGTCCTCGTCAACGCCATCGTCTTCAAGGGTCTCTGGGAG AAGaCCTTCAAGGACGAGGACACCCAGGCCATGCCCTTCCGTGTCACCGAGCAGGAGTCCAA GCCCGTCCAGATGATGTACCAGATCGGTCTCTTCCGTGTCGCCAGCATGGCCTCCGAGAAGA TGAAGATCCTCGAGCTCCCCTTCGCCTCCGGTACTATGTCCATGCTCGTCCTCCTCCCCGAC GAGGTCTCCGGTCTCGAGCAGCTCGAGTCCATCATCAACTTCGAGAAGCTCACCGAGTGGAC CTCCTCCAACGTCATGGAGGAGCGTAAGATCAAGGTCTACCTCCCCCGTATGAAGATGGAGG AGAAGTACAACCTCACCTCCGTCCTCATGGCTATGGGTATCACCGACGTCTTCTCCTCCTCC GCCAACCTCTCCGGTATCTCCTCCGCCGAGTCCCTCAAGATCTCCCAGGCCGTCCACGCCGC CCACGCCGAGATCAACGAGGCCGGTCGTGAGGTCGTCGGTTCCGCCGAGGCCGGTGTCGACG CCGCCTCCGTCTCCGAGGAGTTCCGTGCCGACCACCCCTTCCTCTTCTGCATCAAGCACATC GCCACCAACGCCGTCCTCTTCTTCGGTCGTTGCGTCTCCCCCTAA
E. coli S. cerevisiae C. elegans A. thaliana M. musculus
Strains/ecotypes 1 13 14 8 9
Samples 168 316 391 415 111
Controls 105 21 1 109 101 565
Papers 8 9 9 9 9
Treatments 20 14 29 73 21
Tissues 1 1 3 1 1 28
> Different strains/mutants and tissues receiving the same experimental treatment are counted as a single treatment, all measurements in a time series are counted as a single treatment
Additional > M. musculus data sets Thorrez et al., 2009 and Xue et al.,
remarks: 2013 do not include the control spot on the slide in their
datasets
> E. coli expression values from the Dong and Schellhorn 2009 dataset off to a single decimal and from Ito et al., 2009 dataset to two decimals Table 1A. Overview of the gathered expression data per species.
Figure imgf000076_0001
Figure imgf000077_0001
Figure imgf000078_0001
Figure imgf000079_0001
Figure imgf000080_0001
Figure imgf000081_0001
Table 1 C. Description of the gathered S. cerevisiae expression data.
Figure imgf000082_0001
Figure imgf000083_0001
Figure imgf000084_0001
Table I D; Description of the gathered C. elegans expression data
Figure imgf000085_0001
Figure imgf000086_0001
Figure imgf000087_0001
Table I E; Description of the gathered A thaliana expression data
Figure imgf000088_0001
Figure imgf000089_0001
Figure imgf000090_0001
Table I F. Description of the gathered M. musculus expression da
E. coli S. cerevisiae C. elegans A. thaliana M. musculus
Gene length -0.146 -0.041 0.093 0.030 -0.016
Energy (kcal.mol/nt) -0.006 -0.319 -0.316 -0.229 0.006
Bound nt (fraction) 0.038 0.236 0.061 0.172 0.015
Mean stem size -0.1 1 1 0.054 -0.182 0.053 -0.055
Mean loop size -0.1 15 -0.241 -0.179 -0.155 -0.046
Transitions /nt 0.140 0.144 0.227 0.071 0.069
Table 2. Correlation between mRNA structure characteristics and gene expression per species. The mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy, percentage of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions were determined and correlated (Spearman) with expression.
AA Triplet All Top 5% Bottom 5% Top/Bottom
* TAA 0.648 0.871 0.643 1 .355
TAG 0.067 0.021 0.064 0.328
TGA 0.285 0.107 0.293 0.365
A GCT 0.160 0.332 0.154 2.156
GCC 0.266 0.139 0.275 0.505
GCA 0.209 0.258 0.216 1 .194
GCG 0.365 0.271 0.356 0.761
C TGT 0.435 0.385 0.466 0.826
TGC 0.565 0.615 0.534 1 .152
D GAT 0.617 0.429 0.649 0.661
GAC 0.383 0.571 0.351 1 .627
E GAA 0.693 0.760 0.681 1 .1 16
GAG 0.307 0.240 0.319 0.752
F TTT 0.562 0.290 0.615 0.472
TTC 0.438 0.710 0.385 1 .844
G GGT 0.343 0.527 0.327 1 .612
GGC 0.413 0.406 0.395 1 .028
GGA 0.098 0.031 0.1 16 0.267
GGG 0.146 0.036 0.162 0.222
H CAT 0.557 0.295 0.591 0.499
CAC 0.443 0.705 0.409 1 .724
I ATT 0.503 0.302 0.530 0.570
ATC 0.434 0.688 0.380 1 .81 1 ATA 0.063 0.010 0.090 0.1 1 1 κ AAA 0.770 0.768 0.793 0.968
AAG 0.230 0.232 0.207 1.121
L TTA 0.124 0.042 0.155 0.271
TTG 0.124 0.059 0.130 0.454
CTT 0.100 0.065 0.1 12 0.580
CTC 0.103 0.068 0.106 0.642
CTA 0.035 0.008 0.041 0.195
CTG 0.515 0.758 0.457 1 .659
M ATG 1 .000 1.000 1.000 1 .000
N AAT 0.432 0.182 0.486 0.374
AAC 0.568 0.818 0.514 1 .591
P CCT 0.152 0.140 0.165 0.848
CCC 0.1 15 0.025 0.137 0.182
CCA 0.185 0.134 0.199 0.673
CCG 0.547 0.702 0.498 1 .410
Q CAA 0.337 0.213 0.369 0.577
CAG 0.663 0.787 0.631 1 .247
R CGT 0.396 0.636 0.363 1 .752
CGC 0.410 0.332 0.410 0.810
CGA 0.058 0.010 0.071 0.141
CGG 0.089 0.01 1 0.094 0.1 17
AGA 0.030 0.007 0.044 0.159
AGG 0.016 0.004 0.019 0.21 1
S TCT 0.150 0.323 0.132 2.447
TCC 0.155 0.256 0.136 1 .882
TCA 0.1 17 0.058 0.123 0.472
TCG 0.155 0.057 0.171 0.333
AGT 0.143 0.060 0.158 0.380
AGC 0.280 0.247 0.280 0.882
T ACT 0.168 0.328 0.167 1 .964
ACC 0.449 0.508 0.409 1 .242
ACA 0.120 0.048 0.154 0.312
ACG 0.263 0.1 16 0.270 0.430
V GTT 0.257 0.436 0.258 1 .690
GTC 0.214 0.1 13 0.219 0.516
GTA 0.152 0.225 0.153 1 .471
GTG 0.377 0.226 0.370 0.61 1 w TGG 1 .000 1.000 1.000 1 .000
Y TAT 0.555 0.331 0.582 0.569
TAC 0.445 0.669 0.418 1 .600
Table 6A. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Escherichia coli. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low-expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated. AA Triplet All Top 5% Bottom 5% Top/Bottom
* TAA 0.480 0.731 0.403 1 .814
TAG 0.225 0.1 17 0.290 0.403
TGA 0.295 0.152 0.307 0.495
A GCT 0.367 0.593 0.339 1 .749
GCC 0.223 0.280 0.215 1 .302
GCA 0.296 0.105 0.319 0.329
GCG 0.1 13 0.023 0.127 0.181
C TGT 0.627 0.829 0.594 1 .396
TGC 0.373 0.171 0.406 0.421
D GAT 0.656 0.526 0.642 0.819
GAC 0.344 0.474 0.358 1 .324
E GAA 0.701 0.854 0.699 1 .222
GAG 0.299 0.146 0.301 0.485
F TTT 0.593 0.353 0.616 0.573
TTC 0.407 0.647 0.384 1 .685
G GGT 0.455 0.823 0.387 2.127
GGC 0.197 0.093 0.197 0.472
GGA 0.224 0.051 0.279 0.183
GGG 0.124 0.033 0.138 0.239
H CAT 0.643 0.440 0.617 0.713
CAC 0.357 0.560 0.383 1 .462
1 ATT 0.463 0.522 0.469 1 .1 13
ATC 0.258 0.430 0.236 1 .822
ATA 0.280 0.048 0.295 0.163
K AAA 0.581 0.299 0.639 0.468
AAG 0.419 0.701 0.361 1 .942
L TTA 0.279 0.216 0.244 0.885
TTG 0.283 0.567 0.251 2.259
CTT 0.127 0.057 0.163 0.350
CTC 0.057 0.014 0.086 0.163
CTA 0.142 0.103 0.143 0.720
CTG 0.1 12 0.043 0.1 13 0.381
M ATG 1 .000 1.000 1.000 1 .000
N AAT 0.598 0.303 0.594 0.510
AAC 0.402 0.697 0.406 1 .717
P CCT 0.310 0.227 0.305 0.744
CCC 0.160 0.053 0.164 0.323
CCA 0.407 0.701 0.401 1 .748
CCG 0.123 0.018 0.129 0.140
Q CAA 0.686 0.893 0.663 1 .347
CAG 0.314 0.107 0.337 0.318
R CGT 0.140 0.201 0.131 1 .534
CGC 0.058 0.017 0.078 0.218
CGA 0.068 0.001 0.088 0.01 1
CGG 0.040 0.002 0.064 0.031
AGA 0.478 0.724 0.420 1 .724 AGG 0.217 0.055 0.218 0.252 s TCT 0.261 0.452 0.246 1.837
TCC 0.157 0.289 0.147 1.966
TCA 0.211 0.108 0.218 0.495
TCG 0.097 0.036 0.096 0.375
AGT 0.163 0.063 0.172 0.366
AGC 0.1 11 0.051 0.121 0.421
T ACT 0.343 0.482 0.333 1.447
ACC 0.210 0.352 0.213 1.653
ACA 0.307 0.133 0.325 0.409
ACG 0.140 0.034 0.129 0.264
V GTT 0.389 0.51 1 0.368 1.389
GTC 0.201 0.347 0.210 1.652
GTA 0.216 0.060 0.226 0.265
GTG 0.195 0.082 0.196 0.418 w TGG 1.000 1.000 1.000 1.000
Y TAT 0.568 0.302 0.558 0.541
TAC 0.432 0.698 0.442 1.579
Table 6B. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Saccharomyces cerevisiae. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.
AA Triplet All Top 5% Bottom 5% Top/Bottom
* TAA 0.496 0.694 0.439 1 .581
TAG 0.179 0.141 0.162 0.870
TGA 0.325 0.165 0.399 0.414
A GCT 0.354 0.423 0.325 1 .302
GCC 0.199 0.302 0.157 1 .924
GCA 0.314 0.198 0.385 0.514
GCG 0.133 0.077 0.134 0.575
C TGT 0.555 0.447 0.588 0.760
TGC 0.445 0.553 0.412 1 .342
D GAT 0.679 0.631 0.693 0.91 1
GAC 0.321 0.369 0.307 1 .202
E GAA 0.621 0.534 0.671 0.796
GAG 0.379 0.466 0.329 1 .416
F TTT 0.481 0.261 0.605 0.431
TTC 0.519 0.739 0.395 1 .871
G GGT 0.204 0.168 0.214 0.785
GGC 0.124 0.086 0.134 0.642
GGA 0.592 0.71 1 0.544 1 .307
GGG 0.080 0.035 0.109 0.321
H CAT 0.61 1 0.513 0.649 0.790
CAC 0.389 0.487 0.351 1 .387
1 ATT 0.534 0.470 0.538 0.874
ATC 0.314 0.478 0.226 2.1 15
ATA 0.152 0.052 0.236 0.220
K AAA 0.588 0.381 0.665 0.573
AAG 0.412 0.619 0.335 1 .848
L TTA 0.1 10 0.049 0.169 0.290
TTG 0.234 0.212 0.258 0.822
CTT 0.249 0.306 0.214 1 .430
CTC 0.174 0.280 0.1 16 2.414
CTA 0.091 0.042 0.1 12 0.375
CTG 0.142 0.1 12 0.133 0.842
M ATG 1 .000 1.000 1.000 1 .000
N AAT 0.625 0.484 0.655 0.739
AAC 0.375 0.516 0.345 1 .496
P CCT 0.178 0.126 0.220 0.573
CCC 0.088 0.054 0.100 0.540
CCA 0.532 0.691 0.494 1 .399
CCG 0.202 0.130 0.186 0.699
Q CAA 0.651 0.650 0.679 0.957
CAG 0.349 0.350 0.321 1 .090
R CGT 0.217 0.350 0.150 2.333
CGC 0.096 0.175 0.067 2.612
CGA 0.236 0.146 0.231 0.632
CGG 0.091 0.046 0.098 0.469
AGA 0.288 0.250 0.357 0.700 AGG 0.071 0.032 0.097 0.330 s TCT 0.206 0.235 0.214 1.098
TCC 0.130 0.177 0.112 1.580
TCA 0.257 0.205 0.273 0.751
TCG 0.156 0.169 0.125 1.352
AGT 0.149 0.104 0.173 0.601
AGC 0.102 0.109 0.103 1.058
T ACT 0.324 0.346 0.329 1.052
ACC 0.175 0.297 0.144 2.062
ACA 0.345 0.249 0.383 0.650
ACG 0.156 0.108 0.143 0.755
V GTT 0.388 0.413 0.407 1.015
GTC 0.220 0.320 0.168 1.905
GTA 0.158 0.097 0.191 0.508
GTG 0.234 0.170 0.234 0.726 w TGG 1.000 1.000 1.000 1.000
Y TAT 0.559 0.414 0.631 0.656
TAC 0.441 0.586 0.369 1.588
Table 6C. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Caenorhabditis elegans. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.
AA Triplet All Top 5% Bottom 5% Top/Bottom
* TAA 0.345 0.371 0.263 1.411
TAG 0.204 0.194 0.194 1.000
TGA 0.451 0.435 0.543 0.801
A GCT 0.432 0.498 0.383 1.300
GCC 0.161 0.171 0.174 0.983
GCA 0.263 0.221 0.278 0.795
GCG 0.144 0.110 0.164 0.671
C TGT 0.593 0.561 0.591 0.949
TGC 0.407 0.439 0.409 1.073
D GAT 0.674 0.644 0.662 0.973
GAC 0.326 0.356 0.338 1.053
E GAA 0.511 0.442 0.523 0.845
GAG 0.489 0.558 0.477 1.170
F TTT 0.502 0.427 0.515 0.829
TTC 0.498 0.573 0.485 1.181
G GGT 0.334 0.398 0.316 1.259
GGC 0.141 0.119 0.152 0.783
GGA 0.371 0.367 0.387 0.948
GGG 0.154 0.115 0.145 0.793
H CAT 0.606 0.526 0.612 0.859
CAC 0.394 0.474 0.388 1.222
1 ATT 0.400 0.429 0.375 1.144
ATC 0.363 0.432 0.373 1.158
ATA 0.236 0.139 0.252 0.552
K AAA 0.490 0.385 0.517 0.745
AAG 0.510 0.615 0.483 1.273
L TTA 0.135 0.082 0.148 0.554
TTG 0.220 0.233 0.229 1.017
CTT 0.257 0.290 0.248 1.169
CTC 0.181 0.207 0.172 1.203
CTA 0.105 0.080 0.121 0.661
CTG 0.102 0.108 0.082 1.317
M ATG 1.000 1.000 1.000 1.000
N AAT 0.502 0.430 0.489 0.879
AAC 0.498 0.570 0.511 1.115
P CCT 0.381 0.407 0.353 1.153
CCC 0.106 0.112 0.109 1.028
CCA 0.327 0.336 0.351 0.957
CCG 0.186 0.146 0.186 0.785
Q CAA 0.564 0.465 0.648 0.718
CAG 0.436 0.535 0.352 1.520
R CGT 0.168 0.241 0.161 1.497
CGC 0.070 0.077 0.068 1.132
CGA 0.118 0.087 0.120 0.725
CGG 0.092 0.059 0.086 0.686
AGA 0.352 0.301 0.363 0.829 AGG 0.199 0.234 0.202 1.158 s TCT 0.280 0.303 0.253 1.198
TCC 0.129 0.147 0.127 1.157
TCA 0.204 0.178 0.212 0.840
TCG 0.108 0.100 0.114 0.877
AGT 0.151 0.139 0.158 0.880
AGC 0.127 0.134 0.135 0.993
T ACT 0.334 0.374 0.300 1.247
ACC 0.207 0.260 0.213 1.221
ACA 0.302 0.253 0.313 0.808
ACG 0.157 0.114 0.175 0.651
V GTT 0.400 0.432 0.372 1.161
GTC 0.193 0.219 0.199 1.101
GTA 0.145 0.095 0.157 0.605
GTG 0.262 0.253 0.271 0.934 w TGG 1.000 1.000 1.000 1.000
Y TAT 0.504 0.418 0.508 0.823
TAC 0.496 0.582 0.492 1.183
Table 6D. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Arabidopsis thaliana. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.
AA Triplet All Top 5% Bottom 5% Top/Bottom
* TAA 0.258 0.351 0.323 1.087
TAG 0.235 0.222 0.253 0.877
TGA 0.507 0.427 0.424 1.007
A GCT 0.289 0.320 0.316 1.013
GCC 0.377 0.331 0.340 0.974
GCA 0.232 0.246 0.266 0.925
GCG 0.101 0.103 0.078 1.321
C TGT 0.476 0.516 0.507 1.018
TGC 0.524 0.484 0.493 0.982
D GAT 0.450 0.521 0.500 1.042
GAC 0.550 0.479 0.500 0.958
E GAA 0.412 0.466 0.495 0.941
GAG 0.588 0.534 0.505 1.057
F TTT 0.445 0.507 0.499 1.016
TTC 0.555 0.493 0.501 0.984
G GGT 0.175 0.208 0.197 1.056
GGC 0.332 0.319 0.287 1.111
GGA 0.257 0.272 0.313 0.869
GGG 0.236 0.201 0.204 0.985
H CAT 0.410 0.468 0.472 0.992
CAC 0.590 0.532 0.528 1.008
1 ATT 0.343 0.404 0.362 1.116
ATC 0.495 0.448 0.419 1.069
ATA 0.162 0.148 0.219 0.676
K AAA 0.398 0.407 0.471 0.864
AAG 0.602 0.593 0.529 1.121
L TTA 0.068 0.089 0.095 0.937
TTG 0.132 0.152 0.152 1.000
CTT 0.132 0.154 0.154 1.000
CTC 0.194 0.169 0.176 0.960
CTA 0.079 0.079 0.092 0.859
CTG 0.396 0.357 0.331 1.079
M ATG 1.000 1.000 1.000 1.000
N AAT 0.436 0.481 0.501 0.960
AAC 0.564 0.519 0.499 1.040
P CCT 0.306 0.335 0.316 1.060
CCC 0.298 0.250 0.275 0.909
CCA 0.288 0.310 0.323 0.960
CCG 0.108 0.105 0.086 1.221
Q CAA 0.253 0.258 0.350 0.737
CAG 0.747 0.742 0.650 1.142
R CGT 0.084 0.105 0.080 1.312
CGC 0.170 0.153 0.122 1.254
CGA 0.123 0.145 0.104 1.394
CGG 0.194 0.179 0.128 1.398
AGA 0.213 0.232 0.318 0.730 AGG 0.216 0.186 0.249 0.747 s TCT 0.193 0.222 0.220 1.009
TCC 0.211 0.195 0.188 1.037
TCA 0.143 0.149 0.170 0.876
TCG 0.054 0.057 0.039 1.462
AGT 0.156 0.171 0.174 0.983
AGC 0.243 0.206 0.209 0.986
T ACT 0.249 0.273 0.275 0.993
ACC 0.345 0.313 0.312 1.003
ACA 0.295 0.314 0.328 0.957
ACG 0.1 11 0.099 0.085 1.165
V GTT 0.174 0.225 0.217 1.037
GTC 0.245 0.215 0.241 0.892
GTA 0.1 19 0.138 0.146 0.945
GTG 0.461 0.423 0.395 1.071 w TGG 1.000 1.000 1.000 1.000
Y TAT 0.423 0.481 0.498 0.966
TAC 0.577 0.519 0.502 1.034
Table 6E. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Mus musculus (Animalia). Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.
Top 5% Top 5% Top 5% Top 5%
Trait Organism Stem_size_mean Stem_size_sd Stem_size_max Stem_size_min
Protein
abundance A. thaliana 5.197742798 3.333316648 18.60493827 1.082304527
Gene
expression A. thaliana 5.264773876 3.354989119 18.67107195 1.118942731
Protein
abundance C. elegans 4.949884209 3.035095428 16.98275862 1.161637931
Gene
expression C. elegans 4.950296788 3.048596544 17.30588235 1.129411765
Protein
abundance E. coli 5.127421075 3.127080268 17.00909091 1.227272727
Gene
expression E. coli 5.157297589 3.162030121 17.54285714 1.214285714
Protein
abundance M. musculus 5.063991554 3.236283472 18.29166667 1.078125
Gene
expression M. musculus 5.081367307 3.237828152 18.43329098 1.095298602
Protein
abundance S. cerevisiae 5.254440541 3.230034739 18.21167883 1.237226277
Gene
expression S. cerevisiae 5.262132835 3.23936481 18.01766784 1.247349823
Table 9. Analysis of the mRNA secondary structure characteristics (stem architecture) of the top 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia).
Figure imgf000101_0001
Table 10. Analysis of the mRNA secondary structure characteristics (loop architecture) of the top 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
Top 5% Top 5% Top 5% Bound_nt/100 Energy_(kcal/mol)/1000 Transitions/1000
Trait Organism 0 nt nt nt
Protein abundance A. thaiiana 619.3179412 -292.7800618 119.7782583 Gene expression A. thaiiana 624.0580406 -290.3511673 119.169408 Protein abundance C. elegans 598.4571065 -272.5292233 121.7225865 Gene expression C. elegans 596.5470187 -273.9225057 121.3996132 Protein abundance E. coli 627.3154158 -319.8163586 123.3964781 Gene expression E. coli 631.9373347 -327.7643057 123.4152453
M.
Protein abundance muscuius 616.1866207 -327.7746785 122.4372787
M.
Gene expression muscuius 612.9621408 -313.9661558 121.3436794 Protein abundance S. cerevisiae 606.4041095 -255.5194926 116.2875481 Gene expression S. cerevisiae 605.1063803 -255.9553594 115.8779268
Table 11. Analysis of the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the top 5% expressed genes in Escherichia coii (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
Figure imgf000102_0001
Gene
expression S. cerevisiae 5.274054838 3.147229903 16.84751773 1.365248227
Protein
abundance S. cerevisiae 5.34944781 3.244190265 19.52380952 1.102564103
Table 12. Analysis of the mRNA secondary structure characteristics (stem architecture) of the bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
Figure imgf000103_0001
Table 13. Analysis of the mRNA secondary structure characteristics (loop architecture) of the bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
Figure imgf000103_0002
Table 14. Analysis of the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the bottom 5% expressed genes in Escherichia coii (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
Figure imgf000104_0001
Table 15. Differences in the mRNA secondary structure characteristics (stem architecture) of the top and bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
Delta (top- Delta (top- Delta (top- bottom) bottom) bottom)
Trait Organism Loop_size_mean Loop_size_sd Loop_size_max
Gene expression A. thaiiana -0.267692548 -0.347578028 -1.583865892
Protein abundance A. thaiiana -0.175253334 -0.312156894 -4.174836883
Gene expression C. eiegans -0.419326762 -0.52092485 -3.334682506
Protein abundance C. eiegans -0.154072143 -0.309808645 -4.418251447
Gene expression E. coli -0.186479295 -0.31462739 -3.024198823 Protein abundance E. coli -0.19510469 -0.35983994 -4.111271298
Gene expression M. musculus -0.224252393 -0.288729208 -2.917011238
Protein abundance M. musculus 0.08059553 0.055306019 -2.037498481
Gene expression S. cerevisiae -0.778634452 -1.077665405 -3.962468292
Protein abundance S. cerevisiae -0.364963788 -0.580120518 -5.694456309
Table 16. Differences in the mRNA secondary structure characteristics (loop architecture) of the top and bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus musculus (Animalia).
Figure imgf000105_0001
Table 17. Differences in the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the top and bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus musculus (Animalia).

Claims

1. A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;
a. providing a library of polynucleotides each of which vary at a minimum of a single codon position;
b. analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and c. selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and
d. synthesising said polynucleotide.
2. A method as claimed in claim 1 , wherein the method further comprises selecting a polynucleotide having a maximum stem size of less than 19 bp.
3. A method as claimed in claim 2, wherein the method further comprises selecting a polynucleotide having a maximum loop size of less than 20 bp.
4. A method as claimed in claim 3, wherein the host cell is a prokaryotic cell.
5. A method as claimed in claim 4, wherein the host cell is a bacterial cell.
6. A method as claimed in claim 5, wherein the method further comprises selecting a polynucleotide having a mean stem size between 5.45 bp and 2.50 bp.
7. A method as claimed in claim 5 or claim 6, wherein the method further comprises selecting a polynucleotide having a mean loop size between 3.16 bp and 2.00 bp.
8. A method as claimed in any of claims 5 to 7, wherein the host cell is an Escherichia coli cell.
9. A method as claimed in claim 3, wherein the host cell is a eukaryotic cell.
10. A method as claimed in claim 9, wherein the host cell is a plant cell.
1 1 . A method as claimed in claim 10, wherein the method further comprises selecting a polynucleotide having a mean stem size between 5.20 bp and 2.50 bp.
12. A method as claimed in claim 10 or claim 1 1 , wherein the method further comprises selecting a polynucleotide having a mean loop size between 3.27 bp and 3.00 bp.
13. A method as claimed in claim any of claims 10 to 12, wherein the host cell is an Arabidopsis cell, optionally an Arabidopsis thaliana cell.
14. A method as claimed in claim 9, wherein the host cell is a fungal cell.
15. A method as claimed in claim 14, wherein the method further comprises selecting a polynucleotide having a mean stem size between 5.27 bp and 2.50 bp.
16. A method as claimed in claim 14 or claim 15, wherein the method further comprises selecting a polynucleotide having a mean loop size between 3.77 and
3.00 bp.
17. A method as claimed in any of claims 14 to 16, wherein the host cell is a Saccharomyces cell, optionally a Saccharomyces cerevisiae cell.
18. A method as claimed in claim 9, wherein the host cell is an animal cell.
19. A method as claimed in claim 18, wherein the host cell is a nematode cell.
20. A method as claimed in claim 19, wherein the method further comprises selecting a polynucleotide having a mean stem size between 5.35 bp and 2.50 bp.
21 . A method as claimed in claim 19 or claim 20, wherein the method further comprises selecting a polynucleotide having a mean loop size between 3.47 bp and 3.00 bp.
22. A method as claimed in any of claims 19 to 21 , wherein the host cell is a Caenorhabditis elegans cell.
23. A method as claimed in claim 18, wherein the host cell is a mammalian cell.
24. A method as claimed in claim 23, wherein the method further comprises selecting a polynucleotide having a mean stem size between 4.35 bp and 2.50 bp.
25. A method as claimed in claim 23 or claim 24, wherein the method further comprises selecting a polynucleotide having a mean loop size between 5.18 bp and 4.00 bp.
26. A method as claimed in any of claims 23 to 25, wherein the host cell is a Mus musculus cell.
27. A method as claimed in any of claims 4 to 26, wherein the method further comprises selecting a polynucleotide from a library of synonymous variants wherein the codon usage of the selected polynucleotide most closely matches the most abundant tRNAs in a particular host cell.
28. A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;
a. providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and
b. modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
Figure imgf000108_0001
the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
29. A method as claimed in claim 28, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
Figure imgf000109_0001
30. A method as claimed in claim 28 or claim 29, wherein the host cell is a prokaryotic cell.
31 . A method as claimed in claim 30, wherein the host cell is a bacterial cell.
32. A method as claimed in claim 31 , wherein the host cell is an Escherichia coii cell.
33. A method as claimed in any of claims 30 to 32, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
; and/or:
; and/or:
; and/or:
Figure imgf000109_0002
; and/or:
; and/or:
Figure imgf000110_0001
; and/or:
Figure imgf000110_0002
; and/or:
Figure imgf000110_0003
; and/or:
; and/or:
Figure imgf000110_0004
; and/or:
Figure imgf000110_0005
; and/or:
Figure imgf000111_0001
34. A method as claimed in claim 33, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
35. A method as claimed in claim 28 or claim 29, wherein the host cell is a fungal cell.
36. A method as claimed in claim 35, wherein the host cell is a Saccharomyces cerevisiae cell.
37. A method as claimed in claim 35 or claim 36, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
; and/or:
; and/or:
Figure imgf000111_0002
; and/or: Amino Acid DNA Codon Replacement Codon
Isoleucine ATA ATC or ATT
; and/or:
Amino Acid DNA Codon Replacement Codon
Alanine GCA or GCG GCT or GCC ; and/or:
Figure imgf000112_0001
; and/or:
Figure imgf000112_0002
; and/or:
Figure imgf000112_0003
; and/or:
; and/or:
Figure imgf000112_0004
; and/or: Amino Acid DNA Codon Replacement Codon
Glutamine CAG CAA
; and/or:
Amino Acid DNA Codon Replacement Codon
Glutamic acid GAG GAA
38. A method as claimed in claim 37, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
39. A method as claimed in claim 28 or claim 29, wherein the host cell is a nematode cell.
40. A method as claimed in claim 39, wherein the host cell is a Caenorhabditis elegans cell.
41 . A method as claimed in claim 39 or claim 40, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
; and/or:
Figure imgf000113_0001
; and/or: Amino Acid DNA Codon Replacement Codon
Isoleucine ATA or ATT ATC
; and/or:
Amino Acid DNA Codon Replacement Codon
Threonine ACT, ACA or ACG ACC
; and/or:
Figure imgf000114_0001
; and/or:
Figure imgf000114_0002
; and/or:
Figure imgf000114_0003
; and/or:
; and/or:
Figure imgf000114_0004
; and/or: Amino Acid DNA Codon Replacement Codon
Cysteine TGT TGC
; and/or:
; and/or:
Figure imgf000115_0001
42. A method as claimed in claim 41 , wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
43. A method as claimed in claim 28 or claim 29, wherein the host cell is a Mus musculus cell.
44. A method as claimed in claim 43, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
; and/or:
Figure imgf000115_0002
; and/or: Amino Acid DNA Codon Replacement Codon
Alanine GCC or GCA GCG or GCT
; and/or:
Amino Acid DNA Codon Replacement Codon
Proline CCT, CCC or CCA CCG
; and/or:
Figure imgf000116_0001
; and/or:
Figure imgf000116_0002
; and/or:
; and/or:
Figure imgf000116_0003
; and/or:
Figure imgf000116_0004
; and/or:
; and/or:
; and/or:
Figure imgf000117_0001
45. A method as claimed in claim 44, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
46. A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;
a. providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and
b. modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
Amino Acid DNA Codon Replacement Codon
Histidine CAT CAC
Lysine AAA AAG
Asparagine AAT AAC
Tyrosine TAT TAC
Stop Codon TAG or TGA TAA
Alanine GCC, GCA or GCG GCT
Glycine GGC, GGA or GGG GGT
Isoleucine ATT or ATA ATC
Arginine CGC, CGA, CGG, CGT
AGA or AGG
Serine TCT, TCA, TCG, TCC
AGT or AGC
Threonine ACT, ACA or ACG ACC
Valine GTC, GTA or GTG GTT the host cell being selected from a prokaryotic cell, a fungal cell, a plant cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
47. A method as claimed in claim 46, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
48. A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;
a. providing a polynucleotide sequence which encodes a protein of interest and has one or more of the codons in the following table; and
b. modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table: Amino Acid DNA Codon Replacement Codon
Histidine CAT CAC
Lysine AAA AAG
Asparagine AAT AAC
Tyrosine TAT TAC
Stop Codon TAG or TGA TAA
Leucine CTT, CTC, CTA, TTA CTG
or TTG wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
49. A method as claimed in claim 48, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
Figure imgf000119_0001
; and/or:
Figure imgf000119_0002
; and/or:
Figure imgf000119_0003
; and/or: Amino Acid DNA Codon Replacement Codon
Valine GTC, GTA or GTG GTT
; and/or:
Amino Acid DNA Codon Replacement Codon
Proline CCC, CCA or CCG CCT
; and/or:
Figure imgf000120_0001
; and/or:
Figure imgf000120_0002
; and/or:
Figure imgf000120_0003
; and/or:
; and/or:
Figure imgf000120_0004
; and/or: Amino Acid DNA Codon Replacement Codon
Isoleucine ATT or ATA ATC
; and/or:
Amino Acid DNA Codon Replacement Codon
Glutamine CAA CAG
; and/or:
Amino Acid DNA Codon Replacement Codon
Arginine CGC, CGA, CGG, CGT
AGA or AGG
50. A method as claimed in claim 49, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
51 . A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;
a. providing a polynucleotide sequence which encodes a protein of interest and has one or more of the codons in the following table; and
b. modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
Figure imgf000121_0001
wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
52. A method as claimed in claim 51 , wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
; and/or:
Figure imgf000122_0001
; and/or:
Figure imgf000122_0002
; and/or:
Figure imgf000122_0003
; and/or:
Figure imgf000122_0004
; and/or:
; and/or:
Figure imgf000122_0005
; and/or:
; and/or:
Figure imgf000123_0001
; and/or:
Figure imgf000123_0002
; and/or:
Figure imgf000123_0003
; and/or:
Figure imgf000123_0004
; and/or:
; and/or:
Figure imgf000123_0005
53. A method as claimed in any preceding claim, wherein the starting polynucleotide sequence is the wild-type coding sequence.
54. A method as claimed in any preceding claim, wherein the polynucleotide sequence is present or inserted into an expression vector.
55. A method as claimed in claim 54, wherein the expression vector is further introduced into a host cell.
56. A method as claimed in claim 55, wherein the host cell is cultured to produce the heterologous protein.
57. A method of expressing a heterologous protein in a plant cell comprising the steps of;
a. providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and
b. modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table;
Amino Acid DNA Codon Replacement Codon
Alanine GCT, GCA or GCG GCC
Arginine CGC, CGA, CGG, CGT
AGA or AGG
Asparagine AAT AAC
Aspartic acid GAT GAC
Cysteine TGT TGC
Glutamic acid GAA GAG
Glutamine CAA CAG
Glycine GGC, GGA or GGG GGT
Histidine CAT CAC
Isoleucine ATT or ATA ATC
Leucine CTT, CTA, CTG, TTA CTC
or TTG
Lysine AAA AAG
Phenylalanine TTT TTC
Proline CCT, CCA or CCG CCC
Serine TCT, TCA, TCG, TCC
AGT or AGC
Threonine ACT, ACA or ACG ACC
Tyrosine TAT TAC
Valine GTT, GTA or GTG GTC
Stop codons TAG or TGA TAA c. inserting the polynucleotide sequence into an expression vector;
d. introducing said expression vector into a host cell; and
e. culturing the host cell to produce the heterologous protein;
optionally wherein the corresponding codons are changed according following table;
Figure imgf000125_0001
; and/or: Amino Acid DNA Codon Replacement Codon
Leucine CTT, CTA, CTC, TTA CTG
or TTG
; and/or:
Figure imgf000126_0001
and/or:
Figure imgf000126_0002
; and/or:
Figure imgf000126_0003
; and/or:
Figure imgf000126_0004
58. A method as claimed in claim 57, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
59. A method as claimed in any of claims 46 to 58, wherein the host cell is an Arabidopsis cell.
60. A method as claimed in any preceding claim further comprising; analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence; and
incorporating in said polynucleotide sequence a pattern of optimal and non-optimal codons at a site associated with provision of a structural motif;
wherein said pattern enables increased expression efficiency of said protein in said host cell compared with the synonymous coding sequence containing solely optimal codons, wherein optimal codons are those codons pre-calculated to provide the highest functional expression of heterologous protein in the host cell or the sole possible codon.
PCT/EP2014/076436 2014-12-03 2014-12-03 Optimisation of coding sequence for functional protein expression WO2016086988A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2014/076436 WO2016086988A1 (en) 2014-12-03 2014-12-03 Optimisation of coding sequence for functional protein expression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2014/076436 WO2016086988A1 (en) 2014-12-03 2014-12-03 Optimisation of coding sequence for functional protein expression

Publications (1)

Publication Number Publication Date
WO2016086988A1 true WO2016086988A1 (en) 2016-06-09

Family

ID=52007021

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2014/076436 WO2016086988A1 (en) 2014-12-03 2014-12-03 Optimisation of coding sequence for functional protein expression

Country Status (1)

Country Link
WO (1) WO2016086988A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018013720A1 (en) * 2016-07-12 2018-01-18 Washington University Incorporation of internal polya-encoded poly-lysine sequence tags and their variations for the tunable control of protein synthesis in bacterial and eukaryotic cells
CN113851190A (en) * 2021-11-01 2021-12-28 四川大学华西医院 Heterogeneous mRNA sequence optimization method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1989000604A1 (en) * 1987-07-13 1989-01-26 Interferon Sciences, Inc. Method for improving translation efficiency
WO2001055342A2 (en) * 2000-01-31 2001-08-02 Biocatalytics, Inc. Synthetic genes for enhanced expression
WO2001068835A2 (en) * 2000-03-13 2001-09-20 Aptagen Method for modifying a nucleic acid
WO2002098443A2 (en) * 2001-06-05 2002-12-12 Curevac Gmbh Stabilised mrna with an increased g/c content and optimised codon for use in gene therapy
WO2002099105A2 (en) * 2001-06-05 2002-12-12 Cellectis Methods for modifying the cpg content of polynucleotides
WO2006097945A2 (en) * 2005-03-17 2006-09-21 Zenotech Laboratories Limited A method for achieving high-level expression of recombinant human interleukin-2 upon destabilization of the rna secondary structure
WO2006107954A2 (en) * 2005-04-05 2006-10-12 Pioneer Hi-Bred International, Inc. Methods and compositions for designing nucleic acid molecules for polypeptide expression in plants using plant virus codon-bias
WO2007142954A2 (en) * 2006-05-30 2007-12-13 Dow Global Technologies Inc. Codon optimization method
WO2009049350A1 (en) * 2007-10-15 2009-04-23 The University Of Queensland Expression system for modulating an immune response
WO2011111034A1 (en) * 2010-03-08 2011-09-15 Yeda Research And Development Co. Ltd. Recombinant protein production in heterologous systems

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1989000604A1 (en) * 1987-07-13 1989-01-26 Interferon Sciences, Inc. Method for improving translation efficiency
WO2001055342A2 (en) * 2000-01-31 2001-08-02 Biocatalytics, Inc. Synthetic genes for enhanced expression
WO2001068835A2 (en) * 2000-03-13 2001-09-20 Aptagen Method for modifying a nucleic acid
WO2002098443A2 (en) * 2001-06-05 2002-12-12 Curevac Gmbh Stabilised mrna with an increased g/c content and optimised codon for use in gene therapy
WO2002099105A2 (en) * 2001-06-05 2002-12-12 Cellectis Methods for modifying the cpg content of polynucleotides
WO2006097945A2 (en) * 2005-03-17 2006-09-21 Zenotech Laboratories Limited A method for achieving high-level expression of recombinant human interleukin-2 upon destabilization of the rna secondary structure
WO2006107954A2 (en) * 2005-04-05 2006-10-12 Pioneer Hi-Bred International, Inc. Methods and compositions for designing nucleic acid molecules for polypeptide expression in plants using plant virus codon-bias
WO2007142954A2 (en) * 2006-05-30 2007-12-13 Dow Global Technologies Inc. Codon optimization method
WO2009049350A1 (en) * 2007-10-15 2009-04-23 The University Of Queensland Expression system for modulating an immune response
WO2011111034A1 (en) * 2010-03-08 2011-09-15 Yeda Research And Development Co. Ltd. Recombinant protein production in heterologous systems

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
ANDRONESCU MIRELA ET AL: "Efficient parameter estimation for RNA secondary structure prediction.", BIOINFORMATICS (OXFORD, ENGLAND) 1 JUL 2007, vol. 23, no. 13, 1 July 2007 (2007-07-01), pages i19 - i28, XP002738330, ISSN: 1367-4811 *
JIA M ET AL: "The relationship among gene expression, folding free energy and codon usage bias in Escherichia coli", FEBS LETTERS, ELSEVIER, AMSTERDAM, NL, vol. 579, no. 24, 10 October 2005 (2005-10-10), pages 5333 - 5337, XP027697304, ISSN: 0014-5793, [retrieved on 20051010] *
LIANGJIANG WANG ET AL: "Comparative analysis of expressed sequences reveals a conserved pattern of optimal codon usage in plants", PLANT MOLECULAR BIOLOGY, KLUWER ACADEMIC PUBLISHERS, DORDRECHT, NL, vol. 61, no. 4-5, 1 July 2006 (2006-07-01), pages 699 - 710, XP019405470, ISSN: 1573-5028, DOI: 10.1007/S11103-006-0041-8 *
LORENZ RONNY ET AL: "ViennaRNA Package 2.0.", ALGORITHMS FOR MOLECULAR BIOLOGY : AMB 2011, vol. 6, 26, 2011, pages 1 - 14, XP002738329, ISSN: 1748-7188 *
MURRAY E E ET AL: "CODON USAGE IN PLANT GENES", NUCLEIC ACIDS RESEARCH, OXFORD UNIVERSITY PRESS, GB, vol. 17, no. 2, 25 January 1989 (1989-01-25), pages 477 - 498, XP000008653, ISSN: 0305-1048 *
NAKAMURA M ET AL: "Translation efficiencies of synonymous codons are not always correlated with codon usage in tobacco chloroplasts", THE PLANT JOURNAL, BLACKWELL SCIENTIFIC PUBLICATIONS, OXFORD, GB, vol. 49, no. 1, 28 November 2006 (2006-11-28), pages 128 - 134, XP008133694, ISSN: 0960-7412 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018013720A1 (en) * 2016-07-12 2018-01-18 Washington University Incorporation of internal polya-encoded poly-lysine sequence tags and their variations for the tunable control of protein synthesis in bacterial and eukaryotic cells
US11603533B2 (en) 2016-07-12 2023-03-14 Washington University Incorporation of internal polya-encoded poly-lysine sequence tags and their variations for the tunable control of protein synthesis in bacterial and eukaryotic cells
CN113851190A (en) * 2021-11-01 2021-12-28 四川大学华西医院 Heterogeneous mRNA sequence optimization method

Similar Documents

Publication Publication Date Title
Barahimipour et al. Dissecting the contributions of GC content and codon usage to gene expression in the model alga Chlamydomonas reinhardtii
Sun et al. A zinc finger motif-containing protein is essential for chloroplast RNA editing
Liu et al. Empty pericarp5 encodes a pentatricopeptide repeat protein that is required for mitochondrial RNA editing and seed development in maize
Yap et al. AEF 1/MPR 25 is implicated in RNA editing of plastid atpF and mitochondrial nad5, and also promotes atpF splicing in Arabidopsis and rice
Cantó‐Pastor et al. Efficient transformation and artificial mi RNA gene silencing in L emna minor
Boyle et al. Repression of the defense gene PR-10a by the single-stranded DNA binding protein SEBF
F de Felippes et al. The key role of terminators on the expression and post‐transcriptional gene silencing of transgenes
Doniwa et al. The involvement of a PPR protein of the P subfamily in partial RNA editing of an Arabidopsis mitochondrial transcript
Bernardes et al. Plant 3’regulatory regions from mRNA-encoding genes and their uses to modulate expression
WO2005098004A2 (en) Inducible boost of integrated satellite rna viruses
AU2017234672B2 (en) Zea mays regulatory elements and uses thereof
Yang et al. Molecular and functional diversity of organelle RNA editing mediated by RNA recognition motif‐containing protein ORRM4 in tomato
Elakhdar et al. Eukaryotic peptide chain release factor 1 participates in translation termination of specific cysteine-poor prolamines in rice endosperm
AU2017235944B2 (en) Zea mays regulatory elements and uses thereof
WO2016086988A1 (en) Optimisation of coding sequence for functional protein expression
US20170159064A1 (en) Generation of artificial micrornas
CN105713079B (en) Protein and its relevant biological material are improving the application in plant products
KR20160065952A (en) Zea mays metallothionein-like regulatory elements and uses thereof
JP2018536400A (en) Dreamenol synthase III
US9637750B2 (en) P5SM suicide exon for regulating gene expression
Chen et al. Plant immunity suppressor SKRP encodes a novel RNA‐binding protein that targets exon 3′ end of unspliced RNA
Wang et al. Identification of miRNA858 long-loop precursors in seed plants
Mermigka et al. ERIL 1, the plant homologue of ERI‐1, is involved in the processing of chloroplastic rRNA s
Lee et al. GmDim1 Gene Encodes Nucleolar Localized U5-Small Nuclear Ribonucleoprotein in Glycine max
JP5228169B2 (en) Tuber formation control vector for controlling tuber formation of plant, plant production method and plant with controlled tuber formation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14806629

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14806629

Country of ref document: EP

Kind code of ref document: A1