WO2016086988A1

WO2016086988A1 - Optimisation of coding sequence for functional protein expression

Info

Publication number: WO2016086988A1
Application number: PCT/EP2014/076436
Authority: WO
Inventors: Lotte Bregje Westerhof; Jacob Bakker; Ruud Hendrikus Petrus Wilbers; Arjen Schots; Geert Smant; Aska Goverse; Johannes Helder; Marten Gerko STERKEN; Laurens Bastian SNOEK; Jan Edward Kammenga
Original assignee: Wageningen Universiteit
Priority date: 2014-12-03
Filing date: 2014-12-03
Publication date: 2016-06-09

Abstract

The present invention relates to an approach aimed at the modification of codons in individual polynucleotide sequences encoding a heterologous protein of interest, without altering the amino acid sequence of the polypeptide to enhance the amount of functional expression in a host organism of interest. In its broadest aspect, this approach exploits redundancy in the genetic code by providing a universal set of codons which may be used at certain positions in the polynucleotide sequence in order to achieve improved heterologous protein production in a range of host cells. The present invention also relates to specific codons which may be used to increase protein expression in particular hosts. The present invention also relates to the optimization of the translation efficiency of messenger RNAs on the basis of their secondary structure characteristics, and the provided set of criteria may be used to increase protein expression in particular hosts.

Description

OPTIMISATION OF CODING SEQUENCE FOR FUNCTIONAL PROTEIN

EXPRESSION

FIELD OF THE INVENTION The present invention relates to an approach aimed at the modification of codons in individual polynucleotide sequences encoding a heterologous protein of interest, without altering the amino acid sequence of the polypeptide to enhance the amount of functional expression in a host organism of interest. Recognising that maximum translation efficiency and therefore protein production is influenced by codon usage of a coding sequence, in its broadest aspect, this approach exploits redundancy in the genetic code by providing a universal set of codons which may be used at certain positions in the polynucleotide sequence in order to achieve improved heterologous protein production in a range of host cells. The present invention also relates to the optimization of the translation efficiency of messenger RNAs on the basis of their secondary structure characteristics, and the provided set of criteria may be used to increase protein expression in particular hosts.

BACKGROUND TO THE INVENTION

Most amino acids are encoded by multiple synonymous codons and the frequency wherein synonymous codons are used is not equal within a given species. Also, within species a bias in codon use in highly expressed genes can be observed, linking codon use to gene expression. The codons used most frequently in highly expressed genes (optimal codons) have been shown to correspond to genomic G+C content and often match the most abundant tRNAs in many species. It is assumed that codons that match more abundant tRNAs would be translated faster as tRNA availability for translation occurs via diffusion and the chance of encountering a more abundant tRNA is greater than when encountering a rarer tRNA. An increase in translation rate allows ribosomes to finish translation and reinitiate translation sooner. Also, the probability that a ribosome initially loads a non-matching tRNA is smaller when a codon matches a more abundant tRNA resulting in an energetic advantage as three-quarters of the energy to incorporate an amino acid is lost if a non-matching tRNA has to be rejected after proofreading. Thus, the use of optimal codons in highly-expressed genes was hypothesized to provide a fitness gain by improved translational efficiency.

In recognition of the idea that increased translation efficiency may enhance protein yield, codon optimisation of genes for heterologous expression by recruiting optimal codons of the production host has been a common strategy. However such strategies have met with varying success. For example, a study of the heterologous expression of 154 variants of GFP differing only in synonymous codon use in E. coli demonstrated that the use of optimal codons was positively correlated with bacterial growth, but not protein yield {Kudla et al. 2009, Science, 324: 255-258).

However, many of the studies focusing on codon optimisation have not addressed a potentially confounding variable, translational initiation. In the aforementioned study, about half of the variation in GFP protein levels was explained by folding energy of the first third of the mRNA suggesting that whilst the use of optimal codons may have increased the rate of translation, protein yield remained unchanged because the initiation of translation was rate-limiting. Ribosomal density studies indicate that ribosomes are most abundant at the 5' portion of mRNAs and the overall packing density of nearly all mRNAs is below maximum, suggesting that this may be a general feature.

Wang and Roossinck {Wang and Roossinck, 2006, Plant Mol Biol, 61:699-710) determined which codons were most highly-associated with transcripts which accumulate to high levels, by comparing overall codon use to the codon use in highly-transcribed genes in 1 1 plant species. In doing so the authors demonstrated that codon usage bias is correlated positively with gene transcript levels. As such the authors identified 18 codons which are associated with highly-expressed transcripts across 1 1 plant species. Interestingly, the authors found that use of their "optimal" codons appears to be well conserved between eudicots and monocots, but less well conserved between the higher plants and Chlamydomonas reinhardtii. However, the authors did not express polynucleotides incorporating such "optimal" codons in host cells and consequently, the effect on heterologous protein expression of altering the codon complement of their encoding polynucleotides in this way remains to be determined.

Alternatives to plant hosts are frequently required for protein production for a variety of reasons. Wang and Roossinck (Wang and Roossinck, 2006, Plant Mol Biol, 61:699-710) assessed the codons which are associated with the most abundant transcripts across 12 plant species. However, this result provides no information on codons which are relevant for optimising heterologous protein expression in other, non-plant host organisms.

SUMMARY OF THE INVENTION

The codon use of a gene of interest is often adapted to reflect the expression host's codon use in highly expressed genes in order to enhance heterologous protein production. However, the results obtained with this strategy are variable. A comparison between the overall codon use and the codon use in highly expressed genes of several plant species revealed that optimal codons are not always the codons of which the use is increased most with expression. Although the codon composition of highly expressed genes differs between monocots and dicots, the same codons often rise in frequency with increasing expression levels (expression codons) and are in many cases C-ending. These conserved expression codons were used to optimise the codon composition of three genes, which enhanced protein yield significantly upon stable and transient expression in plants.

With the above in mind an alternative method of codon optimisation has been devised that led to a significant increase in both mRNA stability and mRNA translatability (i.e higher mRNA levels and more proteins per mRNA molecule). Unexpectedly, experimental data shown here indicates that this expression-linked codon bias found in plants also extends to other kingdoms of life. On the basis of these experimental data, the present invention provides a series of synonymous codons which are believed to have wide application for the expression of heterologous genes in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells and which have been surprisingly found to correspond with increased functional protein expression therein. Instead of the lengthy and complicated process of trial and error which characterises existing methods of codon optimisation centered on increasing gene expression in specific cellular or environmental contexts, the present invention provides a quick, practical, universal method of increasing functional heterologous protein expression with wide application for the expression of heterologous genes in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Advantageously, this method removes any need for consideration of the host cell or specific cellular context involved. In addition to a series of universally applicable replacement codons for use in commonly used host cells, the present invention also provides specific sets of codon replacements which further improve functional protein expression in particular hosts, specifically prokaryotes, fungi, animals, nematodes, protists and plants.

Accordingly, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:

the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.

As noted above, Wang and Roossinck (2006) did not actually perform any expression studies to determine the effect of codon optimisation on functional protein expression. In a further aspect the present invention provides a method of expressing a heterologous protein in a plant cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table;

AGA or AGG

Asparagine AAT AAC

Aspartic acid GAT GAC

Cysteine TGT TGC

Glutamic acid GAA GAG

Glutamine CAA CAG

Glycine GGC, GGA or GGG GGT

Histidine CAT CAC

Isoleucine ATT or ATA ATC

Leucine CTT, CTA, CTG, TTA CTC

or TTG

Lysine AAA AAG

Phenylalanine TTT TTC

Proline CCT, CCA or CCG CCC

Serine TCT, TCA, TCG, TCC

AGT or AGC

Threonine ACT, ACA or ACG ACC

Tyrosine TAT TAC

Valine GTT, GTA or GTG GTC

Stop codons TAG or TGA TAA inserting the polynucleotide sequence into an expression vector;

introducing said expression vector into a host cell; and

culturing the host cell to produce the heterologous protein; optionally wherein the corresponding codons are changed according to the following table;

and/or:

Amino Acid DNA Codon Replacement Codon

; and/or:

; and/or:

; and/or:

On the basis of these expression studies using such codon optimisation according to the invention, it was surprisingly discovered that a number of mRNA structural characteristics were found to be positively correlated with expression levels across kingdoms. In particular, the selection of mRNA structures with the most even distribution of stems and loops is positively correlated with higher levels of expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Consequently, in an alternative embodiment, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide.

DETAILED DESCRIPTION

In another aspect of the invention, further improvements in heterologous protein expression may be achieved by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table, particularly where the host cell is a prokaryotic cell, a fungal cell or a nematode cell:

In aspects of the invention where the host cell is a prokaryotic cell, for example, an E.coli cell, heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):

; and/or:

; and/or:

; and/or:

; and/or: Amino Acid DNA Codon Replacement Codon

Leucine TTA, TTG, CTT, CTC CTG

or CTA and/or:

Amino Acid DNA Codon Replacement Codon

Glycine GGA or GGG GGT or GGC and/or:

and/or:

and/or:

and/or:

In aspects of the invention where the host cell is a fungal cell, for example an S. cerevisiae cell, heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):

; and/or:

; and/or:

; and/or:

; and/or: Amino Acid DNA Codon Replacement Codon

Proline CCT, CCC or CCG CCA

and/or:

and/or:

and/or:

and/or:

and/or:

and/or:

Glutamic acid GAG GAA

In aspects of the invention where the host cell is a nematode cell, for example, an C. elegans cell, heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):

and/or:

and/or:

: and/or:

and/or:

and/or: Amino Acid DNA Codon Replacement Codon

Valine GTA or GTG GTC or GTT and/or:

Amino Acid DNA Codon Replacement Codon

Glutamic acid GAA GAG

and/or:

and/or:

and/or:

Glutamine CAA CAG

In aspects of the invention where the host cell is a Mus musculus cell, heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):

Amino Acid DNA Codon Replacement Codon

Serine TCT, TCA, AGT, TCG or TCC

AGC and/or:

Amino Acid DNA Codon Replacement Codon

Arginine AGA or AGG CGG, CGT, CGC or

CGA and/or:

Amino Acid DNA Codon Replacement Codon

Alanine GCC or GCA GCG or GCT

; and/or:

; and/or:

and/or: and/or:

and/or:

and/or:

and/or:

and/or:

; and/or:

In another aspect, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;

providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:

the host cell being selected from a prokaryotic cell, a fungal cell, a plant cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence. In another aspect, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;

providing a polynucleotide sequence which encodes a protein of interest and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:

wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.

In aspects of the invention where the host cell is a plant cell, preferably an Arabidopsis thaliana cell, heterologous protein expression is further improved by supplementing the codon changes detailed in the table above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):

; and/or:

; and/or:

; and/or:

; and/or:

; and/or: Amino Acid DNA Codon Replacement Codon

Glutamic acid GAA GAG

and/or:

Amino Acid DNA Codon Replacement Codon

Phenylalanine TTT TTC

and/or:

and/or:

and/or:

In another aspect the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;

; and/or:

and/or:

and/or:

and/or:

and/or:

and/or: Amino Acid DNA Codon Replacement Codon

Valine GTC, GTA or GTG GTT and/or:

Amino Acid DNA Codon Replacement Codon

Isoleucine ATA ATC or ATT and/or:

and/or:

and/or:

In another aspect the present invention provides a method of expressing a heterologous protein in a plant cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table;

inserting the polynucleotide sequence into an expression vector;

introducing said expression vector into a host cell; and

; and/or:

; and/or:

; and/or:

; and/or:

Preferably in this aspect of the invention the host cell is an Arabidopsis thaliana cell.

In addition to establishing precise codon changes which result in improved functional protein expression, a novel aspect of the present invention was uncovered by studying the correlation between expression level and mRNA characteristics including gene length, minimal folding energy, number of bound nucleotides, mean stem and loop sizes (stretches of bound and unbound nucleotides, respectively) and number of stem-loop transitions which revealed a general trend across kingdoms. Messenger RNAs are folded structures and translation of a given mRNA into a polypeptide requires unfolding. The necessary helicase activity is typically provided by the ribosome itself. This unfolding requires energy and in essence, a linear mRNA (i.e. an RNA polymer without secondary structure) would be optimal for the maximization of protein production. However, a certain degree of folding makes mRNA less susceptible to degradation and increases its diffusibility.

The number of bound nucleotides and the number of stem-loop transitions were found to be positively correlated with expression levels, while loop size was negatively correlated with expression. Combining the gene expression data with available protein abundance data demonstrated that protein:mRNA ratio (proxy for translation efficiency) is positively correlated with the number of stem-loop transitions and negatively correlated with stem and loop size. This general pattern across kingdoms reveals a selection pressure created by gene expression on both mRNA stability and translatability. An increase in the number of nucleotide bonds favours stability, while a more even distribution of these bonds enhances translatability. Altogether, our data indicate that a successful codon optimisation strategy should focus on computational models that calculate the ideal mRNA structure whereby both stability and translatability are enhanced. Here we describe a procedure to select mRNAs with optimal folding characteristics out of a pool consisting of all possible mRNAs encoding a given protein. Remarkably, these are not the most compact mRNAs, nor the ones with the lowest unfolding energy. Here we describe a selection procedure based on a set of criteria for the optimisation of recombinant protein production in a given host that relates to the number and distribution of mRNA stem-loop transitions for any given mRNA. On the basis of these experimental data, in another aspect the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the relevant table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the relevant table(s); the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence and wherein the method further comprises; analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence; and incorporating in said polynucleotide sequence a pattern of optimal and non-optimal codons at a site associated with provision of a structural motif; wherein said pattern enables increased expression efficiency of said protein in said host cell compared with the synonymous coding sequence containing solely optimal codons, wherein optimal codons are those codons pre-calculated to provide the highest functional expression of heterologous protein in the host cell or the sole possible codon. As such the method may comprise merely making the universal codon changes, and/or making modifications according to the replacement codon tables which are specific for particular host cells. In preferred embodiments of the invention, analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence typically will include, but is not limited to; examining and taking account of the mean number of stem-loop transitions, mean stem size, mean loop size, standard deviation of the stem size or the loop size (which acts as a proxy measure for even distribution of stem-loops), maximum loop size and/or maximum stem size. In preferred embodiments uneven stem loop distributions will be discarded and the polynucleotide sequence codon composition will be altered (i.e. non-optimally) based on the observation of mRNA secondary structure to improve translational efficiency and therefore functional protein expression.

A novel aspect of the invention is the selection of mRNA structures with the most even distribution of stems and loops that leads to higher levels of expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Consequently, in a further aspect, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide.

Normally, the first step in selecting the 'ideal' mRNA structure is the generation of a pool of mRNA variants by making all possible combinations of synonymous codons (> 100.000 mRNA variants). Typically, all mRNA species in the pool are then folded in silico. The term "in silico" is widely used in the art and will be understood by the average skilled person as meaning performed on a computer or via computer simulation. The RNA structure is predicted in silico using standard techniques and usually under the temperature and salt concentrations relevant for the preferred host. Appropriate software packages or applications incorporating suitable algorithms may be selected for performing the folded mRNA structure prediction. Suitable packages include, but are not limited to; an RNA structure prediction program such as Vienna RNAfold 2.0 (Lorenz et al.. 201 1 , ViennaRNA Package 2.0 Algorithms for Molecular Biology, 6:1 26). Preferably, the mRNA structure prediction will be carried out using such a prediction program using the standard settings and the folding parameters, for example, those established by Andronescu et al. (Andronescu et al., 2007 Bioinformatics, 23 (13), i19-i28) and preferably, adjusting the folding-temperature to that of the intracellular temperature of the host of interest. More preferably, the temperature and salt concentration parameters will be adjusted to match those of the preferred host. Finally the mRNAs from the library of synonymous variants that have the most even distribution of stems and loops are selected. The mRNAs having the most even distribution of stems and loops may be identified by the structural characteristics outlined below. In particular the standard deviation is used as a measure for an even distribution of the sizes of the stems and loops which is preferred. Typically, the more similar the stem sizes of an mRNA the higher the translation efficiency. Additionally, the more similar the loop sizes of an mRNA the higher the translation efficiency. Where there were several appropriate codons according to the foregoing criteria, previously published data was consulted to make a final selection. Parameters which may be influential include, for example, the folding energy of the 5' terminus and the selection of codons that are frequently used and match the most abundant tRNAs. Preferably, codons giving the lowest folding energy of the 5' terminus and codons that are frequently used and match the most abundant tRNAs were preferred. Methods for determining the folding energy of mRNA may be based on, but are not limited to those described by Tuller et al. (Tuller et al., 2009, PNAS 107:3645-3650) and Kudla et al. (Kudla et al. 2009, Science, 324:255-258). For example, the mRNA molecule from -23 till +39 should have an average folding energy of at least -6 kcal/mol for E. coli and of at least -4 kcal/mol for S. cerevisiae as determined by the use of sliding windows of 40nt with 1 nt steps. Codon choice of the first 13nts providing a low energy will depend on the 5' UTR provided by the expression cassette ((Kudla et al. 2009, Science, 324: 255-258; Tuller et al., 2009, PNAS 107: 3645-3650). Alternatively, instead of adapting the first 13 nts, the 5'UTR may be adapted to provide a low folding energy. For example, the 5'UTR used in the present examples is very U-rich (GTTTTTATTTTTAATTTTCTTTCAAATACTTCCACC [SEQ ID NO: 1 ]), which in most cases provided a relatively high (close to 0) folding energy when using primarily C-ending codons. When using the chitinase SP, this was always the case.

In preferred embodiments of the invention, analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence typically will include, but is not limited to; examining and taking account of; the mean number of stem-loop transitions, mean stem size, mean loop size, standard deviation of the stem size or the loop size (which acts as a proxy measure for even distribution of stem-loops), maximum loop size and/or maximum stem size. In preferred embodiments, the polynucleotide sequence codon composition will be altered (i.e. non-optimally) to avoid uneven stem loop distributions to improve translational efficiency and therefore functional protein expression. Such alterations may include incorporating one or more codons listed as second preference or third preference replacement codons in place of the first preference codon where the secondary structure criteria are not fulfilled by inclusion of the first preference codon. Alternatively, for a given position, such alterations may include retention of the wild-type (WT) or native codon where inclusion of an optimal codon negatively impacts the secondary structure with respect to the particular criteria for each host cell. Preferably, the polynucleotide will have at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp). More preferably, the polynucleotide will have stem loop transitions in the range 1 10 to 250/kbp, optionally in the range 1 10 to 200/kbp, 1 1 1 to 249/kbp, 1 12 to 248/kbp, 1 13 to 247/kbp, 1 14 to 246/kbp, 1 15 to 245/kbp, 1 16 to 244/kbp, 1 17 to 243/kbp, 1 18 to 242/kbp, 1 19 to 241 /kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp.

Preferably, the polynucleotide will have a maximum stem size of less than 19 bp. optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 14bp to 15bp. More preferably, the polynucleotide will have a maximum loop size of less than 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp. Additionally, in embodiments where wherein the host cell is a prokaryotic cell, preferably a bacterial cell and more preferably an E. coli cell, the selected polynucleotide will preferably have at least 1 16 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 1 16 to 200/kbp, 1 17 to 249/kbp, 1 18 to 248/kbp, 1 19 to 247/kbp, 120 to 245/kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp. More preferably, the selected polynucleotide will preferably have a mean stem size between 5.45 bp and 2.50 bp, optionally in the range 5.45 to 4.00 bp, 5.40 bp to 2.60 bp, 5.30 bp to 2.70 bp, 5.20 bp to 2.80 bp, 5.10 bp to 2.90 bp, 5.00 bp to 3.00 bp, 4.90 to 3.10 bp, 4.80 to 3.20 bp, 4.70 to 3.30 bp, 4.60 to 3.40 bp, 4.50 to 3.50 bp, 4.40 to 3.60 bp, 4.30 to 3.70 bp, 4.20 to 3.80 bp or 4.10 to 3.90 bp. More preferably, the method further comprises selecting a polynucleotide having a mean loop size between 3.16 bp and 2.00 bp, optionally in the range 3.10 bp to 2.10 bp, 3.00 bp to 2.20 bp, 2.90 bp to 2.30 bp, 2.80 bp to 2.40 bp, 2.70 bp to 2.50 bp or 2.60 bp to 2.40 bp. More preferably, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 2.95 and 2 bp, optionally in the range 2.90 bp to 2.10 bp, 2.80 bp to 2.20 bp, 2.70 bp to 2.30 bp, 2.60 bp to 2.40 bp or 2.50 bp to 2.40 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.50, preferably between 3.50 and 2.00 bp, optionally in the range 3.40 bp to 2.10 bp, 3.30 bp to 2.20 bp, 3.20 bp to 2.30 bp, 3.10 bp to 2.40 bp, 3.00 bp to 2.50 bp, 2.90 bp to 2.60 bp or 2.80 bp to 2.70 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 16 bp, optionally in the range 10bp to 16bp, 1 1 bp to 15bp or 12bp to 14bp. In the most preferred embodiment where the host cell is a plant cell, the method further comprises selecting a polynucleotide having a maximum stem size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp, 13bp to 15bp or 12 bp to 14 bp.

Alternatively, in embodiments where wherein the host cell is a eukaryotic cell, preferably a plant cell and more preferably an Arabidopsis thaliana cell, the selected polynucleotide will preferably have at least 1 16 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range optionally in the range 1 16 to 200/kbp, 1 17 to 249/kbp, 1 18 to 248/kbp, 1 19 to 247/kbp, 120 to 245/kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp. More preferably the selected polynucleotide will have a mean stem size in the range 5.20 to 2.50 bp, optionally in the range 5.20 bp to 4.00 bp, 5.20 to 2.60 bp, 5.10 bp to 2.70 bp, 5.00 bp to 2.80 bp, 4.90 bp to 2.90 bp, 4.80 bp to 3.00 bp, 4.70 to 3.10 bp, 4.60 to 3.20 bp, 4.50 to 3.30 bp, 4.40 to 3.40 bp, 4.30 to 3.50 bp, 4.20 to 3.60 bp, 4.10 to 3.70 bp or 4.00 to 3.80 bp. Preferably, the method further comprises selecting a polynucleotide having a mean loop size between 3.32 bp and 3.00 bp. optionally in the range 3.30 bp to 3.00 bp, 3.25 bp to 3.05 bp, 3.20 bp to 3.10 bp or 3.15 bp to 3.10 bp. More preferably, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.20 and 2 bp, optionally in the range 3.10 bp to 2.10 bp, 3.00 bp to 2.20 bp, 2.90 bp to 2.30 bp, 2.80 bp to 2.40 bp, 2.70 bp to 2.50 bp or 2.60 bp to 2.40 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.40, preferably between 3.40 and 2.00 bp, optionally in the range 3.30 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp, 2.80 bp to 2.40 bp or 2.60 bp to 2.50 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp or 13bp to 15bp. In the most preferred embodiment where the host cell is a plant cell, the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 12 bp to 15 bp.

Alternatively, in embodiments where wherein the host cell is a fungal cell, preferably a Saccharomyces cell, optionally a Saccharomyces cerevisiae cell, the selected polynucleotide will preferably have at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp). Preferably, the polynucleotide will have stem loop transitions in the range 1 10 to 250/kbp, optionally in the range 1 10 to 200/kbp, 1 1 1 to 249/kbp, 1 12 to 248/kbp, 1 13 to 247/kbp, 1 14 to 246/kbp, 1 15 to 245/kbp, 1 16 to 244/kbp, 1 17 to 243/kbp, 1 18 to 242/kbp, 1 19 to 241 /kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp. More preferably, the selected polynucleotide will preferably have a mean stem size between 5.27 bp and 2.50 bp, optionally in the range 5.27 bp to 4.00 bp, 5.20 to 2.40 bp, 5.10 bp to 2.50 bp, 5.00 to 2.60 bp, 4.90 bp to 2.70 bp, 4.80 bp to 2.80 bp, 4.70 bp to 2.90 bp, 4.60 bp to 3.00 bp, 4.50 to 3.10 bp, 4.40 to 3.20 bp, 4.30 to 3.30 bp, 4.20 to 3.40 bp, 4.10 to 3.50 bp, 4.00 to 3.60 bp or 3.90 to 3.70 bp. More preferably, the method further comprises selecting a polynucleotide having a mean loop size between 3.77 bp and 3.00 bp, optionally in the range 3.75 bp to 3.00 bp, 3.70 bp to 3.10 bp, 3.60 bp to 3.20 bp or 3.50 bp to 3.30 bp. More preferably, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.65 and 2.00 bp, optionally in the range 3.60 bp to 2.10 bp, 3.50 bp to 2.20 bp, 3.40 bp to 2.30 bp, 3.30 bp to 2.40 bp, 3.30 bp to 2.50 bp, 3.20 bp to 2.60 bp, 3.10 bp to 2.70 bp or 3.00 bp to 2.80 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.25, preferably between 3.25 and 2.00 bp, optionally in the range 3.20 bp to 2.10 bp, 3.10 bp to 2.20 bp, 3.00 bp to 2.30 bp, 2.90 bp to 2.40 bp, 2.80 bp to 2.50 bp or 2.70 bp to 2.60 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp. In the most preferred embodiment where the host cell is a fungal cell, the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10 bp to 19 bp, 1 1 bp to 18 bp, 12 bp to 17 bp, 13 bp to 16 bp or 12 bp to 15 bp.

Alternatively, in embodiments where wherein the host cell is an animal cell, preferably a nematode cell, optionally a Caenorhabditis elegans cell, the selected polynucleotide will preferably have at least 1 14 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 1 14 to 200/kbp, 1 15 to 249/kbp, 1 16 to 248/kbp, 1 17 to 247/kbp, 1 18 to 246/kbp, 1 19 to 245/kbp, 120 to 244/kbp, 121 to 243/kbp, 122 to 242/kbp, 123 to 241 /kbp, 124 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp. More preferably, the selected polynucleotide will preferably have a mean stem size between 5.35 and 2.50 bp, optionally in the range 5.35 bp to 4.00 bp, 5.30 to 2.40 bp, 5.20 bp to 2.50 bp, 5.10 to 2.60 bp, 5.00 bp to 2.70 bp, 4.90 bp to 2.80 bp, 4.80 bp to 2.90 bp, 4.70 bp to 3.00 bp, 4.60 to 3.10 bp, 4.50 to 3.20 bp, 4.40 to 3.30 bp, 4.30 to 3.40 bp, 4.20 to 3.50 bp, 4.10 to 3.60 bp, 4.00 to 3.70 bp or 3.90 to 3.80 bp. More preferably, the method further comprises selecting a polynucleotide having a mean loop size between 3.47 bp and 3.00 bp, optionally in the range 3.45 bp to 3.00 bp, 3.40 bp to 3.10 bp or 3.30 bp to 3.20 bp. More preferably, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.37 and 2.00 bp, optionally in the range 3.35 bp to 2.10 bp, 3.30 bp to 2.20 bp, 3.20 bp to 2.30 bp, 3.10 bp to 2.40 bp, 3.00 bp to 2.50 bp, 2.90 bp to 2.60 bp, or 2.80 bp to 2.70 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.27, preferably between 3.27 and 2.00 bp, optionally in the range 3.25 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp or 2.80 bp to 2.60 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp. In the most preferred embodiment where the host cell is a fungal cell, the method further comprises selecting a polynucleotide having a maximum stem size below 18 bp, optionally in the range 10 bp to 18 bp, 1 1 bp to 17 bp, 12 bp to 16 bp, 13 bp to 15 bp or 12 bp to 14 bp. Alternatively, in embodiments where wherein the host cell is an animal cell, preferably a mammalian cell, optionally a Mus musculus cell, the selected polynucleotide will preferably have at least 120 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 120 to 200/kbp, 121 to 249/kbp, 122 to 248/kbp, 123 to 247/kbp, 124 to 246/kbp, 125 to 245/kbp, 130 to 240/kbp, 135 to 235/kbp, 140 to 230/kbp, 145 to 225/kbp, 150 to 220/kbp, 155 to 215/kbp, 160 to 210/kbp, 165 to 205/kbp, 170 to 200/kbp, 175 to 195/kbp or 180 to 190/kbp. More preferably, the selected polynucleotide will preferably have a mean stem size between 4.35 and 2.50 bp, optionally in the range 4.35 to 4.00 bp, 4.30 to 2.40 bp, 4.20 bp to 2.50 bp, 4.10 to 2.60 bp, 4.00 bp to 2.70 bp, 3.90 bp to 2.80 bp, 3.80 bp to 2.90 bp, 3.70 bp to 3.00 bp, 3.60 to 3.10 bp, 3.50 to 3.20 bp or 3.40 to 3.30 bp. More preferably, the method further comprises selecting a polynucleotide having a mean loop size between 5.18 bp and 4.00 bp, optionally in the range 5.15 bp to 4.00 bp, 5.10 bp to 4.10 bp, 5.00 bp to 4.20 bp, 4.90 bp to 4.30 bp, 4.80 bp to 4.40 bp or 4.70 bp to 4.50 bp. More preferably still, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.00 and 2.00 bp, optionally in the range 2.90 bp to 2.10 bp, 2.80 bp to 2.20 bp, 2.70 bp to 2.30 bp or 2.60 bp to 2.40 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.28, preferably between 3.28 and 2.00 bp, optionally in the range 3.27 bp to 2.00 bp, 3.25 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp or 2.80 bp to 2.60 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp or 13bp to 15bp. In the most preferred embodiment where the host cell is an animal cell, the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 12 bp to 15 bp. In a final aspect the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide, wherein the method further comprises selecting a polynucleotide from a library of synonymous variants wherein the codon usage of the selected polynucleotide most closely matches the most abundant tRNAs in a particular host cell. It will be appreciated that this final step may be undertaken.

Polynucleotides

In the context of methods of the invention, polynucleotides encoding heterologous proteins of interest (POI) may be isolated nucleic acid molecules and may be a DNA molecule, a cDNA molecule, an RNA molecule or synthetically produced DNA or RNA or a chimeric nucleic acid molecule. In embodiments where the polynucleotide is an RNA, it will be understood that normally uracil (U) is to be used in place of thymine (T). Throughout, the term "polynucleotide" as used herein refers to a deoxyribonucleotide or ribonucleotide polymer in single- or double-stranded form, or sense or anti-sense, and encompasses analogues of naturally occurring nucleotides that hybridize to nucleic acids in a manner similar to naturally occurring nucleotides. Such polynucleotides may be derived from any organism, including the host organism, or may be synthesised de novo.

Prior to modification in accordance with the methods of the invention, a polynucleotide coding sequence may be provided for the protein of interest (POI) having the wild-type (WT) sequence or alternatively having a 'pre-optimised' sequence; that is to say the sequence incorporates at one or more positions for which synonymous codons are available a codon which is associated with the most abundant tRNA for that particular amino acid. In certain embodiments, it may be that codons corresponding to the most abundant tRNA for particular amino acids are used at each position for which synonymous codons are available. Preferably, however, the starting polynucleotide sequence is the WT sequence encoding the POI. In the context of methods of the invention, it will be appreciated that the POI may be a native protein of a host cell in which expression of the native protein has been silenced, for example, the polynucleotide sequence encoding that protein has been disrupted, deleted or mutated. In these circumstances, the POI will be considered as a heterologous protein in the context of the mutated host cell.

The provision of a polynucleotide having a coding sequence may comprise synthesis of a polynucleotide comprising the coding sequence. This may be for example by modification of a pre-existing sequence, e.g. by site-directed mutagenesis or possibly by de novo synthesis.

Polynucleotide Sequence Modification

In all embodiments of the invention, polynucleotide sequences encoding the protein of interest may be prepared by any suitable method known to those of ordinary skill in the art, including but not limited to, for example, direct chemical synthesis or cloning. Whether the starting polynucleotide is a WT sequence or a pre-optimised sequence where the codons match the most abundant tRNAs for a particular host cell, the starting polynucleotide sequence may be reviewed and modified by incorporating the relevant replacement codons in silico. The modified polynucleotide may subsequently be synthesised, for example by direct chemical synthesis, for introduction into a desired host cell. Alternatively, the starting polynucleotide sequence may be provided and subsequently modified ex vivo or alternatively in vivo for example by site directed mutagenesis or gene editing techniques.

In some embodiments of the invention, all of the polynucleotide sequence is modified according to the relevant table; that is to say 100% of the length of the coding sequence of the polynucleotide encoding the protein of interest (POI). In such embodiments, each occurrence of a particular 'non-optimal' codon in the starting polynucleotide sequence for which a synonymous codon exists will be replaced with the corresponding replacement codon indicated in the relevant table. For a particular codon, this involves modifying every occurrence of that codon within the polynucleotide sequence. Preferably, where two or more codons are indicated as replacement codons, each codon will be modified using the synonymous replacement codon appearing first in the table.

Alternatively, in certain situations it may be desirable to limit application of the method to specific regions of the polynucleotide sequence or to omit certain regions from application of the method, for instance to avoid disruption of secondary structural motifs or regulatory elements in the polynucleotide sequence. According to preferred embodiments of the invention, appropriate replacement codons may be applied to substantially all of the nucleotides in a polynucleotide sequence. Preferably, at least 75%, 76%, 77%, 78%, 79%, 80%, 81 %, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91 %, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, 99.5% or 100% of the polynucleotide sequence is modified by incorporation of replacement codons according to the relevant table. In preferred embodiments, more than 90% of the polynucleotide sequence is modified by incorporation of replacement codons according to the relevant table. More preferably still, more than 95% of the polynucleotide sequence is modified. Ideally, 100% of the polynucleotide sequence is modified, that is, each occurrence of a particular codon is replaced with the corresponding replacement codon indicated in the relevant table. Expression Vectors

After modification of the codon composition of the polynucleotide sequence encoding the protein of interest, subsequent expression of the polynucleotide sequence in the chosen host cell may be carried out. In order that expression can be carried out in the host cell of choice, the sequence will preferably be provided in an expression construct, e.g. an expression vector. In some embodiments, the polynucleotide may be provided in an expression vector. Suitable expression vectors will vary according to the recipient host cell and suitably may incorporate regulatory elements which allow expression in the host cell of interest and preferably which facilitate high-levels of expression. Such regulatory sequences may be capable of influencing transcription or translation of a gene or gene product, for example in terms of initiation, accuracy, rate, stability, downstream processing and mobility.

Such elements may include, for example, strong and/or constitutive promoters, 5' and 3' UTR's, transcriptional and/or translational enhancers, transcription factor or protein binding sequences, start sites and termination sequences, ribosome binding sites, recombination sites, polyadenylation sequences, sense or antisense sequences, sequences ensuring correct initiation of transcription and optionally poly- A signals ensuring termination of transcription and transcript stabilisation in the host cell. The regulatory sequences may be plant-, animal-, bacteria-, fungal- or virus derived, and preferably may be derived from the same organism as the host cell. Clearly, appropriate regulatory elements may vary according to the host cell of interest. For example, regulatory elements which facilitate high-level expression in prokaryotic host cells such as in E. coli may include the pLac, T7, P(Bla), P(Cat), P(Kat), trp or tac promoters. Regulatory elements which facilitate high-level expression in eukaryotic host cells might include the AOX1 or GAL1 promoter in yeast or the CMV- or SV40-promoters, CMV-enhancer, SV40-enhancer, Herpes simplex virus VIP16 transcriptional activator or inclusion of a globin intron in animal cells. In plants, constitutive high-level expression may be obtained using, for example, the Zea mays ubiquitin 1 promoter or 35S and 19S promoters of cauliflower mosaic virus.

Suitable regulatory elements may be constitutive, whereby they direct expression under most environmental conditions or developmental stages, developmental stage specific or inducible. Preferably, the promoter is inducible, to direct expression in response to environmental, chemical or developmental cues, such as temperature, light, chemicals, drought, and other stimuli. Suitably, promoters may be chosen which permit expression of the protein of interest at particular developmental stages or in response to extra- or intra-cellular conditions, signals or externally applied stimuli. For example, a range of promoters exist for use in E. coli which give high- level expression at particular stages of growth (e.g. osmY stationary phase promoter) or in response to particular stimuli (e.g. HtpG Heat Shock Promoter).

Suitable expression vectors may comprise additional sequences encoding selectable markers which allow for the selection of said vector in a suitable host cell and/or under particular conditions. Suitable expression vectors may also comprise additional sequences which enable visualisation or quantification of the expressed protein (e.g. 3' GFP or Luciferase fusion tags) in the host cell of interest. Preferred expression vectors are those which also enable the expressed protein to be easily separated from other cellular proteins for downstream applications. For example, the expression vector may incorporate a fusion tag domain, which when fused to the coding sequence of the protein of interest allows the expressed protein to be bound to a matrix, column or beads (e.g. glutathione-S-transferase (GST)).

Furthermore, the expression vector comprising the heterologous polynucleotide sequence may optionally comprise polynucleotide sequences coding for one or more transit peptides, capable of to localising the expressed protein to a particular cellular compartment in the host cell. Advantageously, such domains may cause secretion of expressed protein, for example into the extracellular medium to enable the protein to be easily recovered from the cell culture medium. In plant hosts suitable transit peptides may cause the protein to localise to, for example, the cell wall, nucleus or chloroplasts. The methods of the present invention will be useful in the production of a large number of different proteins in the agricultural, chemical, industrial and pharmaceutical fields, particularly for example antibodies, vaccines, hormones and other protein therapeutics. Advantageously, according to all aspects of the present invention, levels of heterologous protein are increased relative to the respective native (i.e. unoptimised) protein by modification of the codon usage of the polynucleotide sequence which encodes the protein of interest. Preferably, the levels of heterologous protein may increase in the range 5% to 500% relative to native (unoptimised) protein; optionally in the range 10% to 250%, 20% to 200%, 25% to 100%, 30% to 75% or 35 to 65%.

Once expressed, proteins of interest may preferably be recovered from the cell culture medium as secreted proteins, although they may also be recovered from host cell lysates.

Host cells

The utility of the present invention resides in the universal applicability of the optimal replacement codons to any polynucleotide having a coding sequence and having one or more of the codons listed in the relevant table for expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells or animal cells. Methods of the invention can be applied to any type of host cell which is genetically accessible and which can be cultured. In other words, the approach may be applied to those cells which are able to serve as a host for production of the protein of interest (POI)). It may therefore be applied to commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells commonly employed for recombinant heterologous protein expression. Preferably, host cells will be selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell. Typically, the host cell may be an Escherichia coli cell. Typically, the host cell may be a Saccharomyces cerevisiae cell. Typically, the host cell may be a Caenorhabditis elegans cell. Typically, the host cell may be a Mus musculus cell.

In embodiments of the invention where the host cell is a prokaryotic cell, the host cell may be a bacterial cell or alternatively the host cell may be an archaeal cell. Host cells may be gram-negative bacterial cells. Host cells may be gram-positive bacterial cells. Typically, host cells may include but are not limited to; an Aliivibrio fischeri cell, a Bacillus subtilis cell, a Caulobacter crescentus cell, an Escherichia coli cell, a Mycoplasma genitalium cell, a Synechocystis cell, a Pseudomonas fluorescens cell. In preferred embodiments the host cell is a bacterial cell. Preferably the host cell is an Escherichia coli (E. coli) cell. In particularly preferred embodiments where the host cell is a prokaryotic cell, it is envisaged that the highest functional protein expression will be achieved by modification of each codon in the polynucleotide sequence for which a synonymous codon exists according to the relevant tables above. Preferably, where there is choice of codons indicated for a selected position based on the expression data, preference may be given to the first replacement codon appearing in the relevant table. Alternatively, preference may be given to the second replacement codon appearing in the relevant table. Alternatively, in situations where the second or third preference codon is already present in the starting sequence, it may be decided to retain the codon in the starting sequence, i.e. the wild type codon in embodiments where the starting sequence is the wild-type sequence. This will minimise the number of codon changes to convert the starting sequence in a polynucleotide to the selected synonymous coding sequence for improved functional protein expression. In embodiments of the invention where the host cell is a protist cell, host cells may include but are not limited to; a Chlamydomonas reinhardtii cell, a Dictyostelium discoideum cell, a Tetrahymena thermophila cell, an Emiliania huxleyi cell or a Thalassiosira pseudonana cell. In preferred embodiments the host cell is a Chlamydomonas cell. Preferably, the host cell is a Chlamydomonas reinhardtii cell.

In embodiments of the invention where the host cell is a fungal cell, the host cell may include but is not limited to; fungal cells and yeast cells cells. In particular, the host cell may be a Saccharomyces cerevisiae cell, an Ashbya gossypii cell, an Aspergillus fumigatus cell, an Aspergillus nidulans cell, a Candida albicans cell, a Coprinus cinereus cell, a Cunninghamella elegans cell, a Cryptococcus neoformans cell, a Fusarium oxysporum cell, a Magnaporthe oryzae cell, a Neurospora crassa cell, a Schizophyllum commune cell, a Schizosaccharomyces pombe cell, an Ustilago maydis cell or a Zymoseptoria tritici cell. Preferably the host cell is a Saccharomyces cerevisiae cell or a Schizosaccharomyces pombe cell. More preferably the host cell is a Saccharomyces cerevisiae cell.

According to aspects of the present invention where the host cell is a plant cell, any cell type of any plant species, including both monocots and dicots, may be used as a host system for expression of a heterologous protein. Preferred plant cells for use in the present invention are genetically tractable, and are commonly derived from either crop species, species which typically exhibit high growth rates, are easily harvested or species which have established genetic resources associated with them. Commonly, in some preferred embodiments of the invention, the host cell is an Arabidopsis cell, preferably an Arabidopsis thaliana cell. In other preferred embodiments of the invention the host cell may be a Nicotiana cell, preferably a Nicotiana tabacum cell. Alternatively, depending on the application chosen said plant may suitably be selected from the following: maize (Zea mays), canola (Brassica napus, Brassica rapa ssp.), sugar beet (Beta vulgaris), oat (Avena sp.), barley (Hordeum vulgare), flax (Linum usitatissimum), alfalfa (Medicago sativa), rice (Oryza sativa), rye (Secale cerale), sorghum (Sorghum bicolor, Sorghum vulgare), switchgrass (Panicum virgatum), prairie Cordgrass (Spartina sp.), purple false brome (Brachypodium distachyon), sunflower (helianthus annuas), wheat (Tritium aestivum), soybean (Glycine max), potato (Solanum tuberosum), cotton (Gossypium hirsutum), sweet potato (lopmoea batatus), cassava (Manihot esculenta), foxtail (Setaria sp.), Miscanthus sp., peanuts (Arachis hypogaea), cotton (Gossypium hirsutum), sweet potato (lopmoea batatus), cassava (Manihot esculenta), coffee (Cofea spp.), coconut (Cocos nucifera), pineapple (Anana comosus), citrus tree (Citrus spp.) cocoa (Theobroma cacao), tea (Camellia senensis), banana (Musa spp.), avocado (Persea americana), fig (Ficus casica), guava (Psidium guajava), mango (Mangifer indica), olive (Olea europaea), papaya (Carica papaya), cashew (Anacardium occidentale), macadamia (Macadamia intergrifolia), almond (Prunus amygdalus), sugar beet (Beta vulgaris), oat (Avena sp.), barley (Hordeum vulgare), Chlorella, Volvox, Guillardia theta, Bigelowiella natans or Physcomitrella patens.

Transformation of the host cell with a heterologous gene sequence

Expression constructs comprising the modified polynucleotide sequence may be located in plasmids (expression vectors) which are used to transform the host cell. Specific, but non-limiting methods of transformation may include heat shock, electroporation, particle bombardment, chemical induction, microinjection and viral transformation.

Heterologous protein expression analysis

Subsequently, in preferred embodiments of the present invention the expression levels of the protein of interest in host cells of interest may be determined. Preferably the method chosen allows for quantitative assessment of the level of functional expression. In some instances, functional expression may be directly determined, e.g. as with GFP, luciferase or by enzymatic action of the protein of interest (POI) to generate a detectable optical signal, such as fluorescence or luminescence or a colour change caused by the protein. However, in some circumstances it may be chosen to determine physical expression, for instance by antibody probing, and rely on separate test to verify that physical expression is accompanied by the required function. In preferred embodiments of the invention, the POI will be detectable by a high- throughput screening method, for example, relying on the detection of an optical signal. Preferably, using an optical signal which is directly proportionate to the quantity of the expression product from the polynucleotide is a convenient method of measuring expression and is amenable to high throughput processing. For this purpose, it may be necessary for the POI to incorporate a tag, or be labelled with a removable tag, which permits detection and preferably quantification of expression. Suitable tags may include but are not limited to; a fluorescence reporter molecule translationally-fused to the C-terminal end of the POI, e.g. GFP, Yellow Fluorescent Protein (YFP), Red Fluorescent Protein (RFP) or Cyan Fluorescent Protein (CFP). It may be an enzyme which can be used to generate an optical signal. Alternatively, the expression vector may incorporate a polynucleotide reporter encoding a luminescent protein, such as a luciferase (e.g. firefly luciferase). Alternatively, the reporter gene may be a chromogenic enzyme which can be used to generate an optical signal, e.g. a chromogenic enzyme (such as beta-galactosidase (LacZ) or beta-glucuronidase (Gus)). Tags used for detection of expression may also be antigen peptide tags. A tag may be provided for affinity purification, e.g. a polyhistidine tag. Where the POI is ultimately to be used as a therapeutic agent, any tag employed for detection of expression will be cleavable from the POI. It is envisaged that other types of label may also be used to mark the protein including, for example, organic dye molecules or radiolabels.

Accordingly, in a preferred embodiment of the invention, the measurement of expression comprises the detection of an optical signal, for example a fluorescent signal, a luminescent signal or colour signal. In a particularly preferred embodiment the optical signal is provided by a GFP reporter fused to the protein of interest.

The replacement codon selected from synonymous codons listed as alternatives in the relevant table(s) for a given host is the codon associated with the highest or optimal observed functional expression of the POI, or where more than one codon provides substantially equal such expression, one such codon corresponding with that level of expression. Where there is more than one replacement codon indicated for a given non-optimal codon based on the expression data, this corresponds to the first replacement codon appearing in the relevant table. Therefore where there is choice of codons indicated for a selected position based on the expression data, preference may be given to the first replacement codon appearing in the relevant table. Alternatively, preference may be given to the second replacement codon appearing in the relevant table. Routinely, in situations where the second or third preference codon is already present in the starting sequence, for convenience the codon in the starting sequence may be retained, i.e. the wild type codon in embodiments where the starting sequence is the wild-type sequence. This will minimise the number of codon changes to convert the starting sequence in a polynucleotide to the selected synonymous coding sequence for improved functional protein expression.

EXEMPLIFICATION

The invention will now be illustrated below with reference to the following examples and figures, in which:

Figure 1 shows the influence of codon optimisation on protein yield, mRNA stability and translatability. Panel A is a graphical representation of the nucleotide content of the third codon position in the constructs for Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL- 10) with additional chitinase signal peptide (SP) expression. GFP was also expressed without SP. Panel B is a graphical representation of protein yield in transformed Arabidopsis thaliana seedlings. For each plant analysed the protein yield in ng per mg total soluble protein (TSP) is plotted against the relative mRNA transcript concentration as compared to the A. thaliana household gene TIP-41 . Panel C depicts protein yield in g per mg TSP at 2 to 5 days post infiltration (DPI), in transient expression in Nicotiana benthamiana leaves (native and optimised in black and grey bars, respectively) ^* indicates co-expression with the silencing inhibitor p19 of tomato bushy stunt virus. n=3, error bars indicate standard error. Figure 2 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked nucleotide use. Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) (>250 microarrays per species) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) were rank-normalized and averaged. Subsequently, correlations (Spearman) between expression and nucleotide use (overall and for each codon position) were calculated per species and used to generate this heat map. Consistent positive and negative correlations across species are indicated with stars and triangles, respectively. Figure 3 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked codon use. Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) (>250 microarrays per species) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) were rank-normalized and averaged. Subsequently, correlations (Spearman) between expression and codon use were calculated per species and used to generate this heat map. Consistent positive and negative correlations across species are indicated with stars and triangles respectively.

Figure 4 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked amino acid use. Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) (>250 microarrays per species) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) were rank-normalized and averaged. Subsequently, correlations (Spearman) between expression and amino acid use were calculated per species and used to generate this heat map. Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.

Figure 5 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked codon bias. Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) (>250 microarrays per species) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) was rank-normalized and averaged. Subsequently, genes were grouped based on expression from the centre (50% highest versus 50% lowest) until, with 1 % steps, the extremes (5% highest versus 5% lowest) were reached. With each step the synonymous codon use frequencies in both high- and low- expressed gene pool were calculated together with the difference in codon use frequency between the high- versus the low-expressed gene pool. Finally, the difference in codon use frequency was correlated to the expression defining percentage (Spearman). The relation between the species based on this correlation is visualized in this heat map.

Figure 6 shows a graphical representation of mRNA structural features plotted against ranked expression with moving average (black line). The mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy (kcal/mol/nucleotide), fraction of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions per nucleotide were determined. Previously mentioned mRNA characteristics plotted against expression.

Figure 7 shows a heat map where the mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy (kcal/mol/nucleotide), fraction of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions per nucleotide were determined and correlated with expression (Spearman) (Table 2). The heat map demonstrates that highly-expressed genes across all kingdoms prefer a stable, but 'airy' mRNA structure. Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.

Figure 8 is a heat map showing correlations (Spearman) between mRNA structure characteristics and protein:mRNA ratios per species (Table 3), demonstrating that highly translated transcripts across kingdoms share a similar 'airy' structure. The mRNA structures of all genes of Escherichia coli (Eubacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy, percentage of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions were determined and correlated (Spearman) with protein:mRNA ratios. Rank-normalized mRNA levels were divided by protein abundance (retrieved from PaxDB). Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.

Figure 9 shows mRNA structure predictions of the constructs used for heterologous protein expression. Sequences of the native and optimised variants of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL-10) with additional signal peptide (SP) and GFP without SP flanked by the 5' and 3'-UTRs as expected from our expression cassette were used to predict the mRNA secondary structure.

Figure 10 shows a heat map displaying the relation between species of several kingdoms of life based on translation rate-linked nucleotide use. Correlation (Spearman) between mRNA:protein ratios (proxy for translation rate) and nucleotide content (overall and for each codon position) for the species Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia). For each species >250 microarrays originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A- F) was rank-normalized, averaged and divided by protein abundance (retrieved from PaxDB) before correlations (Spearman) between protein:mRNA ratios and nucleotide use were calculated. Figure 11 shows a heat map displaying the relation between species of several kingdoms of life based on translation rate-linked codon use. Correlation (Spearman) between mRNA:protein ratios (proxy for translation rate) and codon use for the species Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia). For each species >250 microarrays originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) was rank-normalized, averaged and divided by protein abundance (retrieved from PaxDB) before correlations (Spearman) between protein:mRNA ratios and nucleotide use were calculated.

Figure 12 shows a heat map displaying the relation between species of several kingdoms of life based on translation rate-linked amino acid use. Correlation (Spearman) between mRNA:protein ratios (proxy for translation rate) and amino acid use for the species Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia). For each species >250 microarrays originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) was rank-normalized, averaged and divided by protein abundance (retrieved from PaxDB) before correlations (Spearman) between protein :mRNA ratios and nucleotide use were calculated.

Figure 13 shows a sequence alignment of native (nat) and optimized (opt) GFP sequences.

Figure 14 shows a sequence alignment of native (nat) and optimized (opt) GFP sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase. Figure 15 shows a sequence alignment of native (nat) and optimized (opt) mlL-10 sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase. Figure 16 shows a sequence alignnnent of native (nat) and optimized (opt) OVA sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase.

Example 1 - Codon optimisation improves mRNA stability and translatabilitv

Wang and Roossinck (2006) previously compared overall codon use to the codon use in highly expressed genes in 1 1 plant species. Although the codons used most frequently in highly expressed genes (optimal codons) differed between monocots and dicots, the use of the same codons often increases with expression (expression codons). However, the authors did not express the optimised genes in plants. In the experiments shown here, one codon per amino acid that was most often identified as an expression codon across these 1 1 plant species was selected. Strikingly, most of these codons were C-ending, except for the amino acids Arg (CGT) and Gly (GGT). The codons of the amino acids Gin, Glu and Lys, that can only be encoded by A or G-ending codons, were G-ending. To investigate the effect of these codons on heterologous protein production in plants, the gene sequence of three genes was recoded with these codons. The genes of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL- 10) were chosen because of their variation in codon use (Figure 1 a). To eliminate differences caused by translation initiation all genes were preceded by the signal peptide of Arabidopsis thaliana chitinase. GFP was also expressed without this signal peptide, as it is normally not secreted. The native and optimised variants of these four constructs were used to transform Arabidopsis thaliana using the floral dip method and their expression in seedlings was evaluated by determining mRNA transcript and protein levels (Figure 1 b; Table 4). An increased protein yield found upon optimisation could be partly explained by an increase in mRNA transcript levels, i.e. increased mRNA stability (Table 4). Comparing protein:mRNA ratios of transformants within a similar mRNA expression range showed that codon optimisation resulted in more protein per mRNA transcript. Thus, codon optimisation also resulted in increased mRNA translatability.

Upon transient transformation transcript levels are always much higher. An increase in mRNA stability and translatability may than no longer improve protein yield. Therefore, protein yield upon transient expression of the three genes in Nicotiana benthamiana was also determined, with and without co-expression of the gene silencing inhibitor p19 of tomato bushy stunt virus (Figure 1 c; Table 5). Also upon transient expression codon optimisation lead to higher protein yield on all days for all genes, except for OVA unless p19 was co-expressed. In most cases co-expression of p19 had a favourable effect on protein yield independent of optimisation. This is not surprising as, mRNA transcript levels are always high in transient expression, which increases the risk of gene silencing. Thus, the mRNA of the optimised variant of OVA must have been more sensitive to gene silencing compared to the native variant.

Relative

Relative mRNA Protein:

mRNA Fold Protein Fold cone. n mRNA Fold n= cone. change yield change range = ratio change

GFP N 32 0.88 17.03 0.8-2.7 4 22.8±2.70

75*** 1 7.1

0 23 9.25 1276 0.9-2.5 4 161 ±58.5

SP- 1

GFP N 26 1 .63 33.28 1 .4-4.9 1

5.8^* -| 2_** 18.0±5.16

3.5^* 1

0 24 9.53 399.5 1 .2-4.8 2 63.9±14.5

SP- 1 356.2±142

OVA N 26 2.37

2 ^_*** 717.3 2.0-5.3 2 .5

5.5^*** 2 g_**

2 1014±121 .

0 30 5.62 3937 2.2-5.5 3 7

SP-

IL-10 N 17 1 .37 3.30 1 .7-4.2 8

2 -j _*** 1.26±0.43

5.5^***

1

0 25 4.23 17.9 1 .7-4.1 6 6.68±1 .02 Table 4. Codon optimisation of GFP, interleukin-10 and ovalbumin genes boosts expression in Arabidopsis thaliana. Average relative mRNA transcript concentration as compared to the A. thaliana household gene TIP-41 and protein yield in g per mg total soluble protein (TSP) determined in A. thaliana seedlings upon stable transformation of native (N) and optimised (O) sequences of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL-10) with additional chitinase signal peptide (SP). GFP was also expressed without signal peptide. Protein:mRNA ratios were calculated. Because translatability may be lower with a higher mRNA concentration due to the limited number of free ribosomes, the protein:mRNA ratios were calculated of samples within the same mRNA concentration range, as indicated. The fold change when comparing the optimised to the native variant was calculated for the relative mRNA concentration, protein yield and protein:mRNA ratio. For each average the number of included seedlings is indicated (n). Significance of fold changes were calculated with a Welch's i-test: ^* P<0.05, ^** P<0.01 , ^***P<0.001 . dpi 2-5 dpi 5 + p19

Protein yield Fold change Protein yield Fold change

GFP N 5- O 23 I 34 «

SP-GFP N 1

3.2^** 2.1

O 3.2 9.2

SP-OVA N 30

¹⁷ 0.7 2.0^*

O 12 61

SP-IL-10 N 8 4

1 .4

O 21 24

Table 5. Codon optimisation boosts protein yield in transient expression in Nicotiana benthamiana. Average protein yield in g per mg total soluble protein determined in N. benthamiana leaves upon transient transformation of native (N) and optimised (O) sequences of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL-10) with additional chitinase signal peptide (SP) (GFP was also expressed without SP) at 2 to 5 days post infiltration (dpi) (n=12) or 5 dpi whereby tested genes were co-expressed with the viral silencing inhibitor p19 of tomato bushy stunt virus. (n=3). Significance of fold change in protein yield were calculated with a Welch's i-test: ^* P<0.05, ^** P<0.01 , ^***P<0.001 .

Evaluating the average yield from dpi 2-5 or with co-expression of p19 on dpi 5 revealed a lower yield increase upon codon optimisation compared to stable expression in A. thaliana. This is not surprising as at least some of the gain in mRNA stability due to the codon optimisation is compensated by the increased transcription in transient expression. Whether this gain in protein yield is predominantly the result of an increase in mRNA translatability or a combination of a gain in mRNA stability and translatability remains to be determined.

To explain the differences found in mRNA stability, first the thermodynamic stability of the predicted secondary mRNA structures was calculated. Upon codon optimisation the minimum free folding energy had decreased, indicative for a more stable mRNA, from -0.25 to -0.35 and -0.31 to -0.33 kcal/mol/nt for GFP and OVA, respectively. However, for IL-10, the minimum free folding energy increased from - 0.31 to -0.28 kcal/mol/nt indicating a less stable mRNA. Thus, an overall increase in physical stability could not explain the increased mRNA transcript levels of IL-10. However, it is still possible that unstable regions of IL-10 were removed upon codon optimisation, while the overall stability decreased.

In vivo mRNA half-life is predominantly controlled by other factors than physical stability, namely; the occurrence of a splicing event, through AU-rich destabilizing elements in the UTRs, and the presence of sequences that are targets for microRNA. In our experiments, all genes were expressed using the same expression controlling components, thus contained the same UTRs and did not contain introns. However, the sequences of the ORFs varied greatly between the native and optimised variants (78, 76 and 83% homology for GFP, OVA and IL-10, respectively). Therefore, there could be a difference in the presence of microRNA targets and also a difference in the occurrence of stretches of double stranded (ds)RNA between the native and optimised variants. The dsRNA stretches could be processed to small interfering RNAs and, like binding of microRNAs, can trigger gene silencing. In stable expression, gene silencing can also be due to gene methylation, but this always results in the complete absence of transcripts and therefore transformants without detectable expression were not considered. In our transient expression experiment co-expression of the silencing inhibitor p19 gave comparable results. Taken together, differences in mRNA decay based on above mentioned sequence features are unlikely to explain the differences in mRNA stability in our experiments. Translation has also been linked to mRNA decay. Ribosomes can shield nuclease target sites, however, in large-scale in vivo studies mRNA half-life could not be linked to the number of nuclease target sites or ribosomal density. When translation initiation is equal, as is expected in our experiments, an increase in translatability should result in a lower density of ribosomes. Thus, there would have been fewer ribosomes on the optimised variants compared to their native counterparts, and the optimised variants would be less protected against nucleases. While translation per se may not influence mRNA half-life, errors in translation have been proven to lead to mRNA degradation by mRNA surveillance mechanisms. Three mRNA surveillance mechanisms have been identified: I) nonsense mediated decay by the recognition of a premature stop codon, II) non-stop decay by the lack of a stop codon and III) no-go decay by stalled ribosomes. Occurrence of a premature stop codon or the lack of a stop codon can be caused by a mutation or a ribosomal slip causing a frame-shift. Frame-shifts can be caused by a 'slippery' sequence that may be found in proximity of a strong mRNA structure. A ribosome may also stall at a strong stem-loop structure without slipping and trigger degradation. It is possible that the native and optimised variants differ in the presence of 'slippery' sequences and/or strong mRNA structures. Thus, differences in level of translation-linked mRNA decay may explain the difference in mRNA transcript levels in our experiment. In addition, ribosomes have intrinsic helicase activity and recently it was shown that strong mRNA structures such as pseudoknots and hairpins can stall translation only temporarily. It is therefore thought that the mRNA structure provides a mechanical basis for cellular regulation of translation rate. Thus, increased mRNA translatability of the optimised genes may be explained by an increased translation rate caused by differences in the mRNA structure.

Example 2 - General codon bias extends to other kingdoms of life The existence of codon biases in different species has implications for the efficient expression of heterologous proteins in a range of host cells. To investigate if the general codon bias in plants transcends kingdoms of life expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) was interrogated. Per species >250 microarrays originating from several studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues were used (Table 1A-F). First, the expression was ranked and the average rank was used as a measure of overall expression. Subsequently, the correlation between expression and nucleotide content was analysed per species. The relation between the species based on this correlation was visualized in a heat map (Figure 2).

Surprisingly, a strong positive correlation between expression and overall G content, in particular G on the first codon position and a negative correlation between expression and A and T on the first codon position was found across all kingdoms. Next, the correlation between expression and codon use was evaluated (Figure 3). Across all kingdoms the use of CGT (Arg/R), AAG (Lys/K), GGT (Gly/G), GTT (Val/V) and GCT (Ala/A) is positively correlated with expression. However, the fact that the nucleotide contents of the first and second codon position are correlated with expression indicates that there is a correlation between amino acid usage and expression. Highly expressed genes are relatively rich in the amino acids encoded by G-starting triplets: Ala, Gly, and Val (Figure 4).

First, to uncouple the amino acid bias from the codon use bias, the relative synonymous codon use was calculated. Subsequently, a comparison was made between high- and low-expressed genes, as a correlation between codon use and expression may only be found in genes expressed above a certain threshold. Genes were grouped based on expression from the centre (50% highest versus 50% lowest) until, with 1 % steps, the pools with 5% highest and 5% lowest expressed genes were reached. With each step the codon use frequencies in both high- and low-expressed gene pools were calculated together with the difference in codon use frequency between the high- versus the low-expressed gene pool. Finally, the difference in codon use frequency was correlated (Spearman) to the expression defining percentage. The relation between the species based on this correlation was visualized in a heat map (Figure 5; Table 6A-E show codon use frequencies of all, the bottom 5% low- and top 5% high-expressed genes and fold codon use change (top/bottom) per species).

Strikingly, when clustering the correlations between the 5 species, E. coli, S. cerevisiae, C. elegans and A. thaliana group together well. M. musculus seems to have an overall lower codon bias and in -50% of the cases selects for other codons compared to the overall selection of the other species. Excluding M. musculus, 13 codons are positively correlated with expression for all species. These 13 codons encode 1 1 different amino acids and a termination of translation (twice a codon for Thr/T). Comparable to the general codon bias found in plants, 8 of these 13 codons are C-ending. Furthermore, 18 codons are consistently negatively correlated with expression in these four species. Of these codons most are A-ending (8), while none of them are C-ending. Strikingly, 5 universal codons were found which were positively correlated with expression for all species, indicating that these codons are conserved in the coding sequences of highly-expressed genes across all kingdoms of life and could therefore find useful application in methods of optimising functional protein expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. In addition several codons were found which were positively correlated with further increases in expression in E. coli, S. cerevisiae and C. elegans. Furthermore in addition to the universal set of codons, several codons were found to be positively correlated with increases in expression in E. coli, S. cerevisiae, C. elegans and Mus musculus. Separately, several codons were found to be positively correlated with increased expression in A. thaliana.

Taken together the data suggest that a conserved selection pressure influences expression across all kingdoms of life. Heterologous protein expression experiments suggested a role for the mRNA structure in translation rate. As the translational machinery does not vary greatly across kingdoms, the mRNA structure is a likely candidate to be the driving force behind this selection pressure.

Example 3 - Highly expressed genes prefer a stable, but 'airy' mRNA structure To evaluate if the mRNA structure could be the driver of selection that gives rise to the observed general codon bias, the relationship between expression and mRNA structure characteristics was evaluated. Thereto, the mRNA structures of all genes were predicted and determined gene length, minimal free folding energy, number of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of the number of stem/loop transitions and plotted these against expression (Figure 6; Table 7). Also a heat map displaying the relation between the species based on the correlation (Spearman) between these structure characteristics and expression was generated (Figure 7; Table 7). This heat map demonstrates that the number of bound nucleotides and the number of stem/loop transitions was consistently positively correlated and mean loop size consistently negatively correlated with expression across all species.

The positive correlation with the number of bound nucleotides indicates a general adaptation towards a more stable mRNA molecule. Also, a low folding energy (more stable) is correlated with high expression in S. cerevisiae, C. elegans and A. thaliana, but not in E. coli and M. musculus. Still, in E. coli there seems to be a relation between expression and folding energy, as is demonstrated by the trend line that indicates an optimum (Figure 6). An optimum in mRNA stability may indicate a trade-off between stability and translatability in this species. A trade-off in stability and translatability may also explain why there is a correlation between mRNA folding energy and expression in S. cerevisiae, C. elegans and A. thaliana. These species have an overall lower G+C content resulting in on average weaker mRNAs (Table 7) and have therefore more to gain in terms of stability before translatability is affected.

The number of stem-loop transitions and mean loop size are also correlated with expression (positive and negative, respectively) in all species, which suggests that there is a general adaptation towards dividing nucleotide bonds equally over the mRNA molecule. In other words, highly expressed genes prefer a stable, but 'airy' mRNA molecule. This again indicates a trade-off between mRNA stability and translatability. It is striking that while folding energy in S. cerevisiae, C. elegans and A. thaliana is on average much higher (less stable mRNA) (6-10%) compared to E. coli and M. musculus, the fraction of bound nucleotides, mean stem and loop size and number of transitions do not differ that much (Table 7). This means that while the mRNA folding energy may differ between species with different G+C content, the overall mRNA structure characteristics are more similar across species.

Taken together our data indicate that there is a general selection towards an optimal folding energy across kingdoms of life whereby number and type of nucleotide bonds (e.g. A-U and G-U bonds are weaker than G-C bonds) are balanced with short loops to facilitate efficient translation. This is in line with the observation that translation rate is greatly influenced by G+C content and strong mRNA structures.

Table 7. mRNA characteristics of highly expressed genes per species.

Averages of mRNA characteristics of the top 5% high-expressed genes of Escherichia coii (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabdites elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia).

A link between mRNA structure and expression may explain the increase in mRNA stability and translatability in the heterologous protein expression experiments disclosed herein. Therefore the mRNA structures of the native and optimised variants of the expressed genes were predicted and evaluated (Figure 9; Table 8). Optimised variants of GFP and OVA had an increased folding energy indicative of a more stable mRNA. All optimised variants had an increased number of stem-loop transitions (except SP-GFP), which is in line with a more 'airy' mRNA molecule. Thus, although changes in the mRNA structure upon optimisation differ from gene to gene, an improved mRNA structure could be the basis of increased protein yield in our experiments. Energy Bound nt's Mean stem Mean loop Transitions kcal/mol/nt (fraction) size size

GFP N -0.21 0.56 5.74 4.48 0.097

0 -0.33 0.57 5.15 3.85 0.1 1 1

SP-GFP N -0.22 0.57 5.21 3.89 0.109

0 -0.32 0.54 5.22 4.31 0.104

SP-OVA N -0.29 0.61 5.28 3.34 0.1 16

0 -0.31 0.55 4.38 3.56 0.126

SP-IL-10 N -0.29 0.60 5.02 3.29 0.120

0 -0.27 0.54 4.08 3.47 0.131

Table 8. Calculated mRNA structure characteristics of the constructs used for heterologous protein expression. Analysis of the mRNA secondary structure predictions given in Figure 9. Folding energy, bound nucleotides and number of transitions are corrected for gene length. Stem and loop sizes are mean values.

Example 4 - A more 'airy' mRNA increases translation rate

On a cellular level translation efficiency was demonstrated to be the most important factor in controlling protein abundance whereas protein turnover plays only a minor role. Therefore, protein:mRNA ratio is a good proxy of translation rate. To evaluate if the mRNA structure characteristics found to be linked to expression are also linked to translation rate the expression data was combined with large-scale protein abundance data retrieved from PaxDB. To evaluate to what extent the expression data predicts protein abundance, the correlation (Spearman) between the expression data and the protein abundance was calculated: E. coli 0.59, S. cerevisiae 0.67, C. elegans 0.59, A. thaliana 0.62 and M. musculus 0.36. When the relationship between the protein:mRNA ratio and the previously mentioned mRNA structure characteristics was evaluated a similar picture as when using the expression data was obtained (Figure 8; Table 3; Figure 10-12 heat maps demonstrate the relation between species based on correlations of protein :mRNA ratio and nucleotide content, codon use and amino acid use).

I E. coli S. cerevisiae C. elegans A. thaliana M. musculus Gene length -0.146 -0.1 16 -0.180 -0.139 -0.288

Energy (kcal/mol/nt) 0.043 -0.237 -0.212 -0.138 0.087

Bound (fraction) -0.009 0.148 -0.006 0.062 -0.058

Mean stem size -0.193 -0.01 1 -0.216 -0.058 -0.121

Mean loop size -0.121 -0.182 -0.139 -0.105 -0.015

Transitions /nt 0.199 0.140 0.213 0.104 0.081

Table 3. Correlations (Spearman) between mRNA structure characteristics and mRNA:protein ratios per species. The mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabdites elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy, percentage of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions were determined and correlated (Spearman) with mRNA:protein ratios. Rank-normalized mRNA levels were divided by protein abundance (retrieved from PaxDB).

As with expression, the number of stem-loop transitions is positively correlated with protein:mRNA ratio and mean loop size is negatively correlated across all species. Also, the folding energy is negatively correlated (more stable mRNA) for S. cerevisiae, C. elegans and A. thaliana, but not for E. coli and M. musculus. However, in contrast to the expression data, gene length is consistently negatively correlated with protein:mRNA ratio. This is in line with the fact that the packing density of ribosomes was shown to decrease with mRNA transcript length. Also, a negative correlation with mean stem size is found for all species and the fraction of bound nucleotides is not correlated, except for S. cerevisiae. Thus, small stem size must be important for an increased translation rate. This again highlights the tradeoff between mRNA stability and translatability.

Example 5 - Construct design

The native and optimised sequences coding for Aequorea victoria green-fluorescent protein (GFP) (L29345.1 ; nt 7-807) Gallus gallus ovalbumin (OVA) (NM_205152.2; nt 4-1 161 ) and Mus musculus interleukin-10 (IL-10) (NM_010548.2; nt63-537) together with the optimised sequence for the Arabidopsis thaliana basic chitinase signal peptide (cSP) (BAA82810.1 ; nt15-33) were synthetically made by GeneArt (Thermo Fisher Scientific, Breda, the Netherlands). Optimisation was performed by recoding the protein sequences using the C-ending codons for all amino acids (TCC in the case of Ser), except Arg and Gly, for which the T-ending codons were used, and Gin, Glu and Lys, for which the G-ending codons were used. Synonymous mutations to either native or optimised sequences were sometimes introduced to remove undesired restriction and the cryptic splice sites in native GFP (Reichel et al., 1996, PNAS, 93:5888-5893). Gene fragments were flanked with sequences including the restriction sites Ncol (5') and Eagl-BspHI (3') for cSP, Eagl (3') and Knpl (5") for IL-10 and OVA and Ncol (3") and Kpnl (5") for GFP to allow fragment assembly and subsequent in frame cloning into the plant expression vector pHYG (Westerhof et al., 2012, PloS One, 7: e46460). Fragment assembly was accomplished by the in frame ligation of cSP with IL-10 and OVA using the Eagl site and cSP with GFP using the BspHI (cSP) and Ncol (GFP) sites. ORFs were confirmed by sequencing in expression vector stage. All vectors were transformed to Agrobacterium tumefaciens strain GV3101 for stable transformation of Arabidopsis thaliana or MOG101 for agroinfiltration in Nicotiana benthamiana.

Example 6 - Stable transformation of Arabidopsis thaliana

Agrobacterium tumefaciens clones were cultured overnight (o/n) at 28°C in LB medium (1 Og/I pepton140, 5g/l yeast extract, 10g/I NaCI with pH7.0) containing 50 μg/nnl kanamycin. Bacterial cultures were centrifuged for 15 min at 2800 g and resuspended in MMA (20g/l sucrose, 5g/l MS-salts, 1 .95g/l MES, pH5.6) containing 200 μΜ acetosyringone and 0.03% silwet-L77 till an OD of 0.5 was reached. Arabidopsis thaliana plants were submerged in the bacterial suspension for 1 min and kept in a moist environment for 2 days. Plants were maintained in a controlled greenhouse compartment (UNIFARM, Wageningen) until seeds could be collected. Seeds were sterilized by 4-hour exposure to chlorine gas and plated on basic agar plates (8g/l Bacto Agar, 0.101 g/l KNO³) containing 30 ng/ml hygromycin and 100 μg/nnl cefotaxim. Plates were kept in the dark at 4°C for 2 days, then placed in artificial light for 7 hours at 24°C, again kept in the dark at RT for 5 days and finally placed in a climate chamber with 12 hour light regime at 24°C for 2 days. At this stage 10 to 40 seedlings per transformant plant were selected and placed in individual pots with Knop agar (1x Knop, 1 % sucrose, 8g/l Plant Agar pH6.4) containing 30 μg ml hygromycin and 100 μg ml cefotaxim. Seedlings that showed good growth and root formation after 10 days were transferred to fresh pots and allowed to grow for 2 more weeks. Thereafter plants were harvested and snap- frozen. Plant material was homogenized using a TissueLyser II (Qiagen) and stored at -80°C until further use.

Example 7 - Transient transformation of Nicotiana benthamiana

Agrobacterium tumefaciens clones were cultured overnight (o/n) at 28°C in LB medium (1 Og/I pepton140, 5g/l yeast extract, 10g/I NaCI with pH7.0) containing 50 μg ml kanamycin and 20 μg ml rifampicin. The optical density (OD) of the o/n cultures was measured at 600 nm and used to inoculate 50 ml of LB medium containing 200 μΜ acetosyringone and 50 μg ml kanamycin with x μΙ of culture using the following formula: x = 80000/(1028OD). OD was measured again after 16 hours and the bacterial cultures were centrifuged for 15 min at 2800 g. The bacteria were resuspended in MMA infiltration medium (20g/l sucrose, 5g/l MS-salts, 1 .95g/l MES, pH5.6) containing 200 μΜ acetosyringone till an OD of 1 was reached. All constructs were co-expressed with the tomato bushy stunt virus silencing inhibitor p19 by mixing Agrobacterium cultures 1 :1 . After 1 -2 hours incubation at room temperature, the two youngest fully expanded leaves of 5-6 weeks old Nicotiana benthamiana plants were infiltrated completely. Infiltration was performed by injecting the Agrobacterium suspension into a Nicotiana benthamiana leaf at the abaxial side using a 1 ml syringe. Infiltrated plants were maintained in a controlled greenhouse compartment (UNIFARM, Wageningen) and infiltrated leaves were harvested at selected time points.

Example 8 - Determination of heterologous gene expression Total RNA was isolated from homogenized plant material using the RNAeasy Plant Mini Kit (Qiagen) according to supplier's protocol. A Turbo DNasel (Ambion) treatment was included to remove any residual DNA. cDNA was synthesised using the SuperScript^®lll First-Strand Synthesis System (invitrogen) according to supplier's protocol using an oligo(dT) primer. Samples were analysed by quantitative PCR in triplo using ABsolute SYBR Green Fluorescein mix (Thermo Scientific). Arabidopsis thaliana TIP-41 (AY074349.1 ) was used as a reference gene. The oligonucleotides used for amplification of both native and optimised IL-10, OVA and GFP and TIP- 41 were 5'-AACCTCTTCCTCTTCCTC-3' [SEQ ID NO: 2] / 5'- GGAAGTGGGTGCAGTT-3' [SEQ ID NO: 3]; 5'-AACCTCTTCCTCTTCCTC-3' [SEQ ID NO: 4]/ 5'-GGGCAGTAGAAGATGTTC-3' [SEQ ID NO: 5]; 5'- GACGGTAACTACAA-GACC-3' [SEQ ID NO: 6]/ 5'-TTGTCGGCCATGATGTA-3' [SEQ ID NO: 7]; and 5'-GCTCATCGGTACGCTCTTTT-3' [SEQ ID NO: 8]/ 5'- TCCATCAGTCAGAGGCTTCC-3' [SEQ ID NO: 9], respectively. Relative transcript levels of the genes versus TIP-41 were determined by the Pfaffl method (Pfaffl, 2001, Nucleic Acids Research, 29: e45).

Example 9 - Determination of heterologous protein expression

Homogenized plant material was ground in ice-cold extraction buffer (50mM phosphate-buffered saline (PBS) pH=7.4, 100 mM NaCI, 10 mM ethylenediaminetetraacetic acid (EDTA), 0.1 % v/v Tween-20, 2% w/v immobilized polyvinylpolypyrrolidone (PVPP)) using 2 ml/g fresh weight. Crude extract was clarified by centrifugation at 16.000xg for 5 min at 4°C and supernatant was directly used in an ELISA and BCA protein assay. Mouse IL-10 expression levels were determined using the Mouse IL-10 ELISA Ready-SET-Go! kit (eBioscience) according to the supplier's protocol. For the quantification of OVA and GFP, a rabbit anti-ovalbumin or a chicken anti-GFP (both from Rockland Immunochemicals Inc.) was used to coat ELISA plates o/n at 4°C in a moist environment. After this and each following step the plate was washed 5 times with 30 sec intervals in PBST (1 x PBS, 0,05% Tween-20) using an automatic plate washer (BioRad model 1575). The plate was blocked with assay diluent (eBioscience) for 1 h at room temperature. Samples and standard lines were loaded in serial dilutions and incubated for 1 h at room temperature. Standard lines were made from purified chicken ovalbumin (Sigma) or recombinant GFP (Roche). For detection of OVA and GFP a rabbit anti- ovalbumin:HRP antibody or a rabbit anti-GFP:HRP antibody (both from Rockland Immunochemicals Inc.) were used, respectively. A 3,3',5,5'-Tetramethylbenzidine (TMB) substrate (eBioscience) was added and colouring reaction was stopped using stop solution (0.18M sulphuric acid) after 1 -15 min. Read outs were performed using the model 680 microplate reader (BioRad) to measure the OD at 450 nm with correction filter of 690 nm. For sample comparison total soluble protein (TSP) concentration was determined using the BCA Protein Assay Kit (Pierce) according to supplier's protocol using bovine serum albumin (BSA) as a standard.

Example 10 - Gene expression datasets

Gene expression datasets of 5 species (Escherichia coli, Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorhabditis elegans, and Mus musculus) were downloaded from Gene Expression Omnibus (GEO). Gene-expression sets were selected based on platform (Affimetrix), release date (not earlier than 2008), publication linked to the GEO set and number of samples in the study. In total 2067 gene-expression profiles were collected, representing 8 or 9 different studies per organism. An overview can be found in Table 1A-F.

Example 11 - Protein abundance datasets Protein abundance datasets were retrieved from PaxDb (Wang et ai, 2012, Mol Cell Proteomics, 1 1 : 492-500), where the integrated datasets of Escherichia coli, Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorhabditis elegans, and Mus musculus were downloaded.

Example 12 - Gene expression normalization

Gene expression was normalized based on rank. Per species one array platform was used and per species probes were ranked according to their intensities. The average rank per probe was used as a measure of overall gene expression to distinguish genes with overall low and high expression levels for each species. Example 13 - mRNA Sequences

The coding sequences (CDS) of all genes of 5 species were downloaded from sequence/genome repositories. For Escherichia coli, the CDS of strains CFT073, EDL933, MG1655 and Sakai were obtained from NCBI, accesscions NC_004431 .1 , NC_002655.1 , NC_U00096.3 and NC_002695.1 respectively. For Arabidopsis thaliana, the CDS of the 20101 108 release were obtained from TAIR (Lamesch et al., 2012, Nucleic Acids Research 40: D1202-1210). For Saccharomyces cerevisiae, the open reading frames (without UTR, introns, etc.) of the 201 10203 release were obtained from the Saccharomyces genome database (Cherry et al., 2012, Nucleic Acids Research 40: D700-705). For Caenorhabditis elegans, the CDS of WS241 were obtained from WormBase (Yook et al., 2012, Nucleic Acids Research 40: D735-741 ). For Mus musculus, the CDS of the 20130508 release (GRCm38.p1 ) were obtained from the NCBI CCDS database (Farrell et al., 2014 Nucleic Acids Research 42: D865-872).

Example 10 - mRNA folding

The mRNAs of all species were folded using Vienna RNA fold (Lorenz et al., 201 1 , Algorithms for Molecular Biology 6: 26) at 20 C, using the parameters of Andronescu et al., (Andronescu et al., 2007, Bioinformatics 23: i19-28). The M. musculus mRNA was also folded at 37 C and the S. cerevisiae also at 30 C, but all the reported comparisons are based on 20 C.

Example 11 - mRNA sequence and structure statistics

Several statistics were taken from the mRNA sequence: gene length, codon usage, and nucleotide usage. Also from the predicted mRNA structure several statistics were taken: number of bound nucleotides, number of free nucleotides, average stem size, average loop size, variation in stem size, variation in loop size, and energy of the structure.

Example 12 - Gene expression and mRNA folding statistics The correlations (Spearman) between gene expression and the various mRNA- based statistics were calculated by Spearman correlation (in R 3.0.2 x64). For some of the factors a correction was applied for gene-length, these were: number of bound nucleotides, number of unbound nucleotides, energy of the structure, number of stems, number of loops, triplet usage, nucleotide usage, and amino acid usage.

For expression codon analysis, the frequencies of use of synonymous codons was calculated. This was done over a receding window, from 50% highest versus 50% lowest until 5% highest versus 5% lowest, in increments of 1 %.

Example 13 - Sequences used for transformation

A novel aspect of our finding is the selection of mRNA structures with the most even distribution of stems and loops leads to higher levels of expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Below is an example procedure used to select the most optimal mRNA structure for improved functional expression in a host cell of interest.

The first step in selecting the 'ideal' mRNA structure is the generation of a pool of mRNA variants by making all possible combinations of synonymous codons (> 100.000 mRNA variants).

The second step is in silico folding of all mRNA species in the pool under the temperature and salt concentrations relevant for the preferred host. The third step is the selection of mRNAs from the pool that meet the following criteria:

(actually the selection of mRNAs that have the most even distribution of stems and loops, which can be selected by the criteria described below.) For A. thaliana

1 . average number of stem-loop transitions is above 1 16 per 1 ,000 bp (or between 1 16 and 250 per 1 ,000 bp) average stem size is below 5.20 bp (or between 5.20 and 2.5 bp)

average loop size is below 3.32 bp (or between 3.32 and 3 bp)

the standard deviation of the loop size is below 3.20 (or between 3.20 and 2 bp) (measure for even distribution)

the standard deviation of the stem size is below 3.40 (or between 3.40 and 2 bp) (measure for even distribution)

maximum loop size is below 18 bp (discard uneven stem loop distributions) maximum stem size is below 19 bp (discard uneven stem loop distributions) C. eleaans

1 . average number of stem-loop transitions is above 1 14 per 1 ,000 bp (or between 1 14 and 250 per 1 ,000 bp)

2. average stem size is below 5.35 bp (or between 5.35 and 2.5 bp)

3. average loop size is below 3.47 bp (or between 3.47 and 3 bp)

4. the standard deviation of the loop size is below 3.37 (or between 3.37 and 2 bp)

5. the standard deviation of the stem size is below 3.27 (or between 3.27 and 2 bp)

6. maximum loop size is below 20 bp

7. maximum stem size is below 18 bp E. coli

1 . average number of stem-loop transitions is above 1 16 per 1 ,000 bp (or between 1 16 and 250 per 1 ,000 bp)

2. average stem size is below 5.45 bp (or between 5.45 and 2.5 bp)

3. average loop size is below 3.16 bp (or between 3.16 and 2 bp)

4. the standard deviation of the loop size is below 2.95 (or between 2.95 and 2 bp)

5. the standard deviation of the stem size is below 3.50 (or between 3.50 and 2 bp)

6. maximum loop size is below 16 bp

7. maximum stem size is below 18 bp M. musculus 1 . average number of stem-loop transitions is above 120 per 1 ,000 bp (or between 120 and 250 per 1 ,000 bp)

2. average stem size is below 4.35 bp (or between 4.35 and 2.5 bp)

3. average loop size is below 5.18 bp (or between 5.18 and 4 bp)

4. the standard deviation of the loop size is below 3.00 (or between 3.00 and 2 bp)

5. the standard deviation of the stem size is below 3.28 (or between 3.28 and 2 bp)

6. maximum loop size is below 18 bp

7. maximum stem size is below 19 bp

For S. cerevisiae

1 . average number of stem-loop transitions is above 1 10 per 1 ,000 bp (or between 1 10 and 250 per 1 ,000 bp)

2. average stem size is below 5.27 bp (or between 5.27 and 2.5 bp)

3. average loop size is below 3.77 bp (or between 3.77 and 3 bp)

4. the standard deviation of the loop size is below 3.65 (or between 3.65 and 2 bp)

5. the standard deviation of the stem size is below 3.25 (or between 3.25 and 2 bp)

6. maximum loop size is below 20 bp

7. maximum stem size is below 19 bp

After step 3, where there were several appropriate codons according to the foregoing criteria, previously published data was consulted to make a final selection. Codons giving the lowest folding energy of the 5' terminus and codons that are frequently used and match the most abundant tRNAs were preferred.

Example 14 - Sequences used for transformation

All ORFs

GFP-720bp

>GFPnat [SEQ ID NO: 10]

atggccAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGG CGATGTTAATGGGCACAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATACGGAA AACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCcTGGCCAACACTTGTC ACTACTCTCACCTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGCATGA CTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTgCAGGAAAGAACTATATTTTTCAAAGATG ACGGtAACTACAAGACcCGTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAGAATC GAGTTAAAAGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACACAAACTCGAATACAA CTATAACTCACATAATGTATACATCATGGCcGACAAACAGAAGAATGGAATCAAAGTTAACT TCAAAATTAGACACAACATTGAGGATGGAAGCGTTCAATTAGCAGACCATTATCAACAAAAT ACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCCACACAATCTGC CCTT CCAAAGATCCCAACGAAAAGAGAGATCACATGGTCCTTCTTGAGTTTGTAACAGCTG CTGGGATTACACTCGGCATGGATGAACTATACAAATAA

>GFPo_Pt [SEQ ID NO: 1 1]

atggccTCCAAGGGTGAGGAGCTCTTCACCGGTGTCGTCCCCATCCTCGTCGAGCTCGACGG TGACGTCAACGGTCACAAGTTCTCCGTCTCCGGTGAGGGTGAGGGTGACGCCACCTACGGTA AGCTCACCCTCAAGTTCATCTGCACCACCGGTAAGCTCCCCGTCCCCTGGCCCACCCTCGTC ACCACCCTCACCTACGGTGTCCAGTGCTTCTCCCGTTACCCCGACCACATGAAGCAGCACGA CTTCTTCAAGTCCGCCATGCCCGAGGGTTACGTCCAGGAGCGTACCATCTTCTTCAAGGACG ACGGTAACTACAAGACCCGTGCCGAGGTCAAGTTCGAGGGTGACACCCTCGTCAACCGTATC GAGCTCAAGGGTATCGACTTCAAGGAGGACGGTAACATCCTCGGTCACAAGCTCGAGTACAA CTACAACTCCCACAACGTCTACATCATGGCCGACAAGCAGAAGAACGGTATCAAGGTCAACT TCAAGATCCGTCACAACATCGAGGACGGTTCCGTCCAGCTCGCCGACCACTACCAGCAGAAC ACCCCCATCGGTGACGGTCCCGTCCTCCTCCCCGACAACCACTACCTCTCCACCCAGTCCGC CCTCTCCAAGGACCCCAACGAGAAGCGTGACCACATGGTCCTCCTCGAGTTCGTCACCGCCG CCGGTATCACCCTCGGTATGGACGAGCTCTACAAGTAA

SP-AvGFP-786bp

>chitSPoptGFPnat [SEQ ID NO: 12]

atggccAAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCgGC CGtcatggccAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAG ATGGCGATGTTAATGGGCACAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATAC GGAAAACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCcTGGCCAACACT TGTCACTACTCTCACCTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGC ATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTgCAGGAAAGAACTATATTTTTCAAA GATGACGGtAACTACAAGACcCGTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAG AATCGAGTTAAAAGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACACAAACTCGAAT ACAACTATAACTCACATAATGTATACATCATGGCcGACAAACAGAAGAATGGAATCAAAGTT AACTTCAAAATTAGACACAACATTGAGGATGGAAGCGTTCAATTAGCAGACCATTATCAACA AAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCCACACAAT CTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGGTCCTTCTTGAGTTTGTAACA GCTGCTGGGATTACACTCGGCATGGATGAACTATACAAATAA

>chitSPoptGFPopt [SEQ ID NO: 13]

atggccAAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCgGC CGtcatggccTCCAAGGGTGAGGAGCTCTTCACCGGTGTCGTCCCCATCCTCGTCGAGCTCG ACGGTGACGTCAACGGTCACAAGTTCTCCGTCTCCGGTGAGGGTGAGGGTGACGCCACCTAC GGTAAGCTCACCCTCAAGTTCATCTGCACCACCGGTAAGCTCCCCGTCCCCTGGCCCACCCT CGTCACCACCCTCACCTACGGTGTCCAGTGCTTCTCCCGTTACCCCGACCACATGAAGCAGC ACGACTTCTTCAAGTCCGCCATGCCCGAGGGTTACGTCCAGGAGCGTACCATCTTCTTCAAG GACGACGGTAACTACAAGACCCGTGCCGAGGTCAAGTTCGAGGGTGACACCCTCGTCAACCG TATCGAGCTCAAGGGTATCGACTTCAAGGAGGACGGTAACATCCTCGGTCACAAGCTCGAGT ACAACTACAACTCCCACAACGTCTACATCATGGCCGACAAGCAGAAGAACGGTATCAAGGTC AACTTCAAGATCCGTCACAACATCGAGGACGGTTCCGTCCAGCTCGCCGACCACTACCAGCA GAACACCCCCATCGGTGACGGTCCCGTCCTCCTCCCCGACAACCACTACCTCTCCACCCAGT CCGCCCTCTCCAAGGACCCCAACGAGAAGCGTGACCACATGGTCCTCCTCGAGTTCGTCACC GCCGCCGGTATCACCCTCGGTATGGACGAGCTCTACAAGTAA mIL-10-540bp

>chitSPopt-IL-10nat [SEQ ID NO: 14]

atggccAAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCgGC CGcccagtacagccgggaagacaatAACtgcacccacttcccagtcggccagagccacatgc tcctagagctgcggactgccttcagccaggtgaagactttctttcaaacaaaggaccagctg gacaacatactgctaaccgactccttaatgcaggactttaagggttacttgggttgccaagc cttatcggaaatgatccagttttacctggtagaagtgatgccccaggcagagaagcatggcc cagaaatcaaggagcatttgaattccctgggtgagaagctgaagaccctcaggatgcggctg aggcgctgtcatcgatttctcccctgtgaaaataagagcaaggcagtggagcaggtgaagag tgattttaataagctccaagaccaaggtgtctacaaggccatgaatgaatttgacatcttca tcaactgcatagaagcatacatgatgatcaaaatgaaaagctaa

>chitSPopt-mIL-10opt [SEQ ID NO: 15]

atggccAAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCgGC CGcccagtactcccgtgaggacaacaactgcacccacttccccgtcggtcagtcccacatgc tcctcgagctccgtaccgccttctcccaggtcaagaccttcttccagaccaaggaccagctc gacaacatcctcctcaccgactccctcatgcaggacttcaagggttacctcggttgccaggc cctctccgagatgatccagttctacctcgtcgaggtcatgccccaggccgagaagcacggtc ccgagatcaaggagcacctcaactccctcggtgagaagctcaagaccctccgtatgcgtctc cgtcgttgccaccgtttcctcccctgcgagaacaagtccaaggccgtcgagcaggtcaagtc cgacttcaacaagctccaggaccagggtgtctacaaggccatgaacgagttcgacatcttca tcaactgcatcgaggcctacatgatgatcaagatgaagtcctga OVA-1221bp

>chitSPoptOVAnat (only with pIVT) [SEQ ID NO: 16]

atg (gcc) AAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCg GCCGGCTCCATCGGCGCAGCAAGCATGGAATTTTGTTTTGATGTATTCAAGGAGCTCAAAGT CCACCATGCCAATGAGAACATCTTCTACTGCCCCATTGCCATCATGTCAGCTCTAGCCATGG TATACCTGGGTGCAAAAGACAGCACCAGGACACAGATAAATAAGGTTGTTCGCTTTGATAAA CTTCCAGGATTCGGAGACAGTATTGAAGCTCAGTGTGGCACATCTGTAAACGTTCACTCTTC ACTTAGAGACATCCTCAACCAAATCACCAAACCAAATGATGTTTATTCGTTCAGCCTTGCCA GTAGACTTTATGCTGAAGAGAGATACCCAATCCTGCCAGAATACTTGCAGTGTGTGAAGGAA CTGTATAGAGGAGGCTTGGAACCTATCAACTTTCAAACAGCTGCAGATCAAGCCAGAGAGCT CATCAATTCCTGGGTAGAAAGTCAGACAAATGGAATTATCAGAAATGTCCTTCAGCCAAGCT CCGTGGATTCTCAAACTGCAATGGTTCTGGTTAATGCCATTGTCTTCAAAGGACTGTGGGAG AAAACATTTAAGGATGAAGACACACAAGCAATGCCTTTCAGAGTGACTGAGCAAGAAAGCAA ACCTGTGCAGATGATGTACCAGATTGGTTTATTTAGAGTGGCATCAATGGCTTCTGAGAAAA TGAAGATCCTGGAGCTTCCATTTGCCAGTGGGACAATGAGCATGTTGGTGCTGTTGCCTGAT GAAGTCTCAGGCCTTGAGCAGCTTGAGAGTATAATCAACTTTGAAAAACTGACTGAATGGAC CAGTTCTAATGTTATGGAAGAGAGGAAGATCAAAGTGTACTTACCTCGCATGAAGATGGAGG AAAAATACAACCTCACATCTGTCTTAATGGCTATGGGCATTACTGACGTGTTTAGCTCTTCA GCCAATCTGTCTGGCATCTCCTCAGCAGAGAGCCTGAAGATtTCTCAAGCTGTCCATGCAGC ACATGCAGAAATCAATGAAGCAGGCAGAGAGGTGGTAGGGTCAGCAGAGGCTGGAGTGGATG CTGCAAGCGTCTCTGAAGAATTTAGGGCTGACCATCCATTCCTCTTCTGTATCAAGCACATC GCAACCAACGCCGTTCTCTTCTTTGGCAGATGTGTTTCCCCTTAA >chitSPoptOVAopt [SEQ ID NO: 17]

atg (gcc) AAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCg GCCGGTTCCATCGGTGCCGCCAGCATGGAGTTCTGCTTCGACGTCTTCAAGGAGCTCAAGGT CCACCACGCCAACGAGAACATCTTCTACTGCCCCATCGCCATCATGTCCGCCCTCGCTATGG TCTACCTCGGTGCCAAGGACTCCACCCGTACCCAGATCAACAAGGTCGTCCGTTTCGACAAG CTCCCCGGTTTCGGTGACTCCATCGAGGCCCAGTGCGGTACTTCCGTCAACGTCCACTCCTC CCTCCGTGACATCCTCAACCAGATCACCAAGCCCAACGACGTCTACTCCTTCTCCCTCGCCT CCCGTCTCTACGCCGAGGAGCGTTACCCCATCCTCCCCGAGTACCTCCAGTGCGTCAAGGAG CTCTACCGTGGTGGTCTCGAGCCCATCAACTTCCAGACCGCCGCCGACCAGGCCCGTGAGCT CATCAACTCCTGGGTCGAGTCCCAGACCAACGGTATCATCCGTAACGTCCTCCAGCCCTCCT CCGTCGACTCCCAGACCGCTATGGTCCTCGTCAACGCCATCGTCTTCAAGGGTCTCTGGGAG AAGaCCTTCAAGGACGAGGACACCCAGGCCATGCCCTTCCGTGTCACCGAGCAGGAGTCCAA GCCCGTCCAGATGATGTACCAGATCGGTCTCTTCCGTGTCGCCAGCATGGCCTCCGAGAAGA TGAAGATCCTCGAGCTCCCCTTCGCCTCCGGTACTATGTCCATGCTCGTCCTCCTCCCCGAC GAGGTCTCCGGTCTCGAGCAGCTCGAGTCCATCATCAACTTCGAGAAGCTCACCGAGTGGAC CTCCTCCAACGTCATGGAGGAGCGTAAGATCAAGGTCTACCTCCCCCGTATGAAGATGGAGG AGAAGTACAACCTCACCTCCGTCCTCATGGCTATGGGTATCACCGACGTCTTCTCCTCCTCC GCCAACCTCTCCGGTATCTCCTCCGCCGAGTCCCTCAAGATCTCCCAGGCCGTCCACGCCGC CCACGCCGAGATCAACGAGGCCGGTCGTGAGGTCGTCGGTTCCGCCGAGGCCGGTGTCGACG CCGCCTCCGTCTCCGAGGAGTTCCGTGCCGACCACCCCTTCCTCTTCTGCATCAAGCACATC GCCACCAACGCCGTCCTCTTCTTCGGTCGTTGCGTCTCCCCCTAA

E. coli S. cerevisiae C. elegans A. thaliana M. musculus

Strains/ecotypes 1 13 14 8 9

Samples 168 316 391 415 111

Controls 105 21 1 109 101 565

Papers 8 9 9 9 9

Treatments 20 14 29 73 21

Tissues 1 1 3 1 1 28

> Different strains/mutants and tissues receiving the same experimental treatment are counted as a single treatment, all measurements in a time series are counted as a single treatment

Additional > M. musculus data sets Thorrez et al., 2009 and Xue et al.,

remarks: 2013 do not include the control spot on the slide in their

datasets

> E. coli expression values from the Dong and Schellhorn 2009 dataset off to a single decimal and from Ito et al., 2009 dataset to two decimals Table 1A. Overview of the gathered expression data per species.

Table 1 C. Description of the gathered S. cerevisiae expression data.

Table I D_; Description of the gathered C. elegans expression data

Table I E_; Description of the gathered A thaliana expression data

Table I F. Description of the gathered M. musculus expression da

E. coli S. cerevisiae C. elegans A. thaliana M. musculus

Gene length -0.146 -0.041 0.093 0.030 -0.016

Energy (kcal.mol/nt) -0.006 -0.319 -0.316 -0.229 0.006

Bound nt (fraction) 0.038 0.236 0.061 0.172 0.015

Mean stem size -0.1 1 1 0.054 -0.182 0.053 -0.055

Mean loop size -0.1 15 -0.241 -0.179 -0.155 -0.046

Transitions /nt 0.140 0.144 0.227 0.071 0.069

Table 2. Correlation between mRNA structure characteristics and gene expression per species. The mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy, percentage of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions were determined and correlated (Spearman) with expression.

AA Triplet All Top 5% Bottom 5% Top/Bottom

* TAA 0.648 0.871 0.643 1 .355

TAG 0.067 0.021 0.064 0.328

TGA 0.285 0.107 0.293 0.365

A GCT 0.160 0.332 0.154 2.156

GCC 0.266 0.139 0.275 0.505

GCA 0.209 0.258 0.216 1 .194

GCG 0.365 0.271 0.356 0.761

C TGT 0.435 0.385 0.466 0.826

TGC 0.565 0.615 0.534 1 .152

D GAT 0.617 0.429 0.649 0.661

GAC 0.383 0.571 0.351 1 .627

E GAA 0.693 0.760 0.681 1 .1 16

GAG 0.307 0.240 0.319 0.752

F TTT 0.562 0.290 0.615 0.472

TTC 0.438 0.710 0.385 1 .844

G GGT 0.343 0.527 0.327 1 .612

GGC 0.413 0.406 0.395 1 .028

GGA 0.098 0.031 0.1 16 0.267

GGG 0.146 0.036 0.162 0.222

H CAT 0.557 0.295 0.591 0.499

CAC 0.443 0.705 0.409 1 .724

I ATT 0.503 0.302 0.530 0.570

ATC 0.434 0.688 0.380 1 .81 1 ATA 0.063 0.010 0.090 0.1 1 1 κ AAA 0.770 0.768 0.793 0.968

AAG 0.230 0.232 0.207 1.121

L TTA 0.124 0.042 0.155 0.271

TTG 0.124 0.059 0.130 0.454

CTT 0.100 0.065 0.1 12 0.580

CTC 0.103 0.068 0.106 0.642

CTA 0.035 0.008 0.041 0.195

CTG 0.515 0.758 0.457 1 .659

M ATG 1 .000 1.000 1.000 1 .000

N AAT 0.432 0.182 0.486 0.374

AAC 0.568 0.818 0.514 1 .591

P CCT 0.152 0.140 0.165 0.848

CCC 0.1 15 0.025 0.137 0.182

CCA 0.185 0.134 0.199 0.673

CCG 0.547 0.702 0.498 1 .410

Q CAA 0.337 0.213 0.369 0.577

CAG 0.663 0.787 0.631 1 .247

R CGT 0.396 0.636 0.363 1 .752

CGC 0.410 0.332 0.410 0.810

CGA 0.058 0.010 0.071 0.141

CGG 0.089 0.01 1 0.094 0.1 17

AGA 0.030 0.007 0.044 0.159

AGG 0.016 0.004 0.019 0.21 1

S TCT 0.150 0.323 0.132 2.447

TCC 0.155 0.256 0.136 1 .882

TCA 0.1 17 0.058 0.123 0.472

TCG 0.155 0.057 0.171 0.333

AGT 0.143 0.060 0.158 0.380

AGC 0.280 0.247 0.280 0.882

T ACT 0.168 0.328 0.167 1 .964

ACC 0.449 0.508 0.409 1 .242

ACA 0.120 0.048 0.154 0.312

ACG 0.263 0.1 16 0.270 0.430

V GTT 0.257 0.436 0.258 1 .690

GTC 0.214 0.1 13 0.219 0.516

GTA 0.152 0.225 0.153 1 .471

GTG 0.377 0.226 0.370 0.61 1 w TGG 1 .000 1.000 1.000 1 .000

Y TAT 0.555 0.331 0.582 0.569

TAC 0.445 0.669 0.418 1 .600

Table 6A. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Escherichia coli. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low-expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated. AA Triplet All Top 5% Bottom 5% Top/Bottom

* TAA 0.480 0.731 0.403 1 .814

TAG 0.225 0.1 17 0.290 0.403

TGA 0.295 0.152 0.307 0.495

A GCT 0.367 0.593 0.339 1 .749

GCC 0.223 0.280 0.215 1 .302

GCA 0.296 0.105 0.319 0.329

GCG 0.1 13 0.023 0.127 0.181

C TGT 0.627 0.829 0.594 1 .396

TGC 0.373 0.171 0.406 0.421

D GAT 0.656 0.526 0.642 0.819

GAC 0.344 0.474 0.358 1 .324

E GAA 0.701 0.854 0.699 1 .222

GAG 0.299 0.146 0.301 0.485

F TTT 0.593 0.353 0.616 0.573

TTC 0.407 0.647 0.384 1 .685

G GGT 0.455 0.823 0.387 2.127

GGC 0.197 0.093 0.197 0.472

GGA 0.224 0.051 0.279 0.183

GGG 0.124 0.033 0.138 0.239

H CAT 0.643 0.440 0.617 0.713

CAC 0.357 0.560 0.383 1 .462

1 ATT 0.463 0.522 0.469 1 .1 13

ATC 0.258 0.430 0.236 1 .822

ATA 0.280 0.048 0.295 0.163

K AAA 0.581 0.299 0.639 0.468

AAG 0.419 0.701 0.361 1 .942

L TTA 0.279 0.216 0.244 0.885

TTG 0.283 0.567 0.251 2.259

CTT 0.127 0.057 0.163 0.350

CTC 0.057 0.014 0.086 0.163

CTA 0.142 0.103 0.143 0.720

CTG 0.1 12 0.043 0.1 13 0.381

M ATG 1 .000 1.000 1.000 1 .000

N AAT 0.598 0.303 0.594 0.510

AAC 0.402 0.697 0.406 1 .717

P CCT 0.310 0.227 0.305 0.744

CCC 0.160 0.053 0.164 0.323

CCA 0.407 0.701 0.401 1 .748

CCG 0.123 0.018 0.129 0.140

Q CAA 0.686 0.893 0.663 1 .347

CAG 0.314 0.107 0.337 0.318

R CGT 0.140 0.201 0.131 1 .534

CGC 0.058 0.017 0.078 0.218

CGA 0.068 0.001 0.088 0.01 1

CGG 0.040 0.002 0.064 0.031

AGA 0.478 0.724 0.420 1 .724 AGG 0.217 0.055 0.218 0.252 s TCT 0.261 0.452 0.246 1.837

TCC 0.157 0.289 0.147 1.966

TCA 0.211 0.108 0.218 0.495

TCG 0.097 0.036 0.096 0.375

AGT 0.163 0.063 0.172 0.366

AGC 0.1 11 0.051 0.121 0.421

T ACT 0.343 0.482 0.333 1.447

ACC 0.210 0.352 0.213 1.653

ACA 0.307 0.133 0.325 0.409

ACG 0.140 0.034 0.129 0.264

V GTT 0.389 0.51 1 0.368 1.389

GTC 0.201 0.347 0.210 1.652

GTA 0.216 0.060 0.226 0.265

GTG 0.195 0.082 0.196 0.418 w TGG 1.000 1.000 1.000 1.000

Y TAT 0.568 0.302 0.558 0.541

TAC 0.432 0.698 0.442 1.579

Table 6B. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Saccharomyces cerevisiae. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.

AA Triplet All Top 5% Bottom 5% Top/Bottom

* TAA 0.496 0.694 0.439 1 .581

TAG 0.179 0.141 0.162 0.870

TGA 0.325 0.165 0.399 0.414

A GCT 0.354 0.423 0.325 1 .302

GCC 0.199 0.302 0.157 1 .924

GCA 0.314 0.198 0.385 0.514

GCG 0.133 0.077 0.134 0.575

C TGT 0.555 0.447 0.588 0.760

TGC 0.445 0.553 0.412 1 .342

D GAT 0.679 0.631 0.693 0.91 1

GAC 0.321 0.369 0.307 1 .202

E GAA 0.621 0.534 0.671 0.796

GAG 0.379 0.466 0.329 1 .416

F TTT 0.481 0.261 0.605 0.431

TTC 0.519 0.739 0.395 1 .871

G GGT 0.204 0.168 0.214 0.785

GGC 0.124 0.086 0.134 0.642

GGA 0.592 0.71 1 0.544 1 .307

GGG 0.080 0.035 0.109 0.321

H CAT 0.61 1 0.513 0.649 0.790

CAC 0.389 0.487 0.351 1 .387

1 ATT 0.534 0.470 0.538 0.874

ATC 0.314 0.478 0.226 2.1 15

ATA 0.152 0.052 0.236 0.220

K AAA 0.588 0.381 0.665 0.573

AAG 0.412 0.619 0.335 1 .848

L TTA 0.1 10 0.049 0.169 0.290

TTG 0.234 0.212 0.258 0.822

CTT 0.249 0.306 0.214 1 .430

CTC 0.174 0.280 0.1 16 2.414

CTA 0.091 0.042 0.1 12 0.375

CTG 0.142 0.1 12 0.133 0.842

M ATG 1 .000 1.000 1.000 1 .000

N AAT 0.625 0.484 0.655 0.739

AAC 0.375 0.516 0.345 1 .496

P CCT 0.178 0.126 0.220 0.573

CCC 0.088 0.054 0.100 0.540

CCA 0.532 0.691 0.494 1 .399

CCG 0.202 0.130 0.186 0.699

Q CAA 0.651 0.650 0.679 0.957

CAG 0.349 0.350 0.321 1 .090

R CGT 0.217 0.350 0.150 2.333

CGC 0.096 0.175 0.067 2.612

CGA 0.236 0.146 0.231 0.632

CGG 0.091 0.046 0.098 0.469

AGA 0.288 0.250 0.357 0.700 AGG 0.071 0.032 0.097 0.330 s TCT 0.206 0.235 0.214 1.098

TCC 0.130 0.177 0.112 1.580

TCA 0.257 0.205 0.273 0.751

TCG 0.156 0.169 0.125 1.352

AGT 0.149 0.104 0.173 0.601

AGC 0.102 0.109 0.103 1.058

T ACT 0.324 0.346 0.329 1.052

ACC 0.175 0.297 0.144 2.062

ACA 0.345 0.249 0.383 0.650

ACG 0.156 0.108 0.143 0.755

V GTT 0.388 0.413 0.407 1.015

GTC 0.220 0.320 0.168 1.905

GTA 0.158 0.097 0.191 0.508

GTG 0.234 0.170 0.234 0.726 w TGG 1.000 1.000 1.000 1.000

Y TAT 0.559 0.414 0.631 0.656

TAC 0.441 0.586 0.369 1.588

Table 6C. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Caenorhabditis elegans. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.

AA Triplet All Top 5% Bottom 5% Top/Bottom

* TAA 0.345 0.371 0.263 1.411

TAG 0.204 0.194 0.194 1.000

TGA 0.451 0.435 0.543 0.801

A GCT 0.432 0.498 0.383 1.300

GCC 0.161 0.171 0.174 0.983

GCA 0.263 0.221 0.278 0.795

GCG 0.144 0.110 0.164 0.671

C TGT 0.593 0.561 0.591 0.949

TGC 0.407 0.439 0.409 1.073

D GAT 0.674 0.644 0.662 0.973

GAC 0.326 0.356 0.338 1.053

E GAA 0.511 0.442 0.523 0.845

GAG 0.489 0.558 0.477 1.170

F TTT 0.502 0.427 0.515 0.829

TTC 0.498 0.573 0.485 1.181

G GGT 0.334 0.398 0.316 1.259

GGC 0.141 0.119 0.152 0.783

GGA 0.371 0.367 0.387 0.948

GGG 0.154 0.115 0.145 0.793

H CAT 0.606 0.526 0.612 0.859

CAC 0.394 0.474 0.388 1.222

1 ATT 0.400 0.429 0.375 1.144

ATC 0.363 0.432 0.373 1.158

ATA 0.236 0.139 0.252 0.552

K AAA 0.490 0.385 0.517 0.745

AAG 0.510 0.615 0.483 1.273

L TTA 0.135 0.082 0.148 0.554

TTG 0.220 0.233 0.229 1.017

CTT 0.257 0.290 0.248 1.169

CTC 0.181 0.207 0.172 1.203

CTA 0.105 0.080 0.121 0.661

CTG 0.102 0.108 0.082 1.317

M ATG 1.000 1.000 1.000 1.000

N AAT 0.502 0.430 0.489 0.879

AAC 0.498 0.570 0.511 1.115

P CCT 0.381 0.407 0.353 1.153

CCC 0.106 0.112 0.109 1.028

CCA 0.327 0.336 0.351 0.957

CCG 0.186 0.146 0.186 0.785

Q CAA 0.564 0.465 0.648 0.718

CAG 0.436 0.535 0.352 1.520

R CGT 0.168 0.241 0.161 1.497

CGC 0.070 0.077 0.068 1.132

CGA 0.118 0.087 0.120 0.725

CGG 0.092 0.059 0.086 0.686

AGA 0.352 0.301 0.363 0.829 AGG 0.199 0.234 0.202 1.158 s TCT 0.280 0.303 0.253 1.198

TCC 0.129 0.147 0.127 1.157

TCA 0.204 0.178 0.212 0.840

TCG 0.108 0.100 0.114 0.877

AGT 0.151 0.139 0.158 0.880

AGC 0.127 0.134 0.135 0.993

T ACT 0.334 0.374 0.300 1.247

ACC 0.207 0.260 0.213 1.221

ACA 0.302 0.253 0.313 0.808

ACG 0.157 0.114 0.175 0.651

V GTT 0.400 0.432 0.372 1.161

GTC 0.193 0.219 0.199 1.101

GTA 0.145 0.095 0.157 0.605

GTG 0.262 0.253 0.271 0.934 w TGG 1.000 1.000 1.000 1.000

Y TAT 0.504 0.418 0.508 0.823

TAC 0.496 0.582 0.492 1.183

Table 6D. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Arabidopsis thaliana. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.

AA Triplet All Top 5% Bottom 5% Top/Bottom

* TAA 0.258 0.351 0.323 1.087

TAG 0.235 0.222 0.253 0.877

TGA 0.507 0.427 0.424 1.007

A GCT 0.289 0.320 0.316 1.013

GCC 0.377 0.331 0.340 0.974

GCA 0.232 0.246 0.266 0.925

GCG 0.101 0.103 0.078 1.321

C TGT 0.476 0.516 0.507 1.018

TGC 0.524 0.484 0.493 0.982

D GAT 0.450 0.521 0.500 1.042

GAC 0.550 0.479 0.500 0.958

E GAA 0.412 0.466 0.495 0.941

GAG 0.588 0.534 0.505 1.057

F TTT 0.445 0.507 0.499 1.016

TTC 0.555 0.493 0.501 0.984

G GGT 0.175 0.208 0.197 1.056

GGC 0.332 0.319 0.287 1.111

GGA 0.257 0.272 0.313 0.869

GGG 0.236 0.201 0.204 0.985

H CAT 0.410 0.468 0.472 0.992

CAC 0.590 0.532 0.528 1.008

1 ATT 0.343 0.404 0.362 1.116

ATC 0.495 0.448 0.419 1.069

ATA 0.162 0.148 0.219 0.676

K AAA 0.398 0.407 0.471 0.864

AAG 0.602 0.593 0.529 1.121

L TTA 0.068 0.089 0.095 0.937

TTG 0.132 0.152 0.152 1.000

CTT 0.132 0.154 0.154 1.000

CTC 0.194 0.169 0.176 0.960

CTA 0.079 0.079 0.092 0.859

CTG 0.396 0.357 0.331 1.079

M ATG 1.000 1.000 1.000 1.000

N AAT 0.436 0.481 0.501 0.960

AAC 0.564 0.519 0.499 1.040

P CCT 0.306 0.335 0.316 1.060

CCC 0.298 0.250 0.275 0.909

CCA 0.288 0.310 0.323 0.960

CCG 0.108 0.105 0.086 1.221

Q CAA 0.253 0.258 0.350 0.737

CAG 0.747 0.742 0.650 1.142

R CGT 0.084 0.105 0.080 1.312

CGC 0.170 0.153 0.122 1.254

CGA 0.123 0.145 0.104 1.394

CGG 0.194 0.179 0.128 1.398

AGA 0.213 0.232 0.318 0.730 AGG 0.216 0.186 0.249 0.747 s TCT 0.193 0.222 0.220 1.009

TCC 0.211 0.195 0.188 1.037

TCA 0.143 0.149 0.170 0.876

TCG 0.054 0.057 0.039 1.462

AGT 0.156 0.171 0.174 0.983

AGC 0.243 0.206 0.209 0.986

T ACT 0.249 0.273 0.275 0.993

ACC 0.345 0.313 0.312 1.003

ACA 0.295 0.314 0.328 0.957

ACG 0.1 11 0.099 0.085 1.165

V GTT 0.174 0.225 0.217 1.037

GTC 0.245 0.215 0.241 0.892

GTA 0.1 19 0.138 0.146 0.945

GTG 0.461 0.423 0.395 1.071 w TGG 1.000 1.000 1.000 1.000

Y TAT 0.423 0.481 0.498 0.966

TAC 0.577 0.519 0.502 1.034

Table 6E. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Mus musculus (Animalia). Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.

Top 5% Top 5% Top 5% Top 5%

Trait Organism Stem_size_mean Stem_size_sd Stem_size_max Stem_size_min

Protein

abundance A. thaliana 5.197742798 3.333316648 18.60493827 1.082304527

Gene

expression A. thaliana 5.264773876 3.354989119 18.67107195 1.118942731

Protein

abundance C. elegans 4.949884209 3.035095428 16.98275862 1.161637931

Gene

expression C. elegans 4.950296788 3.048596544 17.30588235 1.129411765

Protein

abundance E. coli 5.127421075 3.127080268 17.00909091 1.227272727

Gene

expression E. coli 5.157297589 3.162030121 17.54285714 1.214285714

Protein

abundance M. musculus 5.063991554 3.236283472 18.29166667 1.078125

Gene

expression M. musculus 5.081367307 3.237828152 18.43329098 1.095298602

Protein

abundance S. cerevisiae 5.254440541 3.230034739 18.21167883 1.237226277

Gene

expression S. cerevisiae 5.262132835 3.23936481 18.01766784 1.247349823

Table 9. Analysis of the mRNA secondary structure characteristics (stem architecture) of the top 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia).

Table 10. Analysis of the mRNA secondary structure characteristics (loop architecture) of the top 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).

Top 5% Top 5% Top 5% Bound_nt/100 Energy_(kcal/mol)/1000 Transitions/1000

Trait Organism 0 nt nt nt

Protein abundance A. thaiiana 619.3179412 -292.7800618 119.7782583 Gene expression A. thaiiana 624.0580406 -290.3511673 119.169408 Protein abundance C. elegans 598.4571065 -272.5292233 121.7225865 Gene expression C. elegans 596.5470187 -273.9225057 121.3996132 Protein abundance E. coli 627.3154158 -319.8163586 123.3964781 Gene expression E. coli 631.9373347 -327.7643057 123.4152453

M.

Protein abundance muscuius 616.1866207 -327.7746785 122.4372787

M.

Gene expression muscuius 612.9621408 -313.9661558 121.3436794 Protein abundance S. cerevisiae 606.4041095 -255.5194926 116.2875481 Gene expression S. cerevisiae 605.1063803 -255.9553594 115.8779268

Table 11. Analysis of the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the top 5% expressed genes in Escherichia coii (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).

Gene

expression S. cerevisiae 5.274054838 3.147229903 16.84751773 1.365248227

Protein

abundance S. cerevisiae 5.34944781 3.244190265 19.52380952 1.102564103

Table 12. Analysis of the mRNA secondary structure characteristics (stem architecture) of the bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).

Table 13. Analysis of the mRNA secondary structure characteristics (loop architecture) of the bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).

Table 14. Analysis of the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the bottom 5% expressed genes in Escherichia coii (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).

Table 15. Differences in the mRNA secondary structure characteristics (stem architecture) of the top and bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).

Delta (top- Delta (top- Delta (top- bottom) bottom) bottom)

Trait Organism Loop_size_mean Loop_size_sd Loop_size_max

Gene expression A. thaiiana -0.267692548 -0.347578028 -1.583865892

Protein abundance A. thaiiana -0.175253334 -0.312156894 -4.174836883

Gene expression C. eiegans -0.419326762 -0.52092485 -3.334682506

Protein abundance C. eiegans -0.154072143 -0.309808645 -4.418251447

Gene expression E. coli -0.186479295 -0.31462739 -3.024198823 Protein abundance E. coli -0.19510469 -0.35983994 -4.111271298

Gene expression M. musculus -0.224252393 -0.288729208 -2.917011238

Protein abundance M. musculus 0.08059553 0.055306019 -2.037498481

Gene expression S. cerevisiae -0.778634452 -1.077665405 -3.962468292

Protein abundance S. cerevisiae -0.364963788 -0.580120518 -5.694456309

Table 16. Differences in the mRNA secondary structure characteristics (loop architecture) of the top and bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus musculus (Animalia).

Table 17. Differences in the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the top and bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus musculus (Animalia).

Claims

1. A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;

a. providing a library of polynucleotides each of which vary at a minimum of a single codon position;

b. analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and c. selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and

d. synthesising said polynucleotide.

2. A method as claimed in claim 1 , wherein the method further comprises selecting a polynucleotide having a maximum stem size of less than 19 bp.

3. A method as claimed in claim 2, wherein the method further comprises selecting a polynucleotide having a maximum loop size of less than 20 bp.

4. A method as claimed in claim 3, wherein the host cell is a prokaryotic cell.

5. A method as claimed in claim 4, wherein the host cell is a bacterial cell.

6. A method as claimed in claim 5, wherein the method further comprises selecting a polynucleotide having a mean stem size between 5.45 bp and 2.50 bp.

7. A method as claimed in claim 5 or claim 6, wherein the method further comprises selecting a polynucleotide having a mean loop size between 3.16 bp and 2.00 bp.

8. A method as claimed in any of claims 5 to 7, wherein the host cell is an Escherichia coli cell.

9. A method as claimed in claim 3, wherein the host cell is a eukaryotic cell.

10. A method as claimed in claim 9, wherein the host cell is a plant cell.

1 1 . A method as claimed in claim 10, wherein the method further comprises selecting a polynucleotide having a mean stem size between 5.20 bp and 2.50 bp.

12. A method as claimed in claim 10 or claim 1 1 , wherein the method further comprises selecting a polynucleotide having a mean loop size between 3.27 bp and 3.00 bp.

13. A method as claimed in claim any of claims 10 to 12, wherein the host cell is an Arabidopsis cell, optionally an Arabidopsis thaliana cell.

14. A method as claimed in claim 9, wherein the host cell is a fungal cell.

15. A method as claimed in claim 14, wherein the method further comprises selecting a polynucleotide having a mean stem size between 5.27 bp and 2.50 bp.

16. A method as claimed in claim 14 or claim 15, wherein the method further comprises selecting a polynucleotide having a mean loop size between 3.77 and

3.00 bp.

17. A method as claimed in any of claims 14 to 16, wherein the host cell is a Saccharomyces cell, optionally a Saccharomyces cerevisiae cell.

18. A method as claimed in claim 9, wherein the host cell is an animal cell.

19. A method as claimed in claim 18, wherein the host cell is a nematode cell.

20. A method as claimed in claim 19, wherein the method further comprises selecting a polynucleotide having a mean stem size between 5.35 bp and 2.50 bp.

21 . A method as claimed in claim 19 or claim 20, wherein the method further comprises selecting a polynucleotide having a mean loop size between 3.47 bp and 3.00 bp.

22. A method as claimed in any of claims 19 to 21 , wherein the host cell is a Caenorhabditis elegans cell.

23. A method as claimed in claim 18, wherein the host cell is a mammalian cell.

24. A method as claimed in claim 23, wherein the method further comprises selecting a polynucleotide having a mean stem size between 4.35 bp and 2.50 bp.

25. A method as claimed in claim 23 or claim 24, wherein the method further comprises selecting a polynucleotide having a mean loop size between 5.18 bp and 4.00 bp.

26. A method as claimed in any of claims 23 to 25, wherein the host cell is a Mus musculus cell.

27. A method as claimed in any of claims 4 to 26, wherein the method further comprises selecting a polynucleotide from a library of synonymous variants wherein the codon usage of the selected polynucleotide most closely matches the most abundant tRNAs in a particular host cell.

28. A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;

a. providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and

b. modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:

29. A method as claimed in claim 28, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:

30. A method as claimed in claim 28 or claim 29, wherein the host cell is a prokaryotic cell.

31 . A method as claimed in claim 30, wherein the host cell is a bacterial cell.

32. A method as claimed in claim 31 , wherein the host cell is an Escherichia coii cell.

33. A method as claimed in any of claims 30 to 32, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:

; and/or:

; and/or:

; and/or:

; and/or:

; and/or:

; and/or:

; and/or:

34. A method as claimed in claim 33, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.

35. A method as claimed in claim 28 or claim 29, wherein the host cell is a fungal cell.

36. A method as claimed in claim 35, wherein the host cell is a Saccharomyces cerevisiae cell.

37. A method as claimed in claim 35 or claim 36, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:

; and/or:

; and/or: Amino Acid DNA Codon Replacement Codon

Isoleucine ATA ATC or ATT

; and/or:

Amino Acid DNA Codon Replacement Codon

Alanine GCA or GCG GCT or GCC ; and/or:

; and/or:

; and/or:

; and/or:

; and/or: Amino Acid DNA Codon Replacement Codon

Glutamine CAG CAA

; and/or:

Amino Acid DNA Codon Replacement Codon

Glutamic acid GAG GAA

38. A method as claimed in claim 37, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.

39. A method as claimed in claim 28 or claim 29, wherein the host cell is a nematode cell.

40. A method as claimed in claim 39, wherein the host cell is a Caenorhabditis elegans cell.

41 . A method as claimed in claim 39 or claim 40, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:

; and/or:

; and/or: Amino Acid DNA Codon Replacement Codon

Isoleucine ATA or ATT ATC

; and/or:

Amino Acid DNA Codon Replacement Codon

Threonine ACT, ACA or ACG ACC

; and/or:

; and/or:

; and/or:

; and/or:

; and/or: Amino Acid DNA Codon Replacement Codon

Cysteine TGT TGC

; and/or:

42. A method as claimed in claim 41 , wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.

43. A method as claimed in claim 28 or claim 29, wherein the host cell is a Mus musculus cell.

44. A method as claimed in claim 43, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:

; and/or:

; and/or: Amino Acid DNA Codon Replacement Codon

Alanine GCC or GCA GCG or GCT

; and/or:

Amino Acid DNA Codon Replacement Codon

Proline CCT, CCC or CCA CCG

; and/or:

; and/or:

; and/or:

; and/or:

; and/or:

45. A method as claimed in claim 44, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.

46. A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;

Amino Acid DNA Codon Replacement Codon

Histidine CAT CAC

Lysine AAA AAG

Asparagine AAT AAC

Tyrosine TAT TAC

Stop Codon TAG or TGA TAA

Alanine GCC, GCA or GCG GCT

Glycine GGC, GGA or GGG GGT

Isoleucine ATT or ATA ATC

Arginine CGC, CGA, CGG, CGT

AGA or AGG

Serine TCT, TCA, TCG, TCC

AGT or AGC

Threonine ACT, ACA or ACG ACC

Valine GTC, GTA or GTG GTT the host cell being selected from a prokaryotic cell, a fungal cell, a plant cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.

47. A method as claimed in claim 46, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.

48. A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;

a. providing a polynucleotide sequence which encodes a protein of interest and has one or more of the codons in the following table; and

b. modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table: Amino Acid DNA Codon Replacement Codon

Histidine CAT CAC

Lysine AAA AAG

Asparagine AAT AAC

Tyrosine TAT TAC

Stop Codon TAG or TGA TAA

Leucine CTT, CTC, CTA, TTA CTG

or TTG wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.

49. A method as claimed in claim 48, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:

; and/or:

; and/or:

; and/or: Amino Acid DNA Codon Replacement Codon

Valine GTC, GTA or GTG GTT

; and/or:

Amino Acid DNA Codon Replacement Codon

Proline CCC, CCA or CCG CCT

; and/or:

; and/or:

; and/or:

; and/or:

; and/or: Amino Acid DNA Codon Replacement Codon

Isoleucine ATT or ATA ATC

; and/or:

Amino Acid DNA Codon Replacement Codon

Glutamine CAA CAG

; and/or:

Amino Acid DNA Codon Replacement Codon

Arginine CGC, CGA, CGG, CGT

AGA or AGG

50. A method as claimed in claim 49, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.

51 . A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;

52. A method as claimed in claim 51 , wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:

; and/or:

; and/or:

; and/or:

; and/or:

; and/or:

; and/or:

; and/or:

; and/or:

; and/or:

; and/or:

53. A method as claimed in any preceding claim, wherein the starting polynucleotide sequence is the wild-type coding sequence.

54. A method as claimed in any preceding claim, wherein the polynucleotide sequence is present or inserted into an expression vector.

55. A method as claimed in claim 54, wherein the expression vector is further introduced into a host cell.

56. A method as claimed in claim 55, wherein the host cell is cultured to produce the heterologous protein.

57. A method of expressing a heterologous protein in a plant cell comprising the steps of;

b. modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table;

Amino Acid DNA Codon Replacement Codon

Alanine GCT, GCA or GCG GCC

Arginine CGC, CGA, CGG, CGT

AGA or AGG

Asparagine AAT AAC

Aspartic acid GAT GAC

Cysteine TGT TGC

Glutamic acid GAA GAG

Glutamine CAA CAG

Glycine GGC, GGA or GGG GGT

Histidine CAT CAC

Isoleucine ATT or ATA ATC

Leucine CTT, CTA, CTG, TTA CTC

or TTG

Lysine AAA AAG

Phenylalanine TTT TTC

Proline CCT, CCA or CCG CCC

Serine TCT, TCA, TCG, TCC

AGT or AGC

Threonine ACT, ACA or ACG ACC

Tyrosine TAT TAC

Valine GTT, GTA or GTG GTC

Stop codons TAG or TGA TAA c. inserting the polynucleotide sequence into an expression vector;

d. introducing said expression vector into a host cell; and

e. culturing the host cell to produce the heterologous protein;

optionally wherein the corresponding codons are changed according following table;

; and/or: Amino Acid DNA Codon Replacement Codon

Leucine CTT, CTA, CTC, TTA CTG

or TTG

; and/or:

and/or:

; and/or:

; and/or:

58. A method as claimed in claim 57, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.

59. A method as claimed in any of claims 46 to 58, wherein the host cell is an Arabidopsis cell.

60. A method as claimed in any preceding claim further comprising; analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence; and

incorporating in said polynucleotide sequence a pattern of optimal and non-optimal codons at a site associated with provision of a structural motif;

wherein said pattern enables increased expression efficiency of said protein in said host cell compared with the synonymous coding sequence containing solely optimal codons, wherein optimal codons are those codons pre-calculated to provide the highest functional expression of heterologous protein in the host cell or the sole possible codon.