WO2016086988A1 - Optimisation of coding sequence for functional protein expression - Google Patents
Optimisation of coding sequence for functional protein expression Download PDFInfo
- Publication number
- WO2016086988A1 WO2016086988A1 PCT/EP2014/076436 EP2014076436W WO2016086988A1 WO 2016086988 A1 WO2016086988 A1 WO 2016086988A1 EP 2014076436 W EP2014076436 W EP 2014076436W WO 2016086988 A1 WO2016086988 A1 WO 2016086988A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- codon
- cell
- host cell
- expression
- polynucleotide
- Prior art date
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 309
- 230000014509 gene expression Effects 0.000 title claims abstract description 230
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 200
- 108091026890 Coding region Proteins 0.000 title claims description 28
- 108020004705 Codon Proteins 0.000 claims abstract description 314
- 108091033319 polynucleotide Proteins 0.000 claims abstract description 186
- 102000040430 polynucleotide Human genes 0.000 claims abstract description 186
- 239000002157 polynucleotide Substances 0.000 claims abstract description 186
- 108020004999 messenger RNA Proteins 0.000 claims abstract description 183
- 210000004027 cell Anatomy 0.000 claims description 281
- 235000018102 proteins Nutrition 0.000 claims description 163
- 238000000034 method Methods 0.000 claims description 150
- 241000196324 Embryophyta Species 0.000 claims description 57
- 241000588724 Escherichia coli Species 0.000 claims description 54
- 150000001413 amino acids Chemical class 0.000 claims description 49
- 235000001014 amino acid Nutrition 0.000 claims description 48
- 229940024606 amino acid Drugs 0.000 claims description 48
- 241000699660 Mus musculus Species 0.000 claims description 44
- 230000007704 transition Effects 0.000 claims description 42
- 241000219195 Arabidopsis thaliana Species 0.000 claims description 37
- 240000004808 Saccharomyces cerevisiae Species 0.000 claims description 35
- 235000014680 Saccharomyces cerevisiae Nutrition 0.000 claims description 35
- 241000244203 Caenorhabditis elegans Species 0.000 claims description 24
- 230000002538 fungal effect Effects 0.000 claims description 24
- 230000001965 increasing effect Effects 0.000 claims description 23
- 210000001236 prokaryotic cell Anatomy 0.000 claims description 22
- 210000004102 animal cell Anatomy 0.000 claims description 20
- 239000013604 expression vector Substances 0.000 claims description 19
- 108020004414 DNA Proteins 0.000 claims description 17
- 108020004566 Transfer RNA Proteins 0.000 claims description 16
- 239000000203 mixture Substances 0.000 claims description 16
- 108700010070 Codon Usage Proteins 0.000 claims description 13
- 241000219194 Arabidopsis Species 0.000 claims description 11
- DHMQDGOQFOQNFH-UHFFFAOYSA-N Glycine Chemical compound NCC(O)=O DHMQDGOQFOQNFH-UHFFFAOYSA-N 0.000 claims description 11
- 230000001580 bacterial effect Effects 0.000 claims description 11
- 238000012258 culturing Methods 0.000 claims description 11
- 238000000126 in silico method Methods 0.000 claims description 9
- WHUUTDBJXJRKMK-UHFFFAOYSA-N Glutamic acid Natural products OC(=O)C(N)CCC(O)=O WHUUTDBJXJRKMK-UHFFFAOYSA-N 0.000 claims description 8
- 150000003839 salts Chemical class 0.000 claims description 7
- AGPKZVBTJJNPAG-WHFBIAKZSA-N L-isoleucine Chemical compound CC[C@H](C)[C@H](N)C(O)=O AGPKZVBTJJNPAG-WHFBIAKZSA-N 0.000 claims description 6
- KDXKERNSBIXSRK-UHFFFAOYSA-N Lysine Natural products NCCCCC(N)C(O)=O KDXKERNSBIXSRK-UHFFFAOYSA-N 0.000 claims description 6
- 241000244206 Nematoda Species 0.000 claims description 6
- 235000013922 glutamic acid Nutrition 0.000 claims description 6
- 239000004220 glutamic acid Substances 0.000 claims description 6
- AGPKZVBTJJNPAG-UHFFFAOYSA-N isoleucine Natural products CCC(C)C(N)C(O)=O AGPKZVBTJJNPAG-UHFFFAOYSA-N 0.000 claims description 6
- 229960000310 isoleucine Drugs 0.000 claims description 6
- ONIBWKKTOPOVIA-BYPYZUCNSA-N L-Proline Chemical compound OC(=O)[C@@H]1CCCN1 ONIBWKKTOPOVIA-BYPYZUCNSA-N 0.000 claims description 5
- QNAYBMKLOCPYGJ-REOHCLBHSA-N L-alanine Chemical compound C[C@H](N)C(O)=O QNAYBMKLOCPYGJ-REOHCLBHSA-N 0.000 claims description 5
- ROHFNLRQFUQHCH-YFKPBYRVSA-N L-leucine Chemical compound CC(C)C[C@H](N)C(O)=O ROHFNLRQFUQHCH-YFKPBYRVSA-N 0.000 claims description 5
- ROHFNLRQFUQHCH-UHFFFAOYSA-N Leucine Natural products CC(C)CC(N)C(O)=O ROHFNLRQFUQHCH-UHFFFAOYSA-N 0.000 claims description 5
- ONIBWKKTOPOVIA-UHFFFAOYSA-N Proline Natural products OC(=O)C1CCCN1 ONIBWKKTOPOVIA-UHFFFAOYSA-N 0.000 claims description 5
- 235000004279 alanine Nutrition 0.000 claims description 5
- ZDXPYRJPNDTMRX-UHFFFAOYSA-N glutamine Natural products OC(=O)C(N)CCC(N)=O ZDXPYRJPNDTMRX-UHFFFAOYSA-N 0.000 claims description 5
- 239000004475 Arginine Substances 0.000 claims description 4
- DCXYFEDJOCDNAF-UHFFFAOYSA-N Asparagine Natural products OC(=O)C(N)CC(N)=O DCXYFEDJOCDNAF-UHFFFAOYSA-N 0.000 claims description 4
- 241000588722 Escherichia Species 0.000 claims description 4
- 239000004471 Glycine Substances 0.000 claims description 4
- DCXYFEDJOCDNAF-REOHCLBHSA-N L-asparagine Chemical compound OC(=O)[C@@H](N)CC(N)=O DCXYFEDJOCDNAF-REOHCLBHSA-N 0.000 claims description 4
- OUYCCCASQSFEME-QMMMGPOBSA-N L-tyrosine Chemical compound OC(=O)[C@@H](N)CC1=CC=C(O)C=C1 OUYCCCASQSFEME-QMMMGPOBSA-N 0.000 claims description 4
- 239000004472 Lysine Substances 0.000 claims description 4
- MTCFGRXMJLQNBG-UHFFFAOYSA-N Serine Natural products OCC(N)C(O)=O MTCFGRXMJLQNBG-UHFFFAOYSA-N 0.000 claims description 4
- AYFVYJQAPQTCCC-UHFFFAOYSA-N Threonine Natural products CC(O)C(N)C(O)=O AYFVYJQAPQTCCC-UHFFFAOYSA-N 0.000 claims description 4
- 239000004473 Threonine Substances 0.000 claims description 4
- ODKSFYDXXFIFQN-UHFFFAOYSA-N arginine Natural products OC(=O)C(N)CCCNC(N)=N ODKSFYDXXFIFQN-UHFFFAOYSA-N 0.000 claims description 4
- 235000009582 asparagine Nutrition 0.000 claims description 4
- 229960001230 asparagine Drugs 0.000 claims description 4
- HNDVDQJCIGZPNO-UHFFFAOYSA-N histidine Natural products OC(=O)C(N)CC1=CN=CN1 HNDVDQJCIGZPNO-UHFFFAOYSA-N 0.000 claims description 4
- OUYCCCASQSFEME-UHFFFAOYSA-N tyrosine Natural products OC(=O)C(N)CC1=CC=C(O)C=C1 OUYCCCASQSFEME-UHFFFAOYSA-N 0.000 claims description 4
- COLNVLDHVKWLRT-QMMMGPOBSA-N L-phenylalanine Chemical compound OC(=O)[C@@H](N)CC1=CC=CC=C1 COLNVLDHVKWLRT-QMMMGPOBSA-N 0.000 claims description 3
- 241000235070 Saccharomyces Species 0.000 claims description 3
- KZSNJWFQEVHDMF-UHFFFAOYSA-N Valine Natural products CC(C)C(N)C(O)=O KZSNJWFQEVHDMF-UHFFFAOYSA-N 0.000 claims description 3
- 235000018417 cysteine Nutrition 0.000 claims description 3
- XUJNEKJLAYXESH-UHFFFAOYSA-N cysteine Natural products SCC(N)C(O)=O XUJNEKJLAYXESH-UHFFFAOYSA-N 0.000 claims description 3
- COLNVLDHVKWLRT-UHFFFAOYSA-N phenylalanine Natural products OC(=O)C(N)CC1=CC=CC=C1 COLNVLDHVKWLRT-UHFFFAOYSA-N 0.000 claims description 3
- CKLJMWTZIZZHCS-REOHCLBHSA-N L-aspartic acid Chemical compound OC(=O)[C@@H](N)CC(O)=O CKLJMWTZIZZHCS-REOHCLBHSA-N 0.000 claims description 2
- 235000003704 aspartic acid Nutrition 0.000 claims description 2
- OQFSQFPPLPISGP-UHFFFAOYSA-N beta-carboxyaspartic acid Natural products OC(=O)C(N)C(C(O)=O)C(O)=O OQFSQFPPLPISGP-UHFFFAOYSA-N 0.000 claims description 2
- 210000003527 eukaryotic cell Anatomy 0.000 claims description 2
- 210000004962 mammalian cell Anatomy 0.000 claims description 2
- KZSNJWFQEVHDMF-BYPYZUCNSA-N L-valine Chemical compound CC(C)[C@H](N)C(O)=O KZSNJWFQEVHDMF-BYPYZUCNSA-N 0.000 claims 2
- 239000004474 valine Substances 0.000 claims 2
- 230000014616 translation Effects 0.000 abstract description 45
- 238000013519 translation Methods 0.000 abstract description 37
- 230000001976 improved effect Effects 0.000 abstract description 14
- 238000012986 modification Methods 0.000 abstract description 9
- 230000004048 modification Effects 0.000 abstract description 9
- 108090000765 processed proteins & peptides Proteins 0.000 abstract description 6
- 238000013459 approach Methods 0.000 abstract description 5
- 102000004196 processed proteins & peptides Human genes 0.000 abstract description 5
- 230000002068 genetic effect Effects 0.000 abstract description 3
- 229920001184 polypeptide Polymers 0.000 abstract description 3
- 238000005457 optimization Methods 0.000 abstract description 2
- 125000003275 alpha amino acid group Chemical group 0.000 abstract 1
- 241000894007 species Species 0.000 description 63
- 239000002773 nucleotide Substances 0.000 description 50
- 241001465754 Metazoa Species 0.000 description 48
- 125000003729 nucleotide group Chemical group 0.000 description 47
- 239000005090 green fluorescent protein Substances 0.000 description 39
- 108010043121 Green Fluorescent Proteins Proteins 0.000 description 35
- 102000004144 Green Fluorescent Proteins Human genes 0.000 description 35
- 230000002596 correlated effect Effects 0.000 description 31
- 241000894006 Bacteria Species 0.000 description 25
- 241000233866 Fungi Species 0.000 description 24
- 108010058846 Ovalbumin Proteins 0.000 description 20
- 229940092253 ovalbumin Drugs 0.000 description 20
- 108010076504 Protein Sorting Signals Proteins 0.000 description 19
- 238000009826 distribution Methods 0.000 description 17
- 230000008859 change Effects 0.000 description 14
- 102000003814 Interleukin-10 Human genes 0.000 description 13
- 108090000174 Interleukin-10 Proteins 0.000 description 13
- 229940076144 interleukin-10 Drugs 0.000 description 13
- 230000000875 corresponding effect Effects 0.000 description 12
- 210000003705 ribosome Anatomy 0.000 description 12
- 210000001519 tissue Anatomy 0.000 description 10
- 230000009466 transformation Effects 0.000 description 10
- 102000012286 Chitinases Human genes 0.000 description 9
- 108010022172 Chitinases Proteins 0.000 description 9
- 102100024458 Cyclin-dependent kinase inhibitor 2A Human genes 0.000 description 9
- 238000002474 experimental method Methods 0.000 description 9
- 230000030279 gene silencing Effects 0.000 description 9
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 8
- 101001033265 Mus musculus Interleukin-10 Proteins 0.000 description 8
- 241000207746 Nicotiana benthamiana Species 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 8
- 238000002493 microarray Methods 0.000 description 8
- 102000039446 nucleic acids Human genes 0.000 description 8
- 108020004707 nucleic acids Proteins 0.000 description 8
- 150000007523 nucleic acids Chemical class 0.000 description 8
- 230000003287 optical effect Effects 0.000 description 8
- 230000001105 regulatory effect Effects 0.000 description 8
- 230000010474 transient expression Effects 0.000 description 8
- 101000609762 Gallus gallus Ovalbumin Proteins 0.000 description 7
- 101000997963 Aequorea victoria Green fluorescent protein Proteins 0.000 description 6
- 241000699666 Mus <mouse, genus> Species 0.000 description 6
- OJOBTAOGJIWAGB-UHFFFAOYSA-N acetosyringone Chemical compound COC1=CC(C(C)=O)=CC(OC)=C1O OJOBTAOGJIWAGB-UHFFFAOYSA-N 0.000 description 6
- 238000001514 detection method Methods 0.000 description 6
- 230000001502 supplementing effect Effects 0.000 description 6
- 108020003589 5' Untranslated Regions Proteins 0.000 description 5
- 241000244202 Caenorhabditis Species 0.000 description 5
- 230000001413 cellular effect Effects 0.000 description 5
- 230000004186 co-expression Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000012226 gene silencing method Methods 0.000 description 5
- 239000003112 inhibitor Substances 0.000 description 5
- 239000002609 medium Substances 0.000 description 5
- 238000011160 research Methods 0.000 description 5
- 238000003786 synthesis reaction Methods 0.000 description 5
- 238000011282 treatment Methods 0.000 description 5
- 235000007558 Avena sp Nutrition 0.000 description 4
- 244000299507 Gossypium hirsutum Species 0.000 description 4
- 240000005979 Hordeum vulgare Species 0.000 description 4
- 235000007340 Hordeum vulgare Nutrition 0.000 description 4
- 241000209510 Liliopsida Species 0.000 description 4
- 240000003183 Manihot esculenta Species 0.000 description 4
- 108700026244 Open Reading Frames Proteins 0.000 description 4
- 241000710145 Tomato bushy stunt virus Species 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 4
- 241001233957 eudicotyledons Species 0.000 description 4
- 230000008595 infiltration Effects 0.000 description 4
- 238000001764 infiltration Methods 0.000 description 4
- 239000000126 substance Substances 0.000 description 4
- 238000013518 transcription Methods 0.000 description 4
- 230000035897 transcription Effects 0.000 description 4
- 229920001817 Agar Polymers 0.000 description 3
- 241000589155 Agrobacterium tumefaciens Species 0.000 description 3
- 241000195597 Chlamydomonas reinhardtii Species 0.000 description 3
- 102000004190 Enzymes Human genes 0.000 description 3
- 108090000790 Enzymes Proteins 0.000 description 3
- 108060001084 Luciferase Proteins 0.000 description 3
- 239000005089 Luciferase Substances 0.000 description 3
- 108700011259 MicroRNAs Proteins 0.000 description 3
- 101710163270 Nuclease Proteins 0.000 description 3
- 241000283973 Oryctolagus cuniculus Species 0.000 description 3
- 240000006394 Sorghum bicolor Species 0.000 description 3
- 229930006000 Sucrose Natural products 0.000 description 3
- CZMRCDWAGMRECN-UGDNZRGBSA-N Sucrose Chemical compound O[C@H]1[C@H](O)[C@@H](CO)O[C@@]1(CO)O[C@@H]1[C@H](O)[C@@H](O)[C@H](O)[C@@H](CO)O1 CZMRCDWAGMRECN-UGDNZRGBSA-N 0.000 description 3
- 239000008272 agar Substances 0.000 description 3
- 230000015556 catabolic process Effects 0.000 description 3
- 238000006731 degradation reaction Methods 0.000 description 3
- LOKCTEFSRHRXRJ-UHFFFAOYSA-I dipotassium trisodium dihydrogen phosphate hydrogen phosphate dichloride Chemical compound P(=O)(O)(O)[O-].[K+].P(=O)(O)([O-])[O-].[Na+].[Na+].[Cl-].[K+].[Cl-].[Na+] LOKCTEFSRHRXRJ-UHFFFAOYSA-I 0.000 description 3
- 239000003623 enhancer Substances 0.000 description 3
- 230000007613 environmental effect Effects 0.000 description 3
- 239000012634 fragment Substances 0.000 description 3
- 238000001727 in vivo Methods 0.000 description 3
- 230000000977 initiatory effect Effects 0.000 description 3
- 229930027917 kanamycin Natural products 0.000 description 3
- SBUJHOSQTJFQJX-NOAMYHISSA-N kanamycin Chemical compound O[C@@H]1[C@@H](O)[C@H](O)[C@@H](CN)O[C@@H]1O[C@H]1[C@H](O)[C@@H](O[C@@H]2[C@@H]([C@@H](N)[C@H](O)[C@@H](CO)O2)O)[C@H](N)C[C@@H]1N SBUJHOSQTJFQJX-NOAMYHISSA-N 0.000 description 3
- 229960000318 kanamycin Drugs 0.000 description 3
- 229930182823 kanamycin A Natural products 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 239000002953 phosphate buffered saline Substances 0.000 description 3
- 238000011002 quantification Methods 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 239000000523 sample Substances 0.000 description 3
- 238000002864 sequence alignment Methods 0.000 description 3
- 230000010473 stable expression Effects 0.000 description 3
- 239000005720 sucrose Substances 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 230000001052 transient effect Effects 0.000 description 3
- 230000014621 translational initiation Effects 0.000 description 3
- 101150066838 12 gene Proteins 0.000 description 2
- 108020005345 3' Untranslated Regions Proteins 0.000 description 2
- 241000589158 Agrobacterium Species 0.000 description 2
- 244000144725 Amygdalus communis Species 0.000 description 2
- 235000011437 Amygdalus communis Nutrition 0.000 description 2
- 244000226021 Anacardium occidentale Species 0.000 description 2
- 244000099147 Ananas comosus Species 0.000 description 2
- 244000105624 Arachis hypogaea Species 0.000 description 2
- 244000075850 Avena orientalis Species 0.000 description 2
- 241000490497 Avena sp. Species 0.000 description 2
- 235000021533 Beta vulgaris Nutrition 0.000 description 2
- 241000335053 Beta vulgaris Species 0.000 description 2
- 241000219310 Beta vulgaris subsp. vulgaris Species 0.000 description 2
- 108091003079 Bovine Serum Albumin Proteins 0.000 description 2
- 241000743776 Brachypodium distachyon Species 0.000 description 2
- 235000009467 Carica papaya Nutrition 0.000 description 2
- 240000006432 Carica papaya Species 0.000 description 2
- 241000207199 Citrus Species 0.000 description 2
- 235000013162 Cocos nucifera Nutrition 0.000 description 2
- 244000060011 Cocos nucifera Species 0.000 description 2
- 229920000742 Cotton Polymers 0.000 description 2
- KCXVZYZYPLLWCC-UHFFFAOYSA-N EDTA Chemical compound OC(=O)CN(CC(O)=O)CCN(CC(O)=O)CC(O)=O KCXVZYZYPLLWCC-UHFFFAOYSA-N 0.000 description 2
- 238000002965 ELISA Methods 0.000 description 2
- YQYJSBFKSSDGFO-UHFFFAOYSA-N Epihygromycin Natural products OC1C(O)C(C(=O)C)OC1OC(C(=C1)O)=CC=C1C=C(C)C(=O)NC1C(O)C(O)C2OCOC2C1O YQYJSBFKSSDGFO-UHFFFAOYSA-N 0.000 description 2
- 108010070675 Glutathione transferase Proteins 0.000 description 2
- 235000010469 Glycine max Nutrition 0.000 description 2
- 244000068988 Glycine max Species 0.000 description 2
- 235000009432 Gossypium hirsutum Nutrition 0.000 description 2
- 244000020551 Helianthus annuus Species 0.000 description 2
- 102100029100 Hematopoietic prostaglandin D synthase Human genes 0.000 description 2
- 108091092195 Intron Proteins 0.000 description 2
- 244000017020 Ipomoea batatas Species 0.000 description 2
- 235000002678 Ipomoea batatas Nutrition 0.000 description 2
- 125000003412 L-alanyl group Chemical group [H]N([H])[C@@](C([H])([H])[H])(C(=O)[*])[H] 0.000 description 2
- 235000004431 Linum usitatissimum Nutrition 0.000 description 2
- 240000006240 Linum usitatissimum Species 0.000 description 2
- 241000208467 Macadamia Species 0.000 description 2
- 235000004456 Manihot esculenta Nutrition 0.000 description 2
- 235000016735 Manihot esculenta subsp esculenta Nutrition 0.000 description 2
- 240000004658 Medicago sativa Species 0.000 description 2
- 108060004795 Methyltransferase Proteins 0.000 description 2
- 108020004485 Nonsense Codon Proteins 0.000 description 2
- 240000007817 Olea europaea Species 0.000 description 2
- 240000007594 Oryza sativa Species 0.000 description 2
- 235000007164 Oryza sativa Nutrition 0.000 description 2
- 241001520808 Panicum virgatum Species 0.000 description 2
- 244000025272 Persea americana Species 0.000 description 2
- 235000008673 Persea americana Nutrition 0.000 description 2
- 229920001213 Polysorbate 20 Polymers 0.000 description 2
- 241000235347 Schizosaccharomyces pombe Species 0.000 description 2
- 241000209056 Secale Species 0.000 description 2
- 235000002595 Solanum tuberosum Nutrition 0.000 description 2
- 244000061456 Solanum tuberosum Species 0.000 description 2
- 235000011684 Sorghum saccharatum Nutrition 0.000 description 2
- 235000021536 Sugar beet Nutrition 0.000 description 2
- 244000299461 Theobroma cacao Species 0.000 description 2
- 235000009470 Theobroma cacao Nutrition 0.000 description 2
- 108091023045 Untranslated Region Proteins 0.000 description 2
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 2
- 240000008042 Zea mays Species 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 2
- 230000000692 anti-sense effect Effects 0.000 description 2
- 229940098773 bovine serum albumin Drugs 0.000 description 2
- 229940041514 candida albicans extract Drugs 0.000 description 2
- GPRBEKHLDVQUJE-VINNURBNSA-N cefotaxime Chemical compound N([C@@H]1C(N2C(=C(COC(C)=O)CS[C@@H]21)C(O)=O)=O)C(=O)/C(=N/OC)C1=CSC(N)=N1 GPRBEKHLDVQUJE-VINNURBNSA-N 0.000 description 2
- 239000006143 cell culture medium Substances 0.000 description 2
- 235000020971 citrus fruits Nutrition 0.000 description 2
- 238000010367 cloning Methods 0.000 description 2
- 239000002299 complementary DNA Substances 0.000 description 2
- 238000005094 computer simulation Methods 0.000 description 2
- 230000001276 controlling effect Effects 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 108010082025 cyan fluorescent protein Proteins 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 230000037433 frameshift Effects 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 230000000984 immunochemical effect Effects 0.000 description 2
- 238000010348 incorporation Methods 0.000 description 2
- 230000001939 inductive effect Effects 0.000 description 2
- 230000003834 intracellular effect Effects 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 239000002679 microRNA Substances 0.000 description 2
- 230000035772 mutation Effects 0.000 description 2
- 238000012856 packing Methods 0.000 description 2
- 239000000256 polyoxyethylene sorbitan monolaurate Substances 0.000 description 2
- 235000010486 polyoxyethylene sorbitan monolaurate Nutrition 0.000 description 2
- 239000001253 polyvinylpolypyrrolidone Substances 0.000 description 2
- 235000013809 polyvinylpolypyrrolidone Nutrition 0.000 description 2
- 229920000523 polyvinylpolypyrrolidone Polymers 0.000 description 2
- FGIUAXJPYTZDNR-UHFFFAOYSA-N potassium nitrate Chemical compound [K+].[O-][N+]([O-])=O FGIUAXJPYTZDNR-UHFFFAOYSA-N 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 108010054624 red fluorescent protein Proteins 0.000 description 2
- 230000035939 shock Effects 0.000 description 2
- 238000002741 site-directed mutagenesis Methods 0.000 description 2
- 239000000725 suspension Substances 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 239000012138 yeast extract Substances 0.000 description 2
- 108091005957 yellow fluorescent proteins Proteins 0.000 description 2
- 102000040650 (ribonucleotides)n+m Human genes 0.000 description 1
- 101150072531 10 gene Proteins 0.000 description 1
- UAIUNKRWKOVEES-UHFFFAOYSA-N 3,3',5,5'-tetramethylbenzidine Chemical compound CC1=C(N)C(C)=CC(C=2C=C(C)C(N)=C(C)C=2)=C1 UAIUNKRWKOVEES-UHFFFAOYSA-N 0.000 description 1
- HBEMYXWYRXKRQI-UHFFFAOYSA-N 3-(8-methoxyoctoxy)propyl-methyl-bis(trimethylsilyloxy)silane Chemical compound COCCCCCCCCOCCC[Si](C)(O[Si](C)(C)C)O[Si](C)(C)C HBEMYXWYRXKRQI-UHFFFAOYSA-N 0.000 description 1
- FWMNVWWHGCHHJJ-SKKKGAJSSA-N 4-amino-1-[(2r)-6-amino-2-[[(2r)-2-[[(2r)-2-[[(2r)-2-amino-3-phenylpropanoyl]amino]-3-phenylpropanoyl]amino]-4-methylpentanoyl]amino]hexanoyl]piperidine-4-carboxylic acid Chemical compound C([C@H](C(=O)N[C@H](CC(C)C)C(=O)N[C@H](CCCCN)C(=O)N1CCC(N)(CC1)C(O)=O)NC(=O)[C@H](N)CC=1C=CC=CC=1)C1=CC=CC=C1 FWMNVWWHGCHHJJ-SKKKGAJSSA-N 0.000 description 1
- 102100036826 Aldehyde oxidase Human genes 0.000 description 1
- 241000607620 Aliivibrio fischeri Species 0.000 description 1
- 240000001592 Amaranthus caudatus Species 0.000 description 1
- 235000009328 Amaranthus caudatus Nutrition 0.000 description 1
- 235000001274 Anacardium occidentale Nutrition 0.000 description 1
- 235000007119 Ananas comosus Nutrition 0.000 description 1
- 235000010777 Arachis hypogaea Nutrition 0.000 description 1
- 241001225321 Aspergillus fumigatus Species 0.000 description 1
- 241000351920 Aspergillus nidulans Species 0.000 description 1
- 238000009020 BCA Protein Assay Kit Methods 0.000 description 1
- 238000000035 BCA protein assay Methods 0.000 description 1
- 244000063299 Bacillus subtilis Species 0.000 description 1
- 235000014469 Bacillus subtilis Nutrition 0.000 description 1
- 102100026189 Beta-galactosidase Human genes 0.000 description 1
- 241000305336 Bigelowiella natans Species 0.000 description 1
- 235000014698 Brassica juncea var multisecta Nutrition 0.000 description 1
- 240000002791 Brassica napus Species 0.000 description 1
- 235000011293 Brassica napus Nutrition 0.000 description 1
- 235000006008 Brassica napus var napus Nutrition 0.000 description 1
- 240000000385 Brassica napus var. napus Species 0.000 description 1
- 240000008100 Brassica rapa Species 0.000 description 1
- 235000011292 Brassica rapa Nutrition 0.000 description 1
- 235000006618 Brassica rapa subsp oleifera Nutrition 0.000 description 1
- 235000004977 Brassica sinapistrum Nutrition 0.000 description 1
- 235000004936 Bromus mango Nutrition 0.000 description 1
- 240000001548 Camellia japonica Species 0.000 description 1
- 241000222122 Candida albicans Species 0.000 description 1
- 241000701489 Cauliflower mosaic virus Species 0.000 description 1
- 241000010804 Caulobacter vibrioides Species 0.000 description 1
- 241000195585 Chlamydomonas Species 0.000 description 1
- 241000195649 Chlorella <Chlorellales> Species 0.000 description 1
- KZBUYRJDOAKODT-UHFFFAOYSA-N Chlorine Chemical compound ClCl KZBUYRJDOAKODT-UHFFFAOYSA-N 0.000 description 1
- 244000251987 Coprinus macrorhizus Species 0.000 description 1
- 235000001673 Coprinus macrorhizus Nutrition 0.000 description 1
- 201000007336 Cryptococcosis Diseases 0.000 description 1
- 241000221204 Cryptococcus neoformans Species 0.000 description 1
- 241000235556 Cunninghamella elegans Species 0.000 description 1
- 102000053602 DNA Human genes 0.000 description 1
- 241000168726 Dictyostelium discoideum Species 0.000 description 1
- 238000012286 ELISA Assay Methods 0.000 description 1
- 241000200105 Emiliania huxleyi Species 0.000 description 1
- 241001465328 Eremothecium gossypii Species 0.000 description 1
- 241000218218 Ficus <angiosperm> Species 0.000 description 1
- 108090000331 Firefly luciferases Proteins 0.000 description 1
- 241000223221 Fusarium oxysporum Species 0.000 description 1
- 230000005526 G1 to G0 transition Effects 0.000 description 1
- 101150094690 GAL1 gene Proteins 0.000 description 1
- 102100028501 Galanin peptides Human genes 0.000 description 1
- 241000287828 Gallus gallus Species 0.000 description 1
- 102000053187 Glucuronidase Human genes 0.000 description 1
- 108010060309 Glucuronidase Proteins 0.000 description 1
- 241000543540 Guillardia theta Species 0.000 description 1
- 235000003222 Helianthus annuus Nutrition 0.000 description 1
- 101000928314 Homo sapiens Aldehyde oxidase Proteins 0.000 description 1
- 101100121078 Homo sapiens GAL gene Proteins 0.000 description 1
- 125000002059 L-arginyl group Chemical group O=C([*])[C@](N([H])[H])([H])C([H])([H])C([H])([H])C([H])([H])N([H])C(=N[H])N([H])[H] 0.000 description 1
- 125000001176 L-lysyl group Chemical group [H]N([H])[C@]([H])(C(=O)[*])C([H])([H])C([H])([H])C([H])([H])C(N([H])[H])([H])[H] 0.000 description 1
- 125000000769 L-threonyl group Chemical group [H]N([H])[C@]([H])(C(=O)[*])[C@](O[H])(C([H])([H])[H])[H] 0.000 description 1
- 125000003580 L-valyl group Chemical group [H]N([H])[C@]([H])(C(=O)[*])C(C([H])([H])[H])(C([H])([H])[H])[H] 0.000 description 1
- 102000006830 Luminescent Proteins Human genes 0.000 description 1
- 108010047357 Luminescent Proteins Proteins 0.000 description 1
- 241001330975 Magnaporthe oryzae Species 0.000 description 1
- 235000014826 Mangifera indica Nutrition 0.000 description 1
- 240000007228 Mangifera indica Species 0.000 description 1
- 235000010624 Medicago sativa Nutrition 0.000 description 1
- 235000017587 Medicago sativa ssp. sativa Nutrition 0.000 description 1
- 240000003433 Miscanthus floridulus Species 0.000 description 1
- 241000234295 Musa Species 0.000 description 1
- 240000005561 Musa balbisiana Species 0.000 description 1
- 235000018290 Musa x paradisiaca Nutrition 0.000 description 1
- 241000204051 Mycoplasma genitalium Species 0.000 description 1
- 241000221961 Neurospora crassa Species 0.000 description 1
- 241000208125 Nicotiana Species 0.000 description 1
- 235000002637 Nicotiana tabacum Nutrition 0.000 description 1
- 244000061176 Nicotiana tabacum Species 0.000 description 1
- 235000002725 Olea europaea Nutrition 0.000 description 1
- 108091034117 Oligonucleotide Proteins 0.000 description 1
- 241000195887 Physcomitrella patens Species 0.000 description 1
- 241000589540 Pseudomonas fluorescens Species 0.000 description 1
- 241000508269 Psidium Species 0.000 description 1
- 240000001679 Psidium guajava Species 0.000 description 1
- 235000013929 Psidium pyriferum Nutrition 0.000 description 1
- 108020005067 RNA Splice Sites Proteins 0.000 description 1
- 102000007056 Recombinant Fusion Proteins Human genes 0.000 description 1
- 108010008281 Recombinant Fusion Proteins Proteins 0.000 description 1
- 108700008625 Reporter Genes Proteins 0.000 description 1
- 108091028664 Ribonucleotide Proteins 0.000 description 1
- 241000222481 Schizophyllum commune Species 0.000 description 1
- 235000005775 Setaria Nutrition 0.000 description 1
- 241000232088 Setaria <nematode> Species 0.000 description 1
- 241000700584 Simplexvirus Species 0.000 description 1
- 108020004459 Small interfering RNA Proteins 0.000 description 1
- 235000007230 Sorghum bicolor Nutrition 0.000 description 1
- 241000746413 Spartina Species 0.000 description 1
- 235000009184 Spondias indica Nutrition 0.000 description 1
- 241000923571 Sporobolus michauxianus Species 0.000 description 1
- QAOWNCQODCNURD-UHFFFAOYSA-N Sulfuric acid Chemical compound OS(O)(=O)=O QAOWNCQODCNURD-UHFFFAOYSA-N 0.000 description 1
- 241000192584 Synechocystis Species 0.000 description 1
- 241000248384 Tetrahymena thermophila Species 0.000 description 1
- 241001491687 Thalassiosira pseudonana Species 0.000 description 1
- 241001122767 Theaceae Species 0.000 description 1
- 108091023040 Transcription factor Proteins 0.000 description 1
- 102000040945 Transcription factor Human genes 0.000 description 1
- 235000021307 Triticum Nutrition 0.000 description 1
- 244000098338 Triticum aestivum Species 0.000 description 1
- YZCKVEUIGOORGS-NJFSPNSNSA-N Tritium Chemical compound [3H] YZCKVEUIGOORGS-NJFSPNSNSA-N 0.000 description 1
- 235000015919 Ustilago maydis Nutrition 0.000 description 1
- 244000301083 Ustilago maydis Species 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 241000195615 Volvox Species 0.000 description 1
- 235000007244 Zea mays Nutrition 0.000 description 1
- 101500015412 Zea mays Ubiquitin Proteins 0.000 description 1
- 235000016383 Zea mays subsp huehuetenangensis Nutrition 0.000 description 1
- 235000002017 Zea mays subsp mays Nutrition 0.000 description 1
- 241001360088 Zymoseptoria tritici Species 0.000 description 1
- JLCPHMBAVCMARE-UHFFFAOYSA-N [3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-hydroxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methyl [5-(6-aminopurin-9-yl)-2-(hydroxymethyl)oxolan-3-yl] hydrogen phosphate Polymers Cc1cn(C2CC(OP(O)(=O)OCC3OC(CC3OP(O)(=O)OCC3OC(CC3O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c3nc(N)[nH]c4=O)C(COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3CO)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cc(C)c(=O)[nH]c3=O)n3cc(C)c(=O)[nH]c3=O)n3ccc(N)nc3=O)n3cc(C)c(=O)[nH]c3=O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)O2)c(=O)[nH]c1=O JLCPHMBAVCMARE-UHFFFAOYSA-N 0.000 description 1
- 238000001261 affinity purification Methods 0.000 description 1
- 235000020224 almond Nutrition 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 239000000427 antigen Substances 0.000 description 1
- 108091007433 antigens Proteins 0.000 description 1
- 102000036639 antigens Human genes 0.000 description 1
- 229940091771 aspergillus fumigatus Drugs 0.000 description 1
- 238000003556 assay Methods 0.000 description 1
- 239000011324 bead Substances 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 108010005774 beta-Galactosidase Proteins 0.000 description 1
- 210000004899 c-terminal region Anatomy 0.000 description 1
- 229940095731 candida albicans Drugs 0.000 description 1
- 235000020226 cashew nut Nutrition 0.000 description 1
- 239000013592 cell lysate Substances 0.000 description 1
- 210000003855 cell nucleus Anatomy 0.000 description 1
- 230000009134 cell regulation Effects 0.000 description 1
- 210000002421 cell wall Anatomy 0.000 description 1
- 238000005119 centrifugation Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 210000003763 chloroplast Anatomy 0.000 description 1
- 238000004040 coloring Methods 0.000 description 1
- 235000018597 common camellia Nutrition 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 239000000287 crude extract Substances 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 239000005547 deoxyribonucleotide Substances 0.000 description 1
- 125000002637 deoxyribonucleotide group Chemical group 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000368 destabilizing effect Effects 0.000 description 1
- UQLDLKMNUJERMK-UHFFFAOYSA-L di(octadecanoyloxy)lead Chemical compound [Pb+2].CCCCCCCCCCCCCCCCCC([O-])=O.CCCCCCCCCCCCCCCCCC([O-])=O UQLDLKMNUJERMK-UHFFFAOYSA-L 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 239000003085 diluting agent Substances 0.000 description 1
- 238000011143 downstream manufacturing Methods 0.000 description 1
- 238000004520 electroporation Methods 0.000 description 1
- 230000002255 enzymatic effect Effects 0.000 description 1
- 238000010195 expression analysis Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 235000004426 flaxseed Nutrition 0.000 description 1
- GNBHRKFJIUUOQI-UHFFFAOYSA-N fluorescein Chemical compound O1C(=O)C2=CC=CC=C2C21C1=CC=C(O)C=C1OC1=CC(O)=CC=C21 GNBHRKFJIUUOQI-UHFFFAOYSA-N 0.000 description 1
- 238000010362 genome editing Methods 0.000 description 1
- 108060003196 globin Proteins 0.000 description 1
- 102000018146 globin Human genes 0.000 description 1
- 125000003630 glycyl group Chemical group [H]N([H])C([H])([H])C(*)=O 0.000 description 1
- 238000013537 high throughput screening Methods 0.000 description 1
- 239000005556 hormone Substances 0.000 description 1
- 229940088597 hormone Drugs 0.000 description 1
- 239000012135 ice-cold extraction buffer Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000011534 incubation Methods 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 238000004020 luminiscence type Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 235000009973 maize Nutrition 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000001404 mediated effect Effects 0.000 description 1
- 230000011987 methylation Effects 0.000 description 1
- 238000007069 methylation reaction Methods 0.000 description 1
- 238000000520 microinjection Methods 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 101150008884 osmY gene Proteins 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 235000020232 peanut Nutrition 0.000 description 1
- 239000013612 plasmid Substances 0.000 description 1
- 230000008488 polyadenylation Effects 0.000 description 1
- 229920002704 polyhistidine Polymers 0.000 description 1
- 229920000642 polymer Polymers 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001915 proofreading effect Effects 0.000 description 1
- 230000004844 protein turnover Effects 0.000 description 1
- 238000003753 real-time PCR Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000002336 ribonucleotide Substances 0.000 description 1
- 125000002652 ribonucleotide group Chemical group 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- JQXXHWHPUNPDRT-WLSIYKJHSA-N rifampicin Chemical compound O([C@](C1=O)(C)O/C=C/[C@@H]([C@H]([C@@H](OC(C)=O)[C@H](C)[C@H](O)[C@H](C)[C@@H](O)[C@@H](C)\C=C\C=C(C)/C(=O)NC=2C(O)=C3C([O-])=C4C)C)OC)C4=C1C3=C(O)C=2\C=N\N1CC[NH+](C)CC1 JQXXHWHPUNPDRT-WLSIYKJHSA-N 0.000 description 1
- 229960001225 rifampicin Drugs 0.000 description 1
- 229920002477 rna polymer Polymers 0.000 description 1
- 230000028327 secretion Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000013207 serial dilution Methods 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000010561 standard procedure Methods 0.000 description 1
- 239000012089 stop solution Substances 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 235000011149 sulphuric acid Nutrition 0.000 description 1
- 239000001117 sulphuric acid Substances 0.000 description 1
- 239000006228 supernatant Substances 0.000 description 1
- 229940124597 therapeutic agent Drugs 0.000 description 1
- 229940113082 thymine Drugs 0.000 description 1
- 108091006106 transcriptional activators Proteins 0.000 description 1
- 230000002103 transcriptional effect Effects 0.000 description 1
- 230000014723 transformation of host cell by virus Effects 0.000 description 1
- 229910052722 tritium Inorganic materials 0.000 description 1
- 229940035893 uracil Drugs 0.000 description 1
- 229960005486 vaccine Drugs 0.000 description 1
- 230000003612 virological effect Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
- 210000005253 yeast cell Anatomy 0.000 description 1
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/63—Introduction of foreign genetic material using vectors; Vectors; Use of hosts therefor; Regulation of expression
- C12N15/67—General methods for enhancing the expression
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/10—Processes for the isolation, preparation or purification of DNA or RNA
- C12N15/1034—Isolating an individual clone by screening libraries
- C12N15/1089—Design, preparation, screening or analysis of libraries using computer algorithms
Definitions
- the present invention relates to an approach aimed at the modification of codons in individual polynucleotide sequences encoding a heterologous protein of interest, without altering the amino acid sequence of the polypeptide to enhance the amount of functional expression in a host organism of interest. Recognising that maximum translation efficiency and therefore protein production is influenced by codon usage of a coding sequence, in its broadest aspect, this approach exploits redundancy in the genetic code by providing a universal set of codons which may be used at certain positions in the polynucleotide sequence in order to achieve improved heterologous protein production in a range of host cells.
- the present invention also relates to the optimization of the translation efficiency of messenger RNAs on the basis of their secondary structure characteristics, and the provided set of criteria may be used to increase protein expression in particular hosts.
- codons used most frequently in highly expressed genes have been shown to correspond to genomic G+C content and often match the most abundant tRNAs in many species. It is assumed that codons that match more abundant tRNAs would be translated faster as tRNA availability for translation occurs via diffusion and the chance of encountering a more abundant tRNA is greater than when encountering a rarer tRNA. An increase in translation rate allows ribosomes to finish translation and reinitiate translation sooner.
- the probability that a ribosome initially loads a non-matching tRNA is smaller when a codon matches a more abundant tRNA resulting in an energetic advantage as three-quarters of the energy to incorporate an amino acid is lost if a non-matching tRNA has to be rejected after proofreading.
- the use of optimal codons in highly-expressed genes was hypothesized to provide a fitness gain by improved translational efficiency.
- the codon use of a gene of interest is often adapted to reflect the expression host's codon use in highly expressed genes in order to enhance heterologous protein production.
- the results obtained with this strategy are variable.
- a comparison between the overall codon use and the codon use in highly expressed genes of several plant species revealed that optimal codons are not always the codons of which the use is increased most with expression.
- the codon composition of highly expressed genes differs between monocots and dicots, the same codons often rise in frequency with increasing expression levels (expression codons) and are in many cases C-ending. These conserved expression codons were used to optimise the codon composition of three genes, which enhanced protein yield significantly upon stable and transient expression in plants.
- the present invention provides a quick, practical, universal method of increasing functional heterologous protein expression with wide application for the expression of heterologous genes in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells.
- this method removes any need for consideration of the host cell or specific cellular context involved.
- the present invention also provides specific sets of codon replacements which further improve functional protein expression in particular hosts, specifically prokaryotes, fungi, animals, nematodes, protists and plants.
- the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
- the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
- the present invention provides a method of expressing a heterologous protein in a plant cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table;
- Threonine ACT Threonine ACT, ACA or ACG ACC
- the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide.
- the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
- the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
- heterologous protein expression may be achieved by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table, particularly where the host cell is a prokaryotic cell, a fungal cell or a nematode cell:
- heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
- heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
- heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
- heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
- AGC and/or:
- the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;
- the host cell being selected from a prokaryotic cell, a fungal cell, a plant cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
- the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;
- modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
- heterologous protein expression is further improved by supplementing the codon changes detailed in the table above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
- the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;
- modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
- heterologous protein expression is further improved by supplementing the codon changes detailed in the table above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
- the present invention provides a method of expressing a heterologous protein in a plant cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table;
- the host cell is an Arabidopsis thaliana cell.
- RNAs are folded structures and translation of a given mRNA into a polypeptide requires unfolding.
- the necessary helicase activity is typically provided by the ribosome itself. This unfolding requires energy and in essence, a linear mRNA (i.e. an RNA polymer without secondary structure) would be optimal for the maximization of protein production.
- a certain degree of folding makes mRNA less susceptible to degradation and increases its diffusibility.
- the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the relevant table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the relevant table(s); the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence and wherein the method further comprises; analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence; and incorporating in said polynucleotide sequence a pattern of optimal and non-optimal codons at a site associated
- the method may comprise merely making the universal codon changes, and/or making modifications according to the replacement codon tables which are specific for particular host cells.
- analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence typically will include, but is not limited to; examining and taking account of the mean number of stem-loop transitions, mean stem size, mean loop size, standard deviation of the stem size or the loop size (which acts as a proxy measure for even distribution of stem-loops), maximum loop size and/or maximum stem size.
- uneven stem loop distributions will be discarded and the polynucleotide sequence codon composition will be altered (i.e. non-optimally) based on the observation of mRNA secondary structure to improve translational efficiency and therefore functional protein expression.
- a novel aspect of the invention is the selection of mRNA structures with the most even distribution of stems and loops that leads to higher levels of expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Consequently, in a further aspect, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide.
- the first step in selecting the 'ideal' mRNA structure is the generation of a pool of mRNA variants by making all possible combinations of synonymous codons (> 100.000 mRNA variants).
- all mRNA species in the pool are then folded in silico.
- the term "in silico" is widely used in the art and will be understood by the average skilled person as meaning performed on a computer or via computer simulation.
- the RNA structure is predicted in silico using standard techniques and usually under the temperature and salt concentrations relevant for the preferred host. Appropriate software packages or applications incorporating suitable algorithms may be selected for performing the folded mRNA structure prediction. Suitable packages include, but are not limited to; an RNA structure prediction program such as Vienna RNAfold 2.0 (Lorenz et al..
- the mRNA structure prediction will be carried out using such a prediction program using the standard settings and the folding parameters, for example, those established by Andronescu et al. (Andronescu et al., 2007 Bioinformatics, 23 (13), i19-i28) and preferably, adjusting the folding-temperature to that of the intracellular temperature of the host of interest. More preferably, the temperature and salt concentration parameters will be adjusted to match those of the preferred host. Finally the mRNAs from the library of synonymous variants that have the most even distribution of stems and loops are selected.
- the mRNAs having the most even distribution of stems and loops may be identified by the structural characteristics outlined below. In particular the standard deviation is used as a measure for an even distribution of the sizes of the stems and loops which is preferred. Typically, the more similar the stem sizes of an mRNA the higher the translation efficiency. Additionally, the more similar the loop sizes of an mRNA the higher the translation efficiency. Where there were several appropriate codons according to the foregoing criteria, previously published data was consulted to make a final selection. Parameters which may be influential include, for example, the folding energy of the 5' terminus and the selection of codons that are frequently used and match the most abundant tRNAs.
- codons giving the lowest folding energy of the 5' terminus and codons that are frequently used and match the most abundant tRNAs were preferred.
- Methods for determining the folding energy of mRNA may be based on, but are not limited to those described by Tuller et al. (Tuller et al., 2009, PNAS 107:3645-3650) and Kudla et al. (Kudla et al. 2009, Science, 324:255-258).
- Tuller et al. Tuler et al., 2009, PNAS 107:3645-3650
- Kudla et al. Kudla et al.
- the mRNA molecule from -23 till +39 should have an average folding energy of at least -6 kcal/mol for E. coli and of at least -4 kcal/mol for S.
- the cerevisiae as determined by the use of sliding windows of 40nt with 1 nt steps. Codon choice of the first 13nts providing a low energy will depend on the 5' UTR provided by the expression cassette ((Kudla et al. 2009, Science, 324: 255-258; Tuller et al., 2009, PNAS 107: 3645-3650). Alternatively, instead of adapting the first 13 nts, the 5'UTR may be adapted to provide a low folding energy.
- the 5'UTR used in the present examples is very U-rich (GTTTTTATTTTTAATTTTCTTTCAAATACTTCCACC [SEQ ID NO: 1 ]), which in most cases provided a relatively high (close to 0) folding energy when using primarily C-ending codons.
- GTTTTTATTTTTAATTTTCTTTCAAATACTTCCACC [SEQ ID NO: 1 ]
- analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence typically will include, but is not limited to; examining and taking account of; the mean number of stem-loop transitions, mean stem size, mean loop size, standard deviation of the stem size or the loop size (which acts as a proxy measure for even distribution of stem-loops), maximum loop size and/or maximum stem size.
- the polynucleotide sequence codon composition will be altered (i.e. non-optimally) to avoid uneven stem loop distributions to improve translational efficiency and therefore functional protein expression.
- Such alterations may include incorporating one or more codons listed as second preference or third preference replacement codons in place of the first preference codon where the secondary structure criteria are not fulfilled by inclusion of the first preference codon.
- such alterations may include retention of the wild-type (WT) or native codon where inclusion of an optimal codon negatively impacts the secondary structure with respect to the particular criteria for each host cell.
- WT wild-type
- the polynucleotide will have at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp).
- the polynucleotide will have stem loop transitions in the range 1 10 to 250/kbp, optionally in the range 1 10 to 200/kbp, 1 1 1 to 249/kbp, 1 12 to 248/kbp, 1 13 to 247/kbp, 1 14 to 246/kbp, 1 15 to 245/kbp, 1 16 to 244/kbp, 1 17 to 243/kbp, 1 18 to 242/kbp, 1 19 to 241 /kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp.
- the polynucleotide will have a maximum stem size of less than 19 bp. optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 14bp to 15bp. More preferably, the polynucleotide will have a maximum loop size of less than 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp. Additionally, in embodiments wherein the host cell is a prokaryotic cell, preferably a bacterial cell and more preferably an E.
- the selected polynucleotide will preferably have at least 1 16 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 1 16 to 200/kbp, 1 17 to 249/kbp, 1 18 to 248/kbp, 1 19 to 247/kbp, 120 to 245/kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp.
- kbp stem loop transitions per kilobase pair
- the selected polynucleotide will preferably have a mean stem size between 5.45 bp and 2.50 bp, optionally in the range 5.45 to 4.00 bp, 5.40 bp to 2.60 bp, 5.30 bp to 2.70 bp, 5.20 bp to 2.80 bp, 5.10 bp to 2.90 bp, 5.00 bp to 3.00 bp, 4.90 to 3.10 bp, 4.80 to 3.20 bp, 4.70 to 3.30 bp, 4.60 to 3.40 bp, 4.50 to 3.50 bp, 4.40 to 3.60 bp, 4.30 to 3.70 bp, 4.20 to 3.80 bp or 4.10 to 3.90 bp.
- the method further comprises selecting a polynucleotide having a mean loop size between 3.16 bp and 2.00 bp, optionally in the range 3.10 bp to 2.10 bp, 3.00 bp to 2.20 bp, 2.90 bp to 2.30 bp, 2.80 bp to 2.40 bp, 2.70 bp to 2.50 bp or 2.60 bp to 2.40 bp.
- the method further comprises selecting a polynucleotide having a loop size standard deviation of between 2.95 and 2 bp, optionally in the range 2.90 bp to 2.10 bp, 2.80 bp to 2.20 bp, 2.70 bp to 2.30 bp, 2.60 bp to 2.40 bp or 2.50 bp to 2.40 bp.
- the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.50, preferably between 3.50 and 2.00 bp, optionally in the range 3.40 bp to 2.10 bp, 3.30 bp to 2.20 bp, 3.20 bp to 2.30 bp, 3.10 bp to 2.40 bp, 3.00 bp to 2.50 bp, 2.90 bp to 2.60 bp or 2.80 bp to 2.70 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 16 bp, optionally in the range 10bp to 16bp, 1 1 bp to 15bp or 12bp to 14bp.
- the method further comprises selecting a polynucleotide having a maximum stem size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp, 13bp to 15bp or 12 bp to 14 bp.
- the selected polynucleotide will preferably have at least 1 16 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range optionally in the range 1 16 to 200/kbp, 1 17 to 249/kbp, 1 18 to 248/kbp, 1 19 to 247/kbp, 120 to 245/kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp.
- kbp stem loop transitions per kilobase pair
- the selected polynucleotide will have a mean stem size in the range 5.20 to 2.50 bp, optionally in the range 5.20 bp to 4.00 bp, 5.20 to 2.60 bp, 5.10 bp to 2.70 bp, 5.00 bp to 2.80 bp, 4.90 bp to 2.90 bp, 4.80 bp to 3.00 bp, 4.70 to 3.10 bp, 4.60 to 3.20 bp, 4.50 to 3.30 bp, 4.40 to 3.40 bp, 4.30 to 3.50 bp, 4.20 to 3.60 bp, 4.10 to 3.70 bp or 4.00 to 3.80 bp.
- the method further comprises selecting a polynucleotide having a mean loop size between 3.32 bp and 3.00 bp. optionally in the range 3.30 bp to 3.00 bp, 3.25 bp to 3.05 bp, 3.20 bp to 3.10 bp or 3.15 bp to 3.10 bp.
- the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.20 and 2 bp, optionally in the range 3.10 bp to 2.10 bp, 3.00 bp to 2.20 bp, 2.90 bp to 2.30 bp, 2.80 bp to 2.40 bp, 2.70 bp to 2.50 bp or 2.60 bp to 2.40 bp.
- the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.40, preferably between 3.40 and 2.00 bp, optionally in the range 3.30 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp, 2.80 bp to 2.40 bp or 2.60 bp to 2.50 bp.
- a polynucleotide having a stem size standard deviation below 3.40 preferably between 3.40 and 2.00 bp, optionally in the range 3.30 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp, 2.80 bp to 2.40 bp or 2.60 bp to 2.50 bp.
- the method further comprises selecting a polynucleotide having a maximum loop size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp or 13bp to 15bp.
- the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 12 bp to 15 bp.
- the selected polynucleotide will preferably have at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp).
- the polynucleotide will have stem loop transitions in the range 1 10 to 250/kbp, optionally in the range 1 10 to 200/kbp, 1 1 1 to 249/kbp, 1 12 to 248/kbp, 1 13 to 247/kbp, 1 14 to 246/kbp, 1 15 to 245/kbp, 1 16 to 244/kbp, 1 17 to 243/kbp, 1 18 to 242/kbp, 1 19 to 241 /kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp.
- stem loop transitions in the range 1
- the selected polynucleotide will preferably have a mean stem size between 5.27 bp and 2.50 bp, optionally in the range 5.27 bp to 4.00 bp, 5.20 to 2.40 bp, 5.10 bp to 2.50 bp, 5.00 to 2.60 bp, 4.90 bp to 2.70 bp, 4.80 bp to 2.80 bp, 4.70 bp to 2.90 bp, 4.60 bp to 3.00 bp, 4.50 to 3.10 bp, 4.40 to 3.20 bp, 4.30 to 3.30 bp, 4.20 to 3.40 bp, 4.10 to 3.50 bp, 4.00 to 3.60 bp or 3.90 to 3.70 bp.
- the method further comprises selecting a polynucleotide having a mean loop size between 3.77 bp and 3.00 bp, optionally in the range 3.75 bp to 3.00 bp, 3.70 bp to 3.10 bp, 3.60 bp to 3.20 bp or 3.50 bp to 3.30 bp.
- the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.65 and 2.00 bp, optionally in the range 3.60 bp to 2.10 bp, 3.50 bp to 2.20 bp, 3.40 bp to 2.30 bp, 3.30 bp to 2.40 bp, 3.30 bp to 2.50 bp, 3.20 bp to 2.60 bp, 3.10 bp to 2.70 bp or 3.00 bp to 2.80 bp.
- the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.25, preferably between 3.25 and 2.00 bp, optionally in the range 3.20 bp to 2.10 bp, 3.10 bp to 2.20 bp, 3.00 bp to 2.30 bp, 2.90 bp to 2.40 bp, 2.80 bp to 2.50 bp or 2.70 bp to 2.60 bp.
- the method further comprises selecting a polynucleotide having a maximum loop size below 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp.
- the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10 bp to 19 bp, 1 1 bp to 18 bp, 12 bp to 17 bp, 13 bp to 16 bp or 12 bp to 15 bp.
- the selected polynucleotide will preferably have at least 1 14 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 1 14 to 200/kbp, 1 15 to 249/kbp, 1 16 to 248/kbp, 1 17 to 247/kbp, 1 18 to 246/kbp, 1 19 to 245/kbp, 120 to 244/kbp, 121 to 243/kbp, 122 to 242/kbp, 123 to 241 /kbp, 124 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to
- the selected polynucleotide will preferably have a mean stem size between 5.35 and 2.50 bp, optionally in the range 5.35 bp to 4.00 bp, 5.30 to 2.40 bp, 5.20 bp to 2.50 bp, 5.10 to 2.60 bp, 5.00 bp to 2.70 bp, 4.90 bp to 2.80 bp, 4.80 bp to 2.90 bp, 4.70 bp to 3.00 bp, 4.60 to 3.10 bp, 4.50 to 3.20 bp, 4.40 to 3.30 bp, 4.30 to 3.40 bp, 4.20 to 3.50 bp, 4.10 to 3.60 bp, 4.00 to 3.70 bp or 3.90 to 3.80 bp.
- the method further comprises selecting a polynucleotide having a mean loop size between 3.47 bp and 3.00 bp, optionally in the range 3.45 bp to 3.00 bp, 3.40 bp to 3.10 bp or 3.30 bp to 3.20 bp.
- the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.37 and 2.00 bp, optionally in the range 3.35 bp to 2.10 bp, 3.30 bp to 2.20 bp, 3.20 bp to 2.30 bp, 3.10 bp to 2.40 bp, 3.00 bp to 2.50 bp, 2.90 bp to 2.60 bp, or 2.80 bp to 2.70 bp.
- the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.27, preferably between 3.27 and 2.00 bp, optionally in the range 3.25 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp or 2.80 bp to 2.60 bp.
- the method further comprises selecting a polynucleotide having a maximum loop size below 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp.
- the method further comprises selecting a polynucleotide having a maximum stem size below 18 bp, optionally in the range 10 bp to 18 bp, 1 1 bp to 17 bp, 12 bp to 16 bp, 13 bp to 15 bp or 12 bp to 14 bp.
- the selected polynucleotide will preferably have at least 120 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 120 to 200/kbp, 121 to 249/kbp, 122 to 248/kbp, 123 to 247/kbp, 124 to 246/kbp, 125 to 245/kbp, 130 to 240/kbp, 135 to 235/kbp, 140 to 230/kbp, 145 to 225/kbp, 150 to 220/kbp, 155 to 215/kbp, 160 to 210/kbp, 165 to 205/kbp, 170 to 200/kbp, 175 to 195/kbp or 180 to 190/kbp.
- kbp stem loop transitions per kilobase pair
- the selected polynucleotide will preferably have a mean stem size between 4.35 and 2.50 bp, optionally in the range 4.35 to 4.00 bp, 4.30 to 2.40 bp, 4.20 bp to 2.50 bp, 4.10 to 2.60 bp, 4.00 bp to 2.70 bp, 3.90 bp to 2.80 bp, 3.80 bp to 2.90 bp, 3.70 bp to 3.00 bp, 3.60 to 3.10 bp, 3.50 to 3.20 bp or 3.40 to 3.30 bp.
- the method further comprises selecting a polynucleotide having a mean loop size between 5.18 bp and 4.00 bp, optionally in the range 5.15 bp to 4.00 bp, 5.10 bp to 4.10 bp, 5.00 bp to 4.20 bp, 4.90 bp to 4.30 bp, 4.80 bp to 4.40 bp or 4.70 bp to 4.50 bp.
- the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.00 and 2.00 bp, optionally in the range 2.90 bp to 2.10 bp, 2.80 bp to 2.20 bp, 2.70 bp to 2.30 bp or 2.60 bp to 2.40 bp.
- the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.28, preferably between 3.28 and 2.00 bp, optionally in the range 3.27 bp to 2.00 bp, 3.25 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp or 2.80 bp to 2.60 bp.
- the method further comprises selecting a polynucleotide having a maximum loop size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp or 13bp to 15bp.
- the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 12 bp to 15 bp.
- the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide, wherein the method further comprises selecting a polynucleotide from a library of synonymous variants wherein the codon usage of the selected polynucleotide most closely matches the most abundant tRNAs in a particular host cell. It will be appreciated that this final step may be undertaken.
- polynucleotides encoding heterologous proteins of interest may be isolated nucleic acid molecules and may be a DNA molecule, a cDNA molecule, an RNA molecule or synthetically produced DNA or RNA or a chimeric nucleic acid molecule.
- the polynucleotide is an RNA, it will be understood that normally uracil (U) is to be used in place of thymine (T).
- polynucleotide refers to a deoxyribonucleotide or ribonucleotide polymer in single- or double-stranded form, or sense or anti-sense, and encompasses analogues of naturally occurring nucleotides that hybridize to nucleic acids in a manner similar to naturally occurring nucleotides.
- polynucleotides may be derived from any organism, including the host organism, or may be synthesised de novo.
- a polynucleotide coding sequence may be provided for the protein of interest (POI) having the wild-type (WT) sequence or alternatively having a 'pre-optimised' sequence; that is to say the sequence incorporates at one or more positions for which synonymous codons are available a codon which is associated with the most abundant tRNA for that particular amino acid.
- WT wild-type
- a 'pre-optimised' sequence that is to say the sequence incorporates at one or more positions for which synonymous codons are available a codon which is associated with the most abundant tRNA for that particular amino acid.
- codons corresponding to the most abundant tRNA for particular amino acids are used at each position for which synonymous codons are available.
- the starting polynucleotide sequence is the WT sequence encoding the POI.
- the POI may be a native protein of a host cell in which expression of the native protein has been silenced, for example, the polynucleotide sequence encoding that protein has been disrupted, deleted or mutated. In these circumstances, the POI will be considered as a heterologous protein in the context of the mutated host cell.
- a polynucleotide having a coding sequence may comprise synthesis of a polynucleotide comprising the coding sequence. This may be for example by modification of a pre-existing sequence, e.g. by site-directed mutagenesis or possibly by de novo synthesis.
- polynucleotide sequences encoding the protein of interest may be prepared by any suitable method known to those of ordinary skill in the art, including but not limited to, for example, direct chemical synthesis or cloning.
- the starting polynucleotide is a WT sequence or a pre-optimised sequence where the codons match the most abundant tRNAs for a particular host cell
- the starting polynucleotide sequence may be reviewed and modified by incorporating the relevant replacement codons in silico.
- the modified polynucleotide may subsequently be synthesised, for example by direct chemical synthesis, for introduction into a desired host cell.
- the starting polynucleotide sequence may be provided and subsequently modified ex vivo or alternatively in vivo for example by site directed mutagenesis or gene editing techniques.
- all of the polynucleotide sequence is modified according to the relevant table; that is to say 100% of the length of the coding sequence of the polynucleotide encoding the protein of interest (POI).
- POI protein of interest
- each occurrence of a particular 'non-optimal' codon in the starting polynucleotide sequence for which a synonymous codon exists will be replaced with the corresponding replacement codon indicated in the relevant table.
- this involves modifying every occurrence of that codon within the polynucleotide sequence.
- each codon will be modified using the synonymous replacement codon appearing first in the table.
- appropriate replacement codons may be applied to substantially all of the nucleotides in a polynucleotide sequence.
- At least 75%, 76%, 77%, 78%, 79%, 80%, 81 %, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91 %, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, 99.5% or 100% of the polynucleotide sequence is modified by incorporation of replacement codons according to the relevant table.
- more than 90% of the polynucleotide sequence is modified by incorporation of replacement codons according to the relevant table.
- More than 95% of the polynucleotide sequence is modified.
- 100% of the polynucleotide sequence is modified, that is, each occurrence of a particular codon is replaced with the corresponding replacement codon indicated in the relevant table.
- the sequence will preferably be provided in an expression construct, e.g. an expression vector.
- the polynucleotide may be provided in an expression vector.
- Suitable expression vectors will vary according to the recipient host cell and suitably may incorporate regulatory elements which allow expression in the host cell of interest and preferably which facilitate high-levels of expression. Such regulatory sequences may be capable of influencing transcription or translation of a gene or gene product, for example in terms of initiation, accuracy, rate, stability, downstream processing and mobility.
- Such elements may include, for example, strong and/or constitutive promoters, 5' and 3' UTR's, transcriptional and/or translational enhancers, transcription factor or protein binding sequences, start sites and termination sequences, ribosome binding sites, recombination sites, polyadenylation sequences, sense or antisense sequences, sequences ensuring correct initiation of transcription and optionally poly- A signals ensuring termination of transcription and transcript stabilisation in the host cell.
- the regulatory sequences may be plant-, animal-, bacteria-, fungal- or virus derived, and preferably may be derived from the same organism as the host cell.
- appropriate regulatory elements may vary according to the host cell of interest. For example, regulatory elements which facilitate high-level expression in prokaryotic host cells such as in E.
- coli may include the pLac, T7, P(Bla), P(Cat), P(Kat), trp or tac promoters.
- Regulatory elements which facilitate high-level expression in eukaryotic host cells might include the AOX1 or GAL1 promoter in yeast or the CMV- or SV40-promoters, CMV-enhancer, SV40-enhancer, Herpes simplex virus VIP16 transcriptional activator or inclusion of a globin intron in animal cells.
- constitutive high-level expression may be obtained using, for example, the Zea mays ubiquitin 1 promoter or 35S and 19S promoters of cauliflower mosaic virus.
- Suitable regulatory elements may be constitutive, whereby they direct expression under most environmental conditions or developmental stages, developmental stage specific or inducible.
- the promoter is inducible, to direct expression in response to environmental, chemical or developmental cues, such as temperature, light, chemicals, drought, and other stimuli.
- promoters may be chosen which permit expression of the protein of interest at particular developmental stages or in response to extra- or intra-cellular conditions, signals or externally applied stimuli.
- a range of promoters exist for use in E. coli which give high- level expression at particular stages of growth (e.g. osmY stationary phase promoter) or in response to particular stimuli (e.g. HtpG Heat Shock Promoter).
- Suitable expression vectors may comprise additional sequences encoding selectable markers which allow for the selection of said vector in a suitable host cell and/or under particular conditions. Suitable expression vectors may also comprise additional sequences which enable visualisation or quantification of the expressed protein (e.g. 3' GFP or Luciferase fusion tags) in the host cell of interest. Preferred expression vectors are those which also enable the expressed protein to be easily separated from other cellular proteins for downstream applications.
- the expression vector may incorporate a fusion tag domain, which when fused to the coding sequence of the protein of interest allows the expressed protein to be bound to a matrix, column or beads (e.g. glutathione-S-transferase (GST)).
- GST glutathione-S-transferase
- the expression vector comprising the heterologous polynucleotide sequence may optionally comprise polynucleotide sequences coding for one or more transit peptides, capable of to localising the expressed protein to a particular cellular compartment in the host cell.
- such domains may cause secretion of expressed protein, for example into the extracellular medium to enable the protein to be easily recovered from the cell culture medium.
- suitable transit peptides may cause the protein to localise to, for example, the cell wall, nucleus or chloroplasts.
- the methods of the present invention will be useful in the production of a large number of different proteins in the agricultural, chemical, industrial and pharmaceutical fields, particularly for example antibodies, vaccines, hormones and other protein therapeutics.
- levels of heterologous protein are increased relative to the respective native (i.e. unoptimised) protein by modification of the codon usage of the polynucleotide sequence which encodes the protein of interest.
- the levels of heterologous protein may increase in the range 5% to 500% relative to native (unoptimised) protein; optionally in the range 10% to 250%, 20% to 200%, 25% to 100%, 30% to 75% or 35 to 65%.
- proteins of interest may preferably be recovered from the cell culture medium as secreted proteins, although they may also be recovered from host cell lysates.
- the utility of the present invention resides in the universal applicability of the optimal replacement codons to any polynucleotide having a coding sequence and having one or more of the codons listed in the relevant table for expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells or animal cells.
- Methods of the invention can be applied to any type of host cell which is genetically accessible and which can be cultured. In other words, the approach may be applied to those cells which are able to serve as a host for production of the protein of interest (POI)). It may therefore be applied to commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells commonly employed for recombinant heterologous protein expression.
- host cells will be selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell.
- the host cell may be an Escherichia coli cell.
- the host cell may be a Saccharomyces cerevisiae cell.
- the host cell may be a Caenorhabditis elegans cell.
- the host cell may be a Mus musculus cell.
- the host cell may be a bacterial cell or alternatively the host cell may be an archaeal cell.
- Host cells may be gram-negative bacterial cells.
- Host cells may be gram-positive bacterial cells.
- host cells may include but are not limited to; an Aliivibrio fischeri cell, a Bacillus subtilis cell, a Caulobacter crescentus cell, an Escherichia coli cell, a Mycoplasma genitalium cell, a Synechocystis cell, a Pseudomonas fluorescens cell.
- the host cell is a bacterial cell.
- the host cell is an Escherichia coli (E. coli) cell.
- E. coli Escherichia coli
- the host cell is a prokaryotic cell
- the highest functional protein expression will be achieved by modification of each codon in the polynucleotide sequence for which a synonymous codon exists according to the relevant tables above.
- preference may be given to the first replacement codon appearing in the relevant table.
- preference may be given to the second replacement codon appearing in the relevant table.
- host cells may include but are not limited to; a Chlamydomonas reinhardtii cell, a Dictyostelium discoideum cell, a Tetrahymena thermophila cell, an Emiliania huxleyi cell or a Thalassiosira pseudonana cell.
- the host cell is a Chlamydomonas cell.
- the host cell is a Chlamydomonas reinhardtii cell.
- the host cell may include but is not limited to; fungal cells and yeast cells cells.
- the host cell may be a Saccharomyces cerevisiae cell, an Ashbya gossypii cell, an Aspergillus fumigatus cell, an Aspergillus nidulans cell, a Candida albicans cell, a Coprinus cinereus cell, a Cunninghamella elegans cell, a Cryptococcus neoformans cell, a Fusarium oxysporum cell, a Magnaporthe oryzae cell, a Neurospora crassa cell, a Schizophyllum commune cell, a Schizosaccharomyces pombe cell, an Ustilago maydis cell or a Zymoseptoria tritici cell.
- the host cell is a Saccharomyces cerevisiae cell or a Schizosaccharo
- the host cell is a plant cell
- any cell type of any plant species including both monocots and dicots, may be used as a host system for expression of a heterologous protein.
- Preferred plant cells for use in the present invention are genetically tractable, and are commonly derived from either crop species, species which typically exhibit high growth rates, are easily harvested or species which have established genetic resources associated with them.
- the host cell is an Arabidopsis cell, preferably an Arabidopsis thaliana cell.
- the host cell may be a Nicotiana cell, preferably a Nicotiana tabacum cell.
- said plant may suitably be selected from the following: maize (Zea mays), canola (Brassica napus, Brassica rapa ssp.), sugar beet (Beta vulgaris), oat (Avena sp.), barley (Hordeum vulgare), flax (Linum usitatissimum), alfalfa (Medicago sativa), rice (Oryza sativa), rye (Secale cerale), sorghum (Sorghum bicolor, Sorghum vulgare), switchgrass (Panicum virgatum), prairie Cordgrass (Spartina sp.), purple false brome (Brachypodium distachyon), sunflower (helianthus annuas), wheat (Tritium aestivum), soybean (Glycine max), potato (Solanum tuberosum), cotton (Gossypium hirsutum), sweet potato (lopmoea batatus), cass
- Expression constructs comprising the modified polynucleotide sequence may be located in plasmids (expression vectors) which are used to transform the host cell.
- transformation may include heat shock, electroporation, particle bombardment, chemical induction, microinjection and viral transformation.
- the expression levels of the protein of interest in host cells of interest may be determined.
- the method chosen allows for quantitative assessment of the level of functional expression.
- functional expression may be directly determined, e.g. as with GFP, luciferase or by enzymatic action of the protein of interest (POI) to generate a detectable optical signal, such as fluorescence or luminescence or a colour change caused by the protein.
- POI protein of interest
- the POI will be detectable by a high- throughput screening method, for example, relying on the detection of an optical signal.
- a high- throughput screening method for example, relying on the detection of an optical signal.
- using an optical signal which is directly proportionate to the quantity of the expression product from the polynucleotide is a convenient method of measuring expression and is amenable to high throughput processing.
- Suitable tags may include but are not limited to; a fluorescence reporter molecule translationally-fused to the C-terminal end of the POI, e.g.
- GFP Yellow Fluorescent Protein
- RFP Red Fluorescent Protein
- CFP Cyan Fluorescent Protein
- the expression vector may incorporate a polynucleotide reporter encoding a luminescent protein, such as a luciferase (e.g. firefly luciferase).
- the reporter gene may be a chromogenic enzyme which can be used to generate an optical signal, e.g. a chromogenic enzyme (such as beta-galactosidase (LacZ) or beta-glucuronidase (Gus)).
- Tags used for detection of expression may also be antigen peptide tags.
- a tag may be provided for affinity purification, e.g. a polyhistidine tag.
- any tag employed for detection of expression will be cleavable from the POI. It is envisaged that other types of label may also be used to mark the protein including, for example, organic dye molecules or radiolabels.
- the measurement of expression comprises the detection of an optical signal, for example a fluorescent signal, a luminescent signal or colour signal.
- an optical signal for example a fluorescent signal, a luminescent signal or colour signal.
- the optical signal is provided by a GFP reporter fused to the protein of interest.
- the replacement codon selected from synonymous codons listed as alternatives in the relevant table(s) for a given host is the codon associated with the highest or optimal observed functional expression of the POI, or where more than one codon provides substantially equal such expression, one such codon corresponding with that level of expression. Where there is more than one replacement codon indicated for a given non-optimal codon based on the expression data, this corresponds to the first replacement codon appearing in the relevant table. Therefore where there is choice of codons indicated for a selected position based on the expression data, preference may be given to the first replacement codon appearing in the relevant table. Alternatively, preference may be given to the second replacement codon appearing in the relevant table.
- the codon in the starting sequence may be retained, i.e. the wild type codon in embodiments where the starting sequence is the wild-type sequence. This will minimise the number of codon changes to convert the starting sequence in a polynucleotide to the selected synonymous coding sequence for improved functional protein expression.
- Figure 1 shows the influence of codon optimisation on protein yield, mRNA stability and translatability.
- Panel A is a graphical representation of the nucleotide content of the third codon position in the constructs for Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL- 10) with additional chitinase signal peptide (SP) expression. GFP was also expressed without SP.
- Panel B is a graphical representation of protein yield in transformed Arabidopsis thaliana seedlings. For each plant analysed the protein yield in ng per mg total soluble protein (TSP) is plotted against the relative mRNA transcript concentration as compared to the A.
- Figure 2 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked nucleotide use.
- Figure 3 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked codon use.
- Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) were rank-normalized and averaged. Subsequently, correlations (Spearman) between expression and codon use were calculated per species and used to generate this heat map. Consistent positive and negative correlations across species are indicated with stars and triangles respectively.
- Figure 4 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked amino acid use.
- Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) were rank-normalized and averaged.
- correlations (Spearman) between expression and amino acid use were calculated per species and used to generate this heat map. Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.
- Figure 5 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked codon bias.
- Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) was rank-normalized and averaged.
- genes were grouped based on expression from the centre (50% highest versus 50% lowest) until, with 1 % steps, the extremes (5% highest versus 5% lowest) were reached.
- the synonymous codon use frequencies in both high- and low- expressed gene pool were calculated together with the difference in codon use frequency between the high- versus the low-expressed gene pool.
- the difference in codon use frequency was correlated to the expression defining percentage (Spearman). The relation between the species based on this correlation is visualized in this heat map.
- Figure 6 shows a graphical representation of mRNA structural features plotted against ranked expression with moving average (black line).
- the mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy (kcal/mol/nucleotide), fraction of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions per nucleotide were determined.
- minimal free folding energy kcal/mol/nucleotide
- Figure 7 shows a heat map where the mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy (kcal/mol/nucleotide), fraction of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions per nucleotide were determined and correlated with expression (Spearman) (Table 2).
- the heat map demonstrates that highly-expressed genes across all kingdoms prefer a stable, but 'airy' mRNA structure. Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.
- Figure 8 is a heat map showing correlations (Spearman) between mRNA structure characteristics and protein:mRNA ratios per species (Table 3), demonstrating that highly translated transcripts across kingdoms share a similar 'airy' structure.
- the mRNA structures of all genes of Escherichia coli (Eubacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy, percentage of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions were determined and correlated (Spearman) with protein:mRNA ratios. Rank-normalized mRNA levels were divided by protein abundance (retrieved from PaxDB). Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.
- Figure 9 shows mRNA structure predictions of the constructs used for heterologous protein expression. Sequences of the native and optimised variants of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL-10) with additional signal peptide (SP) and GFP without SP flanked by the 5' and 3'-UTRs as expected from our expression cassette were used to predict the mRNA secondary structure.
- GFP Aequorea victoria green fluorescent protein
- OVA Gallus gallus ovalbumin
- IL-10 Mus musculus interleukin-10
- Figure 10 shows a heat map displaying the relation between species of several kingdoms of life based on translation rate-linked nucleotide use. Correlation (Spearman) between mRNA:protein ratios (proxy for translation rate) and nucleotide content (overall and for each codon position) for the species Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia).
- Correlation Searman
- Saccharomyces cerevisiae Frungi
- Caenorhabditis elegans Animalia
- Arabidopsis thaliana Plantae
- Mus musculus Animalia
- Figure 12 shows a heat map displaying the relation between species of several kingdoms of life based on translation rate-linked amino acid use. Correlation (Spearman) between mRNA:protein ratios (proxy for translation rate) and amino acid use for the species Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia).
- Figure 13 shows a sequence alignment of native (nat) and optimized (opt) GFP sequences.
- Figure 14 shows a sequence alignment of native (nat) and optimized (opt) GFP sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase.
- Figure 15 shows a sequence alignment of native (nat) and optimized (opt) mlL-10 sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase.
- Figure 16 shows a sequence alignnnent of native (nat) and optimized (opt) OVA sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase.
- Example 1 - Codon optimisation improves mRNA stability and translatabilitv
- the genes of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL- 10) were chosen because of their variation in codon use ( Figure 1 a). To eliminate differences caused by translation initiation all genes were preceded by the signal peptide of Arabidopsis thaliana chitinase. GFP was also expressed without this signal peptide, as it is normally not secreted.
- Protein:mRNA ratios were calculated. Because translatability may be lower with a higher mRNA concentration due to the limited number of free ribosomes, the protein:mRNA ratios were calculated of samples within the same mRNA concentration range, as indicated. The fold change when comparing the optimised to the native variant was calculated for the relative mRNA concentration, protein yield and protein:mRNA ratio. For each average the number of included seedlings is indicated (n). Significance of fold changes were calculated with a Welch's i-test: * P ⁇ 0.05, ** P ⁇ 0.01 , *** P ⁇ 0.001 . dpi 2-5 dpi 5 + p19
- thermodynamic stability of the predicted secondary mRNA structures was calculated.
- the minimum free folding energy had decreased, indicative for a more stable mRNA, from -0.25 to -0.35 and -0.31 to -0.33 kcal/mol/nt for GFP and OVA, respectively.
- the minimum free folding energy increased from - 0.31 to -0.28 kcal/mol/nt indicating a less stable mRNA.
- an overall increase in physical stability could not explain the increased mRNA transcript levels of IL-10.
- dsRNA stretches could be processed to small interfering RNAs and, like binding of microRNAs, can trigger gene silencing.
- gene silencing can also be due to gene methylation, but this always results in the complete absence of transcripts and therefore transformants without detectable expression were not considered.
- co-expression of the silencing inhibitor p19 gave comparable results.
- Ribosomes can shield nuclease target sites, however, in large-scale in vivo studies mRNA half-life could not be linked to the number of nuclease target sites or ribosomal density.
- translation initiation is equal, as is expected in our experiments, an increase in translatability should result in a lower density of ribosomes.
- optimised variants there would have been fewer ribosomes on the optimised variants compared to their native counterparts, and the optimised variants would be less protected against nucleases.
- translation per se may not influence mRNA half-life, errors in translation have been proven to lead to mRNA degradation by mRNA surveillance mechanisms.
- RNA surveillance mechanisms I) nonsense mediated decay by the recognition of a premature stop codon, II) non-stop decay by the lack of a stop codon and III) no-go decay by stalled ribosomes.
- Occurrence of a premature stop codon or the lack of a stop codon can be caused by a mutation or a ribosomal slip causing a frame-shift.
- Frame-shifts can be caused by a 'slippery' sequence that may be found in proximity of a strong mRNA structure.
- a ribosome may also stall at a strong stem-loop structure without slipping and trigger degradation.
- the native and optimised variants differ in the presence of 'slippery' sequences and/or strong mRNA structures.
- differences in level of translation-linked mRNA decay may explain the difference in mRNA transcript levels in our experiment.
- ribosomes have intrinsic helicase activity and recently it was shown that strong mRNA structures such as pseudoknots and hairpins can stall translation only temporarily. It is therefore thought that the mRNA structure provides a mechanical basis for cellular regulation of translation rate.
- increased mRNA translatability of the optimised genes may be explained by an increased translation rate caused by differences in the mRNA structure.
- Example 2 General codon bias extends to other kingdoms of life The existence of codon biases in different species has implications for the efficient expression of heterologous proteins in a range of host cells.
- the general codon bias in plants transcends kingdoms of life expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) was interrogated.
- Per species >250 microarrays originating from several studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues were used (Table 1A-F).
- the relative synonymous codon use was calculated. Subsequently, a comparison was made between high- and low-expressed genes, as a correlation between codon use and expression may only be found in genes expressed above a certain threshold. Genes were grouped based on expression from the centre (50% highest versus 50% lowest) until, with 1 % steps, the pools with 5% highest and 5% lowest expressed genes were reached. With each step the codon use frequencies in both high- and low-expressed gene pools were calculated together with the difference in codon use frequency between the high- versus the low-expressed gene pool. Finally, the difference in codon use frequency was correlated (Spearman) to the expression defining percentage.
- M. musculus seems to have an overall lower codon bias and in -50% of the cases selects for other codons compared to the overall selection of the other species.
- 13 codons are positively correlated with expression for all species. These 13 codons encode 1 1 different amino acids and a termination of translation (twice a codon for Thr/T). Comparable to the general codon bias found in plants, 8 of these 13 codons are C-ending. Furthermore, 18 codons are consistently negatively correlated with expression in these four species.
- codons most are A-ending (8), while none of them are C-ending. Strikingly, 5 universal codons were found which were positively correlated with expression for all species, indicating that these codons are conserved in the coding sequences of highly-expressed genes across all kingdoms of life and could therefore find useful application in methods of optimising functional protein expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. In addition several codons were found which were positively correlated with further increases in expression in E. coli, S. cerevisiae and C. elegans. Furthermore in addition to the universal set of codons, several codons were found to be positively correlated with increases in expression in E. coli, S. cerevisiae, C. elegans and Mus musculus. Separately, several codons were found to be positively correlated with increased expression in A. thaliana.
- Example 3 Highly expressed genes prefer a stable, but 'airy' mRNA structure
- the relationship between expression and mRNA structure characteristics was evaluated.
- the mRNA structures of all genes were predicted and determined gene length, minimal free folding energy, number of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of the number of stem/loop transitions and plotted these against expression ( Figure 6; Table 7).
- a heat map displaying the relation between the species based on the correlation (Spearman) between these structure characteristics and expression was generated (Figure 7; Table 7). This heat map demonstrates that the number of bound nucleotides and the number of stem/loop transitions was consistently positively correlated and mean loop size consistently negatively correlated with expression across all species.
- Table 7 mRNA characteristics of highly expressed genes per species.
- Table 8 Calculated mRNA structure characteristics of the constructs used for heterologous protein expression. Analysis of the mRNA secondary structure predictions given in Figure 9. Folding energy, bound nucleotides and number of transitions are corrected for gene length. Stem and loop sizes are mean values.
- the number of stem-loop transitions is positively correlated with protein:mRNA ratio and mean loop size is negatively correlated across all species.
- the folding energy is negatively correlated (more stable mRNA) for S. cerevisiae, C. elegans and A. thaliana, but not for E. coli and M. musculus.
- gene length is consistently negatively correlated with protein:mRNA ratio. This is in line with the fact that the packing density of ribosomes was shown to decrease with mRNA transcript length.
- a negative correlation with mean stem size is found for all species and the fraction of bound nucleotides is not correlated, except for S. cerevisiae.
- small stem size must be important for an increased translation rate. This again highlights the tradeoff between mRNA stability and translatability.
- GFP green-fluorescent protein
- OVA Gallus gallus ovalbumin
- IL-10 Mus musculus interleukin-10
- Optimisation was performed by recoding the protein sequences using the C-ending codons for all amino acids (TCC in the case of Ser), except Arg and Gly, for which the T-ending codons were used, and Gin, Glu and Lys, for which the G-ending codons were used.
- CTC C-ending codons for all amino acids
- Arg and Gly for which the T-ending codons were used
- Gin Glu and Lys
- Agrobacterium tumefaciens clones were cultured overnight (o/n) at 28°C in LB medium (1 Og/I pepton140, 5g/l yeast extract, 10g/I NaCI with pH7.0) containing 50 ⁇ g/nnl kanamycin. Bacterial cultures were centrifuged for 15 min at 2800 g and resuspended in MMA (20g/l sucrose, 5g/l MS-salts, 1 .95g/l MES, pH5.6) containing 200 ⁇ acetosyringone and 0.03% silwet-L77 till an OD of 0.5 was reached.
- Arabidopsis thaliana plants were submerged in the bacterial suspension for 1 min and kept in a moist environment for 2 days. Plants were maintained in a controlled greenhouse compartment (UNIFARM, Wageningen) until seeds could be collected. Seeds were sterilized by 4-hour exposure to chlorine gas and plated on basic agar plates (8g/l Bacto Agar, 0.101 g/l KNO 3 ) containing 30 ng/ml hygromycin and 100 ⁇ g/nnl cefotaxim. Plates were kept in the dark at 4°C for 2 days, then placed in artificial light for 7 hours at 24°C, again kept in the dark at RT for 5 days and finally placed in a climate chamber with 12 hour light regime at 24°C for 2 days.
- Agrobacterium tumefaciens clones were cultured overnight (o/n) at 28°C in LB medium (1 Og/I pepton140, 5g/l yeast extract, 10g/I NaCI with pH7.0) containing 50 ⁇ g ml kanamycin and 20 ⁇ g ml rifampicin.
- OD was measured again after 16 hours and the bacterial cultures were centrifuged for 15 min at 2800 g.
- the bacteria were resuspended in MMA infiltration medium (20g/l sucrose, 5g/l MS-salts, 1 .95g/l MES, pH5.6) containing 200 ⁇ acetosyringone till an OD of 1 was reached. All constructs were co-expressed with the tomato bushy stunt virus silencing inhibitor p19 by mixing Agrobacterium cultures 1 :1 . After 1 -2 hours incubation at room temperature, the two youngest fully expanded leaves of 5-6 weeks old Nicotiana benthamiana plants were infiltrated completely.
- Infiltration was performed by injecting the Agrobacterium suspension into a Nicotiana benthamiana leaf at the abaxial side using a 1 ml syringe. Infiltrated plants were maintained in a controlled greenhouse compartment (UNIFARM, Wageningen) and infiltrated leaves were harvested at selected time points.
- the oligonucleotides used for amplification of both native and optimised IL-10, OVA and GFP and TIP- 41 were 5'-AACCTCTTCCTCTTCCTC-3' [SEQ ID NO: 2] / 5'- GGAAGTGGGTGCAGTT-3' [SEQ ID NO: 3]; 5'-AACCTCTTCCTCTTCCTC-3' [SEQ ID NO: 4]/ 5'-GGGCAGTAGAAGATGTTC-3' [SEQ ID NO: 5]; 5'- GACGGTAACTACAA-GACC-3' [SEQ ID NO: 6]/ 5'-TTGTCGGCCATGATGTA-3' [SEQ ID NO: 7]; and 5'-GCTCATCGGTACGCTCTTTT-3' [SEQ ID NO: 8]/ 5'- TCCATCAGTCAGAGGCTTCC-3' [SEQ ID NO: 9], respectively.
- Relative transcript levels of the genes versus TIP-41 were determined by the Pfaffl method (Pfaffl,
- Crude extract was clarified by centrifugation at 16.000xg for 5 min at 4°C and supernatant was directly used in an ELISA and BCA protein assay.
- Mouse IL-10 expression levels were determined using the Mouse IL-10 ELISA Ready-SET-Go!
- a rabbit anti-ovalbumin or a chicken anti-GFP both from Rockland Immunochemicals Inc. was used to coat ELISA plates o/n at 4°C in a moist environment. After this and each following step the plate was washed 5 times with 30 sec intervals in PBST (1 x PBS, 0,05% Tween-20) using an automatic plate washer (BioRad model 1575). The plate was blocked with assay diluent (eBioscience) for 1 h at room temperature. Samples and standard lines were loaded in serial dilutions and incubated for 1 h at room temperature.
- Standard lines were made from purified chicken ovalbumin (Sigma) or recombinant GFP (Roche).
- a rabbit anti- ovalbumin:HRP antibody or a rabbit anti-GFP:HRP antibody both from Rockland Immunochemicals Inc.
- a 3,3',5,5'-Tetramethylbenzidine (TMB) substrate (eBioscience) was added and colouring reaction was stopped using stop solution (0.18M sulphuric acid) after 1 -15 min.
- Read outs were performed using the model 680 microplate reader (BioRad) to measure the OD at 450 nm with correction filter of 690 nm.
- TSP total soluble protein
- BSA bovine serum albumin
- Gene expression datasets of 5 species were downloaded from Gene Expression Omnibus (GEO).
- GEO Gene Expression Omnibus
- Gene-expression sets were selected based on platform (Affimetrix), release date (not earlier than 2008), publication linked to the GEO set and number of samples in the study. In total 2067 gene-expression profiles were collected, representing 8 or 9 different studies per organism. An overview can be found in Table 1A-F.
- Example 11 Protein abundance datasets Protein abundance datasets were retrieved from PaxDb (Wang et ai, 2012, Mol Cell Proteomics, 1 1 : 492-500), where the integrated datasets of Escherichia coli, Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorhabditis elegans, and Mus musculus were downloaded.
- Gene expression was normalized based on rank. Per species one array platform was used and per species probes were ranked according to their intensities. The average rank per probe was used as a measure of overall gene expression to distinguish genes with overall low and high expression levels for each species.
- the coding sequences (CDS) of all genes of 5 species were downloaded from sequence/genome repositories.
- CDS coding sequences
- For Arabidopsis thaliana the CDS of the 20101 108 release were obtained from TAIR (Lamesch et al., 2012, Nucleic Acids Research 40: D1202-1210).
- the open reading frames (without UTR, introns, etc.) of the 201 10203 release were obtained from the Saccharomyces genome database (Cherry et al., 2012, Nucleic Acids Research 40: D700-705).
- the CDS of WS241 were obtained from WormBase (Yook et al., 2012, Nucleic Acids Research 40: D735-741 ).
- the CDS of the 20130508 release (GRCm38.p1 ) were obtained from the NCBI CCDS database (Farrell et al., 2014 Nucleic Acids Research 42: D865-872).
- the mRNAs of all species were folded using Vienna RNA fold (Lorenz et al., 201 1 , Algorithms for Molecular Biology 6: 26) at 20 C, using the parameters of Andronescu et al., (Andronescu et al., 2007, Bioinformatics 23: i19-28).
- the M. musculus mRNA was also folded at 37 C and the S. cerevisiae also at 30 C, but all the reported comparisons are based on 20 C.
- Example 12 Gene expression and mRNA folding statistics
- the correlations (Spearman) between gene expression and the various mRNA- based statistics were calculated by Spearman correlation (in R 3.0.2 x64). For some of the factors a correction was applied for gene-length, these were: number of bound nucleotides, number of unbound nucleotides, energy of the structure, number of stems, number of loops, triplet usage, nucleotide usage, and amino acid usage.
- a novel aspect of our finding is the selection of mRNA structures with the most even distribution of stems and loops leads to higher levels of expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Below is an example procedure used to select the most optimal mRNA structure for improved functional expression in a host cell of interest.
- the first step in selecting the 'ideal' mRNA structure is the generation of a pool of mRNA variants by making all possible combinations of synonymous codons (> 100.000 mRNA variants).
- the second step is in silico folding of all mRNA species in the pool under the temperature and salt concentrations relevant for the preferred host.
- the third step is the selection of mRNAs from the pool that meet the following criteria:
- average number of stem-loop transitions is above 1 16 per 1 ,000 bp (or between 1 16 and 250 per 1 ,000 bp) average stem size is below 5.20 bp (or between 5.20 and 2.5 bp)
- average loop size is below 3.32 bp (or between 3.32 and 3 bp)
- the standard deviation of the loop size is below 3.20 (or between 3.20 and 2 bp) (measure for even distribution)
- the standard deviation of the stem size is below 3.40 (or between 3.40 and 2 bp) (measure for even distribution)
- maximum loop size is below 18 bp (discard uneven stem loop distributions) maximum stem size is below 19 bp (discard uneven stem loop distributions) C. eleaans
- average stem size is below 5.35 bp (or between 5.35 and 2.5 bp)
- the standard deviation of the stem size is below 3.27 (or between 3.27 and 2 bp)
- maximum stem size is below 18 bp E. coli
- average number of stem-loop transitions is above 1 16 per 1 ,000 bp (or between 1 16 and 250 per 1 ,000 bp)
- average stem size is below 5.45 bp (or between 5.45 and 2.5 bp)
- the standard deviation of the stem size is below 3.50 (or between 3.50 and 2 bp)
- maximum stem size is below 18 bp M.
- musculus 1 average number of stem-loop transitions is above 120 per 1 ,000 bp (or between 120 and 250 per 1 ,000 bp)
- average stem size is below 4.35 bp (or between 4.35 and 2.5 bp)
- average loop size is below 5.18 bp (or between 5.18 and 4 bp)
- the standard deviation of the stem size is below 3.28 (or between 3.28 and 2 bp)
- average number of stem-loop transitions is above 1 10 per 1 ,000 bp (or between 1 10 and 250 per 1 ,000 bp)
- average stem size is below 5.27 bp (or between 5.27 and 2.5 bp)
- the standard deviation of the loop size is below 3.65 (or between 3.65 and 2 bp)
- the standard deviation of the stem size is below 3.25 (or between 3.25 and 2 bp)
- step 3 where there were several appropriate codons according to the foregoing criteria, previously published data was consulted to make a final selection. Codons giving the lowest folding energy of the 5' terminus and codons that are frequently used and match the most abundant tRNAs were preferred.
- Table 1 C Description of the gathered S. cerevisiae expression data.
- Table 6A Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Escherichia coli. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low-expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated. AA Triplet All Top 5% Bottom 5% Top/Bottom
- Table 6C Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Caenorhabditis elegans. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.
- Table 6D Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Arabidopsis thaliana. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.
- Table 9 Analysis of the mRNA secondary structure characteristics (stem architecture) of the top 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia).
- Table 11 Analysis of the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the top 5% expressed genes in Escherichia coii (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
- Table 14 Analysis of the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the bottom 5% expressed genes in Escherichia coii (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
- Table 15 Differences in the mRNA secondary structure characteristics (stem architecture) of the top and bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
- Table 17 Differences in the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the top and bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus musculus (Animalia).
Abstract
The present invention relates to an approach aimed at the modification of codons in individual polynucleotide sequences encoding a heterologous protein of interest, without altering the amino acid sequence of the polypeptide to enhance the amount of functional expression in a host organism of interest. In its broadest aspect, this approach exploits redundancy in the genetic code by providing a universal set of codons which may be used at certain positions in the polynucleotide sequence in order to achieve improved heterologous protein production in a range of host cells. The present invention also relates to specific codons which may be used to increase protein expression in particular hosts. The present invention also relates to the optimization of the translation efficiency of messenger RNAs on the basis of their secondary structure characteristics, and the provided set of criteria may be used to increase protein expression in particular hosts.
Description
OPTIMISATION OF CODING SEQUENCE FOR FUNCTIONAL PROTEIN
EXPRESSION
FIELD OF THE INVENTION The present invention relates to an approach aimed at the modification of codons in individual polynucleotide sequences encoding a heterologous protein of interest, without altering the amino acid sequence of the polypeptide to enhance the amount of functional expression in a host organism of interest. Recognising that maximum translation efficiency and therefore protein production is influenced by codon usage of a coding sequence, in its broadest aspect, this approach exploits redundancy in the genetic code by providing a universal set of codons which may be used at certain positions in the polynucleotide sequence in order to achieve improved heterologous protein production in a range of host cells. The present invention also relates to the optimization of the translation efficiency of messenger RNAs on the basis of their secondary structure characteristics, and the provided set of criteria may be used to increase protein expression in particular hosts.
BACKGROUND TO THE INVENTION
Most amino acids are encoded by multiple synonymous codons and the frequency wherein synonymous codons are used is not equal within a given species. Also, within species a bias in codon use in highly expressed genes can be observed, linking codon use to gene expression. The codons used most frequently in highly expressed genes (optimal codons) have been shown to correspond to genomic G+C content and often match the most abundant tRNAs in many species. It is assumed that codons that match more abundant tRNAs would be translated faster as tRNA availability for translation occurs via diffusion and the chance of encountering a more abundant tRNA is greater than when encountering a rarer tRNA. An increase in translation rate allows ribosomes to finish translation and reinitiate translation sooner. Also, the probability that a ribosome initially loads a non-matching tRNA is smaller when a codon matches a more abundant tRNA resulting in an energetic advantage as three-quarters of the energy to incorporate an amino acid is lost if a non-matching tRNA has to be rejected after proofreading. Thus, the use of optimal codons in highly-expressed genes was hypothesized to provide a fitness gain by improved translational efficiency.
In recognition of the idea that increased translation efficiency may enhance protein yield, codon optimisation of genes for heterologous expression by recruiting optimal codons of the production host has been a common strategy. However such strategies have met with varying success. For example, a study of the heterologous expression of 154 variants of GFP differing only in synonymous codon use in E. coli demonstrated that the use of optimal codons was positively correlated with bacterial growth, but not protein yield {Kudla et al. 2009, Science, 324: 255-258).
However, many of the studies focusing on codon optimisation have not addressed a potentially confounding variable, translational initiation. In the aforementioned study, about half of the variation in GFP protein levels was explained by folding energy of the first third of the mRNA suggesting that whilst the use of optimal codons may have increased the rate of translation, protein yield remained unchanged because the initiation of translation was rate-limiting. Ribosomal density studies indicate that ribosomes are most abundant at the 5' portion of mRNAs and the overall packing
density of nearly all mRNAs is below maximum, suggesting that this may be a general feature.
Wang and Roossinck {Wang and Roossinck, 2006, Plant Mol Biol, 61:699-710) determined which codons were most highly-associated with transcripts which accumulate to high levels, by comparing overall codon use to the codon use in highly-transcribed genes in 1 1 plant species. In doing so the authors demonstrated that codon usage bias is correlated positively with gene transcript levels. As such the authors identified 18 codons which are associated with highly-expressed transcripts across 1 1 plant species. Interestingly, the authors found that use of their "optimal" codons appears to be well conserved between eudicots and monocots, but less well conserved between the higher plants and Chlamydomonas reinhardtii. However, the authors did not express polynucleotides incorporating such "optimal" codons in host cells and consequently, the effect on heterologous protein expression of altering the codon complement of their encoding polynucleotides in this way remains to be determined.
Alternatives to plant hosts are frequently required for protein production for a variety of reasons. Wang and Roossinck (Wang and Roossinck, 2006, Plant Mol Biol, 61:699-710) assessed the codons which are associated with the most abundant transcripts across 12 plant species. However, this result provides no information on codons which are relevant for optimising heterologous protein expression in other, non-plant host organisms.
SUMMARY OF THE INVENTION
The codon use of a gene of interest is often adapted to reflect the expression host's codon use in highly expressed genes in order to enhance heterologous protein production. However, the results obtained with this strategy are variable. A comparison between the overall codon use and the codon use in highly expressed genes of several plant species revealed that optimal codons are not always the codons of which the use is increased most with expression. Although the codon composition of highly expressed genes differs between monocots and dicots, the same codons often rise in frequency with increasing expression levels (expression codons) and are in many cases C-ending. These conserved expression codons were used to optimise the codon composition of three genes, which enhanced protein yield significantly upon stable and transient expression in plants.
With the above in mind an alternative method of codon optimisation has been devised that led to a significant increase in both mRNA stability and mRNA translatability (i.e higher mRNA levels and more proteins per mRNA molecule). Unexpectedly, experimental data shown here indicates that this expression-linked codon bias found in plants also extends to other kingdoms of life. On the basis of these experimental data, the present invention provides a series of synonymous codons which are believed to have wide application for the expression of heterologous genes in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells and which have been surprisingly found to correspond with increased functional protein expression therein. Instead of the lengthy and complicated process of trial and error which characterises existing methods of codon optimisation centered on increasing gene expression in specific cellular or environmental contexts, the present invention provides a quick, practical, universal method of increasing functional heterologous protein expression with wide application for the expression of heterologous genes in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Advantageously, this method removes any need for consideration of the host cell or specific cellular context involved. In addition to a series of universally applicable replacement codons for use in commonly used host cells, the present invention also provides specific sets of codon replacements which further improve functional
protein expression in particular hosts, specifically prokaryotes, fungi, animals, nematodes, protists and plants.
Accordingly, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
As noted above, Wang and Roossinck (2006) did not actually perform any expression studies to determine the effect of codon optimisation on functional protein expression. In a further aspect the present invention provides a method of expressing a heterologous protein in a plant cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table;
Asparagine AAT AAC
Aspartic acid GAT GAC
Cysteine TGT TGC
Glutamic acid GAA GAG
Glutamine CAA CAG
Glycine GGC, GGA or GGG GGT
Histidine CAT CAC
Isoleucine ATT or ATA ATC
Leucine CTT, CTA, CTG, TTA CTC
or TTG
Lysine AAA AAG
Phenylalanine TTT TTC
Proline CCT, CCA or CCG CCC
Serine TCT, TCA, TCG, TCC
AGT or AGC
Threonine ACT, ACA or ACG ACC
Tyrosine TAT TAC
Valine GTT, GTA or GTG GTC
Stop codons TAG or TGA TAA inserting the polynucleotide sequence into an expression vector;
introducing said expression vector into a host cell; and
culturing the host cell to produce the heterologous protein; optionally wherein the corresponding codons are changed according to the following table;
; and/or:
; and/or:
; and/or:
On the basis of these expression studies using such codon optimisation according to the invention, it was surprisingly discovered that a number of mRNA structural characteristics were found to be positively correlated with expression levels across kingdoms. In particular, the selection of mRNA structures with the most even distribution of stems and loops is positively correlated with higher levels of expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Consequently, in an alternative embodiment, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell
comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide.
DETAILED DESCRIPTION
Accordingly, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
In another aspect of the invention, further improvements in heterologous protein expression may be achieved by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table, particularly where the host cell is a prokaryotic cell, a fungal cell or a nematode cell:
In aspects of the invention where the host cell is a prokaryotic cell, for example, an E.coli cell, heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
; and/or:
; and/or:
; and/or:
; and/or:
Leucine TTA, TTG, CTT, CTC CTG
or CTA and/or:
Amino Acid DNA Codon Replacement Codon
Glycine GGA or GGG GGT or GGC and/or:
and/or:
In aspects of the invention where the host cell is a fungal cell, for example an S. cerevisiae cell, heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
; and/or:
; and/or:
; and/or:
; and/or:
Amino Acid DNA Codon Replacement Codon
Proline CCT, CCC or CCG CCA
In aspects of the invention where the host cell is a nematode cell, for example, an C. elegans cell, heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
and/or:
: and/or:
and/or:
Valine GTA or GTG GTC or GTT and/or:
Amino Acid DNA Codon Replacement Codon
Glutamic acid GAA GAG
and/or:
and/or:
In aspects of the invention where the host cell is a Mus musculus cell, heterologous protein expression is further improved by supplementing the universal codon changes detailed above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
Amino Acid DNA Codon Replacement Codon
Serine TCT, TCA, AGT, TCG or TCC
AGC and/or:
Amino Acid DNA Codon Replacement Codon
Arginine AGA or AGG CGG, CGT, CGC or
CGA and/or:
Amino Acid DNA Codon Replacement Codon
Alanine GCC or GCA GCG or GCT
; and/or:
; and/or:
In another aspect, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;
providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
the host cell being selected from a prokaryotic cell, a fungal cell, a plant cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
In another aspect, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;
providing a polynucleotide sequence which encodes a protein of interest and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
In aspects of the invention where the host cell is a plant cell, preferably an Arabidopsis thaliana cell, heterologous protein expression is further improved by supplementing the codon changes detailed in the table above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
; and/or:
; and/or:
; and/or:
Glutamic acid GAA GAG
and/or:
Amino Acid DNA Codon Replacement Codon
Phenylalanine TTT TTC
and/or:
and/or:
In another aspect the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;
providing a polynucleotide sequence which encodes a protein of interest and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
In aspects of the invention where the host cell is a plant cell, preferably an Arabidopsis thaliana cell, heterologous protein expression is further improved by supplementing the codon changes detailed in the table above by modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table(s):
; and/or:
and/or:
and/or:
Valine GTC, GTA or GTG GTT and/or:
Amino Acid DNA Codon Replacement Codon
Isoleucine ATA ATC or ATT and/or:
and/or:
In another aspect the present invention provides a method of expressing a heterologous protein in a plant cell comprising the steps of;
providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table;
introducing said expression vector into a host cell; and
culturing the host cell to produce the heterologous protein;
optionally wherein the corresponding codons are changed according to the following table;
; and/or:
; and/or:
; and/or:
; and/or:
In addition to establishing precise codon changes which result in improved functional protein expression, a novel aspect of the present invention was uncovered by studying the correlation between expression level and mRNA characteristics including gene length, minimal folding energy, number of bound nucleotides, mean stem and loop sizes (stretches of bound and unbound nucleotides, respectively) and number of stem-loop transitions which revealed a general trend across kingdoms. Messenger RNAs are folded structures and translation of a given mRNA into a polypeptide requires unfolding. The necessary helicase activity is typically provided by the ribosome itself. This unfolding requires energy and in essence, a linear mRNA (i.e. an RNA polymer without secondary structure) would be optimal for the maximization of protein production. However, a certain degree of folding makes mRNA less susceptible to degradation and increases its diffusibility.
The number of bound nucleotides and the number of stem-loop transitions were found to be positively correlated with expression levels, while loop size was negatively correlated with expression. Combining the gene expression data with available protein abundance data demonstrated that protein:mRNA ratio (proxy for translation efficiency) is positively correlated with the number of stem-loop transitions and negatively correlated with stem and loop size. This general pattern across kingdoms reveals a selection pressure created by gene expression on both mRNA stability and translatability. An increase in the number of nucleotide bonds favours stability, while a more even distribution of these bonds enhances translatability. Altogether, our data indicate that a successful codon optimisation strategy should focus on computational models that calculate the ideal mRNA structure whereby both stability and translatability are enhanced. Here we describe a procedure to select mRNAs with optimal folding characteristics out of a pool consisting of all possible mRNAs encoding a given protein. Remarkably, these are not the most compact mRNAs, nor the ones with the lowest unfolding energy. Here we describe a selection procedure based on a set of criteria for the optimisation of recombinant protein production in a given host that relates to the number and distribution of mRNA stem-loop transitions for any given mRNA.
On the basis of these experimental data, in another aspect the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the relevant table; and modifying substantially all or all of the polynucleotide sequence using replacement codons according to the relevant table(s); the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence and wherein the method further comprises; analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence; and incorporating in said polynucleotide sequence a pattern of optimal and non-optimal codons at a site associated with provision of a structural motif; wherein said pattern enables increased expression efficiency of said protein in said host cell compared with the synonymous coding sequence containing solely optimal codons, wherein optimal codons are those codons pre-calculated to provide the highest functional expression of heterologous protein in the host cell or the sole possible codon. As such the method may comprise merely making the universal codon changes, and/or making modifications according to the replacement codon tables which are specific for particular host cells. In preferred embodiments of the invention, analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence typically will include, but is not limited to; examining and taking account of the mean number of stem-loop transitions, mean stem size, mean loop size, standard deviation of the stem size or the loop size (which acts as a proxy measure for even distribution of stem-loops), maximum loop size and/or maximum stem size. In preferred embodiments uneven stem loop distributions will be discarded and the polynucleotide sequence codon composition will be altered (i.e. non-optimally) based on the observation of mRNA secondary structure to improve translational efficiency and therefore functional protein expression.
A novel aspect of the invention is the selection of mRNA structures with the most even distribution of stems and loops that leads to higher levels of expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and
animal cells. Consequently, in a further aspect, the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide.
Normally, the first step in selecting the 'ideal' mRNA structure is the generation of a pool of mRNA variants by making all possible combinations of synonymous codons (> 100.000 mRNA variants). Typically, all mRNA species in the pool are then folded in silico. The term "in silico" is widely used in the art and will be understood by the average skilled person as meaning performed on a computer or via computer simulation. The RNA structure is predicted in silico using standard techniques and usually under the temperature and salt concentrations relevant for the preferred host. Appropriate software packages or applications incorporating suitable algorithms may be selected for performing the folded mRNA structure prediction. Suitable packages include, but are not limited to; an RNA structure prediction program such as Vienna RNAfold 2.0 (Lorenz et al.. 201 1 , ViennaRNA Package 2.0 Algorithms for Molecular Biology, 6:1 26). Preferably, the mRNA structure prediction will be carried out using such a prediction program using the standard settings and the folding parameters, for example, those established by Andronescu et al. (Andronescu et al., 2007 Bioinformatics, 23 (13), i19-i28) and preferably, adjusting the folding-temperature to that of the intracellular temperature of the host of interest. More preferably, the temperature and salt concentration parameters will be adjusted to match those of the preferred host. Finally the mRNAs from the library of synonymous variants that have the most even distribution of stems and loops are selected. The mRNAs having the most even distribution of stems and loops may be identified by the structural characteristics outlined below. In particular the standard deviation is used as a measure for an even distribution of the sizes of the stems and loops which is preferred. Typically,
the more similar the stem sizes of an mRNA the higher the translation efficiency. Additionally, the more similar the loop sizes of an mRNA the higher the translation efficiency. Where there were several appropriate codons according to the foregoing criteria, previously published data was consulted to make a final selection. Parameters which may be influential include, for example, the folding energy of the 5' terminus and the selection of codons that are frequently used and match the most abundant tRNAs. Preferably, codons giving the lowest folding energy of the 5' terminus and codons that are frequently used and match the most abundant tRNAs were preferred. Methods for determining the folding energy of mRNA may be based on, but are not limited to those described by Tuller et al. (Tuller et al., 2009, PNAS 107:3645-3650) and Kudla et al. (Kudla et al. 2009, Science, 324:255-258). For example, the mRNA molecule from -23 till +39 should have an average folding energy of at least -6 kcal/mol for E. coli and of at least -4 kcal/mol for S. cerevisiae as determined by the use of sliding windows of 40nt with 1 nt steps. Codon choice of the first 13nts providing a low energy will depend on the 5' UTR provided by the expression cassette ((Kudla et al. 2009, Science, 324: 255-258; Tuller et al., 2009, PNAS 107: 3645-3650). Alternatively, instead of adapting the first 13 nts, the 5'UTR may be adapted to provide a low folding energy. For example, the 5'UTR used in the present examples is very U-rich (GTTTTTATTTTTAATTTTCTTTCAAATACTTCCACC [SEQ ID NO: 1 ]), which in most cases provided a relatively high (close to 0) folding energy when using primarily C-ending codons. When using the chitinase SP, this was always the case.
In preferred embodiments of the invention, analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence typically will include, but is not limited to; examining and taking account of; the mean number of stem-loop transitions, mean stem size, mean loop size, standard deviation of the stem size or the loop size (which acts as a proxy measure for even distribution of stem-loops), maximum loop size and/or maximum stem size. In preferred embodiments, the polynucleotide sequence codon composition will be altered (i.e. non-optimally) to avoid uneven stem loop distributions to improve translational efficiency and therefore functional protein expression. Such alterations may include incorporating one or more codons listed as second preference or third preference replacement codons in place of the first preference codon where the secondary structure criteria are not fulfilled by inclusion of the first preference codon. Alternatively, for a given position,
such alterations may include retention of the wild-type (WT) or native codon where inclusion of an optimal codon negatively impacts the secondary structure with respect to the particular criteria for each host cell. Preferably, the polynucleotide will have at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp). More preferably, the polynucleotide will have stem loop transitions in the range 1 10 to 250/kbp, optionally in the range 1 10 to 200/kbp, 1 1 1 to 249/kbp, 1 12 to 248/kbp, 1 13 to 247/kbp, 1 14 to 246/kbp, 1 15 to 245/kbp, 1 16 to 244/kbp, 1 17 to 243/kbp, 1 18 to 242/kbp, 1 19 to 241 /kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp.
Preferably, the polynucleotide will have a maximum stem size of less than 19 bp. optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 14bp to 15bp. More preferably, the polynucleotide will have a maximum loop size of less than 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp. Additionally, in embodiments where wherein the host cell is a prokaryotic cell, preferably a bacterial cell and more preferably an E. coli cell, the selected polynucleotide will preferably have at least 1 16 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 1 16 to 200/kbp, 1 17 to 249/kbp, 1 18 to 248/kbp, 1 19 to 247/kbp, 120 to 245/kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp. More preferably, the selected polynucleotide will preferably have a mean stem size between 5.45 bp and 2.50 bp, optionally in the range 5.45 to 4.00 bp, 5.40 bp to 2.60 bp, 5.30 bp to 2.70 bp, 5.20 bp to 2.80 bp, 5.10 bp to 2.90 bp, 5.00 bp to 3.00 bp, 4.90 to 3.10 bp, 4.80 to 3.20 bp, 4.70 to 3.30 bp, 4.60 to 3.40 bp, 4.50 to 3.50 bp, 4.40 to 3.60 bp, 4.30 to 3.70 bp, 4.20 to 3.80 bp or 4.10 to 3.90 bp. More preferably, the method further comprises selecting a polynucleotide having a mean loop size between 3.16 bp and 2.00 bp, optionally in the range 3.10 bp to 2.10 bp, 3.00 bp to 2.20 bp, 2.90 bp to 2.30 bp, 2.80 bp to 2.40 bp, 2.70 bp to 2.50 bp or 2.60
bp to 2.40 bp. More preferably, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 2.95 and 2 bp, optionally in the range 2.90 bp to 2.10 bp, 2.80 bp to 2.20 bp, 2.70 bp to 2.30 bp, 2.60 bp to 2.40 bp or 2.50 bp to 2.40 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.50, preferably between 3.50 and 2.00 bp, optionally in the range 3.40 bp to 2.10 bp, 3.30 bp to 2.20 bp, 3.20 bp to 2.30 bp, 3.10 bp to 2.40 bp, 3.00 bp to 2.50 bp, 2.90 bp to 2.60 bp or 2.80 bp to 2.70 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 16 bp, optionally in the range 10bp to 16bp, 1 1 bp to 15bp or 12bp to 14bp. In the most preferred embodiment where the host cell is a plant cell, the method further comprises selecting a polynucleotide having a maximum stem size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp, 13bp to 15bp or 12 bp to 14 bp.
Alternatively, in embodiments where wherein the host cell is a eukaryotic cell, preferably a plant cell and more preferably an Arabidopsis thaliana cell, the selected polynucleotide will preferably have at least 1 16 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range optionally in the range 1 16 to 200/kbp, 1 17 to 249/kbp, 1 18 to 248/kbp, 1 19 to 247/kbp, 120 to 245/kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp. More preferably the selected polynucleotide will have a mean stem size in the range 5.20 to 2.50 bp, optionally in the range 5.20 bp to 4.00 bp, 5.20 to 2.60 bp, 5.10 bp to 2.70 bp, 5.00 bp to 2.80 bp, 4.90 bp to 2.90 bp, 4.80 bp to 3.00 bp, 4.70 to 3.10 bp, 4.60 to 3.20 bp, 4.50 to 3.30 bp, 4.40 to 3.40 bp, 4.30 to 3.50 bp, 4.20 to 3.60 bp, 4.10 to 3.70 bp or 4.00 to 3.80 bp. Preferably, the method further comprises selecting a polynucleotide having a mean loop size between 3.32 bp and 3.00 bp. optionally in the range 3.30 bp to 3.00 bp, 3.25 bp to 3.05 bp, 3.20 bp to 3.10 bp or 3.15 bp to 3.10 bp. More preferably, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.20 and 2 bp, optionally in the range 3.10 bp to 2.10 bp, 3.00 bp to 2.20 bp, 2.90 bp to 2.30 bp, 2.80 bp to 2.40 bp, 2.70 bp to 2.50 bp or 2.60 bp to 2.40 bp. Still more preferably, the method further comprises selecting a polynucleotide having
a stem size standard deviation below 3.40, preferably between 3.40 and 2.00 bp, optionally in the range 3.30 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp, 2.80 bp to 2.40 bp or 2.60 bp to 2.50 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp or 13bp to 15bp. In the most preferred embodiment where the host cell is a plant cell, the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 12 bp to 15 bp.
Alternatively, in embodiments where wherein the host cell is a fungal cell, preferably a Saccharomyces cell, optionally a Saccharomyces cerevisiae cell, the selected polynucleotide will preferably have at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp). Preferably, the polynucleotide will have stem loop transitions in the range 1 10 to 250/kbp, optionally in the range 1 10 to 200/kbp, 1 1 1 to 249/kbp, 1 12 to 248/kbp, 1 13 to 247/kbp, 1 14 to 246/kbp, 1 15 to 245/kbp, 1 16 to 244/kbp, 1 17 to 243/kbp, 1 18 to 242/kbp, 1 19 to 241 /kbp, 120 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp. More preferably, the selected polynucleotide will preferably have a mean stem size between 5.27 bp and 2.50 bp, optionally in the range 5.27 bp to 4.00 bp, 5.20 to 2.40 bp, 5.10 bp to 2.50 bp, 5.00 to 2.60 bp, 4.90 bp to 2.70 bp, 4.80 bp to 2.80 bp, 4.70 bp to 2.90 bp, 4.60 bp to 3.00 bp, 4.50 to 3.10 bp, 4.40 to 3.20 bp, 4.30 to 3.30 bp, 4.20 to 3.40 bp, 4.10 to 3.50 bp, 4.00 to 3.60 bp or 3.90 to 3.70 bp. More preferably, the method further comprises selecting a polynucleotide having a mean loop size between 3.77 bp and 3.00 bp, optionally in the range 3.75 bp to 3.00 bp, 3.70 bp to 3.10 bp, 3.60 bp to 3.20 bp or 3.50 bp to 3.30 bp. More preferably, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.65 and 2.00 bp, optionally in the range 3.60 bp to 2.10 bp, 3.50 bp to 2.20 bp, 3.40 bp to 2.30 bp, 3.30 bp to 2.40 bp, 3.30 bp to 2.50 bp, 3.20 bp to 2.60 bp, 3.10 bp to 2.70 bp or 3.00 bp to 2.80 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.25, preferably between 3.25 and 2.00 bp, optionally in the range 3.20 bp to 2.10 bp, 3.10 bp to 2.20 bp, 3.00 bp to 2.30 bp, 2.90 bp to 2.40 bp, 2.80
bp to 2.50 bp or 2.70 bp to 2.60 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp. In the most preferred embodiment where the host cell is a fungal cell, the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10 bp to 19 bp, 1 1 bp to 18 bp, 12 bp to 17 bp, 13 bp to 16 bp or 12 bp to 15 bp.
Alternatively, in embodiments where wherein the host cell is an animal cell, preferably a nematode cell, optionally a Caenorhabditis elegans cell, the selected polynucleotide will preferably have at least 1 14 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 1 14 to 200/kbp, 1 15 to 249/kbp, 1 16 to 248/kbp, 1 17 to 247/kbp, 1 18 to 246/kbp, 1 19 to 245/kbp, 120 to 244/kbp, 121 to 243/kbp, 122 to 242/kbp, 123 to 241 /kbp, 124 to 240/kbp, 125 to 235/kbp, 130 to 230/kbp, 135 to 225/kbp, 140 to 220/kbp, 145 to 215/kbp, 150 to 210/kbp, 155 to 205/kbp, 160 to 200/kbp, 165 to 195/kbp, 170 to 190/kbp or 175 to 185/kbp. More preferably, the selected polynucleotide will preferably have a mean stem size between 5.35 and 2.50 bp, optionally in the range 5.35 bp to 4.00 bp, 5.30 to 2.40 bp, 5.20 bp to 2.50 bp, 5.10 to 2.60 bp, 5.00 bp to 2.70 bp, 4.90 bp to 2.80 bp, 4.80 bp to 2.90 bp, 4.70 bp to 3.00 bp, 4.60 to 3.10 bp, 4.50 to 3.20 bp, 4.40 to 3.30 bp, 4.30 to 3.40 bp, 4.20 to 3.50 bp, 4.10 to 3.60 bp, 4.00 to 3.70 bp or 3.90 to 3.80 bp. More preferably, the method further comprises selecting a polynucleotide having a mean loop size between 3.47 bp and 3.00 bp, optionally in the range 3.45 bp to 3.00 bp, 3.40 bp to 3.10 bp or 3.30 bp to 3.20 bp. More preferably, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.37 and 2.00 bp, optionally in the range 3.35 bp to 2.10 bp, 3.30 bp to 2.20 bp, 3.20 bp to 2.30 bp, 3.10 bp to 2.40 bp, 3.00 bp to 2.50 bp, 2.90 bp to 2.60 bp, or 2.80 bp to 2.70 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.27, preferably between 3.27 and 2.00 bp, optionally in the range 3.25 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp or 2.80 bp to 2.60 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 20 bp, optionally in the range 10bp to 20bp, 1 1 bp to 19bp, 12bp to 18bp, 13bp to 17bp or 14bp to 16bp. In the most preferred embodiment
where the host cell is a fungal cell, the method further comprises selecting a polynucleotide having a maximum stem size below 18 bp, optionally in the range 10 bp to 18 bp, 1 1 bp to 17 bp, 12 bp to 16 bp, 13 bp to 15 bp or 12 bp to 14 bp. Alternatively, in embodiments where wherein the host cell is an animal cell, preferably a mammalian cell, optionally a Mus musculus cell, the selected polynucleotide will preferably have at least 120 and fewer than 250 stem loop transitions per kilobase pair (kbp), optionally in the range 120 to 200/kbp, 121 to 249/kbp, 122 to 248/kbp, 123 to 247/kbp, 124 to 246/kbp, 125 to 245/kbp, 130 to 240/kbp, 135 to 235/kbp, 140 to 230/kbp, 145 to 225/kbp, 150 to 220/kbp, 155 to 215/kbp, 160 to 210/kbp, 165 to 205/kbp, 170 to 200/kbp, 175 to 195/kbp or 180 to 190/kbp. More preferably, the selected polynucleotide will preferably have a mean stem size between 4.35 and 2.50 bp, optionally in the range 4.35 to 4.00 bp, 4.30 to 2.40 bp, 4.20 bp to 2.50 bp, 4.10 to 2.60 bp, 4.00 bp to 2.70 bp, 3.90 bp to 2.80 bp, 3.80 bp to 2.90 bp, 3.70 bp to 3.00 bp, 3.60 to 3.10 bp, 3.50 to 3.20 bp or 3.40 to 3.30 bp. More preferably, the method further comprises selecting a polynucleotide having a mean loop size between 5.18 bp and 4.00 bp, optionally in the range 5.15 bp to 4.00 bp, 5.10 bp to 4.10 bp, 5.00 bp to 4.20 bp, 4.90 bp to 4.30 bp, 4.80 bp to 4.40 bp or 4.70 bp to 4.50 bp. More preferably still, the method further comprises selecting a polynucleotide having a loop size standard deviation of between 3.00 and 2.00 bp, optionally in the range 2.90 bp to 2.10 bp, 2.80 bp to 2.20 bp, 2.70 bp to 2.30 bp or 2.60 bp to 2.40 bp. Still more preferably, the method further comprises selecting a polynucleotide having a stem size standard deviation below 3.28, preferably between 3.28 and 2.00 bp, optionally in the range 3.27 bp to 2.00 bp, 3.25 bp to 2.10 bp, 3.20 bp to 2.20 bp, 3.10 bp to 2.30 bp, 3.00 bp to 2.40 bp, 2.90 bp to 2.50 bp or 2.80 bp to 2.60 bp. Even more preferably the method further comprises selecting a polynucleotide having a maximum loop size below 18 bp, optionally in the range 10bp to 18bp, 1 1 bp to 17bp, 12bp to 16bp or 13bp to 15bp. In the most preferred embodiment where the host cell is an animal cell, the method further comprises selecting a polynucleotide having a maximum stem size below 19 bp, optionally in the range 10bp to 19bp, 1 1 bp to 18bp, 12bp to 17bp, 13bp to 16bp or 12 bp to 15 bp.
In a final aspect the present invention provides a method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of; providing a library of polynucleotides each of which vary at a minimum of a single codon position; analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and synthesising said polynucleotide, wherein the method further comprises selecting a polynucleotide from a library of synonymous variants wherein the codon usage of the selected polynucleotide most closely matches the most abundant tRNAs in a particular host cell. It will be appreciated that this final step may be undertaken.
Polynucleotides
In the context of methods of the invention, polynucleotides encoding heterologous proteins of interest (POI) may be isolated nucleic acid molecules and may be a DNA molecule, a cDNA molecule, an RNA molecule or synthetically produced DNA or RNA or a chimeric nucleic acid molecule. In embodiments where the polynucleotide is an RNA, it will be understood that normally uracil (U) is to be used in place of thymine (T). Throughout, the term "polynucleotide" as used herein refers to a deoxyribonucleotide or ribonucleotide polymer in single- or double-stranded form, or sense or anti-sense, and encompasses analogues of naturally occurring nucleotides that hybridize to nucleic acids in a manner similar to naturally occurring nucleotides. Such polynucleotides may be derived from any organism, including the host organism, or may be synthesised de novo.
Prior to modification in accordance with the methods of the invention, a polynucleotide coding sequence may be provided for the protein of interest (POI) having the wild-type (WT) sequence or alternatively having a 'pre-optimised' sequence; that is to say the sequence incorporates at one or more positions for which synonymous codons are available a codon which is associated with the most abundant tRNA for that particular amino acid. In certain embodiments, it may be that codons corresponding to the most abundant tRNA for particular amino acids are used at each position for which synonymous codons are available. Preferably,
however, the starting polynucleotide sequence is the WT sequence encoding the POI. In the context of methods of the invention, it will be appreciated that the POI may be a native protein of a host cell in which expression of the native protein has been silenced, for example, the polynucleotide sequence encoding that protein has been disrupted, deleted or mutated. In these circumstances, the POI will be considered as a heterologous protein in the context of the mutated host cell.
The provision of a polynucleotide having a coding sequence may comprise synthesis of a polynucleotide comprising the coding sequence. This may be for example by modification of a pre-existing sequence, e.g. by site-directed mutagenesis or possibly by de novo synthesis.
Polynucleotide Sequence Modification
In all embodiments of the invention, polynucleotide sequences encoding the protein of interest may be prepared by any suitable method known to those of ordinary skill in the art, including but not limited to, for example, direct chemical synthesis or cloning. Whether the starting polynucleotide is a WT sequence or a pre-optimised sequence where the codons match the most abundant tRNAs for a particular host cell, the starting polynucleotide sequence may be reviewed and modified by incorporating the relevant replacement codons in silico. The modified polynucleotide may subsequently be synthesised, for example by direct chemical synthesis, for introduction into a desired host cell. Alternatively, the starting polynucleotide sequence may be provided and subsequently modified ex vivo or alternatively in vivo for example by site directed mutagenesis or gene editing techniques.
In some embodiments of the invention, all of the polynucleotide sequence is modified according to the relevant table; that is to say 100% of the length of the coding sequence of the polynucleotide encoding the protein of interest (POI). In such embodiments, each occurrence of a particular 'non-optimal' codon in the starting polynucleotide sequence for which a synonymous codon exists will be replaced with the corresponding replacement codon indicated in the relevant table. For a particular codon, this involves modifying every occurrence of that codon within the polynucleotide sequence. Preferably, where two or more codons are indicated as
replacement codons, each codon will be modified using the synonymous replacement codon appearing first in the table.
Alternatively, in certain situations it may be desirable to limit application of the method to specific regions of the polynucleotide sequence or to omit certain regions from application of the method, for instance to avoid disruption of secondary structural motifs or regulatory elements in the polynucleotide sequence. According to preferred embodiments of the invention, appropriate replacement codons may be applied to substantially all of the nucleotides in a polynucleotide sequence. Preferably, at least 75%, 76%, 77%, 78%, 79%, 80%, 81 %, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91 %, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, 99.5% or 100% of the polynucleotide sequence is modified by incorporation of replacement codons according to the relevant table. In preferred embodiments, more than 90% of the polynucleotide sequence is modified by incorporation of replacement codons according to the relevant table. More preferably still, more than 95% of the polynucleotide sequence is modified. Ideally, 100% of the polynucleotide sequence is modified, that is, each occurrence of a particular codon is replaced with the corresponding replacement codon indicated in the relevant table. Expression Vectors
After modification of the codon composition of the polynucleotide sequence encoding the protein of interest, subsequent expression of the polynucleotide sequence in the chosen host cell may be carried out. In order that expression can be carried out in the host cell of choice, the sequence will preferably be provided in an expression construct, e.g. an expression vector. In some embodiments, the polynucleotide may be provided in an expression vector. Suitable expression vectors will vary according to the recipient host cell and suitably may incorporate regulatory elements which allow expression in the host cell of interest and preferably which facilitate high-levels of expression. Such regulatory sequences may be capable of influencing transcription or translation of a gene or gene product, for example in terms of initiation, accuracy, rate, stability, downstream processing and mobility.
Such elements may include, for example, strong and/or constitutive promoters, 5' and 3' UTR's, transcriptional and/or translational enhancers, transcription factor or
protein binding sequences, start sites and termination sequences, ribosome binding sites, recombination sites, polyadenylation sequences, sense or antisense sequences, sequences ensuring correct initiation of transcription and optionally poly- A signals ensuring termination of transcription and transcript stabilisation in the host cell. The regulatory sequences may be plant-, animal-, bacteria-, fungal- or virus derived, and preferably may be derived from the same organism as the host cell. Clearly, appropriate regulatory elements may vary according to the host cell of interest. For example, regulatory elements which facilitate high-level expression in prokaryotic host cells such as in E. coli may include the pLac, T7, P(Bla), P(Cat), P(Kat), trp or tac promoters. Regulatory elements which facilitate high-level expression in eukaryotic host cells might include the AOX1 or GAL1 promoter in yeast or the CMV- or SV40-promoters, CMV-enhancer, SV40-enhancer, Herpes simplex virus VIP16 transcriptional activator or inclusion of a globin intron in animal cells. In plants, constitutive high-level expression may be obtained using, for example, the Zea mays ubiquitin 1 promoter or 35S and 19S promoters of cauliflower mosaic virus.
Suitable regulatory elements may be constitutive, whereby they direct expression under most environmental conditions or developmental stages, developmental stage specific or inducible. Preferably, the promoter is inducible, to direct expression in response to environmental, chemical or developmental cues, such as temperature, light, chemicals, drought, and other stimuli. Suitably, promoters may be chosen which permit expression of the protein of interest at particular developmental stages or in response to extra- or intra-cellular conditions, signals or externally applied stimuli. For example, a range of promoters exist for use in E. coli which give high- level expression at particular stages of growth (e.g. osmY stationary phase promoter) or in response to particular stimuli (e.g. HtpG Heat Shock Promoter).
Suitable expression vectors may comprise additional sequences encoding selectable markers which allow for the selection of said vector in a suitable host cell and/or under particular conditions. Suitable expression vectors may also comprise additional sequences which enable visualisation or quantification of the expressed protein (e.g. 3' GFP or Luciferase fusion tags) in the host cell of interest. Preferred expression vectors are those which also enable the expressed protein to be easily
separated from other cellular proteins for downstream applications. For example, the expression vector may incorporate a fusion tag domain, which when fused to the coding sequence of the protein of interest allows the expressed protein to be bound to a matrix, column or beads (e.g. glutathione-S-transferase (GST)).
Furthermore, the expression vector comprising the heterologous polynucleotide sequence may optionally comprise polynucleotide sequences coding for one or more transit peptides, capable of to localising the expressed protein to a particular cellular compartment in the host cell. Advantageously, such domains may cause secretion of expressed protein, for example into the extracellular medium to enable the protein to be easily recovered from the cell culture medium. In plant hosts suitable transit peptides may cause the protein to localise to, for example, the cell wall, nucleus or chloroplasts. The methods of the present invention will be useful in the production of a large number of different proteins in the agricultural, chemical, industrial and pharmaceutical fields, particularly for example antibodies, vaccines, hormones and other protein therapeutics. Advantageously, according to all aspects of the present invention, levels of heterologous protein are increased relative to the respective native (i.e. unoptimised) protein by modification of the codon usage of the polynucleotide sequence which encodes the protein of interest. Preferably, the levels of heterologous protein may increase in the range 5% to 500% relative to native (unoptimised) protein; optionally in the range 10% to 250%, 20% to 200%, 25% to 100%, 30% to 75% or 35 to 65%.
Once expressed, proteins of interest may preferably be recovered from the cell culture medium as secreted proteins, although they may also be recovered from host cell lysates.
Host cells
The utility of the present invention resides in the universal applicability of the optimal replacement codons to any polynucleotide having a coding sequence and having one or more of the codons listed in the relevant table for expression in commonly
used host cells, for example prokaryotic cells, fungal cells, plant cells or animal cells. Methods of the invention can be applied to any type of host cell which is genetically accessible and which can be cultured. In other words, the approach may be applied to those cells which are able to serve as a host for production of the protein of interest (POI)). It may therefore be applied to commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells commonly employed for recombinant heterologous protein expression. Preferably, host cells will be selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell. Typically, the host cell may be an Escherichia coli cell. Typically, the host cell may be a Saccharomyces cerevisiae cell. Typically, the host cell may be a Caenorhabditis elegans cell. Typically, the host cell may be a Mus musculus cell.
In embodiments of the invention where the host cell is a prokaryotic cell, the host cell may be a bacterial cell or alternatively the host cell may be an archaeal cell. Host cells may be gram-negative bacterial cells. Host cells may be gram-positive bacterial cells. Typically, host cells may include but are not limited to; an Aliivibrio fischeri cell, a Bacillus subtilis cell, a Caulobacter crescentus cell, an Escherichia coli cell, a Mycoplasma genitalium cell, a Synechocystis cell, a Pseudomonas fluorescens cell. In preferred embodiments the host cell is a bacterial cell. Preferably the host cell is an Escherichia coli (E. coli) cell. In particularly preferred embodiments where the host cell is a prokaryotic cell, it is envisaged that the highest functional protein expression will be achieved by modification of each codon in the polynucleotide sequence for which a synonymous codon exists according to the relevant tables above. Preferably, where there is choice of codons indicated for a selected position based on the expression data, preference may be given to the first replacement codon appearing in the relevant table. Alternatively, preference may be given to the second replacement codon appearing in the relevant table. Alternatively, in situations where the second or third preference codon is already present in the starting sequence, it may be decided to retain the codon in the starting sequence, i.e. the wild type codon in embodiments where the starting sequence is the wild-type sequence. This will minimise the number of codon changes to convert the starting sequence in a polynucleotide to the selected synonymous coding sequence for improved functional protein expression.
In embodiments of the invention where the host cell is a protist cell, host cells may include but are not limited to; a Chlamydomonas reinhardtii cell, a Dictyostelium discoideum cell, a Tetrahymena thermophila cell, an Emiliania huxleyi cell or a Thalassiosira pseudonana cell. In preferred embodiments the host cell is a Chlamydomonas cell. Preferably, the host cell is a Chlamydomonas reinhardtii cell.
In embodiments of the invention where the host cell is a fungal cell, the host cell may include but is not limited to; fungal cells and yeast cells cells. In particular, the host cell may be a Saccharomyces cerevisiae cell, an Ashbya gossypii cell, an Aspergillus fumigatus cell, an Aspergillus nidulans cell, a Candida albicans cell, a Coprinus cinereus cell, a Cunninghamella elegans cell, a Cryptococcus neoformans cell, a Fusarium oxysporum cell, a Magnaporthe oryzae cell, a Neurospora crassa cell, a Schizophyllum commune cell, a Schizosaccharomyces pombe cell, an Ustilago maydis cell or a Zymoseptoria tritici cell. Preferably the host cell is a Saccharomyces cerevisiae cell or a Schizosaccharomyces pombe cell. More preferably the host cell is a Saccharomyces cerevisiae cell.
According to aspects of the present invention where the host cell is a plant cell, any cell type of any plant species, including both monocots and dicots, may be used as a host system for expression of a heterologous protein. Preferred plant cells for use in the present invention are genetically tractable, and are commonly derived from either crop species, species which typically exhibit high growth rates, are easily harvested or species which have established genetic resources associated with them. Commonly, in some preferred embodiments of the invention, the host cell is an Arabidopsis cell, preferably an Arabidopsis thaliana cell. In other preferred embodiments of the invention the host cell may be a Nicotiana cell, preferably a Nicotiana tabacum cell. Alternatively, depending on the application chosen said plant may suitably be selected from the following: maize (Zea mays), canola (Brassica napus, Brassica rapa ssp.), sugar beet (Beta vulgaris), oat (Avena sp.), barley (Hordeum vulgare), flax (Linum usitatissimum), alfalfa (Medicago sativa), rice (Oryza sativa), rye (Secale cerale), sorghum (Sorghum bicolor, Sorghum vulgare), switchgrass (Panicum virgatum), prairie Cordgrass (Spartina sp.), purple false brome (Brachypodium distachyon), sunflower (helianthus annuas), wheat (Tritium aestivum), soybean (Glycine max), potato (Solanum tuberosum), cotton (Gossypium hirsutum), sweet potato (lopmoea batatus), cassava (Manihot esculenta), foxtail
(Setaria sp.), Miscanthus sp., peanuts (Arachis hypogaea), cotton (Gossypium hirsutum), sweet potato (lopmoea batatus), cassava (Manihot esculenta), coffee (Cofea spp.), coconut (Cocos nucifera), pineapple (Anana comosus), citrus tree (Citrus spp.) cocoa (Theobroma cacao), tea (Camellia senensis), banana (Musa spp.), avocado (Persea americana), fig (Ficus casica), guava (Psidium guajava), mango (Mangifer indica), olive (Olea europaea), papaya (Carica papaya), cashew (Anacardium occidentale), macadamia (Macadamia intergrifolia), almond (Prunus amygdalus), sugar beet (Beta vulgaris), oat (Avena sp.), barley (Hordeum vulgare), Chlorella, Volvox, Guillardia theta, Bigelowiella natans or Physcomitrella patens.
Transformation of the host cell with a heterologous gene sequence
Expression constructs comprising the modified polynucleotide sequence may be located in plasmids (expression vectors) which are used to transform the host cell. Specific, but non-limiting methods of transformation may include heat shock, electroporation, particle bombardment, chemical induction, microinjection and viral transformation.
Heterologous protein expression analysis
Subsequently, in preferred embodiments of the present invention the expression levels of the protein of interest in host cells of interest may be determined. Preferably the method chosen allows for quantitative assessment of the level of functional expression. In some instances, functional expression may be directly determined, e.g. as with GFP, luciferase or by enzymatic action of the protein of interest (POI) to generate a detectable optical signal, such as fluorescence or luminescence or a colour change caused by the protein. However, in some circumstances it may be chosen to determine physical expression, for instance by antibody probing, and rely on separate test to verify that physical expression is accompanied by the required function. In preferred embodiments of the invention, the POI will be detectable by a high- throughput screening method, for example, relying on the detection of an optical signal. Preferably, using an optical signal which is directly proportionate to the quantity of the expression product from the polynucleotide is a convenient method of measuring expression and is amenable to high throughput processing. For this
purpose, it may be necessary for the POI to incorporate a tag, or be labelled with a removable tag, which permits detection and preferably quantification of expression. Suitable tags may include but are not limited to; a fluorescence reporter molecule translationally-fused to the C-terminal end of the POI, e.g. GFP, Yellow Fluorescent Protein (YFP), Red Fluorescent Protein (RFP) or Cyan Fluorescent Protein (CFP). It may be an enzyme which can be used to generate an optical signal. Alternatively, the expression vector may incorporate a polynucleotide reporter encoding a luminescent protein, such as a luciferase (e.g. firefly luciferase). Alternatively, the reporter gene may be a chromogenic enzyme which can be used to generate an optical signal, e.g. a chromogenic enzyme (such as beta-galactosidase (LacZ) or beta-glucuronidase (Gus)). Tags used for detection of expression may also be antigen peptide tags. A tag may be provided for affinity purification, e.g. a polyhistidine tag. Where the POI is ultimately to be used as a therapeutic agent, any tag employed for detection of expression will be cleavable from the POI. It is envisaged that other types of label may also be used to mark the protein including, for example, organic dye molecules or radiolabels.
Accordingly, in a preferred embodiment of the invention, the measurement of expression comprises the detection of an optical signal, for example a fluorescent signal, a luminescent signal or colour signal. In a particularly preferred embodiment the optical signal is provided by a GFP reporter fused to the protein of interest.
The replacement codon selected from synonymous codons listed as alternatives in the relevant table(s) for a given host is the codon associated with the highest or optimal observed functional expression of the POI, or where more than one codon provides substantially equal such expression, one such codon corresponding with that level of expression. Where there is more than one replacement codon indicated for a given non-optimal codon based on the expression data, this corresponds to the first replacement codon appearing in the relevant table. Therefore where there is choice of codons indicated for a selected position based on the expression data, preference may be given to the first replacement codon appearing in the relevant table. Alternatively, preference may be given to the second replacement codon appearing in the relevant table. Routinely, in situations where the second or third preference codon is already present in the starting sequence, for convenience the
codon in the starting sequence may be retained, i.e. the wild type codon in embodiments where the starting sequence is the wild-type sequence. This will minimise the number of codon changes to convert the starting sequence in a polynucleotide to the selected synonymous coding sequence for improved functional protein expression.
EXEMPLIFICATION
The invention will now be illustrated below with reference to the following examples and figures, in which:
Figure 1 shows the influence of codon optimisation on protein yield, mRNA stability and translatability. Panel A is a graphical representation of the nucleotide content of the third codon position in the constructs for Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL- 10) with additional chitinase signal peptide (SP) expression. GFP was also expressed without SP. Panel B is a graphical representation of protein yield in transformed Arabidopsis thaliana seedlings. For each plant analysed the protein yield in ng per mg total soluble protein (TSP) is plotted against the relative mRNA transcript concentration as compared to the A. thaliana household gene TIP-41 . Panel C depicts protein yield in g per mg TSP at 2 to 5 days post infiltration (DPI), in transient expression in Nicotiana benthamiana leaves (native and optimised in black and grey bars, respectively) * indicates co-expression with the silencing inhibitor p19 of tomato bushy stunt virus. n=3, error bars indicate standard error. Figure 2 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked nucleotide use. Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) (>250 microarrays per species) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) were rank-normalized and averaged. Subsequently, correlations (Spearman) between expression and nucleotide use (overall and for each codon position) were calculated per species and used to generate this heat map. Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.
Figure 3 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked codon use. Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) (>250 microarrays per species) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) were rank-normalized and averaged. Subsequently, correlations (Spearman) between expression and codon use were calculated per species and used to generate this heat map. Consistent positive and negative correlations across species are indicated with stars and triangles respectively.
Figure 4 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked amino acid use. Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) (>250 microarrays per species) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) were rank-normalized and averaged. Subsequently, correlations (Spearman) between expression and amino acid use were calculated per species and used to generate this heat map. Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.
Figure 5 shows a heat map displaying the relation between species of several kingdoms of life based on expression-linked codon bias. Expression data of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) (>250 microarrays per species) originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) was rank-normalized and averaged. Subsequently, genes were grouped based on expression from the centre (50% highest versus 50% lowest) until, with 1 % steps, the extremes (5% highest versus 5% lowest) were reached. With each step the synonymous codon use frequencies in both high- and low- expressed gene pool were calculated together with the difference in codon use
frequency between the high- versus the low-expressed gene pool. Finally, the difference in codon use frequency was correlated to the expression defining percentage (Spearman). The relation between the species based on this correlation is visualized in this heat map.
Figure 6 shows a graphical representation of mRNA structural features plotted against ranked expression with moving average (black line). The mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy (kcal/mol/nucleotide), fraction of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions per nucleotide were determined. Previously mentioned mRNA characteristics plotted against expression.
Figure 7 shows a heat map where the mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy (kcal/mol/nucleotide), fraction of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions per nucleotide were determined and correlated with expression (Spearman) (Table 2). The heat map demonstrates that highly-expressed genes across all kingdoms prefer a stable, but 'airy' mRNA structure. Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.
Figure 8 is a heat map showing correlations (Spearman) between mRNA structure characteristics and protein:mRNA ratios per species (Table 3), demonstrating that highly translated transcripts across kingdoms share a similar 'airy' structure. The mRNA structures of all genes of Escherichia coli (Eubacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy, percentage of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions were determined and correlated (Spearman) with protein:mRNA ratios. Rank-normalized mRNA levels were divided by protein abundance (retrieved from PaxDB). Consistent positive and negative correlations across species are indicated with stars and triangles, respectively.
Figure 9 shows mRNA structure predictions of the constructs used for heterologous protein expression. Sequences of the native and optimised variants of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL-10) with additional signal peptide (SP) and GFP without SP flanked by the 5' and 3'-UTRs as expected from our expression cassette were used to predict the mRNA secondary structure.
Figure 10 shows a heat map displaying the relation between species of several kingdoms of life based on translation rate-linked nucleotide use. Correlation (Spearman) between mRNA:protein ratios (proxy for translation rate) and nucleotide content (overall and for each codon position) for the species Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia). For each species >250 microarrays originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A- F) was rank-normalized, averaged and divided by protein abundance (retrieved from PaxDB) before correlations (Spearman) between protein:mRNA ratios and nucleotide use were calculated.
Figure 11 shows a heat map displaying the relation between species of several kingdoms of life based on translation rate-linked codon use. Correlation (Spearman) between mRNA:protein ratios (proxy for translation rate) and codon use for the species Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia). For each species >250 microarrays originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) was rank-normalized, averaged and divided by protein abundance (retrieved from PaxDB) before correlations (Spearman) between protein:mRNA ratios and nucleotide use were calculated.
Figure 12 shows a heat map displaying the relation between species of several kingdoms of life based on translation rate-linked amino acid use. Correlation (Spearman) between mRNA:protein ratios (proxy for translation rate) and amino acid use for the species Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia). For each species >250 microarrays originating from multiple studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues (Table 1A-F) was rank-normalized, averaged and divided by protein abundance (retrieved from PaxDB) before correlations (Spearman) between protein :mRNA ratios and nucleotide use were calculated.
Figure 13 shows a sequence alignment of native (nat) and optimized (opt) GFP sequences.
Figure 14 shows a sequence alignment of native (nat) and optimized (opt) GFP sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase. Figure 15 shows a sequence alignment of native (nat) and optimized (opt) mlL-10 sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase.
Figure 16 shows a sequence alignnnent of native (nat) and optimized (opt) OVA sequences, both preceded by an optimised signal peptide of Arabidopsis thaliana chitinase.
Example 1 - Codon optimisation improves mRNA stability and translatabilitv
Wang and Roossinck (2006) previously compared overall codon use to the codon use in highly expressed genes in 1 1 plant species. Although the codons used most frequently in highly expressed genes (optimal codons) differed between monocots and dicots, the use of the same codons often increases with expression (expression codons). However, the authors did not express the optimised genes in plants. In the experiments shown here, one codon per amino acid that was most often identified as an expression codon across these 1 1 plant species was selected. Strikingly, most of these codons were C-ending, except for the amino acids Arg (CGT) and Gly (GGT). The codons of the amino acids Gin, Glu and Lys, that can only be encoded by A or G-ending codons, were G-ending. To investigate the effect of these codons on heterologous protein production in plants, the gene sequence of three genes was recoded with these codons. The genes of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL- 10) were chosen because of their variation in codon use (Figure 1 a). To eliminate differences caused by translation initiation all genes were preceded by the signal peptide of Arabidopsis thaliana chitinase. GFP was also expressed without this signal peptide, as it is normally not secreted. The native and optimised variants of these four constructs were used to transform Arabidopsis thaliana using the floral dip method and their expression in seedlings was evaluated by determining mRNA transcript and protein levels (Figure 1 b; Table 4). An increased protein yield found upon optimisation could be partly explained by an increase in mRNA transcript levels, i.e. increased mRNA stability (Table 4). Comparing protein:mRNA ratios of transformants within a similar mRNA expression range showed that codon optimisation resulted in more protein per mRNA transcript. Thus, codon optimisation also resulted in increased mRNA translatability.
Upon transient transformation transcript levels are always much higher. An increase in mRNA stability and translatability may than no longer improve protein yield. Therefore, protein yield upon transient expression of the three genes in Nicotiana benthamiana was also determined, with and without co-expression of the gene silencing inhibitor p19 of tomato bushy stunt virus (Figure 1 c; Table 5). Also upon
transient expression codon optimisation lead to higher protein yield on all days for all genes, except for OVA unless p19 was co-expressed. In most cases co-expression of p19 had a favourable effect on protein yield independent of optimisation. This is not surprising as, mRNA transcript levels are always high in transient expression, which increases the risk of gene silencing. Thus, the mRNA of the optimised variant of OVA must have been more sensitive to gene silencing compared to the native variant.
Relative
Relative mRNA Protein:
mRNA Fold Protein Fold cone. n mRNA Fold n= cone. change yield change range = ratio change
GFP N 32 0.88 17.03 0.8-2.7 4 22.8±2.70
75*** 1 7.1
0 23 9.25 1276 0.9-2.5 4 161 ±58.5
SP- 1
GFP N 26 1 .63 33.28 1 .4-4.9 1
5.8* -| 2** 18.0±5.16
3.5* 1
0 24 9.53 399.5 1 .2-4.8 2 63.9±14.5
SP- 1 356.2±142
OVA N 26 2.37
2 ^*** 717.3 2.0-5.3 2 .5
5.5*** 2 g**
2 1014±121 .
0 30 5.62 3937 2.2-5.5 3 7
SP-
IL-10 N 17 1 .37 3.30 1 .7-4.2 8
2 -j *** 1.26±0.43
5.5***
1
0 25 4.23 17.9 1 .7-4.1 6 6.68±1 .02 Table 4. Codon optimisation of GFP, interleukin-10 and ovalbumin genes boosts expression in Arabidopsis thaliana. Average relative mRNA transcript concentration as compared to the A. thaliana household gene TIP-41 and protein yield in g per mg total soluble protein (TSP) determined in A. thaliana seedlings upon stable transformation of native (N) and optimised (O) sequences of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL-10) with additional chitinase signal peptide (SP). GFP
was also expressed without signal peptide. Protein:mRNA ratios were calculated. Because translatability may be lower with a higher mRNA concentration due to the limited number of free ribosomes, the protein:mRNA ratios were calculated of samples within the same mRNA concentration range, as indicated. The fold change when comparing the optimised to the native variant was calculated for the relative mRNA concentration, protein yield and protein:mRNA ratio. For each average the number of included seedlings is indicated (n). Significance of fold changes were calculated with a Welch's i-test: * P<0.05, ** P<0.01 , ***P<0.001 . dpi 2-5 dpi 5 + p19
Protein yield Fold change Protein yield Fold change
GFP N 5- O 23 I 34 «
SP-GFP N 1
3.2** 2.1
O 3.2 9.2
SP-OVA N 30
17 0.7 2.0*
O 12 61
SP-IL-10 N 8 4
1 .4
O 21 24
Table 5. Codon optimisation boosts protein yield in transient expression in Nicotiana benthamiana. Average protein yield in g per mg total soluble protein determined in N. benthamiana leaves upon transient transformation of native (N) and optimised (O) sequences of Aequorea victoria green fluorescent protein (GFP), Gallus gallus ovalbumin (OVA) and Mus musculus interleukin-10 (IL-10) with additional chitinase signal peptide (SP) (GFP was also expressed without SP) at 2 to 5 days post infiltration (dpi) (n=12) or 5 dpi whereby tested genes were co-expressed with the viral silencing inhibitor p19 of tomato bushy stunt virus. (n=3). Significance of fold change in protein yield were calculated with a Welch's i-test: * P<0.05, ** P<0.01 , ***P<0.001 .
Evaluating the average yield from dpi 2-5 or with co-expression of p19 on dpi 5 revealed a lower yield increase upon codon optimisation compared to stable
expression in A. thaliana. This is not surprising as at least some of the gain in mRNA stability due to the codon optimisation is compensated by the increased transcription in transient expression. Whether this gain in protein yield is predominantly the result of an increase in mRNA translatability or a combination of a gain in mRNA stability and translatability remains to be determined.
To explain the differences found in mRNA stability, first the thermodynamic stability of the predicted secondary mRNA structures was calculated. Upon codon optimisation the minimum free folding energy had decreased, indicative for a more stable mRNA, from -0.25 to -0.35 and -0.31 to -0.33 kcal/mol/nt for GFP and OVA, respectively. However, for IL-10, the minimum free folding energy increased from - 0.31 to -0.28 kcal/mol/nt indicating a less stable mRNA. Thus, an overall increase in physical stability could not explain the increased mRNA transcript levels of IL-10. However, it is still possible that unstable regions of IL-10 were removed upon codon optimisation, while the overall stability decreased.
In vivo mRNA half-life is predominantly controlled by other factors than physical stability, namely; the occurrence of a splicing event, through AU-rich destabilizing elements in the UTRs, and the presence of sequences that are targets for microRNA. In our experiments, all genes were expressed using the same expression controlling components, thus contained the same UTRs and did not contain introns. However, the sequences of the ORFs varied greatly between the native and optimised variants (78, 76 and 83% homology for GFP, OVA and IL-10, respectively). Therefore, there could be a difference in the presence of microRNA targets and also a difference in the occurrence of stretches of double stranded (ds)RNA between the native and optimised variants. The dsRNA stretches could be processed to small interfering RNAs and, like binding of microRNAs, can trigger gene silencing. In stable expression, gene silencing can also be due to gene methylation, but this always results in the complete absence of transcripts and therefore transformants without detectable expression were not considered. In our transient expression experiment co-expression of the silencing inhibitor p19 gave comparable results. Taken together, differences in mRNA decay based on above mentioned sequence features are unlikely to explain the differences in mRNA stability in our experiments.
Translation has also been linked to mRNA decay. Ribosomes can shield nuclease target sites, however, in large-scale in vivo studies mRNA half-life could not be linked to the number of nuclease target sites or ribosomal density. When translation initiation is equal, as is expected in our experiments, an increase in translatability should result in a lower density of ribosomes. Thus, there would have been fewer ribosomes on the optimised variants compared to their native counterparts, and the optimised variants would be less protected against nucleases. While translation per se may not influence mRNA half-life, errors in translation have been proven to lead to mRNA degradation by mRNA surveillance mechanisms. Three mRNA surveillance mechanisms have been identified: I) nonsense mediated decay by the recognition of a premature stop codon, II) non-stop decay by the lack of a stop codon and III) no-go decay by stalled ribosomes. Occurrence of a premature stop codon or the lack of a stop codon can be caused by a mutation or a ribosomal slip causing a frame-shift. Frame-shifts can be caused by a 'slippery' sequence that may be found in proximity of a strong mRNA structure. A ribosome may also stall at a strong stem-loop structure without slipping and trigger degradation. It is possible that the native and optimised variants differ in the presence of 'slippery' sequences and/or strong mRNA structures. Thus, differences in level of translation-linked mRNA decay may explain the difference in mRNA transcript levels in our experiment. In addition, ribosomes have intrinsic helicase activity and recently it was shown that strong mRNA structures such as pseudoknots and hairpins can stall translation only temporarily. It is therefore thought that the mRNA structure provides a mechanical basis for cellular regulation of translation rate. Thus, increased mRNA translatability of the optimised genes may be explained by an increased translation rate caused by differences in the mRNA structure.
Example 2 - General codon bias extends to other kingdoms of life The existence of codon biases in different species has implications for the efficient expression of heterologous proteins in a range of host cells. To investigate if the general codon bias in plants transcends kingdoms of life expression data of
Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) was interrogated. Per species >250 microarrays originating from several studies covering a wide range of strains/ecotypes, culturing conditions, developmental stages and tissues were used (Table 1A-F). First, the expression was ranked and the average rank was used as a measure of overall expression. Subsequently, the correlation between expression and nucleotide content was analysed per species. The relation between the species based on this correlation was visualized in a heat map (Figure 2).
Surprisingly, a strong positive correlation between expression and overall G content, in particular G on the first codon position and a negative correlation between expression and A and T on the first codon position was found across all kingdoms. Next, the correlation between expression and codon use was evaluated (Figure 3). Across all kingdoms the use of CGT (Arg/R), AAG (Lys/K), GGT (Gly/G), GTT (Val/V) and GCT (Ala/A) is positively correlated with expression. However, the fact that the nucleotide contents of the first and second codon position are correlated with expression indicates that there is a correlation between amino acid usage and expression. Highly expressed genes are relatively rich in the amino acids encoded by G-starting triplets: Ala, Gly, and Val (Figure 4).
First, to uncouple the amino acid bias from the codon use bias, the relative synonymous codon use was calculated. Subsequently, a comparison was made between high- and low-expressed genes, as a correlation between codon use and expression may only be found in genes expressed above a certain threshold. Genes were grouped based on expression from the centre (50% highest versus 50% lowest) until, with 1 % steps, the pools with 5% highest and 5% lowest expressed genes were reached. With each step the codon use frequencies in both high- and low-expressed gene pools were calculated together with the difference in codon use frequency between the high- versus the low-expressed gene pool. Finally, the difference in codon use frequency was correlated (Spearman) to the expression defining percentage. The relation between the species based on this correlation was visualized in a heat map (Figure 5; Table 6A-E show codon use frequencies of all,
the bottom 5% low- and top 5% high-expressed genes and fold codon use change (top/bottom) per species).
Strikingly, when clustering the correlations between the 5 species, E. coli, S. cerevisiae, C. elegans and A. thaliana group together well. M. musculus seems to have an overall lower codon bias and in -50% of the cases selects for other codons compared to the overall selection of the other species. Excluding M. musculus, 13 codons are positively correlated with expression for all species. These 13 codons encode 1 1 different amino acids and a termination of translation (twice a codon for Thr/T). Comparable to the general codon bias found in plants, 8 of these 13 codons are C-ending. Furthermore, 18 codons are consistently negatively correlated with expression in these four species. Of these codons most are A-ending (8), while none of them are C-ending. Strikingly, 5 universal codons were found which were positively correlated with expression for all species, indicating that these codons are conserved in the coding sequences of highly-expressed genes across all kingdoms of life and could therefore find useful application in methods of optimising functional protein expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. In addition several codons were found which were positively correlated with further increases in expression in E. coli, S. cerevisiae and C. elegans. Furthermore in addition to the universal set of codons, several codons were found to be positively correlated with increases in expression in E. coli, S. cerevisiae, C. elegans and Mus musculus. Separately, several codons were found to be positively correlated with increased expression in A. thaliana.
Taken together the data suggest that a conserved selection pressure influences expression across all kingdoms of life. Heterologous protein expression experiments suggested a role for the mRNA structure in translation rate. As the translational machinery does not vary greatly across kingdoms, the mRNA structure is a likely candidate to be the driving force behind this selection pressure.
Example 3 - Highly expressed genes prefer a stable, but 'airy' mRNA structure
To evaluate if the mRNA structure could be the driver of selection that gives rise to the observed general codon bias, the relationship between expression and mRNA structure characteristics was evaluated. Thereto, the mRNA structures of all genes were predicted and determined gene length, minimal free folding energy, number of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of the number of stem/loop transitions and plotted these against expression (Figure 6; Table 7). Also a heat map displaying the relation between the species based on the correlation (Spearman) between these structure characteristics and expression was generated (Figure 7; Table 7). This heat map demonstrates that the number of bound nucleotides and the number of stem/loop transitions was consistently positively correlated and mean loop size consistently negatively correlated with expression across all species.
The positive correlation with the number of bound nucleotides indicates a general adaptation towards a more stable mRNA molecule. Also, a low folding energy (more stable) is correlated with high expression in S. cerevisiae, C. elegans and A. thaliana, but not in E. coli and M. musculus. Still, in E. coli there seems to be a relation between expression and folding energy, as is demonstrated by the trend line that indicates an optimum (Figure 6). An optimum in mRNA stability may indicate a trade-off between stability and translatability in this species. A trade-off in stability and translatability may also explain why there is a correlation between mRNA folding energy and expression in S. cerevisiae, C. elegans and A. thaliana. These species have an overall lower G+C content resulting in on average weaker mRNAs (Table 7) and have therefore more to gain in terms of stability before translatability is affected.
The number of stem-loop transitions and mean loop size are also correlated with expression (positive and negative, respectively) in all species, which suggests that there is a general adaptation towards dividing nucleotide bonds equally over the mRNA molecule. In other words, highly expressed genes prefer a stable, but 'airy' mRNA molecule. This again indicates a trade-off between mRNA stability and translatability. It is striking that while folding energy in S. cerevisiae, C. elegans and A. thaliana is on average much higher (less stable mRNA) (6-10%) compared to E. coli and M. musculus, the fraction of bound nucleotides, mean stem and loop size and number of transitions do not differ that much (Table 7). This means that while
the mRNA folding energy may differ between species with different G+C content, the overall mRNA structure characteristics are more similar across species.
Taken together our data indicate that there is a general selection towards an optimal folding energy across kingdoms of life whereby number and type of nucleotide bonds (e.g. A-U and G-U bonds are weaker than G-C bonds) are balanced with short loops to facilitate efficient translation. This is in line with the observation that translation rate is greatly influenced by G+C content and strong mRNA structures.
Table 7. mRNA characteristics of highly expressed genes per species.
Averages of mRNA characteristics of the top 5% high-expressed genes of Escherichia coii (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabdites elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia).
A link between mRNA structure and expression may explain the increase in mRNA stability and translatability in the heterologous protein expression experiments disclosed herein. Therefore the mRNA structures of the native and optimised variants of the expressed genes were predicted and evaluated (Figure 9; Table 8). Optimised variants of GFP and OVA had an increased folding energy indicative of a more stable mRNA. All optimised variants had an increased number of stem-loop transitions (except SP-GFP), which is in line with a more 'airy' mRNA molecule. Thus, although changes in the mRNA structure upon optimisation differ from gene to gene, an improved mRNA structure could be the basis of increased protein yield in our experiments.
Energy Bound nt's Mean stem Mean loop Transitions kcal/mol/nt (fraction) size size
GFP N -0.21 0.56 5.74 4.48 0.097
0 -0.33 0.57 5.15 3.85 0.1 1 1
SP-GFP N -0.22 0.57 5.21 3.89 0.109
0 -0.32 0.54 5.22 4.31 0.104
SP-OVA N -0.29 0.61 5.28 3.34 0.1 16
0 -0.31 0.55 4.38 3.56 0.126
SP-IL-10 N -0.29 0.60 5.02 3.29 0.120
0 -0.27 0.54 4.08 3.47 0.131
Table 8. Calculated mRNA structure characteristics of the constructs used for heterologous protein expression. Analysis of the mRNA secondary structure predictions given in Figure 9. Folding energy, bound nucleotides and number of transitions are corrected for gene length. Stem and loop sizes are mean values.
Example 4 - A more 'airy' mRNA increases translation rate
On a cellular level translation efficiency was demonstrated to be the most important factor in controlling protein abundance whereas protein turnover plays only a minor role. Therefore, protein:mRNA ratio is a good proxy of translation rate. To evaluate if the mRNA structure characteristics found to be linked to expression are also linked to translation rate the expression data was combined with large-scale protein abundance data retrieved from PaxDB. To evaluate to what extent the expression data predicts protein abundance, the correlation (Spearman) between the expression data and the protein abundance was calculated: E. coli 0.59, S. cerevisiae 0.67, C. elegans 0.59, A. thaliana 0.62 and M. musculus 0.36. When the relationship between the protein:mRNA ratio and the previously mentioned mRNA structure characteristics was evaluated a similar picture as when using the expression data was obtained (Figure 8; Table 3; Figure 10-12 heat maps demonstrate the relation between species based on correlations of protein :mRNA ratio and nucleotide content, codon use and amino acid use).
I E. coli S. cerevisiae C. elegans A. thaliana M. musculus
Gene length -0.146 -0.1 16 -0.180 -0.139 -0.288
Energy (kcal/mol/nt) 0.043 -0.237 -0.212 -0.138 0.087
Bound (fraction) -0.009 0.148 -0.006 0.062 -0.058
Mean stem size -0.193 -0.01 1 -0.216 -0.058 -0.121
Mean loop size -0.121 -0.182 -0.139 -0.105 -0.015
Transitions /nt 0.199 0.140 0.213 0.104 0.081
Table 3. Correlations (Spearman) between mRNA structure characteristics and mRNA:protein ratios per species. The mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabdites elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy, percentage of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions were determined and correlated (Spearman) with mRNA:protein ratios. Rank-normalized mRNA levels were divided by protein abundance (retrieved from PaxDB).
As with expression, the number of stem-loop transitions is positively correlated with protein:mRNA ratio and mean loop size is negatively correlated across all species. Also, the folding energy is negatively correlated (more stable mRNA) for S. cerevisiae, C. elegans and A. thaliana, but not for E. coli and M. musculus. However, in contrast to the expression data, gene length is consistently negatively correlated with protein:mRNA ratio. This is in line with the fact that the packing density of ribosomes was shown to decrease with mRNA transcript length. Also, a negative correlation with mean stem size is found for all species and the fraction of bound nucleotides is not correlated, except for S. cerevisiae. Thus, small stem size must be important for an increased translation rate. This again highlights the tradeoff between mRNA stability and translatability.
Example 5 - Construct design
The native and optimised sequences coding for Aequorea victoria green-fluorescent protein (GFP) (L29345.1 ; nt 7-807) Gallus gallus ovalbumin (OVA) (NM_205152.2; nt 4-1 161 ) and Mus musculus interleukin-10 (IL-10) (NM_010548.2; nt63-537) together
with the optimised sequence for the Arabidopsis thaliana basic chitinase signal peptide (cSP) (BAA82810.1 ; nt15-33) were synthetically made by GeneArt (Thermo Fisher Scientific, Breda, the Netherlands). Optimisation was performed by recoding the protein sequences using the C-ending codons for all amino acids (TCC in the case of Ser), except Arg and Gly, for which the T-ending codons were used, and Gin, Glu and Lys, for which the G-ending codons were used. Synonymous mutations to either native or optimised sequences were sometimes introduced to remove undesired restriction and the cryptic splice sites in native GFP (Reichel et al., 1996, PNAS, 93:5888-5893). Gene fragments were flanked with sequences including the restriction sites Ncol (5') and Eagl-BspHI (3') for cSP, Eagl (3') and Knpl (5") for IL-10 and OVA and Ncol (3") and Kpnl (5") for GFP to allow fragment assembly and subsequent in frame cloning into the plant expression vector pHYG (Westerhof et al., 2012, PloS One, 7: e46460). Fragment assembly was accomplished by the in frame ligation of cSP with IL-10 and OVA using the Eagl site and cSP with GFP using the BspHI (cSP) and Ncol (GFP) sites. ORFs were confirmed by sequencing in expression vector stage. All vectors were transformed to Agrobacterium tumefaciens strain GV3101 for stable transformation of Arabidopsis thaliana or MOG101 for agroinfiltration in Nicotiana benthamiana.
Example 6 - Stable transformation of Arabidopsis thaliana
Agrobacterium tumefaciens clones were cultured overnight (o/n) at 28°C in LB medium (1 Og/I pepton140, 5g/l yeast extract, 10g/I NaCI with pH7.0) containing 50 μg/nnl kanamycin. Bacterial cultures were centrifuged for 15 min at 2800 g and resuspended in MMA (20g/l sucrose, 5g/l MS-salts, 1 .95g/l MES, pH5.6) containing 200 μΜ acetosyringone and 0.03% silwet-L77 till an OD of 0.5 was reached. Arabidopsis thaliana plants were submerged in the bacterial suspension for 1 min and kept in a moist environment for 2 days. Plants were maintained in a controlled greenhouse compartment (UNIFARM, Wageningen) until seeds could be collected. Seeds were sterilized by 4-hour exposure to chlorine gas and plated on basic agar plates (8g/l Bacto Agar, 0.101 g/l KNO3) containing 30 ng/ml hygromycin and 100 μg/nnl cefotaxim. Plates were kept in the dark at 4°C for 2 days, then placed in artificial light for 7 hours at 24°C, again kept in the dark at RT for 5 days and finally
placed in a climate chamber with 12 hour light regime at 24°C for 2 days. At this stage 10 to 40 seedlings per transformant plant were selected and placed in individual pots with Knop agar (1x Knop, 1 % sucrose, 8g/l Plant Agar pH6.4) containing 30 μg ml hygromycin and 100 μg ml cefotaxim. Seedlings that showed good growth and root formation after 10 days were transferred to fresh pots and allowed to grow for 2 more weeks. Thereafter plants were harvested and snap- frozen. Plant material was homogenized using a TissueLyser II (Qiagen) and stored at -80°C until further use.
Example 7 - Transient transformation of Nicotiana benthamiana
Agrobacterium tumefaciens clones were cultured overnight (o/n) at 28°C in LB medium (1 Og/I pepton140, 5g/l yeast extract, 10g/I NaCI with pH7.0) containing 50 μg ml kanamycin and 20 μg ml rifampicin. The optical density (OD) of the o/n cultures was measured at 600 nm and used to inoculate 50 ml of LB medium containing 200 μΜ acetosyringone and 50 μg ml kanamycin with x μΙ of culture using the following formula: x = 80000/(1028OD). OD was measured again after 16 hours and the bacterial cultures were centrifuged for 15 min at 2800 g. The bacteria were resuspended in MMA infiltration medium (20g/l sucrose, 5g/l MS-salts, 1 .95g/l MES, pH5.6) containing 200 μΜ acetosyringone till an OD of 1 was reached. All constructs were co-expressed with the tomato bushy stunt virus silencing inhibitor p19 by mixing Agrobacterium cultures 1 :1 . After 1 -2 hours incubation at room temperature, the two youngest fully expanded leaves of 5-6 weeks old Nicotiana benthamiana plants were infiltrated completely. Infiltration was performed by injecting the Agrobacterium suspension into a Nicotiana benthamiana leaf at the abaxial side using a 1 ml syringe. Infiltrated plants were maintained in a controlled greenhouse compartment (UNIFARM, Wageningen) and infiltrated leaves were harvested at selected time points.
Example 8 - Determination of heterologous gene expression Total RNA was isolated from homogenized plant material using the RNAeasy Plant Mini Kit (Qiagen) according to supplier's protocol. A Turbo DNasel (Ambion)
treatment was included to remove any residual DNA. cDNA was synthesised using the SuperScript®lll First-Strand Synthesis System (invitrogen) according to supplier's protocol using an oligo(dT) primer. Samples were analysed by quantitative PCR in triplo using ABsolute SYBR Green Fluorescein mix (Thermo Scientific). Arabidopsis thaliana TIP-41 (AY074349.1 ) was used as a reference gene. The oligonucleotides used for amplification of both native and optimised IL-10, OVA and GFP and TIP- 41 were 5'-AACCTCTTCCTCTTCCTC-3' [SEQ ID NO: 2] / 5'- GGAAGTGGGTGCAGTT-3' [SEQ ID NO: 3]; 5'-AACCTCTTCCTCTTCCTC-3' [SEQ ID NO: 4]/ 5'-GGGCAGTAGAAGATGTTC-3' [SEQ ID NO: 5]; 5'- GACGGTAACTACAA-GACC-3' [SEQ ID NO: 6]/ 5'-TTGTCGGCCATGATGTA-3' [SEQ ID NO: 7]; and 5'-GCTCATCGGTACGCTCTTTT-3' [SEQ ID NO: 8]/ 5'- TCCATCAGTCAGAGGCTTCC-3' [SEQ ID NO: 9], respectively. Relative transcript levels of the genes versus TIP-41 were determined by the Pfaffl method (Pfaffl, 2001, Nucleic Acids Research, 29: e45).
Example 9 - Determination of heterologous protein expression
Homogenized plant material was ground in ice-cold extraction buffer (50mM phosphate-buffered saline (PBS) pH=7.4, 100 mM NaCI, 10 mM ethylenediaminetetraacetic acid (EDTA), 0.1 % v/v Tween-20, 2% w/v immobilized polyvinylpolypyrrolidone (PVPP)) using 2 ml/g fresh weight. Crude extract was clarified by centrifugation at 16.000xg for 5 min at 4°C and supernatant was directly used in an ELISA and BCA protein assay. Mouse IL-10 expression levels were determined using the Mouse IL-10 ELISA Ready-SET-Go! kit (eBioscience) according to the supplier's protocol. For the quantification of OVA and GFP, a rabbit anti-ovalbumin or a chicken anti-GFP (both from Rockland Immunochemicals Inc.) was used to coat ELISA plates o/n at 4°C in a moist environment. After this and each following step the plate was washed 5 times with 30 sec intervals in PBST (1 x PBS, 0,05% Tween-20) using an automatic plate washer (BioRad model 1575). The plate was blocked with assay diluent (eBioscience) for 1 h at room temperature. Samples and standard lines were loaded in serial dilutions and incubated for 1 h at room temperature. Standard lines were made from purified chicken ovalbumin (Sigma) or recombinant GFP (Roche). For detection of OVA and GFP a rabbit anti-
ovalbumin:HRP antibody or a rabbit anti-GFP:HRP antibody (both from Rockland Immunochemicals Inc.) were used, respectively. A 3,3',5,5'-Tetramethylbenzidine (TMB) substrate (eBioscience) was added and colouring reaction was stopped using stop solution (0.18M sulphuric acid) after 1 -15 min. Read outs were performed using the model 680 microplate reader (BioRad) to measure the OD at 450 nm with correction filter of 690 nm. For sample comparison total soluble protein (TSP) concentration was determined using the BCA Protein Assay Kit (Pierce) according to supplier's protocol using bovine serum albumin (BSA) as a standard.
Example 10 - Gene expression datasets
Gene expression datasets of 5 species (Escherichia coli, Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorhabditis elegans, and Mus musculus) were downloaded from Gene Expression Omnibus (GEO). Gene-expression sets were selected based on platform (Affimetrix), release date (not earlier than 2008), publication linked to the GEO set and number of samples in the study. In total 2067 gene-expression profiles were collected, representing 8 or 9 different studies per organism. An overview can be found in Table 1A-F.
Example 11 - Protein abundance datasets Protein abundance datasets were retrieved from PaxDb (Wang et ai, 2012, Mol Cell Proteomics, 1 1 : 492-500), where the integrated datasets of Escherichia coli, Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorhabditis elegans, and Mus musculus were downloaded.
Example 12 - Gene expression normalization
Gene expression was normalized based on rank. Per species one array platform was used and per species probes were ranked according to their intensities. The average rank per probe was used as a measure of overall gene expression to distinguish genes with overall low and high expression levels for each species.
Example 13 - mRNA Sequences
The coding sequences (CDS) of all genes of 5 species were downloaded from sequence/genome repositories. For Escherichia coli, the CDS of strains CFT073, EDL933, MG1655 and Sakai were obtained from NCBI, accesscions NC_004431 .1 , NC_002655.1 , NC_U00096.3 and NC_002695.1 respectively. For Arabidopsis thaliana, the CDS of the 20101 108 release were obtained from TAIR (Lamesch et al., 2012, Nucleic Acids Research 40: D1202-1210). For Saccharomyces cerevisiae, the open reading frames (without UTR, introns, etc.) of the 201 10203 release were obtained from the Saccharomyces genome database (Cherry et al., 2012, Nucleic Acids Research 40: D700-705). For Caenorhabditis elegans, the CDS of WS241 were obtained from WormBase (Yook et al., 2012, Nucleic Acids Research 40: D735-741 ). For Mus musculus, the CDS of the 20130508 release (GRCm38.p1 ) were obtained from the NCBI CCDS database (Farrell et al., 2014 Nucleic Acids Research 42: D865-872).
Example 10 - mRNA folding
The mRNAs of all species were folded using Vienna RNA fold (Lorenz et al., 201 1 , Algorithms for Molecular Biology 6: 26) at 20 C, using the parameters of Andronescu et al., (Andronescu et al., 2007, Bioinformatics 23: i19-28). The M. musculus mRNA was also folded at 37 C and the S. cerevisiae also at 30 C, but all the reported comparisons are based on 20 C.
Example 11 - mRNA sequence and structure statistics
Several statistics were taken from the mRNA sequence: gene length, codon usage, and nucleotide usage. Also from the predicted mRNA structure several statistics were taken: number of bound nucleotides, number of free nucleotides, average stem size, average loop size, variation in stem size, variation in loop size, and energy of the structure.
Example 12 - Gene expression and mRNA folding statistics
The correlations (Spearman) between gene expression and the various mRNA- based statistics were calculated by Spearman correlation (in R 3.0.2 x64). For some of the factors a correction was applied for gene-length, these were: number of bound nucleotides, number of unbound nucleotides, energy of the structure, number of stems, number of loops, triplet usage, nucleotide usage, and amino acid usage.
For expression codon analysis, the frequencies of use of synonymous codons was calculated. This was done over a receding window, from 50% highest versus 50% lowest until 5% highest versus 5% lowest, in increments of 1 %.
Example 13 - Sequences used for transformation
A novel aspect of our finding is the selection of mRNA structures with the most even distribution of stems and loops leads to higher levels of expression in commonly used host cells, for example prokaryotic cells, fungal cells, plant cells and animal cells. Below is an example procedure used to select the most optimal mRNA structure for improved functional expression in a host cell of interest.
The first step in selecting the 'ideal' mRNA structure is the generation of a pool of mRNA variants by making all possible combinations of synonymous codons (> 100.000 mRNA variants).
The second step is in silico folding of all mRNA species in the pool under the temperature and salt concentrations relevant for the preferred host. The third step is the selection of mRNAs from the pool that meet the following criteria:
(actually the selection of mRNAs that have the most even distribution of stems and loops, which can be selected by the criteria described below.) For A. thaliana
1 . average number of stem-loop transitions is above 1 16 per 1 ,000 bp (or between 1 16 and 250 per 1 ,000 bp)
average stem size is below 5.20 bp (or between 5.20 and 2.5 bp)
average loop size is below 3.32 bp (or between 3.32 and 3 bp)
the standard deviation of the loop size is below 3.20 (or between 3.20 and 2 bp) (measure for even distribution)
the standard deviation of the stem size is below 3.40 (or between 3.40 and 2 bp) (measure for even distribution)
maximum loop size is below 18 bp (discard uneven stem loop distributions) maximum stem size is below 19 bp (discard uneven stem loop distributions) C. eleaans
1 . average number of stem-loop transitions is above 1 14 per 1 ,000 bp (or between 1 14 and 250 per 1 ,000 bp)
2. average stem size is below 5.35 bp (or between 5.35 and 2.5 bp)
3. average loop size is below 3.47 bp (or between 3.47 and 3 bp)
4. the standard deviation of the loop size is below 3.37 (or between 3.37 and 2 bp)
5. the standard deviation of the stem size is below 3.27 (or between 3.27 and 2 bp)
6. maximum loop size is below 20 bp
7. maximum stem size is below 18 bp E. coli
1 . average number of stem-loop transitions is above 1 16 per 1 ,000 bp (or between 1 16 and 250 per 1 ,000 bp)
2. average stem size is below 5.45 bp (or between 5.45 and 2.5 bp)
3. average loop size is below 3.16 bp (or between 3.16 and 2 bp)
4. the standard deviation of the loop size is below 2.95 (or between 2.95 and 2 bp)
5. the standard deviation of the stem size is below 3.50 (or between 3.50 and 2 bp)
6. maximum loop size is below 16 bp
7. maximum stem size is below 18 bp M. musculus
1 . average number of stem-loop transitions is above 120 per 1 ,000 bp (or between 120 and 250 per 1 ,000 bp)
2. average stem size is below 4.35 bp (or between 4.35 and 2.5 bp)
3. average loop size is below 5.18 bp (or between 5.18 and 4 bp)
4. the standard deviation of the loop size is below 3.00 (or between 3.00 and 2 bp)
5. the standard deviation of the stem size is below 3.28 (or between 3.28 and 2 bp)
6. maximum loop size is below 18 bp
7. maximum stem size is below 19 bp
For S. cerevisiae
1 . average number of stem-loop transitions is above 1 10 per 1 ,000 bp (or between 1 10 and 250 per 1 ,000 bp)
2. average stem size is below 5.27 bp (or between 5.27 and 2.5 bp)
3. average loop size is below 3.77 bp (or between 3.77 and 3 bp)
4. the standard deviation of the loop size is below 3.65 (or between 3.65 and 2 bp)
5. the standard deviation of the stem size is below 3.25 (or between 3.25 and 2 bp)
6. maximum loop size is below 20 bp
7. maximum stem size is below 19 bp
After step 3, where there were several appropriate codons according to the foregoing criteria, previously published data was consulted to make a final selection. Codons giving the lowest folding energy of the 5' terminus and codons that are frequently used and match the most abundant tRNAs were preferred.
Example 14 - Sequences used for transformation
All ORFs
GFP-720bp
>GFPnat [SEQ ID NO: 10]
atggccAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGG CGATGTTAATGGGCACAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATACGGAA
AACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCcTGGCCAACACTTGTC ACTACTCTCACCTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGCATGA CTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTgCAGGAAAGAACTATATTTTTCAAAGATG ACGGtAACTACAAGACcCGTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAGAATC GAGTTAAAAGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACACAAACTCGAATACAA CTATAACTCACATAATGTATACATCATGGCcGACAAACAGAAGAATGGAATCAAAGTTAACT TCAAAATTAGACACAACATTGAGGATGGAAGCGTTCAATTAGCAGACCATTATCAACAAAAT ACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCCACACAATCTGC CCTT CCAAAGATCCCAACGAAAAGAGAGATCACATGGTCCTTCTTGAGTTTGTAACAGCTG CTGGGATTACACTCGGCATGGATGAACTATACAAATAA
>GFPoPt [SEQ ID NO: 1 1]
atggccTCCAAGGGTGAGGAGCTCTTCACCGGTGTCGTCCCCATCCTCGTCGAGCTCGACGG TGACGTCAACGGTCACAAGTTCTCCGTCTCCGGTGAGGGTGAGGGTGACGCCACCTACGGTA AGCTCACCCTCAAGTTCATCTGCACCACCGGTAAGCTCCCCGTCCCCTGGCCCACCCTCGTC ACCACCCTCACCTACGGTGTCCAGTGCTTCTCCCGTTACCCCGACCACATGAAGCAGCACGA CTTCTTCAAGTCCGCCATGCCCGAGGGTTACGTCCAGGAGCGTACCATCTTCTTCAAGGACG ACGGTAACTACAAGACCCGTGCCGAGGTCAAGTTCGAGGGTGACACCCTCGTCAACCGTATC GAGCTCAAGGGTATCGACTTCAAGGAGGACGGTAACATCCTCGGTCACAAGCTCGAGTACAA CTACAACTCCCACAACGTCTACATCATGGCCGACAAGCAGAAGAACGGTATCAAGGTCAACT TCAAGATCCGTCACAACATCGAGGACGGTTCCGTCCAGCTCGCCGACCACTACCAGCAGAAC ACCCCCATCGGTGACGGTCCCGTCCTCCTCCCCGACAACCACTACCTCTCCACCCAGTCCGC CCTCTCCAAGGACCCCAACGAGAAGCGTGACCACATGGTCCTCCTCGAGTTCGTCACCGCCG CCGGTATCACCCTCGGTATGGACGAGCTCTACAAGTAA
SP-AvGFP-786bp
>chitSPoptGFPnat [SEQ ID NO: 12]
atggccAAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCgGC CGtcatggccAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAG ATGGCGATGTTAATGGGCACAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATAC GGAAAACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCcTGGCCAACACT TGTCACTACTCTCACCTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGC ATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTgCAGGAAAGAACTATATTTTTCAAA GATGACGGtAACTACAAGACcCGTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAG AATCGAGTTAAAAGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACACAAACTCGAAT ACAACTATAACTCACATAATGTATACATCATGGCcGACAAACAGAAGAATGGAATCAAAGTT AACTTCAAAATTAGACACAACATTGAGGATGGAAGCGTTCAATTAGCAGACCATTATCAACA AAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCCACACAAT CTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGGTCCTTCTTGAGTTTGTAACA GCTGCTGGGATTACACTCGGCATGGATGAACTATACAAATAA
>chitSPoptGFPopt [SEQ ID NO: 13]
atggccAAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCgGC CGtcatggccTCCAAGGGTGAGGAGCTCTTCACCGGTGTCGTCCCCATCCTCGTCGAGCTCG ACGGTGACGTCAACGGTCACAAGTTCTCCGTCTCCGGTGAGGGTGAGGGTGACGCCACCTAC GGTAAGCTCACCCTCAAGTTCATCTGCACCACCGGTAAGCTCCCCGTCCCCTGGCCCACCCT CGTCACCACCCTCACCTACGGTGTCCAGTGCTTCTCCCGTTACCCCGACCACATGAAGCAGC ACGACTTCTTCAAGTCCGCCATGCCCGAGGGTTACGTCCAGGAGCGTACCATCTTCTTCAAG GACGACGGTAACTACAAGACCCGTGCCGAGGTCAAGTTCGAGGGTGACACCCTCGTCAACCG TATCGAGCTCAAGGGTATCGACTTCAAGGAGGACGGTAACATCCTCGGTCACAAGCTCGAGT
ACAACTACAACTCCCACAACGTCTACATCATGGCCGACAAGCAGAAGAACGGTATCAAGGTC AACTTCAAGATCCGTCACAACATCGAGGACGGTTCCGTCCAGCTCGCCGACCACTACCAGCA GAACACCCCCATCGGTGACGGTCCCGTCCTCCTCCCCGACAACCACTACCTCTCCACCCAGT CCGCCCTCTCCAAGGACCCCAACGAGAAGCGTGACCACATGGTCCTCCTCGAGTTCGTCACC GCCGCCGGTATCACCCTCGGTATGGACGAGCTCTACAAGTAA mIL-10-540bp
>chitSPopt-IL-10nat [SEQ ID NO: 14]
atggccAAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCgGC CGcccagtacagccgggaagacaatAACtgcacccacttcccagtcggccagagccacatgc tcctagagctgcggactgccttcagccaggtgaagactttctttcaaacaaaggaccagctg gacaacatactgctaaccgactccttaatgcaggactttaagggttacttgggttgccaagc cttatcggaaatgatccagttttacctggtagaagtgatgccccaggcagagaagcatggcc cagaaatcaaggagcatttgaattccctgggtgagaagctgaagaccctcaggatgcggctg aggcgctgtcatcgatttctcccctgtgaaaataagagcaaggcagtggagcaggtgaagag tgattttaataagctccaagaccaaggtgtctacaaggccatgaatgaatttgacatcttca tcaactgcatagaagcatacatgatgatcaaaatgaaaagctaa
>chitSPopt-mIL-10opt [SEQ ID NO: 15]
atggccAAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCgGC CGcccagtactcccgtgaggacaacaactgcacccacttccccgtcggtcagtcccacatgc tcctcgagctccgtaccgccttctcccaggtcaagaccttcttccagaccaaggaccagctc gacaacatcctcctcaccgactccctcatgcaggacttcaagggttacctcggttgccaggc cctctccgagatgatccagttctacctcgtcgaggtcatgccccaggccgagaagcacggtc ccgagatcaaggagcacctcaactccctcggtgagaagctcaagaccctccgtatgcgtctc cgtcgttgccaccgtttcctcccctgcgagaacaagtccaaggccgtcgagcaggtcaagtc cgacttcaacaagctccaggaccagggtgtctacaaggccatgaacgagttcgacatcttca tcaactgcatcgaggcctacatgatgatcaagatgaagtcctga OVA-1221bp
>chitSPoptOVAnat (only with pIVT) [SEQ ID NO: 16]
atg (gcc) AAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCg GCCGGCTCCATCGGCGCAGCAAGCATGGAATTTTGTTTTGATGTATTCAAGGAGCTCAAAGT CCACCATGCCAATGAGAACATCTTCTACTGCCCCATTGCCATCATGTCAGCTCTAGCCATGG TATACCTGGGTGCAAAAGACAGCACCAGGACACAGATAAATAAGGTTGTTCGCTTTGATAAA CTTCCAGGATTCGGAGACAGTATTGAAGCTCAGTGTGGCACATCTGTAAACGTTCACTCTTC ACTTAGAGACATCCTCAACCAAATCACCAAACCAAATGATGTTTATTCGTTCAGCCTTGCCA GTAGACTTTATGCTGAAGAGAGATACCCAATCCTGCCAGAATACTTGCAGTGTGTGAAGGAA CTGTATAGAGGAGGCTTGGAACCTATCAACTTTCAAACAGCTGCAGATCAAGCCAGAGAGCT CATCAATTCCTGGGTAGAAAGTCAGACAAATGGAATTATCAGAAATGTCCTTCAGCCAAGCT CCGTGGATTCTCAAACTGCAATGGTTCTGGTTAATGCCATTGTCTTCAAAGGACTGTGGGAG AAAACATTTAAGGATGAAGACACACAAGCAATGCCTTTCAGAGTGACTGAGCAAGAAAGCAA ACCTGTGCAGATGATGTACCAGATTGGTTTATTTAGAGTGGCATCAATGGCTTCTGAGAAAA TGAAGATCCTGGAGCTTCCATTTGCCAGTGGGACAATGAGCATGTTGGTGCTGTTGCCTGAT GAAGTCTCAGGCCTTGAGCAGCTTGAGAGTATAATCAACTTTGAAAAACTGACTGAATGGAC CAGTTCTAATGTTATGGAAGAGAGGAAGATCAAAGTGTACTTACCTCGCATGAAGATGGAGG AAAAATACAACCTCACATCTGTCTTAATGGCTATGGGCATTACTGACGTGTTTAGCTCTTCA GCCAATCTGTCTGGCATCTCCTCAGCAGAGAGCCTGAAGATtTCTCAAGCTGTCCATGCAGC ACATGCAGAAATCAATGAAGCAGGCAGAGAGGTGGTAGGGTCAGCAGAGGCTGGAGTGGATG CTGCAAGCGTCTCTGAAGAATTTAGGGCTGACCATCCATTCCTCTTCTGTATCAAGCACATC GCAACCAACGCCGTTCTCTTCTTTGGCAGATGTGTTTCCCCTTAA
>chitSPoptOVAopt [SEQ ID NO: 17]
atg (gcc) AAGACCAACCTCttcCTCttcCTCATCttcTCCCTCCTCCTCTCCCTCTCCTCg GCCGGTTCCATCGGTGCCGCCAGCATGGAGTTCTGCTTCGACGTCTTCAAGGAGCTCAAGGT CCACCACGCCAACGAGAACATCTTCTACTGCCCCATCGCCATCATGTCCGCCCTCGCTATGG TCTACCTCGGTGCCAAGGACTCCACCCGTACCCAGATCAACAAGGTCGTCCGTTTCGACAAG CTCCCCGGTTTCGGTGACTCCATCGAGGCCCAGTGCGGTACTTCCGTCAACGTCCACTCCTC CCTCCGTGACATCCTCAACCAGATCACCAAGCCCAACGACGTCTACTCCTTCTCCCTCGCCT CCCGTCTCTACGCCGAGGAGCGTTACCCCATCCTCCCCGAGTACCTCCAGTGCGTCAAGGAG CTCTACCGTGGTGGTCTCGAGCCCATCAACTTCCAGACCGCCGCCGACCAGGCCCGTGAGCT CATCAACTCCTGGGTCGAGTCCCAGACCAACGGTATCATCCGTAACGTCCTCCAGCCCTCCT CCGTCGACTCCCAGACCGCTATGGTCCTCGTCAACGCCATCGTCTTCAAGGGTCTCTGGGAG AAGaCCTTCAAGGACGAGGACACCCAGGCCATGCCCTTCCGTGTCACCGAGCAGGAGTCCAA GCCCGTCCAGATGATGTACCAGATCGGTCTCTTCCGTGTCGCCAGCATGGCCTCCGAGAAGA TGAAGATCCTCGAGCTCCCCTTCGCCTCCGGTACTATGTCCATGCTCGTCCTCCTCCCCGAC GAGGTCTCCGGTCTCGAGCAGCTCGAGTCCATCATCAACTTCGAGAAGCTCACCGAGTGGAC CTCCTCCAACGTCATGGAGGAGCGTAAGATCAAGGTCTACCTCCCCCGTATGAAGATGGAGG AGAAGTACAACCTCACCTCCGTCCTCATGGCTATGGGTATCACCGACGTCTTCTCCTCCTCC GCCAACCTCTCCGGTATCTCCTCCGCCGAGTCCCTCAAGATCTCCCAGGCCGTCCACGCCGC CCACGCCGAGATCAACGAGGCCGGTCGTGAGGTCGTCGGTTCCGCCGAGGCCGGTGTCGACG CCGCCTCCGTCTCCGAGGAGTTCCGTGCCGACCACCCCTTCCTCTTCTGCATCAAGCACATC GCCACCAACGCCGTCCTCTTCTTCGGTCGTTGCGTCTCCCCCTAA
E. coli S. cerevisiae C. elegans A. thaliana M. musculus
Strains/ecotypes 1 13 14 8 9
Samples 168 316 391 415 111
Controls 105 21 1 109 101 565
Papers 8 9 9 9 9
Treatments 20 14 29 73 21
Tissues 1 1 3 1 1 28
> Different strains/mutants and tissues receiving the same experimental treatment are counted as a single treatment, all measurements in a time series are counted as a single treatment
Additional > M. musculus data sets Thorrez et al., 2009 and Xue et al.,
remarks: 2013 do not include the control spot on the slide in their
datasets
> E. coli expression values from the Dong and Schellhorn 2009 dataset off to a single decimal and from Ito et al., 2009 dataset to two decimals Table 1A. Overview of the gathered expression data per species.
Table 1 C. Description of the gathered S. cerevisiae expression data.
Table I D; Description of the gathered C. elegans expression data
Table I E; Description of the gathered A thaliana expression data
Table I F. Description of the gathered M. musculus expression da
E. coli S. cerevisiae C. elegans A. thaliana M. musculus
Gene length -0.146 -0.041 0.093 0.030 -0.016
Energy (kcal.mol/nt) -0.006 -0.319 -0.316 -0.229 0.006
Bound nt (fraction) 0.038 0.236 0.061 0.172 0.015
Mean stem size -0.1 1 1 0.054 -0.182 0.053 -0.055
Mean loop size -0.1 15 -0.241 -0.179 -0.155 -0.046
Transitions /nt 0.140 0.144 0.227 0.071 0.069
Table 2. Correlation between mRNA structure characteristics and gene expression per species. The mRNA structures of all genes of Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia) were predicted and gene length, minimal free folding energy, percentage of bound nucleotides, mean stem and loop (stretches of bound and unbound nucleotides, respectively) size and number of stem/loop transitions were determined and correlated (Spearman) with expression.
AA Triplet All Top 5% Bottom 5% Top/Bottom
* TAA 0.648 0.871 0.643 1 .355
TAG 0.067 0.021 0.064 0.328
TGA 0.285 0.107 0.293 0.365
A GCT 0.160 0.332 0.154 2.156
GCC 0.266 0.139 0.275 0.505
GCA 0.209 0.258 0.216 1 .194
GCG 0.365 0.271 0.356 0.761
C TGT 0.435 0.385 0.466 0.826
TGC 0.565 0.615 0.534 1 .152
D GAT 0.617 0.429 0.649 0.661
GAC 0.383 0.571 0.351 1 .627
E GAA 0.693 0.760 0.681 1 .1 16
GAG 0.307 0.240 0.319 0.752
F TTT 0.562 0.290 0.615 0.472
TTC 0.438 0.710 0.385 1 .844
G GGT 0.343 0.527 0.327 1 .612
GGC 0.413 0.406 0.395 1 .028
GGA 0.098 0.031 0.1 16 0.267
GGG 0.146 0.036 0.162 0.222
H CAT 0.557 0.295 0.591 0.499
CAC 0.443 0.705 0.409 1 .724
I ATT 0.503 0.302 0.530 0.570
ATC 0.434 0.688 0.380 1 .81 1
ATA 0.063 0.010 0.090 0.1 1 1 κ AAA 0.770 0.768 0.793 0.968
AAG 0.230 0.232 0.207 1.121
L TTA 0.124 0.042 0.155 0.271
TTG 0.124 0.059 0.130 0.454
CTT 0.100 0.065 0.1 12 0.580
CTC 0.103 0.068 0.106 0.642
CTA 0.035 0.008 0.041 0.195
CTG 0.515 0.758 0.457 1 .659
M ATG 1 .000 1.000 1.000 1 .000
N AAT 0.432 0.182 0.486 0.374
AAC 0.568 0.818 0.514 1 .591
P CCT 0.152 0.140 0.165 0.848
CCC 0.1 15 0.025 0.137 0.182
CCA 0.185 0.134 0.199 0.673
CCG 0.547 0.702 0.498 1 .410
Q CAA 0.337 0.213 0.369 0.577
CAG 0.663 0.787 0.631 1 .247
R CGT 0.396 0.636 0.363 1 .752
CGC 0.410 0.332 0.410 0.810
CGA 0.058 0.010 0.071 0.141
CGG 0.089 0.01 1 0.094 0.1 17
AGA 0.030 0.007 0.044 0.159
AGG 0.016 0.004 0.019 0.21 1
S TCT 0.150 0.323 0.132 2.447
TCC 0.155 0.256 0.136 1 .882
TCA 0.1 17 0.058 0.123 0.472
TCG 0.155 0.057 0.171 0.333
AGT 0.143 0.060 0.158 0.380
AGC 0.280 0.247 0.280 0.882
T ACT 0.168 0.328 0.167 1 .964
ACC 0.449 0.508 0.409 1 .242
ACA 0.120 0.048 0.154 0.312
ACG 0.263 0.1 16 0.270 0.430
V GTT 0.257 0.436 0.258 1 .690
GTC 0.214 0.1 13 0.219 0.516
GTA 0.152 0.225 0.153 1 .471
GTG 0.377 0.226 0.370 0.61 1 w TGG 1 .000 1.000 1.000 1 .000
Y TAT 0.555 0.331 0.582 0.569
TAC 0.445 0.669 0.418 1 .600
Table 6A. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Escherichia coli. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low-expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.
AA Triplet All Top 5% Bottom 5% Top/Bottom
* TAA 0.480 0.731 0.403 1 .814
TAG 0.225 0.1 17 0.290 0.403
TGA 0.295 0.152 0.307 0.495
A GCT 0.367 0.593 0.339 1 .749
GCC 0.223 0.280 0.215 1 .302
GCA 0.296 0.105 0.319 0.329
GCG 0.1 13 0.023 0.127 0.181
C TGT 0.627 0.829 0.594 1 .396
TGC 0.373 0.171 0.406 0.421
D GAT 0.656 0.526 0.642 0.819
GAC 0.344 0.474 0.358 1 .324
E GAA 0.701 0.854 0.699 1 .222
GAG 0.299 0.146 0.301 0.485
F TTT 0.593 0.353 0.616 0.573
TTC 0.407 0.647 0.384 1 .685
G GGT 0.455 0.823 0.387 2.127
GGC 0.197 0.093 0.197 0.472
GGA 0.224 0.051 0.279 0.183
GGG 0.124 0.033 0.138 0.239
H CAT 0.643 0.440 0.617 0.713
CAC 0.357 0.560 0.383 1 .462
1 ATT 0.463 0.522 0.469 1 .1 13
ATC 0.258 0.430 0.236 1 .822
ATA 0.280 0.048 0.295 0.163
K AAA 0.581 0.299 0.639 0.468
AAG 0.419 0.701 0.361 1 .942
L TTA 0.279 0.216 0.244 0.885
TTG 0.283 0.567 0.251 2.259
CTT 0.127 0.057 0.163 0.350
CTC 0.057 0.014 0.086 0.163
CTA 0.142 0.103 0.143 0.720
CTG 0.1 12 0.043 0.1 13 0.381
M ATG 1 .000 1.000 1.000 1 .000
N AAT 0.598 0.303 0.594 0.510
AAC 0.402 0.697 0.406 1 .717
P CCT 0.310 0.227 0.305 0.744
CCC 0.160 0.053 0.164 0.323
CCA 0.407 0.701 0.401 1 .748
CCG 0.123 0.018 0.129 0.140
Q CAA 0.686 0.893 0.663 1 .347
CAG 0.314 0.107 0.337 0.318
R CGT 0.140 0.201 0.131 1 .534
CGC 0.058 0.017 0.078 0.218
CGA 0.068 0.001 0.088 0.01 1
CGG 0.040 0.002 0.064 0.031
AGA 0.478 0.724 0.420 1 .724
AGG 0.217 0.055 0.218 0.252 s TCT 0.261 0.452 0.246 1.837
TCC 0.157 0.289 0.147 1.966
TCA 0.211 0.108 0.218 0.495
TCG 0.097 0.036 0.096 0.375
AGT 0.163 0.063 0.172 0.366
AGC 0.1 11 0.051 0.121 0.421
T ACT 0.343 0.482 0.333 1.447
ACC 0.210 0.352 0.213 1.653
ACA 0.307 0.133 0.325 0.409
ACG 0.140 0.034 0.129 0.264
V GTT 0.389 0.51 1 0.368 1.389
GTC 0.201 0.347 0.210 1.652
GTA 0.216 0.060 0.226 0.265
GTG 0.195 0.082 0.196 0.418 w TGG 1.000 1.000 1.000 1.000
Y TAT 0.568 0.302 0.558 0.541
TAC 0.432 0.698 0.442 1.579
Table 6B. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Saccharomyces cerevisiae. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.
AA Triplet All Top 5% Bottom 5% Top/Bottom
* TAA 0.496 0.694 0.439 1 .581
TAG 0.179 0.141 0.162 0.870
TGA 0.325 0.165 0.399 0.414
A GCT 0.354 0.423 0.325 1 .302
GCC 0.199 0.302 0.157 1 .924
GCA 0.314 0.198 0.385 0.514
GCG 0.133 0.077 0.134 0.575
C TGT 0.555 0.447 0.588 0.760
TGC 0.445 0.553 0.412 1 .342
D GAT 0.679 0.631 0.693 0.91 1
GAC 0.321 0.369 0.307 1 .202
E GAA 0.621 0.534 0.671 0.796
GAG 0.379 0.466 0.329 1 .416
F TTT 0.481 0.261 0.605 0.431
TTC 0.519 0.739 0.395 1 .871
G GGT 0.204 0.168 0.214 0.785
GGC 0.124 0.086 0.134 0.642
GGA 0.592 0.71 1 0.544 1 .307
GGG 0.080 0.035 0.109 0.321
H CAT 0.61 1 0.513 0.649 0.790
CAC 0.389 0.487 0.351 1 .387
1 ATT 0.534 0.470 0.538 0.874
ATC 0.314 0.478 0.226 2.1 15
ATA 0.152 0.052 0.236 0.220
K AAA 0.588 0.381 0.665 0.573
AAG 0.412 0.619 0.335 1 .848
L TTA 0.1 10 0.049 0.169 0.290
TTG 0.234 0.212 0.258 0.822
CTT 0.249 0.306 0.214 1 .430
CTC 0.174 0.280 0.1 16 2.414
CTA 0.091 0.042 0.1 12 0.375
CTG 0.142 0.1 12 0.133 0.842
M ATG 1 .000 1.000 1.000 1 .000
N AAT 0.625 0.484 0.655 0.739
AAC 0.375 0.516 0.345 1 .496
P CCT 0.178 0.126 0.220 0.573
CCC 0.088 0.054 0.100 0.540
CCA 0.532 0.691 0.494 1 .399
CCG 0.202 0.130 0.186 0.699
Q CAA 0.651 0.650 0.679 0.957
CAG 0.349 0.350 0.321 1 .090
R CGT 0.217 0.350 0.150 2.333
CGC 0.096 0.175 0.067 2.612
CGA 0.236 0.146 0.231 0.632
CGG 0.091 0.046 0.098 0.469
AGA 0.288 0.250 0.357 0.700
AGG 0.071 0.032 0.097 0.330 s TCT 0.206 0.235 0.214 1.098
TCC 0.130 0.177 0.112 1.580
TCA 0.257 0.205 0.273 0.751
TCG 0.156 0.169 0.125 1.352
AGT 0.149 0.104 0.173 0.601
AGC 0.102 0.109 0.103 1.058
T ACT 0.324 0.346 0.329 1.052
ACC 0.175 0.297 0.144 2.062
ACA 0.345 0.249 0.383 0.650
ACG 0.156 0.108 0.143 0.755
V GTT 0.388 0.413 0.407 1.015
GTC 0.220 0.320 0.168 1.905
GTA 0.158 0.097 0.191 0.508
GTG 0.234 0.170 0.234 0.726 w TGG 1.000 1.000 1.000 1.000
Y TAT 0.559 0.414 0.631 0.656
TAC 0.441 0.586 0.369 1.588
Table 6C. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Caenorhabditis elegans. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.
AA Triplet All Top 5% Bottom 5% Top/Bottom
* TAA 0.345 0.371 0.263 1.411
TAG 0.204 0.194 0.194 1.000
TGA 0.451 0.435 0.543 0.801
A GCT 0.432 0.498 0.383 1.300
GCC 0.161 0.171 0.174 0.983
GCA 0.263 0.221 0.278 0.795
GCG 0.144 0.110 0.164 0.671
C TGT 0.593 0.561 0.591 0.949
TGC 0.407 0.439 0.409 1.073
D GAT 0.674 0.644 0.662 0.973
GAC 0.326 0.356 0.338 1.053
E GAA 0.511 0.442 0.523 0.845
GAG 0.489 0.558 0.477 1.170
F TTT 0.502 0.427 0.515 0.829
TTC 0.498 0.573 0.485 1.181
G GGT 0.334 0.398 0.316 1.259
GGC 0.141 0.119 0.152 0.783
GGA 0.371 0.367 0.387 0.948
GGG 0.154 0.115 0.145 0.793
H CAT 0.606 0.526 0.612 0.859
CAC 0.394 0.474 0.388 1.222
1 ATT 0.400 0.429 0.375 1.144
ATC 0.363 0.432 0.373 1.158
ATA 0.236 0.139 0.252 0.552
K AAA 0.490 0.385 0.517 0.745
AAG 0.510 0.615 0.483 1.273
L TTA 0.135 0.082 0.148 0.554
TTG 0.220 0.233 0.229 1.017
CTT 0.257 0.290 0.248 1.169
CTC 0.181 0.207 0.172 1.203
CTA 0.105 0.080 0.121 0.661
CTG 0.102 0.108 0.082 1.317
M ATG 1.000 1.000 1.000 1.000
N AAT 0.502 0.430 0.489 0.879
AAC 0.498 0.570 0.511 1.115
P CCT 0.381 0.407 0.353 1.153
CCC 0.106 0.112 0.109 1.028
CCA 0.327 0.336 0.351 0.957
CCG 0.186 0.146 0.186 0.785
Q CAA 0.564 0.465 0.648 0.718
CAG 0.436 0.535 0.352 1.520
R CGT 0.168 0.241 0.161 1.497
CGC 0.070 0.077 0.068 1.132
CGA 0.118 0.087 0.120 0.725
CGG 0.092 0.059 0.086 0.686
AGA 0.352 0.301 0.363 0.829
AGG 0.199 0.234 0.202 1.158 s TCT 0.280 0.303 0.253 1.198
TCC 0.129 0.147 0.127 1.157
TCA 0.204 0.178 0.212 0.840
TCG 0.108 0.100 0.114 0.877
AGT 0.151 0.139 0.158 0.880
AGC 0.127 0.134 0.135 0.993
T ACT 0.334 0.374 0.300 1.247
ACC 0.207 0.260 0.213 1.221
ACA 0.302 0.253 0.313 0.808
ACG 0.157 0.114 0.175 0.651
V GTT 0.400 0.432 0.372 1.161
GTC 0.193 0.219 0.199 1.101
GTA 0.145 0.095 0.157 0.605
GTG 0.262 0.253 0.271 0.934 w TGG 1.000 1.000 1.000 1.000
Y TAT 0.504 0.418 0.508 0.823
TAC 0.496 0.582 0.492 1.183
Table 6D. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Arabidopsis thaliana. Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.
AA Triplet All Top 5% Bottom 5% Top/Bottom
* TAA 0.258 0.351 0.323 1.087
TAG 0.235 0.222 0.253 0.877
TGA 0.507 0.427 0.424 1.007
A GCT 0.289 0.320 0.316 1.013
GCC 0.377 0.331 0.340 0.974
GCA 0.232 0.246 0.266 0.925
GCG 0.101 0.103 0.078 1.321
C TGT 0.476 0.516 0.507 1.018
TGC 0.524 0.484 0.493 0.982
D GAT 0.450 0.521 0.500 1.042
GAC 0.550 0.479 0.500 0.958
E GAA 0.412 0.466 0.495 0.941
GAG 0.588 0.534 0.505 1.057
F TTT 0.445 0.507 0.499 1.016
TTC 0.555 0.493 0.501 0.984
G GGT 0.175 0.208 0.197 1.056
GGC 0.332 0.319 0.287 1.111
GGA 0.257 0.272 0.313 0.869
GGG 0.236 0.201 0.204 0.985
H CAT 0.410 0.468 0.472 0.992
CAC 0.590 0.532 0.528 1.008
1 ATT 0.343 0.404 0.362 1.116
ATC 0.495 0.448 0.419 1.069
ATA 0.162 0.148 0.219 0.676
K AAA 0.398 0.407 0.471 0.864
AAG 0.602 0.593 0.529 1.121
L TTA 0.068 0.089 0.095 0.937
TTG 0.132 0.152 0.152 1.000
CTT 0.132 0.154 0.154 1.000
CTC 0.194 0.169 0.176 0.960
CTA 0.079 0.079 0.092 0.859
CTG 0.396 0.357 0.331 1.079
M ATG 1.000 1.000 1.000 1.000
N AAT 0.436 0.481 0.501 0.960
AAC 0.564 0.519 0.499 1.040
P CCT 0.306 0.335 0.316 1.060
CCC 0.298 0.250 0.275 0.909
CCA 0.288 0.310 0.323 0.960
CCG 0.108 0.105 0.086 1.221
Q CAA 0.253 0.258 0.350 0.737
CAG 0.747 0.742 0.650 1.142
R CGT 0.084 0.105 0.080 1.312
CGC 0.170 0.153 0.122 1.254
CGA 0.123 0.145 0.104 1.394
CGG 0.194 0.179 0.128 1.398
AGA 0.213 0.232 0.318 0.730
AGG 0.216 0.186 0.249 0.747 s TCT 0.193 0.222 0.220 1.009
TCC 0.211 0.195 0.188 1.037
TCA 0.143 0.149 0.170 0.876
TCG 0.054 0.057 0.039 1.462
AGT 0.156 0.171 0.174 0.983
AGC 0.243 0.206 0.209 0.986
T ACT 0.249 0.273 0.275 0.993
ACC 0.345 0.313 0.312 1.003
ACA 0.295 0.314 0.328 0.957
ACG 0.1 11 0.099 0.085 1.165
V GTT 0.174 0.225 0.217 1.037
GTC 0.245 0.215 0.241 0.892
GTA 0.1 19 0.138 0.146 0.945
GTG 0.461 0.423 0.395 1.071 w TGG 1.000 1.000 1.000 1.000
Y TAT 0.423 0.481 0.498 0.966
TAC 0.577 0.519 0.502 1.034
Table 6E. Relative synonymous codon use frequency averages of all genes and gene subsets based on expression for Mus musculus (Animalia). Gene subsets were defined by expression in terms of percentage; top 5% high-, bottom 5% low- expressed. The fold change in codon use comparing high to low expressed genes (Top/Bottom) was also calculated.
Top 5% Top 5% Top 5% Top 5%
Trait Organism Stem_size_mean Stem_size_sd Stem_size_max Stem_size_min
Protein
abundance A. thaliana 5.197742798 3.333316648 18.60493827 1.082304527
Gene
expression A. thaliana 5.264773876 3.354989119 18.67107195 1.118942731
Protein
abundance C. elegans 4.949884209 3.035095428 16.98275862 1.161637931
Gene
expression C. elegans 4.950296788 3.048596544 17.30588235 1.129411765
Protein
abundance E. coli 5.127421075 3.127080268 17.00909091 1.227272727
Gene
expression E. coli 5.157297589 3.162030121 17.54285714 1.214285714
Protein
abundance M. musculus 5.063991554 3.236283472 18.29166667 1.078125
Gene
expression M. musculus 5.081367307 3.237828152 18.43329098 1.095298602
Protein
abundance S. cerevisiae 5.254440541 3.230034739 18.21167883 1.237226277
Gene
expression S. cerevisiae 5.262132835 3.23936481 18.01766784 1.247349823
Table 9. Analysis of the mRNA secondary structure characteristics (stem architecture) of the top 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaliana (Plantae) and Mus musculus (Animalia).
Table 10. Analysis of the mRNA secondary structure characteristics (loop architecture) of the top 5% expressed genes in Escherichia coli (Bacteria),
Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
Top 5% Top 5% Top 5% Bound_nt/100 Energy_(kcal/mol)/1000 Transitions/1000
Trait Organism 0 nt nt nt
Protein abundance A. thaiiana 619.3179412 -292.7800618 119.7782583 Gene expression A. thaiiana 624.0580406 -290.3511673 119.169408 Protein abundance C. elegans 598.4571065 -272.5292233 121.7225865 Gene expression C. elegans 596.5470187 -273.9225057 121.3996132 Protein abundance E. coli 627.3154158 -319.8163586 123.3964781 Gene expression E. coli 631.9373347 -327.7643057 123.4152453
M.
Protein abundance muscuius 616.1866207 -327.7746785 122.4372787
M.
Gene expression muscuius 612.9621408 -313.9661558 121.3436794 Protein abundance S. cerevisiae 606.4041095 -255.5194926 116.2875481 Gene expression S. cerevisiae 605.1063803 -255.9553594 115.8779268
Table 11. Analysis of the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the top 5% expressed genes in Escherichia coii (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
expression S. cerevisiae 5.274054838 3.147229903 16.84751773 1.365248227
Protein
abundance S. cerevisiae 5.34944781 3.244190265 19.52380952 1.102564103
Table 12. Analysis of the mRNA secondary structure characteristics (stem architecture) of the bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
Table 13. Analysis of the mRNA secondary structure characteristics (loop architecture) of the bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
Table 14. Analysis of the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the bottom 5% expressed genes in Escherichia coii (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
Table 15. Differences in the mRNA secondary structure characteristics (stem architecture) of the top and bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis eiegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus muscuius (Animalia).
Delta (top- Delta (top- Delta (top- bottom) bottom) bottom)
Trait Organism Loop_size_mean Loop_size_sd Loop_size_max
Gene expression A. thaiiana -0.267692548 -0.347578028 -1.583865892
Protein abundance A. thaiiana -0.175253334 -0.312156894 -4.174836883
Gene expression C. eiegans -0.419326762 -0.52092485 -3.334682506
Protein abundance C. eiegans -0.154072143 -0.309808645 -4.418251447
Gene expression E. coli -0.186479295 -0.31462739 -3.024198823
Protein abundance E. coli -0.19510469 -0.35983994 -4.111271298
Gene expression M. musculus -0.224252393 -0.288729208 -2.917011238
Protein abundance M. musculus 0.08059553 0.055306019 -2.037498481
Gene expression S. cerevisiae -0.778634452 -1.077665405 -3.962468292
Protein abundance S. cerevisiae -0.364963788 -0.580120518 -5.694456309
Table 16. Differences in the mRNA secondary structure characteristics (loop architecture) of the top and bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus musculus (Animalia).
Table 17. Differences in the mRNA secondary structure characteristics (bound nucleotides, energy, stem-loop transitions) of the top and bottom 5% expressed genes in Escherichia coli (Bacteria), Saccharomyces cerevisiae (Fungi), Caenorhabditis elegans (Animalia), Arabidopsis thaiiana (Plantae) and Mus musculus (Animalia).
Claims
1. A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;
a. providing a library of polynucleotides each of which vary at a minimum of a single codon position;
b. analyzing the secondary structure of each mRNA corresponding to a polynucleotide sequence of the library in silico under the temperature and salt concentrations relevant for the preferred host; and c. selecting a polynucleotide having at least 1 10 and fewer than 250 stem loop transitions per kilobase pair (kbp); and
d. synthesising said polynucleotide.
2. A method as claimed in claim 1 , wherein the method further comprises selecting a polynucleotide having a maximum stem size of less than 19 bp.
3. A method as claimed in claim 2, wherein the method further comprises selecting a polynucleotide having a maximum loop size of less than 20 bp.
4. A method as claimed in claim 3, wherein the host cell is a prokaryotic cell.
5. A method as claimed in claim 4, wherein the host cell is a bacterial cell.
6. A method as claimed in claim 5, wherein the method further comprises selecting a polynucleotide having a mean stem size between 5.45 bp and 2.50 bp.
7. A method as claimed in claim 5 or claim 6, wherein the method further comprises selecting a polynucleotide having a mean loop size between 3.16 bp and 2.00 bp.
8. A method as claimed in any of claims 5 to 7, wherein the host cell is an Escherichia coli cell.
9. A method as claimed in claim 3, wherein the host cell is a eukaryotic cell.
10. A method as claimed in claim 9, wherein the host cell is a plant cell.
1 1 . A method as claimed in claim 10, wherein the method further comprises selecting a polynucleotide having a mean stem size between 5.20 bp and 2.50 bp.
12. A method as claimed in claim 10 or claim 1 1 , wherein the method further comprises selecting a polynucleotide having a mean loop size between 3.27 bp and 3.00 bp.
13. A method as claimed in claim any of claims 10 to 12, wherein the host cell is an Arabidopsis cell, optionally an Arabidopsis thaliana cell.
14. A method as claimed in claim 9, wherein the host cell is a fungal cell.
15. A method as claimed in claim 14, wherein the method further comprises selecting a polynucleotide having a mean stem size between 5.27 bp and 2.50 bp.
16. A method as claimed in claim 14 or claim 15, wherein the method further comprises selecting a polynucleotide having a mean loop size between 3.77 and
3.00 bp.
17. A method as claimed in any of claims 14 to 16, wherein the host cell is a Saccharomyces cell, optionally a Saccharomyces cerevisiae cell.
18. A method as claimed in claim 9, wherein the host cell is an animal cell.
19. A method as claimed in claim 18, wherein the host cell is a nematode cell.
20. A method as claimed in claim 19, wherein the method further comprises selecting a polynucleotide having a mean stem size between 5.35 bp and 2.50 bp.
21 . A method as claimed in claim 19 or claim 20, wherein the method further comprises selecting a polynucleotide having a mean loop size between 3.47 bp and 3.00 bp.
22. A method as claimed in any of claims 19 to 21 , wherein the host cell is a Caenorhabditis elegans cell.
23. A method as claimed in claim 18, wherein the host cell is a mammalian cell.
24. A method as claimed in claim 23, wherein the method further comprises selecting a polynucleotide having a mean stem size between 4.35 bp and 2.50 bp.
25. A method as claimed in claim 23 or claim 24, wherein the method further comprises selecting a polynucleotide having a mean loop size between 5.18 bp and 4.00 bp.
26. A method as claimed in any of claims 23 to 25, wherein the host cell is a Mus musculus cell.
27. A method as claimed in any of claims 4 to 26, wherein the method further comprises selecting a polynucleotide from a library of synonymous variants wherein the codon usage of the selected polynucleotide most closely matches the most abundant tRNAs in a particular host cell.
28. A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;
a. providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and
b. modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
the host cell being selected from a prokaryotic cell, a fungal cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
29. A method as claimed in claim 28, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
30. A method as claimed in claim 28 or claim 29, wherein the host cell is a prokaryotic cell.
31 . A method as claimed in claim 30, wherein the host cell is a bacterial cell.
32. A method as claimed in claim 31 , wherein the host cell is an Escherichia coii cell.
33. A method as claimed in any of claims 30 to 32, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
; and/or:
; and/or:
; and/or:
; and/or:
; and/or:
34. A method as claimed in claim 33, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
35. A method as claimed in claim 28 or claim 29, wherein the host cell is a fungal cell.
36. A method as claimed in claim 35, wherein the host cell is a Saccharomyces cerevisiae cell.
37. A method as claimed in claim 35 or claim 36, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
; and/or:
; and/or:
; and/or:
Amino Acid DNA Codon Replacement Codon
Isoleucine ATA ATC or ATT
; and/or:
Amino Acid DNA Codon Replacement Codon
; and/or:
; and/or:
Amino Acid DNA Codon Replacement Codon
Glutamine CAG CAA
; and/or:
Amino Acid DNA Codon Replacement Codon
Glutamic acid GAG GAA
38. A method as claimed in claim 37, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
39. A method as claimed in claim 28 or claim 29, wherein the host cell is a nematode cell.
40. A method as claimed in claim 39, wherein the host cell is a Caenorhabditis elegans cell.
41 . A method as claimed in claim 39 or claim 40, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
; and/or:
; and/or:
Amino Acid DNA Codon Replacement Codon
Isoleucine ATA or ATT ATC
; and/or:
Amino Acid DNA Codon Replacement Codon
Threonine ACT, ACA or ACG ACC
; and/or:
; and/or:
Amino Acid DNA Codon Replacement Codon
Cysteine TGT TGC
; and/or:
; and/or:
42. A method as claimed in claim 41 , wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
43. A method as claimed in claim 28 or claim 29, wherein the host cell is a Mus musculus cell.
44. A method as claimed in claim 43, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
; and/or:
; and/or:
Amino Acid DNA Codon Replacement Codon
Alanine GCC or GCA GCG or GCT
; and/or:
Amino Acid DNA Codon Replacement Codon
Proline CCT, CCC or CCA CCG
; and/or:
; and/or:
; and/or:
45. A method as claimed in claim 44, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
46. A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a host cell comprising the steps of;
a. providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and
b. modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
Amino Acid DNA Codon Replacement Codon
Histidine CAT CAC
Lysine AAA AAG
Asparagine AAT AAC
Tyrosine TAT TAC
Stop Codon TAG or TGA TAA
Alanine GCC, GCA or GCG GCT
Glycine GGC, GGA or GGG GGT
Isoleucine ATT or ATA ATC
Arginine CGC, CGA, CGG, CGT
AGA or AGG
Serine TCT, TCA, TCG, TCC
AGT or AGC
Threonine ACT, ACA or ACG ACC
Valine GTC, GTA or GTG GTT the host cell being selected from a prokaryotic cell, a fungal cell, a plant cell, a protist cell or an animal cell; and wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
47. A method as claimed in claim 46, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
48. A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;
a. providing a polynucleotide sequence which encodes a protein of interest and has one or more of the codons in the following table; and
b. modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
Amino Acid DNA Codon Replacement Codon
Histidine CAT CAC
Lysine AAA AAG
Asparagine AAT AAC
Tyrosine TAT TAC
Stop Codon TAG or TGA TAA
Leucine CTT, CTC, CTA, TTA CTG
or TTG wherein modifying the codon composition of the starting polynucleotide sequence results in an increase in functional expression of the heterologous protein in the host cell compared with that of the native sequence.
49. A method as claimed in claim 48, wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
; and/or:
; and/or:
; and/or:
Amino Acid DNA Codon Replacement Codon
Valine GTC, GTA or GTG GTT
; and/or:
Amino Acid DNA Codon Replacement Codon
Proline CCC, CCA or CCG CCT
; and/or:
; and/or:
Amino Acid DNA Codon Replacement Codon
Isoleucine ATT or ATA ATC
; and/or:
Amino Acid DNA Codon Replacement Codon
Glutamine CAA CAG
; and/or:
Amino Acid DNA Codon Replacement Codon
Arginine CGC, CGA, CGG, CGT
AGA or AGG
50. A method as claimed in claim 49, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
51 . A method of providing a DNA comprising a coding sequence for functional expression of a heterologous protein in a plant cell comprising the steps of;
a. providing a polynucleotide sequence which encodes a protein of interest and has one or more of the codons in the following table; and
b. modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
52. A method as claimed in claim 51 , wherein the method further comprises modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table:
; and/or:
; and/or:
; and/or:
53. A method as claimed in any preceding claim, wherein the starting polynucleotide sequence is the wild-type coding sequence.
54. A method as claimed in any preceding claim, wherein the polynucleotide sequence is present or inserted into an expression vector.
55. A method as claimed in claim 54, wherein the expression vector is further introduced into a host cell.
56. A method as claimed in claim 55, wherein the host cell is cultured to produce the heterologous protein.
57. A method of expressing a heterologous protein in a plant cell comprising the steps of;
a. providing a polynucleotide sequence which encodes a protein of interest; and has one or more of the codons in the following table; and
b. modifying substantially all or all of the polynucleotide sequence using replacement codons according to the following table;
Amino Acid DNA Codon Replacement Codon
Alanine GCT, GCA or GCG GCC
Arginine CGC, CGA, CGG, CGT
AGA or AGG
Asparagine AAT AAC
Aspartic acid GAT GAC
Cysteine TGT TGC
Glutamic acid GAA GAG
Glutamine CAA CAG
Glycine GGC, GGA or GGG GGT
Histidine CAT CAC
Isoleucine ATT or ATA ATC
Leucine CTT, CTA, CTG, TTA CTC
or TTG
Lysine AAA AAG
Phenylalanine TTT TTC
Proline CCT, CCA or CCG CCC
Serine TCT, TCA, TCG, TCC
AGT or AGC
Threonine ACT, ACA or ACG ACC
Tyrosine TAT TAC
Valine GTT, GTA or GTG GTC
Stop codons TAG or TGA TAA c. inserting the polynucleotide sequence into an expression vector;
d. introducing said expression vector into a host cell; and
e. culturing the host cell to produce the heterologous protein;
optionally wherein the corresponding codons are changed according following table;
; and/or:
Amino Acid DNA Codon Replacement Codon
Leucine CTT, CTA, CTC, TTA CTG
or TTG
; and/or:
; and/or:
; and/or:
58. A method as claimed in claim 57, wherein the method comprises modifying each codon in the polynucleotide sequence for which a synonymous codon exists.
59. A method as claimed in any of claims 46 to 58, wherein the host cell is an Arabidopsis cell.
60. A method as claimed in any preceding claim further comprising;
analysing the secondary structure of mRNA corresponding to the resulting polynucleotide sequence; and
incorporating in said polynucleotide sequence a pattern of optimal and non-optimal codons at a site associated with provision of a structural motif;
wherein said pattern enables increased expression efficiency of said protein in said host cell compared with the synonymous coding sequence containing solely optimal codons, wherein optimal codons are those codons pre-calculated to provide the highest functional expression of heterologous protein in the host cell or the sole possible codon.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/EP2014/076436 WO2016086988A1 (en) | 2014-12-03 | 2014-12-03 | Optimisation of coding sequence for functional protein expression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/EP2014/076436 WO2016086988A1 (en) | 2014-12-03 | 2014-12-03 | Optimisation of coding sequence for functional protein expression |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016086988A1 true WO2016086988A1 (en) | 2016-06-09 |
Family
ID=52007021
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2014/076436 WO2016086988A1 (en) | 2014-12-03 | 2014-12-03 | Optimisation of coding sequence for functional protein expression |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2016086988A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018013720A1 (en) * | 2016-07-12 | 2018-01-18 | Washington University | Incorporation of internal polya-encoded poly-lysine sequence tags and their variations for the tunable control of protein synthesis in bacterial and eukaryotic cells |
CN113851190A (en) * | 2021-11-01 | 2021-12-28 | 四川大学华西医院 | Heterogeneous mRNA sequence optimization method |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1989000604A1 (en) * | 1987-07-13 | 1989-01-26 | Interferon Sciences, Inc. | Method for improving translation efficiency |
WO2001055342A2 (en) * | 2000-01-31 | 2001-08-02 | Biocatalytics, Inc. | Synthetic genes for enhanced expression |
WO2001068835A2 (en) * | 2000-03-13 | 2001-09-20 | Aptagen | Method for modifying a nucleic acid |
WO2002098443A2 (en) * | 2001-06-05 | 2002-12-12 | Curevac Gmbh | Stabilised mrna with an increased g/c content and optimised codon for use in gene therapy |
WO2002099105A2 (en) * | 2001-06-05 | 2002-12-12 | Cellectis | Methods for modifying the cpg content of polynucleotides |
WO2006097945A2 (en) * | 2005-03-17 | 2006-09-21 | Zenotech Laboratories Limited | A method for achieving high-level expression of recombinant human interleukin-2 upon destabilization of the rna secondary structure |
WO2006107954A2 (en) * | 2005-04-05 | 2006-10-12 | Pioneer Hi-Bred International, Inc. | Methods and compositions for designing nucleic acid molecules for polypeptide expression in plants using plant virus codon-bias |
WO2007142954A2 (en) * | 2006-05-30 | 2007-12-13 | Dow Global Technologies Inc. | Codon optimization method |
WO2009049350A1 (en) * | 2007-10-15 | 2009-04-23 | The University Of Queensland | Expression system for modulating an immune response |
WO2011111034A1 (en) * | 2010-03-08 | 2011-09-15 | Yeda Research And Development Co. Ltd. | Recombinant protein production in heterologous systems |
-
2014
- 2014-12-03 WO PCT/EP2014/076436 patent/WO2016086988A1/en active Application Filing
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1989000604A1 (en) * | 1987-07-13 | 1989-01-26 | Interferon Sciences, Inc. | Method for improving translation efficiency |
WO2001055342A2 (en) * | 2000-01-31 | 2001-08-02 | Biocatalytics, Inc. | Synthetic genes for enhanced expression |
WO2001068835A2 (en) * | 2000-03-13 | 2001-09-20 | Aptagen | Method for modifying a nucleic acid |
WO2002098443A2 (en) * | 2001-06-05 | 2002-12-12 | Curevac Gmbh | Stabilised mrna with an increased g/c content and optimised codon for use in gene therapy |
WO2002099105A2 (en) * | 2001-06-05 | 2002-12-12 | Cellectis | Methods for modifying the cpg content of polynucleotides |
WO2006097945A2 (en) * | 2005-03-17 | 2006-09-21 | Zenotech Laboratories Limited | A method for achieving high-level expression of recombinant human interleukin-2 upon destabilization of the rna secondary structure |
WO2006107954A2 (en) * | 2005-04-05 | 2006-10-12 | Pioneer Hi-Bred International, Inc. | Methods and compositions for designing nucleic acid molecules for polypeptide expression in plants using plant virus codon-bias |
WO2007142954A2 (en) * | 2006-05-30 | 2007-12-13 | Dow Global Technologies Inc. | Codon optimization method |
WO2009049350A1 (en) * | 2007-10-15 | 2009-04-23 | The University Of Queensland | Expression system for modulating an immune response |
WO2011111034A1 (en) * | 2010-03-08 | 2011-09-15 | Yeda Research And Development Co. Ltd. | Recombinant protein production in heterologous systems |
Non-Patent Citations (6)
Title |
---|
ANDRONESCU MIRELA ET AL: "Efficient parameter estimation for RNA secondary structure prediction.", BIOINFORMATICS (OXFORD, ENGLAND) 1 JUL 2007, vol. 23, no. 13, 1 July 2007 (2007-07-01), pages i19 - i28, XP002738330, ISSN: 1367-4811 * |
JIA M ET AL: "The relationship among gene expression, folding free energy and codon usage bias in Escherichia coli", FEBS LETTERS, ELSEVIER, AMSTERDAM, NL, vol. 579, no. 24, 10 October 2005 (2005-10-10), pages 5333 - 5337, XP027697304, ISSN: 0014-5793, [retrieved on 20051010] * |
LIANGJIANG WANG ET AL: "Comparative analysis of expressed sequences reveals a conserved pattern of optimal codon usage in plants", PLANT MOLECULAR BIOLOGY, KLUWER ACADEMIC PUBLISHERS, DORDRECHT, NL, vol. 61, no. 4-5, 1 July 2006 (2006-07-01), pages 699 - 710, XP019405470, ISSN: 1573-5028, DOI: 10.1007/S11103-006-0041-8 * |
LORENZ RONNY ET AL: "ViennaRNA Package 2.0.", ALGORITHMS FOR MOLECULAR BIOLOGY : AMB 2011, vol. 6, 26, 2011, pages 1 - 14, XP002738329, ISSN: 1748-7188 * |
MURRAY E E ET AL: "CODON USAGE IN PLANT GENES", NUCLEIC ACIDS RESEARCH, OXFORD UNIVERSITY PRESS, GB, vol. 17, no. 2, 25 January 1989 (1989-01-25), pages 477 - 498, XP000008653, ISSN: 0305-1048 * |
NAKAMURA M ET AL: "Translation efficiencies of synonymous codons are not always correlated with codon usage in tobacco chloroplasts", THE PLANT JOURNAL, BLACKWELL SCIENTIFIC PUBLICATIONS, OXFORD, GB, vol. 49, no. 1, 28 November 2006 (2006-11-28), pages 128 - 134, XP008133694, ISSN: 0960-7412 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018013720A1 (en) * | 2016-07-12 | 2018-01-18 | Washington University | Incorporation of internal polya-encoded poly-lysine sequence tags and their variations for the tunable control of protein synthesis in bacterial and eukaryotic cells |
US11603533B2 (en) | 2016-07-12 | 2023-03-14 | Washington University | Incorporation of internal polya-encoded poly-lysine sequence tags and their variations for the tunable control of protein synthesis in bacterial and eukaryotic cells |
CN113851190A (en) * | 2021-11-01 | 2021-12-28 | 四川大学华西医院 | Heterogeneous mRNA sequence optimization method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Barahimipour et al. | Dissecting the contributions of GC content and codon usage to gene expression in the model alga Chlamydomonas reinhardtii | |
Sun et al. | A zinc finger motif-containing protein is essential for chloroplast RNA editing | |
Liu et al. | Empty pericarp5 encodes a pentatricopeptide repeat protein that is required for mitochondrial RNA editing and seed development in maize | |
Yap et al. | AEF 1/MPR 25 is implicated in RNA editing of plastid atpF and mitochondrial nad5, and also promotes atpF splicing in Arabidopsis and rice | |
Cantó‐Pastor et al. | Efficient transformation and artificial mi RNA gene silencing in L emna minor | |
Boyle et al. | Repression of the defense gene PR-10a by the single-stranded DNA binding protein SEBF | |
F de Felippes et al. | The key role of terminators on the expression and post‐transcriptional gene silencing of transgenes | |
Doniwa et al. | The involvement of a PPR protein of the P subfamily in partial RNA editing of an Arabidopsis mitochondrial transcript | |
Bernardes et al. | Plant 3’regulatory regions from mRNA-encoding genes and their uses to modulate expression | |
WO2005098004A2 (en) | Inducible boost of integrated satellite rna viruses | |
AU2017234672B2 (en) | Zea mays regulatory elements and uses thereof | |
Yang et al. | Molecular and functional diversity of organelle RNA editing mediated by RNA recognition motif‐containing protein ORRM4 in tomato | |
Elakhdar et al. | Eukaryotic peptide chain release factor 1 participates in translation termination of specific cysteine-poor prolamines in rice endosperm | |
AU2017235944B2 (en) | Zea mays regulatory elements and uses thereof | |
WO2016086988A1 (en) | Optimisation of coding sequence for functional protein expression | |
US20170159064A1 (en) | Generation of artificial micrornas | |
CN105713079B (en) | Protein and its relevant biological material are improving the application in plant products | |
KR20160065952A (en) | Zea mays metallothionein-like regulatory elements and uses thereof | |
JP2018536400A (en) | Dreamenol synthase III | |
US9637750B2 (en) | P5SM suicide exon for regulating gene expression | |
Chen et al. | Plant immunity suppressor SKRP encodes a novel RNA‐binding protein that targets exon 3′ end of unspliced RNA | |
Wang et al. | Identification of miRNA858 long-loop precursors in seed plants | |
Mermigka et al. | ERIL 1, the plant homologue of ERI‐1, is involved in the processing of chloroplastic rRNA s | |
Lee et al. | GmDim1 Gene Encodes Nucleolar Localized U5-Small Nuclear Ribonucleoprotein in Glycine max | |
JP5228169B2 (en) | Tuber formation control vector for controlling tuber formation of plant, plant production method and plant with controlled tuber formation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14806629 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14806629 Country of ref document: EP Kind code of ref document: A1 |