WO2008153676A2 - Séquences nucléotiques codant pour les enzymes de fermentation et de la voie du pentose phosphate présentant une cinétique translationnelle plus fine et procédés de réalisation correspondants - Google Patents

Séquences nucléotiques codant pour les enzymes de fermentation et de la voie du pentose phosphate présentant une cinétique translationnelle plus fine et procédés de réalisation correspondants Download PDF

Info

Publication number
WO2008153676A2
WO2008153676A2 PCT/US2008/006378 US2008006378W WO2008153676A2 WO 2008153676 A2 WO2008153676 A2 WO 2008153676A2 US 2008006378 W US2008006378 W US 2008006378W WO 2008153676 A2 WO2008153676 A2 WO 2008153676A2
Authority
WO
WIPO (PCT)
Prior art keywords
nucleotides
replaced
codon
amino acids
seq
Prior art date
Application number
PCT/US2008/006378
Other languages
English (en)
Other versions
WO2008153676A3 (fr
Inventor
Kirsty A. Salmon
David A. Roth
G. Wesley Hatfield
Yimeng Dou
Original Assignee
The Regents Of The University Of California
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Regents Of The University Of California filed Critical The Regents Of The University Of California
Publication of WO2008153676A2 publication Critical patent/WO2008153676A2/fr
Publication of WO2008153676A3 publication Critical patent/WO2008153676A3/fr

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N9/00Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
    • C12N9/10Transferases (2.)
    • C12N9/1022Transferases (2.) transferring aldehyde or ketonic groups (2.2)
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/11DNA or RNA fragments; Modified forms thereof; Non-coding nucleic acids having a biological activity
    • C12N15/52Genes encoding for enzymes or proenzymes
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N9/00Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
    • C12N9/0004Oxidoreductases (1.)
    • C12N9/0006Oxidoreductases (1.) acting on CH-OH groups as donors (1.1)
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N9/00Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
    • C12N9/90Isomerases (5.)
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12PFERMENTATION OR ENZYME-USING PROCESSES TO SYNTHESISE A DESIRED CHEMICAL COMPOUND OR COMPOSITION OR TO SEPARATE OPTICAL ISOMERS FROM A RACEMIC MIXTURE
    • C12P7/00Preparation of oxygen-containing organic compounds
    • C12P7/02Preparation of oxygen-containing organic compounds containing a hydroxy group
    • C12P7/04Preparation of oxygen-containing organic compounds containing a hydroxy group acyclic
    • C12P7/06Ethanol, i.e. non-beverage
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02EREDUCTION OF GREENHOUSE GAS [GHG] EMISSIONS, RELATED TO ENERGY GENERATION, TRANSMISSION OR DISTRIBUTION
    • Y02E50/00Technologies for the production of fuel of non-fossil origin
    • Y02E50/10Biofuels, e.g. bio-diesel

Definitions

  • the present invention relates to refining the translational kinetics of an mRNA into polypeptide, and polypeptide-encoding nucleotide sequences which have refined translational properties.
  • Saccharomyces yeasts have proven to be safe, effective and user- friendly microorganisms for large-scale production of industrial ethanol from glucose- based feedstocks. Recently, efforts have been made to use cellulosic biomass as feedstock for producing ethanol.
  • the major fermentable sugars from hydrolysis of these feedstocks (such as rice and wheat straw, sugarcane bagasse, corn stover, corn fibre, softwood, hardwood and grasses) are D-glucose, L-arabinose and D-xylose.
  • the Saccharomyces yeasts are not able to use arabinose or xylose for growth or production of ethanol.
  • yeast and other microorganisms that can co- ferment glucose, arabinose and xylose simultaneously to ethanol through expression of the enzymes involved in the arabinose and xylose fermentation pathways.
  • Such pathways have been identified in yeast, filamentous fungi and other eukaryotes.
  • Related pathways utilizing distinct enzymes have been identified in bacteria.
  • PPP and/or fermentation enzymes do not express well in host organisms such as Escherichia coli or Saccharomyces cerevisiae. As a result, large-scale production is limited. Therefore, there is a continued need for improved expression of these enzymes.
  • Some translational pauses are resultant from the presence of particular codon pairs in the nucleotide sequence encoding the polypeptide to be translated. As provided herein, inappropriate or excessive translation pauses can reduce protein expression considerably. Further, the translational pausing properties of codon pairs vary from organism to organism. As a result, exogenous expression of genes foreign to the expression organism can lead to inefficient translation and poor expression. Even when the gene is translated in a sufficiently efficient manner that recoverable quantities of the translation product are produced, the protein is often inactive, insoluble, aggregated, or otherwise different in properties from the native protein. Thus, removing inappropriate or excessive translation pause structures coded for by specific di-codon nucleotide sequences in the open reading frame (ORF) can improve protein expression.
  • ORF open reading frame
  • PPP and/or fermentation enzyme-encoding nucleotide sequences with refined translational kinetics and methods of designing and synthesizing the same.
  • a PPP and/or fermentation enzyme-encoding nucleotide sequence wherein the encoded sequence has amino acid sequence identity with an original PPP and/or fermentation enzyme polypeptide, and wherein predicted translation pauses in the expression organism have been removed or reduced by replacing original codon pairs with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the resultant PPP and/or fermentation enzyme-encoding nucleotide is predicted to be translated rapidly along its entire length.
  • Expression of the resultant PPP and/or fermentation enzyme-encoding nucleotide is predicted to result in improved protein expression levels in cases where inappropriate or excessive translation pauses reduce protein expression.
  • expression of the resultant PPP and/or fermentation enzyme-encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression products in cases where inappropriate or excessive translation pauses cause expression of inactive, insoluble or aggregated enzyme.
  • PPP and/or fermentation enzyme-encoding nucleotide sequences wherein the encoded sequence has amino acid sequence identity with an original PPP and/or fermentation enzyme -encoding nucleotide sequence and is adapted for expression in a heterologous host organism, wherein at least 1, 2, or 3 codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein.
  • the host organism is not human, E. coli or S. cerevisiae.
  • transketolase-encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -678 of wild-type transketolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TTGAAA (nucleotides 268 - 273); TTGAAA (nucleotides 526 - 531); TTGAAA (nucleotides 802 - 807); AAGAAA (nucleotides 964 - 969); GGTATT (nucleotides 490 - 495); GGTATT (nucleotides 679 - 684); GGTATT (nucleotides 1261 - 1266); GGTATT (nucleotides 1297 - 1302); GGTATT (nucleotides 1915 - 1920); ACTTTA (nucleotides 1474 - 1479); TTGAAC (nucleotides 1345 - 1350); ACTTTG (nucleotides 1744 - 1749); GCTACT (nucleotides 1702 - 1707); GATATT (nucleocleot
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: TTGAAA (nucleotides 268 - 273) replaced with TTAAAA; TTGAAA (nucleotides 526 - 531) replaced with TTAAAA; TTGAAA (nucleotides 802 - 807) replaced with TTAAAA; AAGAAA (nucleotides 964 - 969) replaced with AAGAAG; GGTATT (nucleotides 490 - 495) replaced with GGAATT; GGTATT (nucleotides 679 - 684) replaced with GGAATT; GGTATT (nucleotides 1261 - 1266) replaced with GGAATT; GGTATT (nucleotides 1297 - 1302)
  • transketolase-encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -678 of wild-type transketolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: AAAGAG (nucleotides 1993 - 1998); TTGCTG (nucleotides 1987 - 1992); ATTGCC (nucleotides 544 - 549); ATTGCC (nucleotides 682 - 687); GAAGTG (nucleotides 1714 - 1719).
  • AAAGAG nucleotides 1993 - 1998
  • TTGCTG nucleotides 1987 - 1992
  • ATTGCC nucleotides 544 - 549
  • ATTGCC nucleotides 682 - 687
  • GAAGTG nucleotides 1714 - 1719.
  • At least 3 of the following codon pair replacements have been made: AAAGAG (nucleotides 1993 - 1998) replaced with AAAGAA; TTGCTG (nucleotides 1987 - 1992) replaced with CTGTTG; ATTGCC (nucleotides 544 - 549) replaced with ATCGCG; ATTGCC (nucleotides 682 - 687) replaced with ATCGCG; GAAGTG (nucleotides 1714 - 1719) replaced with GAAGTT.
  • the Nucleotide sequence is optimized for expression in E.coli.
  • transketolase-encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-678 of wild-type transketolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: AAGAAA (nucleotides 964 - 969); TCCAAG (nucleotides 70 - 75); TCCAAG (nucleotides 712 - 717); TCCAAG (nucleotides 1567 - 1572); ATCAAG (nucleotides 1762 - 1767); GATATT (nucleotides 1687 - 1692); TTGAAA (nucleotides 268 - 273);TTGAAA (nucleotides 526 - 531); TTGAAA (nucleotides 802 - 807); TTCAAC (nucleotides 844 - 849); GGTATT (nucleotides 490 - 495); GGTATT (nucleotides 679 - 684); GGTATT (nucleotides 1261 - 1266); GGTATT (nucleotides 1261
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: AAGAAA (nucleotides 964 - 969) replaced with AAAAAG; TCCAAG (nucleotides 70 - 75) replaced with AGCAAA; TCCAAG (nucleotides 712 - 717) replaced with TCTAAA; TCCAAG (nucleotides 1567 - 1572) replaced with TCTAAA; ATCAAG (nucleotides 1762 - 1767) replaced with ATCAAG; GATATT (nucleotides 1687 - 1692) replaced with GACATC; TTGAAA (nucleotides 268 - 273) replaced with CTGAAA; TTGAAA (nucleotides 526 - 53
  • transketolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-678 of wild-type transketolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TTGAAA (nucleotides 268 - 273 ); GGTATC (nucleotides 361 - 366 ); TTGAAA (nucleotides 526 - 531 ); GCCAAG (nucleotides 685 - 690 ); CTTCGA (nucleotides 766 - 771 ); TTGAAA (nucleotides 802 - 807 ); AAGAAA (nucleotides 964 - 969 ); TTCCCA (nucleotides 970 - 975 ); GGCCAA (nucleotides 1009
  • nucleotide sequences at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • TTGAAA nucleotides 268 - 273
  • GGTATC nucleotides 361 - 366
  • GGAATT TTGAAA
  • GCCAAG nucleotides 685 - 690
  • CTTCGA nucleotides 766 - 771
  • TTGAAA nucleotides 802 - 807
  • nucleotide sequence is optimized for expression in K. lactis.
  • transketolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-678 of wild-type transketolase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GGTATC (nucleotides 361 - 366 ); CGGGCA (nucleotides 427 - 432 ); ACCTAT (nucleotides 454 - 459 ); GGTATT (nucleotides 490 - 495 ); GAAGCC (nucleotides 625 - 630 ); GCCGGT (nucleotides 676 - 681 ); GGTATT (nucleotides 679 - 684 ); GCCAAG (nucleotides 685 - 690 ); GCTATT (nucleotides 691 - 696 ); TTATCC (nucleotides 709 - 714 ); GAAGCC (nucleotides 922 - 927 ); GCCAAG (nucleotides 1054 - 1059 ); GAAGCC (nucleotides 1
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GGTATC (nucleotides 361 - 366 ) replaced with GGCATT; CGGGCA (nucleotides 427 - 432 ) replaced with CGTGCC; ACCTAT (nucleotides 454 - 459 ) replaced with ACGTAT; GGTATT (nucleotides 490 - 495 ) replaced with GGGATT; GAAGCC (nucleotides 625 - 630 ) replaced with GAAGCG; GCCGGT (nucleotides 676 - 681 ) replaced with GCAGGA; GGTATT (nucleotides 679 - 684 ) replaced with GGAATT; GCCAAG (nucleotides 361 - 366 ) replaced with GGCATT
  • transketolase-encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-678 of wild-type transketolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the Standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • transketolase-encoding Nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-678 of wild-type transketolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatta (Monkey); E.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the Nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing xylose comprising one or more host organisms that collectively include Nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, xylulokinase and transketolase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the Nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the transketolase retains at least 75% of the enzymatic activity of wild-type TLKl (SEQ ID NO: 2) under normal physiological conditions.
  • a transketolase-encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-678 of wild-type transketolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 6-339 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 6-339 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 6-339 when expressed in the native organism.
  • no replacement codon encoding amino acids 6-339 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair TTGAAA when expressed in the native organism.
  • a transketolase-encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-678 of wild-type transketolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 353-533 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 353-533 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 353-533 when expressed in the native organism.
  • no replacement codon encoding amino acids 353-533 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GGTATT when expressed in the native organism.
  • a transketolase-encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -678 of wild-type transketolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 545-656 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 545-656 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 545-656 when expressed in the native organism.
  • no replacement codon encoding amino acids 545-656 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GGTATT when expressed in the native organism.
  • a transketolase-encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -678 of wild-type transketolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-6 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1-6 of SEQ ID NO: 2 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -6 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1-6 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair ACTCAA when expressed in the native organism.
  • a transketolase-encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -678 of wild-type transketolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 339-353 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 339- 353 of SEQ ID NO: 2 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 339-353 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 339-353 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GCCAAG when expressed in the native organism.
  • a transketolase-encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-678 of wild-type transketolase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 533-545 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 533- 545 of SEQ ID NO: 2 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 533-545 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 533-545 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GCTCTT when expressed in the native organism.
  • a ribulose 5-phosphate epimerase - encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-238 of wild-type ribulose 5-phosphate epimerase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GGGTTT (nucleotides 466 - 471); ATCAAA (nucleotides 367 - 372); ATCAAA (nucleotides 385 - 390); GTGGAA (nucleotides 457 - 462); GTGGAA (nucleotides 508 - 513); ACTTTG (nucleotides 514 - 519); TTGAAT (nucleotides 538 - 543); GGCCAA (nucleotides 145 - 150); GGCCAA (nucleotides 475 - 480); TTCCCC (nucleotides 529 - 534); GCCAAG (nucleotides 523 - 528.
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GGGTTT (nucleotides 466 - 471) replaced with GGTTTT; ATCAAA (nucleotides 367 - 372) replaced with ATTAAA; ATCAAA (nucleotides 385 - 390) replaced with ATTAAG; GTGGAA (nucleotides 457 - 462) replaced with GTTGAA; GTGGAA (nucleotides 508 - 513) replaced with GTTGAA; ACTTTG (nucleotides 514 - 519) replaced with ACTCTA; TTGAAT (nucleotides 538 - 543) replaced with TTAAAC; GGCCAA (nucleotides 145 - 150) replaced with
  • a ribulose 5-phosphate epimerase - encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-238 of wild-type ribulose 5-phosphate epimerase as set forth in SEQ ID NO: 26, wherein at least 1 codon pair of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 1 codon pair to be replaced comprises: TCTGAC (nucleotides 37 - 42 ); GGCGCA (nucleotides 85 - 90 ); CCTGGC (nucleotides 187 - 192 ); GGCGAT (nucleotides 190 - 195 ); TTTGCT (nucleotides 277 - 282 ); GCTGAC (nucleotides 292 - 297 ); TTGATT (nucleotides 349 - 354 ); GTCGAT (nucleotides 550 - 555 ); GCTGAC (nucleotides 640 - 645 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: TCTGAC (nucleotides 37 - 42 ) replaced with TCTGAC; GGCGCA (nucleotides 85 - 90 ) replaced with GGTGCT; CCTGGC (nucleotides 187 - 192 ) replaced with CCAGGT; GGCGAT (nucleotides 190 - 195 ) replaced with GGTGAT; TTTGCT (nucleotides 277 - 282 ) replaced with TTTGCT; GCTGAC (nucleotides 292 - 297 ) replaced with GCTGAT; TTGATT (nucleotides 349 - 354 ) replaced with TTGATT; GTCGAT (n
  • the Nucleotide sequence is optimized for expression in E.coli.
  • a ribulose 5-phosphate epimerase - encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-238 of wild-type ribulose 5-phosphate epimerase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GACGAT (nucleotides 271 - 276); GGCCAT (nucleotides 118 - 123); ATCAAC (nucleotides 76 - 81); GGGTTT (nucleotides 466 - 471); ATCAAA (nucleotides 367 - 372); ATCAAA (nucleotides 385 - 390); GCCAAA (nucleotides 589 - 594); GGCCAA (nucleotides 145 - 150); GGCCAA (nucleotides 475 - 480).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GACGAT (nucleotides 271 - 276) replaced with GATGAT; GGCCAT (nucleotides 118 - 123) replaced with GGTCAT; ATCAAC (nucleotides 76 - 81) replaced with ATTAAT; GGGTTT (nucleotides 466 - 471) replaced with GGTTTT; ATCAAA (nucleotides 367 - 372) replaced with ATTAAA; ATCAAA (nucleotides 385 - 390) replaced with ATTAAG; GCCAAA (nucleotides 589 - 594) replaced with GCTAAA; GGCCAA (nucleotides 145 - 150) replaced with
  • a ribulose 5-phosphate epimerase - encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-238 of wild-type ribulose 5-phosphate epimerase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GGCCAT (nucleotides 118 - 123 ); GGCCAA (nucleotides 145 - 150 ); AAGAAG (nucleotides 21 1 - 216 ); AAATGG (nucleotides 262 - 267 ); TTTGCT (nucleotides 277 - 282 ); ATCAAA (nucleotides 367 - 372 ); ATCAAA (nucleotides 385 - 390 ); GGGTTT (nucleotides 466 - 471 ); GGCCAA (nucleotides 475 - 480 ); GCCAAG (nucleotides 523 - 528 ); TTCCCC (nucleotides 529 - 534 ); AATATC (nucleotides 541 - 546 ); GGTACC (nucleotides 619
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GGCCAT (nucleotides 1 18 - 123 ) replaced with GGTCAT; GGCCAA (nucleotides 145
  • GGTCAA replaced with GGTCAA; AAGAAG (nucleotides 21 1 - 216 ) replaced with AAAAAA; AAATGG (nucleotides 262 - 267 ) replaced with AAGTGG; TTTGCT (nucleotides 277 - 282 ) replaced with TTCGCA; ATCAAA (nucleotides 367 - 372 ) replaced with ATTAAA; ATCAAA (nucleotides 385 - 390 ) replaced with ATTAAA; GGGTTT (nucleotides 466 - 471 ) replaced with GGTTTC; GGCCAA (nucleotides 475
  • nucleotide sequence is optimized for expression in K. lactis.
  • a ribulose 5-phosphate epimerase - encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-238 of wild-type ribulose 5-phosphate epimerase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: ACTTTG (nucleotides 514 - 519 ); GCCAAG (nucleotides 523 - 528 ); AATATC (nucleotides 541 - 546 ); GTCGAT (nucleotides 550 - 555 ); GCCGGT (nucleotides 595 - 600 ).
  • ACTTTG nucleotides 514 - 519
  • GCCAAG nucleotides 523 - 528
  • AATATC nucleotides 541 - 546
  • GTCGAT nucleotides 550 - 555
  • GCCGGT nucleotides 595 - 600 .
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • At least 3 of the following codon pair replacements have been made: ACTTTG (nucleotides 514 - 519 ) replaced with ACCTTG; GCCAAG (nucleotides 523 - 528 ) replaced with GCCAAA; AATATC (nucleotides 541 - 546 ) replaced with AATATT; GTCGAT (nucleotides 550 - 555 ) replaced with GTTGAT; GCCGGT (nucleotides 595 - 600 ) replaced with GCTGGA.
  • the nucleotide sequence is optimized for expression in Z. mobilis.
  • a ribulose 5-phosphate epimerase -encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -238 of wild-type ribulose 5-phosphate epimerase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • a ribulose 5-phosphate epimerase -encoding Nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-238 of wild-type ribulose 5-phosphate epimerase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long- tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the Nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing xylose comprising one or more host organisms that collectively include Nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, xylulokinase, and ribulose 5-phosphate epimerase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the Nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the ribulose 5- phosphate epimerase retains at least 75% of the enzymatic activity of wild-type RPE (SEQ ID NO: 26) under normal physiological conditions.
  • a ribulose 5-phosphate epimerase - encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-238 of wild-type ribulose 5-phosphate epimerase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 5-214 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 5-214 of SEQ ID NO: 26 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 5-214 when expressed in the native organism.
  • no replacement codon encoding amino acids 5-214 of SEQ ID NO: 26 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GGGTTT when expressed in the native organism.
  • a ribulose 5-phosphate epimerase - encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-238 of wild-type ribulose 5-phosphate epimerase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-5 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1-5 of SEQ ID NO: 26 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-5 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1-5 of SEQ ID NO: 26 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GTCAAA when expressed in the native organism.
  • alchohol dehydrogenase 1- encoding Nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-348 of wild-type alchohol dehydrogenase 1 as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TTGAAG (nucleotides 487 - 492); GGTATT (nucleotides 598 - 603); ATCAAA (nucleotides 271 - 276); TTGAAC (nucleotides 280 - 285); TCTCCA (nucleotides 946 - 951); GATATT (nucleotides 67 - 72); ATCAAG (nucleotides 952 - 957); GCCAAG (nucleotides 571 - 576); GCCAAG (nucleotides 823 - 828).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: TTGAAG (nucleotides 487 - 492) replaced with CTAAAA; GGTATT (nucleotides 598 - 603) replaced with GGAATA; ATCAAA (nucleotides 271 - 276) replaced with ATTAAA; TTGAAC (nucleotides 280 - 285) replaced with TTAAAT; TCTCCA (nucleotides 946 - 951) replaced with TCACCC; GATATT (nucleotides 67 - 72) replaced with GATATA; ATCAAG (nucleotides 952 - 957) replaced with ATTAAA; GCCAAG (nucleotides 571 - 576) replaced with GC
  • alchohol dehydrogenase 1- encoding Nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -348 of wild-type alchohol dehydrogenase 1 as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GACTGG (nucleotides 160 - 165 ); GGCTGG (nucleotides 244 - 249 ); GCCTGT (nucleotides 298 - 303 ); CACTGG (nucleotides 514 - 519 ); GCCAGA (nucleotides 928 - 933 ).
  • GACTGG nucleotides 160 - 165
  • GGCTGG nucleotides 244 - 249
  • GCCTGT nucleotides 298 - 303
  • CACTGG nucleotides 514 - 519
  • GCCAGA nucleotides 928 - 933
  • At least 3 of the following codon pair replacements have been made: GACTGG (nucleotides 160 - 165 ) replaced with GATTGG; GGCTGG (nucleotides 244 - 249 ) replaced with GGTTGG; GCCTGT (nucleotides 298 - 303 ) replaced with GCTTGT; CACTGG (nucleotides 514 - 519 ) replaced with CATTGG; GCCAGA (nucleotides 928 - 933 ) replaced with GCTAGA.
  • the Nucleotide sequence is optimized for expression in E.coli.
  • alchohol dehydrogenase 1- encoding Nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-348 of wild-type alchohol dehydrogenase 1 as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: ATCAAG (nucleotides 952 - 957); GATATT (nucleotides 67 - 72); TTCAAC (nucleotides 844 - 849); ATCAAC (nucleotides 106 - 111); ATCAAC (nucleotides 730 - 735); GGTATT (nucleotides 598 - 603); ATCAAA (nucleotides 271 - 276); GTCAAG (nucleotides 856 - 861); GTCAAG (nucleotides 940 - 945); GGTATC (nucleotides 268 - 273); GGTATC (nucleotides 466 - 471); TCTTTG (nucleotides 553 - 558); TTGAAC (nucleotides 280 - 285).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: ATCAAG (nucleotides 952 - 957) replaced with ATTAAA; GATATT (nucleotides 67 - 72) replaced with GATATA; TTCAAC (nucleotides 844 - 849) replaced with TTTAAT; ATCAAC (nucleotides 106 - 1 11) replaced with ATTAAT; ATCAAC (nucleotides 730 - 735) replaced with ATTAAT; GGTATT (nucleotides 598 - 603) replaced with GGAATA; ATCAAA (nucleotides 271 - 276) replaced with ATTAAA; GTCAAG (nucleotides 856 - 861) replaced with GTTAAA
  • alchohol dehydrogenase 1- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-324 of wild-type alchohol dehydrogenase 1 as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GGTATC (nucleotides 268 - 273 ); ATCAAA (nucleotides 271 - 276 ); AAATGG (nucleotides 274 - 279 ); TTGAAC (nucleotides 280
  • TTCCAA nucleotides 376 - 381
  • GGTACC nucleotides 427 - 432
  • GGTATC nucleotides 466 - 471
  • TTGAAG nucleotides 487 - 492
  • GCCAAG nucleotides 571 - 576
  • GGTACC nucleotides 790 - 795
  • GCCAAG nucleotides 823 - 828 ).
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GGTATC (nucleotides 268 - 273 ) replaced with GGTATA; ATCAAA (nucleotides 271
  • nucleotide sequence is optimized for expression in K. lactis.
  • alchohol dehydrogenase 1- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type alchohol dehydrogenase 1 as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GCCGGT (nucleotides 208 - 213 ); GCCGGT (nucleotides 265 - 270 ); GGTATC (nucleotides 268 - 273 ); GGTATC (nucleotides 466
  • GCTTTG (nucleotides 484 - 489 ); GCCGGT (nucleotides 508 - 513 ); TCCGGT (nucleotides 529 - 534 ); GCCAAG (nucleotides 571 - 576 ); GGTATT (nucleotides 598 - 603 ); GAAGCC (nucleotides 748 - 753 ); GCTATT (nucleotides 754
  • GCCGGT nucleotides 208 - 213
  • GCCGGT nucleotides 265
  • nucleotide sequence is optimized for expression in Z. mobilis.
  • alchohol dehydrogenase 1 -encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -348 of wild-type alchohol dehydrogenase 1 as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • alchohol dehydrogenase 1 -encoding Nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-348 of wild-type alchohol dehydrogenase 1 as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long- tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the Nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing xylose comprising one or more host organisms that collectively include Nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, xylulokinase, and alcohol dehydrogenase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the Nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the alchohol dehydrogenase 1 retains at least 75% of the enzymatic activity of wild-type ADHl (SEQ ID NO: 50) under normal physiological conditions.
  • alchohol dehydrogenase 1- encoding Nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -348 of wild-type alchohol dehydrogenase 1 as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 30-140 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 30-140 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 30-140 when expressed in the native organism.
  • no replacement codon encoding amino acids 30-140 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair ATCAAA when expressed in the native organism.
  • alchohol dehydrogenase 1- encoding Nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -348 of wild-type alchohol dehydrogenase 1 as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 169-311 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 169-31 1 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 169-311 when expressed in the native organism.
  • no replacement codon encoding amino acids 169-31 1 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GGTATT when expressed in the native organism.
  • alchohol dehydrogenase 1- encoding Nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-348 of wild-type alchohol dehydrogenase 1 as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-30 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1-30 of SEQ ID NO: 50 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 - 30 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1-5 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GATATT when expressed in the native organism.
  • alchohol dehydrogenase 1- encoding Nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -348 of wild-type alchohol dehydrogenase 1 as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 140-169 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 140-169 of SEQ ID NO: 50 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 140-169 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1-5 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair TTGAAG when expressed in the native organism.
  • an alcohol dehydrogenase 3- encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-375 of wild-type alcohol dehydrogenase 3 as set forth in SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TTGAAA (nucleotides 583 - 588); CTTTTC (nucleotides 709 - 714); AAGAAA (nucleotides 715 - 720); GGTATT (nucleotides 679 - 684); ATCAAA (nucleotides 352 - 357); ATCAAA (nucleotides 1021 - 1026); ATCAAA (nucleotides 1033 - 1038); AAACTA (nucleotides 259 - 264); AAACTA (nucleotides 304 - 309); GATATC (nucleotides 148 - 153); ATCAAG (nucleotides 952 - 957).
  • TTGAAA nucleotides 583 - 588 replaced with TTAAAA
  • CTTTTC nucleotides 709 - 714 replaced with TTGTTT AAGAAA
  • GGTATT nucleotides 679 - 684) replaced with GGAATT
  • ATCAAA nucleotides 352 - 357 replaced with ATTAAA
  • ATCAAA nucleotides 1021 - 1026 replaced with ATTAAA
  • ATCAAA nucleotides 1033 - 1038
  • an alcohol dehydrogenase 3- encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-375 of wild-type alcohol dehydrogenase 3 as set forth in SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GGCTGG (nucleotides 325 - 330 ); AAAGAG (nucleotides 571 - 576 ); GACTGG (nucleotides 595 - 600 ); GCGATG (nucleotides 658 - 663 ); GTGGTG (nucleotides 934 - 939 ); GGATTA (nucleotides 1045 - 1050 ); GTCGAT (nucleotides 1 1 1 1 - 1 16 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replecements have been made: GGCTGG (nucleotides 325 - 330 ) replaced with GGTTGG; AAAGAG (nucleotides 571 - 576 ) replaced with AAGGAA; GACTGG (nucleotides 595 - 600 ) replaced with GATTGG; GCGATG (nucleotides 658 - 663 ) replaced with GCTATG; GTGGTG (nucleotides 934 - 939 ) replaced with GTTGTT; GGATTA (nucleotides 1045 - 1050 ) replaced with GGTTTG; GTCGAT (nucleotides 1 1 1 1 - 1 16 ) replaced with GTTGAT.
  • an alcohol dehydrogenase 3- encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-375 of wild-type alcohol dehydrogenase 3 as set forth in SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: AAGAAA (nucleotides 715 - 720); GATATC (nucleotides 148 - 153); ATCAAG (nucleotides 952 - 957); TTCAAG (nucleotides 712 - 717); TTGAAA (nucleotides 583 - 588); ATCAAC (nucleotides 187 - 192); GGTATT (nucleotides 679 - 684); ATCAAA (nucleotides 352 - 357); ATCAAA (nucleotides 1021
  • nucleotide sequence is optimized for expression in P. pastoris.
  • a alcohol dehydrogenase 3- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-375 of wild-type alcohol dehydrogenase 3 as set forth in SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GATATC (nucleotides 148 - 153 ); GGTATC (nucleotides 349 - 354 ); ATCAAA (nucleotides 352 - 357 ); AAATGG (nucleotides 355 - 360 ); TTCCAA (nucleotides 457 - 462 ); GGTACC (nucleotides 508 - 513 ); TTGAAA (nucleotides 583 - 588 ); CTTTTC (nucleotides 709 - 714 ); AAGAAA (nucleotides 715 - 720 ); GGTACC (nucleotides 871 - 876 ); AATATC (nucleotides 949
  • ATCAAA nucleotides 1021 - 1026
  • ATCAAA nucleotides 1033 - 1038
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • GATATC nucleotides 148 - 153
  • GGTATC nucleotides 349 - 354
  • ATCAAA nucleotides 352 - 357
  • AAATGG nucleotides 355 - 360
  • TTCCAA nucleotides 457 - 462
  • GGTACC nucleotides 508 - 513
  • TTGAAA nucleotides 583 - 588
  • a alcohol dehydrogenase 3- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-375 of wild-type alcohol dehydrogenase 3 as set forth in SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GGTATC (nucleotides 349 - 354 ); GCTATT (nucleotides 481 - 486 ); GGTATT (nucleotides 679 - 684 ); GAAGCC (nucleotides 829
  • GCTATT nucleotides 835 - 840
  • ATCAAT nucleotides 946 - 951
  • AATATC nucleotides 949 - 954
  • GAAGCC nucleotides 991 - 996
  • GTCGAT nucleotides 1 1 1 1 - 1 16 .
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • GGTATC nucleotides 349 - 354
  • GCTATT nucleotides 481 - 486
  • GGTATT nucleotides 679 - 684
  • GAAGCC nucleotides 829 - 834
  • GAGGCT nucleotides 835 - 840
  • ATCAAT nucleotides 946 - 951
  • AATATC nucleotides 949 - 954
  • GAAGCC nucleotides 991 - 996
  • GTCGAT nucleotides 1 1 1 1 1 - 1116
  • an alcohol dehydrogenase 3-encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-375 of wild-type alcohol dehydrogenase 3 as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • an alcohol dehydrogenase 3-encoding Nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-375 of wild-type alcohol dehydrogenase 3 as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long- tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the Nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing xylose comprising one or more host organisms that collectively include Nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, xylulokinase, and alcohol dehydrogenase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the Nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the alcohol dehydrogenase 3 retains at least 75% of the enzymatic activity of wild-type ADH3 (SEQ ID NO: 74) under normal physiological conditions.
  • a alchohol dehydrogenase 3- encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-375 of wild-type alchohol dehydrogenase 3 as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 57-168 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 57-168 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200% : or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 57-168 when expressed in the native organism.
  • no replacement codon encoding amino acids 57-168 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair ATCAAA when expressed in the native organism.
  • a alchohol dehydrogenase 3- encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-375 of wild-type alchohol dehydrogenase 3 as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 196-339 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 196-339 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 196-339 when expressed in the native organism.
  • no replacement codon encoding amino acids 196-339 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair CTTTTC when expressed in the native organism.
  • a alchohol dehydrogenase 3- encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-375 of wild-type alchohol dehydrogenase 3 as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-57 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1-57 of SEQ ID NO: 74 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1- 57 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1-57 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GATATC when expressed in the native organism.
  • a alchohol dehydrogenase 3- encoding Nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-375 of wild-type alchohol dehydrogenase 3 as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 168-196 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 168-196 of SEQ ID NO: 74 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 196-339 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 168-196 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair TTGAAA when expressed in the native organism.
  • transaldolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type transaldolase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: : AAGAAA (nucleotides 19 - 24 ); TTGAAA (nucleotides 55 - 60 ); GCCAAG (nucleotides 109 - 1 14 ); GCCAAG (nucleotides 163 - 168 ); GCCAAG (nucleotides 181 - 186 ); GTGGAA (nucleotides 202 - 207 ); GGTATT (nucleotides 451 - 456 ); ACTTTG (nucleotides 556 - 561 ); AAGAAA (nucleotides 655
  • ATCAAA nucleotides 733 - 738
  • TCTCCA nucleotides 769 - 774
  • TTGAAT nucleotides 901 - 906
  • GATATT nucleotides 961 - 966
  • AAGAAA nucleotides 991 - 996 .
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • At least 3 of the following codon pair replecements have been made: AA GAAA (nucleotides 19 - 24 ) replaced with AAGAAG; TTGAAA (nucleotides 55 - 60 ) replaced with TTAAAG; GCCAAG (nucleotides 109 - 114 ) replaced with GCTAAA; GCCAAG (nucleotides 163 - 168 ) replaced with GCTAAA; GCCAAG (nucleotides 181 - 186 ) replaced with GCTAAA; GTGGAA (nucleotides 202 - 207 ) replaced with GTTGAA; GGTATT (nucleotides 451 - 456 ) replaced with GGAATA; ACTTTG (nucleotides 556 - 561 ) replaced with ACATTG; AAGAAA (nucleotides 655
  • nucleotide sequence is optimized for expression in S. cerevisiae.
  • transaldolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type transaldolase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: ATTGCC (nucleotides 106 - 1 1 1 ); GACAGA (nucleotides 259 - 264 ); GCCTGT (nucleotides 535 - 540 ); TTGATT (nucleotides 559 - 564 ); GACTGG (nucleotides 589 - 594 ); TCCAGC (nucleotides 601 - 606 ); TTGATT (nucleotides 982 - 987 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replecements have been made:ATTGCC (nucleotides 106 - 1 1 1 ) replaced with ATCGCG; GACAGA (nucleotides 259 - 264 ) replaced with GACCGT; GCCTGT (nucleotides 535 - 540 ) replaced with GCGTGC; TTGATT (nucleotides 559 - 564 ) replaced with CTCATC; GACTGG (nucleotides 589 - 594 ) replaced with GATTGG; TCCAGC (nucleotides 601
  • nucleotide sequence is optimized for expression in E. coli.
  • a transaldolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type transaldolase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: AAGAAA (nucleotides 19 - 24 ); TTGAAA (nucleotides 55
  • AAGTTT nucleotides 112 - 117
  • TTTGAC nucleotides 343 - 348
  • TCCAAG nucleotides 409 - 414
  • GGTATT nucleotides 451 - 456
  • GCCAAA nucleotides 463 - 468
  • GGTATC nucleotides 487 - 492
  • GTCAAG nucleotides 652
  • AAGAAA nucleotides 655 - 660
  • GACGAA nucleotides 727 - 732
  • ATCAAA nucleotides 733 - 738
  • CCAAGA nucleotides 814 - 819
  • GACGAA nucleotides 877 - 882
  • GGTATC nucleotides 940 - 945
  • GATATT nucleotides 961
  • nucleotides 991 - 996 are nucleotide sequences. In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • At least 3 of the following codon pair replecements have been made: AAGAAA (nucleotides 19 - 24 ) replaced with AAAAAG; TTGAAA (nucleotides 55 - 60 ) replaced with TTGAAG; AAGTTT (nucleotides 112 - 117 ) replaced with AAATTT; TTTGAC (nucleotides 343 - 348 ) replaced with TTTGAT; TCCAAG (nucleotides 409 - 414 ) replaced with TCTAAA; GGTATT (nucleotides 451 - 456 ) replaced with GGAATC; GCCAAA (nucleotides 463
  • nucleotide sequence is optimized for expression in P. pastoris.
  • transaldolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type transaldolase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: AAGAAA (nucleotides 19 - 24 ); TTGAAA (nucleotides 55
  • GCCAAG nucleotides 109 - 1 14
  • GCCAAG nucleotides 163 - 168
  • GCCAAG nucleotides 181 - 186
  • AAGAAG nucleotides 214 - 219
  • GGTATC nucleotides 487 - 492
  • GGTAAA nucleotides 610 - 615
  • AAGAAA nucleotides 655 - 660
  • AAGAAG nucleotides 676 - 681
  • ATCAAA nucleotides 733 - 738
  • TTCCCA nucleotides 811 - 816
  • AAGAAG nucleotides 841 - 846
  • GGTATC nucleotides 940 - 945
  • AAGAAA nucleotides 991 - 996
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replecements have been made: AAGAAA (nucleotides 19 - 24 ) replaced with AAAAAA; TTGAAA (nucleotides 55 - 60 ) replaced with CTTAAG; GCCAAG (nucleotides 109 - 1 14 ) replaced with GCTAAA; GCCAAG (nucleotides 163
  • transaldolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -335 of wild-type transaldolase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GCCAAG (nucleotides 109 - 114 ); GCCAAG (nucleotides 163 - 168 ); GCCAAG (nucleotides 181 - 186 ); GGTATT (nucleotides 451 - 456 ); GGTATC (nucleotides 487 - 492 ); ACTTTG (nucleotides 556 - 561 ); GAAGCC (nucleotides 628 - 633 ); GAAGCC (nucleotides 847 - 852 ); GGTATC (nucleotides 940
  • GCCAAG nucleotides 109 - 114
  • GCCAAG nucleotides 163
  • GCTAAG (nucleotides 181 - 186 ) replaced with GCCAAA
  • GGTATT (nucleotides 451 - 456 ) replaced with GGGATC
  • GGTATC (nucleotides 487 - 492 ) replaced with GGCATT
  • ACTTTG (nucleotides 556 - 561 ) replaced with ACCTTG
  • GAAGCC (nucleotides 628 - 633 ) replaced with GAAGCT
  • GAAGCC (nucleotides 847 - 852 ) replaced with GAGGCT
  • GGTATC (nucleotides 940
  • nucleotide sequence is optimized for expression in Z. mobilis.
  • transaldolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type transaldolase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • transaldolase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type transaldolase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing xylose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, xylulokinase and transaldolase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the transaldolase retains at least 75% of the enzymatic activity of wild-type TALI (SEQ ID NO: 98) under normal physiological conditions.
  • a transaldolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type transaldolase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 25-329 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 25-329 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 25-329 when expressed in the native organism.
  • no replacement codon encoding amino acids 25-329 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair AAGAAA when expressed in the native organism.
  • a transaldolase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-335 of wild-type transaldolase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 1-25 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1-25 of SEQ ID NO: 98 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-25 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1-25 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair AAGAAA when expressed in the native organism.
  • polynucleotides comprising any of the nucleotide sequences provided herein. Also provided herein are isolated polynucleotides comprising the nucleotide sequence of SEQ ID NOs: , 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 99, 101 , 103, 105 or 107.
  • such a polynucleotide is a DNA polynucleotide
  • such a polynucleotide can be an RNA polynucleotide comprising the RNA-equivalent of said DNA sequence.
  • cells comprising such a polynucleotide. In some such cells, the cell expresses the polypeptide encoded by the polynucleotide.
  • methods of introducing a polynucleotide into a host cell comprising providing a host cell; and contacting said host cell with any of the polynucleotides provided herein under conditions that permit the polynucleotide to be introduced into the host cell.
  • Also provided are methods of expressing a polypeptide comprising providing a cell comprising any of the polynucleotides provided herein; and placing the cell under conditions that permit the cell to express the polypeptide encoded by the DNA sequence, whereby said encoded polypeptide is expressed by said cell. Also provided are methods of hydrolyzing a carbohydrate comprising providing a carbohydrate comprising at least one covalent bond; providing a polypeptide encoded by any of the polynucleotides provided herein; and contacting said carbohydrate with said polypeptide under conditions that permit said polypeptide to hydrolyze at least one covalent bond of said carbohydrate; whereby at least one glycosidic bond of said carbohydrate is hydrolyzed.
  • the carbohydrate is arabinose.
  • integrable polynucleotides for modifying an endogenous nucleotide sequence in a cell comprising: a removable selectable marker cassette comprising a selectable marker flanked by a 5' site-specific recombinase recognition site and a 3' site-specific recombinase recognition site, wherein said removable selectable marker cassette is flanked by a 5' nucleic acid sequence with homology to an endogenous sequence and a 3' nucleic acid sequence with homology to an endogenous sequence.
  • integrable polynucleotides further comprise a heterologous nucleic acid flanked by said 5' nucleic acid sequence with homology to an endogenous sequence and said 3' nucleic acid sequence with homology to an endogenous sequence.
  • the heterologous nucleic acid comprises a sequence encoding a polypeptide.
  • the heterologous nucleic acid comprises a regulatory sequence.
  • the sequence encoding a polypeptide is operatively linked to said regulatory sequence.
  • the regulatory sequence comprises a promoter sequence and a terminator sequence.
  • the heterologous nucleic acid comprises a polynucleotide in accordance with any of the polynucleotides provided herein.
  • the heterologous nucleic acid encodes a polypeptide that catalyzes a reaction in a sugar degradation pathway.
  • the heterologous nucleic acid comprises SEQ ID NOs: 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81 , 83, 85, 87, 89, 91, 93, 95, 99, 101, 103, 105 or 107.
  • the selectable marker can be selected for or can be selected against.
  • the selectable marker can be selected for and can be selected against.
  • the selectable mark is selected from the group consisting of URA3, TRPl, CANl, KIURA3, CYH2, LYS2 and METl 5.
  • the nucleic acid sequence with homology to an endogenous sequence comprises a genomic repetitive element.
  • the nucleic acid sequence with homology to an endogenous sequence comprises TyI DNA or Ty3 DNA.
  • the site- specific recombinase recognition site comprises a loxP sequence.
  • the site-specific recombinase recognition site comprises a frt sequence.
  • the integrable polynucleotide comprises a PCR product.
  • cells comprising any of the integrable polynucleotides provided herein. Some such cells comprise a gene encoding a site- specific recombinase. In some such cells, the site-specific recombinase comprises a CRE recombinase or a FLP recombinase. Some such cells are S. cerevisiae cells.
  • Also provided herein are methods of modifying an endogenous sequence in a cell comprising: providing a cell with at least one of the integrable polynucleotides provided; and selecting for a cell comprising said at least one integrable polynucleotide integrated therein to the genome of the cell. Some such methods further comprise excising at least one selectable marker from said at least one cell comprising said at least one integrable polynucleotide integrated therein; and selecting for a cell in which said at least one selectable marker has been excised. In some such methods, the excising said selectable marker comprises providing said cell with a site-specific recombinase.
  • the site-specific recombinase comprises a CRE recombinase or a FLP recombinase. In some such methods, the site-specific recombinase is expressed from an endogenous gene or from a heterologous nucleic acid.
  • the providing a cell with at least one integrable polynucleotide comprises providing a cell with a plurality of integrable polynucleotides, wherein said plurality of integrable polynucleotides comprises at least a first integrable polynucleotide comprising a first selectable marker and a second integrable polynucleotide comprising a second selectable marker.
  • the plurality comprises 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more different integrable polynucleotides.
  • cells comprising an endogenous sequence modified by any of such methods provided herein.
  • the modified endogenous sequence comprises an insertion, a deletion or a mutation.
  • cells comprising a removable selectable marker cassette integrated into said cell comprising a selectable marker flanked by a 5' site- specific recombinase recognition site and a 3' site-specific recombinase recognition site; and a heterologous nucleic acid integrated into said cell, wherein said removable selectable marker is juxtaposed to said heterologous nucleic.
  • cells comprising: a heterologous nucleic acid integrated into said cell, and a site-specific recombinase recognition site integrated into said cell, wherein said site-specific recombinase recognition site is juxtaposed to said heterologous nucleic acid.
  • the site-specific recombinase recognition site comprises a loxP or frt sequence.
  • the cell is a S. cerevisae cell.
  • the heterologous nucleic acid comprises a polynucleotide in accordance with any of the polynucleotides provided herein.
  • the heterologous nucleic acid encodes a polypeptide that catalyzes a reaction in a sugar degradation pathway.
  • the heterologous nucleic acid comprises SEQ ID NOs: 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 99, 101, 103, 105 or 107.
  • Figure 1 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in S. cerevisiae of nucleic acid sequences encoding the transketolase enzyme of S. cerevisiae (TLKl), plotted as a function of codon pair position.
  • Figures 2-4 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 2-4 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding TLKl, plotted as a function of codon pair position.
  • Figure 2 A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the TLKI l protein.
  • Figure 2B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the TLKI l which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 3 A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the TLKl 1 protein.
  • Figure 3 B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the TLKl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 4A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the TLKl protein.
  • Figure 4B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the TLKl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 5A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the TLKl protein.
  • Figure 5B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the TLKl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 6A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the TLKl protein.
  • Figure 6B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the TLKl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 7 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in S. cerevisiae of nucleic acid sequences encoding the ribulose 5-phosphate epimerase enzyme of S. cerevisiae (RPE), plotted as a function of codon pair position.
  • RPE ribulose 5-phosphate epimerase enzyme
  • Figures 8-12 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 8-12 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding RPE, plotted as a function of codon pair position.
  • Figure 8 A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the RPE protein.
  • Figure 8B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the RPE which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 9A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the RPE protein.
  • Figure 9B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the RPE which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 1 OA depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the RPE protein.
  • Figure 1OB depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the RPE which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure HA depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the RPE protein.
  • Figure HB depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the RPE which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 12 A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the RPE protein.
  • Figure 12B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the RPE which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
  • Figure 13 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in S. cerevisiae of nucleic acid sequences encoding the alchohol dehydrogenase 1 enzyme of S. cerevisiae (ADHl), plotted as a function of codon pair position.
  • Figures 14-18 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 14-18 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding ADHl, plotted as a function of codon pair position.
  • Figure 14A depicts a graphical display of the S 1 . cerevisiae expression of the native nucleic acid sequence encoding the ADHl protein.
  • Figure 14B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the ADHl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 15A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the ADHl protein.
  • Figure 15B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the ADHl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 16 A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the ADHl protein.
  • Figure 16B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the ADHl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 17 A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the ADHl protein.
  • Figure 17B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the ADHl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 18A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the ADHl protein.
  • Figure 18B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the ADHl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 19 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in S. cerevisiae of nucleic acid sequences encoding the alcohol dehydrogenase 3 enzyme of S. cerevisiae (ADH3), plotted as a function of codon pair position.
  • Figures 20-24 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 20-24 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding ADH3, plotted as a function of codon pair position.
  • Figure 20A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the ADH3 protein.
  • Figure 2OB depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the ADH3 which has been modified to eliminate codon pairs that are predicted to cause a translational pause in 5". cerevisiae.
  • Figure 21 A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the ADH3 protein.
  • Figure 21B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the ADH3 which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 22A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the ADH3 protein.
  • Figure 22B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the ADH3 which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 23 A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the ADH3 protein.
  • Figure 23B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the ADH3which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 24A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the ADH3protein.
  • Figure 24B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the ADH3which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 25 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in S. cerevisiae of nucleic acid sequences encoding the transaldolase enzyme of S. cerevisiae (TALI), plotted as a function of codon pair position.
  • Figures 26-30 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 26-30 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding TALI, plotted as a function of codon pair position.
  • Figure 26A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the TALI l protein.
  • Figure 26B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the TALI 1 which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 27A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the TALI 1 protein.
  • Figure 27B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the TALI which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 28A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the TALI protein.
  • Figure 28B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the TALI which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 29A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the TALI protein.
  • Figure 29B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the TALI which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 3OA depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the TALI protein.
  • Figure 30B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the TALI which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 31 A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the XynA protein.
  • Figure 3 IB depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Biomass is the earth's most attractive alternative among fuel sources and most sustainable energy resource and is reproduced by the bioconversion of carbon dioxide.
  • Ethanol produced from biomass is today the most widely used bi . ofuel when blended with gasoline.
  • the use of biofuels can significantly reduce the accumulation of greenhouse gas.
  • Ethanol is just one example of the uses of biomass harvesting using industrial enzymes. The technologies associated with biomass harvesting are similarly applicable in the production of other biofuels, fine chemicals as well as other diverse applications.
  • Lignocellulosic biomass is composed predominantly of cellulose, hemicellulose, and lignin and is naturally resistant to chemical and biologic conversion.
  • An economical biomass-to-ethanol process critically depends on the rapid and efficient conversion of all of the sugars present in both its cellulose and hemicellulose fractions. While many microorganisms can ferment the glucose component in cellulose to ethanol, efficient conversion of the pentose sugars in the hemicellulose fraction, particularly xylose and arabinose, has been hindered by the lack of a suitable biocatalyst.
  • Xylose is the predominant pentose sugar derived from hemicellulose, but arabinose can constitute a significant amount of the pentose sugars derived from various agricultural residues and other herbaceous crops, such as switchgrass.
  • Xylose metabolism Xylose is metabolized in the pentose phosphate pathway (PPP) where it enters through D-xylulose and is converted by transketolase (TLK), generating D-fructose-6-phosphate and D-glyceraldehyde-3 -phosphate (GAP), which can be converted in a redox-neutral way to equimolar amounts of CO 2 and ethanol.
  • PPP pentose phosphate pathway
  • TLK transketolase
  • GAP D-fructose-6-phosphate
  • GAP D-glyceraldehyde-3 -phosphate
  • D-xylose is reduced to xylitol by a xylose reductase (XR; e.g., Xyr, XYLl, Xyllp) and then xylitol is oxidized to D-Xylulose by a xylitol dehydrogenase (XDH; e.g., XYL2, XyUp).
  • XR xylose reductase
  • XDH xylitol dehydrogenase
  • XK D-xylulokinase
  • the rate of the two-step reduction/oxidation reactions to generate D- xylulose, and hence feed the PPP and eventually generate ethanol, is governed by the cofactor requirements of the first two reactions which affect cellular demands for oxygen.
  • XDH from Pichia stipitis is strictly NAD + -dependent.
  • L-arabinose metabolism In yeast, filamentous fungi and other eukaryotes, the L-arabinose pathway consists of five enzymes: aldose reductase (ARD), L-arabinitol 4-dehydrogenase (LAD), L-xylulose reductase (LXR), xylitol dehydrogenase (XDH), and xylulokinase (XKI), converting L-arabinose to L-arabitol, L-xylulose, xylitol, D-xylulose, and D-xylulose-5-P, respectively.
  • ARD aldose reductase
  • LAD L-arabinitol 4-dehydrogenase
  • LXR L-xylulose reductase
  • XDH xylitol dehydrogenase
  • XKI xylulokinase
  • the bacterial pathway for L-arabinose utilization does not use redox reactions like the yeast/fungal system, but consists of L-arabinose isomerase (AraA), L- ribulokinase (AraB), and L-ribulose-5-P 4-epimerase (AraD) converting L-arabinose to L- ribulose, L-ribulose-5-P, and D-xylulose-5-P, respectively (Lee et al. (1986) Gene 47:231 -244).
  • the expression of the E. coli pathway in S. cerevisiae did not result in either growth on L-arabinose or production of ethanol from L-arabinose (Sedlak at al. (2001) 28: 16-24). It was suggested that the main problem was the low activity of E. coli L-arabinose isomerase in yeast.
  • the final step in glucose fermentation pathway is the reduction of acetaldehyde to ethanol.
  • This enzymatic conversion is performed by alcohol dehydrogenase enzymes.
  • ADHl is the cytoplasmic isoform of alcohol dehydrogenase and the major enzyme required for the conversion of acetaldehyde to ethanol.
  • ADH2 expression is repressed by growth on glucose and is mainly involved in ethanol consumption, converting ethanol into acetaldehyde.
  • ADHS is a mitochondrial isozyme of alcohol dehydrogenase, and is involved in ethanol production and in the shuttling of mitochondrial NADH to the cytosol under anaerobic conditions.
  • ADH4 is a formaldehyde dehydrogenase and has no effect on ethanol production.
  • ADH5 has been sequenced, but its enzymatic function is not clear. In order to produce industrial levels of ethanol, it is desirable to express ADH enzymes in host organisms.
  • Some translational pauses are resultant from the presence of particular codon pairs in the nucleotide sequence encoding the polypeptide to be translated. As provided herein, inappropriate or excessive translation pauses can reduce protein expression considerably. Further, the translational pausing properties of codon pairs vary from organism to organism. As a result, exogenous expression of genes foreign to the expression organism can lead to inefficient translation. Even when the gene is translated in a sufficiently efficient manner that recoverable quantities of the translation product are produced, the protein is often inactive, insoluble, aggregated, or otherwise different in properties from the native protein. Thus, removing inappropriate or excessive translation pauses can improve protein expression.
  • a translational pause can serve to slow translation of the nascent amino acid chain.
  • the pause(s) can serve to facilitate proper polypeptide folding, post-translational modification, re-organization/folding at protein domain boundaries, or other steps toward arriving at the native, active wild type protein.
  • one or more pauses that are predicted to be present in native translation of PPP and/or fermentation enzymes is/are preserved in a modified hydrolysis-encoding polynucleotide provided in accordance with the teachings herein.
  • a codon pair in the modified PPP and/or fermentation enzyme-encoding polynucleotide can be selected to have a predicted translational kinetics value that is at least 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, or 99% that of the native codon pair whose predicted pause is to be preserved; further, the codon pair in the modified PPP and/or fermentation enzyme -encoding polynucleotide can be selected to be located within 30, 20, 10, 9, 8, 7, 6, 5, 4, 3, 2 or 1 codons of the native codon pair whose predicted pause is to be preserved.
  • Translation EngineeringTM refers to a process used to modify the translational kinetics of a polypeptide-encoding nucleic sequence.
  • Translation EngineeringTM can be applied to modify the translational kinetics of a polypeptide-encoding nucleic sequence when expressed in its native organism.
  • Translation Engineering M can be applied to modify the translational kinetics of a polypeptide-encoding nucleic sequence when expressed in its native organism. In some embodiments, this process alters the polypeptide-encoding nucleic sequence to optimize codon usage and codon pair optimization in the organism in which the polypeptide-encoding nucleic sequence is expressed.
  • sequence modifications can be made to place or prevent restriction sites in the sequence, eliminate strong RNA secondary structures and avoid inadvertent Shine-Delgarno sequences.
  • Translation EngineeringTM involves modifying the translational kinetics of a polypeptide-encoding nucleic sequence by removing, preserving, and/or inserting translational pauses into the polypeptide-encoding nucleic sequence.
  • PPP and/or fermentation enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same.
  • a PPP and/or fermentation enzyme -encoding DNA sequence wherein the encoded sequence has amino acid sequence identity with wild-type PPP and/or fermentation enzyme, and wherein predicted translation pauses in the expression organism have been removed or reduced by replacing input-sequence codon pairs with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the resultant PPP and/or fermentation enzyme -encoding nucleotide is predicted to be translated rapidly along its entire length.
  • Expression of the resultant PPP and/or fermentation enzyme - encoding nucleotide is predicted to result in improved protein expression levels in cases where inappropriate or excessive translation pauses reduce protein expression.
  • expression of the resultant PPP and/or fermentation enzyme -encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression in cases where inappropriate or excessive translation pauses causes expression of inactive, insoluble or aggregated PPP and/or fermentation enzyme .
  • expression of the resultant PPP and/or fermentation enzyme - encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression in cases where one or more predicted pauses are preserved from the native expression profile or are added to preserve expression of active and/or soluble PPP and/or fermentation enzyme .
  • the PPP and/or fermentation enzyme - encoding nucleotide sequences provided herein allow for one or more of the following results: higher expression levels; higher enzymatic activity; greater protein stability and resistance to degradation; and increased solubility.
  • PPP and/or fermentation enzyme refers to the enzymes encoded by the nucleotide sequences provided herein, and includes transketolase (TLKl), ribulose 5-phosphate epimerase (RPE), alcohol dehydrogenase- 1 (ADHl), alcohol dehydrogenase-3 (ADH3) and transaldolase (TALI) enzymes.
  • TLKl transketolase
  • RPE ribulose 5-phosphate epimerase
  • ADHl alcohol dehydrogenase- 1
  • ADH3 alcohol dehydrogenase-3
  • TALI transaldolase
  • nucleic acid sequences encoding the transketolase enzyme of S. cerevisiae are provided.
  • the nucleotide sequences provided herein include the native sequence from S. cerevisiae shown in the sequence listing (SEQ ID NO: 1) which encodes the TLKl amino acid sequence (SEQ ID NO: 2).
  • nucleic acid sequences encoding the ribulose 5-phosphate epimerase enzyme of S. cerevisiae are provided.
  • the nucleotide sequences provided herein include the native sequence from S. cerevisiae shown in the sequence listing (SEQ ID NO: 25) which encodes the RPE amino acid sequence (SEQ ID NO: 26).
  • nucleic acid sequences encoding the alchohol dehydrogenase 1 enzyme of S. cerevisiae are provided.
  • the nucleotide sequences provided herein include the native sequence from S. cerevisiae shown in the sequence listing (SEQ ID NO: 49) which encodes the ADHl amino acid sequence (SEQ ID NO: 50).
  • nucleic acid sequences encoding the alcohol dehydrogenase 3 enzyme of S. cerevisiae are provided.
  • the nucleotide sequences provided herein include the native sequence from 5". cerevisiae shown in the sequence listing (SEQ ID NO: 73) which encodes the ADH3 amino acid sequence (SEQ ID NO: 74).
  • nucleic acid sequences encoding the transaldolase enzyme of S. cerevisiae are provided.
  • the nucleotide sequences provided herein include the native sequence from S. cerevisiae shown in the sequence listing (SEQ ID NO: 97) which encodes the TALI amino acid sequence (SEQ ID NO: 98).
  • nucleic acid sequences encoding PPP and/or fermentation enzymes with refined translational kinetics for expression in S. cerevisiae (SEQ ID NOS: 3, 27, 51, 75 and 99), E. coli (SEQ ID NOS: 9, 33, 57, 81 and 101), P. pasto ⁇ s (SEQ ID NOS: 15, 39, 63, 87, 103), K. lactis (SEQ ID NOS: 21 , 45, 69, 93 and 105) and Z. mobilis (SEQ ID NOS: 23, 47, 71, 95 and 107).
  • nucleotide sequences may be added 3' or 5' of any nucleic acid, for example, to facilitate hybridization of PCR primers, to add cloning restriction sites or other sites that facilitate cloning and/or expression. Accordingly, provided in the sequence listing are nucleic acid sequences with additional 5' and 3' cloning and/or PCR sequences, and which encode PPP and/or fermentation enzymes with refined translational kinetics for expression in S. cerevisiae (SEQ ID NOS: 5, 7, 29, 31, 53, 55, 77, 79), E. coli (SEQ ID NOS: 11, 13, 35, 37, 59, 61 , 83, 85) and P. pastoris (SEQ ID NOS: 17, 19, 41, 43, 65, 67, 89, 91).
  • PPP and/or fermentation enzyme nucleic acid sequences with refined translational kinetics SEQ ID NOS: 3, 5, 7, 9, 1 1 , 13, 15, 17,
  • PPP and/or fermentation enzyme-encoding DNA sequences wherein the encoded sequence has amino acid sequence identity with an original PPP and/or fermentation enzyme polypeptide and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly- overrepresented therein.
  • the host organism is not human, E. coli or S. cerevisiae.
  • transketolase polynucleotide encodes a polypeptide having transketolase activity
  • transketolase and like terms refers to the enzymatic conversion of D-xylulose-5-phosphate to glyceraldeyde-3-phosphate.
  • Transketolase catalyzes the transfer of a 2-carbon fragment from a 5-carbon keto sugar (D-xylulose-5-P) to a 5-carbon aldo sugar (D-ribulose-5-P) to form a 7-carbon keto sugar (sedoheptulose-7- P) and a 3-carbon aldo sugar (glyceraldehyde-3-P).
  • a method for measuring transketolase activity is exemplified by a known method in which an enzymatic reaction is carried out and absorbance at 340 nm is monitored by spectrophotometry, as described in U.S. Patent No. 5,631 ,150, hereby incorporated by reference in its entirety.
  • a ribulose 5-phosphate epimerase polynucleotide encodes a polypeptide having ribulose 5-phosphate epimerase activity.
  • Ribulose 5- phosphate epimerase and like terms refers to the enzymatic interconversion of ribulose-5- phosphate and xylulose-5-phosphate.
  • Ribulose 5-phosphate epimerase interconverts the stereoisomers ribulose-5 -phosphate and xylulose-5-phosphate.
  • a alchohol dehydrogenase 1 polynucleotide encodes a polypeptide having alchohol dehydrogenase 1 activity.
  • Alchohol dehydrogenase 1 and like terms refers to the enzymatic reduction of acetaldehyde to ethanol.
  • a method for measuring alchohol dehydrogenase 1 activity is exemplified by a known method in which an ADH enzymatic reaction is carried out and NADH absorbance at 340 nm is monitored by spectrophotometry, as described in Park et al. ((2006) J. Ind. Microbiol. Biotechnol. 33:1032-1036), hereby incorporated by reference in its entirety.
  • a alchohol dehydrogenase 3 polynucleotide encodes a polypeptide having alchohol dehydrogenase 3 activity.
  • Alchohol dehydrogenase and like terms refers to the enzymatic reduction of acetaldehyde to ethanol.
  • a method for measuring alchohol dehydrogenase 3 activity is exemplified by a known method in which an ADH enzymatic reaction is carried out and NADH absorbance at 340 nm is monitored by spectrophotometry, as described in Park et al. ((2006) J. Ind. Microbiol. Biotechnol. 33:1032-1036), hereby incorporated by reference in its entirety.
  • transaldolase polynucleotide encodes a polypeptide having transaldolase activity.
  • Transaldolase and like terms refers to the enzymatic removal of a three-carbon fragment from sedoheptulose-7-phosphate and the subsequenc condensation of the fragment with glyceraldeyde-3 -phosphate, forming fructose-6- phosphate and erythrose-4-phosphate.
  • a method for measuring transaldolase activity is exemplified by a known method in which an enzymatic reaction is carried out and absorbance at 340 nm is monitored by spectrophotometry, as described in U.S. Patent No.
  • polynucleotides provided herein encode polypeptides that have hydrolysis activity.
  • a PPP and/or fermentation enzyme-encoding polynucleotide comprising any of the DNA sequences provided herein can be transcribed and the resulting RNA translated to produce a polypeptide with PPP and/or fermentation enzyme activity.
  • nucleotide sequence is used to refer to any polynucleotide sequence.
  • DNA sequence is used herein to refer to the nucleotide sequences presented herein.
  • RNA equivalent nucleotide sequences are also described by DNA sequences presented herein.
  • an equivalent RNA sequence can be substituted for a DNA sequecne by a T to U substitution, (i.e., replacing thymine in the DNA sequence with uracil in the RNA sequence).
  • the PPP and/or fermentation enzyme-encoding DNA sequence is adapted for expression in a heterologous host organism.
  • a DNA sequence that has been adapted for expression is a DNA sequence that has been inserted into an expression vector or otherwise modified to contain regulatory elements necessary for expression of the DNA in the host cell, positioned in such a manner as to permit expression of the DNA in the host cell.
  • regulatory elements required for expression include promoter sequences, transcription initiation sequences and, optionally, enhancer sequences.
  • a DNA sequence may be inserted into a plasmid vector adapted for expression in a bacterial cell, such as E. coli, or a eukaryotic cell, such as S. cerevisiae or other yeast, or any other host organism.
  • a heterologous host organism is an organism used to express DNA, RNA or protein that is foreign to the host organism.
  • the host organism is not human, E. coli or S. cerevisiae.
  • translational kinetics of an mRNA into polypeptide can be changed in order to achieve any of a variety of expression profiles. For example, translational kinetics of an mRNA into polypeptide can be changed in order to remove some or all translational pauses. In another example, translational kinetics of an mRNA into polypeptide can be changed in order to replace some or all translational pauses predicted to occur within an autonomous folding unit of a nascent protein. In another example, translational kinetics of an mRNA into polypeptide can be changed in order to replace some or all over- represented codon pairs.
  • a pause or translation slowing codon pair can queue ribosomes back to the beginning of the coding sequence, thereby inhibiting further ribosome attachment to the message which can result in down- regulation of protein expression levels as the rate of translation initiation readily saturates and the slowest translation step time becomes rate limiting. It is also proposed herein that the presence of a pause or translational slowing codon pair can stall or detach a ribosome. It is also proposed herein that the presence of a pause or translational slowing codon pair can expose naked mRNA, which is then subject to message degradation.
  • Organism-specific codon usage and codon pair usage, and the presence of organism-specific pause sites result in gene translation that is highly adapted to the original host organism.
  • ribosomal pausing sites that may be functional in a human cell will typically be scrambled, random, or not appropriate or not recognized in the proper context in a bacterium or other non-native host.
  • a heterologous cDNA or synthetic polynucleotide has a random but high probability of inadvertently encoding a pause site somewhere, often leading to protein expression and/or activity failure.
  • Methods for refining translational kinetics of an mRNA into polypeptide can be performed according to any method known in the art, as exemplified in U.S. Patent Publication No. 2008/0046192, published on February 21, 2008, which is incorporated by reference herein in its entirety.
  • a polypeptide-encoding nucleotide can be designed to be predicted to be translated rapidly along its entire length.
  • some polypeptide-encoding nucleotides provided herein are those that have been engineered to remove all predicted pauses. Expression of such a polypeptide-encoding nucleotide can result in improved protein expression levels and improved levels of active and/or natively folded polypeptide expression.
  • a test of translation pausing or slowing as a result of codon pair usage can be performed by comparing a series of genes that have random pauses with modified genes where codon pairs predicted to cause translational pauses are replaced. Unmodified genes moved from their source organism and expressed in a heterologous host can have an altered set of codon pairs predicted to cause a translational pause or ribosomal slowing (e.g., an altered set of over-represented codon pairs), resulting in altered configuration and location of presumed pause sites.
  • translational kinetics of an mRNA into PPP and/or fermentation enzyme-encoding polypeptide can be changed in order to remove some or all translational pauses or replace other codon pairs that cause translational slowing, message instability and degradation, and poor protein translation, expression, and functional properties. While not intending to be limited to the following, it is believed that, for at least some proteins, reduction or elimination of translational pauses can serve to increase the expression level and/or quality and characteristics of the protein. Accordingly, by removing some or all translational pauses or replacing other codon pairs that cause translational slowing, the expression levels and/or quality of an expressed protein can be increased.
  • the PPP and/or fermentation enzyme-encoding nucleotide sequences provided herein allow for one or more of the following results: higher expression levels, higher enzymatic activity, greater protein stability, resistance to degradation, and increased solubility compared to the original native gene when expressed in a heterologous host.
  • PPP and/or fermentation enzyme - encoding nucleotide sequences that have been modified to have one or more transcriptional pauses or slowing sites removed by modifying one or more codon pairs to a corresponding codon pair that is less likely to cause a translational pause or slowing. While in some embodiments it is preferred to replace all codon pairs predicted to cause a translational pause or slowing, in other embodiments, it is sufficient to replace a subset of codon pairs predicted to cause a translational pause or slowing. For example, expression levels can be increased by replacing at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more codon pairs predicted to cause a translational pause or slowing.
  • At least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, 98% or 99% of codon pairs predicted to cause a translational pause or slowing are replaced by, for example, substituting different codon pairs that encode the same amino acids.
  • translational kinetics of an mRNA into polypeptide can be changed in order to remove some or all translational pauses predicted to occur within an autonomous folding unit of a protein.
  • an autonomous folding unit of a protein refers to an element of the overall protein structure that is self- stabilizing and often folds independently of the rest of the protein chain. Such autonomous folding units typically correspond to a protein domain.
  • expression of a gene in a heterologous host organism can result in translational pauses located in regions that inhibit protein expression and/or protein folding.
  • preserving or inserting a translational pause in a region predicted to separate autonomous folding units of a protein can result in improved folding and/or solubility of expressed proteins.
  • methods of changing translational kinetics of an mRNA into polypeptide by preserving, relative to native, or inserting one or more translational pauses in one or more regions predicted to separate autonomous folding units of a protein, thereby increasing improving the folding and/or solubility of the expressed protein.
  • one step can include identifying predicted autonomous folding units of a protein.
  • Methods for identifying predicted autonomous folding units of a protein or protein domains are known in the art, and include alignment of amino acid sequences with protein sequences having known structures, and threading amino acid sequences against template protein domain databases.
  • Such methods can employ any of a variety of software algorithms in searching any of a variety of databases known in the art for predicting the location of protein domains.
  • the results of such methods will typically include an identification of the amino acids predicted to be present in a particular domain, and also can include an identification of the domain itself, and an identification of the secondary structural element, if any, in which each amino acid sequence of a domain is located.
  • the polypeptide- encoding nucleotide sequence it is not possible to modify the polypeptide- encoding nucleotide sequence to remove a translational pause not present in the expression profile of the polypeptide in the native host organism. For example, there may be no codon pairs that are not predicted to cause a translational pause or slowing and that encode a corresponding pair of amino acids. In such instances, several options are available: the codon pair that is least likely to cause a translational pause or slowing can be selected; an amino acid insertion, deletion or mutation can be introduced to yield a codon pair that is not predicted to cause a translational pause or slowing; or no change is made.
  • One option in a computational method is to request human input in order to resolve the issue.
  • the computational method may, for example, involve the use of a computer that is programmed to request human input.
  • the computer may be programmed to make a selection, or combination of selections, such that multiple genes, or Ordered Gene Sets or small permutation libraries are designed and synthetically produced for use in expression analysis.
  • an amino acid insertion, deletion or mutation is made in order to change translational kinetics, it is preferable to select a change that is predicted not to substantially influence the final three-dimensional structure of the protein and/or the activity of the protein.
  • Such an amino acid insertion, deletion or mutation can include, for example, a conservative amino acid substitution such as the conservative substitutions shown in Table 1.
  • the substitutions shown are based on amino acid physical-chemical properties, and as such, are independent of organism.
  • the conservative amino acid substitution is a substitution listed under the heading of exemplary substitutions.
  • codon pairs predicted to cause a translational pause or slowing are treated equally
  • one or more different threshold levels can be established for differential treatment of codon pairs, where codon pairs above a highest threshold are the codon pairs most likely to cause a translational pause or slowing, and succeedingly lower codon pair threshold-based groups correspond to succeedingly lower likelihoods of the respective codon pairs causing a translational pause or slowing.
  • codon pairs above a highest threshold are the codon pairs most likely to cause a translational pause or slowing
  • succeedingly lower codon pair threshold-based groups correspond to succeedingly lower likelihoods of the respective codon pairs causing a translational pause or slowing.
  • different numbers or percentages of codon pairs can be replaced for each of these different threshold-based groups. For example, 95% or more codon pairs above a highest threshold level can be replaced, while 90% or less of all codon pairs between that level and an intermediate threshold level are replaced.
  • codon pairs likely to cause a translational pause or slowing can be segregated into two or more different threshold- based groups, three or more different threshold-based groups, four or more different threshold-based groups, five or more different threshold-based groups, six or more different threshold-based groups, or more. Discussion of specific thresholds are provided elsewhere herein; however, typically the higher the threshold, the higher the likelihood of a translational pause or slowing caused by a codon pair with a translational kinetics value greater than the threshold. In embodiments in which codon pairs likely to cause a translational pause or slowing can be segregated into two or more different threshold- based groups, different numbers or percentages of codon pairs can be replaced for each codon pair group.
  • codon pairs above a highest threshold are replaced, while the same or a lower percentage of codon pairs are replaced from codon pair groups corresponding to one or more lower thresholds.
  • the same or a lower percentage of codon pairs are replaced.
  • all codon pairs above a highest threshold are replaced, while a codon pair above an intermediate threshold is replaced only if the codon pair is located within an autonomous folding unit.
  • all codon pairs above a highest threshold are replaced, while a codon pair above an intermediate threshold is replaced only if the codon pair can be replaced without requiring a change in the encoded polypeptide sequence.
  • all codon pairs above a highest threshold are replaced, while a codon pair above a first higher intermediate threshold is replaced only if the codon pair can be replaced without changing the encoded polypeptide sequence or with only a conservative change to the encoded polypeptide sequence, while a codon pair above a second lower intermediate threshold is replaced only if the codon pair can be replaced without requiring any change in the encoded polypeptide sequence.
  • an evaluation method can be used that determines the degree to which a codon pair should be replaced according to the translational kinetics value of the codon pair, where the degree to which the codon pair should be replaced can be counterbalanced by any of a variety of user-determined factors such as, for example, presence of the codon pair within or between autonomous folding units, and degree of change to the encoded polypeptide sequence.
  • a translational kinetics value of a codon pair is a representation of the degree to which it is expected that a codon pair is associated with a translational pause. Methods of determining the translational kinetics value of a codon pair are discussed elsewhere herein. Such translational kinetics values can be normalized to facilitate comparison of translational kinetics values between species. In some embodiments, the translational value can be the degree of over-representation of a codon pair. An over-represented codon pair is a codon pair which is present in a protein-encoding sequence in higher abundance than would be expected if all codon pairs were statistically randomly abundant.
  • a codon pair predicted to cause a translational pause or slowing is a codon pair whose likelihood of causing a translational pause or slowing is at least one standard deviation above the mean translational kinetics value, where a particular translational kinetics value above the mean translational kinetics value in this context refers to a translational kinetics value indicative of a greater likelihood of causing translational pausing or slowing, relative to a mean translational kinetics value, and is not strictly limited to a particular mathematical relationship (e.g., greater than the mean) since the depiction of propensity to cause a translational pause by a translational kinetics value can be selected to be negative or positive, based on the selected implementation by one skilled in the art.
  • over-represented codon pairs may be graphically displayed as a positive function in a SpeedPlotTM, as depicted in Figure 1 , where a positive deflection or peak above a selected threshold describes a translational pause or slowing at the exact nucleotide location as defined by the abscissa.
  • a threshold for the translational kinetics value of codon pairs that are predicted to cause a translational pause or slowing can be set in accordance with the method and level of stringency desired by one skilled in the art.
  • a threshold value can be set to 5, or 3, or 2, or 1.5 standard deviations or more above the mean.
  • Typical threshold values can be at least 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 and 5 or more standard deviations above the mean.
  • a plurality of thresholds can be applied in the herein-provided methods in segregating codon pairs into a plurality of groups. Each threshold of such a plurality can be a different value selected from 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 and 5 or more standard deviations above the mean.
  • translational kinetics of an mRNA into polypeptide can be changed to add or retain one or more translational pauses predicted to occur before, after or within an autonomous folding unit of a protein, or between autonomous folding units. While not intending to be limited to the following, it is proposed that translational pauses are present in wild type genes in order to slow translation of a nascent polypeptide subsequent to translation of a protein domain, thus providing time for acquisition of secondary and at least partial tertiary structure in the domain prior to further downstream translation and reorganization or reconfiguration of the growing polypeptide or domain. By modifying the translational kinetics of complex multi-domain proteins it may be possible to experimentally alter the time each domain has available to organize.
  • Folding of a heterologously-expressed gene having two or more independent domains can be altered by the presence of pause sites between the domains. Refolding studies indicate that the time it takes for a protein to settle into its final configuration may take longer than the translation of the protein. Pausing may allow each domain to partially organize and commit to a particular, independent fold. Other co- translational events, such as those associated with co-factors, protein subunits, protein complexes, membranes, chaperones, secretion, or proteolysis complexes, also can depend on the kinetics of the emerging nascent polypeptide. Pauses can be introduced by engineering one codon pair predicted to cause a translational pause or slowing, or two or more such codon pairs into the sequence to facilitate these co-translational interactions.
  • typically a translational pause is preserved, which refers to maintaining the same codon pair for a polypeptide-encoding nucleotide sequence that is expressed in the native host organism, or, when the polypeptide-encoding nucleotide sequence is heterologously expressed, changing the codon pair as appropriate to have a translational kinetics value comparable to or closest to the translational kinetics value of the native codon pair in the native host organism.
  • proximal codon pairs can be selected to be replaced in order to introduce a translational pause or slowing.
  • one of the 1, 2, 3, 4 or 5 most proximal codon pairs upstream (5' of the desired pause site) or one of the 1, 2, 3, 4 or 5 most proximal codon pairs downstream (3' of the desired pause site) can be chosen for replacement to introduce the translational pause or slowing.
  • the selected codon pair for replacement to introduce the translational pause or slowing is the codon pair closest to the originally desired codon pair location of the translational pause or slowing, provided the desired translational pause or slowing can be attained (e.g., 1 codon pair upstream or downstream is typically selected instead of 2 codon pairs upstream or downstream, provided the desired translational pause or slowing can be attained).
  • a translational pause or slowing can be introduced by selecting a replacement codon pair encoding a conservative amino acid substitution, such as the conservative substitutions shown in Table 1.
  • replacement of a proximal codon pair to introduce a translational pause or slowing is preferred over replacement of a codon pair resulting in a change in the encoded amino acid sequence.
  • graphical displays of translational kinetics values of one or more proteins can be used to provide information to assist in the selection of a translational pause or slowing to preserve or insert in a redesigned polypeptide-encoding nucleotide sequence.
  • graphical displays of translational kinetics values can permit, for example, alignment of homologous proteins from different species and an identification, based on this alignment, of predicted translational pause or slowing sites that are conserved in the aligned proteins.
  • Such predicted translational pause or slowing sites can be preserved or inserted in a redesigned polypeptide-encoding nucleotide sequence.
  • regions between autonomous folding units in one or more proteins within a particular species can be graphically examined for the presence or absence of predicted pause sites.
  • Such graphical display methods can result in an identification of a region between autonomous folding units in which a translational pause or slowing is desirably preserved in a redesigned polypeptide-encoding sequence.
  • Methods for identifying and selecting conserved translational pauses can be performed according to any method known in the art, as exemplified in U.S. Patent Publication No. 2007/0298503, published on December 27, 2007, and U.S. Patent Publication No. 2007/0275399, published on November 29, 2007.
  • the codon pair translation kinetics values can be compared with a database of related gene sequences and conserved pause sites can be identified.
  • a synthetic gene can be designed wherein at least one conserved pause site is maintained to provide a synthetic gene with modified translation kinetics.
  • codon pairs are associated with translational pauses, and can thereby influence translational kinetics of an mRNA into polypeptide.
  • the methods of changing translational kinetics provided herein will typically be performed by modifying or designing one or more nucleotide sequences encoding a polypeptide to be expressed.
  • methods of modifying a gene or designing a synthetic nucleotide sequence encoding the polypeptide encoded by the gene collectively referred to herein as redesigning a polypeptide-encoding gene sequence or redesigning a polypeptide-encoding nucleotide sequence.
  • redesigning a polypeptide-encoding gene sequence or redesigning a polypeptide-encoding nucleotide sequence.
  • Also included in the various embodiments provided herein are redesigned gene sequences encoding polypeptides that are not identical to the original gene.
  • a PPP and/or fermentation enzyme- encoding DNA sequence wherein the encoded sequence has at least a 50%, 60%, 70%, 75%, 80%, 85%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the wild type PPP and/or fermentation enzyme polypeptide sequence as set forth in SEQ ID NO: 2, 26, 50, 74 or 98.
  • At least 1, 2 or 3 codon pairs of a polynucleotide sequence encoding the PPP and/or fermentation enzyme have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the DNA sequence is optimized for expression in S. cerevisiae, E. coli, P. pastoris, K. lactis or Z mobilis.
  • a PPP and/or fermentation enzyme-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the a functional domain of the PPP and/or fermentation enzyme have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for functional domains are known in the art.
  • the replacement codon pairs are predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. That is, the embodiments in which one or more codon pairs encoding amino acids of the cellulose binding domain have been replaced include embodiments in which the nucleotide sequence encoding the cellulose binding domain is changed to increase the predicted translational kinetics of translation of the cellulose binding domain.
  • incomplete translation, improper folding, or other protein expression shortcomings can result from the presence of one or more translational pauses in a heterologously-expressed polypeptide.
  • removal of one or more of these pauses can increase the speed of translation of the cellulose binding domain, and thereby increase the quantity of protein produced and/or increase the amount of stable, properly folded, active, and/or soluble protein produced.
  • the replacement codons i.e., the codons added as replacements for the wild type codons
  • the replacement codon are typically predicted to be less likely to cause a translational pause.
  • the replacement codon can have a translational kinetics value in the heterologous host organism that is 95%, 90%, 85%, 80%, 75%, 70%, or less, than the translational kinetics value of the wild type codon pair when expressed in the heterologous host organism.
  • the replacement codon is selected to have a translational kinetics value similar to the translational kinetics value of the wild type codon pair in the native organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism can be no more than 250%, 200%, 150%, 125% or 100% of the z score for the wild type codon pair when expressed in the native organism.
  • a PPP and/or fermentation enzyme-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between domains of the PPP and/or fermentation enzyme, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the domains are known in the art and are described in detail below.
  • a transketolase-encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the thiamine diphosphate binding domain of the transketolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for thiamine diphosphate binding domains are known in the art.
  • the thiamine diphosphate binding domain includes at least amino acids 7-338 or 6-339.
  • a transketolase-encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the pyrimidine binding domain of the transketolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for pyrimidine binding domains are known in the art.
  • the pyrimidine binding domain includes at least amino acids 354-532 or 353-533.
  • transketolase-encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the transketolase C-terminal domain of the transketolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for transketolase C-terminal domains are known in the art.
  • the C-terminal domain includes at least amino acids 546- 655 or 545-656.
  • transketolase-encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the thiamine diphosphate binding domain of the transketolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the thiamine diphosphate binding domain are described hereinabove.
  • transketolase-encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the thiamine diphosphate binding domain and the pyrimidine binding domain of the transketolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the thiamine diphosphate binding and the pyrimidine binding domains are described hereinabove.
  • transketolase-encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the pyrimidine binding domain and the transketolase C- terminal domain of the transketolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the pyrimidine binding and the transketolase C-terminal domains are described hereinabove.
  • the conserved amino acid sequence pattern and domain boundaries for ribulose-phosphate 3-epimerase domains are known in the art.
  • the ribulose-phosphate 3-epimerase domain includes at least amino acids 5- 214 or 3-231.
  • the conserved amino acid sequence pattern and domain boundaries for the ribulose-phosphate 3-epimerase domain are described hereinabove.
  • a alchohol dehydrogenase 1- encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the alchohol dehydrogenase GroES-like domain of the alchohol dehydrogenase 1, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for alchohol dehydrogenase GroES- like domains are known in the art.
  • the alchohol dehydrogenase 1 of SEQ ID NO: 50 the alchohol dehydrogenase GroES-like domain includes at least amino acids 31-139 or 30-140.
  • a alchohol dehydrogenase 1- encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the alcohol dehydrogenase zinc binding domain of the alchohol dehydrogenase 1, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for alcohol dehydrogenase zinc binding domains are known in the art.
  • the alcohol dehydrogenase zinc binding domain includes at least amino acids 170-310 or 169-31 1.
  • a alchohol dehydrogenase 1- encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the alchohol dehydrogenase GroES-like domain of the alchohol dehydrogenase 1 , have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the alchohol dehydrogenase GroES-like domain are described hereinabove.
  • a alchohol dehydrogenase 1- encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the alchohol dehydrogenase GroES-like and the alcohol dehydrogenase zinc binding domain of the alchohol dehydrogenase 1 , have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the alchohol dehydrogenase GroES-like and the alcohol dehydrogenase zinc binding domains are described hereinabove.
  • a alchohol dehydrogenase 3- encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the alchohol dehydrogenase GroES-like domain of the alchohol dehydrogenase 3, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for alchohol dehydrogenase GroES- like domains are known in the art.
  • the alchohol dehydrogenase GroES-like domain includes at least amino acids 58-167 or 57-168.
  • a alchohol dehydrogenase 3- encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the alcohol dehydrogenase zinc binding domain of the alchohol dehydrogenase 3, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for alcohol dehydrogenase zinc binding domains are known in the art.
  • the alcohol dehydrogenase zinc binding domain includes at least amino acids 197-338 or 196-339.
  • a alchohol dehydrogenase 3- encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the alchohol dehydrogenase GroES-like domain of the alchohol dehydrogenase 3, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the alchohol dehydrogenase GroES-like domain are described hereinabove.
  • a alchohol dehydrogenase 3- encoding Nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the alchohol dehydrogenase GroES-like and the alcohol dehydrogenase zinc binding domain of the alchohol dehydrogenase 3, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the alchohol dehydrogenase GroES-like and the alcohol dehydrogenase zinc binding domains are described hereinabove.
  • transaldolase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the transaldolase domain of the transaldolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for transaldolase domains are known in the art.
  • the transaldolase domain includes at least amino acids 25-329.
  • transaldolase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the transaldolase domain of the transaldolase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the transaldolase domain are described hereinabove.
  • polypeptide-encoding nucleotide sequence provided herein to modify the translational kinetics of the polypeptide-encoding nucleotide sequence, where the polypeptide-encoding nucleotide sequence is altered such that one or more codon pairs have a decreased likelihood of causing a translational pause or slowing relative to the unaltered polypeptide-encoding nucleotide sequence.
  • one or more nucleotides of a polypeptide-encoding nucleotide sequence can be changed such that a codon pair containing the changed nucleotides has a translational kinetics value indicative of a decreased likelihood of causing a translational pause or slowing relative to the unchanged polypeptide-encoding nucleotide sequence.
  • the redesigned polypeptide-encoding nucleotide sequence need not possess a high degree of identity to the polypeptide-encoding nucleotide sequence of the original gene, in some embodiments, the redesigned polypeptide-encoding nucleotide sequence will have at least 50%, 60%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% nucleotide identity with the polypeptide-encoding nucleotide sequence of the original gene.
  • an original gene refers to a gene for which codon pair refinement is to be performed; such original genes can be, for example, wild type genes, native genes, naturally occurring mutant genes, other mutant genes such as site-directed mutant genes or engineered or completely synthetic genes.
  • the polynucleotide sequence will be completely synthetic, and will bear much lower identity with the original gene, e.g., no more than 90%, 80%, 70%, 60%, 50%, 40%, or lower.
  • polypeptide-encoding nucleotide sequences can be redesigned to be convenient to work with and specifically tailored to a particular host and vector system of choice.
  • the resulting sequence can be designed to: (1) reduce or eliminate translational problems caused by inappropriate ribosome pausing, such as those caused by over- represented codon pairs or other codon pairs with translational values predictive of a translational pause; (2) have codon usage refined to avoid over-reliance on rare codons; (3) reduce in number or remove particular restriction sites, splice sites, internal Shine- Dalgarno sequences, or other sites that may cause problems in cloning or in interactions with the host organism; or (4) have controlled RNA secondary structure to avoid detrimental translational termination effects, translation initiation effects, or RNA processing, which can arise from, for example, RNA self-hybridization.
  • this sequence also can be designed to avoid oligonucleotides that mis-hybridize, resulting in genes that can be assembled from refined oligonucleotides that by thermodynamic necessity only pair up in the desired manner, using methods known in the art, as exemplified in U.S. Patent Publication No. 2005/0106590, which is hereby incorporated by reference in its entirety.
  • polypeptide-encoding nucleotide sequence it is not possible to modify the polypeptide- encoding nucleotide sequence to suitably modify the translational kinetics of the mRNA into polypeptide without modifying the amino acid sequence of the encoded polypeptide.
  • an amino acid insertion, deletion or mutation can be introduced to yield a codon pair that is not predicted to cause a translational pause or slowing; or no change is made.
  • the change is preferably predicted to not substantially influence the final three-dimensional structure of the protein and/or the activity of the protein.
  • Such non-identical polypeptides can vary by containing one or more insertions, deletions and/or mutations.
  • polypeptide sequence can vary according to the purpose of the change, typically such a change results in a polypeptide that is at least 50%, 60%, 70%, 75%, 80%, 85%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% identical to the wild type polypeptide sequence.
  • the sequence of the polynucleotide can be generated, optionally in conjunction with optimization of a plurality of parameters where one such parameter can be codon pair usage, where the resultant polynucleotide can be prepared by assembly of a plurality of oligonucleotides sufficiently small to be synthesized by known oligonucleotide synthetic methods.
  • Methods known in the art for optimizing multiple parameters in synthetic nucleotide sequences can be applied to optimizing the parameters recited in the present claims. Such methods may advantageously include those exemplified in U.S. Patent App. Publication No. 2005/0106590, U.S. Patent App. Publication No. 2007/0009928, and R. H.
  • an exemplary method for generating a sequence can also include dividing the desired sequence into a plurality of partially overlapping segments; optimizing the melting temperatures of the overlapping regions of each segment to disfavor hybridization to the overlapping segments which are non- adjacent in the desired sequence; allowing the overlapping regions of single stranded segments which are adjacent to one another in the desired sequence to hybridize to one another under conditions which disfavor hybridization of non-adjacent segments; and filling in, ligating, or repairing the gaps between the overlapping regions, thereby forming a double-stranded DNA with the desired sequence.
  • This process can be performed manually or can be automated, e.g., in a general purpose digital computer.
  • the search of possible codon assignments is mapped into an anytime branch and bound computerized algorithm developed for biological applications.
  • a synthetic nucleotide sequence for the polynucleotides provided herein, where the synthetic nucleotide sequence also is typically designed to have desirable translational kinetics properties, such as the removal of some or all codon pairs predicted to result in a translational pause or slowing.
  • Such design methods include determining a set of partially overlapping segments with optimized melting temperatures, and determining the translational kinetics of the synthetic sequence, where if it is desired to change the translational kinetics of the synthetic gene, the sequences of the overlapping segments are modified and refined in order to approximate the desired translational kinetics while still possessing acceptable hybridization properties. In some embodiments, this process is performed iteratively.
  • a criterion is established for selecting codon pairs having high translational kinetics values to be replaced with codon pairs having lower the translational kinetics values unless a codon pair of this group is the site of a planned pause.
  • the top 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10% of codon pairs ranked by translational kinetics values can be replaced by codon pairs having lower translational kinetics values, such as translational kinetics value below a user defined level that can be, for example, a translational kinetics value equal to or below the translational kinetics values of codon pairs not in the top selected percentage, unless a codon pair of this group is the site of a planned pause (in which case it is not necessarily replaced).
  • all codon pairs above a user-selected translational kinetics value such as more than 5, 4.5, 4, 3.5, 3, 2.5 or 2 standard deviations above the mean translational kinetics value can be replaced by codon pairs having lower translational kinetics values, such as translational kinetics value below a user defined level that can be, for example, a translational kinetics value that is 4, 3.5, 3, 2.5, 2, 1.5 or 1 standard deviations less than the mean translational kinetics value, unless a codon pair of this group is the site of a planned pause (in which case it is not necessarily replaced).
  • polynucleotide sequences design methods provided herein can be employed where a plurality of properties of the polynucleotide sequences can be refined in addition to codon pair usage properties, where such properties can include, but are not limited to, melting temperature gap between oligonucleotides of synthetic gene, average codon usage, average codon pair chi-squared (e.g., z score), worst codon usage, worst codon pair (e.g., z score), maximum usage in adjacent codons, Shine-Dalgarno sequence (for E.
  • coli expression occurrences of 5 consecutive G's or 5 consecutive Cs, occurrences of 6 consecutive A's or 6 consecutive T's, long exactly repeated subsequences, cloning restriction sites, user-prohibited sequences (e.g., other restriction sites), codon usage of a specific codon above user-specified limit, and out-of-frame stop codons (framecatchers).
  • additional properties that can be considered in a process of designing a polynucleotide sequence include, but are not limited to, occurrences of RNA splice sites, occurrences of polyA sites, and occurrence of ribosome binding sequence.
  • a process of designing a poly nucleotide sequence can include constraints including, but not limited to, minimum melting temperature gap between oligonucleotides of synthetic gene, minimum average codon usage, maximum average codon pair chi-squared (z score), minimum absolute codon usage, maximum absolute codon pair (z score), minimum maximum usage in adjacent codons, no Shine-Dalgarno sequence (for E.
  • additional constraints can include, but are not limited to, minimum occurrences of RNA splice sites, minimum occurrences of polyA sites, and occurrence of ribosome binding sequence.
  • a process of designing a polynucleotide sequence can include preferences including, but not limited to, prefer high average codon usage, prefer low average codon pair chi-squared, prefer larger melting temperature gap, prefer more out of frame stop codons (framecatchers), and optionally prefer evenly distributed codon usage.
  • Any of a variety of nucleotide sequence refinement/optimization methods known in the art can be used to refine the polynucleotide sequence according to the codon pair usage properties, and according to any of the additional properties specifically described above, or other properties that are refined in nucleotide sequence redesign methods known in the art.
  • a branch and bound method is employed to refine the polynucleotide sequence according to codon pair usage properties and at least one additional property, such as codon usage.
  • the methods provided herein can further include analyzing at least a portion of the candidate polynucleotide sequence in frame shift, and selecting codons for the candidate polynucleotide sequence such that stop codons are added to at least one said frame shift.
  • the generating step further includes analyzing at least a portion of the candidate polynucleotide sequence in frame shift, and selecting codons for the candidate polynucleotide sequence such that one or more stop codons in one, two or three reading frames are added downstream of polypeptide-encoding region of the nucleotide sequence.
  • methods for redesigning a polypeptide-encoding gene for expression in a host organism, by providing a data set representative of codon pair translational kinetics for the host organism which includes translational kinetics values of the codon pairs utilized by the host organism, providing a desired polypeptide sequence for expression in the host organism, and generating a polynucleotide sequence encoding the polypeptide sequence by analyzing candidate nucleotides to select, where possible, codon pairs that are predicted not to cause a translational pause in the host organism, with reference to the data set, thereby providing a candidate polynucleotide sequence encoding the desired polypeptide.
  • Also provided herein are methods for redesigning a polypeptide- encoding gene for expression in a host organism by providing a first data set representative of codon pair translational kinetics for the host organism which includes translational kinetics values of the codon pairs utilized by the host organism, providing a second data set representative of at least one additional desired property of the synthetic gene, providing a desired polypeptide sequence for expression in the host organism, and generating a polynucleotide sequence encoding the polypeptide sequence by analyzing candidate nucleotides to select, where possible, both (i) codon pairs that are predicted not to cause a translational pause in the host organism, with reference to the first data set, and (ii) nucleotides that provide a desired property, with reference to the second data set, thereby providing a candidate polynucleotide sequence encoding the desired polypeptide.
  • a branch and bound method is employed to refine the polypeptide- encoding nucleotide sequence according to codon pair usage properties of the first data set and according to the properties of the second data set.
  • the second data set contains codon preferences representative of codon usage by the host organism, including the most common codons used by the host organism for a given amino acid.
  • a PPP and/or fermentation enzyme - encoding DNA sequence wherein the encoded sequence has at least a 50%, 60%, 70%, 75%,80%, 85%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the wild type PPP and/or fermentation enzyme polypeptide sequence as set forth in the sequence listing.
  • the polynucleotide provided herein is adapted for expression in a heterologous host organism.
  • a heterologous host organism is an organism used to express DNA, RNA or protein that is foreign to the host organism.
  • the host organism is not human, E. coli or S. cerevisiae.
  • At least 1 , 2 or 3 codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein.
  • a highly- overrepresented codon pair is a codon pair that has a translational kinetics value greater than a designated threshold, wherein a threshold value can be at least 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 or 5 or more standard deviations above the mean translational kinetics value.
  • a PPP and/or fermentation enzyme -encoding DNA sequence having at least a 75% sequence identity with an original PPP and/or fermentation enzyme polypeptide sequence as set forth in the sequence listing and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organisms are selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long- tailed monkey); M. mulatto (Monkey); E. coli K12 W3110; E.
  • the methods provided herein can include analyzing the candidate polynucleotide sequence to confirm that no codon pairs are predicted to cause a translational pause in the host organism by more than a designated threshold. As described elsewhere herein, the likelihood that a particular codon pair will cause translational pausing or slowing in an organism (or the relative predicted magnitude thereof) can be represented by a translational kinetics value.
  • the translational kinetics value can be expressed in any of a variety of manners in accordance with the guidance provided herein.
  • a translational kinetics value can be expressed in terms of the mean translational kinetics value and the corresponding standard deviation for all codon pairs in an organism.
  • the translational kinetics value for a particular codon pair can be expressed in terms of the number of standard deviations that separate the translational kinetics value of the codon pair from the mean translational kinetics value.
  • a threshold value can be at least 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 or 5 or more standard deviations above the mean translational kinetics value.
  • the methods provided herein in addition to generating a candidate nucleotide sequence according to codon pair usage properties, also include generating a candidate nucleotide sequence according to codon usage.
  • codon usage As is known in the art, different organisms can have different preference for the three- nucleotide codon sequence encoding a particular amino acid. As a result, translation can often be improved by using the most common three-nucleotide codon sequence encoding a particular amino acid.
  • some methods provided herein also include generating a candidate nucleotide sequence such that codon utilization is non-randomly biased in favor of codons most commonly used by the host organism.
  • the methods of redesigning a polypeptide- encoding nucleotide sequence are based on a plurality of properties, where a conflict in the preferred nucleotide sequence arising from the plurality of properties is determined in order to optimize the predicted translational kinetics. That is, when the plurality of properties being optimized would lead to more than one possible nucleotide sequence depending on which property is to be accorded more weight, typically, the conflict is resolved by selecting the nucleotide sequence predicted to be translated more rapidly, for example, due to fewer predicted translational pauses.
  • the methods of redesigning a polypeptide-encoding nucleotide sequence are based on a plurality of properties, where a conflict in the preferred nucleotide sequence arising from the plurality of properties is determined in order to optimize codon pair usage preferences. That is, when the plurality of properties being optimized would lead to more than one possible nucleotide sequence depending on which property is to be accorded more weight, typically, codon pair usage will be accorded more weight in order to resolve the conflict between the more than one possible nucleotide sequences.
  • the methods provided herein can include identifying at least one instance of a conflict between selecting common codons and avoiding codon pairs predicted to cause a translational pause; in such instances, the conflict is resolved in favor of avoiding codon pairs predicted to cause a translational pause.
  • Some embodiments provided herein include generating a candidate polynucleotide sequence encoding the polypeptide sequence, the candidate polynucleotide sequence having a non-random codon pair usage, such that the codon pairs encoding any particular pair of amino acids have the lowest translational kinetics values.
  • the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that the encoded amino acid sequence is not altered.
  • the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that the three dimensional structure of the encoded polypeptide is not substantially altered.
  • the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that no more than conservative amino acid changes are made to the encoded polypeptide.
  • the methods provided herein can further include a step of refining or altering the candidate polynucleotide sequence in accordance with a second nucleotide sequence property to be refined.
  • the methods further include generating or refining a candidate polynucleotide sequence encoding a polypeptide sequence such that the candidate polynucleotide sequence has a non-random codon usage, where the most common codons used by the host organism are over-represented in the candidate polynucleotide sequence.
  • the methods can include refining or altering the candidate polynucleotide sequence in accordance with any of a variety of additional properties provided herein, including but not limited to, melting temperature gap between oligonucleotides of synthetic gene, Shine-Dalgarno sequence, occurrences of 5 consecutive G's or 5 consecutive Cs, occurrences of 6 consecutive A's or 6 consecutive T's long exactly repeated subsequences, cloning restriction sites, or any other user-prohibited sequences. Further, any of a variety of combinations of these properties can be additionally included in the nucleotide sequence refinement methods provided herein.
  • the method provided herein can further include an evaluation step in which after the candidate polynucleotide sequence is altered, the sequence is compared with at least a portion of a data set of a property against which the sequence was refined.
  • an evaluation step in which after the candidate polynucleotide sequence is altered, the sequence is compared with at least a portion of a data set of a property against which the sequence was refined.
  • the candidate nucleotide sequence can be compared to each property considered in the refinement, and, if the values for all properties are deemed to be acceptable or desired, no further sequence alteration is required. If the values for fewer than all properties are deemed to be acceptable or desired, the candidate nucleotide sequence can be subjected to further sequence alteration and evaluation.
  • sequence alteration steps of methods provided herein can be performed iteratively. That is, one or more steps of altering the nucleotide sequence can be performed, and the candidate nucleotide sequence can be evaluated to determine whether or not further sequence alteration is necessary and/or desirable. These steps can be repeated until values for all properties are deemed to be acceptable or desired, or until no further improvement can be achieved. Determination of translational kinetics values for codon pairs
  • the methods and sequences provided herein include determination and use of translational kinetics values for codon pairs. As provided herein, such a translational kinetics value can be calculated and/or empirically measured, and the final translational kinetics value used in graphical displays and methods of predicting translational kinetics can be a refined value resultant from two or more types of codon pair translational kinetics information.
  • codon pair translational kinetics information that can be used in refining or replacing a translational kinetics value for a codon pair include, for example, values of observed versus expected codon pair frequencies in a particular organism, normalized values of observed versus expected codon pair frequencies in a particular organism, the degree to which observed versus expected codon pair frequency values are conserved in related proteins across two or more species, the degree to which observed versus expected codon pair frequency values are conserved at predicted pause sites such as boundaries between autonomous folding units in related proteins across two or more species, the degree to which codon pairs are conserved at predicted pause sites across different proteins in the same species, and empirical measurement of translational kinetics for a codon pair.
  • the values of observed versus expected codon pair frequencies in a host organism can be determined by any of a variety of methods known in the art for statistically evaluating observed occurrences relative to expected occurrences. Regardless of the statistical method used, this typically involves obtaining codon sequence data for the organism, for example, on a gene-by-gene basis. In some embodiments, the analysis is focused only on the coding regions of the genome. Because the analysis is a statistical one, a large database is preferred. Initially, the total number of codons is determined and the number of times each of the 61 non-terminating codons appears is determined.
  • the expected frequency of each of the 3721 (61 ) possible non- terminating codon pairs is calculated, typically by multiplying together the frequencies with which each of the component codons appears.
  • This frequency analysis can be carried out on a global basis, analyzing all of the sequences in the database together; however, it is typically done on a local basis, analyzing each sequence individually. This will tend to minimize the statistical effect of an unusually high proportion of rare codons in a sequence.
  • the expected number of occurrences of each codon pair is calculated by, for example, multiplying the expected frequency by the number of pairs in the sequence. This information can then be added to a global table, and each next succeeding sequence can be analyzed in like manner.
  • the values of observed versus expected codon pair frequencies are chi-squared values, such as chi-squared 2 (chisq2) values or chi- squared 3 (chisq3) values.
  • Methods for calculating chi-squared values can be performed according to any method known in the art, as exemplified in U.S. Patent No. 5,082,767, which is incorporated by reference herein in its entirety.
  • the result of chi-squared calculations is a list of 3,721 non-terminating codon pairs, each with an expected and observed value, together with a value for chi-squared (chisql): chisql - (observed-expected) 2 / expected
  • a new value chi-squared 2 (chisq2) can be calculated as follows. For each group of codon pairs encoding the same amino acid pair (i.e., 400 groups), the sums of the expected and observed values are tallied; any non-randomness in amino acid pairs is reflected in the difference between these two values. Therefore, each of the expected values within the group is multiplied by the factor [sum observed/sum expected], so that the sums of the expected and observed values with the group are equal. The new chi- squared, chisq2, is evaluated using these new expected values.
  • a new value chi-squared 3 (chisq3) can be calculated. Correction is made only for those dinucleotides formed between adjacent codon pairs; any bias of dinucleotides within codons (codon triplet positions I-II and II-III) will directly affect codon usage and is, therefore, automatically taken into account in the underlying calculations.
  • the sums of the expected and observed values are tallied; any non- randomness in dinucleotide pairs is reflected in the difference between these two values. Therefore, each of the expected values within the group is multiplied by the factor [sum observed/sum expected], so that the sums of the expected and observed values with the group are equal.
  • the new chi-squared, chisq3, is evaluated using these new expected values.
  • Dinucleotide bias represents a smaller effect in yeast, and only a very minor one in E. coli.
  • the predominant dinucleotide bias in human is the well-known CpG deficit, other dinucleotides are also very highly biased. For example, there is a deficit of TA, as well as an excess of TG, CA and CT. Overall, the deficit of CpG contributes only 35% of the total dinucleotide bias in the human database, and 17% in yeast.
  • the values of observed versus expected codon pair frequencies in a host organism herein can be normalized. Normalization permits different sets of values of observed versus expected codon pair frequencies to be compared by placing these values on the same numerical scale. For example, normalized codon pair frequency values can be compared between different organisms, or can be compared for different codon pair frequency value calculations within a particular organism (e.g., different calculations based on input sequence information or based on different calculations such as chisql or chisq2 or chisq3). Typically, normalization results in codon pair frequency values that are described in terms of their mean and standard deviation from the mean.
  • An exemplary method for normalizing codon pair frequency values is the calculation of z scores.
  • the z score for an item indicates how far and in what direction that item deviates from its distribution's mean, expressed in units of its distribution's standard deviation.
  • the mathematics of the z score transformation are such that if every item in a distribution is converted to its z score, the transformed scores will have a mean of zero and a standard deviation of one.
  • the z scores transformation can be especially useful when seeking to compare the relative standings of items from distributions with different means and/or different standard deviations, z scores are especially informative when the distribution to which they refer is normal. In a normal distribution, the distance between the mean and a given z score cuts off a fixed proportion of the total area under the curve.
  • An exemplary method for determining z scores for codon pair chi- squared values is as follows: First, a list of all 3721 possible non-terminating codon pairs is generated. Second, for the i th codon pair, the i th chi-squared value is calculated, where the i th chi-squared value is denoted c,. The chi-squared value, c,, is given the sign of (observed - expected), so that over-represented codon pairs are assigned a positive c, and under-represented codon pairs are assigned a negative C 1 .
  • the mean chi-squared value is calculated where the mean is denoted m.
  • the standard deviation of the chi-squared values is calculated, where the standard deviation is denoted s.
  • a z score is calculated by subtracting the mean then dividing by the standard deviation, wherein the i l z score is denoted Z 1 .
  • provided herein are methods of refining the predictive capability of a translational kinetics value of a codon pair in a host organism by providing an initial translational kinetics value based on the value of observed codon pair frequency versus expected codon pair frequency for a codon pair in a host organism, providing additional translational kinetics data for the codon pair in the host organism, and modifying the initial translational kinetics value according to the additional codon pair translational kinetics data to generate a refined translational kinetics value for the codon pair in the host organism.
  • the translational kinetics data that can be used to refine translational kinetics values and methods of modifying translational kinetics values according to such additional translational kinetics data to generate a refined translational kinetics value for a codon pair in a host organism are provided below.
  • translational kinetics data that can be used to refine translational kinetics values are based on recurrence of a codon pair and/or recurrence of a predicted translational kinetics value associated with a codon pair.
  • Recurrence-based refinement of translational kinetics values is based on the investigation of multiple polypeptide-encoding nucleotide sequences to determine whether or not there are multiple occurrences of either codon pairs or predicted translational kinetics values in those sequences.
  • Recurrence-based refinement of translational kinetics can be performed using any of a variety of known sequence comparison methods consistent with the examples provided herein. For purposes of exemplification, and not for limitation, the following example of recurrence-based refinement of translational kinetics is provided.
  • the predicted translational kinetics value for a codon pair can be refined according to the degree to which observed versus expected codon pair frequency values are conserved in related proteins across two or more species.
  • related proteins are proteins having homologous amino acid sequences and/or similar three dimensional structures.
  • Related proteins having homologous amino acid sequences will typically have at least about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95% sequence identity.
  • Related proteins having similar three dimensional structures will typically share similar secondary structure topology and similar relative positioning of secondary structural elements; exemplary related proteins having three dimensional structures are members of the same SCOP- classified Family (see, e.g., Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995).
  • SCOP a structural classification of proteins database for the investigation of sequences and structures. J. MoI Biol. 247, 536-540.).
  • the observed versus expected codon pair frequency values for any given codon pair can vary from species to species. However, as provided herein, evolutionarily related proteins in different species will typically conserve some or all translational pause or slowing sites. Based on this, an observed conservation of one or more predicted translational pause or slowing sites in evolutionarily related proteins of different species can confirm or increase the likelihood that a translational pause or slowing site is a functional translational kinetics signal.
  • the codon pair located at the position on a protein that is confirmed as, or considered to have an increased likelihood of, containing an actual translational pause or slowing can itself be confirmed as being, or considered to have an increased likelihood of being, a functional translational kinetics signal.
  • a codon pair located at a position on a protein that is confirmed as not containing, or considered to have a decreased likelihood of containing, an actual translational pause or slowing, can itself be confirmed as not acting, or considered to have an decreased likelihood of acting, as a functional translational kinetics signal.
  • initially predicted translational kinetics data e.g., data based on values of observed codon pair frequency versus expected codon pair frequency
  • the predicted translational kinetics value for a codon pair can be refined according to the presence of the codon pair at a location predicted by methods other than codon pair frequency methods to contain a translational pause or slowing site.
  • a predicted location is a boundary location between autonomous folding units of a protein.
  • translational pauses are present in wild type genes in order to slow translation of a nascent polypeptide subsequent to translation of a secondary structural element of a protein and/or a protein domain, thus providing time for acquisition of secondary and at least partial tertiary structure by the nascent protein prior to further downstream translation, and thereby allowing each domain to partially organize and commit to a particular, independent fold.
  • codon pairs can be associated with translational pauses between autonomous folding units of a protein, where autonomous folding units can be secondary structural elements such as an alpha helix, or can be tertiary structural elements such as a protein domain.
  • the presence of a codon pair at a boundary location between autonomous folding units of a protein can confirm or increase the likelihood that the codon pair acts to pause or slow translation.
  • predicted translational kinetics data e.g., data based on values of observed codon pair frequency versus expected codon pair frequency
  • predicted translational kinetics data can be modified according to the presence of the codon pair at a boundary location between autonomous folding units of a protein, which can increase the likelihood of the codon pair acts to pause or slow translation.
  • an over-represented codon pair that is present at a boundary location between autonomous folding units of a protein can be confirmed as acting as a translational pause or slowing codon pair.
  • a single observation of the codon pair at a boundary location between autonomous folding units of a protein can confirm or increase the likely translational pause or slowing properties of a codon pair.
  • typically a plurality of observations will be used to more accurately estimate the translational pause or slowing properties of a codon pair.
  • methods of using, for example, predicted boundary locations can be combined with methods that are based on recurrence of a codon pair and/or recurrence of a predicted translational kinetics value associated with a codon pair in methods of refining a predicted translational kinetics value for a codon pair.
  • a protein present in two or more species can have conserved boundary locations between autonomous folding units of the protein, and recurrent presence of an over-represented codon pair at the boundary locations can confirm the likelihood of an actual translational pause at that boundary location, leading to confirmation, or increased likelihood, that the corresponding codon pair for the respective species acts as a translational pause or slowing codon pair.
  • two or more proteins of the same species can have boundary locations between autonomous folding units, and recurrent presence of an over-represented codon pair at the boundary locations can confirm or indicate the likelihood of an actual translational pause at that boundary location, leading to confirmation or indication of increased likelihood that the corresponding codon pair acts as a translational pause or slowing codon pair.
  • Such recurrence-based methods also can be used to confirm or indicate increased likelihood that a non-over-represented codon pair (e.g., an under-represented codon pair or a represented-as-expected codon pair) acts as a translational pause or slowing codon pair.
  • a non-over-represented codon pair e.g., an under-represented codon pair or a represented-as-expected codon pair
  • two or more proteins of the same species can have boundary locations between autonomous folding units, and recurrent presence of a non- over-represented codon pair at the boundary locations, particularly if no over-represented codon pair is present, can confirm or indicate the likelihood of an actual translational pause at that boundary location, leading to confirmation or indication of increased likelihood that the corresponding codon pair acts as a translational pause or slowing codon pair.
  • Such recurrence-based methods also can be used to confirm or indicate the likelihood that a codon pair, such as an over-represented codon pair, does not act as a translational pause or slowing codon pair.
  • a codon pair such as an over-represented codon pair
  • two or more proteins of the same species can have boundary locations between autonomous folding units, and consistent absence of a non-over-represented codon pair at the boundary locations can confirm or indicate increased likelihood that the codon pair does not act as a translational pause or slowing codon pair.
  • the predicted translational kinetics value for a codon pair can be refined according to empirical measurement of translational kinetics for a codon pair.
  • the influence of a codon pair on translational kinetics can be experimentally measured, and these experimental measurements can be used to refine or replace the predicted translational kinetics values for a codon pair.
  • Several methods of experimentally measuring the translational kinetics of a codon pair are known in the art, and can be used herein, as exemplified in Irwin et al, J. Biol. Chem., (1995) 270:22801.
  • One such exemplary assay is based on the observation that a ribosome pausing at a site near the beginning of an mRNA coding sequence can inhibit translation initiation by physically interfering with the attachment of a new ribosome to the message, and, thus, the codon pair to be assayed can be placed at the beginning of a polypeptide-encoding nucleotide sequence and the effect of the codon pair on translational initiation can be measured as an indication of the ability of the codon pair to cause a translational pause.
  • Another such exemplary assay is based on the fact that the transit time of a ribosome through the leader polypeptide coding region of the leader RNA of the tip operon sets the basal level of transcription through the trp attenuator, and, thus, the codon pair to be assayed can be placed into a trpLep leader polypeptide codon region, and level of expression can be inversely indicative of the translational pause properties of the codon pair, due to a faster translation causing formation of a stem-loop attenuator in the leader RNA, which results in transcriptional attenuation.
  • the methods provided herein for calculation of translational kinetics values can be applied to the native organism of the polypeptide of SEQ ID NOS: 2, 26, 50, 74 or 98, and also can be applied to a selected organism in which the polypeptide of SEQ ID NO: 2, 26, 50, 74 or 98, or a modification thereof, is to be heterologously expressed.
  • the nucleotide sequence information of an organism can be used to calculate chi-squared values in accordance with the methods provided herein, and the translational kinetics values can be based on these chi-squared values as well as on additional translational kinetics information provided herein, including, but not limited to, codon pairs conserved in domain boundaries and empirically measured translational kinetics for a codon pair.
  • Exemplary organisms for which translational kinetics values can be calculated and used to prepare a nucleotide sequence encoding a PPP and/or fermentation enzyme protein provided herein incude Pichia pastor is; Oryctolagus cuniculus (rabbit); Macaca fascicula ⁇ s (Long-tailed monkey); M.
  • the translational kinetics data described herein can be combined in such a manner as to provide a refined translational kinetics value for a codon pair in a host organism.
  • Methods of combining predictive data to arrive at a refined predictive value are known in the art and can be used herein.
  • D) P(D
  • P(D) is constant for all H.
  • P(H) is identified with the degree of belief in hypothesis H before the data was observed.
  • H) read "the probability of D given H,” is identified with how well hypothesis H predicts the observed data D.
  • an hypothesis H is that a given sequence feature, e.g., a given codon pair, has utility for translational kinetics engineering, e.g., creates a translational pause site.
  • P(D]H) P(Dl & D2 & D3 & D4
  • H) P(D
  • an experimental measurement Dl that has been confirmed by replicate testing would have a very low probability of error, and therefore it would dominate the estimate if available.
  • P(Di is correct) and P(Di is not correct) can be estimated a priori by the correlation of Di with previous experimental measurements.
  • H) are obtained by observing whether or not hypothesis H is consistent with observed data item Di. More complex and powerful Bayesian approaches are also well known to the art. The fully general approach rewrites P(D
  • the translational kinetics values for a codon pair can be refined by consideration of, for example, chi-squared value of observed versus expected codon pair frequency and the degree to which codon pairs are conserved at predicted pause sites across different proteins in the same species, for example, at protein structure domain boundaries.
  • An over-represented codon pair which is present with above-random frequency at boundary locations between autonomous folding units of proteins in the same species can have a translational kinetics value reflecting higher predicted translational pause properties of the codon pair.
  • an over- represented codon pair which is present with below-random frequency at boundary locations between autonomous folding units of proteins in the same species can have a translational kinetics value reflecting lower predicted translational pause properties of the codon pair.
  • the translational kinetics values for a codon pair can be refined by consideration of, for example, experimentally measured translation step times in one species and the degree to which codon pairs that correspond to measured pause sites in the first species are conserved across homologous proteins in other species, for example, in a multiple sequence alignment.
  • an over-represented codon pair in another species is aligned with above-random frequency to a codon pair that corresponds to a measured translation pause site in the first species, it can have a translational kinetics value reflecting higher predicted translational pause properties of that codon pair in the other species.
  • an over-represented codon pair in another species when aligned with below-random frequency to a codon pair that corresponds to a measured translation pause site in the first species, it can have a translational kinetics value reflecting lower predicted translational pause properties of that codon pair in the other species.
  • translational kinetics values for codon pairs can be determined.
  • the translational kinetic values can be organized according to the likelihood of causing a translational pause or slowing based on any method known in the art.
  • the translational kinetic values for two or more codon pairs, up to all codon pairs, in an organism are determined, and the mean translational kinetics value and associated standard deviation are calculated. Based on this, the translational kinetics value for a particular codon pair can be described in terms of the multiple of standard deviations the translational kinetics value for the particular codon pair differs from the mean translational kinetics value. Accordingly, reference herein to mean translational kinetics values and standard deviations, whether or not applied to a particular expression of translational kinetics value, can be applied to any of a variety of expressions of translational kinetics values provided herein.
  • Such a graphical display provides a visual display of the predicted translational influence, including translational pause or slowing for numerous or all codon pairs of a polypeptide-encoding nucleotide sequence.
  • This visual display can be used in methods of modifying polypeptide-encoding nucleotide sequences in order to thereby modify the predicted translational kinetics of the mRNA into polypeptide in methods such as those provided herein.
  • the graphical displays can be used to identify one or more codon pairs to be modified in a polypeptide-encoding nucleotide sequence.
  • the graphical displays can be used in analyzing a polypeptide-encoding nucleotide sequence prior to modifying the polypeptide-encoding nucleotide sequence, or can be used in analyzing a modified polypeptide-encoding nucleotide sequence to determine, for example, whether or not further modifications are desired.
  • Methods for creating and using graphical displays can be performed according to any method known in the art, as exemplified in U.S. Patent Publication No. 2007/0298503, published on December 27, 2007, and U.S. Patent Publication No. 2007/0275399, published on November 29, 2007, which are incorporated by reference herein in their entireties.
  • graphical displays as described therein can be created to illustrate the translational kinetics of an original or redesigned polypeptide- encoding nucleotide sequence in the native or a heterologous organism, or to illustrate differences and/or similarities of translation kinetic of a polypeptide-encoding nucleotide sequence in which one or more codon pairs have been modified.
  • numerous normalized graphical displays can be created to illustrate differences and/or similarities of translation kinetics of a polypeptide-encoding nucleotide sequence when expressed in two or more different organisms.
  • the graphical displays can be created using translational kinetics values based on any of the methods for determining translational kinetics values provided herein or otherwise known in the art. For example, chi-squared as a function of codon pair position, chi-squared 2 as a function of codon position, or chi-squared 3 as a function of codon pair position, translational kinetics values thereof, empirical measurement of translational pause of codon pairs in a host organism, estimated translational pause capability based on observed presence and/or recurrence of a codon pair at predicted pause site, and variations and combinations thereof as provided herein.
  • the exact format of the graphical displays can take any of a variety of forms, and the specific form is typically selected for ease of analysis and comparison between plots.
  • the abscissa typically lists the position along the nucleotide sequence or polypeptide sequence, and can be represented by nucleotide position, codon position, codon pair position, amino acid position, or amino acid pair position.
  • the ordinate typically lists the translational kinetics value of the codon pair, such as, but not limited to, a translational kinetics value of codon pair frequency, including, but not limited to the z score of chisql, the z score of chisq2, the z score of chisq3, the empirically measured value, and the refined translational kinetics value.
  • the sequence position can be plotted along the ordinate and the translational kinetics value can be plotted along the abscissa.
  • a set of graphical displays including at least a first graphical display and a second graphical display, are prepared. These sets of displays can be compared in order to determine the difference in predicted translational efficiency or translational kinetics of the two plots.
  • the plots can differ according to any of a variety of criteria. For example, each plot can represent a different polypeptide-encoding nucleotide sequence, each plot can represent a different host organism, each plot can represent differently determined translational kinetics values, or any combination thereof.
  • any number of different graphical displays can be compared in accordance with the methods provided herein, for example, 2, 3, 4, 5, 6, 7, 8 or more different graphical displays can be compared.
  • two plots will represent different polypeptide-encoding nucleotide sequences, the same sequence in different host organisms, or different sequences in different host organisms.
  • Comparison of different graphical displays can be used to analyze the predicted change in translational kinetics as a result of the difference represented by the graphical displays. For example, comparison of the same polypeptide-encoding nucleotide sequence in different host organisms can be used to analyze any predicted transcriptional pauses that can be removed. Accordingly, provided herein are methods of analyzing translational kinetics of an mRNA into polypeptide in a host organism by comparing two graphical displays to understand or predict the differences in translational kinetics of the mRNA into polypeptide, where the differences in the graphical displays can be as a result of, for example, a difference in the polypeptide-encoding nucleotide sequence or a difference in the host organism.
  • a graphical display of the translational kinetics values of codon pairs for the original polypeptide- encoding nucleotide sequence in the heterologous host can be compared to a graphical display of the translational kinetics values of codon pairs for a modified polypeptide- encoding nucleotide sequence in the heterologous host, and it can be determined whether or not the modification to the polypeptide-encoding nucleotide sequence resulted in improved translational kinetics.
  • the nucleic acid sequences provided herein can be present in a polynucleotide (e.g., DNA or RNA molecule).
  • a polynucleotide e.g., DNA or RNA molecule.
  • the polynucleotides can be inserted into a replicable vector for cloning (e.g., amplification of the DNA) or for expression.
  • a replicable vector for cloning (e.g., amplification of the DNA) or for expression.
  • Various vectors are publicly available and are known in the art.
  • the vector can, for example, be in the form of a plasmid, cosmid, viral particle, or phage.
  • the appropriate nucleic acid sequence can be inserted into the vector by any of a variety of procedures known in the art.
  • Vector components can generally include, but are not limited to, one or more of a signal sequence, an origin of replication, one or more marker genes, an enhancer element, a promoter, and a transcription termination sequence. Construction of suitable vectors containing one or more of these components employs standard ligation techniques which are known to the skilled artisan.
  • the encoded polypeptide can be produced recombinantly not only directly, but also as a fusion polypeptide with a heterologous polypeptide, which can be, e.g., a signal sequence or other polypeptide having a specific cleavage site at the N- terminus of the mature protein or polypeptide.
  • a heterologous polypeptide which can be, e.g., a signal sequence or other polypeptide having a specific cleavage site at the N- terminus of the mature protein or polypeptide.
  • the signal sequence can be a component of the vector, or it can be a part of the polynucleotide that is inserted into the vector.
  • the signal sequence can be a prokaryotic signal sequence selected, for example, from the group of the alkaline phosphatase, penicillinase, lpp, or heat-stable enterotoxin II leaders.
  • the signal sequence can be, e.g., the yeast invertase leader, alpha factor leader (including Saccharomyces and Kluyveromyces ⁇ -factor leaders, the latter described in U.S. Patent No. 5,010,182), or acid phosphatase leader, the C. albicans glucoamylase leader (EP 362,179 published 4 April 1990), or the signal described in WO 90/13646 published 15 November 1990.
  • mammalian signal sequences can be used to direct secretion of the protein, such as signal sequences from secreted polypeptides of the same or related species, as well as viral secretory leaders.
  • Both expression and cloning vectors contain a polynucleoitde that permits the vector to replicate in one or more selected host cells. Such sequences are well known for a variety of bacteria, yeast, and viruses.
  • the origin of replication from the plasmid pBR322 is suitable for most Gram-negative bacteria, the 2 ⁇ plasmid origin is suitable for yeast, and various viral origins (SV40, polyoma, adenovirus, VSV or BPV) are useful for cloning vectors in mammalian cells.
  • Selection genes will typically contain a selection gene, also termed a selectable marker.
  • Typical selection genes encode proteins that (a) confer resistance to antibiotics or other toxins, e.g., ampicillin, neomycin, methotrexate, or tetracycline, (b) complement auxotrophic deficiencies, or (c) supply critical nutrients not available from complex media, e.g., the gene encoding D-alanine racemase for Bacilli.
  • Suitable selectable markers for mammalian cells are those that enable the identification of cells competent to take up the polynucleotide- containing vector, such as DHFR or thymidine kinase.
  • An appropriate host cell when wild-type DHFR is employed is the CHO cell line deficient in DHFR activity, prepared and propagated as described by Urlaub et al., Proc. Natl. Acad. Sci. USA, 77:4216 (1980).
  • a suitable selection gene for use in yeast is the trpl gene present in the yeast plasmid YRp7 [Stinchcomb et al., Nature, 282:39 (1979); Kingsman et al., Gene, 7:141 (1979); Tschemper et al., Gene, 10:157 (1980)].
  • the trpl gene provides a selection marker for a mutant strain of yeast lacking the ability to grow in tryptophan, for example, ATCC No. 44076 or PEP4-1 [Jones, Genetics, 85:12 (1977)].
  • Expression and cloning vectors usually contain a promoter operably linked to the polynucleotide provided herein to direct mRNA synthesis. Promoters recognized by a variety of potential host cells are well known. Promoters suitable for use with prokaryotic hosts include the ⁇ -lactamase and lactose promoter systems [Chang et al., Nature, 275:615 (1978); Goeddel et al., Nature, 281 :544 (1979)], alkaline phosphatase, a tryptophan (tip) promoter system [Goeddel, Nucleic Acids Res., 8:4057 (1980); EP 36,776], and hybrid promoters such as the tac promoter [deBoer et al., Proc. Natl. Acad. Sci. USA, 80:21-25 (1983)]. Promoters for use in bacterial systems also will contain a Shine-Dalgarno (S. D.) sequence operably linked to the polynuvuv
  • Suitable promoting sequences for use with yeast hosts include the promoters for 3-phosphoglycerate kinase [Hitzeman et al., J. Biol. Chem., 255:2073 (1980)] or other glycolytic enzymes [Hess et al., J. Adv.
  • yeast promoters which are inducible promoters having the additional advantage of transcription controlled by growth conditions, are the promoter regions for alcohol dehydrogenase 2, isocytochrome C, acid phosphatase, degradative enzymes associated with nitrogen metabolism, metallothionein, glyceraldehyde-3- phosphate dehydrogenase, and enzymes responsible for maltose and galactose utilization. Suitable vectors and promoters for use in yeast expression are further described in EP 73,657.
  • Transcription from vectors in mammalian host cells is controlled, for example, by promoters obtained from the genomes of viruses such as polyoma virus, fowlpox virus (UK 2,21 1,504 published 5 July 1989), adenovirus (such as Adenovirus 2), bovine papilloma virus, avian sarcoma virus, cytomegalovirus, a retrovirus, hepatitis-B virus and Simian Virus 40 (SV40), from heterologous mammalian promoters, e.g., the actin promoter or an immunoglobulin promoter, and from heat-shock promoters, provided such promoters are compatible with the host cell systems.
  • viruses such as polyoma virus, fowlpox virus (UK 2,21 1,504 published 5 July 1989), adenovirus (such as Adenovirus 2), bovine papilloma virus, avian sarcoma virus, cytomegalovirus, a retrovirus, hepati
  • Enhancers are cis-acting elements of DNA, usually about from 10 to 300 bp, that act on a promoter to increase its transcription.
  • Many enhancer sequences are now known from mammalian genes (globin, elastase, albumin, ⁇ - fetoprotein, and insulin).
  • an enhancer from a eukaryotic cell virus. Examples include the SV40 enhancer on the late side of the replication origin (bp 100-270), the cytomegalovirus early promoter enhancer, the polyoma enhancer on the late side of the replication origin, and adenovirus enhancers.
  • the enhancer can be spliced into the vector at a position 5' or 3' to the polynucleotide provided herein, but is preferably located at a site 5' from the promoter.
  • Expression vectors used in eukaryotic host cells will also contain sequences necessary for the termination of transcription and for stabilizing the mRNA. Such sequences are commonly available from the 5' and, occasionally 3', untranslated regions of eukaryotic or viral DNAs or cDNAs. These regions contain nucleotide segments transcribed as polyadenylated fragments in the untranslated portion of the mRNA transcribed from the polynucleotide provided herein.
  • Host cells are transfected or transformed with expression or cloning vectors described herein for polypeptide production and cultured in conventional nutrient media modified as appropriate for inducing promoters, selecting transformants, or amplifying the genes encoding the desired sequences.
  • the culture conditions such as media, temperature, pH and the like, can be selected by the skilled artisan without undue experimentation. In general, principles, protocols, and practical techniques for maximizing the productivity of cell cultures can be found in Mammalian Cell Biotechnology: a Practical Approach, M. Butler, ed. (IRL Press, 1991) and Sambrook et al., supra.
  • Methods of eukaryotic cell transfection and prokaryotic cell transformation are known to the ordinarily skilled artisan, for example, CaCl 2 , CaPO 4 , liposome-mediated and electroporation. Depending on the host cell used, transformation is performed using standard techniques appropriate to such cells.
  • the calcium treatment employing calcium chloride, as described in Sambrook et al., supra, or electroporation is generally used for prokaryotes.
  • Infection with Agrobacterium tumefaciens is used for transformation of certain plant cells, as described by Shaw et al., Gene, 23:315 (1983) and WO 89/05859 published 29 June 1989.
  • Suitable host cells for cloning or expressing the DNA in the vectors herein include prokaryote, yeast, or higher eukaryote cells.
  • Suitable prokaryotes include but are not limited to eubacteria, such as Gram-negative or Gram-positive organisms, for example, Enterobacteriaceae such as E. coli.
  • Various E. coli strains are publicly available, such as E. coli K12 strain MM294 (ATCC 31,446); E. coli X1776 (ATCC 31 ,537); E. coli strain W3110 (ATCC 27,325) and K5 772 (ATCC 53,635).
  • suitable prokaryotic host cells include Enterobacteriaceae such as Escherichia, e.g., E. coli, Enterobacter, Erwinia, Klebsiella, Proteus, Salmonella, e.g., Salmonella typhimurium, Serratia, e.g., Serratia marcescans, and Shigella, as well as Bacilli such as B. subtilis and B. licheniformis (e.g., B. licheniformis 41P disclosed in DD 266,710 published 12 April 1989), Pseudomonas such as P. aeruginosa, and Streptomyces. These examples are illustrative rather than limiting.
  • Strain W31 10 is one particularly preferred host or parent host because it is a common host strain for recombinant DNA product fermentations. Preferably, the host cell secretes minimal amounts of proteolytic enzymes.
  • strain W31 10 can be modified to effect a genetic mutation in the genes encoding proteins endogenous to the host, with examples of such hosts including E. coli W3110 strain 1A2, which has the complete genotype tonA ; E. coli W3110 strain 9E4, which has the complete genotype tonA ptr3; E.
  • coli W31 10 strain 27C7 (ATCC 55,244), which has the complete genotype tonA ptr3 phoA El 5 (argF-lac)169 degP ompT kanr; E. coli W31 10 strain 37D6, which has the complete genotype tonA ptr3 phoA El 5 (argF- lac)169 degP ompT rbs7 ilvG kanr; E. coli W3110 strain 40B4, which is strain 37D6 with a non-kanamycin resistant degP deletion mutation; and an E. coli strain having mutant periplasmic protease disclosed in U.S. Patent No. 4,946,783 issued 7 August 1990.
  • in vitro methods of cloning e.g., PCR or other nucleic acid polymerase reactions, are suitable.
  • eukaryotic microbes such as filamentous fungi or yeast are suitable cloning or expression hosts for polynucleoitide-containing vectors.
  • Saccharomyces cerevisiae is a commonly used lower eukaryotic host microorganism.
  • Others include Schizosaccharomyces pombe (Beach and Nurse, Nature, 290: 140 [1981]; EP 139,383 published 2 May 1985); Kluyveromyces hosts (U.S. Patent No. 4,943,529; Fleer et al., Bio/Technology, 9:968-975 (1991)) such as, e.g., K.
  • lactis (MW98-8C, CBS683, CBS4574; Louvencourt et al., J. Bacteriol., 154(2):737-742 [1983]), K. fragilis (ATCC 12,424), K. bulgaricus (ATCC 16,045), K. wickeramii (ATCC 24,178), K. waltii (ATCC 56,500), K. drosophilarum (ATCC 36,906; Van den Berg et al., Bio/Technology, 8:135 (1990)), K. thermotolerans, and K. marxianus; yarrowia (EP 402,226); Pichia pastoris (EP 183,070; Sreekrishna et al., J.
  • Candida Trichoderma reesia (EP 244,234); Neurospora crassa (Case et al., Proc. Natl. Acad. Sci. USA, 76:5259-5263 [1979]); Schwanniomyces such as Schwanniomyces occidentalis (EP 394,538 published 31 October 1990); and filamentous fungi such as, e.g., Neurospora, Penicillium, Tolypocladium (WO 91/00357 published 10 January 1991), and Aspergillus hosts such as A. nidulans (Ballance et al., Biochem. Biophys. Res.
  • Methylotropic yeasts are suitable herein and include, but are not limited to, yeast capable of growth on methanol selected from the genera consisting of Hansenula, Candida, Kloeckera, Pichia, Saccharomyces, Torulopsis, and Rhodotorula. A list of specific species that are exemplary of this class of yeasts can be found in C. Anthony, The Biochemistry of Methylotrophs, 269 (1982).
  • Suitable host cells for the expression of glycosylated polypeptides are derived from multicellular organisms.
  • invertebrate cells include insect cells such as Drosophila S2 and Spodoptera Sf9, as well as plant cells.
  • useful mammalian host cell lines include Chinese hamster ovary (CHO) and COS cells. More specific examples include monkey kidney CVl line transformed by SV40 (COS-7, ATCC CRL 1651); human embryonic kidney line (293 or 293 cells subcloned for growth in suspension culture, Graham et al., J. Gen Virol., 36:59 (1977)); Chinese hamster ovary cells/-DHFR (CHO, Urlaub and Chasin, Proc. Natl. Acad. Sci.
  • mice Sertoli cells TM4, Mather, Biol. Reprod., 23:243-251 (1980)
  • human lung cells Wl 38, ATCC CCL 75
  • human liver cells Hep G2, HB 8065
  • mouse mammary tumor MMT 060562, ATCC CCL51. The selection of the appropriate host cell is deemed to be within the skill in the art.
  • Gene amplification and/or expression can be measured in a sample directly, for example, by conventional Southern blotting, Northern blotting to quantitate the transcription of mRNA [Thomas, Proc. Natl. Acad. Sci. USA, 77:5201 5205 (1980)], dot blotting (DNA analysis), or in situ hybridization, using an appropriately labeled probe, based on the sequences provided herein.
  • antibodies can be employed that can recognize specific duplexes, including DNA duplexes, RNA duplexes, and DNA RNA hybrid duplexes or DNA protein duplexes.
  • the antibodies in turn can be labeled and the assay can be carried out where the duplex is bound to a surface, so that upon the formation of duplex on the surface, the presence of antibody bound to the duplex can be detected.
  • Gene expression can be measured by immunological methods, such as immunohistochemical staining of cells or tissue sections and assay of cell culture or body fluids, to quantitate directly the expression of gene product.
  • Antibodies useful for immunohistochemical staining and/or assay of sample fluids can be either monoclonal or polyclonal, and can be prepared in any mammal. Conveniently, the antibodies can be prepared against any polypeptide provided herein or against a synthetic peptide based on the sequences provided herein or against exogenous sequence fused to the polypeptide or fragment thereof and encoding a specific antibody epitope.
  • Polypeptides can be recovered from culture medium or from host cell lysates. If membrane-bound, it can be released from the membrane using a suitable detergent solution (e.g. Triton-X 100) or by enzymatic cleavage. Cells employed in expression of polypeptides can be disrupted by various physical or chemical means, such as freeze-thaw cycling, sonication, mechanical disruption, or cell lysing agents, as is known in the art.
  • a suitable detergent solution e.g. Triton-X 100
  • Cells employed in expression of polypeptides can be disrupted by various physical or chemical means, such as freeze-thaw cycling, sonication, mechanical disruption, or cell lysing agents, as is known in the art.
  • the following procedures are exemplary of suitable purification procedures: by fractionation on an ion-exchange column; ethanol precipitation; reverse phase HPLC; chromatography on silica or on a cation-exchange resin such as DEAE; chromatofocusing; SDS-PAGE; ammonium sulfate precipitation; gel filtration using, for example, Sephadex G-75; protein A Sepharose columns to remove contaminants such as IgG; and metal chelating columns to bind epitope-tagged forms of the polypeptide.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes a DNA sequence of the embodiments provided herein operably linked to an expression control sequence.
  • an expression vector is a DNA or RNA vector that is capable of transforming a host cell and of effecting expression of a specified nucleic acid molecule.
  • the expression vector is also capable of replicating within the host cell.
  • Expression vectors can be either prokaryotic or eukaryotic, and are typically viruses or plasmids.
  • operably linked refers to functional linkage between a nucleic acid expression control sequence (such as a promoter, or array of transcription factor binding sites) and a second nucleic acid sequence, wherein the expression control sequence directs transcription of the nucleic acid corresponding to the second sequence.
  • An operably linked expression vector can also include secretion signals and other modifying sequences, and can encode chaperones and proteins for a variety of organisms and systems.
  • Methods of expressing polypeptides from polypeptide-encoding nucleotide sequences are known in the art, as exemplified, for example, by the techniques described in Maniatis et al., 1989, Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory, N. Y. and Ausubel et al., 2008, Current Protocols in Molecular Biology, Greene Publishing Associates and Wiley Interscience, N. Y.
  • the methods include inserting a polypeptide- encoding nucleotide sequence designed by the methods provided herein into a cell, and expressing the polypeptide-encoding nucleotide sequence under conditions suitable for gene expression. Additionally provided expression methods include cell-free expression systems as known in the art, where such methods include providing a polypeptide- encoding nucleotide sequence designed by the methods provided herein and contacting the polypeptide-encoding nucleotide sequence with a cell-free expression system under conditions suitable for protein translation.
  • the expression levels of one or more enzymes in a metabolic pathway are individually manipulated. Differential metabolic expression levels can be manipulated using methods known in the art. For example, by selecting a specific promoter with a desired transcriptional level, one can vary the expression level of the gene that is operably linked to the promoter. Similarly, one may select an expression vector that produces the desired levels of expression.
  • Endogenous sequences include genomic sequences of a cell. Such genomic sequences can include sequences previously modified by the constructs, methods and systems provided herein. Modifications of endogenous sequences can include insertions, deletions and mutations. In some embodiments, a modification can include the insertion of a heterologous sequence. Heterologous sequences include exogenous nucleic acid sequences and can include sequences with homology to endogenous sequences.
  • integrable polynucleotides for modifying endogenous nucleotide sequences in cell are provided.
  • Such integrable polynucleotides can contain sequences with homology to endogenous sequences and a removable selectable marker cassette.
  • the removable selectable marker cassette can include a selectable marker flanked by a 5' site-specific recombinase recognition sequence and a 3' site-specific recombinase recognition sequence.
  • integrable polynucleotides can also contain heterologous sequences.
  • the heterologous sequences and removable selectable marker cassette can be flanked by a 5' nucleic acid sequence with homology to an endogenous sequence and a 3' nucleic acid sequence with homology to an endogenous sequence.
  • integrable polynucleotides can include episomal nucleic acids, such as plasmids and YACS.
  • integrable polynucleotides can include autonomous replication sequences such as CoIEl, Ori, oriT, 2 ⁇ m, CEN/ARS.
  • integrable polynucleotides can include linearized episomal nucleic acids, for example, plasmids cut with a restriction enzyme.
  • integrable polynucleotides can include PCR products.
  • a removable selectable cassette can contain a selectable marker flanked by a 5' site-specific recombinase recognition sequence and a 3' site-specific recombinase recognition sequence.
  • Removable selectable marker cassettes can be used to select for integration of an integrable polynucleotide into the genome of a cell. Subsequent to integration of the integrable polynucleotide, the removable selectable marker cassette can be excised, if desired, from the genome of the cell. Because the number of known selectable markers is limited, one advantage of excising a selectable maker from the genome of a cell is that the selectable marker can be used repeatedly.
  • the same selectable marker can be used in a second integrable polynucleotide to modify the genome of a cell previously modified by the first integrable polynucleotide.
  • the selectable marker can allow selection for a cell in which the selectable marker has integrated into the cell's genome.
  • Selectable markers can be antibiotic resistance genes against compounds, for example, kanamycin, ampicillin, tetracycline, chloramphenicol, spectinomycin, gentamycin, zeomycin, or streptomycin. More selectable markers can be genes capable of complementing strains of yeast having well characterized metabolic deficiencies, for example, tryptophan or histidine deficient mutants.
  • a selectable marker can be used to select against cells that retain the selectable marker. In such embodiments, cells which do not express the selectable marker will be selected for.
  • a selectable marker can be selected for and against.
  • selectable markers examples include, but are not limited to, URA3 (Boeke, J. D. , LaCroute, F. , and Fink, G. R. (1984).
  • a counterselection for the tryptophan pathway in yeast 5-fluoroanthranilic acid resistance.
  • Yeast 16, 553-560 CANl (Whelan, W. L., Gocke, E., and Manney, T. R. (1979).
  • the CANl locus of Saccharomyces cerevisiae fine-structure analysis and forward mutation rates. Genetics 35-51), KIURA3, CYH2, LYS2 and METl 5 (Singh, A. and Sherman, F. (1975). Genetic and physiological characterization of met 15 mutants of Saccharomyces cerevisiae: a selective system for forward and reverse mutations. Genetics 75-97).
  • Such examples can typically be used in conjunction with specific strains of Saccharamyces cerevisiae which are non-functional for specific genes.
  • a first selection of the selectable marker can be made to select for incorporation of the selectable marker and a second selection of the selectable marker can be made to select against maintaining the selectable marker.
  • Such embodiments can find particular application when the same selectable marker is utilized iteratively, namely, two or more times, for the separate incorporation of two or more heterologous polynucleotides into the host organism.
  • the selectable marker can be flanked by site-specific recombinase recognition sequences.
  • site-specific recombinase recognition sequences allow a site-specific recombinase to excise the selectable marker from an integrable polynucleotide integrated into the genome of a cell.
  • sequence-specific recombinase target sites include, but are not limited to, loxP sites, fit sites, art sites and dif sites.
  • the site-specific recombinase recognition sequences can be loxP sites recognized by the CRE recombinase.
  • the CRE recombinase can be a CRE recombinase optimized for expression in a particular organism, for example, S. cerevisiae, using methods known in the art.
  • the site-specific recombinase recognition sequence can be frt sites recognized by the FLP recombinase.
  • flanking loxP sites or flanking frt sites should be in the same orientation, that is, the sites should be in tandem orientation.
  • CRE recombinase or FLP recombinase expressed in a cell can excise the sequence between loxP sites or frt sites, respectively.
  • the site-specific recombinase can be expressed from a plasmid. In other embodiments, the site-specific recombinase can be expressed from an inducible endogenous gene.
  • integration of an integrable polynucleotide into the genome of a cell can be mediated by a variety of processes.
  • Such processes can include, but are not limited to, random integration, homologous recombination, or site- specific recombination.
  • integrable polynucleotides can contain sequences with homology to endogenous sequences. Such sequences with homology to endogenous sequences can direct integration of integrable polynucleotides to certain locations in a cell's genome, specifically, the location of the endogenous sequence.
  • One advantage of directing integration of integrable polynucleotides to particular locations of the genome is that the integrable polynucleotides can be directed to locations of the genome that, for example, can contain enhancer elements, locus control regions, or can be more permissive for expression of a heterologous sequence contained within an integrable polynucleotide.
  • sequences with homology to endogenous sequences can be more than about 5 nucleotides, more than about 10 nucleotides, more than about 15 nucleotides, more than about 20 nucleotides, more than about 25 nucleotides, more than about 30 nucleotides, more than about 35 nucleotides, more than about 40 nucleotides, more than about 45 nucleotides, more than about 50 nucleotides, more than about 100 nucleotides, more than 500 nucleotides, more than about 1 kilobases, more than about 2 kilobases, more than about 3 kilobases, more than about 4 kilobases, or more than about 5 kilobases in length.
  • Sequences with homology to endogenous sequences can be 100% identical or can have at least 99 %, 98 %, 97 %, 96 %, 95 %, 94 %, 93 %, 92 %, 91 %, 90 %, 85 %, 80 %, 70 %, or 70% identity to the endogenous sequence.
  • sequences with homology to endogenous sequences can contain sequences with homology to genomic repetitive elements, such as long interspersed repeats (LINEs), short interspersed repeats (SINEs), or retrotransposon DNA, such as long terminal repeats (LTR).
  • genomic repetitive elements can be TyI or Ty3 elements.
  • integrable polynucleotides containing sequences with homology to genomic repetitive elements may integrate at more than one site in the genome of a cell.
  • sequences with homology to endogenous sequences can contain ⁇ sequences, ⁇ sequences are a component of the LTR of the TyI retrotransposon and are distributed throughout the S. cerevisiae genome.
  • Vectors containing ⁇ sequences for integration into S. cerevisiae are known in the art, as exemplified in Lee F. W. and Da Dilva N. A., Sequential delta-integration for the regulated insertion of cloned genes in Saccharomyces cerevisiae. Biotechnol Prog. (1997) 13(4): 368-373.
  • the 5' nucleic acid sequence with homology to an endogenous sequence and the 3' nucleic acid sequence with homology to an endogenous sequence can contain ⁇ sequences.
  • Vectors containing heterologous sequences flanked by ⁇ sequences are known in the art to have an increased stability for expression of heterologous sequences contained therein (Lee F. W.
  • an integrable polynucleotide can contain heterologous sequences.
  • Such heterologous sequences can include sequences encoding polypeptides.
  • the heterologous sequences can encode genes important in sugar metabolism, cellulose metabolism, arabinose metabolism, and xylose metabolism.
  • heterologous sequences can contain regulatory elements operatively linked to a sequence encoding a polypeptide.
  • regulatory elements can include, for example, promoters, enhancers, and terminator sequences. Promoters may be constitutive or inducible. Suitable promoters for use in prokaryotic hosts include, but are not limited to, the trp, lac and phage promoters, tRNA promoters and glycolytic enzyme promoters.
  • Useful yeast promoters include, but are not limited to, the promoter regions for metallothionein, 3-phosphoglycerate kinase or other glycolytic enzymes such as enolase or glyceraldehyde-3 -phosphate dehydrogenase and the enzymes responsible for maltose and galactose utilization.
  • Appropriate mammalian promoters include, but are not limited to, the early and late promoters from SV40 and promoters derived from murine Moloney leukemia virus (MLV), mouse mammary tumor virus (MMTV), avian sarcoma viruses, adenovirus II, bovine papilloma virus and polyomas.
  • a heterologous sequence can contain the PGKl promoter, the TEFl promoter, the CYCl terminator, and combinations thereof.
  • heterologous sequences encode and express the gene of interest in a cell in which the heterologous sequence has integrated.
  • a cell can contain any of the integrable polynucleotides described herein.
  • a cell can be a prokaryotic cell or a eukaryotic cell.
  • prokaryotic cells include Escherichia coli, and Clostridium species.
  • eukaryotic cells include, but are not limited to, fungi and yeast cells, such as, Saccharomyces cerevisiae, Pichia pastoris, Zymomonas mobilis, Kluyveromyces lactis, Kluveromyces marxianus, Trichoderma species, and Aspergillus species; mammalian cells, such as Chinese hamster cells; avian cells; and insect cells.
  • the cell can contain an integrable polynucleotide integrated into the genome of a cell.
  • a cell can contain a heterologous nucleic acid integrated into the genome of the cell in which the removable selectable marker is juxtaposed to said heterologous nucleic acid.
  • a removable selectable marker can be juxtaposed to a heterologous nucleic acid where the removable selectable marker and the heterologous nucleic acid are adjacent to one another on a sequence, for example, the removable selectable marker and the heterologous nucleic acid can be immediately adjacent to one another, or separated by less than 1 nucleotide, less than about 5 nucleotides, less than about 10 nucleotides, less than about 20 nucleotides, less than about 30 nucleotides, less than about 40 nucleotides, less than about 50 nucleotides, less than about 60 nucleotides, less than about 70 nucleotides, less than about 80 nucleotides, less than about 90 nucleotides, less than about 100 nucleotides, less than about 200 nucleotides, less than about 300 nucleotides, less than about 400 nucleotides, less than about 0.5 kilobases, less than about 1 kilobases, less than about 2 kilobases, less
  • a cell can contain an integrable polynucleotide integrated into the genome of the cell where the removable selectable cassette has been excised from the integrated polynucleotide.
  • a cell can contain a heterologous nucleic acid integrated into the genome of the cell in which a site-specific recombinase recognition site is juxtaposed to the heterologous nucleic acid.
  • a site-specific recombinase recognition site can be juxtaposed to a heterologous nucleic acid where the site-specific recombinase recognition site and the heterologous nucleic acid are adjacent to one another on a sequence, for example, the site-specific recombinase recognition site and the heterologous nucleic acid can be immediately adjacent to one another, or separated by less than 1 nucleotide, less than about 5 nucleotides, less than about 10 nucleotides, less than about 20 nucleotides, less than about 30 nucleotides, less than about 40 nucleotides, less than about 50 nucleotides, less than about 60 nucleotides, less than about 70 nucleotides, less than about 80 nucleotides, less than about 90 nucleotides, less than about 100 nucleotides, less than about 200 nucleotides, less than about 300 nucleotides, less than about 400 nucleotides, less than about 0.5 kilob
  • a cell can contain a plurality of integrable polynucleotides.
  • a cell can contain a plurality of different integrable polynucleotides containing different selectable markers.
  • a cell contains no more than about 1, no more than about 2, no more than about 3, no more than about 4, no more than about 5, no more than about 6, no more than about 7, no more than about 8, no more than about 8, or no more than about 10 different selectable markers.
  • the number of selectable markers a cell can contain can include the number of different selectable markers compatible with the methods and compositions described herein.
  • a cell can contain a plurality of different integrable polynucleotides that have integrated into the genome of the cell.
  • a cell can contain 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 15 or more, 20 or more, 25 or more, 30 or more, 40 or more, 45 or more, or 50 or more different integrable polynucleotides that have integrated into the genome of the cell.
  • a cell can contain a plurality of different integrable polynucleotides that have integrated into the genome of the cell where some integrable polynucleotides contain selectable markers, and some integrable polynucleotides have no selectable marker. In even more embodiments, a cell can contain a plurality of different integrable polynucleotides where some or all of the selectable markers have been excised.
  • methods to modify an endogenous sequence in a cell can include providing a cell with any integrable polynucleotide described herein, and selecting for at least one cell containing the integrable polynucleotide integrated into the genome of the cell.
  • a plurality of different integrable polynucleotides can be provided to a cell.
  • the plurality of different integrable polynucleotides can include 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more different integrable polynucleotides.
  • the plurality of integrable polynucleotides can include integrable polynucleotides with different selectable makers.
  • One advantage of providing a cell with a plurality of polynucleotides with different selectable markers includes the ability to make more than one modification to endogenous sequences in a cell simultaneously.
  • methods that include providing a cell with a plurality of different integrable polynucleotides simultaneously include providing a cell with a plurality of different integrable polynucleotides simultaneously.
  • the plurality of integrable polynucleotides can include integrable polynucleotides with different heterologous sequences.
  • the plurality of integrable polynucleotides can include integrable polynucleotides with different flanking sequences with homology to endogenous sequences.
  • At least one selectable marker can be used iteratively.
  • a cell can be produced from a first round of modification(s) using the methods described herein.
  • a cell can be provided with a first integrable polynucleotide containing a selectable marker, a cell can be selected for containing the integrable polynucleotide integrated into the cell's genome, the selection cassette can be excised from a cell containing an integrated integrable polynucleotide, and a cell can be selected for having the selection cassette excised.
  • a cell containing the modifications of the first round can undergo at least a second round of modifications using a second integrable polynucleotide containing the same selectable marker as the first integrable polynucleotide.
  • a selectable marker can be reused and is used iteratively.
  • a cell can be provided with a plurality of integrable polynucleotides containing set of different selectable markers in a first round of modifications.
  • a cell containing the modifications of the first round of modifications can be provided with a plurality of integrable polynucleotides containing the same set of different selectable markers as the first round of modifications.
  • the integrable polynucleotide can be provided to a cell as a linearized plasmid.
  • the integrable polynucleotide can be provided to a cell as a PCR product.
  • Methods of PCR are well known in the art.
  • the template for the PCR can comprise a sequence for an integrable polynucleotide, for example, a vector containing the integrable polynucleotide sequence.
  • the initial template for PCR may not contain the entire sequence for an integrable polynucleotide.
  • One advantage of using PCR to generate the integrable polynucleotide includes the ability to incorporate additional sequences to the ends of the initial PCR template.
  • PCR primers with tails can be designed and used to amplify the initial PCR template and incorporate the additional sequences in the tails into the amplified product.
  • Such additional tail sequences can be 2 nucleotides, 3 nucleotides, 4 nucleotides, 5 nucleotides, 6 nucleotides, 7 nucleotides, 8 nucleotides, 9 nucleotides, 10 nucleotides, 1 1 nucleotides, 12 nucleotides, 13 nucleotides, 14 nucleotides, 15 nucleotides, 16 nucleotides, 17 nucleotides, 18 nucleotides, 19 nucleotides, 20 nucleotides, 21 nucleotides, 22 nucleotides, 24 nucleotides, 25 nucleotides, 26 nucleotides, 27 nucleotides, 28 nucleotides, 29 nucleotides, 30 nucleotides, 31 nucleotides, 32 nucleotides, 33 nucleotides, 34 nucleotides, 35 nucleotides, 36 nucleotides, 37 nucleotides, 38
  • primers for the PCR can be designed to add sequences with homology to endogenous sequences to the initial PCR template.
  • an integrable polynucleotide with flanking sequences with homology to endogenous sequences can be generated.
  • additional tail sequences can include TyI sequences.
  • methods to modify an endogenous sequence in a cell can also include excising the selectable marker from the integrable polynucleotide integrated into the genome of the cell.
  • excising a selectable marker integrated into the genome of a cell is that the selectable marker can be re-used to select for another modification in a subsequent round of modifications.
  • a selectable marker can be excised from an integrated site by site-specific recombination using a site-specific recombinase expressed in the cell.
  • Site-specific recombinases can include CRE recombinase to excise sequences between tandem loxP sites, and FLP recombinase to excise sequences between tandem frt sites.
  • the site-specific recombinase can be expressed from a plasmid transformed into the cell.
  • the site-specific recombinase can be expressed from an inducible endogenous gene. It is contemplated that in instances where more than one type of different selectable makers have integrated into the cell's genome, all the different selectable makers can be excised simultaneously by the expression of at least one type of site-specific recombination.
  • the selectable markers of an integrable polynucleotide containing the URA3 marker flanked by loxP sites, and an integrable polynucleotide containing the TRPl marker flanked by loxP sites can both be excised from sites where the integrable polynucleotides have integrated into the cell by expression in the cell of CRE recombinase.
  • a cell can be provided with a plurality of integrable polynucleotides which contain different recombinase recognition sequences.
  • the plurality of integrable polynucleotides can include some integrable polynucleotides that contain one type of recombinase recognition sequences, such as loxP sites, and some integrable polynucleotides can contain another type of recombinase recognition sequences, such as frt sites.
  • a cell in which a selectable marker has been excised can be identified by selecting against cells that retain the marker. Methods for such negative selection are well known in the art.
  • An exemplary eukaryotic system for xylose metabolism is a cassette of enzymes that can include xylose reductase (XR), xylitol dehydrogenase (XDH), xylulokinase (XKI) and transketolasae (TLK).
  • An exemplary bacterial system for xylose metabolism is a cassette of enzymes that can include xylose isomerase (XyIA), xylulokinase (XKI), transketolase (TLK) and transaldolase (TAL).
  • XyIA xylose isomerase
  • XKI xylulokinase
  • TLK transketolase
  • TAL transaldolase
  • one or more, or all of the enzymes are heterologous to the one or more host organisms.
  • the translational kinetics of each of the nucleotide sequences encoding the enzymes has been increased by silent permutation or conservative amino acid substitution of at least 1, 2, 3, 4, 5 or 6 or more codon pairs present in the original sequence for each enzyme.
  • a silent permutation is a change to one or more nucleotides of a codon such that the encoded amino acid does not change.
  • the at least 1 , 2, 3, 4, 5 or 6 or more substituted codon pairs are predicted to cause a translational pause or slowing in the host organism, and the substituting codon pair is typically a codon pair not predicted to cause a translational pause or slowing in the host organism.
  • a codon pair in the modified polynucleotide can be selected to preserve or insert a predicted pause.
  • systems for arabinose metabolism comprising one or more host organisms that collectively include nucleotide sequences operably encoding at least two least two enzymes that metabolize arabinose and its downstream metabolites, including transketolase, ribulose 5-phosphate epimerase, and transaldolase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the polynucleotides encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of the at least three codon pairs.
  • such a system can collectively include one or more host organisms that collectively include polynuclotides operably encoding the following enzymes alcohol dehydroganase- 1, and alcohol dehydrogenase-3.
  • Also provided herein are systems for arabinose metabolism comprising one or more host organisms that collectively include nucleotide sequences operably encoding at least two least two enzymes from bacterial or eukaryotic pathways.
  • An exemplary eukaryotic system for arabinose metabolism is a cassette of enzymes that can include aldose reductase (ARD), L-arabinitol 4-dehydrogenase (LAD), L-xylulose reductase (LXR), xylitol dehydrogenase (XDH), xylulokinase (XKI), and transaldolase (TLK).
  • ARD aldose reductase
  • LAD L-arabinitol 4-dehydrogenase
  • LXR L-xylulose reductase
  • XDH xylitol dehydrogenase
  • XKI xylulokinase
  • TLK transaldolase
  • An exemplary bacterial system for arabinose metabolism is a cassette of enzymes that can include L-arabinose isomerase (AraA), L-ribulokinase (AraB), L-ribulose-5-P 4- epimerase (AraD) and transaldolase (TLK).
  • one or more, or all of the enzymes are heterologous to the one or more host organisms.
  • the translational kinetics of each of the nucleotide sequences encoding the enzymes has been increased by silent permutation or conservative amino acid substitution of at least 1, 2, 3, 4, 5 or 6 or more codon pairs present in the original sequence for each enzyme.
  • a silent permutation is a change to one or more nucleotides of a codon such that the encoded amino acid does not change.
  • the at least 1, 2, 3, 4, 5 or 6 or more substituted codon pairs are predicted to cause a translational pause or slowing in the host organism, and the substituting codon pair is typically a codon pair not predicted to cause a translational pause or slowing in the host organism.
  • a codon pair in the modified polynucleotide can be selected to preserve or insert a predicted pause.
  • the stoichiometry of enzymes in a pathway can affect the overall efficiency of biomass conversion. Accordingly, provided herein are systems of two or more enzymes wherein one of the two or more enzymes in the pathway has a translational pause. Also provided herein are two or more enzymes wherein two of the enzymes in the pathway have a translational pause.
  • xylose reductase can have a pause
  • xylitol dehydrogenase XDH
  • xylulokinase XKI
  • combinations thereof can have pauses.
  • xylose isomerase XyIA
  • Xylulokinase XKI
  • both enzymes can have a pause.
  • aldose reductase can have a pause
  • L- arabinitol 4-dehydrogenase LAD
  • L-xylulose reductase LXR
  • XDH xylitol dehydrogenase
  • XKI xylulokinase
  • L-arabinose isomerase (AraA) can have a pause
  • L-ribulokinase (AraB) can have a pause
  • L- ribulose-5-P 4-epimerase (AraD) can have a pause, or combinations thereof can have pauses.
  • AraA and AraB do not have pauses, while AraD contains a pause; it is contemplated that such an arrangement would result in AraA and AraB having high levels of activity, with AraD retaining low levels of activity.
  • the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichiapastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme in the system has at least a 50%, 60%, 70%, 80%, and more typically at least 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the with the original sequence of the enzyme.
  • one or more of the enzymes in the system retains at least 75% of the enzymatic activity of the enzyme encoded by the original sequence under conditions suitable for metabolism of xylose. Methods for measuring the activity of the enzymes in the system are known in the art.
  • Also provided are methods of hydrolyzing a sugar comprising providing a sugar comprising at least one covalent bond, providing a polypeptide encoded by any of the polynucleotides provided herein, and contacting said sugar with said polypeptide under conditions that permit said polypeptide to break or form at least one covalent bond of said sugar, whereby at least one covalent bond of said sugar is hydrolyzed.
  • the sugar is xylose or arabinose.
  • Such methods can be performed using the cells and systems provided herein. Such methods can be performed in order to provide sugar metabolites which can be used by a cell or processed extracellularly according to any one of a variety of known methods in the art.
  • a polynucleotide containing an improved-expression nucleotide sequence calculated in accordance with the teachings herein can be prepared by known methods, such as, for example, assembly of overlapping oligonucleotides which can be solid phase synthesized, as is described in U.S. Patent Number 7,262,031 , and U.S. Patent Publication Numbers 2005/0106590 and 2007/0009928.
  • the prepared polynucleotide can then be amplified by PCR methodologies or by insertion into a vector, transformation into cells, and subsequent harvesting of the vector from the cells. Examples of such methods for amplification of a polynucleotide are provided in Ausubel et al., 2008, Current Protocols in Molecular Biology, Greene Publishing Associates and Wiley Interscience, N. Y.
  • the polynucleotide itself or amplicon thereof can be inserted into an expression vector configured to produce the polypeptide encoded by the inserted polynucleotide.
  • the expression vector is then inserted into cells, and according to the expression vector used, the cells are treated under conditions suitable for polypeptide expression.
  • the expressed polypeptide can be analyzed and manipulated as desired.
  • the expressed polypeptide can be analyzed by Western blot analysis using a known antibody to the expressed polypeptide or using an anti-polypeptide antibody generated by known methods.
  • the expressed polypeptide also can be subjected to one or more purification steps to increase the purity of the expressed polypeptide.
  • Various analytical and purification method, as well as antibody-generation methods are known in the art, as exemplified in Ausubel, supra.
  • This example describes optimization of a Nucleotide sequence encoding TLKl for expression in yeast.
  • the chi-squared value "chisql” was generated by the expected and observed values determined.
  • the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
  • the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
  • z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
  • the nucleotide sequence for the gene encoding the TLKl protein was modified to optimize codon usage for S. cerevisiae.
  • the Nucleotide sequence encoding TLKl (SEQ ID NO: 1) was derived from Genbank accession number Ml 6190 by removing untranslated sequence (5' untranslated region and introns).
  • a graphical display for the native gene (SEQ ID NO: 1) encoding the TLKl protein (SEQ ID NO: 2) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
  • the graphical display is provided in Figure 1.
  • a graphical display for the native gene (SEQ ID NO: 1) encoding the TLKl protein (SEQ ID NO: 2) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
  • the graphical display is provided in Figure 2A.
  • the nucleotide sequence for the gene encoding the TLKl protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 3) was found to encode a protein (SEQ ID NO: 4) with 100% amino acid sequence identity to wild-type TLKl (SEQ ID NO: 2).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 3) encoding the TLKl protein (SEQ ID NO: 4) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 2B.
  • This example describes optimization of a Nucleotide sequence encoding TLKl for expression in bacteria.
  • Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the TLKl protein was modified to optimize codon usage for E. coli.
  • a graphical display for the native gene (SEQ ID NO: 1) encoding the TLKl protein (SEQ ID NO: 2) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 3 A.
  • the nucleotide sequence for the gene encoding the TLKl protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 9) was found to encode a protein (SEQ ID NO: 10) with 100% amino acid sequence identity to wild-type TLKl (SEQ ID NO: 2).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 9) encoding the TLKl protein (SEQ ID NO: 10) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 3B.
  • This example describes optimization of a Nucleotide sequence encoding TLKl for expression in P. past oris.
  • Chi-squared values for P. pasto ⁇ s were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pasto ⁇ s were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the TLKl protein was modified to optimize codon usage for P. pastoris.
  • a graphical display for the native gene (SEQ ID NO: 1) encoding the TLKl protein (SEQ ID NO: 2) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4A.
  • the nucleotide sequence for the gene encoding the TLKl protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 15) was found to encode a protein (SEQ ID NO: 16) with 100% amino acid sequence identity to wild-type TLKl (SEQ ID NO: 2).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 15) encoding the TLKl protein (SEQ ID NO: 16) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4B.
  • This example describes optimization of a nucleotide sequence encoding TLKl for expression in K. lactis.
  • Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the TLKl protein was modified to optimize codon usage for K. lactis.
  • a graphical display for the native gene (SEQ ID NO: 1) encoding the TLKl protein (SEQ ID NO: 2) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 5 A.
  • the nucleotide sequence for the gene encoding the TLKl protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 21) was found to encode a protein (SEQ ID NO: 22) with 100% amino acid sequence identity to wild-type TLKl (SEQ ID NO: 2).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 21) encoding the TLKl protein (SEQ ID NO: 22) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 5B.
  • This example describes optimization of a nucleotide sequence encoding TLKl for expression in Z. mobilis.
  • Chi-squared values for Z mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the TLKl protein was modified to optimize codon usage for Z. mobilis.
  • a graphical display for the native gene (SEQ ID NO: 1) encoding the TLKl protein (SEQ ID NO: 2) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 6A.
  • the nucleotide sequence for the gene encoding the TLKl protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 23) was found to encode a protein (SEQ ID NO: 24) with 100% amino acid sequence identity to wild-type TLKl (SEQ ID NO: 2).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 23) encoding the TLKl protein (SEQ ID NO: 24) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 6B.
  • E. coli expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 2 and native TLKl protein is examined by Western blot analysis.
  • Each vector is transformed into E. coli strain Top 10 (F-mcrA ⁇ (mrr-hsdRMS-mcrBQ ⁇ 80lacZ ⁇ M15 ⁇ lacX74 deoR recAl amD139 h(ara-leu) 7697 gal U gal K rpsL (StrR) endAl nupG).
  • An overnight culture is inoculated at 1 : 100 into 5 ml of LB medium plus lOO ⁇ g/ml ampicillin and grown at 37 0 C to OD 600 of 0.5.
  • Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-TLKl antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
  • This example describes optimization of a Nucleotide sequence encoding RPE for expression in yeast.
  • the chi-squared value "chisql” was generated by the expected and observed values determined.
  • the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
  • the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
  • z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
  • nucleotide sequence for the gene encoding the RPE protein was modified to optimize codon usage for S. cerevisiae.
  • a graphical display for the native gene (SEQ ID NO: 25) encoding the RPE protein (SEQ ID NO: 26) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
  • the graphical display is provided in Figure 7.
  • a graphical display for the native gene (SEQ ID NO: 25) encoding the RPE protein (SEQ ID NO: 26) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
  • the graphical display is provided in Figure 8A.
  • the nucleotide sequence for the gene encoding the RJPE protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 27) was found to encode a protein (SEQ ID NO: 28) with 100% amino acid sequence identity to wild-type RPE (SEQ ID NO: 26).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 27) encoding the RPE protein (SEQ ID NO: 28) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 8B.
  • the nucleotide sequence for the gene encoding the RPE protein was modified to optimize codon usage for E. coli.
  • a graphical display for the native gene (SEQ ID NO: 25) encoding the RPE protein (SEQ ID NO: 26) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 9A.
  • the nucleotide sequence for the gene encoding the RPE protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 33) was found to encode a protein (SEQ ID NO: 34) with 100% amino acid sequence identity to wild-type RPE (SEQ ID NO: 26).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 33) encoding the RPE protein (SEQ ID NO: 34) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 9B.
  • This example describes optimization of a Nucleotide sequence encoding RPE for expression in P. pastoris.
  • Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the RPE protein was modified to optimize codon usage for P. pastoris.
  • a graphical display for the native gene (SEQ ID NO: 25) encoding the RPE protein (SEQ ID NO: 26) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 1OA.
  • the nucleotide sequence for the gene encoding the RPE protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 39) was found to encode a protein (SEQ ID NO: 40) with 100% amino acid sequence identity to wild-type RPE (SEQ ID NO: 26).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 39) encoding the RPE protein (SEQ ID NO: 40) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 1OB.
  • This example describes optimization of a nucleotide sequence encoding RPE for expression in K. lactis.
  • Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the RPE protein was modified to optimize codon usage for K. lactis.
  • a graphical display for the native gene (SEQ ID NO: 25) encoding the RPE protein (SEQ ID NO: 26) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 1 IA.
  • the nucleotide sequence for the gene encoding the RPE protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 45) was found to encode a protein (SEQ ID NO: 46) with 100% amino acid sequence identity to wild-type RPE (SEQ ID NO: 26).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 45) encoding the RPE protein (SEQ ID NO: 46) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 1 IB.
  • This example describes optimization of a nucleotide sequence encoding RPE for expression in Z mobilis.
  • Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the RPE protein was modified to optimize codon usage for Z mobilis.
  • a graphical display for the native gene (SEQ ID NO: 25) encoding the RPE protein (SEQ ID NO: 26) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 12 A.
  • the nucleotide sequence for the gene encoding the RPE protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 47) was found to encode a protein (SEQ ID NO: 48) with 100% amino acid sequence identity to wild-type RPE (SEQ ID NO: 26).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 47) encoding the RPE protein (SEQ ID NO: 48) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 12B.
  • E. coli expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 8 and native RPE protein is examined by Western blot analysis.
  • Each vector is transformed into E. coli strain Top 10 (F-mcrA ⁇ (mrr-hsdRMS-mcrBC) ⁇ 80lacZ ⁇ M15 UacX74 deoR recAl araD139 ⁇ (ara-leu) 7697 galU galK rpsL (StrR) endAl nupG).
  • An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOO ⁇ g/ml ampicillin and grown at 37 0 C to OD 600 of 0.5.
  • Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-RPE antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
  • This example describes optimization of a Nucleotide sequence encoding ADHl for expression in yeast.
  • the chi-squared value "chisql” was generated by the expected and observed values determined.
  • the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
  • the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
  • z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
  • nucleotide sequence for the gene encoding the ADHl protein was modified to optimize codon usage for S. cerevisiae.
  • a graphical display for the native gene (SEQ ID NO: 49) encoding the ADHl protein (SEQ ID NO: 50) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
  • the graphical display is provided in Figure 13.
  • a graphical display for the native gene (SEQ ID NO: 49) encoding the ADHl protein (SEQ ID NO: 50) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
  • the graphical display is provided in Figure 14A.
  • the nucleotide sequence for the gene encoding the ADHl protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 51) was found to encode a protein (SEQ ID NO: 52) with 100% amino acid sequence identity to wild-type ADHl (SEQ ID NO: 50).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 51) encoding the ADHl protein (SEQ ID NO: 52) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 14B.
  • This example describes optimization of a Nucleotide sequence encoding ADHl for expression in bacteria.
  • Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the ADHl protein was modified to optimize codon usage for E. coli.
  • a graphical display for the native gene (SEQ ID NO: 49) encoding the ADHl protein (SEQ ID NO: 50) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 15 A.
  • the nucleotide sequence for the gene encoding the ADHl protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 57) was found to encode a protein (SEQ ID NO: 58) with 100% amino acid sequence identity to wild-type ADHl (SEQ ID NO: 50).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 57) encoding the ADHl protein (SEQ ID NO: 58) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 15B.
  • This example describes optimization of a Nucleotide sequence encoding ADHl for expression in P. pastoris.
  • Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the ADHl protein was modified to optimize codon usage for P. pastoris.
  • a graphical display for the native gene (SEQ ID NO: 49) encoding the ADHl protein (SEQ ID NO: 50) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 16A.
  • the nucleotide sequence for the gene encoding the ADHl protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 63) was found to encode a protein (SEQ ID NO: 64) with 100% amino acid sequence identity to wild-type ADHl (SEQ ID NO: 50).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 63) encoding the ADHl protein (SEQ ID NO: 64) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 16B.
  • This example describes optimization of a nucleotide sequence encoding ADHl for expression in K. lactis.
  • Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the ADHl protein was modified to optimize codon usage for K. lactis.
  • a graphical display for the native gene (SEQ ID NO: 49) encoding the ADHl protein (SEQ ID NO: 50) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 17A.
  • the nucleotide sequence for the gene encoding the ADHl protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 69) was found to encode a protein (SEQ ID NO: 70) with 100% amino acid sequence identity to wild-type ADHl (SEQ ID NO: 50).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 69) encoding the ADHl protein (SEQ ID NO: 70) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 17B.
  • This example describes optimization of a nucleotide sequence encoding ADHl for expression in Z. mobilis.
  • Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the ADHl protein was modified to optimize codon usage for Z mobilis.
  • a graphical display for the native gene (SEQ ID NO: 49) encoding the ADHl protein (SEQ ID NO: 50) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 18A.
  • the nucleotide sequence for the gene encoding the ADHl protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 71) was found to encode a protein (SEQ ID NO: 72) with 100% amino acid sequence identity to wild-type ADHl (SEQ ID NO: 50).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 71) encoding the ADHl protein (SEQ ID NO: 72) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 18B.
  • E. coli expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 14 and native ADHl protein is examined by Western blot analysis.
  • Each vector is transformed into E. coli strain Top 10 (F-mcrA ⁇ (mrr-hsdRMS-mcrBQ ⁇ 80lacZ ⁇ M15 UacX74 deoR recAl araD139 ⁇ (ara-leu) 7697 gal U gal K rpsL (StrR) endAl nupG).
  • An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOO ⁇ g/ml ampicillin and grown at 37°C to OD 60O of 0.5.
  • Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-ADHl antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
  • This example describes optimization of a Nucleotide sequence encoding ADH3 for expression in yeast.
  • the chi-squared value "chisql” was generated by the expected and observed values determined.
  • the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
  • the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
  • z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
  • nucleotide sequence for the gene encoding the ADH3 protein was modified to optimize codon usage for S. cerevisiae.
  • a graphical display for the native gene (SEQ ID NO: 73) encoding the ADH3 protein (SEQ ID NO: 74) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
  • the graphical display is provided in Figure 19.
  • a graphical display for the native gene (SEQ ID NO: 73) encoding the ADH3 protein (SEQ ID NO: 74) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
  • the graphical display is provided in Figure 2OA.
  • the nucleotide sequence for the gene encoding the ADH3 protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 75) was found to encode a protein (SEQ ID NO: 76) with 100% amino acid sequence identity to wild-type ADH3 (SEQ ID NO: 74).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 75) encoding the ADH3 protein (SEQ ID NO: 76) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 2OB.
  • This example describes optimization of a Nucleotide sequence encoding ADH3 for expression in bacteria.
  • Chi-squared values for E. coli were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the ADH3 protein was modified to optimize codon usage for E. coli.
  • a graphical display for the native gene (SEQ ID NO: 73) encoding the ADH3 protein (SEQ ID NO: 74) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 21 A.
  • the nucleotide sequence for the gene encoding the ADH3 protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 81) was found to encode a protein (SEQ ID NO: 82) with 100% amino acid sequence identity to wild-type ADH3 (SEQ ID NO: 74).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 81) encoding the ADH3 protein (SEQ ID NO: 82) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 2 IB.
  • This example describes optimization of a Nucleotide sequence encoding ADH3 for expression in P. pastoris.
  • Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the ADH3 protein was modified to optimize codon usage for P. pastoris.
  • a graphical display for the native gene (SEQ ID NO: 73) encoding the ADH3 protein (SEQ ID NO: 74) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 22A.
  • the nucleotide sequence for the gene encoding the ADH3 protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 87) was found to encode a protein (SEQ ID NO: 88) with 100% amino acid sequence identity to wild-type ADH3 (SEQ ID NO: 74).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 87) encoding the ADH3 protein (SEQ ID NO: 88) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 22B.
  • This example describes optimization of a nucleotide sequence encoding ADFBfor expression in K. lactis.
  • Chi-squared values for K. lactis were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the ADH3 protein was modified to optimize codon usage for K. lactis.
  • a graphical display for the native gene (SEQ ID NO: 73) encoding the ADH3 protein (SEQ ID NO: 74) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 23 A.
  • the nucleotide sequence for the gene encoding the ADH3 protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 93) was found to encode a protein (SEQ ID NO: 94) with 100% amino acid sequence identity to wild-type ADH3 (SEQ ID NO: 74).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 93) encoding the ADH3 protein (SEQ ID NO: 94) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 23B.
  • This example describes optimization of a nucleotide sequence encoding ADH3 for expression in Z. mobilis.
  • Chi-squared values for Z. mobilis were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the ADH3 protein was modified to optimize codon usage for Z mobilis.
  • a graphical display for the native gene (SEQ ID NO: 73) encoding the ADH3 protein (SEQ ID NO: 74) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 24A.
  • the nucleotide sequence for the gene encoding the ADH3 protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 95) was found to encode a protein (SEQ ID NO: 96) with 100% amino acid sequence identity to wild-type ADH3 (SEQ ID NO: 74).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 95) encoding the ADH3 protein (SEQ ID NO: 96) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 24B.
  • E. coli expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 20 and native ADH3 protein is examined by Western blot analysis.
  • Each vector is transformed into E. coli strain Top 10 (F-mcrA ⁇ (mrr-hsdRMS-mcrBQ ⁇ 80lacZ ⁇ M15 UacX74 deoR recAl araD139 ⁇ (ara-leu) 7697 galU galK rpsL (StrR) endAl nupG).
  • An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOO ⁇ g/ml ampicillin and grown at 37 0 C to OD 600 of 0.5.
  • Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-ADH3 antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
  • This example describes optimization of a nucleotide sequence encoding TALI for expression in yeast.
  • the chi-squared value "chisql” was generated by the expected and observed values determined.
  • the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.”
  • the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.”
  • z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
  • the nucleotide sequence for the gene encoding the TALI protein was modified to optimize codon usage for S. cerevisiae.
  • the nucleotide sequence encoding TALI (SEQ ID NO: 97) was derived from Genbank accession number Ml 6190 by removing untranslated sequence (5' untranslated region and introns).
  • a graphical display for the native gene (SEQ ID NO: 97) encoding the TALI protein (SEQ ID NO: 98) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
  • the graphical display is provided in Figure 25.
  • a graphical display for the native gene (SEQ ID NO: 97) encoding the TALI protein (SEQ ID NO: 98) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
  • the graphical display is provided in Figure 26A.
  • the nucleotide sequence for the gene encoding the TALI protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 99) was found to encode a protein (SEQ ID NO: 100) with 100% amino acid sequence identity to wild-type TALI (SEQ ID NO: 98).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 99) encoding the TALI protein (SEQ ID NO: 100) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 26B.
  • This example describes optimization of a nucleotide sequence encoding TALI for expression in bacteria.
  • Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1. [0418] The nucleotide sequence for the gene encoding the TALI protein was modified to optimize codon usage for E. coli.
  • a graphical display for the native gene (SEQ ID NO: 97) encoding the TALI protein (SEQ ID NO: 98) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position.
  • the graphical display is provided in Figure 27A.
  • the nucleotide sequence for the gene encoding the TALI protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 101) was found to encode a protein (SEQ ID NO: 102) with 100% amino acid sequence identity to wild-type TALI (SEQ ID NO: 98).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 101) encoding the TALI protein (SEQ ID NO: 102) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 27B.
  • This example describes optimization of a nucleotide sequence encoding TALI for expression in P. pastoris.
  • Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the TALI protein was modified to optimize codon usage for P. pastoris.
  • a graphical display for the native gene (SEQ ID NO: 97) encoding the TALI protein (SEQ ID NO: 98) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 28A.
  • the nucleotide sequence for the gene encoding the TALI protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 103) was found to encode a protein (SEQ ID NO: 104) with 100% amino acid sequence identity to wild-type TALI (SEQ ID NO: 98).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 103) encoding the TALI protein (SEQ ID NO: 104) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 28B.
  • This example describes optimization of a nucleotide sequence encoding TALI for expression in K. lactis.
  • Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the TALI protein was modified to optimize codon usage for K. lactis.
  • a graphical display for the native gene (SEQ ID NO: 97) encoding the TALI protein (SEQ ID NO: 98) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 29A.
  • the nucleotide sequence for the gene encoding the TALI protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 105) was found to encode a protein (SEQ ID NO: 106) with 100% amino acid sequence identity to wild-type TALI (SEQ ID NO: 98).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 105) encoding the TALI protein (SEQ ID NO: 106) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 29B.
  • This example describes optimization of a nucleotide sequence encoding TALI for expression in Z mobilis.
  • Chi-squared values for Z mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the TALI protein was modified to optimize codon usage for Z mobilis.
  • a graphical display for the native gene (SEQ ID NO: 97) encoding the TALI protein (SEQ ID NO: 98) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 30A.
  • the nucleotide sequence for the gene encoding the TALI protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 107) was found to encode a protein (SEQ ID NO: 108) with 100% amino acid sequence identity to wild-type TALI (SEQ ID NO: 98).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 107) encoding the TALI protein (SEQ ID NO: 108) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 30B.
  • E. coli expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rrod) from Example 26 and native TALI protein is examined by Western blot analysis.
  • Each vector is transformed into E. coli strain Top 10 (F-mcrA ⁇ (mrr-hsdRMS-mcrBC) ⁇ 80lacZ ⁇ M15 ⁇ lacX74 deoR recAl araD139 ⁇ (ara-leu) 7697 galU galK rpsL (StrR) endAl nupG).
  • An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOO ⁇ g/ml ampicillin and grown at 37 0 C to OD 600 of 0.5.
  • Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37 0 C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-TALl antibody diluted 1:20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
  • Nucleic acid constructs can be prepared, for example, as shown in Figure 31 A (upper panel).
  • Figure 31 A shows nucleic acid constructs for expressing heterologous genes in S. cerevisiae.
  • a yeast copy-number control element CEN/ARS or 2 ⁇ m was introduced into the EcoRl site of the polylinker of the bacterial vector pUC18.
  • the PGKl promoter sequence (PGKIp) and CYCl terminator (CYCIt) sequences were introduced into a unique site (Sspl, B) separated by a restriction site (D, SpellXhol) which can be used for cloning of the heterologous gene of interest (GENE X) by ligation or recombination rescue cloning.
  • the desired nutritional MARKER UAA3, TRPl, CANl, or METl 5
  • Figure 3 IB shows a scheme for the integration of heterologous gene expression cassettes.
  • Stable expression of combinations of genes is achieved through sequential or simultaneous integration of heterologous genes into yeast chromosomes via recombinational replacement of TyI elements (ending in delta repeats, open boxes) inserted at positions which allow substantial gene expression.
  • Primers containing outside ends with similarity to target genomic sequences (black boxes) and inside ends which overlap the PGKIp and loxP sequences are used in a PCR reaction to amplify a fragment containing GENE X and the selectable MARKER.
  • the PCR fragment is integrated via a double crossover in terminal regions of homology with the genome and integrants are selected.
  • cells are transformed with a plasmid expressing the Cre recombinase and cells in which the MARKER is lost by Cre-mediated recombination between the flanking loxP sites are selected by growth on medium containing the appropriate reverse selection agent.
  • the PGKl promoter region was amplified from genomic S.
  • PGKl -FOR (5'-AATATTaggcattgcaagaattactcgtgagtaagg- 3') and PGKl-REV (5'-ACTAGTatatttgttgtaaaaagtagataattacttcc-3'), which places a Sspl site at the 5' end of the construct and a Spel site at the 3' end of the construct.
  • the CYC terminator was amplified from plasmid pNB2258 using primers CYCl-FOR (5'- ACT AGTgatatctgcgcaCTCGAGtcatgtaattagttatgtcacgc-3') and CYCl-REV (5'- AATATTggccgcaaattaaagccttcgagcgtcccaaaaccttetc-3').
  • This amplified product therefore has SpeI-12N-Xhol restriction sites at the 5' end and a Sspl site at the 3' end.
  • the ligated fragment composed of the PGKl-CYCterm with flanking Sspl sites was then ligated into the Sspl site of pUCI8, creating vectors pXP13 (forward direction) and pXPI4 (reverse direction).
  • TEF-I promoter region was amplified from genomic S. cerevisiae DNA using primers TEF-1-FOR-SspI (5'-AATATTaccgcgaatccttacatcac-3') and TEFl-REV (5'-ccACTAGTtttgtaattaaaacttagattagattgctatgc-3'), which places a Sspl site at the 5' end of the construct and a Spel site at the 3' end of the construct. This was digested with Sspl and Spel.
  • the CYCl terminator fragment described in section 1 with SpeI-12N-XhoI restriction sites at the 5' end and a Sspl site at the 3' end was ligated with the TEFl promoter fragment.
  • the ligated fragment composed of the TEFI- CYCl term with flanking Sspl sites was then ligated into the Sspl site of pUC18, creating vectors pXP17 (forward direction) and pXPI8 (reverse direction).
  • the 2 ⁇ m origin was amplified from plasmid pRS425 using primers 2um-F0R (5'GAATTCaacgaagcatctgtgcttcattttgtagaa-3') and 2um-REV (5 1 - GAATTCgtatgatccaatatcaaaggaaatgatagc-3 1 ). These primers place EcoRl sites at each side of the 2 ⁇ m origin cassette. Following sequence verification, this cassette was ligated into the pXPI3 and pXPI7 vectors described above, creating vectors pXP200 and pXP400, respectively.
  • CEN/ARS origin was amplified from plasmid pRS315 using primers CEN/ARS-FOR (5'- GAATTCatcacgtgctataaaaataattataattt-3') and CEN/ARS-REV (5'-
  • CANl marker was amplified from plasmid pRS319a using primers CANl-FOR (5 1 -
  • the same construct was also ligated into the unique Smal site of plasmids pXP200 and pXP400 creating plasmids pXP200CAN, pXP200CAN-REV, pXP400CAN and pXP400CAN- REV.
  • METl 5 marker was amplified from plasmid pRS401 using primers MET-FOR (5 1 -
  • This construct was then ligated into the unique Smal site of plasmids pXPlOO and pXP300 creating plasmids pXPlOOMET, pXP100MET-REV, pXP300MET and pXP300MET-REV.
  • the same construct was also ligated into the unique Smal site of plasmids pXP200 and pXP400 creating plasmids pXP200MET, pXP200MET-REV, pXP400MET and pXP400MET-REV.
  • TRPl marker was amplified from plasmid pRS314 using primers TRP-FOR (5 1 -
  • This construct was then ligated into the unique Smal site of plasmids pXPlOO and pXP300 creating plasmids pXPIOOTRP and pXPIOOTRP-REV, pXP300TRP and pXP300TRP-REV.
  • the same construct was also ligated into the unique Smal site of plasmids pXP200 and pXP400 creating plasmids pXP200TRP, pXP200TRP- REV, pXP400TRP and pXP400TRP-REV.
  • the URA3 marker was amplified from plasmid pRSl l ⁇ using primers URA-FOR (5'-
  • This construct was then ligated into the unique Smal site of plasmids pXPlOO and pXP300 creating plasmids pXPlOOURA, pXP100URA-REV, pXP300URA and pXP300URA-REV.
  • the same construct was also ligated into the unique Smal site of plasmids pXP200 and pXP400 creating plasmids pXP200URA, pXP200URA-REV, pXP400URA and pXP400URA-REV.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • General Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Plant Pathology (AREA)
  • Micro-Organisms Or Cultivation Processes Thereof (AREA)
  • Enzymes And Modification Thereof (AREA)
  • Preparation Of Compounds By Using Micro-Organisms (AREA)

Abstract

Cette invention concerne des séquences polynucléotidiques et des gènes de synthèse codant pour les enzymes de fermentation et de la voie du pentose phosphate devant être exprimées dans un organisme hôte avec une cinétique translationnelle plus fine et/ou améliorée, ainsi que des procédés permettant de les réaliser. Le nucléotide codant pour les enzymes de fermentation et/ou de la voie du pentose phosphate est prévu pour être translaté rapidement sur toute sa longueur. L'expression de ce nucléotide ainsi obtenu est prévue pour permettre d'obtenir des niveaux d'expression protéique améliorés dans les cas où les pauses de translation excessives ou inappropriées réduisent l'expression protéique. De plus, l'expression du nucléotide codant pour les enzymes de fermentation et/ou de la voie du pentose phosphate est prévue pour permettre d'obtenir des niveaux améliorés d'expression des polypeptides fonctionnels et repliés d'origine et/ou actifs dans les cas où des pauses de translation excessives ou inappropriées provoquent l'expression d'une enzyme de fermentation et/ou de voie du pentose phosphate inactive, insoluble, agrégée ou quelque peu dysfonctionnelle ou peu active.
PCT/US2008/006378 2007-05-21 2008-05-14 Séquences nucléotiques codant pour les enzymes de fermentation et de la voie du pentose phosphate présentant une cinétique translationnelle plus fine et procédés de réalisation correspondants WO2008153676A2 (fr)

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
US93919707P 2007-05-21 2007-05-21
US93920007P 2007-05-21 2007-05-21
US60/939,197 2007-05-21
US60/939,200 2007-05-21
US94064207P 2007-05-29 2007-05-29
US60/940,642 2007-05-29
US94078507P 2007-05-30 2007-05-30
US60/940,785 2007-05-30

Publications (2)

Publication Number Publication Date
WO2008153676A2 true WO2008153676A2 (fr) 2008-12-18
WO2008153676A3 WO2008153676A3 (fr) 2009-03-05

Family

ID=39682485

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/006378 WO2008153676A2 (fr) 2007-05-21 2008-05-14 Séquences nucléotiques codant pour les enzymes de fermentation et de la voie du pentose phosphate présentant une cinétique translationnelle plus fine et procédés de réalisation correspondants

Country Status (1)

Country Link
WO (1) WO2008153676A2 (fr)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0723017A2 (fr) * 1995-01-23 1996-07-24 Basf Aktiengesellschaft Transcétolase
WO2000032789A1 (fr) * 1998-12-02 2000-06-08 Centrum Voor Plantenveredelings- En Reproduktieonderzoek (Cpro-Dlo) Genes lies a la saveur des fruits et utilisation de ces genes
WO2002020796A2 (fr) * 2000-09-01 2002-03-14 E.I. Dupont De Nemours And Company Genes et enzymes du mecanisme metabolique methanotrophique du carbone
WO2004042043A2 (fr) * 2002-11-05 2004-05-21 Affinium Pharmaceuticals, Inc. Structures cristallines de 3-epimerases bacteriennes de ribulose-phosphate
WO2005003362A2 (fr) * 2003-03-10 2005-01-13 Athenix Corporation Methodes destinees a conferer une resistance aux herbicides
US20050181464A1 (en) * 2002-04-04 2005-08-18 Affinium Pharmaceuticals, Inc. Novel purified polypeptides from bacteria
WO2006083410A2 (fr) * 2004-12-22 2006-08-10 Michigan Biotechnology Institute Micro-organismes de recombinaison pour une production accrue d'acides organiques
WO2008041840A1 (fr) * 2006-10-02 2008-04-10 Dsm Ip Assets B.V. Génie métabolique de cellules de levure induisant la fermentation de l'arabinose

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0723017A2 (fr) * 1995-01-23 1996-07-24 Basf Aktiengesellschaft Transcétolase
WO2000032789A1 (fr) * 1998-12-02 2000-06-08 Centrum Voor Plantenveredelings- En Reproduktieonderzoek (Cpro-Dlo) Genes lies a la saveur des fruits et utilisation de ces genes
WO2002020796A2 (fr) * 2000-09-01 2002-03-14 E.I. Dupont De Nemours And Company Genes et enzymes du mecanisme metabolique methanotrophique du carbone
US20050181464A1 (en) * 2002-04-04 2005-08-18 Affinium Pharmaceuticals, Inc. Novel purified polypeptides from bacteria
WO2004042043A2 (fr) * 2002-11-05 2004-05-21 Affinium Pharmaceuticals, Inc. Structures cristallines de 3-epimerases bacteriennes de ribulose-phosphate
WO2005003362A2 (fr) * 2003-03-10 2005-01-13 Athenix Corporation Methodes destinees a conferer une resistance aux herbicides
WO2006083410A2 (fr) * 2004-12-22 2006-08-10 Michigan Biotechnology Institute Micro-organismes de recombinaison pour une production accrue d'acides organiques
WO2008041840A1 (fr) * 2006-10-02 2008-04-10 Dsm Ip Assets B.V. Génie métabolique de cellules de levure induisant la fermentation de l'arabinose

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
BECKER JESSICA ET AL: "A MODIFIED SACCHAROMYCES CEREVISIAE STRAIN THAT CONSUMES L-ARABINOSE AND PRODUCES ETHANOL" APPLIED AND ENVIRONMENTAL MICROBIOLOGY, WASHINGTON,DC, vol. 69, no. 7, 1 July 2003 (2003-07-01), pages 4144-4150, XP009080581 ISSN: 0099-2240 *
HAHN-HÄGERDAL BÄRBEL ET AL: "Metabolic engineering for pentose utilization in Saccharomyces cerevisiae." ADVANCES IN BIOCHEMICAL ENGINEERING/BIOTECHNOLOGY 2007, vol. 108, 12 May 2007 (2007-05-12), pages 147-177, XP008095962 ISSN: 0724-6145 *
KARHUMAA KAISA ET AL: "Investigation of limiting metabolic steps in the utilization of xylose by recombinant Saccharomyces cerevisiae using metabolic engineering." YEAST (CHICHESTER, ENGLAND) 15 APR 2005, vol. 22, no. 5, 15 April 2005 (2005-04-15), pages 359-368, XP002494168 ISSN: 0749-503X *
SUNDSTROEM M ET AL: "YEAST TKL1 GENE ENCODES A TRANSKETOLASE THAT IS REQUIRED FOR EFFICIENT GLYCOLYSIS AND BIOSYNTHESIS OF AROMATIC AMINO ACIDS" JOURNAL OF BIOLOGICAL CHEMISTRY, AMERICAN SOCIETY OF BIOLOCHEMICAL BIOLOGISTS, BIRMINGHAM,; US, vol. 268, no. 32, 1 January 1993 (1993-01-01), pages 24346-24352, XP000919119 ISSN: 0021-9258 *
VAN MARIS ANTONIUS J A ET AL: "Alcoholic fermentation of carbon sources in biomass hydrolysates by Saccharomyces cerevisiae: current status." ANTONIE VAN LEEUWENHOEK NOV 2006, vol. 90, no. 4, November 2006 (2006-11), pages 391-418, XP002494169 ISSN: 0003-6072 *
WIEDEMANN BEATE ET AL: "Codon-optimized bacterial genes improve L-Arabinose fermentation in recombinant Saccharomyces cerevisiae." APPLIED AND ENVIRONMENTAL MICROBIOLOGY APR 2008, vol. 74, no. 7, April 2008 (2008-04), pages 2043-2050, XP002494170 ISSN: 1098-5336 *

Also Published As

Publication number Publication date
WO2008153676A3 (fr) 2009-03-05

Similar Documents

Publication Publication Date Title
DK2345662T3 (en) Genetically modified yeast species and fermentation processes using genetically modified yeast
US11624057B2 (en) Glycerol free ethanol production
EP3033413B2 (fr) Procédés pour l'amélioration du rendement de production et de la production dans un micro-organisme par recyclage de glycérol
EP2922950B1 (fr) Voie métabolique de production d' éthanol consommatrice d' électrons pour remplacer la formation de glycerol dans s. cerevisiae
CA2832279C (fr) Procedes pour l'amelioration du rendement et de la production de produit dans un microorganisme par l'addition d'accepteurs d'electrons alternatifs
US9546385B2 (en) Genetically modified clostridium thermocellum engineered to ferment xylose
WO2011022651A1 (fr) Production de propanols, d'alcools, et de polyols dans des organismes de biotraitement consolidés
JPWO2008093847A1 (ja) キシリトールデヒドロゲナーゼをコードするdna
US20040142456A1 (en) Xylose-fermenting recombinant yeast strains
US20080085341A1 (en) Methods and microorganisms for forming fermentation products and fixing carbon dioxide
CA2850480A1 (fr) Ingenierie de micro-organismes pour augmenter la production d'ethanol par redirection metabolique
US11034949B2 (en) Chimeric polypeptides having xylose isomerase activity
WO2008144012A2 (fr) Séquences de nucléotide codant une enzyme de métabolisation de xylose et d'arabinose, avec une cinétique translationnelle affinée et procédés de réalisation
EP2812430B1 (fr) Pentosefermentierender mikroorganismus
WO2009005564A2 (fr) Séquences nucléotidiques codant pour l'enzyme dégradant la cellulose et l'hémicellulose et ayant une cinétique traductionnelle raffinée, et procédé de production correspondant
WO2008153676A2 (fr) Séquences nucléotiques codant pour les enzymes de fermentation et de la voie du pentose phosphate présentant une cinétique translationnelle plus fine et procédés de réalisation correspondants
CN115976005A (zh) 一种基于祖先序列构建方法获得的木糖异构酶及其应用
US20160340702A1 (en) Heat-stable, fe-dependent alcohol dehydrogenase for aldehyde detoxification
KR20210048394A (ko) 일산화탄소 이용능이 향상된 미생물 및 이의 용도
US20210017526A1 (en) Xylose metabolizing yeast
JP2020115827A (ja) キシリトールの蓄積を抑制した酵母

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08767803

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08767803

Country of ref document: EP

Kind code of ref document: A2