WO2008144012A2 - Xylose- and arabinose- metabolizing enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same - Google Patents

Xylose- and arabinose- metabolizing enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same Download PDF

Info

Publication number
WO2008144012A2
WO2008144012A2 PCT/US2008/006353 US2008006353W WO2008144012A2 WO 2008144012 A2 WO2008144012 A2 WO 2008144012A2 US 2008006353 W US2008006353 W US 2008006353W WO 2008144012 A2 WO2008144012 A2 WO 2008144012A2
Authority
WO
WIPO (PCT)
Prior art keywords
nucleotides
replaced
amino acids
nucleotide sequence
seq
Prior art date
Application number
PCT/US2008/006353
Other languages
French (fr)
Other versions
WO2008144012A3 (en
Inventor
Kirsty A. Salmon
David A. Roth
Wesley G. Hatfield
Yimeng Dou
Original Assignee
The Regents Of The University Of California
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Regents Of The University Of California filed Critical The Regents Of The University Of California
Publication of WO2008144012A2 publication Critical patent/WO2008144012A2/en
Publication of WO2008144012A3 publication Critical patent/WO2008144012A3/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12PFERMENTATION OR ENZYME-USING PROCESSES TO SYNTHESISE A DESIRED CHEMICAL COMPOUND OR COMPOSITION OR TO SEPARATE OPTICAL ISOMERS FROM A RACEMIC MIXTURE
    • C12P7/00Preparation of oxygen-containing organic compounds
    • C12P7/02Preparation of oxygen-containing organic compounds containing a hydroxy group
    • C12P7/04Preparation of oxygen-containing organic compounds containing a hydroxy group acyclic
    • C12P7/06Ethanol, i.e. non-beverage
    • C12P7/08Ethanol, i.e. non-beverage produced as by-product or from waste or cellulosic material substrate
    • C12P7/10Ethanol, i.e. non-beverage produced as by-product or from waste or cellulosic material substrate substrate containing cellulosic material
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N9/00Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
    • C12N9/0004Oxidoreductases (1.)
    • C12N9/0006Oxidoreductases (1.) acting on CH-OH groups as donors (1.1)
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02EREDUCTION OF GREENHOUSE GAS [GHG] EMISSIONS, RELATED TO ENERGY GENERATION, TRANSMISSION OR DISTRIBUTION
    • Y02E50/00Technologies for the production of fuel of non-fossil origin
    • Y02E50/10Biofuels, e.g. bio-diesel

Definitions

  • the present invention relates to refining the translational kinetics of an mRNA into polypeptide, and polypeptide-encoding nucleotide sequences which have refined translational properties.
  • Saccharomyces yeasts have proven to be safe, effective and user- friendly microorganisms for large-scale production of industrial ethanol from glucose- based feedstocks. Recently, efforts have been made to use cellulosic biomass as feedstock for producing ethanol.
  • the major fermentable sugars from hydrolysis of these feedstocks such as rice and wheat straw, sugarcane bagasse, corn stover, corn fibre, softwood, hardwood and grasses
  • D-glucose, L-arabinose and D-xylose are D-glucose, L-arabinose and D-xylose.
  • the Saccharomyces yeasts are not able to use arabinose or xylose for growth or production of ethanol.
  • yeast and other microorganisms that can co- ferment glucose, arabinose and xylose simultaneously to ethanol through expression of the enzymes involved in the arabinose and xylose fermentation pathways.
  • Such pathways have been identified in yeast, filamentous fungi and other eukaryotes.
  • Related pathways utilizing distinct enzymes have been identified in bacteria.
  • Some translational pauses are resultant from the presence of particular codon pairs in the nucleotide sequence encoding the polypeptide to be translated. As provided herein, inappropriate or excessive translation pauses can reduce protein expression considerably. Further, the translational pausing properties of codon pairs vary from organism to organism. As a result, exogenous expression of genes foreign to the expression organism can lead to inefficient translation and poor expression. Even when the gene is translated in a sufficiently efficient manner that recoverable quantities of the translation product are produced, the protein is often inactive, insoluble, aggregated, or otherwise different in properties from the native protein. Thus, removing inappropriate or excessive translation pause structures coded for by specific di-codon nucleotide sequences in the open reading frame (ORF) can improve protein expression.
  • ORF open reading frame
  • sugar catabolic enzyme-encoding nucleotide sequences with refined translational kinetics and methods of designing and synthesizing the same.
  • a sugar catabolic enzyme-encoding nucleotide sequence wherein the encoded sequence has amino acid sequence identity with an original sugar catabolic enzyme polypeptide, and wherein predicted translation pauses in the expression organism have been removed or reduced by replacing original codon pairs with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the resultant sugar catabolic enzyme- encoding nucleotide is predicted to be translated rapidly along its entire length.
  • Expression of the resultant sugar catabolic enzyme-encoding nucleotide is predicted- to result in improved protein expression levels in cases where inappropriate or excessive translation pauses reduce protein expression.
  • expression of the resultant sugar catabolic enzyme-encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression products in cases where inappropriate or excessive translation pauses cause expression of inactive, insoluble or aggregated enzyme.
  • sugar catabolic enzyme-encoding nucleotide sequences wherein the encoded sequence has amino acid sequence identity with an original sugar catabolic enzyme -encoding nucleotide sequence and is adapted for expression in a heterologous host organism, wherein at least 1. 2, or 3 codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein.
  • the host organism is not human, E. coli or S. cerevisiae.
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GGTATT (nucleotides 619-624); TTGAAC (nucleotides 16- 21 ); TTGAAC (nucleotides 274-279); TTGAAC (nucleotides 670-675); TTGAAC (nucleotides 688-693); CTTTCT (nucleotides 286-291); GCCATT (nucleotides 181 -186); TCTCCA (nucleotides 697-702); TCTCCA (nucleotides 751 -756); ATCAAG (nucleotides 103-108); ATCAAG (nucleotides 541 -546); ATCAAG (nucleotides 721 - 726); GCCAAG (nucleotides 889-894).
  • At least 3, or 4. or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GGTATT (nucleotides 619-624) replaced with GGAATT; TTGAAC (nucleotides 16-21) replaced with TTAAAT; TTGAAC (nucleotides 274-279) replaced with CTAAAT; TTGAAC (nucleotides 670-675) replaced with TTAAAT; TTGAAC (nucleotides 688-693) replaced with TTAAAT; CTTTCT (nucleotides 286- 291 ) replaced with CTATCT; GCCATT (nucleotides 181 -186) replaced with GCTATT; TCTCCA (nucleotides 697-702) replaced with TCACCA; TCTCCA
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GAAGAT (nucleotides 136 - 141 ); CTTTCT (nucleotides 286 - 291 ); GAAGAT (nucleotides 415 - 420 ); ATTGCC (nucleotides 793 - 798 ); ATTGCC (nucleotides 886 - 891 ); GACTGG (nucleotides 928 - 933 ).
  • GAAGAT nucleotides 136 - 141
  • CTTTCT nucleotides 286 - 291
  • GAAGAT nucleotides 415 - 420
  • ATTGCC nucleotides 793 - 798
  • ATTGCC nucleotides 886 - 891
  • GACTGG nucleotides 928 - 933
  • At least 3 of the following codon pair replacements have been made: GAAGAT (nucleotides 136 - 141 ) replaced with GAAGAT; CTTTCT (nucleotides 286 - 291 ) replaced with CTATCT; GAAGAT (nucleotides 415 - 420 ) replaced with GAAGAT; ATTGCC (nucleotides 793 - 798 ) replaced with ATCGCT; ATTGCC (nucleotides 886 - 891 ) replaced with ATAGCT; GACTGG (nucleotides 928 - 933 ) replaced with GATTGG.
  • the nucleotide sequence is optimized for expression in E.coli.
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TCCAAG (nucleotides 226-231); ATCAAG (nucleotides 103-108); ATCAAG (nucleotides 541 -546); ATCAAG (nucleotides 721 -726); TTCAAG (nucleotides 343-348); TTCAAC (nucleotides 913-918); ATCAAC (nucleotides 901- 906); GGTATT (nucleotides 619-624); GTCAAG (nucleotides 172-177): GTCAAG (nucleotides 199-204); GTCAAG (nucleotides 460-465); GACGAA (nucleotides 187- 192); GACGAA (nucleotides 865-870); GGTATC (nucleotides 193-198); CCAAGA (nucleotides 589-594);
  • At least 3, or 4. or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: TCCAAG (nucleotides 226-231) replaced with TCTAAA; ATCAAG (nucleotides 103-108) replaced with ATTAAA; ATCAAG (nucleotides 541 -546) replaced with ATTAAA; ATCAAG (nucleotides 721 -726) replaced with ATTAAG; TTCAAG (nucleotides 343-348) replaced with TTTAAA; TTCAAC (nucleotides 913-918) replaced with TTTAAT; ATCAAC (nucleotides 901 -906) replaced with ATTAAT; GGTATT (nucleotides 619-624) replaced with GGAATT; GTCAAG (nucleotides 226-231) replaced with TCTAAA; ATCA
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2. wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TTGAAC (nucleotides 16 - 21 ); AAGAAG (nucleotides 175 - 180 ); GCCATT (nucleotides 181 - 186 ); GGTATC (nucleotides 193 - 198 ); TTGAAC (nucleotides 274 - 279 ); CTTTCT (nucleotides 286 - 291 ); TTCCCA (nucleotides 331 - 336 ); TTCCCA (nucleotides 499 - 504 ); TTGAAC (nucleotides 670 - 675 ); TTGAAC (nucleotides 688 - 693 ); GCCAAG (nucleotides 889 - 894 ).
  • TTGAAC nucleotides 16 - 21
  • AAGAAG nucleotides 175 - 180
  • AAAAAG nucleotides 175 - 180
  • GCCATT nucleotides 181 - 186
  • GCTATT nucleotides 181 - 186
  • GGTATC nucleotides 193 - 198
  • TTGAAC nucleotides 274 - 279
  • CTTTCT nucleotides 286 - 291
  • TTCCCA nucleotides 331 - 336
  • TTTCCA nucleotides 331 - 336
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GCCGGT (nucleotides 166 - 171 ); GGTATC (nucleotides 193 - 198 ); GCCTTG (nucleotides 271 - 276 ); GCCGGT (nucleotides 466 - 471 ); GCTTTG (nucleotides 508 - 513 ); GGTATT (nucleotides 619 - 624 ); GCTTTG (nucleotides 685 - 690 ); AACAGC (nucleotides 850 - 855 ); GCCAAG (nucleotides 889 - 894 ) .In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • GCCGGT nucleotides 166 - 171
  • GGTATC nucleotides 193 - 198
  • GGCATT nucleotides 193 - 198
  • GCCTTG nucleotides 271 - 276
  • GCCGGT nucleotides 466 - 471
  • GCTTTG nucleotides 508 - 513
  • GGTATT nucleotides 619 - 624
  • GCTTTG nucleotides 685 - 690
  • AACAGC nucleotides 850 - 855
  • AATTCT GCCAAG
  • 889 - 894 nucleotides 894
  • the nucleotide sequence is optimized for expression in Z. mobilis.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3. or 2.5. or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • a xylose reductase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oiyctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M.
  • J0015 Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing xylose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pasto ⁇ s, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis. Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the xylose reductase retains at least 75% of the enzymatic activity of wild-type Xyr (SEQ ID NO: 2) under normal physiological conditions.
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 5-301 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest ⁇ scores of the wild type codon pairs encoding amino acids 5-301 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 5-301 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair TCCAAG when expressed in the native organism.
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 1 -5 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1 -5 of SEQ ID NO: 2 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%. or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -5 when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1-5 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 200%. or 100%, or 75%, or 50% or 40% of the wild type codon pair CCTTCT when expressed in the native organism.
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type xylose reductase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: AAGAAA (nucleotides 382 - 387); TTGAAG (nucleotides 694 - 699); ATCAAA (nucleotides 190 - 195); TTGAAC (nucleotides 34 - 39); TTGAAC (nucleotides 313 - 318); GCCATT (nucleotides 901 - 906); GCTACT (nucleotides 10 - 15); ATCAAG (nucleotides 121 - 126); ATCAAG (nucleotides 202 - 207); ATCAAG (nucleotides 559 - 564).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: AAGAAA (nucleotides 382 - 387) replaced with AAAAAG; TTGAAG (nucleotides 694 - 699) replaced with TTAAAA; ATCAAA (nucleotides 190 - 195) replaced with ATTAAA; TTGAAC (nucleotides 34 - 39) replaced with TTAAAT; TTGAAC (nucleotides 313 - 318) replaced with TTAAAT; GCCATT (nucleotides 901 - 906) replaced with GCTATA: GCTACT (nucleotides 10 - 15) replaced with GCTACC: ATCAAG (nucleotides 121 - 126) replaced with ATTAAA
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-324 of wild-type xylose reductase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GAAGAG (nucleotides 226 - 231 ); ATTGCC (nucleotides 748 - 753); ATTGCC (nucleotides 904 - 909).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GAAGAG (nucleotides 226 - 231 ) replaced with GAAGAA; ATTGCC (nucleotides 748 - 753) replaced with ATTGCG; ATTGCC (nucleotides 904 - 909) replaced with ATCGCG.
  • GAAGAG nucleotides 226 - 231
  • ACCTGG nucleotides 454 - 459
  • TTGCAG nucleotides 574 - 579
  • ATTGCC nucleotides 748 - 753
  • TTGCAG nucleotides 895 - 900
  • ATTGCC nucleotides 904 - 909 .
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • At least 3 of the following codon pair replacements have been made: GAAGAG (nucleotides 226 - 231 ) replaced with GAAGAA; ACCTGG (nucleotides 454 - 459 ) replaced with ACTTGG; TTGCAG (nucleotides 574 - 579 ) replaced with CTCCAG; ATTGCC (nucleotides 748 - 753 ) replaced with ATTGCG; TTGCAG (nucleotides 895 - 900 ) replaced with CTCCAG; ATTGCC (nucleotides 904 - 909 ) replaced with ATCGCG.
  • the nucleotide sequence is optimized for expression in E.coli.
  • a xylose reductase -encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type xylose reductase as set forth in SEQ ID NO: 26 ; wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: AAGAAA (nucleotides 382 - 387); TCCAAG (nucleotides 244 - 249); ATCAAG (nucleotides 121 - 126): ATCAAG (nucleotides 202 - 207); ATCAAG (nucleotides 559 - 564): TTCAAC (nucleotides 931 - 936); ATCAAA (nucleotides 190 - 195); GTCAAG (nucleotides 217 - 222); GTCAAG (nucleotides 739 - 744); GGTATC (nucleotides 187 - 192); GGTATC (nucleotides 505 - 510); CCAAGA (nucleotides 823 - 828); TTGAAC (nucleotides 34 - 39); TTGAAC (nucleotides 34
  • At least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: AAGAAA (nucleotides 382 - 387) replaced with AAGAAG; TCCAAG (nucleotides 244
  • ATCAAG (nucleotides 121 - 126) replaced with ATTAAA; ATCAAG (nucleotides 202 - 207) replaced with ATCAAA; ATCAAG (nucleotides 559 - 564) replaced with ATCAAA; TTCAAC (nucleotides 931 - 936) replaced with TTCAAC; ATCAAA (nucleotides 190 - 195) replaced with ATCAAA; GTCAAG (nucleotides 217 - 222) replaced with GTTAAA; GTCAAG (nucleotides 739 - 744) replaced with GTTAAA; GGTATC (nucleotides 187 - 192) replaced with GGTATC; GGTATC (nucleotides 505 - 510) replaced with GGTATC; CCAAGA (nucleotides 823 - 828) replaced with CCGCGC; TGAAC (nucleotides 34 -
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type xylose reductase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TTGAAC (nucleotides 34 - 39 ); GGTATC (nucleotides 187
  • ATCAAA nucleotides 190 - 195
  • AAGAAG nucleotides 271 - 276
  • TTGAAC nucleotides 313 - 318
  • TTCCCA TTCCCA
  • GGTATC nucleotides 505 - 510
  • TTGAAG nucleotides 694 - 699
  • GCCATT nucleotides 901 - 906 .
  • TTGAAC nucleotides 34 - 39
  • GGTATC nucleotides 187 - 192
  • GGAATT ATCAAA
  • ATCAAA nucleotides 190 - 195
  • AAAAAA AAAAAA
  • TTGAAC nucleotides 313 - 318
  • TTCCCA nucleotides 349 - 354
  • AAGAAA nucleotides 382 - 387
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-324 of wild-type xylose reductase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GGTATC (nucleotides 187 - 192 ); GAAGGC (nucleotides 208 - 213 ); GCTTTG (nucleotides 289 - 294 ); GCTTTG (nucleotides 463 - 468 ); GGTATC (nucleotides 505 - 510 ); GCCTTG (nucleotides 571 - 576 ); GCCTTG (nucleotides 703 - 708 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GGTATC (nucleotides 187 - 192 ) replaced with GGGATT; GAAGGC (nucleotides 208 - 213 ) replaced with GAAGGG; GCTTTG (nucleotides 289 - 294 ) replaced with GCCCTT; GCTTTG (nucleotides 463 - 468 ) replaced with GCCCTT; GGTATC (nucleotides 505 - 510 ) replaced with GGCATT; GCCTTG (nucleotides 571 - 576 ) replaced with GCCTTA; GCCTTG (nucleotides 703 - 708 ) replaced with GCATTG.
  • the nucleotide sequence is optimized for expression in Z mobilis.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5. or 3. or 2.5, or 2 times the Standard deviation of translational kinetics values for the host organism.
  • the host organism is not human. E. coli or S.cerevisiae.
  • a xylose reductase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-324 of wild-type xylose reductase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris: Otyctolagus cuniculus (rabbit): Macaca fascicularis (Long-tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing xylose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, and xylulokinase: wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pasto ⁇ s, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the xylose reductase retains at least 75% of the enzymatic activity of wild-type XyIl (SEQ ID NO: 26) under normal physiological conditions.
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type xylose reductase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 1 1 -306 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 1 1 -306 of SEQ ID NO: 26 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest ⁇ scores of the wild type codon pairs encoding amino acids 1 1 -306 when expressed in the native organism.
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type xylose reductase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 25 and which encode amino acids 1 -1 1 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the 2 score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1-1 1 of SEQ ID NO: 26 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -1 1 when expressed in the native organism.
  • a xylitol dehydrogenase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50,wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: AAGAAA (nucleotides 106 - 1 1 1); TTGAAG (nucleotides 637 - 642); CTTTTG (nucleotides 565 - 570); GGTATT (nucleotides 277 - 282); TTGAAC (nucleotides 25 - 30); ACTTTG (nucleotides 880 - 885); GCCATT (nucleotides 790 - 795); GCTACT (nucleotides 349 - 354); GCTACT (nucleotides 664 - 669); ATCAAG (nucleotides 709 - 714); ATCAAG (nucleotides 772 - 777); GCCAAG (nucleotides 583 - 588); GCCAAG (nucleotides 646 - 651).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: AAGAAA (nucleotides 106 - 1 1 1) replaced with AAAAAG; TTGAAG (nucleotides 637 - 642) replaced with TTAAAA; CTTTTG (nucleotides 565 - 570) replaced with TTGTTG; GGTATT (nucleotides 277 - 282) replaced with GGAATA; TTGAAC (nucleotides 25 - 30) replaced with TTAAAT; ACTTTG (nucleotides 880 - 885) replaced with ACATTG; GCCATT (nucleotides 790 - 795) replaced with GCTATT; GCTACT (nucleotides 349 - 354) replaced
  • a xylitol dehydrogenase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 5O.wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CCTTCC (nucleotides 13 - 18 ); AAGAAA (nucleotides 106 - 1 1 1 ); GTCAGC (nucleotides 448 - 453 ); CTCGGT (nucleotides 460 - 465 ); GTTGCC (nucleotides 535 - 540 ); TTTGGT (nucleotides 544 - 549 ); GCTGAA (nucleotides 760 - 765 ); ATTGCC (nucleotides 793 - 798 ): GTCAGC (nucleotides 841 - 846 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: CCTTCC (nucleotides 13 - 18 ) replaced with CCATCT; AAGAAA (nucleotides 106 - 1 1 1 ) replaced with AAAAAG; GTCAGC (nucleotides 448 - 453 ) replaced with GTTTCA; CTCGGT (nucleotides 460 - 465 ) replaced with TTGGGT; GTTGCC (nucleotides 535 - 540 ) replaced with GTTGCT; TTTGGT (nucleotides 544 - 549 ) replaced with TTCGGT; GCTGAA (nucleotides 760 - 765 ) replaced with GCTGAG; ATTGCC (nucleotides 13 - 18 ) replaced with CCATCT;
  • a xylitol dehydrogenase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: AAGAAA (nucleotides 106 - 1 1 1); TCCAAG (nucleotides 361 - 366); TCCAAG (nucleotides 502 - 507); TCCAAG (nucleotides 682 - 687); ATCAAG (nucleotides 709 - 714); ATCAAG (nucleotides 772 - 777); TTCAAG (nucleotides 406 - 41 1); TTCAAG (nucleotides 1012 - 1017); CTTTTG (nucleotides 565
  • TTCAAC nucleotides 676 - 681
  • TTCAAC nucleotides 907 - 912
  • GGTATT nucleotides 277 - 282
  • GTCAAG nucleotides 103 - 108
  • GTCAAG nucleotides 430 - 435
  • GTCAAG nucleotides 1063 - 1068
  • GACGAA nucleotides 298 - 303
  • GGTATC nucleotides 1 15 - 120
  • TTGAAC nucleotides 25 - 30
  • TTTGAC nucleotides 937 - 942).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: AAGAAA (nucleotides 106 - 1 1 1 ) replaced with AAAAAG; TCCAAG (nucleotides 361
  • TCCAAG nucleotides 502 - 507 replaced with TCTAAA
  • TCCAAG nucleotides 682 - 687 replaced with TCTAAA
  • ATCAAG nucleotides 709 - 714) replaced with ATTAAA
  • ATCAAG nucleotides 772 - 777 replaced with ATTAAA
  • TTCAAG nucleotides 406 - 41 1 ) replaced with TTTAAA
  • TTCAAG nucleotides 1012 - 1017) replaced with TTTAAA
  • CTTTTG nucleotides 565
  • nucleotide sequence (nucleotides 676 - 681 ) replaced with TTTAAT; TTCAAC (nucleotides 907 - 912) replaced with TTTAAT; GGTATT (nucleotides 277 - 282) replaced with GGAATA; GTCAAG (nucleotides 103 - 108) replaced with GTTAAA; GTCAAG (nucleotides 430 - 435) replaced with GTTAAA; GTCAAG (nucleotides 1063 - 1068) replaced with GTTAAA; GACGAA (nucleotides 298 - 303) replaced with GATGAA; GGTATC (nucleotides 1 15 - 120) replaced with GGAATT; TTGAAC (nucleotides 25 - 30) replaced with TTAAAT; TTTGAC (nucleotides 937 - 942) replaced with TTCGAT.
  • the nucleotide sequence (nucleotides 676
  • a xylitol dehydrogenase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50,wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TTGAAC (nucleotides 25 - 30 ); AAGAAA (nucleotides 106 - 1 1 1 ); GGTATC (nucleotides 1 15 - 120 ); GGTACC (nucleotides 388 - 393 ); CTTTTG (nucleotides 565 - 570 ); GCCAAG (nucleotides 583 - 588 ); TTGAAG (nucleotides 637 - 642 ); GCCAAG (nucleotides 646 - 651 ); GCCATT (nucleotides 790 - 795 ); TTCCCA (nucleotides 847 - 852 ).
  • At least 3, or 4. or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: TTGAAC (nucleotides 25 - 30 ) replaced with TTAAAT; AAGAAA (nucleotides 106 - 1 1 1 ) replaced with AAAAAG; GGTATC (nucleotides 1 15
  • nucleotide sequence is optimized for expression in K. lactis.
  • a xylitol dehydrogenase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50,wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GATGCC (nucleotides 61 - 66 ); GGTATC (nucleotides 1 15
  • GCCGGT nucleotides 205 - 210
  • GGTATT nucleotides 277 - 282
  • GAAGGC nucleotides 367 - 372
  • GCCAAG nucleotides 583 - 588
  • GCCAAG nucleotides 646 - 651
  • ACTTTG nucleotides 880 - 885
  • GCTATT nucleotides 1021
  • At least 3 of the following codon pair replacements have been made: GATGCC (nucleotides 61 - 66 ) replaced with GATGCT; GGTATC (nucleotides 1 15 - 120 ) replaced with GGCATT; GCCGGT (nucleotides 205 - 210 ) replaced with GCTGGA; GGTATT (nucleotides 277 - 282 ) replaced with GGCATT; GAAGGC (nucleotides 367 - 372 ) replaced with GAAGGT; GCCAAG (nucleotides 583 - 588 ) replaced with GCTAAA; GCCAAG (nucleotides 646 - 651 ) replaced with GCCAAA; ACTTTG (nucleotides 880 - 885 ) replaced with ACCTTG; GCTATT (nucleotides 1021 - 1026 ) replaced with GCGATT; GAAGCC (nucleotides 1021 - 10
  • a xylitol dehydrogenase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human. E. coli or S.cerevisiae.
  • a xylitol dehydrogenase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris: Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing xylose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mo ⁇ , Spodoptera fmgiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the xylitol dehydrogenase retains at least 75% of the enzymatic activity of wild-type Xdh (SEQ ID NO: 50) under normal physiological conditions.
  • a xylitol dehydrogenase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 28- 146 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 28-146 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 28-146 when expressed in the native organism.
  • no replacement codon encoding amino acids 28-146 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%. or 300%, or 200%, or 150% or 100% of the wild type codon pair AAGAAA when expressed in the native organism.
  • a xylitol dehydrogenase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 175- 314 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 175-314 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 175-314 when expressed in the native organism.
  • no replacement codon encoding amino acids 175-314 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair TCCAAG when expressed in the native organism.
  • a xylitol dehydrogenase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO.l and which encode amino acids 146- 175 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 146- 175 of SEQ ID NO: 50 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75% ; or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 146-175 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1 -5 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair TCCAAG when expressed in the native organism.
  • a D-xylulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TTGAAA (nucleotides 1858 - 1863); TTGAAG (nucleotides 67 - 72); TTGAAG (nucleotides 793 - 798); GAAAGT (nucleotides 1849 - 1854); GGTATT (nucleotides 283 - 288); GGTATT (nucleotides 1213 - 1218): GGGTTC (nucleotides 43 - 48): TTGAAC (nucleotides 1276 - 1281); ACTTTG (nucleotides 1366 - 1371); GCCATT (nucleotides 190 - 195); GATATC (nucleotides 490 - 495): GATATC (nucleotides 679 - 684); TCTCAA (nucleotides 1021 - 1026); TTCCCC (nucleotides
  • ATCAAG nucleotides 1261 - 1266
  • ATCAAG nucleotides 1606 - 161 1
  • GCCAAG nucleotides 1717 - 1722
  • GCCAAG nucleotides 1840 - 1845.
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • TTGAAA nucleotides 1858 - 1863 replaced with TTAAAA
  • TTGAAG nucleotides 67 - 72 replaced with TTAAAA
  • TTGAAG nucleotides 793 - 798 replaced with TTAAAA
  • GAAAGT nucleotides 1849
  • a D-xylulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GAAGAG (nucleotides 451 - 456); GAAGAG (nucleotides 703 - 708); TTCCTC (nucleotides 37 - 42); GCCAGT (nucleotides 613 - 618); GCCAGT (nucleotides 1693 - 1698); AAAGAG (nucleotides 442 - 447); GCCAGA (nucleotides 1099 - 1 104); GCCAGA (nucleotides 1552 - 1557); AGCCAG (nucleotides 379 - 384); ATTGCC (nucleotides 847 - 852); GCCTGT (nucleotides 1666 - 1671 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GAAGAG (nucleotides 451 - 456) replaced with GAAGAA; GAAGAG (nucleotides 703 - 708) replaced with GAAGAA; TTCCTC (nucleotides 37 - 42) replaced with TTCCTG; GCCAGT (nucleotides 613 - 618) replaced with GCGTCT; GCCAGT (nucleotides 1693 - 1698) replaced with GCTAGC: AAAGAG (nucleotides 442 - 447) replaced with AAAGAA; GCCAGA (nucleotides 1099 - 1 104) replaced with GCTCGT; GCCAGA (nucleotides 1552 - 1557
  • a D-xylulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74. wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TCGTTG (nucleotides 934 - 939); GATATC (nucleotides 490 - 495); GATATC (nucleotides 679 - 684); ATCAAG (nucleotides 1261 - 1266); ATCAAG (nucleotides 1606 - 161 1 ); AAGTTT (nucleotides 1498 - 1503); TTCAAG (nucleotides 403 - 408); TTCAAG (nucleotides 556 - 561); TTGAAA (nucleotides 1858 - 1863); TTCAAC (nucleotides 268 - 273); TTCAAC (nucleotides 697 - 702); TTCAAC (nucleotides 877 - 882); TTCAAC (nucleotides 1 198 - 1203); ATGTTG (n
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: TCGTTG (nucleotides 934 - 939) replaced with TCTCTG; GATATC (nucleotides 490 - 495) replaced with GACATC; GATATC (nucleotides 679 - 684) replaced with GACATC; ATCAAG (nucleotides 1261 - 1266) replaced with ATCAAA; ATCAAG (nucleotides 1606 - 161 1 ) replaced with ATCAAA; AAGTTT (nucleotides 1498 - 1503) replaced with AAGTTC; TTCAAG (nucleotides 403 - 408) replaced with TTCAAA; TTCAAG (nucleotides 556 -
  • a D-xylulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GGGTTC (nucleotides 43 - 48 ); TTGAAG (nucleotides 67 - 72 ); GCCATT (nucleotides 190 - 195 ); AAGAAG (nucleotides 250 - 255 ); TTCCCC (nucleotides 262 - 267 ); TCGTTA (nucleotides 370 - 375 ); GGTAAA (nucleotides 439 - 444 ); GATATC (nucleotides 490 - 495 ); GATATC (nucleotides 679 - 684 ); GGTATC (nucleotides 781 - 786 ); TTGAAG (nucleotides 793 - 798 ); TTTGTC (nucleotides 859 - 864 ); TCGTTG (nucleotides 934
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GGGTTC (nucleotides 43 - 48 ) replaced with GGTTTC; TTGAAG (nucleotides 67 - 72 ) replaced with TTAAAG; GCCATT (nucleotides 190 - 195 ) replaced with GCTATT; AAGAAG (nucleotides 250 - 255 ) replaced with AAAAAG; TTCCCC (nucleotides 262 - 267 ) replaced with TTTCCG; TCGTTA (nucleotides 370 - 375 ) replaced with TCTTTA; GGTAAA (nucleotides 439 - 444 ) replaced with GGAAAA; GATATC (nucleotides 43 - 48 ) replaced with GGTTTC;
  • a D-xylulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TCGACT (nucleotides 55 - 60 ); AACAGC (nucleotides 136
  • At least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: TCGACT (nucleotides 55 - 60 ) replaced with TCTACC; AACAGC (nucleotides 136 - 141 ) replaced with AATTCT; GATGCC (nucleotides 220 - 225 ) replaced with GACGCG; GGTATT (nucleotides 283 - 288 ) replaced with GGCATT; TCCGGT (nucleotides 289 - 294 ) replaced with AGCGGT; GATGCC (nucleotides 478 - 483 ) replaced with GATGCT; GCCTTG (nucleotides 481 - 486 ) replaced with GCTTTA; GAAGCC (nucleotides 649
  • D-xylulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the Standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • D-xylulokinase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pasto ⁇ s; Oryctolagus cuniciilus (rabbit); Macaca fascicularis (Long-tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing xylose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mo ⁇ , Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the D-xylulokinase retains at least 75% of the enzymatic activity of wild-type XKI (SEQ ID NO: 74) under normal physiological conditions.
  • a D-xylulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 12-312 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 12-312 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 400%. or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 12-312 when expressed in the native organism.
  • no replacement codon encoding amino acids 12-312 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GATATC when expressed in the native organism.
  • a D-xylulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -12 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1 -12 of SEQ ID NO: 74 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-12 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1 -12 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GATGCT when expressed in the native organism.
  • L-arabinitol 4-dehydrogenase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CGCTAC (nucleotides 454 - 459 ); GCCAAG (nucleotides 562 - 567 ); CTCGGT (nucleotides 574 - 579 ); GATATC (nucleotides 946 - 951 ); CGCTAC (nucleotides 964 - 969 ); GCCATT (nucleotides 1 102 - 1 107 ).
  • CGCTAC nucleotides 454 - 459
  • GCCAAG nucleotides 562 - 567
  • CTCGGT nucleotides 574 - 579
  • GATATC nucleotides 946 - 951
  • CGCTAC nucleotides 964 - 969
  • GCCATT nucleotides 1 102 - 1 107 .
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced
  • CGCTAC nucleotides 454 - 459
  • GCCAAG nucleotides 562 - 567
  • CTCGGT nucleotides 574 - 579
  • GATATC nucleotides 946 - 951
  • GATATA nucleotides 964 - 969
  • GCCATT nucleotides 1 102 - 1 107
  • GCTATT nucleotides 1 102 - 1 107
  • L-arabinitol 4-dehydrogenase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CTGGCG (nucleotides 688 - 693); GCCAGC (nucleotides 856 - 861); ATCCTC (nucleotides 262 - 267); GCCAGT (nucleotides 928 - 933); CTCGGC (nucleotides 265 - 270); GTCAGC (nucleotides 775 - 780); TTCCCG (nucleotides 1045 - 1050); CTCGGT (nucleotides 574 - 579); TTCTGG (nucleotides 214 - 219); GCGCTG (nucleotides 517 - 522); ATCGCC (nucleotides 292 - 297).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: CTGGCG (nucleotides 688 - 693) replaced with CTCGCG; GCCAGC (nucleotides 856 - 861) replaced with GCGTCT; ATCCTC (nucleotides 262 - 267) replaced with ATCCTG; GCCAGT (nucleotides 928 - 933) replaced with GCGTCT; CTCGGC (nucleotides 265 - 270) replaced with CTGGGT; GTCAGC (nucleotides 775 - 780) replaced with GTTAGC; TTCCCG (nucleotides 1045 - 1050) replaced with TTCCCA; CTCGGT (nucleotides 574 -
  • L-arabinitol 4-dehydrogenase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GATATC (nucleotides 946 - 951); AAGTTT (nucleotides 862 - 867); GTCAAG (nucleotides 55 - 60); GTCAAG (nucleotides 1063 - 1068); GCCAAA (nucleotides 763 - 768); GGTATC (nucleotides 190 - 195); AAGAAT (nucleotides 898 - 903); TCCAAA (nucleotides 1024 - 1029).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GATATC (nucleotides 946 - 951 ) replaced with GACATC; AAGTTT (nucleotides 862 - 867) replaced with AAATTC; GTCAAG (nucleotides 55 - 60) replaced with GTTAAA; GTCAAG (nucleotides 1063 - 1068) replaced with GTTAAG; GCCAAA (nucleotides 763 - 768) replaced with GCGAAA; GGTATC (nucleotides 190 - 195) replaced with GGTATT; AAGAAT (nucleotides 898 - 903) replaced with AAAAAC; TCCAAA (nucleotides 1024 - 1029)
  • L-arabinitol 4-dehydrogenase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GGTATC (nucleotides 190 - 195 ); CTGCGA (nucleotides 448 - 453 ); GCCAAG (nucleotides 562 - 567 ); GATATC (nucleotides 946 - 951 ); GCCATT (nucleotides 1 102 - 1 107 ).
  • GGTATC nucleotides 190 - 195
  • CTGCGA nucleotides 448 - 453
  • GCCAAG nucleotides 562 - 567
  • GATATC nucleotides 946 - 951
  • GCCATT nucleotides 1 102 - 1 107
  • GGTATC nucleotides 190 - 195
  • CTGCGA nucleotides 448 - 453
  • TTGAGG TTGAGG
  • GCCAAG nucleotides 562 - 567
  • GATATC nucleotides 946 - 951
  • GCCATT nucleotides 1 102 - 1 107
  • L-arabinitol 4-dehydrogenase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GTCGAT (nucleotides 16 - 21 ); GGGGCA (nucleotides 40 - 45 ); GATGCC (nucleotides 127 - 132 ): GGTATC (nucleotides 190 - 195 ): GCCAAG (nucleotides 562 - 567 ); GCCGGT (nucleotides 643 - 648 ); AGCCGT (nucleotides 682 - 687 ); TCGGCT (nucleotides 748 - 753 ); GTCGAT (nucleotides 943 - 948 ); GATGCC (nucleotides 1057 - 1062 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GTCGAT (nucleotides 16 - 21 ) replaced with GTTGAT; GGGGCA (nucleotides 40 - 45 ) replaced with GGCGCT; GATGCC (nucleotides 127 - 132 ) replaced with GACGCC; GGTATC (nucleotides 190 - 195 ) replaced with GGTATA; GCCAAG (nucleotides 562 - 567 ) replaced with GCTAAG; GCCGGT (nucleotides 643 - 648 ) replaced with GCTGGG; AGCCGT (nucleotides 682 - 687 ) replaced with TCTCGT; TCGGCT (nucleocleo
  • L-arabinitol 4-dehydrogenase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • L-arabinitol 4-dehydrogenase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long- tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing arabinose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: arabinose dehyodrogenase.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schi ⁇ osaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the L-arabinitol 4- dehydrogenase retains at least 75% of the enzymatic activity of wild-type LADl (SEQ ID NO: 98) under normal physiological conditions.
  • a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 53-164 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the ⁇ score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 53-164 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 53-164 when expressed in the native organism.
  • no replacement codon encoding amino acids 53-164 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair AAGATT when expressed in the native organism.
  • a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 192-366 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 192-366 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 192-366 when expressed in the native organism.
  • no replacement codon encoding amino acids 192-366 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GAGATT when expressed in the native organism.
  • a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-53 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1-53 of SEQ ID NO: 98 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 - 53 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1 -53 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GTCAAG when expressed in the native organism.
  • a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 164-192 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 164-192 of SEQ ID NO: 98 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 164-192 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 164-192 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GCGCTG when expressed in the native organism.
  • a L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GGTATT (nucleotides 619 - 624); TTGAAC (nucleotides 16 - 21); TTGAAC (nucleotides 274 - 279): TTGAAC (nucleotides 670 - 675); TTGAAC (nucleotides 688 - 693); CTTTCT (nucleotides 286 - 291); GCCATT (nucleotides 181 - 186); TCTCCA (nucleotides 697 - 702); TCTCCA (nucleotides 751 - 756); ATCAAG (nucleotides 103 - 108): ATCAAG (nucleotides 541 - 546); ATCAAG (nucleotides 721 - 726); GCCAAG (nucleotides 889 - 894).
  • GGTATT nucleotides 619
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GGTATT (nucleotides 619 - 624) replaced with GGAATT; TTGAAC (nucleotides 16 - 21) replaced with TTAAAT; TTGAAC (nucleotides 274 - 279) replaced with CTAAAT; TTGAAC (nucleotides 670 - 675) replaced with TTAAAT; TTGAAC (nucleotides 688 - 693) replaced with TTAAAT; CTTTCT (nucleotides 286 - 291) replaced with CTATCT; GCCATT (nucleotides 181 - 186) replaced with GCTATT; TCTCCA (nucleotides 697 - 702) replaced
  • a L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GCCTGT (nucleotides 58 - 63 ); CTTGAT (nucleotides 124 - 129 ); GCCTGT (nucleotides 226 - 231 ); GAAGAT (nucleotides 346 - 351 ); CTTTCT (nucleotides 748 - 753 ); GCCAGC (nucleotides 781 - 786 ).
  • GCCTGT nucleotides 58 - 63
  • CTTGAT nodeoxyribotide
  • GCCTGT nucleotides 226 - 231
  • GAAGAT nucleotides 346 - 351
  • CTTTCT nucleotides 748 - 753
  • GCCAGC nucleotides 781 - 786 .
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding
  • GCCTGT nucleotides 58 - 63
  • CTTGAT nucleotides 124 - 129
  • GCCTGT nucleotides 226 - 231
  • GAAGAT nucleotides 346 - 351
  • CTTTCT nucleotides 748 - 753
  • GCCAGC nucleotides 781 - 786 ) replaced with GCATCA.
  • the nucleotide sequence is optimized for expression in E.coli.
  • a L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be repjaced are selected from the following: TTGAAC (nucleotides 16 - 21 ); ATCAAG (nucleotides ⁇ 03- - 108); GTCAAG (nucleotides 172 - 177); GACGAA (nucleotides 187 - 192); GGTATC (nucleotides 193 - 198); GTCAAG (nucleotides 199 - 204); TCCAAG (nucleotides 226 - 231); TTGAAC (nucleotides 274 - 279); TTCAAG (nucleotides 343 - 348); GTCAAG (nucleotides 460 - 465); ATCAAG (nucleotides 541 - 546): CCAAGA (nucleotides 589 - 594); GGTATT (nucleotides 619 - 624); TTGAAC
  • TTGAAC nucleotides 16 - 21
  • ATCAAG nucleotides 103 - 108
  • GTCAAG nucleotides 172 - 177
  • GACGAA nucleotides 187 - 192
  • GGTATC nucleotides 193 - 198
  • a L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GATATC (nucleotides 127 - 132 ); TTGAAG (nucleotides 190 - 195 ); TTGAAA (nucleotides 196 - 201 ); GTGTTT (nucleotides 262 - 267 ); TTTGCT (nucleotides 265 - 270 ); TTCCCA (nucleotides 337 - 342 ); GCCAAG (nucleotides 358 - 363 ); TTTGCT (nucleotides 421 - 426 ); ATCAAA (nucleotides 436 - 441 ); GGTATC (nucleotides 445 - 450 ); GCCATT (nucleotides 490 - 495 ); GGTATC (nucleotides 688 - 693 ); CTTTCT (nucleotides)
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made:GATATC (nucleotides 127 - 132 ) replaced with GACATT; TTGAAG (nucleotides 190 - 195 ) replaced with TTAAAG; TTGAAA (nucleotides 196 - 201 ) replaced with TTAAAG; GTGTTT (nucleotides 262 - 267 ) replaced with GTTTTC; TTTGCT (nucleotides 265 - 270 ) replaced with TTCGCT; TTCCCA (nucleotides 337 - 342 ) replaced with TTCCCT; GCCAAG (nucleotides 358 - 363 ) replaced with GCTAAA; TTTG
  • a L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: ACTTTT (nucleotides 19 - 24 ); GCTTTG (nucleotides 1 18 - 123 ); CTTGAT (nucleotides 124 - 129 ); GCCAAG (nucleotides 358 - 363 ); GCCTTT (nucleotides 418 - 423 ); GGTATC (nucleotides 445 - 450 ); ACTTTG (nucleotides 562 - 567 ); ATCAAT (nucleotides 649 - 654 ); GGTATC (nucleotides 688 - 693 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: ACTTTT (nucleotides 19 - 24 ) replaced with ACCTTT; GCTTTG (nucleotides 1 18 - 123 ) replaced with GCTCTT; CTTGAT (nucleotides 124 - 129 ) replaced with TTGGAC; GCCAAG (nucleotides 358 - 363 ) replaced with GCTAAG; GCCTTT (nucleotides 418 - 423 ) replaced with GCTTTC; GGTATC (nucleotides 445 - 450 ) replaced with GGGATT; ACTTTG (nucleotides 562 - 567 ) replaced with ACCTTG; ATCAAT (
  • L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • L-xylulose reductase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1 -272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system metabolizing arabinose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: arabinose dehyodrogenase, L- arabinitol 4-dehydrogenase, and L-xylulose reductase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the L-xylulose reductase retains at least 75% of the enzymatic activity of wild-type LXR (SEQ ID NO: 122) under normal physiological conditions.
  • a L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 8- 267 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism
  • no replacement codon encoding amino acids 8-267 of SEQ ID NO 122 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 8-267 when expressed in the native organism
  • a L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-272 of wild-type L-xylulose reductase as set forth in SEQ ID NO 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO 1 and which encode amino acids 1-8 of SEQ ID NO 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism
  • a L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% ammo acid sequence identity with amino acids 1-266 of wild-type L-xylulose reductase as set forth in SEQ ID NO 146, wherein at least 3 codon pairs of SEQ ID NO 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof
  • the at least 3 codon pairs to be replaced are selected from the following TTGAAG (nucleotides 49 - 54), TTTGCC (nucleotides 583 - 588).
  • GATATT nucleotides 766 - 771.
  • AGCGAT nucleotides 364 - 369
  • GCCAAG nucleotides 529 - 534)
  • GCCAAG nucleotides 700 - 705.
  • at least 3, or 4, or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • TTGAAG nucleotides 49 - 54 replaced with TTAAAA
  • TTTGCC nucleotides 583 - 588 replaced with TTTGCT
  • GATATT nucleotides 766 - 771
  • AGCGAT nucleotides 364 - 369 replaced with TCAGAT
  • GCCAAG nucleotides 529 - 534 replaced with GCAAAA: GCCAAG (nucleotides 700 - 705) replaced with GCTAAA.
  • the nucleotide sequence is optimized for expression in S.cerevisiae.
  • a L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GATCTC (nucleotides 37 - 42); ATTGCC (nucleotides 313 - 318); GCCGGA (nucleotides 322 - 327); GCCAGC (nucleotides 361 - 366); CTGGCG (nucleotides 550 - 555); TTTGCC (nucleotides 583 - 588); GTCAGC (nucleotides 733 - 738).
  • GATCTC nucleotides 37 - 42
  • ATTGCC nucleotides 313 - 318
  • GCCGGA nucleotides 322 - 327
  • GCCAGC nucleotides 361 - 366
  • CTGGCG nucleotides 550 - 555
  • TTTGCC nucleotides 583 - 588
  • GTCAGC nucleotides 733 - 738
  • At least 3 of the following codon pair replacements have been made: GATCTC (nucleotides 37 - 42) replaced with GATTTG; ATTGCC (nucleotides 313 - 318) replaced with ATTGCT; GCCGGA (nucleotides 322 - 327) replaced with GCTGGA; GCCAGC (nucleotides 361 - 366) replaced with GCTTCA; CTGGCG (nucleotides 550 - 555) replaced with TTGGCT; TTTGCC (nucleotides 583 - 588) replaced with TTTGCT; GTCAGC (nucleotides 733 - 738) replaced with GTTTCA.
  • the nucleotide sequence is optimized for expression in E.coli.
  • a L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GTCAAG (nucleotides 220 - 225 ); TTCAAG (nucleotides 436 - 441 ); AAGAAG (nucleotides 439 - 444 ); GGCCAC (nucleotides 448 - 453 ); GGCCAC (nucleotides 484 - 489 ); TTTGCC (nucleotides 583 - 588 ); GATATT (nucleotides 766 - 771 ). In some such nucleotide sequences, at least 3, or 4, or 5.
  • codon pair replacements have been made: GTCAAG (nucleotides 220 - 225 ) replaced with GTTAAA; TTCAAG (nucleotides 436 - 441 ) replaced with TTTAAA; AAGAAG (nucleotides 439 - 444 ) replaced with AAAAAG; GGCCAC (nucleotides 448 - 453 ) replaced with GGACAT; GGCCAC (nucleotides 484 - 489 ) replaced with GGACAC; TTTGCC (nucleotides 583 - 588 ) replaced with TTCGCT; GATATT (nucleotides 766 - 771 ) replaced with GATATA; GCCAAG (nucleotides 700 - 705 ) replaced with GCTA
  • a L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TTGAAG (nucleotides 49 - 54 ); AAGAAG (nucleotides 439 .
  • GCCAAG nucleotides 529 - 534
  • TTTGCC nucleotides 583 - 588
  • GCCAAG nucleotides 700 - 705
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • TTGAAG nucleotides 49 - 54
  • AAGAAG nucleotides 439 - 444
  • AAAAAG AAAAAG
  • GCCAAG nucleotides 529 - 534
  • TTTGCC nucleotides 583 - 588
  • TTCGCT GCCAAG
  • a L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CTTGAT (nucleotides 34 - 39 ); GATGCC (nucleotides 304
  • GCCTTT nucleotides 307 - 312
  • GCCGGA nucleotides 322 - 327
  • GCCAAG nucleotides 529 - 534
  • GCCGGT nucleotides 535 - 540
  • AACAGC nucleotides 595 - 600
  • GATGCC nucleotides 697 - 702
  • GCCAAG nucleotides 700
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: CTTGAT (nucleotides 34 - 39 ) replaced with TTGGAT; GATGCC (nucleotides 304 - 309 ) replaced with GATGCT; GCCTTT (nucleotides 307 - 312 ) replaced with GCTTTC; GCCGGA (nucleotides 322 - 327 ) replaced with GCTGGA; GCCAAG (nucleotides 529 - 534 ) replaced with GCTAAG; GCCGGT (nucleotides 535 - 540 ) replaced with GCCGGG; AACAGC (nucleotides 595 - 600 ) replaced with AATTCT; GA
  • L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • L-xylulose reductase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris: Oiyctolagus cunici ⁇ us (rabbit); Macaca fascicularis (Long-tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system metabolizing arabinose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: arabinose dehyodrogenase, L- arabinitol 4-dehydrogenase, and L-xylulose reductase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the L-xylulose reductase retains at least 75% of the enzymatic activity of wild-type LXR (SEQ ID NO: 146) under normal physiological conditions.
  • L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 10-261 of SEQ ID NO: 146 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 10-261 when expressed in the native organism.
  • no replacement codon encoding amino acids 10-261 of SEQ ID NO: 146 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair AAGACG when expressed in the native organism.
  • L-xylulose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1-10 of SEQ ID NO: 146 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-10 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -10 of SEQ ID NO: 146 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GCCAAC when expressed in the native organism.
  • a xylose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 1 70. wherein at least 3 codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GAGTTT (nucleotides 262 - 267); TTTGCC (nucleotides 130 - 135); GTGGAA (nucleotides 943 - 948); GCCATT (nucleotides 856 - 861 ); CAGTTT (nucleotides 766 - 771 ); CAAAGT (nucleotides 1033 - 1038); GGCCAA (nucleotides 1201 - 1206); TTTTTC (nucleotides 265 - 270).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GAGTTT (nucleotides 262 - 267) replaced with GAGTTC; TTTGCC (nucleotides 130 - 135) replaced with TTTGCT; GTGGAA (nucleotides 943 - 948) replaced with GTTGAA; GCCATT (nucleotides 856 - 861) replaced with GCTATA; CAGTTT (nucleotides 766 - 771) replaced with CAATTT; CAAAGT (nucleotides 1033 - 1038) replaced with CAATCT; GGCCAA (nucleotides 1201 - 1206) replaced with GGTCAA; TTTTTC (nucleotides 265 - 270) replaced with
  • a xylose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170, wherein at least 3 codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CTGGCG (nucleotides 226 - 231); CTGGCG (nucleotides 1093 - 1098); CTGGTG (nucleotides 94 - 99); CTGGTG (nucleotides 958 - 963); GAAGAG (nucleotides 1 15 - 120); GAAGAG (nucleotides 391 - 396); GAAGAG (nucleotides 946 - 951 ); CTGGCA (nucleotides 376 - 381); CTGGCA (nucleotides 820 - 825); CTGGCA (nucleotides 1213 - 1218); TTTGCC (nucleotides 130 - 135); ACGCTG (nucleotides 586 - 591 ); ACGCTG (nucleotides 817 - 822); AAAGAG (nucleotides 337
  • GCGGCA nucleotides 496 - 501
  • GTGATG nucleotides 961 - 966
  • GCGCTG nucleotides 955 - 960
  • GCGCTG nucleotides 1096 - 1 101 .
  • at least 3. or 4. or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • At least 3 of the following codon pair replacements have been made: CTGGCG (nucleotides 226 - 231) replaced with TTGGCT; CTGGCG (nucleotides 1093 - 1098) replaced with TTGGCA; CTGGTG (nucleotides 94 - 99) replaced with TTGGTT; CTGGTG (nucleotides 958 - 963) replaced with TTGGTT; GAAGAG (nucleotides 1 15 - 120) replaced with GAGGAA; GAAGAG (nucleotides 391 - 396) replaced with GAAGAA; GAAGAG (nucleotides 946 - 951 ) replaced with GAAGAA; CTGGCA (nucleotides 376 - 381) replaced with TTAGCT; CTGGCA (nucleotides 820 - 825) replaced with TTGGCT; CTGGCA (nucleotides 1213 - 1218) replaced with TTGG
  • nucleotide sequence is optimized for expression in E.coli.
  • a xylose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170, wherein at least 3 codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GAGTTT (nucleotides 262 - 267); TTTGCC (nucleotides 130 - 135); AAACTG (nucleotides 790 - 795); GCCAAA (nucleotides 1018 - 1023); GCCAAA (nucleotides 1225 - 1230); CTGAAA (nucleotides 760 - 765); CTGAAA (nucleotides 1099 - 1 104); CTGAAA (nucleotides 1 195 - 1200); GACGAA (nucleotides 88 - 93): AAACAG (nucleotides 763 - 768); GGCCAA (nucleotides 1201 - 1206); CTGGTA (nucleotides 1294 - 1299); TCGTTA (nucleotides 331 - 336): TTTGAC (n
  • At least 3. or 4, or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GAGTTT (nucleotides 262 - 267) replaced with GAGTTC: TTTGCC (nucleotides 130 - 135) replaced with TTTGCT; AAACTG (nucleotides 790 - 795) replaced with AAATTA; GCCAAA (nucleotides 1018 - 1023) replaced with GCTAAA; GCCAAA (nucleotides 1225 - 1230) replaced with GCTAAA; CTGAAA (nucleotides 760 - 765) replaced with CTAAAA; CTGAAA (nucleotides 1099 - 1 104) replaced with TTAAAA; CTGAAA (nucleotides 1 195 - 1
  • a xylose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170, wherein at least 3 codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TTTGCC (nucleotides 130 - 135 ); GAGTTT (nucleotides 262 - 267 ); TCGTTA (nucleotides 331 - 336 ); CAGTTT (nucleotides 766 - 771 ); TTCCAT (nucleotides 835 - 840 ); GCCATT (nucleotides 856 - 861 ); GGCCAA (nucleotides 1201 - 1206 ).
  • TTTGCC (nucleotides 130 - 135 ) replaced with TTCGCT: GAGTTT (nucleotides 262 - 267 ) replaced with GAATTT; TCGTTA (nucleotides 331 - 336 ) replaced with AGTTTA; CAGTTT (nucleotides 766 - 771 ) replaced with CAATTC; TTCCAT (nucleotides 835 - 840 ) replaced with TTCCAC; GCCATT (nucleotides 856 - 861 ) replaced with GCTATT; GGCCAA (nucleotides 1201 - 1206 ) replaced with GGTCAA.
  • the nucleotides 130 - 135 replaced with TTCGCT: GAGTTT (nucleotides 262 - 267 ) replaced with GAATTT; TCGTTA (nucleotides 331 - 336 ) replaced with AGTTTA; CAGTTT (nucleotides 766
  • a xylose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170, wherein at least 3 codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GCCTAT (nucleotides 7 - 12 ); CTCGAT (nucleotides 22 - 27 ); GAAGGC (nucleotides 40 - 45 ); ATCAAT (nucleotides 346 - 351 ); AAGCTG (nucleotides 406 - 41 1 ); CTGTTA (nucleotides 589 - 594 ); GATGCC (nucleotides 736 - 741 ); GATGCC (nucleotides 1015 - 1020 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GCCTAT (nucleotides 7 - 12 ) replaced with GCTTAT; CTCGAT (nucleotides 22 - 27 ) replaced with TTGGAT; GAAGGC (nucleotides 40 - 45 ) replaced with GAAGGT; ATCAAT (nucleotides 346 - 351 ) replaced with ATTAAT; AAGCTG (nucleotides 406 - 41 1 ) replaced with AAATTG; CTGTTA (nucleotides 589 - 594 ) replaced with TTGTTG; GATGCC (nucleotides 736 - 741 ) replaced with GACGCC; GATGCC (nucleotides 1015 -
  • a xylose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5. or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • a xylose isomerase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing xylose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: xylose isomerase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the xylose isomerase retains at least 75% of the enzymatic activity of wild-type XyIA (SEQ ID NO: 170) under normal physiological conditions.
  • a xylose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least 1.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 76-286 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 76-286 when expressed in the native organism.
  • no replacement codon encoding amino acids 76-286 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 400%, or 300% : or 200%, or 150% or 100% of the wild type codon pair GAAGAG when expressed in the native organism.
  • a xylose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 1-76 of SEQ ID NO: 170 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1 -76 of SEQ ID NO: 170 has a z score for expression in the heterologous that is more than 200%, or 100% ; or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -76 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1 -76 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair CTGGTG when expressed in the native organism.
  • a L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TTGAAA (nucleotides 148 - 153 -); ATCAAC (nucleotides 268 - 273 ); ATCAAG (nucleotides 598 - 603 ); CTCGGT (nucleotides 1 1 1 1 - 1 1 16 ); GGTATT (nucleotides 1 1 14 - 1 1 19 ); GGATTT (nucleotides 1489 - 1494 ).
  • TTGAAA nucleotides 148 - 153 -
  • ATCAAC nucleotides 268 - 273
  • ATCAAG nucleotides 598 - 603
  • CTCGGT nucleotides 1 1 1 1 1 - 1 16
  • GGTATT nucleotides 1 14 - 1 1 19
  • GGATTT nucleotides 1489 - 1494
  • TTGAAA nucleotides 148 - 153
  • ATCAAC nucleotides 268 - 273
  • ATTAAT ATCAAG
  • ATCAAG nucleotides 598 - 603
  • CTCGGT nucleotides 1 1 1 1 - 1 1 16
  • GGTATT nucleotides 1 1 14 - 1 1 19
  • GGAATT nucleotides 1489 - 1494
  • the nucleotide sequence is optimized for expression in S.cerevisiae.
  • a L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CTCGAC (nucleotides 142 - 147): ATCCTC (nucleotides 226 - 231 ); ATCCTC (nucleotides 640 - 645); GACTGG (nucleotides 1081 - 1086); GTGGTG (nucleotides 1 180 - 1 185); GTGGTG (nucleotides 1096 - 1 101 ); TTGCTG (nucleotides 1093 - 1098); CTCGGC (nucleotides 1327 - 1332); CTCGGC (nucleotides 922 - 927); CTGGAA (nucleotides 229 - 234); CTGGAA (nucleotides 649 - 654); CTGGAA (nucleotides 298 - 303); AGCCAG (nucleotides 1039 - 1044); ATTGCC (nucleotides 1039
  • GCGCTG (nucleotides 1 192 - 1 197); GCGCTG (nucleotides 1 1 11 - 1 1 16); GCGCTG (nucleotides 958 - 963); GCGCTG (nucleotides 109 - 1 14); CTCGAC (nucleotides 328 - 333); ATCCTC (nucleotides 682 - 687); ATCCTC (nucleotides 1279 - 1284); GACTGG (nucleotides 1366 - 1371 ); GTGGTG (nucleotides 1462 - 1467).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: CTCGAC (nucleotides 142 - 147) replaced with TTAGTT; ATCCTC (nucleotides 226 - 231) replaced with TTAGTT; ATCCTC (nucleotides 640 - 645) replaced with TTGGTT; GACTGG (nucleotides 1081 - 1086) replaced with GAAGAA; GTGGTG (nucleotides 1 180 - 1 185) replaced with GCTTCT; GTGGTG (nucleotides 1096 - 1 101 ) replaced with TTGGAT; TTGCTG (nucleotides 1093 - 1098) replaced with ATTTTG; CTCGGC (nucleotides 1327 -
  • GCGCTG (nucleotides 496 - 501) replaced with CAAGCA; GCGCTG (nucleotides 1 192 - 1 197) replaced with GATTTG; GCGCTG (nucleotides 1 1 1 1 - 1 16) replaced with TTGGGA; GCGCTG (nucleotides 958 - 963) replaced with GTAATG; GCGCTG (nucleotides 109 - 1 14) replaced with GCTTTA; CTCGAC (nucleotides 328 - 333) replaced with GCTTTG; ATCCTC (nucleotides 682 - 687) replaced with GCTTTG; ATCCTC (nucleotides 1279 - 1284) replaced with GCATTG; GACTGG (nucleotides 1366 - 1371 ) replaced with GCTTTA; GTGGTG (nucleotides 1462 - 1467) replaced with GCTTTG
  • a L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GACGAT (nucleotides 208 - 213); GACGAT (nucleotides 1 129 - 1 134); ATCAAG (nucleotides 598 - 603); AAACTG (nucleotides 127 - 132); AAACTG (nucleotides 139 - 144); AAACTG (nucleotides 1261 - 1266); TTGAAA (nucleotides 148 - 153); CTTCCA (nucleotides 862 - 867); TTCAAC (nucleotides 319 - 324); ATCAAC (nucleotides 268 - 273); GGTATT (nucleotides 1 1 14 - 1 1 19); GCCAAA (nucleotides 256 - 261 ); CTGAAA (nucleotides 526 - 531); CTGAAA (n
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GACGAT (nucleotides 208 - 213) replaced with GATGAT; GACGAT (nucleotides 1 129 - 1 134) replaced with GATGAT; ATCAAG (nucleotides 598 - 603) replaced with ATAAAA; AAACTG (nucleotides 127 - 132) replaced with AAATTG; AAACTG (nucleotides 139 - 144) replaced with AAATTA; AAACTG (nucleotides 1261 - 1266) replaced with AAATTG; TTGAAA (nucleotides 148 - 153) replaced with TTAAAA; CTTCCA (nucleotides 862 - 867) replaced with TT
  • a L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TTTGTC (nucleotides 31 - 36 ); GTCATT (nucleotides 34 - 39 ); TTGAAA (nucleotides 148 - 153 ); GACGAT (nucleotides 208 - 213 ); CAGCAG (nucleotides 892 - 897 ); GAGAAA (nucleotides 1018 - 1023 ); GAGAAA (nucleotides 1084 - 1089 ); GACGTT (nucleotides 1099 - 1 104 ); GGTATT (nucleotides 1 1 14 - 1 1 19 ); GACGAT (nucleotides 1 129 - 1 134 ); GTGAAA (nucleotides 1237 - 1242 ); GCGTTT (nucleotides 1450 - 1455 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: TTTGTC (nucleotides 31 - 36 ) replaced with TTCGTT; GTCATT (nucleotides 34 - 39 ) replaced with GTTATT; TTGAAA (nucleotides 148 - 153 ) replaced with TTAAAG; GACGAT (nucleotides 208 - 213 ) replaced with GATGAT; CAGCAG (nucleotides 892 - 897 ) replaced with CAACAA; GAGAAA (nucleotides 1018 - 1023 ) replaced with GAAAAA; GAGAAA (nucleotides 1084 - 1089 ) replaced with GAAAAA; GACGTT (nucleotides 1099 ).
  • L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GCTATT (nucleotides 184 - 189 ); GACAGT (nucleotides 340 - 345 ); GCGGTT (nucleotides 499 - 504 ); GCGGTT (nucleotides 628 - 633 ): GTCGAT (nucleotides 688 - 693 ); CAGCTT (nucleotides 859 - 864 ); GAAGGC (nucleotides 916 - 921 ); ACCTAT (nucleotides 1006 - 101 1 ); GGTATT (nucleotides 1 1 14 - 1 1 19 ); AAAGAC (nucleotides 1456 - 1461 ).
  • At least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GCTATT (nucleotides 184 - 189 ) replaced with GCCATT; GACAGT (nucleotides 340 - 345 ) replaced with GACTCC; GCGGTT (nucleotides 499 - 504 ) replaced with GCCGTT: GCGGTT (nucleotides 628 - 633 ) replaced with GCCGTC; GTCGAT (nucleotides 688 - 693 ) replaced with GTTGAT; CAGCTT (nucleotides 859 - 864 ) replaced with CAGTTG; GAAGGC (nucleotides 916 - 921 ) replaced with GAGGGT; ACCTAT (nucleo
  • L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cer ⁇ visiae.
  • L-arabinose isomerase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey): M.
  • E. coli Kl 2 W31 10 E. coli UTI89: E. coli O157:H7 EDL933; E. coli OJ57.H7 sir.
  • Sakai Bombyx mori: Spodoptera frugiperda: Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing arabinose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: L-arabinose isomerase, L- ribulokinase, and L-ribulose-5-P 4-epimerase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the L-arabinose isomerase retains at least 75% of the enzymatic activity of wild-type AraA (SEQ ID NO: 194) under normal physiological conditions.
  • L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least 1.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 8-472 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 8-472 when expressed in the native organism.
  • no replacement codon encoding amino acids 8-472 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair CTGGTG when expressed in the native organism.
  • a L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -8 of SEQ ID NO: 194 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1-8 of SEQ ID NO: 194 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -8 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1 -5 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GAAGTG when expressed in the native organism.
  • a L-ribulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218, wherein at least 3 codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 562 - 567); GGTATT (nucleotides 445 - 450); GGTATT (nucleotides 943 - 948); GAGTTT (nucleotides 319 - 324); GGATTT (nucleotides 979 - 984); TTTGCC (nucleotides 322 - 327); GATATC (nucleotides 101 8 - 1023); CTTTAT (nucleotides 1603 - 1608); GATATT (nucleotides 586 - 591 ); GATATT (nucleotides 736 - 741 ); GGCCAA (nucleotides 1000 - 1005).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 562 - 567) replaced with TTGAGT; GGTATT (nucleotides 445 - 450) replaced with GGAATT; GGTATT (nucleotides 943 - 948) replaced with GGAATT; GAGTTT (nucleotides 319 - 324) replaced with GAATTT; GGATTT (nucleotides 979 - 984) replaced with GGATTT; TTTGCC (nucleotides 322 - 327) replaced with TTTGCA; GATATC (nucleotides 1018 - 1023) replaced with GACATT; CTTTAT (nucleotides 1603 - 1608) replaced with
  • a L-ribulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218, wherein at least 3 codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CTGGCG (nucleotides 304 - 309); GAAGAG (nucleotides 73 - 78); GAAGAG (nucleotides 385 - 390); GCCAGC (nucleotides 64 - 69); GCCAGC (nucleotides 1 105 - 1 1 10); CTTTCC (nucleotides 562 - 567): CTCGAC (nucleotides 1 183 - 1 188); TTTGCC (nucleotides 322 - 327); GGGCAA (nucleotides 1 18 - 123); ATCCTC (nucleotides 685 - 690): GACTGG (nucleotides 544 - 549); GACTGG (nucleotides 1 186 - 1 191 ); GCCAGT (nucleotides 658 - 663);
  • GCGCTG nucleotides 1 129 - 1 134
  • GCGCTG nucleotides 1369 - 1374
  • ATCGCC nucleotides 79 - 84
  • ATCGCC nucleotides 1348 - 1353.
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • At least 3 of the following codon pair replacements have been made: CTGGCG (nucleotides 304 - 309) replaced with CTGGCT; GAAGAG (nucleotides 73 - 78) replaced with GAAGAA; GAAGAG (nucleotides 385 - 390) replaced with GAAGAA; GCCAGC (nucleotides 64 - 69) replaced with GCGTCT; GCCAGC (nucleotides 1 105 - 1 1 10) replaced with GCGTCT; CTTTCC (nucleotides 562 - 567) replaced with CTGTCT; CTCGAC (nucleotides 1 183 - 1 188) replaced with CTGGAT: TTTGCC (nucleotides 322 - 327) replaced with TTTGCG; GGGCAA (nucleotides 1 18 - 123) replaced with GGTCAG; ATCCTC (nucleotides 685 - 690) replaced
  • a L-ribulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218, wherein at least 3 codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GAGTTT (nucleotides 319 - 324); GATATC (nucleotides 1018 - 1023); GATATT (nucleotides 586 - 591 ); GATATT (nucleotides 736 - 741); TTTGCC (nucleotides 322 - 327); CTTCCA (nucleotides 1651 - 1656); ATCAAC (nucleotides 1099 - 1 104); GGTATT (nucleotides 445 - 450); GGTATT (nucleotides 943
  • GCCAAA nucleotides 1 147 - 1 152
  • CTGAAA nucleotides 193 - 198
  • CTGAAA nucleotides 1087 - 1092
  • CTGAAA nucleotides 1228 - 1233
  • AAACAG nucleotides 913 - 918
  • GGCCAA nucleotides 1000 - 1005
  • CTGGTA nucleotides 865
  • CTTTCC nucleotides 562 - 567
  • TTTGAC nucleotides 817 - 822.
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • At least 3 of the following codon pair replacements have been made: GAGTTT (nucleotides 319 - 324) replaced with GAATTT; GATATC (nucleotides 1018 - 1023) replaced with GACATC; GATATT (nucleotides 586 - 591 ) replaced with GACATC: GATATT (nucleotides 736 - 741 ) replaced with GACATC: TTTGCC (nucleotides 322 - 327) replaced with TTTGCG: CTTCCA (nucleotides 1651 - 1656) replaced with CTCCCG; ATCAAC (nucleotides 1099 - 1 104) replaced with ATCAAC; GGTATT (nucleotides 445 - 450) replaced with GGTATC; GGTATT (nucleotides 943 - 948) replaced with GGTATC; GCCAAA (nucleotides 1 147 - 1 152) replaced
  • a L-ribulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218. wherein at least 3 codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GAGTTT (nucleotides 319 - 324 ); TTTGCC (nucleotides 322 - 327 ); CTTTCC (nucleotides 562 - 567 ); GGTACC (nucleotides 568 - 573 ); GGCCAA (nucleotides 1000 - 1005 ); GATATC (nucleotides 1018 - 1023 ); TTTGCT (nucleotides 1486 - 1491 ). In some such nucleotide sequences, at least 3, or 4.
  • nucleotide sequence is optimized for expression in K. lactis.
  • L-ribulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218, wherein at least 3 codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CTCGAT (nucleotides 19 - 24 ); GCTTTG (nucleotides 46 - 51 ); GATGCC (nucleotides 130 - 135 ); GACAGT (nucleotides 256 - 261 ); GCACCG (nucleotides 277 - 282 ); GATGCC (nucleotides 286 - 291 ); AAAGAC (nucleotides 358 - 363 ); GCGGTT (nucleotides 370 - 375 ); CGCTAT (nucleotides 433 - 438 ); GGTATT (nucleotides 445 - 450 ); GACAGC (nucleotides 499 - 504 ); TCCGGT (nucleotides 565 - 570 ); CGGGCA (nucleotides 931 - 936 ); CTCGGT
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: CTCGAT (nucleotides 19 - 24 ) replaced with TTGGAT; GCTTTG (nucleotides 46 - 51 ) replaced with GCCCTT; GATGCC (nucleotides 130 - 135 ) replaced with GATGCT; GACAGT (nucleotides 256 - 261 ) replaced with GATTCT; GCACCG (nucleotides 277 - 282 ) replaced with GCCCCG; GATGCC (nucleotides 286 - 291 ) replaced with GACGCC; AAAGAC (nucleotides 358 - 363 ) replaced with AAAGAT; GCGGTT (nucleotides 370
  • the nucleotide sequence is optimized for expression in Z mobilis.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3. or 2.5. or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • L-ribulokinase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cunicidus (rabbit); Macaca fascicularis (Long-tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing arabinose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: L-arabinose isomerase (AraA), L- ribulokinase (AraB), and L-ribulose-5-P 4-epimerase (AraD); wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Klnyveromyces lactis, Zymomonas mobilis and Schi ⁇ osaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the L-ribulokinase retains at least 75% of the enzymatic activity of wild-type AraB (SEQ ID NO: 218) under normal physiological conditions.
  • a L-ribulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 59-549 of SEQ ID NO: 218 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 59-549 of SEQ ID NO: 218 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 59-549 when expressed in the native organism.
  • no replacement codon encoding amino acids 59-549 of SEQ ID NO: 218 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair CTGGCG when expressed in the native organism.
  • a L-ribulokinase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 1 -59 of SEQ ID NO: 218 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1 -59 of SEQ ID NO: 218 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest ⁇ scores of the wild type codon pairs encoding amino acids 1 -59 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1-59 of SEQ ID NO: 218 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GAAGAG when expressed in the native organism.
  • a L-ribulose-5-P 4-epimerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242, wherein at least 3 codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the 3 codon pairs to be replaced are selected from the following: AACGTC (nucleotides 82 - 87 ); ATCAAA (nucleotides 121 - 126 ); GGCCAG (nucleotides 322 - 327 ); GCAGAA (nucleotides 403 - 408 ); ATCAAC (nucleotides 409 - 414 ); AACGTC (nucleotides 439 - 444 ); GGTATC (nucleotides 469 - 474 ); CCGCAG (nucleotides 613 - 618 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: AACGTC (nucleotides 82 - 87 ) replaced with AATGTT; ATCAAA (nucleotides 121 - 126 ) replaced with ATTAAA: GGCCAG (nucleotides 322 - 327 ) replaced with GGTCAA; GCAGAA (nucleotides 403 - 408 ) replaced with GCTGAA: ATCAAC (nucleotides 409 - 414 ) replaced with ATTAAT; AACGTC (nucleotides 439 - 444 ) replaced with AATGTA; GGTATC (nucleotides 469 - 474 ) replaced with GGAATT; CCGCAG
  • a L-ribulose-5-P 4-epimerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242 ; wherein at least 3 codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CTGGCG (nucleotides 40 - 45); GAAGAG (nucleotides 571 - 576); ACGCTG (nucleotides 637 - 642); GTCAGC (nucleotides 85 - 90); CTGGAA (nucleotides 568 - 573); ACGCCA (nucleotides 229 - 234); TTCCCG (nucleotides 259 - 264); GAAGTG (nucleotides 193 - 198); CAGGCG (nucleotides 316 - 321 ); GATCTC (nucleotides 10 - 15); GCGCTG (nucleotides 43 - 48).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: CTGGCG (nucleotides 40 - 45) replaced with TTGGCG; GAAGAG (nucleotides 571 - 576) replaced with GAAGAA; ACGCTG (nucleotides 637 - 642) replaced with ACATTG; GTCAGC (nucleotides 85 - 90) replaced with GTTTCA; CTGGAA (nucleotides 568 - 573) replaced with TTGGAA; ACGCCA (nucleotides 229 - 234) replaced with ACTCCA; TTCCCG (nucleotides 259 - 264) replaced with TTTCCA; GAAGTG (nucleotides 193 - 198) replaced with GAAGTT
  • L-ribulose-5-P 4-epimerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242, wherein at least 3 codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GACGAT (nucleotides 160 - 165); ATCAAC (nucleotides 409 - 414); ATCAAA (nucleotides 121 - 126); GGTATC (nucleotides 469 - 474): AAACAG (nucleotides 463 - 468).
  • GACGAT nucleotides 160 - 165
  • ATCAAC nucleotides 409 - 414
  • ATCAAA nucleotides 121 - 126
  • GGTATC nucleotides 469 - 47
  • AAACAG nucleotides 463 - 468.
  • At least 3 of the following codon pair replacements have been made: GACGAT (nucleotides 160 - 165) replaced with GATGAT; ATCAAC (nucleotides 409 - 414) replaced with ATTAAT; ATCAAA (nucleotides 121 - 126) replaced with ATTAAA; GGTATC (nucleotides 469 - 474) replaced with GGAATT; AAACAG (nucleotides 463 - 468) replaced with AAACAA.
  • the nucleotide sequence is optimized for expression in P. pasto ⁇ s.
  • a L-ribulose-5-P 4-epimerase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242, wherein at least 3 codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • ATCAAA nucleotides 121 - 126 : GACGAT (nucleotides 160 - 165 ); TATTTC (nucleotides 361 - 366 ); ACCATT (nucleotides 373 - 378 ); GGTATC (nucleotides 469 - 474 ); TTTGCA (nucleotides 520 - 525 ).
  • GACGAT nucleotides 160 - 165
  • TATTTC nucleotides 361 - 366
  • ACCATT nucleotides 373 - 378
  • GGTATC nucleotides 469 - 474
  • TTTGCA nucleotides 520 - 525 .
  • at least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • At least 3 of the following codon pair replacements have been made: ATCAAA (nucleotides 121 - 126 ) replaced with ATTAAA; GACGAT (nucleotides 160 - 165 ) replaced with GATGAT; TATTTC (nucleotides 361 - 366 ) replaced with TACTTC; ACCATT (nucleotides 373 - 378 ) replaced with ACAATT; GGTATC (nucleotides 469 - 474 ) replaced with GGAATT; TTTGCA (nucleotides 520 - 525 ) replaced with TTCGCG.
  • the at least 3 codon pairs to be replaced are selected from the following:
  • the nucleotide sequence is optimized for expression in K. lactis.
  • a L-ribulose-5-P 4-epimerase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242, wherein at least 3 codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: ACATGG (nucleotides 73 - 78 ): GTCGAT (nucleotides 136 - 141 ); CTCTAT (nucleotides 247 - 252 ); GGTATC (nucleotides 469 - 474 ); GCATGG (nucleotides 523 - 528 ).
  • ACATGG nucleotides 73 - 78
  • GTCGAT nucleotides 136 - 141
  • CTCTAT nucleotides 247 - 252
  • GGTATC nucleotides 469 - 474
  • GCATGG nucleotides 523 - 528 .
  • at least 3. or 4, or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • At least 3 of the following codon pair replacements have been made: ACATGG (nucleotides 73 - 78 ) replaced with ACCTGG; GTCGAT (nucleotides 136 - 141 ) replaced with GTCGAC; CTCTAT (nucleotides 247 - 252 ) replaced with TTGTAT; GGTATC (nucleotides 469 - 474 ) replaced with GGCATT: GCATGG (nucleotides 523 - 528 ) replaced with GCTTGG.
  • the nucleotide sequence is optimized for expression in Z mobi/is.
  • L-ribulose-5-P 4-epimerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • L-ribulose-5-P 4-epimerase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pasloris; Oiyctolagns cunicuhis (rabbit); Macaca fascicularis (Long- tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing arabinose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: L-arabinose isomerase, L- ribulokinase, and L-ribulose-5-P 4-epimerase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the L-ribulose-5-P 4- epimerase retains at least 75% of the enzymatic activity of wild-type AraD (SEQ ID NO: 242) under normal physiological conditions.
  • a L-ribulose-5-P 4-epimerase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 7-217 of SEQ ID NO: 242 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 7-217 of SEQ ID NO: 242 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 7-217 when expressed in the native organism.
  • no replacement codon encoding amino acids 7-217 of SEQ ID NO: 242 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair CTGGCG when expressed in the native organism.
  • a L-ribulose-5-P 4-epimerase- encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-7 of SEQ ID NO: 242 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1 -7 of SEQ ID NO: 242 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -7 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1 -7 of SEQ ID NO: 242 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GATCTC when expressed in the native organism.
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -322 of wild-type xylose reductase as set forth in SEQ ID NO: 266. wherein at least 3 codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: ATCAAA (nucleotides 22 - 27); TTGAAC (nucleotides 286 - 291 ); TTGAAC (nucleotides 700 - 705); ATCAAG (nucleotides 1 15 - 120); ATCAAG (nucleotides 553 - 558); ATCAAG (nucleotides 733 - 738); GCCAAG (nucleotides 748 - 753); GCCAAG (nucleotides 901 - 906). In some such nucleotide sequences, at least 3, or 4, or 5.
  • ATCAAA nucleotides 22 - 27
  • TTGAAC nucleotides 286 - 291
  • TTGAAC nucleotides 700 - 705 replaced with TTAAAT
  • ATCAAG nucleotides 1 15 - 120
  • ATCAAG nucleotides 553 - 558
  • ATTAAA nucleotides 733 - 7308
  • ATTAAA nucleotides 748 - 753 replaced with GCAAAA
  • GCCAAG nucleotides 901 - 906) replaced with GCTAAA.
  • the nucleotide sequence is optimized for
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-322 of wild-type xylose reductase as set forth in SEQ ID NO: 266, wherein at least 3 codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GAAGAG (nucleotides 220 - 225); TTCCTC (nucleotides 229 - 234) ;ATTGCC (nucleotides 349 - 354); ATCGCC (nucleotides 898 - 903); GACTGG (nucleotides 940 - 945).
  • GAAGAG nucleotides 220 - 225
  • TTCCTC nucleotides 229 - 234)
  • ATTGCC nucleotides 349 - 354
  • ATCGCC nucleotides 898 - 903
  • GACTGG nucleotides 940 - 945.
  • At least 3 of the following codon pair replacements have been made: GAAGAG (nucleotides 220 - 225) replaced with GAAGAA; TTCCTC (nucleotides 229 - 234) replaced with TTCCTG; ATTGCC (nucleotides 349 - 354) replaced with ATCGCG; ATCGCC (nucleotides 898 - 903) replaced with ATCGCG; GACTGG (nucleotides 940 - 945) replaced with GATTGG.
  • the nucleotide sequence is optimized for expression in E.coli.
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-322 of wild-type xylose reductase as set forth in SEQ ID NO: 266. wherein at least 3 codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TCCAAG (nucleotides 238 - 243); ATCAAG (nucleotides 1 15 - 120); ATCAAG (nucleotides 553 - 558); ATCAAG (nucleotides 733 - 738); TTCAAG (nucleotides 355 - 360); TTCAAC (nucleotides 859 - 864); TTCAAC (nucleotides 925 - 930); ATCAAA (nucleotides 22 - 27); GTCAAG (nucleotides 184 - 189); GTCAAG (nucleotides 21 1 - 216); GACGAA (nucleotides 199 - 204); GGTATC (nucleotides 802 - 807); TTGAAC (nucleotides 286 - 291); TTGAAC (nucleotides
  • At least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: TCCAAG (nucleotides 238 - 243) replaced with TCTAAA; ATCAAG (nucleotides 1 15 - 120) replaced with ATTAAA; ATCAAG (nucleotides 553 - 558) replaced with ATTAAG: ATCAAG (nucleotides 733 - 738) replaced with ATTAAG; TTCAAG (nucleotides 355 - 360) replaced with TTTAAA; TTCAAC (nucleotides 859 - 864) replaced with TTTAAT; TTCAAC (nucleotides 925 - 930) replaced with TTTAAT; ATCAAA (nucleotides 22 - 27) replaced
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -322 of wild-type xylose reductase as set forth in SEQ ID NO: 266, wherein at least 3 codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: ATCAAA (nucleotides 22 - 27 ); TTGAAC (nucleotides 286
  • TTCCCA nucleotides 343 - 348
  • TTCCCA nucleotides 51 1 - 516
  • TTGAAC nucleotides " 700 - 705
  • GCCAAG nucleotides 748 - 753
  • GGTATC nucleotides 802
  • GCCAAG nucleotides 901 - 906 .
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • At least 3 of the following codon pair replacements have been made: ATCAAA (nucleotides 22 - 27 ) replaced with ATAAAA; TTGAAC (nucleotides 286 - 291 ) replaced with TTAAAT; TTCCCA (nucleotides 343 - 348 ) replaced with TTCCCT; TTCCCA (nucleotides 51 1 - 516 ) replaced with TTCCCT; TTGAAC (nucleotides 700 - 705 ) replaced with TTAAAC; GCCAAG (nucleotides 748 - 753 ) replaced with GCTAAA; GGTATC (nucleotides 802 - 807 ) replaced with GGAATT; GCCAAG (nucleotides 901 - 906 ) replaced with GCTAAA.
  • the nucleotide sequence is optimized for expression in K. lactis.
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-322 of wild-type xylose reductase as set forth in SEQ ID NO: 266, wherein at least 3 codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GCCGGT (nucleotides 91 - 96 ); GCCGGT (nucleotides 121 - 126 ); GCCTTG (nucleotides 283 - 288 ); GCCGGT (nucleotides 478 - 483 ); GCTTTG (nucleotides 520 - 525 ); GCCGGT (nucleotides 628 - 633 ); GCTTTG (nucleotides 697 - 702 ); GCTATT (nucleotides 739 - 744 ); GCCAAG (nucleotides 748 - 753 ); GGTATC (nucleotides 802 - 807 ); GCCAAG (nucleotides 901 - 906 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GCCGGT (nucleotides 91 - 96 ) replaced with GCGGGT: GCCGGT (nucleotides 121 - 126 ) replaced with GCTGGT; GCCTTG (nucleotides 283 - 288 ) replaced with GCTCTT; GCCGGT (nucleotides 478 - 483 ) replaced with GCTGGC; GCTTTG (nucleotides 520 - 525 ) replaced with GCTCTT; GCCGGT (nucleotides 628 - 633 ) replaced with GCTGGA; GCTTTG (nucleotides 697 - 702 ) replaced with GCTCTT
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-322 of wild-type xylose reductase as set forth in SEQ ID NO: 266 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • a xylose reductase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1 -322 of wild-type xylose reductase as set forth in SEQ ID NO: 266 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing xylose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the xylose reductase retains at least 75% of the enzymatic activity of wild-type Xyr (SEQ ID NO: 266) under normal physiological conditions.
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-322 of wild-type xylose reductase as set forth in SEQ ID NO: 266 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 9- 306 of SEQ ID NO: 266 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 9-306 of SEQ ID NO: 266 has a z score for expression in the heterologous host that is more than 400%. or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 9-306 when expressed in the native organism.
  • a xylose reductase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -322 of wild-type xylose reductase as set forth in SEQ ID NO: 266 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -9 of SEQ ID NO: 266 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1-9 of SEQ ID NO: 266 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-9 when expressed in the native organism.
  • L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290, wherein at least 3 codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 274 - 279 ); GATATC (nucleotides 325 - 330 ); CTTTAT (nucleotides 682 - 687 ); GGGTTT (nucleotides 901 - 906 ); TTTGCC (nucleotides 904 - 909 ); GCCATT (nucleotides 1 159 - 1 164 ); GATATT (nucleotides 1 180 - 1 185 ); TTGAAA (nucleotides 1291 - 1296 ); GAAAGT (nucleotides 1402 - 1407 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 274 - 279 ) replaced with TTAAGT; GATATC (nucleotides 325 - 330 ) replaced with GACATT; CTTTAT (nucleotides 682 - 687 ) replaced with CTATAT; GGGTTT (nucleotides 901 - 906 ) replaced with GGTTTT; TTTGCC (nucleotides 904 - 909 ) replaced with TTTGCA; GCCATT (nucleotides 1 159 - 1 164 ) replaced with GCTATT; GATATT (nucleotides 1 180 - 1 185 ) replaced with GATATA; TTGAAA (
  • a L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290, wherein at least 3 codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TTCTGG (nucleotides 25 - 30 ); AGCCAG (nucleotides 43 - 48 ); GAAGAG (nucleotides 61 - 66 ); ACGCTG (nucleotides 67 - 72 ); CTGGAA (nucleotides 70 - 75 ); CTTTCC (nucleotides 274 - 279 ); ATTGCC (nucleotides 436 - 441 ); GAAGTG (nucleotides 460 - 465 ); GCCAGA (nucleotides 532 - 537 ); GCGGTA (nucleotides 562 - 567 ); GATCTC (nucleotides 634 - 639 ); GAAGTG (nucleotides 643 - 648 ); GTGATG (nucleotides 646 - 651 ); CAGGCG
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: TTCTGG (nucleotides 25 - 30 ) replaced with TTTTGG; AGCCAG (nucleotides 43 - 48 ) replaced with TCTCAG; GAAGAG (nucleotides 61 - 66 ) replaced with GAAGAA; ACGCTG (nucleotides 67 - 72 ) replaced with ACCCTC; CTGGAA (nucleotides 70 - 75 ) replaced with CTCGAA; CTTTCC (nucleotides 274 - 279 ) replaced with CTGAGC; ATTGCC (nucleotides 436 - 441 ) replaced with ATCGCG; GAAGTG (nucleotides 460 - 465
  • a L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290, wherein at least 3 codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 274 - 279 ); GATATC (nucleotides 325 - 330 ); ATCAAC (nucleotides 403 - 408 ); GACGAA (nucleotides 733
  • TCGTTT nucleotides 829 - 834
  • AAACAG nucleotides 853 - 858
  • GGGTTT nucleotides 901 - 906
  • TTTGCC nucleotides 904 - 909
  • GATATT nucleotides 1 180
  • TTGAAA nucleotides 1291 - 1296
  • AAACTG nucleotides 1438 - 1443
  • CTGAAA nucleotides 1441 - 1446
  • CTTCAA nucleotides 1480 - 1485 .
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • CTTTCC (nucleotides 274 - 279 ) replaced with TTATCT; GATATC (nucleotides 325 - 330 ) replaced with GACATT; ATCAAC (nucleotides 403 - 408 ) replaced with ATTAAT; GACGAA (nucleotides 733 - 738 ) replaced with GATGAA: TCGTTT (nucleotides 829 - 834 ) replaced with TCTTTT; AAACAG (nucleotides 853 - 858 ) replaced with AAACAA; GGGTTT (nucleotides 901 - 906 ) replaced with GGATTC; TTTGCC (nucleotides 904 - 909 ) replaced with TTCGCT; GATATT (nucleotides 1 180 - 1 185 ) replaced with GATATA; TTGAAA (nucleotides 1 180 - 1 185 ) replaced with GATATA; TTGAAA
  • a L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290, wherein at least 3 codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 274 - 279 ): GATATC (nucleotides 325 - 330 ); GTGAAA (nucleotides 463 - 468 ); GGGTTT (nucleotides 901 - 906 ); TTTGCC (nucleotides 904 - 909 ); GCCATT (nucleotides 1 159 - 1 164 ); TTGAAA (nucleotides 1291 - 1296 ); AAATGG (nucleotides 1456 - 1461 ). In some such nucleotide sequences, at least 3, or 4. or 5.
  • CTTTCC nucleotides 274 - 279
  • GATATC nucleotides 325 - 330
  • GTGAAA nucleotides 463 - 468
  • GGGTTT nucleotides 901 - 906
  • TTTGCC nucleotides 904 - 909
  • GCCATT nucleotides 1 159 - 1 164
  • TTGAAA nucleotides 1291 - 1296
  • L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290, wherein at least 3 codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CTGTTA (nucleotides 184 - 189 ); ACATGG (nucleotides 229 - 234 ); GAAGGC (nucleotides 268 - 273 ); AACAGC (nucleotides 361
  • GCGGCT nucleotides 496 - 501
  • GTAACG nucleotides 565 - 570
  • ATCGGG nucleotides 628 - 633
  • CTTTAT nucleotides 682 - 687
  • GCTTTT nucleotides 790 - 795
  • GCCGGT nucleotides 907 - 912
  • GCTTTG nucleotides 1066
  • AAAGAC nucleotides 1237 - 1242
  • GCATGG nucleotides 1309 - 1314
  • CTTGAT nucleotides 1375 - 1380
  • CTTTAC nucleotides 1471 - 1476 .
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • CTGTTA nucleotides 184 - 189
  • ACATGG nucleotides 229 - 234
  • GAAGGC nucleotides 268 - 273
  • AACAGC nucleotides 361
  • nucleotide sequence is optimized for expression in Z. mobilis.
  • L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • L-arabinose isomerase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris: Oi ⁇ ctolagus cunicuhis (rabbit); Macaca fascicularis (Long-tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing arabinose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: L-arabinose isomerase, L- ribulokinase, and L-ribulose-5-P 4-epimerase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the L-arabinose isomerase retains at least 75% of the enzymatic activity of wild-type AraA ( SEQ ID NO: 290) under normal physiological conditions.
  • a L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 7-487 of SEQ ID NO: 290 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 7-487 of SEQ ID NO: 290 has a z score for expression in the heterologous host that is more than 400%. or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 7-487 when expressed in the native organism.
  • no replacement codon encoding amino acids 7-487 of SEQ ID NO: 290 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GGCGGA when expressed in the native organism.
  • a L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -8 of SEQ ID NO: 290 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1-8 of SEQ ID NO: 290 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -8 when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1 -5 of SEQ ID NO: 290 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair AAGGAT when expressed in the native organism.
  • L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 codon pairs of SEQ ID NO: 301 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 274 - 279 ); CAGTTT (nucleotides 313 - 318 ): AATATT (nucleotides 361 - 366 ); ATCAAA (nucleotides 523 - 528 ); CTTTAT (nucleotides 703 - 708 ); GTGGAA (nucleotides 1204 - 1209 ).
  • CTTTCC nucleotides 274 - 279
  • CAGTTT nucleotides 313 - 318
  • AATATT nucleotides 361 - 366
  • ATCAAA nucleotides 523 - 528
  • CTTTAT nucleotides 703 - 708
  • GTGGAA nucleotides 1204 - 1209
  • At least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 274 - 279 ) replaced with TTGTCT; CAGTTT (nucleotides 313 - 318 ) replaced with CAATTT; AATATT (nucleotides 361 - 366 ) replaced with AACATT; ATCAAA (nucleotides 523 - 528 ) replaced with ATTAAG; CTTTAT (nucleotides 703 - 708 ) replaced with TTGTAT; GTGGAA (nucleotides 1204 - 1209 ) replaced with GTTGAA.
  • the nucleotide sequence is optimized for expression in S.cerevisiae.
  • a L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 codon pairs of SEQ ID NO: 301 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: AGCCAG (nucleotides 43 - 48 ); GAAGAG (nucleotides 61 - 66 ); GCGGTA (nucleotides 67 - 72 ); GAAGAG (nucleotides 82 - 87 ); TCGCTG (nucleotides 163 - 168 ): GAAGAG (nucleotides 190 - 195 ); GAAGAG (nucleotides 208 - 213 ); CTTTCC (nucleotides 274 - 279 ); ATCGCC (nucleotides 436 - 441 ); GCCGGA (nucleotides 439 - 444 ); GCGGTA (nucleotides 562 - 567 ); GATCTC (nucleotides 634 - 639 ); GCGGCA (nucleotides 727 - 7
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: AGCCAG (nucleotides 43 - 48 ) replaced with TCTCAG; GAAGAG (nucleotides 61 - 66 ) replaced with GAAGAA; GCGGTA (nucleotides 67 - 72 ) replaced with GCTGTT; GAAGAG (nucleotides 82 - 87 ) replaced with GAAGAA: TCGCTG (nucleotides 163 - 168 ) replaced with TCTCTG; GAAGAG (nucleotides 190 - 195 ) replaced with GAAGAA: GAAGAG (nucleotides 208 - 213 ) replaced with GAAGAA; CTTTCC (nucleotides 2
  • nucleotide sequence is optimized for expression in E.coli.
  • L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 codon pairs of SEQ ID NO: 301 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: TCCAAA (nucleotides 91 - 96 ); AAACTG (nucleotides 181 - 186 ); GACGAA (nucleotides 205 - 210 ); GCCAAA (nucleotides 253 - 258 ); CTTTCC (nucleotides 274 - 279 ); CAGTTT (nucleotides 313 - 318 ); AATATT (nucleotides 361 - 366 ); ATCAAA (nucleotides 523 - 528 ): GTCAAG (nucleotides 742
  • TTTGAC nucleotides 1 126 - 1 131
  • AAGTTT nucleotides 1474 - 1479 .
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • TCCAAA nucleotides 91 - 96
  • AAACTG nucleotides 181 - 186
  • GACGAA nucleotides 205 - 210
  • GCCAAA nucleotides 253 - 258
  • CTTTCC nucleotides 274 - 279
  • CAGTTT nucleotides 313 - 318
  • AATATT nucleotides 361 - 366
  • nucleotide sequence is optimized for expression in P. pastoris.
  • L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 codon pairs of SEQ ID NO: 301 have been replaced with different codon. pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GTGTTT (nucleotides 22 - 27 ); CTTTCC (nucleotides 274 - 279 ); CAGTTT (nucleotides 313 - 318 ); AAATGG (nucleotides 481 - 486 ); ATCAAA (nucleotides 523 - 528 ); GTGTTT (nucleotides 1 123 - 1 128 ); AAATGG (nucleotides 1444 - 1449 ).
  • At least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GTGTTT (nucleotides 22 - 27 ) replaced with GTTTTC; CTTTCC (nucleotides 274 - 279 ) replaced with TTGTCT; CAGTTT (nucleotides 313 - 318 ) replaced with CAATTC; AAATGG (nucleotides 481 - 486 ) replaced with AAGTGG; ATCAAA (nucleotides 523 - 528 ) replaced with ATTAAA; GTGTTT (nucleotides 1 123
  • nucleotide sequence is optimized for expression in K. lactis.
  • L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 codon pairs of SEQ ID NO: 301 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the at least 3 codon pairs to be replaced are selected from the following: GTCAGA (nucleotides 175 - 180 ): GCCGGA (nucleotides 439 - 444 ); CAGCTT (nucleotides 598 - 603 ): ATCAAT (nucleotides 649 - 654 ); CTTTAT (nucleotides 703 - 708 ); GAAGGC (nucleotides 718 - 723 ); GCAAGG (nucleotides 730 - 735 ); GCCTTT (nucleotides 805 - 810 ); CAGCTT (nucleotides 844 - 849 ); GAAGGC (nucleotides 880 - 885 ); ATCAAT (nucleotides 1 195 - 1200 ); TCGGCT (nucleotides 1288 - 1293 ); CTCGAT (nucleotides 1363
  • At least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3 of the following codon pair replacements have been made: GTCAGA (nucleotides 175 - 180 ) replaced with GTTCGT; GCCGGA (nucleotides 439 - 444 ) replaced with GCTGGT; CAGCTT (nucleotides 598 - 603 ) replaced with CAGTTG; ATCAAT (nucleotides 649 - 654 ) replaced with ATTAAT; CTTTAT (nucleotides 703 - 708 ) replaced with TTGTAT; GAAGGC (nucleotides 718 - 723 ) replaced with GAGGGC; GCAAGG (nucleotides 730 - 735 ) replaced with GCTCGT; GCCTTT (nucleo
  • L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism.
  • the host organism is not human, E. coli or S.cerevisiae.
  • L-arabinose isomerase-encoding nucleotide sequence having at least a 75% amino acid sequence identity with amino acids 1-493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pasto ⁇ s; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
  • a system for metabolizing arabinose comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: L-arabinose isomerase, L- ribulokinase, and L-ribulose-5-P 4-epimerase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schi ⁇ osaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme.
  • the L-arabinose isomerase retains at least 75% of the enzymatic activity of wild-type AraA (SEQ ID NO: 302) under normal physiological conditions.
  • a L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 9-483 of SEQ ID NO: 302 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism.
  • no replacement codon encoding amino acids 9-483 of SEQ ID NO: 302 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 9-483 when expressed in the native organism.
  • no replacement codon encoding amino acids 9-483 of SEQ ID NO: 302 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair CTGGTG when expressed in the native organism.
  • L-arabinose isomerase-encoding nucleotide sequence wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302 and is adapted for expression in a heterologous host organism, wherein at least 1.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism.
  • At least one replacement codon encoding amino acids 1 -8 of SEQ ID NO: 302 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-8 when expressed in the native organism.
  • at least one replacement codon encoding amino acids 1 -5 of SEQ ID NO: 302 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GAAGTG when expressed in the native organism.
  • isolated polynucleotides comprising the any of the nucleotide sequences provided herein. Also provided herein are isolated polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 3, 5. 7. 9.
  • isolated polypeptides encoded by the any of the nucleotide sequences provided herein, provided that the amino acid sequence of said polypeptide is not SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 194, 218, 242, 266, 290 or 302.
  • expression systems comprising: an expression vector in a host organism, wherein the expression vector includes the any of the polynucleotides provided herein operably linked to an expression control sequence. Also provided herein are expression systems, comprising: an expression vector in a host organism, wherein the expression vector includes two or more polynucleotides provided herein, each polynucleotide being operably linked to the same or different expression control sequences.
  • expression systems for metabolizing xylose comprising: one or more host organisms that collectively include polynucleotides operably encoding the following enzymes xylose reductase, xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the polynucleotides encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • expression systems for metabolizing xylose comprising: one or more host organisms that collectively include polynucleotides operably encoding the following enzymes xylose isomerase and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the polynucleotides encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 1 1, 13, 15, 17, 19, 21 , 23, 27, 29, 31 , 33, 35, 37, 39, 41 , 43, 45, 47, 51 , 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71 , 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 267, 271 , 273, 275, 277, 279, 281 , 283, 285 or 287.
  • Some such systems comprise two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 1 1 , 13, 15, 17, 19, 21 , 23, 27, 29, 31 , 33, 35, 37, 39, 41 , 43, 45, 47, 51 , 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71 , 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 267, 271 , 273, 275, 277, 279, 281 , 283, 285 or 287.
  • one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 171 , 173, 175, 177, 179, 181 , 183, 185, 187, 189 or 191.
  • Some such systems comprise two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 171 , 173, 175, 177, 179, 181 , 183, 185, 187, 189 or 191.
  • the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of said enzyme.
  • each encoded enzyme retains at least 75% of the enzymatic activity of wild-type polypeptide (SEQ ID NO2, 26, 50, 74, 98, 122, 146, 170, 194, 218, 242, 266, 290 or 302) under normal physiological conditions.
  • expression systems for metabolizing arabinose comprising: one or more host organisms that collectively include polynuclotides operably encoding the following enzymes: L-arabinitol 4-dehydrogenase, L-xylulose reductase, xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the DNA sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • expression systems for metabolizing arabinose comprising: one or more host organisms that collectively include polynuclotides operably encoding the following enzymes: L- arabinose isomerase, L-ribulokinase, and L-ribulose-5-P 4-epimerase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the DNA sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
  • one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 51 , 53, 55, 57, 59, 61, 63, 65, 67, 69, 71 , 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 99, 101 , 103, 105, 107, 109, 1 1 1 , 1 13, 1 15. 1 17, 1 19, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141 , 143, 147, 149, 151 , 153, 155, 157, 159, 161, 163, 165 or 167.
  • Some such systems comprise two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 51 , 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71 , 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 99, 101 , 103, 105, 107, 109, 1 1 1 , 1 13, 1 15, 1 17, 1 19, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141 , 143, 147, 149, 151 , 153, 155, 157, 159, 161 , 163, 165 or 167.
  • one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 195, 197, 199, 201 , 203, 205, 207, 209, 21 1 , 213, 215, 219, 221 , 223, 225, 227, 229, 231, 233, 235, 237, 239, 243, 245, 247, 249, 251 , 253, 255, 257, 259, 261 , 263, 291 , 295, 297, 299, 303, 305, 307, 309 or 31 1.
  • Some such systems comprise two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 195, 197, 199, 201 , 203, 205, 207, 209, 21 1 , 213, 215, 219, 221 , 223, 225, 227, 229, 231 , 233, 235, 237, 239, 243, 245, 247, 249, 251 , 253, 255, 257, 259, 261, 263, 291, 295, 297, 299, 303, 305, 307, 309 or 31 1.
  • the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of said enzyme.
  • each encoded enzyme retains at least 75% of the enzymatic activity of wild-type polypeptide (SEQ ID NO2, 26, 50, 74, 98, 122, 146, 170, 194, 218, 242, 266, 290 or 302) under normal physiological conditions.
  • cells comprising any of the polynucleotides provided herein.
  • the cell expresses the polypeptide encoded by said polynucleotide.
  • Also provided herein are methods of introducing a polynucleotide into a host cell comprising: providing a host cell; and contacting said host cell with any of the polynucleotides provided herein under conditions that permit the polynucleotide to be introduced into the host cell.
  • Also provided herein are methods of expressing a polypeptide comprising: providing a cell comprising any of the polynucleotides provided herein; and placing the cell under conditions that permit the cell to express the polypeptide encoded by the DNA sequence, whereby said encoded polypeptide is expressed by said cell.
  • Also provided herein are methods of metabolizing a sugar comprising: providing a sugar comprising at least one covalent bond bond; providing a polypeptide encoded by any of the polynucleotides provided herein; and contacting said sugar with said polypeptide under conditions that permit said polypeptide to break or form at least one covalent bond of said sugar, whereby at least one covalentbond of said sugar is broken or formed.
  • integrable polynucleotides for modifying an endogenous nucleotide sequence in a cell comprising: a removable selectable marker cassette comprising a selectable marker flanked by a 5' site-specific recombinase recognition site and a 3' site-specific recombinase recognition site, wherein said removable selectable marker cassette is flanked by a 5' nucleic acid sequence with homology to an endogenous sequence and a 3' nucleic acid sequence with homology to an endogenous sequence.
  • integrable polynucleotides further comprise a heterologous nucleic acid flanked by said 5' nucleic acid sequence with homology to an endogenous sequence and said 3' nucleic acid sequence with homology to an endogenous sequence.
  • the heterologous nucleic acid comprises a sequence encoding a polypeptide.
  • the heterologous nucleic acid comprises a regulatory sequence.
  • the sequence encoding a polypeptide is operatively linked to said regulatory sequence.
  • the regulatory sequence comprises a promoter sequence and a terminator sequence.
  • the heterologous nucleic acid comprises a polynucleotide in accordance with any of the polynucleotides provided herein.
  • the heterologous nucleic acid encodes a polypeptide that catalyzes a reaction in a sugar degradation pathway.
  • the heterologous nucleic acid comprises SEQ ID NOs: 3, 5, 7, 9, 1 1 , 13, 15, 17, 19, 21 , 23, 27, 29, 31 , 33, 35, 37, 39, 41 , 43, 45 ; 47, 51 , 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71 , 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 99, 101 , 103, 105, 107, 109, 1 1 1 , 1 13, 1 15, 1 17, 1 19, 123, 125, 127, 129, 131 , 133, 135, 137, 139, 141 , 143, 147, 149, 151 , 153, 155, 157, 159, 161 , 163, 165, 167, 171 , 173, 175, 177, 179, 181 , 183, 185,
  • the selectable marker can be selected for or can be selected against. In some such integrable polynucleotides, the selectable marker can be selected for and can be selected against. In some such integrable polynucleotides, the selectable mark is selected from the group consisting of URA3, TRPl , CANl, KIURA3, CYH2, LYS2 and MET15. In some such integrable polynucleotides, the nucleic acid sequence with homology to an endogenous sequence comprises a genomic repetitive element. In some such integrable polynucleotides, the nucleic acid sequence with homology to an endogenous sequence comprises TyI DNA or Ty3 DNA.
  • the site- specific recombinase recognition site comprises a loxP sequence. In some such integrable polynucleotides, the site-specific recombinase recognition site comprises a frt sequence. In some such integrable polynucleotides, the integrable polynucleotide comprises a PCR product.
  • cells comprising any of the integrable polynucleotides provided herein. Some such cells comprise a gene encoding a site- specific recombinase. In some such cells, the site-specific recombinase comprises a CRE recombinase or a FLP recombinase. Some such cells are S. cerevisiae cells.
  • Also provided herein are methods of modifying an endogenous sequence in a cell comprising: providing a cell with at least one of the integrable polynucleotides provided; and selecting for a cell comprising said at least one integrable polynucleotide integrated therein to the genome of the cell. Some such methods further comprise excising at least one selectable marker from said at least one cell comprising said at least one integrable polynucleotide integrated therein: and selecting for a cell in which said at least one selectable marker has been excised. In some such methods, the excising said selectable marker comprises providing said cell with a site-specific recombinase.
  • the site-specific recombinase comprises a CRE recombinase or a FLP recombinase. In some such methods, the site-specific recombinase is expressed from an endogenous gene or from a heterologous nucleic acid.
  • the providing a cell with at least one integrable polynucleotide comprises providing a cell with a plurality of integrable polynucleotides, wherein said plurality of integrable polynucleotides comprises at least a first integrable polynucleotide comprising a first selectable marker and a second integrable polynucleotide comprising a second selectable marker.
  • the plurality comprises 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more different integrable polynucleotides.
  • cells comprising an endogenous sequence modified by any of such methods provided herein.
  • the modified endogenous sequence comprises an insertion, a deletion or a mutation.
  • cells comprising a removable selectable marker cassette integrated into said cell comprising a selectable marker flanked by a 5' site- specific recombinase recognition site and a 3' site-specific recombinase recognition site; and a heterologous nucleic acid integrated into said cell, wherein said removable selectable marker is juxtaposed to said heterologous nucleic.
  • cells comprising: a heterologous nucleic acid integrated into said cell, and a site-specific recombinase recognition site integrated into said cell, wherein said site-specific recombinase recognition site is juxtaposed to said heterologous nucleic acid.
  • the site-specific recombinase recognition site comprises a loxP or frt sequence.
  • the cell is a S. cerevisae cell.
  • the heterologous nucleic acid comprises a polynucleotide in accordance with any of the polynucleotides provided herein.
  • the heterologous nucleic acid encodes a polypeptide that catalyzes a reaction in a sugar degradation pathway.
  • the heterologous nucleic acid comprises SEQ ID NOs: 3, 5, 7, 9, 1 1 , 13, 15, 17, 19, 21 , 23, 27, 29, 31 , 33, 35, 37, 39, 41 , 43, 45, 47, 51 , 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71 , 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 99, 101 , 103, 105, 107, 109, 1 1 1 , 1 13, 1 15, 1 17, 1 19, 123, 125, 127, 129, 131 , 133, 135, 137, 139, 141 , 143, 147, 149, 151, 153, 155, 157, 159, 161 , 163, 165, 167, 171 , 173, 175, 177, 179, 181 , 183, 185, 187, 189, 191 , 195
  • Figure 1 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in P. stipitis of nucleic acid sequences encoding the xylose reductase enzyme of P. stipitis (Xyr), plotted as a function of codon pair position.
  • Figures 2-6 depicts effects of Translational eEngineeringTM on protein expression levels. Each of Figures 2-6 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding Xyr, plotted as a function of codon pair position.
  • Figure 2A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the Xyr protein.
  • Figure 2B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 3A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the Xyr protein.
  • Figure 3B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 4A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the Xyr protein.
  • Figure 4B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 5A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the Xyr protein.
  • Figure 5B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 6A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the Xyr protein.
  • Figure 6B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 7 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in C. parapsilosis of nucleic acid sequences encoding the xylose reductase enzyme of C. parapsilosis (XyIl ), plotted as a function of codon pair position.
  • Figures 8-12 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 8-12 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding XyIl , plotted as a function of codon pair position.
  • Figure 8A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the XyIl protein.
  • Figure 8B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the XyIl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in 5. cerevisiae.
  • Figure 9A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the XyIl protein.
  • Figure 9B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the XyI l which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 1OA depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the XyIl protein.
  • Figure 1OB depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the XyIl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 1 1A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the XyIl protein.
  • Figure H B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the XyIl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 12A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the XyIl protein.
  • Figure 12B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the XyI l which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
  • Figure 13 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in P. stipitis of nucleic acid sequences encoding the xylitol dehydrogenase enzyme of P. stipitis (Xdh), plotted as a function of codon pair position.
  • Figures 14-18 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 14-18 depict graphical displays of ⁇ scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding Xdh. plotted as a function of codon pair position.
  • Figure 14A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the Xdh protein.
  • Figure 14B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the Xdh which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 15A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the Xdh protein.
  • Figure 15B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the Xdh which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 16A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the Xdh protein.
  • Figure 16B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the Xdh which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 17A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the Xdh protein.
  • Figure 17B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the Xdh which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 18A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the Xdh protein.
  • Figure 18B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the Xdh which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 19 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in P. stipitis of nucleic acid sequences encoding the D-xylulokinase enzyme of P. stipitis (XKI), plotted as a function of codon pair position.
  • Figures 20-40 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 20-40 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding XKI, plotted as a function of codon pair position.
  • Figure 2OA depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the XKl protein.
  • Figure 2OB depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the XKI which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 21 A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the XKI protein.
  • Figure 21B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the XKI which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 22A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the XKI protein.
  • Figure 22B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the XKI which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 23A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the XKI protein.
  • Figure 23B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the XKI which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 24A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the XKl protein.
  • Figure 24B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the XKl has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 25 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in T. reesei of nucleic acid sequences encoding the L-arabinitol 4-dehydrogenase enzyme of T. reesei (LADl ), plotted as a function of codon pair position.
  • Figures 26-30 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 26-30 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding LADl , plotted as a function of codon pair position.
  • Figure 26A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LADl protein.
  • Figure 26B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LADl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 27A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LADl protein.
  • Figure 27B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LADl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 28A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LADl protein.
  • Figure 28B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LADl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 29A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LADl protein.
  • Figure 29B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LADl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 30A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the LADl protein.
  • Figure 3OB depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the LADl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 31 depicts a graphical display of z scores of translational kinetics values for codon pair utilization in A. monospora of nucleic acid sequences encoding the L-xylulose reductase enzyme of A. monospora (LXR), plotted as a function of codon pair position.
  • Figures 32-36 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 32-36 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding LXR. plotted as a function of codon pair position.
  • Figure 32A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LXR protein.
  • Figure 32B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 33A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LXR protein.
  • Figure 33B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 34A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LXR protein.
  • Figure 34B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 35A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LXR protein.
  • Figure 35B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 36A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the LXR protein.
  • Figure 36B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 37 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in T. reesei of nucleic acid sequences encoding the L-xylulose reductase enzyme of T. reesei (LXR), plotted as a function of codon pair position.
  • Figures 38-42 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 38-42 depict graphical displays ofz scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding LXR, plotted as a function of codon pair position.
  • Figure 38A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LXR protein.
  • Figure 38B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 39A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LXR protein.
  • Figure 39B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 4OA depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LXR protein.
  • Figure 4OB depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 41 A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LXR protein.
  • Figure 41 B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 42A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the LXR protein.
  • Figure 42B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
  • Figure 43 depicts a graphical display of z scores of translational kinetics values for codon pair utilization in E. coli of nucleic acid sequences encoding the xylose isomerase enzyme of E. coli (XyIA), plotted as a function of codon pair position.
  • Figures 44-48 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 44-48 depict graphical displays ofz scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding XyIA, plotted as a function of codon pair position.
  • Figure 44A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the XyIA protein.
  • Figure 44B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the XyIA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 45A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the XyIA protein.
  • Figure 45B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the XyIA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 46A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the XyIA protein.
  • Figure 46B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the XyIA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 47A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the XyIA protein.
  • Figure 47B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the XyIA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 48A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the XyIA protein.
  • Figure 48B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the XyIA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 49 depicts a graphical display of z scores of translational kinetics values for codon pair utilization in E. coli of nucleic acid sequences encoding the L-arabinose isomerase enzyme of E. coli (AraA), plotted as a function of codon pair position.
  • Figures 50-54 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 50-54 depict graphical displays of ⁇ scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding AraA. plotted as a function of codon pair position.
  • Figure 50A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 50B depicts a graphical display of the 5. cerevisiae expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 51 A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 5 IB depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 52A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 52B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 53A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 53B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 54A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 54B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 55 depicts a graphical display of ⁇ scores of translational kinetics values for codon pair utililization in E. coli of nucleic acid sequences encoding the L-ribulokinase enzyme of if. coli (AraB), plotted as a function of codon pair position.
  • Figures 56-60 depicts effects of Translational eEngineeringTM on protein expression levels. Each of Figures 56-60 depict graphical displays of ⁇ scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding AraB, plotted as a function of codon pair position.
  • Figure 56A depicts a graphical display of the 5. cerevisiae expression of the native nucleic acid sequence encoding the AraB protein.
  • Figure 56B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the AraB which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 57A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the AraB protein.
  • Figure 57B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the AraB which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 58A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the AraB protein.
  • Figure 58B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the AraB which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 59A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the AraB protein.
  • Figure 59B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the AraB which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 6OA depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the AraB protein.
  • Figure 6OB depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the AraB which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 61 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in E. coli of nucleic acid sequences encoding the L-ribulose-5-P 4-epimerase enzyme of E. coli (AraD). plotted as a function of codon pair position.
  • Figures 62-66 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 62-66 depict graphical displays ofz scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding AraD. plotted as a function of codon pair position.
  • Figure 62A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the AraD protein.
  • Figure 62B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the AraD which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 63A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the AraD protein.
  • Figure 63B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the AraD which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 64A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the AraD protein.
  • Figure 64B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the AraD which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 65A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the AraD protein.
  • Figure 65B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the AraD which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 66A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the AraD protein.
  • Figure 66B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the AraD which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
  • Figures 67-71 depict effects of Translational eEngineeringTM on protein expression levels. Each of Figures 67-71 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding the xylose reductase enzyme of C. tenuis (Xyr). plotted as a function of codon pair position.
  • Figure 67A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the Xyr protein.
  • Figure 67B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 68A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the Xyr protein.
  • Figure 68B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 69A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the Xyr protein.
  • Figure 69B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 70A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the Xyr protein.
  • Figure 70B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 71 A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the Xyr protein.
  • Figure 71 B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 72 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in E. coli of nucleic acid sequences encoding the L-arabinose isomerase enzyme of E. coli ( AraA), plotted as a function of codon pair position.
  • Figures 73-77 depicts effects of Translational eEngineeringTM on protein expression levels.
  • Each of Figures 73-77 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding AraA, plotted as a function of codon pair position.
  • Figure 73 A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 73B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 74A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 74B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 75A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 75B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 76A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 76B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 77A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 77B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 78 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in E. coli of nucleic acid sequences encoding the L-arabinose isomerase enzyme of E. coli (AraA). plotted as a function of codon pair position.
  • Figures 79-83 depicts effects of Translational eEngineeringTM on protein expression levels. Each of Figures 79-83 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding AraA, plotted as a function of codon pair position.
  • Figure 79A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 79B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
  • Figure 80A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 80B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
  • Figure 81 A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 81 B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
  • Figure 82A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 82B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
  • Figure 83A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the AraA protein.
  • Figure 83B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 84A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the XynA protein.
  • Figure 84B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
  • Figure 85 depicts a Western blot analysis of expression in S. cerevisiae of the AraBAD enzymes. As shown in the figure. AraB and AraD are expressed and soluble. AraA is also well expressed (as seen in a denaturing purification, not shown). F denotes flowthrough and E denotes eluate of the HlS-tagged proteins on a Ni ++ NTA column (Qiagen).
  • Figure 86 depicts a Western blot analysis showing expression in S. cerevisiae of P. stipitis xylose reductase (XYR).
  • XYR P. stipitis xylose reductase
  • the native gene is compared to HotRod gene, which was modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae. Time points are indicated as minutes after induction with galactose.
  • FIG. 87 depicts Western blot analysis of expression of the HotRod version of the XKl enzyme in S. cerevisiae.
  • the gene was expressed from the PGAL promoter in the pYES2 vector (Invitrogen), and purified under either denaturing or native conditions using the 6-H1S tag located at the N-terminus of the enzyme. These results show that this enzyme is soluble when expressed in yeast.
  • Biomass is the earth's most attractive alternative among fuel sources and most sustainable energy resource and is reproduced by the bioconversion of carbon dioxide.
  • Ethanol produced from biomass is today the most widely used biofuel when blended with gasoline. As the carbon dioxide released by combustion is recycled into biomass, the use of biofuels can significantly reduce the accumulation of greenhouse gas.
  • Ethanol is just one example of the uses of biomass harvesting using industrial enzymes. The technologies associated with biomass harvesting are similarly applicable in the production of other biofuels, fine chemicals as well as other diverse applications.
  • Lignocellulosic biomass is composed predominantly of cellulose, hemicellulose, and lignin and is naturally resistant to chemical and biologic conversion.
  • An economical biomass-to-ethanol process critically depends on the rapid and efficient conversion of all of the sugars present in both its cellulose and hemicellulose fractions. While many microorganisms can ferment the glucose component in cellulose to ethanol, efficient conversion of the pentose sugars in the hemicellulose fraction, particularly xylose and arabinose, has been hindered by the lack of a suitable biocatalyst.
  • Xylose is the predominant pentose sugar derived from hemicellulose, but arabinose can constitute a significant amount of the pentose sugars derived from various agricultural residues and other herbaceous crops, such as switchgrass.
  • Xylose metabolism Xylose is metabolized in the pentose phosphate pathway (PPP) where it enters through D-xylulose and is converted by transketolase (TLK). generating D-fructose-6-phosphate and D-glyceraldehyde-3-phosphate (GAP), which can be converted in a redox-neutral way to equimolar amounts of COT and ethanol.
  • PPP pentose phosphate pathway
  • TLK transketolase
  • GAP D-fructose-6-phosphate and D-glyceraldehyde-3-phosphate
  • D-xylose is reduced to xylitol by a xylose reductase (XR; e.g., Xyr, XYLl, XyUp) and then xylitol is oxidized to D-Xylulose by a xylitol dehydrogenase (XDH; e.g., XYL2, XyUp).
  • XR xylose reductase
  • XDH xylitol dehydrogenase
  • XK D-xylulokinase
  • the rate of the two-step reduction/oxidation reactions to generate D- xylulose, and hence feed the PPP and eventually generate ethanol, is governed by the cofactor requirements of the first two reactions which affect cellular demands for oxygen.
  • XDH from Pichia stipitis is strictly NAD + -dependent.
  • L-arabinose metabolism In yeast, filamentous fungi and other eukaryotes. the L-arabinose pathway consists of five enzymes: aldose reductase (ARD), L-arabinitol 4-dehydrogenase (LAD), L-xylulose reductase (LXR), xylitol dehydrogenase (XDH), and xylulokinase (XKI), converting L-arabinose to L-arabitol, L-xylulose, xylitol, D-xylulose, and D-xylulose-5-P, respectively.
  • ARD aldose reductase
  • LAD L-arabinitol 4-dehydrogenase
  • LXR L-xylulose reductase
  • XDH xylitol dehydrogenase
  • XKI xylulokinase
  • the bacterial pathway for L-arabinose utilization does not use redox reactions like the yeast/fungal system, but consists of L-arabinose isomerase (AraA), L- ribulokinase (AraB), and L-ribulose-5-P 4-epimerase (AraD) converting L-arabinose to L- ribulose.
  • L-ribulose-5-P, and D-xylulose-5-P respectively (Lee et al. (1986) Gene 47:231 -244).
  • the expression of the E. coli pathway in 5. cerevisiae did not result in either growth on L-arabinose or production of ethanol from L-arabinose (Sedlak at al. (2001 ) 28:16-24). It was suggested that the main problem was the low activity of B. licheniformis L-arabinose isomerase in yeast.
  • Some translational pauses are resultant from the presence of particular codon pairs in the nucleotide sequence encoding the polypeptide to be translated. As provided herein, inappropriate or excessive translation pauses can reduce protein expression considerably. Further, the translational pausing properties of codon pairs vary from organism to organism. As a result, exogenous expression of genes foreign to the expression organism can lead to inefficient translation. Even when the gene is translated in a sufficiently efficient manner that recoverable quantities of the translation product are produced, the protein is often inactive, insoluble, aggregated, or otherwise different in properties from the native protein. Thus, removing inappropriate or excessive translation pauses can improve protein expression.
  • a translational pause can serve to slow translation of the nascent amino acid chain.
  • the pause(s) can serve to facilitate proper polypeptide folding, post-translational modification, re-organization/folding at protein domain boundaries, or other steps toward arriving at the native, active wild type protein.
  • one or more pauses that are predicted to be present in native translation of sugar catabolic enzymes is/are preserved in a modified hydrolysis-encoding polynucleotide provided in accordance with the teachings herein.
  • a codon pair in the modified sugar catabolic enzyme-encoding polynucleotide can be selected to have a predicted translational kinetics value that is at least 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, or 99% that of the native codon pair whose predicted pause is to be preserved; further, the codon pair in the modified sugar catabolic enzyme -encoding polynucleotide can be selected to be located within 30, 20, 10, 9, 8, 7, 6, 5, 4, 3, 2 or 1 codons of the native codon pair whose predicted pause is to be preserved.
  • Translation EngineeringTM refers to a process used to modify the translational kinetics of a polypeptide-encoding nucleic sequence.
  • Translation EngineeringTM can be applied to modify the translational kinetics of a polypeptide-encoding nucleic sequence when expressed in its native organism.
  • Translation EngineeringTM can be applied to modify the translational kinetics of a polypeptide-encoding nucleic sequence when expressed in its native organism.
  • this process alters the polypeptide-encoding nucleic sequence to optimize codon usage and codon pair optimization in the organism in which the polypeptide-encoding nucleic sequence is expressed.
  • sequence modifications can be made to place or prevent restriction sites in the sequence, eliminate strong RNA secondary structures and avoid inadvertent Shine-Delgarno sequences.
  • Translation EngineeringTM involves modifying the translational kinetics of a polypeptide-encoding nucleic sequence by removing, preserving, and/or inserting translational pauses into the polypeptide-encoding nucleic sequence.
  • sugar catabolic enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same are provided herein.
  • a sugar catabolic enzyme -encoding DNA sequence wherein the encoded sequence has amino acid sequence identity with wild-type sugar catabolic enzyme, and wherein predicted translation pauses in the expression organism have been removed or reduced by replacing input-sequence codon pairs with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the resultant sugar catabolic enzyme - encoding nucleotide is predicted to be translated rapidly along its entire length.
  • expression of the resultant sugar catabolic enzyme -encoding nucleotide is predicted to result in improved protein expression levels in cases where inappropriate or excessive translation pauses reduce protein expression.
  • expression of the resultant sugar catabolic enzyme -encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression in cases where inappropriate or excessive translation pauses causes expression of inactive, insoluble or aggregated sugar catabolic enzyme .
  • expression of the resultant sugar catabolic enzyme -encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression in cases where one or more predicted pauses are preserved from the native expression profile or are added to preserve expression of active and/or soluble sugar catabolic enzyme .
  • the sugar catabolic enzyme -encoding nucleotide sequences provided herein allow for one or more of the following results: higher expression levels; higher enzymatic activity; greater protein stability and resistance to degradation: and increased solubility.
  • sugar catabolic enzyme refers to the enzymes encoded by the nucleotide sequences provided herein, and includes xylose reductase, xylitol dehydrogenase, D-xylulokinase, L-arabinitol 4-dehydrogenase, L-xylulose reductase, xylose isomerase, L-arabinose isomerase, L-ribulokinase, and L-ribulose-5-P 4-epimerase enzymes.
  • nucleic acid sequences encoding the xylose reductase enzyme of P. stipitis (Xyr) are provided.
  • the nucleotide sequences provided herein include the native sequence from P. stipitis shown in the sequence listing (SEQ ID NO: 1) which encodes the Xyr amino acid sequence (SEQ ID NO: 2).
  • nucleic acid sequences encoding the xylose reductase enzyme of C parapsilosis are provided.
  • the nucleotide sequences provided herein include the native sequence from C. parapsilosis shown in the sequence listing (SEQ ID NO: 25) which encodes the XyIl amino acid sequence (SEQ ID NO: 26).
  • nucleic acid sequences encoding the xylitol dehydrogenase enzyme of P. stipitis are provided.
  • the nucleotide sequences provided herein include the native sequence from P. stipitis shown in the sequence listing (SEQ ID NO: 49) which encodes the Xdh amino acid sequence (SEQ ID NO: 50).
  • nucleic acid sequences encoding the D-xylulokinase enzyme of P. stipitis are provided.
  • the nucleotide sequences provided herein include the native sequence from P. stipitis shown in the sequence listing (SEQ ID NO: 73) which encodes the XKI amino acid sequence (SEQ ID NO: 74).
  • nucleic acid sequences encoding the L-arabinitol 4- dehydrogenase enzyme of T. reesei are provided.
  • the nucleotide sequences provided herein include the native sequence from T. reesei shown in the sequence listing (SEQ ID NO: 97) which encodes the LADl amino acid sequence (SEQ ID NO: 98).
  • nucleic acid sequences encoding the L-xylulose reductase enzyme of A. monospora are provided.
  • the nucleotide sequences provided herein include the native sequence from A. monospora shown in the sequence listing (SEQ ID NO: 121 ) which encodes the LXR amino acid sequence (SEQ ID NO: 122).
  • nucleic acid sequences encoding the L-xylulose reductase enzyme of T. reesei are provided.
  • the nucleotide sequences provided herein include the native sequence from T. reesei shown in the sequence listing (SEQ ID NO: 145) which encodes the LXR amino acid sequence (SEQ ID NO: 146).
  • nucleic acid sequences encoding the xylose isomerase enzyme of E. coli are provided.
  • the nucleotide sequences provided herein include the native sequence from E. coli shown in the sequence listing (SEQ ID NO: 169) which encodes the XyIA amino acid sequence (SEQ ID NO: 170).
  • nucleic acid sequences encoding the L-arabinose isomerase enzyme of E. coli are provided.
  • the nucleotide sequences provided herein include the native sequence from E. coli shown in the sequence listing (SEQ ID NO: 193) which encodes the AraA amino acid sequence (SEQ ID NO: 194).
  • nucleic acid sequences encoding the L-ribulokinase enzyme of E. coli are provided.
  • the nucleotide sequences provided herein include the native sequence from E. coli shown in the sequence listing (SEQ ID NO: 217) which encodes the AraB amino acid sequence (SEQ ID NO: 21 8).
  • nucleic acid sequences encoding the L-ribulose-5-P 4- epimerase enzyme of E. coli are provided.
  • the nucleotide sequences provided herein include the native sequence from E. coli shown in the sequence listing (SEQ ID NO: 241) which encodes the AraD amino acid sequence (SEQ ID NO: 242).
  • nucleic acid sequences encoding the xylose reductase enzyme of C. tenuis are provided.
  • the nucleotide sequences provided herein include the native sequence from C. tenuis shown in the sequence listing (SEQ ID NO: 265) which encodes the Xyr amino acid sequence (SEQ ID NO: 266).
  • nucleic acid sequences encoding the L-arabinose isomerase enzyme of B. subtilis are provided.
  • the nucleotide sequences provided herein include the native sequence from E. coli shown in the sequence listing ( SEQ ID NO: 289) which encodes the AraA amino acid sequence ( SEQ ID NO: 290).
  • nucleic acid sequences encoding the L-arabinose isomerase enzyme of E. coli are provided.
  • the nucleotide sequences provided herein include the native sequence from B. licheniformis shown in the sequence listing (SEQ ID NO: 301) which encodes the AraA amino acid sequence (SEQ ID NO: 302).
  • nucleic acid sequences encoding sugar catabolic enzymes with refined translational kinetics for expression in S. cerevisiae (SEQ ID NOS: 3, 27, 51 , 75, 99 8 123, 147, 171 , 195, 219, 243, 267, 291 , 303), £. c ⁇ /i (SEQ ID NOS: 9, 33, 57, 81 , 105, 129, 153, 177, 201, 225, 249, 273, 293 and 305), P.
  • nucleotide sequences may be added 3 r or 5 : of any nucleic acid, for example, to facilitate hybridization of PCR primers, to add cloning restriction sites or other sites that facilitate cloning and/or expression. Accordingly, provided in the sequence listing are nucleic acid sequences with additional 5 : and 3 : cloning and/or PCR sequences, and which encode sugar catabolic enzymes with refined translational kinetics for expression in S.
  • E. coli SEQ ID NOS: 11, 13, 35, 37, 59, 61, 83, 85, 107, 109, 131, 133, 155, 157, 179, 181, 203, 205, 227, 229, 251, 253, 275, 277) and P.
  • sugar catabolic enzyme amino acid sequences encoded by the nucleotide sequences with refined translational kinetics described herein.
  • sugar catabolic enzyme nucleic acid sequences with refined translational kinetics SEQ IDNOS: 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83,85,87,89,91,93,95,99, 101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165, 167, 171, 173, 175,
  • sugar catabolic enzyme-encoding DNA sequences wherein the encoded sequence has amino acid sequence identity with an original sugar catabolic enzyme polypeptide and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly- overrepresented therein.
  • the host organism is not human. E. coli or S. cerevisiae.
  • a xylose reductase polynucleotide encodes a polypeptide having xylose reductase activity.
  • Xylose reductase and like terms refers to the enzymatic conversion of xylose to xylitol.
  • a method for measuring xylose reductase activity is exemplified by a known method in which an enzymatic reaction is carried out and NADPH absorbance at 340 nm is monitored by spectrophotometry, as described in Rawat et al. ((1998) J. Biol. Chem. 273:9415-9423), hereby incorporated by reference in its entirety.
  • a xylitol dehydrogenase polynucleotide encodes a polypeptide having xylitol dehydrogenase activity.
  • Xylitol dehydrogenase and like terms refers to the enzymatic conversion of xylitol to D-xylulose.
  • a method for measuring xylitol dehydrogenase activity is exemplified by a known method in which an enzymatic reaction is carried out and NADPH absorbance at 340 nm is monitored by spectrophotometry, as described in Ko et al. ((2006) Appl. Eviron. Microbiol. 72:4207- 4213), hereby incorporated by reference in its entirety.
  • a D-xylulokinase polynucleotide encodes a polypeptide having D-xylulokinase activity.
  • D-xylulokinase and like terms refers to the enzymatic conversion of D-xylulose to D-xylulose-5-phosphate.
  • a method for measuring D-xylulokinase activity is exemplified by a known method in which an enzymatic reaction is carried out and NADPH absorbance at 340 nm is monitored by spectrophotometry, as described in Dills et al. ((1994) Protein Expr. Purif. 5:259-265), hereby incorporated by reference in its entirety.
  • L-arabinitol 4-dehydrogenase polynucleotide encodes a polypeptide having L-arabinitol 4-dehydrogenase activity.
  • L-arabinitol 4- dehydrogenase and like terms refers to the enzymatic conversion of L-arabinose to L- arabitol.
  • a method for measuring L-arabinitol 4-dehydrogenase activity is exemplified by a known method in which an enzymatic reaction is carried out and NADPH absorbance at 340 nm is monitored by spectrophotometry, as described in U.S. Patent Application No. 2003/0186402, hereby incorporated by reference in its entirety.
  • L-xylulose reductase polynucleotide encodes a polypeptide having L-xylulose reductase activity.
  • L-xylulose reductase and like terms refers to the enzymatic conversion of L-xylulose to xylitol.
  • a method for measuring L- xylulose reductase activity is exemplified by a known method as described in Verho et al. ((2004) J. Biol. Chem. 279:14746-14751 ), hereby inco ⁇ orated by reference in its entirety.
  • a xylose isomerase polynucleotide encodes a polypeptide having xylose isomerase activity.
  • Xylose isomerase and like terms refers to the enzymatic conversion of xylose to D-xylulose.
  • a method for measuring xylose isomerase activity is exemplified by a known method in which an enzymatic reaction is carried out and xylulose production is monitored by spectrophotometry, as described in U.S. Patent No. 6.475,768, hereby incorporated by reference in its entirety.
  • L-arabinose isomerase polynucleotide encodes a polypeptide having L-arabinose isomerase activity.
  • L-arabinose isomerase and like terms refers to the enzymatic conversion of L-arabinose to L-ribulose.
  • a method for measuring L-arabinose isomerase activity is exemplified by a known method in which an enzymatic reaction is carried out and ribulose absorbance at 560 nm is monitored by spectrophotometry, as described in Lee et al. ((2005) Appl. Environ. Microbiol. 71 :7888- 7896), hereby incorporated by reference in its entirety.
  • L-ribulokinase polynucleotide encodes a polypeptide having L-ribulokinase activity.
  • L-ribulokinase and like terms refers to the enzymatic conversion of L-ribulose to L-ribulose-5-P.
  • a method for measuring L-ribulokinase activity is exemplified by a known method in which an enzymatic reaction is carried out and DPNH absorbance at 340 nm is monitored by spectrophotometry, as described by Lee and Englesberg (( 1962) Proc. Natl. Acad. Sci. 48:335). hereby incorporated by reference in its entirety.
  • L-ribulose-5-P 4-epimerase polynucleotide encodes a polypeptide having L-ribulose-5-P 4-epimerase activity.
  • L-ribulose-5-P 4-epimerase and like terms refers to the enzymatic conversion of L-ribulose-5-P to D-xylulose-5-P.
  • a method for measuring L-ribulose-5-P 4-epimerase activity is exemplified by a known method in which an enzymatic reaction is carried out and NADPH absorbance at 340 nm is monitored by spectrophotometry, as described in Becker and Boles ((2003) Appl. Environ. Microbiol. 69:4144-50, hereby incorporated by reference in its entirety.
  • the polynucleotides provided herein encode polypeptides that have sugar catabolism activity.
  • a sugar catabolic enzyme-encoding polynucleotide comprising any of the DNA sequences provided herein can be transcribed and the resulting RNA translated to produce a polypeptide with sugar catabolic enzyme activity.
  • 0319j As used herein, the term nucleotide sequence is used to refer to any polynucleotide sequence.
  • DNA sequence is used herein to refer to the nucleotide sequences presented herein.
  • an RNA equivalent nucleotide sequences are also described by DNA sequences presented herein. As is well-known in the art, an equivalent RNA sequence can be substituted for a DNA sequecne by a T to U substitution, (i.e., replacing thymine in the DNA sequence with uracil in the RNA sequence).
  • the sugar catabolic enzyme-encoding DNA sequence is adapted for expression in a heterologous host organism.
  • a DNA sequence that has been adapted for expression is a DNA sequence that has been inserted into an expression vector or otherwise modified to contain regulatory elements necessary for expression of the DNA in the host cell, positioned in such a manner as to permit expression of the DNA in the host cell.
  • regulatory elements required for expression include promoter sequences, transcription initiation sequences and, optionally, enhancer sequences.
  • a DNA sequence may be inserted into a plasmid vector adapted for expression in a bacterial cell, such as E. coli, or a eukaryotic cell, such as S. cerevisiae or other yeast, or any other host organism.
  • a heterologous host organism is an organism used to express DNA, RNA or protein that is foreign to the host organism.
  • the host organism is not human, E. coli or S. cerevisiae.
  • translational kinetics of an mRNA into polypeptide can be changed in order to achieve any of a variety of expression profiles. For example, translational kinetics of an mRNA into polypeptide can be changed in order to remove some or all translational pauses. In another example, translational kinetics of an mRNA into polypeptide can be changed in order to replace some or all translational pauses predicted to occur within an autonomous folding unit of a nascent protein. In another example, translational kinetics of an mRNA into polypeptide can be changed in order to replace some or all over-represented codon pairs.
  • a pause or translation slowing codon pair can queue ribosomes back to the beginning of the coding sequence, thereby inhibiting further ribosome attachment to the message which can result in down- regulation of protein expression levels as the rate of translation initiation readily saturates and the slowest translation step time becomes rate limiting. It is also proposed herein that the presence of a pause or translational slowing codon pair can stall or detach a ribosome. It is also proposed herein that the presence of a pause or translational slowing codon pair can expose naked mRNA, which is then subject to message degradation.
  • Organism-specific codon usage and codon pair usage, and the presence of organism-specific pause sites result in gene translation that is highly adapted to the original host organism.
  • ribosomal pausing sites that may be functional in a human cell will typically be scrambled, random, or not appropriate or not recognized in the proper context in a bacterium or other non-native host.
  • a heterologous cDNA or synthetic polynucleotide has a random but high probability of inadvertently encoding a pause site somewhere, often leading to protein expression and/or activity failure.
  • Methods for refining translational kinetics of an mRNA into polypeptide can be performed according to any method known in the art, as exemplified in U.S. Patent Publication No. 2008/0046192, published on February 21 , 2008, which is incorporated by reference herein in its entirety.
  • a polypeptide-encoding nucleotide can be designed to be predicted to be translated rapidly along its entire length.
  • some polypeptide-encoding nucleotides provided herein are those that have been engineered to remove all predicted pauses. Expression of such a polypeptide-encoding nucleotide can result in improved protein expression levels and improved levels of active and/or natively folded polypeptide expression.
  • a test of translation pausing or slowing as a result of codon pair usage can be performed by comparing a series of genes that have random pauses with modified genes where codon pairs predicted to cause translational pauses are replaced. Unmodified genes moved from their source organism and expressed in a heterologous host can have an altered set of codon pairs predicted to cause a translational pause or ribosomal slowing (e.g., an altered set of over-represented codon pairs), resulting in altered configuration and location of presumed pause sites.
  • translational kinetics of an mRNA into sugar catabolic enzyme-encoding polypeptide can be changed in order to remove some or all translational pauses or replace other codon pairs that cause translational slowing, message instability and degradation, and poor protein translation, expression, and functional properties. While not intending to be limited to the following, it is believed that, for at least some proteins, reduction or elimination of translational pauses can serve to increase the expression level and/or quality and characteristics of the protein. Accordingly, by removing some or all translational pauses or replacing other codon pairs that cause translational slowing, the expression levels and/or quality of an expressed protein can be increased.
  • the sugar catabolic enzyme-encoding nucleotide sequences provided herein allow for one or more of the following results: higher expression levels, higher enzymatic activity, greater protein stability, resistance to degradation, and increased solubility compared to the original native gene when expressed in a heterologous host.
  • sugar catabolic enzyme -encoding nucleotide sequences that have been modified to have one or more transcriptional pauses or slowing sites removed by modifying one or more codon pairs to a corresponding codon pair that is less likely to cause a translational pause or slowing. While in some embodiments it is preferred to replace all codon pairs predicted to cause a translational pause or slowing, in other embodiments, it is sufficient to replace a subset of codon pairs predicted to cause a translational pause or slowing. For example, expression levels can be increased by replacing at least 1. 2. 3, 4, 5, 6, 7, 8, 9, 10 or more codon pairs predicted to cause a translational pause or slowing. In another example, at least 10%, 20%, 30%.
  • codon pairs predicted to cause a translational pause or slowing are replaced by, for example, substituting different codon pairs that encode the same amino acids.
  • translational kinetics of an mRNA into polypeptide can be changed in order to remove some or all translational pauses predicted to occur within an autonomous folding unit of a protein.
  • an autonomous folding unit of a protein refers to an element of the overall protein structure that is self- stabilizing and often folds independently of the rest of the protein chain. Such autonomous folding units typically correspond to a protein domain.
  • expression of a gene in a heterologous host organism can result in translational pauses located in regions that inhibit protein expression and/or protein folding.
  • preserving or inserting a translational pause in a region predicted to separate autonomous folding units of a protein can result in improved folding and/or solubility of expressed proteins.
  • provided herein are methods of changing translational kinetics of an mRNA into polypeptide by preserving, relative to native, or inserting one or more translational pauses in one or more regions predicted to separate autonomous folding units of a protein, thereby increasing improving the folding and/or solubility of the expressed protein.
  • one step can include identifying predicted autonomous folding units of a protein.
  • Methods for identifying predicted autonomous folding units of a protein or protein domains are known in the art, and include alignment of amino acid sequences with protein sequences having known structures, and threading amino acid sequences against template protein domain databases.
  • Such methods can employ any of a variety of software algorithms in searching any of a variety of databases known in the art for predicting the location of protein domains.
  • the results of such methods will typically include an identification of the amino acids predicted to be present in a particular domain, and also can include an identification of the domain itself, and an identification of the secondary structural element, if any, in which each amino acid sequence of a domain is located.
  • the polypeptide- encoding nucleotide sequence it is not possible to modify the polypeptide- encoding nucleotide sequence to remove a translational pause not present in the expression profile of the polypeptide in the native host organism. For example, there may be no codon pairs that are not predicted to cause a translational pause or slowing and that encode a corresponding pair of amino acids. In such instances, several options are available: the codon pair that is least likely to cause a translational pause or slowing can be selected; an amino acid insertion, deletion or mutation can be introduced to yield a codon pair that is not predicted to cause a translational pause or slowing; or no change is made.
  • One option in a computational method is to request human input in order to resolve the issue.
  • the computational method may, for example, involve the use of a computer that is programmed to request human input.
  • the computer may be programmed to make a selection, or combination of selections, such that multiple genes, or Ordered Gene Sets or small permutation libraries are designed and synthetically produced for use in expression analysis.
  • an amino acid insertion, deletion or mutation is made in order to change translational kinetics, it is preferable to select a change that is predicted not to substantially influence the final three-dimensional structure of the protein and/or the activity of the protein.
  • Such an amino acid insertion, deletion or mutation can include, for example, a conservative amino acid substitution such as the conservative substitutions shown in Table 1.
  • the substitutions shown are based on amino acid physical-chemical properties, and as such, are independent of organism.
  • the conservative amino acid substitution is a substitution listed under the heading of exemplary substitutions.
  • codon pairs predicted to cause a translational pause or slowing are treated equally
  • one or more different threshold levels can be established for differential treatment of codon pairs, where codon pairs above a highest threshold are the codon pairs most likely to cause a translational pause or slowing, and succeedingly lower codon pair threshold-based groups correspond to succeedingly lower likelihoods of the respective codon pairs causing a translational pause or slowing.
  • codon pair groupings different numbers or percentages of codon pairs can be replaced for each of these different threshold-based groups. For example, 95% or more codon pairs above a highest threshold level can be replaced, while 90% or less of all codon pairs between that level and an intermediate threshold level are replaced.
  • codon pairs likely to cause a translational pause or slowing can be segregated into two or more different threshold- based groups, three or more different threshold-based groups, four or more different threshold-based groups, five or more different threshold-based groups, six or more different threshold-based groups, or more. Discussion of specific thresholds are provided elsewhere herein; however, typically the higher the threshold, the higher the likelihood of a translational pause or slowing caused by a codon pair with a translational kinetics value greater than the threshold. In embodiments in which codon pairs likely to cause a translational pause or slowing can be segregated into two or more different threshold- based groups, different numbers or percentages of codon pairs can be replaced for each codon pair group.
  • codon pairs above a highest threshold are replaced, while the same or a lower percentage of codon pairs are replaced from codon pair groups corresponding to one or more lower thresholds.
  • the same or a lower percentage of codon pairs are replaced.
  • all codon pairs above a highest threshold are replaced, while a codon pair above an intermediate threshold is replaced only if the codon pair is located within an autonomous folding unit.
  • all codon pairs above a highest threshold are replaced, while a codon pair above an intermediate threshold is replaced only if the codon pair can be replaced without requiring a change in the encoded polypeptide sequence.
  • all codon pairs above a highest threshold are replaced, while a codon pair above a first higher intermediate threshold is replaced only if the codon pair can be replaced without changing the encoded polypeptide sequence or with only a conservative change to the encoded polypeptide sequence, while a codon pair above a second lower intermediate threshold is replaced only if the codon pair can be replaced without requiring any change in the encoded polypeptide sequence.
  • an evaluation method can be used that determines the degree to which a codon pair should be replaced according to the translational kinetics value of the codon pair, where the degree to which the codon pair should be replaced can be counterbalanced by any of a variety of user-determined factors such as, for example, presence of the codon pair within or between autonomous folding units, and degree of change to the encoded polypeptide sequence.
  • a translational kinetics value of a codon pair is a representation of the degree to which it is expected that a codon pair is associated with a translational pause. Methods of determining the translational kinetics value of a codon pair are discussed elsewhere herein. Such translational kinetics values can be normalized to facilitate comparison of translational kinetics values between species. In some embodiments, the translational value can be the degree of over-representation of a codon pair. An over-represented codon pair is a codon pair which is present in a protein-encoding sequence in higher abundance than would be expected if all codon pairs were statistically randomly abundant.
  • a codon pair predicted to cause a translational pause or slowing is a codon pair whose likelihood of causing a translational pause or slowing is at least one standard deviation above the mean translational kinetics value, where a particular translational kinetics value above the mean translational kinetics value in this context refers to a translational kinetics value indicative of a greater likelihood of causing translational pausing or slowing, relative to a mean translational kinetics value, and is not strictly limited to a particular mathematical relationship (e.g., greater than the mean) since the depiction of propensity to cause a translational pause by a translational kinetics value can be selected to be negative or positive, based on the selected implementation by one skilled in the art.
  • over-represented codon pairs may be graphically displayed as a positive function in a SpeedPlotTM, as depicted in Figure 1 , where a positive deflection or peak above a selected threshold describes a translational pause or slowing at the exact nucleotide location as defined by the abscissa.
  • a threshold for the translational kinetics value of codon pairs that are predicted to cause a translational pause or slowing can be set in accordance with the method and level of stringency desired by one skilled in the art.
  • a threshold value can be set to 5, or 3, or 2, or 1.5 standard deviations or more above the mean.
  • Typical threshold values can be at least I 5 1.25, 1.5, 1 .75. 2, 2.25, 2.5, 3, 3.5, 4, 4.5 and 5 or more Standard deviations above the mean.
  • a plurality of thresholds can be applied in the herein-provided methods in segregating codon pairs into a plurality of groups. Each threshold of such a plurality can be a different value selected from 1 , 1 .25, 1.5, 1.75. 2, 2.25, 2.5, 3, 3.5. 4, 4.5 and 5 or more standard deviations above the mean.
  • translational kinetics of an mRNA into polypeptide can be changed to add or retain one or more translational pauses predicted to occur before, after or within an autonomous folding unit of a protein, or between autonomous folding units. While not intending to be limited to the following, it is proposed that translational pauses are present in wild type genes in order to slow translation of a nascent polypeptide subsequent to translation of a protein domain, thus providing time for acquisition of secondary and at least partial tertiary structure in the domain prior to further downstream translation and reorganization or reconfiguration of the growing polypeptide or domain. By modifying the translational kinetics of complex multi-domain proteins it may be possible to experimentally alter the time each domain has available to organize.
  • Folding of a heterologously-expressed gene having two or more independent domains can be altered by the presence of pause sites between the domains. Refolding studies indicate that the time it takes for a protein to settle into its final configuration may take longer than the translation of the protein. Pausing may allow each domain to partially organize and commit to a particular, independent fold. Other co- translational events, such as those associated with co-factors, protein subunits, protein complexes, membranes, chaperones, secretion, or proteolysis complexes, also can depend on the kinetics of the emerging nascent polypeptide. Pauses can be introduced by engineering one codon pair predicted to cause a translational pause or slowing, or two or more such codon pairs into the sequence to facilitate these co-translational interactions.
  • typically a translational pause is preserved, which refers to maintaining the same codon pair for a polypeptide-encoding nucleotide sequence that is expressed in the native host organism, or, when the polypeptide-encoding nucleotide sequence is heterologously expressed, changing the codon pair as appropriate to have a translational kinetics value comparable to or closest to the translational kinetics value of the native codon pair in the native host organism.
  • proximal codon pairs can be selected to be replaced in order to introduce a translational pause or slowing.
  • one of the 1 , 2, 3, 4 or 5 most proximal codon pairs upstream (5 " of the desired pause site) or one of the 1 , 2, 3, 4 or 5 most proximal codon pairs downstream (3 ' of the desired pause site) can be chosen for replacement to introduce the translational pause or slowing.
  • the selected codon pair for replacement to introduce the translational pause or slowing is the codon pair closest to the originally desired codon pair location of the translational pause or slowing, provided the desired translational pause or slowing can be attained (e.g., 1 codon pair upstream or downstream is typically selected instead of 2 codon pairs upstream or downstream, provided the desired translational pause or slowing can be attained).
  • a translational pause or slowing can be introduced by selecting a replacement codon pair encoding a conservative amino acid substitution, such as the conservative substitutions shown in Table 1.
  • replacement of a proximal codon pair to introduce a translational pause or slowing is preferred over replacement of a codon pair resulting in a change in the encoded amino acid sequence.
  • graphical displays of translational kinetics values of one or more proteins can be used to provide information to assist in the selection of a translational pause or slowing to preserve or insert in a redesigned polypeptide-encoding nucleotide sequence.
  • graphical displays of translational kinetics values can permit, for example, alignment of homologous proteins from different species and an identification, based on this alignment, of predicted translational pause or slowing sites that are conserved in the aligned proteins.
  • Such predicted translational pause or slowing sites can be preserved or inserted in a redesigned polypeptide-encoding nucleotide sequence.
  • regions between autonomous folding units in one or more proteins within a particular species can be graphically examined for the presence or absence of predicted pause sites.
  • Such graphical display methods can result in an identification of a region between autonomous folding units in which a translational pause or slowing is desirably preserved in a redesigned polypeptide-encoding sequence.
  • Methods for identifying and selecting conserved translational pauses can be performed according to any method known in the art. as exemplified in U.S. Patent Publication No. 2007/0298503, published on December 27, 2007, and U.S. Patent Publication No. 2007/0275399, published on November 29, 2007.
  • the codon pair translation kinetics values can be compared with a database of related gene sequences and conserved pause sites can be identified.
  • a synthetic gene can be designed wherein at least one conserved pause site is maintained to provide a synthetic gene with modified translation kinetics.
  • codon pairs are associated with translational pauses, and can thereby influence translational kinetics of an mRNA into polypeptide.
  • the methods of changing translational kinetics provided herein will typically be performed by modifying or designing one or more nucleotide sequences encoding a polypeptide to be expressed. Accordingly, provided herein are methods of modifying a gene or designing a synthetic nucleotide sequence encoding the polypeptide encoded by the gene, collectively referred to herein as redesigning a polypeptide-encoding gene sequence or redesigning a polypeptide-encoding nucleotide sequence.
  • a sugar catabolic enzyme-encoding DNA sequence wherein the encoded sequence has at least a 50%, 60%. 70%. 75%. 80%. 85%, and more typically at least 90%, 91%, 92%, 93%, 94% : 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the wild type sugar catabolic enzyme polypeptide sequence as set forth in SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 194. 218, 242, 266 ; 290 or 302.
  • At least 1 , 2 or 3 codon pairs of a polynucleotide sequence encoding the sugar catabolic enzyme have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the DNA sequence is optimized for expression in S. cerevisiae. E. coli, P. pastoris, K. lac l is or Z mobilis.
  • a sugar catabolic enzyme-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the a functional domain of the sugar catabolic enzyme have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for functional domains are known in the art.
  • the replacement codon pairs are predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. That is, the embodiments in which one or more codon pairs encoding amino acids of the a functional domain of one of the encoded polypeptides provided herein have been replaced include embodiments in which the nucleotide sequence encoding the functional domain is changed to increase the predicted translational kinetics of translation of the functional domain. As provided herein, incomplete translation, improper folding, or other protein expression shortcomings can result from the presence of one or more translational pauses in a heterologously-expressed polypeptide.
  • the replacement codons i.e., the codons added as replacements for the wild type codons, are typically predicted to be less likely to cause a translational pause.
  • the replacement codon can have a translational kinetics value in the heterologous host organism that is 95% ; 90% : 85%, 80% : 75%, 70%, or less, than the translational kinetics value of the wild type codon pair when expressed in the heterologous host organism.
  • the replacement codon is selected to have a translational kinetics value similar to the translational kinetics value of the wild type codon pair in the native organism.
  • the z score of at least one replacement codon pair when expressed in the heterologous host organism can be no more than 250%, 200%, 150%, 125% or 100% of the z score for the wild type codon pair when expressed in the native organism.
  • a sugar catabolic enzyme-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between domains of the sugar catabolic enzyme, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the domains are known in the art and are described in detail below.
  • a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9. 10, or more codon pairs present in wild-type nucleotide sequence and which encode the aldo/keto reductase domain of the xylose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for aldo/keto reductase domains are known in the art.
  • the aldo/keto reductase domain includes at least amino acids 6-300 or 5-301.
  • a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the aldo/keto reductase domain of the xylose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the aldo/keto reductase domain are described hereinabove.
  • a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the aldo/keto reductase domain of the xylose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for aldo/keto reductase domains are known in the art.
  • the aldo/keto reductase domain includes at least amino acids 1 1-306, 12-307 or 3-324.
  • a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the aldo/keto reductase domain of the xylose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the aldo/keto reductase domain are described hereinabove.
  • a xylitol dehydrogenase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the alcohol dehydrogenase GroES-like domain of the xylitol dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for alcohol dehydrogenase GroES-like domains are known in the art.
  • the alcohol dehydrogenase GroES-like domain includes at least amino acids 28-146 or 27- 147.
  • a xylitol dehydrogenase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4. 5, 6, 7, 8, 9. 10, or more codon pairs present in wild-type nucleotide sequence and which encode the zinc-binding dehydrogenase domain of the xylitol dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for zinc-binding dehydrogenase domains are known in the art.
  • the zinc- binding dehydrogenase domain includes at least amino acids 175-314 or 174-315.
  • a xylitol dehydrogenase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5. 6, 7, 8, 9. 10. or more codon pairs present in wild-type nucleotide sequence and which encode the region between the zinc-binding dehydrogenase domain and the alcohol dehydrogenase GroES-like domain of the xylitol dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the zinc-binding dehydrogenase domain and the alcohol dehydrogenase GroES-like domain are described hereinabove.
  • a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the alcohol dehydrogenase GroES-like domain of the xylitol dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the alcohol dehydrogenase GroES-like domain are described hereinabove.
  • a D-xylulokinase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the FGGY carbohydrate kinse domain of the D-xylulokinase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for FGGY carbohydrate kinse domains are known in the art.
  • the FGGY carbohydrate kinse domain includes at least amino acids 12-312 or 1 1 -313.
  • a D-xylulokinase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7. 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the FGGY carbohydrate kinse domain of the D-xylulokinase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the FGGY carbohydrate kinse domain are described hereinabove.
  • the conserved amino acid sequence pattern and domain boundaries for alcohol dehydrogenase GroES- like domains are known in the art.
  • the alcohol dehydrogenase GroES-like domain includes at least amino acids 54-163 or 53-164.
  • the conserved amino acid sequence pattern and domain boundaries for alcohol dehydrogenase zinc binding domains are known in the art.
  • the alcohol dehydrogenase zinc binding domain includes at least amino acids 191-365 or 192-366.
  • a L-arabinitol 4-dehydrogenase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the alcohol dehydrogenase GroES-like domain of the L-arabinitol 4-dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the alcohol dehydrogenase GroES-like domain are described hereinabove.
  • a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1. 2, 3, 4, 5, 6, 7, 8, 9, 10. or more codon pairs present in wild-type nucleotide sequence and which encode the region between the alcohol dehydrogenase GroES-like domain and the alcohol dehydrogenase zinc binding domain of the L- arabinitol 4-dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the alcohol dehydrogenase zinc binding domain are described hereinabove.
  • a L-xylulose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the short-chain dehydrogenase/reductase domain of the L- xylulose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for short-chain dehydrogenase/reductase domains are known in the art.
  • the short-chain dehydrogenase/reductase domain includes at least amino acids 13- 194 or 8-267.
  • a L-xylulose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the short-chain dehydrogenase/reductase domain of the L-xylulose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the short-chain dehydrogenase/reductase domain are described hereinabove.
  • a L-xylulose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2. 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the short-chain dehydrogenase/reductase domain of the L- xylulose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for short-chain dehydrogenase/reductase domains are known in the art.
  • the short-chain dehydrogenase/reductase domain includes at least amino acids 20- 193 or 10-261.
  • a xylose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4. 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the xylose isomerase type TlM barrel domain of the xylose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for xylose isomerase type TIM barrel domains are known in the art.
  • the xylose isomerase type TIM barrel domain includes at least amino acids 77-285 or 76-286.
  • a xylose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the xylose isomerase type TIM barrel domain of the xylose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the xylose isomerase type TIM barrel domain are described hereinabove.
  • a L-arabinose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the arabinose isomerase domain of the L-arabinose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for arabinose isomerase domains are known in the art.
  • the arabinose isomerase domain includes at least amino acids 9-471 or 8-472.
  • a L-arabinose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the arabinose isomerase domain of the L-arabinose isomerase. have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the arabinose isomerase domain are described hereinabove.
  • a L-ribulokinase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5. 6, 7, 8, 9, 10. or more codon pairs present in wild-type nucleotide sequence and which encode the carbohydrate kinase domain of the L-ribulokinase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for carbohydrate kinase domains are known in the art.
  • the carbohydrate kinase domain includes at least amino acids 59-549 or 60-548.
  • a L-ribulokinase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the carbohydrate kinase domain of the L-ribulokinase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the carbohydrate kinase domain are described hereinabove.
  • a L-ribulose-5-P 4-epimerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the aldolase domain of the L-ribulose-5-P 4- epimerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for aldolase domains are known in the art.
  • the aldolase domain includes at least amino acids 7-218 or 8-217.
  • a L-ribulose-5-P 4-epimerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the aldolase domain of the L-ribulose-5-P 4-epimerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the aldolase domain are described hereinabove.
  • a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6. 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the aldo/keto reductase domain of the xylose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for aldo/keto reductase domains are known in the art.
  • the aldo/keto reductase domain includes at least amino acids 10-305 or 9-306.
  • a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the aldo/keto reductase domain of the xylose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the aldo/keto reductase domain are described hereinabove.
  • a L-arabinose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the arabinose isomerase domain of the L-arabinose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for arabinose isomerase domains are known in the art.
  • the arabinose isomerase domain includes at least amino acids 7-487.
  • a L-arabinose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the arabinose isomerase domain of the L-arabinose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the arabinose isomerase domain are described hereinabove.
  • a L-arabinose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2. 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the arabinose isomerase domain of the L-arabinose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for arabinose isomerase domains are known in the art.
  • the arabinose isomerase domain includes at least amino acids 9-483.
  • a L-arabinose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the arabinose isomerase domain of the L-arabinose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
  • the conserved amino acid sequence pattern and domain boundaries for the arabinose isomerase domain are described hereinabove.
  • polypeptide-encoding nucleotide sequence provided herein to modify the translational kinetics of the polypeptide-encoding nucleotide sequence, where the polypeptide-encoding nucleotide sequence is altered such that one or more codon pairs have a decreased likelihood of causing a translational pause or slowing relative to the unaltered polypeptide-encoding nucleotide sequence.
  • one or more nucleotides of a polypeptide-encoding nucleotide sequence can be changed such that a codon pair containing the changed nucleotides has a translational kinetics value indicative of a decreased likelihood of causing a translational pause or slowing relative to the unchanged polypeptide-encoding nucleotide sequence.
  • the redesigned polypeptide-encoding nucleotide sequence need not possess a high degree of identity to the polypeptide-encoding nucleotide sequence of the original gene, in some embodiments, the redesigned polypeptide-encoding nucleotide sequence will have at least 50%, 60%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% nucleotide identity with the polypeptide-encoding nucleotide sequence of the original gene.
  • an original gene refers to a gene for which codon pair refinement is to be performed; such original genes can be.
  • polynucleotide sequence will be completely synthetic, and will bear much lower identity with the original gene, e.g., no more than 90%, 80%, 70%, 60%, 50%, 40%, or lower.
  • the resulting sequence can be designed to: ( 1) reduce or eliminate translational problems caused by inappropriate ribosome pausing, such as those caused by over- represented codon pairs or other codon pairs with translational values predictive of a translational pause; (2) have codon usage refined to avoid over-reliance on rare codons; (3) reduce in number or remove particular restriction sites, splice sites, internal Shine- Dalgarno sequences, or other sites that may cause problems in cloning or in interactions with the host organism; or (4) have controlled RNA secondary structure to avoid detrimental translational termination effects, translation initiation effects, or RNA processing, which can arise from, for example, RNA self-hybridization.
  • this sequence also can be designed to avoid oligonucleotides that mis-hybridize, resulting in genes that can be assembled from refined oligonucleotides that by thermodynamic necessity only pair up in the desired manner, using methods known in the art, as exemplified in U.S. Patent Publication No. 2005/0106590, which is hereby incorporated by reference in its entirety.
  • polypeptide-encoding nucleotide sequence it is not possible to modify the polypeptide- encoding nucleotide sequence to suitably modify the translational kinetics of the mRNA into polypeptide without modifying the amino acid sequence of the encoded polypeptide.
  • an amino acid insertion, deletion or mutation can be introduced to yield a codon pair that is not predicted to cause a translational pause or slowing; or no change is made.
  • the change is preferably predicted to not substantially influence the final three-dimensional structure of the protein and/or the activity of the protein.
  • Such non-identical polypeptides can vary by containing one or more insertions, deletions and/or mutations.
  • polypeptide sequence can vary according to the purpose of the change, typically such a change results in a polypeptide that is at least 50%, 60%, 70%, 75%. 80%, 85%, and more typically at least 90%, 91 %, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% identical to the wild type polypeptide sequence.
  • the sequence of the polynucleotide can be generated, optionally in conjunction with optimization of a plurality of parameters where one such parameter can be codon pair usage, where the resultant polynucleotide can be prepared by assembly of a plurality of oligonucleotides sufficiently small to be synthesized by known oligonucleotide synthetic methods.
  • Methods known in the art for optimizing multiple parameters in synthetic nucleotide sequences can be applied to optimizing the parameters recited in the present claims. Such methods may advantageously include those exemplified in U.S. Patent App. Publication No. 2005/0106590, U.S. Patent App. Publication No. 2007/0009928, and R. H.
  • an exemplary method for generating a sequence can also include dividing the desired sequence into a plurality of partially overlapping segments; optimizing the melting temperatures of the overlapping regions of each segment to disfavor hybridization to the overlapping segments which are non- adjacent in the desired sequence; allowing the overlapping regions of single stranded segments which are adjacent to one another in the desired sequence to hybridize to one another under conditions which disfavor hybridization of non-adjacent segments; and filling in, ligating, or repairing the gaps between the overlapping regions, thereby forming a double-stranded DNA with the desired sequence.
  • This process can be performed manually or can be automated, e.g., in a general purpose digital computer.
  • the search of possible codon assignments is mapped into an anytime branch and bound computerized algorithm developed for biological applications.
  • a synthetic nucleotide sequence for the polynucleotides provided herein, where the synthetic nucleotide sequence also is typically designed to have desirable translational kinetics properties, such as the removal of some or all codon pairs predicted to result in a translational pause or slowing.
  • Such design methods include determining a set of partially overlapping segments with optimized melting temperatures, and determining the translational kinetics of the synthetic sequence, where if it is desired to change the translational kinetics of the synthetic gene, the sequences of the overlapping segments are modified and refined in order to approximate the desired translational kinetics while still possessing acceptable hybridization properties. In some embodiments, this process is performed iteratively.
  • a criterion is established for selecting codon pairs having high translational kinetics values to be replaced with codon pairs having lower the translational kinetics values unless a codon pair of this group is the site of a planned pause.
  • the top 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10% of codon pairs ranked by translational kinetics values can be replaced by codon pairs having lower translational kinetics values, such as translational kinetics value below a user defined level that can be, for example, a translational kinetics value equal to or below the translational kinetics values of codon pairs not in the top selected percentage, unless a codon pair of this group is the site of a planned pause (in which case it is not necessarily replaced).
  • all codon pairs above a user-selected translational kinetics value such as more than 5, 4.5, 4, 3.5, 3, 2.5 or 2 standard deviations above the mean translational kinetics value can be replaced by codon pairs having lower translational kinetics values, such as translational kinetics value below a user defined level that can be, for example, a translational kinetics value that is 4, 3.5, 3, 2.5, 2, 1.5 or 1 standard deviations less than the mean translational kinetics value, unless a codon pair of this group is the site of a planned pause (in which case it is not necessarily replaced).
  • polynucleotide sequences design methods provided herein can be employed where a plurality of properties of the polynucleotide sequences can be refined in addition to codon pair usage properties, where such properties can include, but are not limited to, melting temperature gap between oligonucleotides of synthetic gene, average codon usage, average codon pair chi-squared (e.g., z score), worst codon usage, worst codon pair (e.g., z score), maximum usage in adjacent codons, Shine-Dalgarno sequence (for E.
  • coli expression occurrences of 5 consecutive G's or 5 consecutive Cs, occurrences of 6 consecutive A's or 6 consecutive T's, long exactly repeated subsequences, cloning restriction sites, user-prohibited sequences (e.g., other restriction sites), codon usage of a specific codon above user-specified limit, and out-of-frame stop codons (framecatchers).
  • additional properties that can be considered in a process of designing a polynucleotide sequence include, but are not limited to, occurrences of RNA splice sites, occurrences of polyA sites, and occurrence of ribosome binding sequence.
  • a process of designing a poly nucleotide sequence can include constraints including, but not limited to, minimum melting temperature gap between oligonucleotides of synthetic gene, minimum average codon usage, maximum average codon pair chi-squared (z score), minimum absolute codon usage, maximum absolute codon pair (z score), minimum maximum usage in adjacent codons, no Shine-Dalgarno sequence (for E.
  • additional constraints can include, but are not limited to, minimum occurrences of RNA splice sites, minimum occurrences of polyA sites, and occurrence of ribosome binding sequence.
  • a process of designing a polynucleotide sequence can include preferences including, but not limited to, prefer high average codon usage, prefer low average codon pair chi-squared, prefer larger melting temperature gap, prefer more out of frame stop codons (framecatchers), and optionally prefer evenly distributed codon usage.
  • Any of a variety of nucleotide sequence refinement/optimization methods known in the art can be used to refine the polynucleotide sequence according to the codon pair usage properties, and according to any of the additional properties specifically described above, or other properties that are refined in nucleotide sequence redesign methods known in the art.
  • a branch and bound method is employed to refine the polynucleotide sequence according to codon pair usage properties and at least one additional property, such as codon usage.
  • the methods provided herein can further include analyzing at least a portion of the candidate polynucleotide sequence in frame shift, and selecting codons for the candidate polynucleotide sequence such that stop codons are added to at least one said frame shift.
  • the generating step further includes analyzing at least a portion of the candidate polynucleotide sequence in frame shift, and selecting codons for the candidate polynucleotide sequence such that one or more stop codons in one, two or three reading frames are added downstream of polypeptide-encoding region of the nucleotide sequence.
  • methods for redesigning a polypeptide-encoding gene for expression in a host organism, by providing a data set representative of codon pair translational kinetics for the host organism which includes translational kinetics values of the codon pairs utilized by the host organism, providing a desired polypeptide sequence for expression in the host organism, and generating a polynucleotide sequence encoding the polypeptide sequence by analyzing candidate nucleotides to select, where possible, codon pairs that are predicted not to cause a translational pause in the host organism, with reference to the data set, thereby providing a candidate polynucleotide sequence encoding the desired polypeptide.
  • Also provided herein are methods for redesigning a polypeptide- encoding gene for expression in a host organism by providing a first data set representative of codon pair translational kinetics for the host organism which includes translational kinetics values of the codon pairs utilized by the host organism, providing a second data set representative of at least one additional desired property of the synthetic gene, providing a desired polypeptide sequence for expression in the host organism, and generating a polynucleotide sequence encoding the polypeptide sequence by analyzing candidate nucleotides to select, where possible, both (i) codon pairs that are predicted not to cause a translational pause in the host organism, with reference to the first data set, and (ii) nucleotides that provide a desired property, with reference to the second data set, thereby providing a candidate polynucleotide sequence encoding the desired polypeptide.
  • a branch and bound method is employed to refine the polypeptide- encoding nucleotide sequence according to codon pair usage properties of the first data set and according to the properties of the second data set.
  • the second data set contains codon preferences representative of codon usage by the host organism, including the most common codons used by the host organism for a given amino acid.
  • a sugar catabolic enzyme -encoding DNA sequence wherein the encoded sequence has at least a 50%, 60%, 70%, 75%,80%, 85%, and more typically at least 90%, 91 %, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the wild type sugar catabolic enzyme polypeptide sequence as set forth in the sequence listing.
  • the polynucleotide provided herein is adapted for expression in a heterologous host organism.
  • a heterologous host organism is an organism used to express DNA, RNA or protein that is foreign to the host organism.
  • the host organism is not human, E. coli or S. cerevisiae.
  • At least 1 , 2 or 3 codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein.
  • the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein.
  • a highly- overrepresented codon pair is a codon pair that has a translational kinetics value greater than a designated threshold, wherein a threshold value can be at least 1 , 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 or 5 or more standard deviations above the mean translational kinetics value.
  • a sugar catabolic enzyme -encoding DNA sequence having at least a 75% sequence identity with an original sugar catabolic enzyme polypeptide sequence as set forth in the sequence listing and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organisms are selected from the following: Pichia pastoris: Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long- tailed monkey); M. mulatto (Monkey): E. coli Kl 2 W31 10; E.
  • the methods provided herein can include analyzing the candidate polynucleotide sequence to confirm that no codon pairs are predicted to cause a translational pause in the host organism by more than a designated threshold.
  • the likelihood that a particular codon pair will cause translational pausing or slowing in an organism can be represented by a translational kinetics value.
  • the translational kinetics value can be expressed in any of a variety of manners in accordance with the guidance provided herein. In one example, a translational kinetics value can be expressed in terms of the mean translational kinetics value and the corresponding standard deviation for all codon pairs in an organism.
  • the translational kinetics value for a particular codon pair can be expressed in terms of the number of standard deviations that separate the translational kinetics value of the codon pair from the mean translational kinetics value.
  • a threshold value can be at least 1 , 1 .25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 or 5 or more standard deviations above the mean translational kinetics value.
  • the methods provided herein also include generating a candidate nucleotide sequence according to codon usage.
  • codon usage As is known in the art, different organisms can have different preference for the three- nucleotide codon sequence encoding a particular amino acid. As a result, translation can often be improved by using the most common three-nucleotide codon sequence encoding a particular amino acid.
  • some methods provided herein also include generating a candidate nucleotide sequence such that codon utilization is non-randomly biased in favor of codons most commonly used by the host organism. Codon usage preferences are known in the art for a variety of organisms and methods for selecting the more commonly used codons are well known in the art.
  • the methods of redesigning a polypeptide- encoding nucleotide sequence are based on a plurality of properties, where a conflict in the preferred nucleotide sequence arising from the plurality of properties is determined in order to optimize the predicted translational kinetics. That is, when the plurality of properties being optimized would lead to more than one possible nucleotide sequence depending on which property is to be accorded more weight, typically, the conflict is resolved by selecting the nucleotide sequence predicted to be translated more rapidly, for example, due to fewer predicted translational pauses.
  • the methods of redesigning a polypeptide-encoding nucleotide sequence are based on a plurality of properties, where a conflict in the preferred nucleotide sequence arising from the plurality of properties is determined in order to optimize codon pair usage preferences. That is, when the plurality of properties being optimized would lead to more than one possible nucleotide sequence depending on which property is to be accorded more weight, typically, codon pair usage will be accorded more weight in order to resolve the conflict between the more than one possible nucleotide sequences.
  • the methods provided herein can include identifying at least one instance of a conflict between selecting common codons and avoiding codon pairs predicted to cause a translational pause; in such instances, the conflict is resolved in favor of avoiding codon pairs predicted to cause a translational pause.
  • Some embodiments provided herein include generating a candidate polynucleotide sequence encoding the polypeptide sequence, the candidate polynucleotide sequence having a non-random codon pair usage, such that the codon pairs encoding any particular pair of amino acids have the lowest translational kinetics values.
  • the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that the encoded amino acid sequence is not altered.
  • the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that the three dimensional structure of the encoded polypeptide is not substantially altered.
  • the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that no more than conservative amino acid changes are made to the encoded polypeptide.
  • the methods provided herein can further include a step of refining or altering the candidate polynucleotide sequence in accordance with a second nucleotide sequence property to be refined.
  • the methods further include generating or refining a candidate polynucleotide sequence encoding a polypeptide sequence such that the candidate polynucleotide sequence has a non-random codon usage, where the most common codons used by the host organism are over-represented in the candidate polynucleotide sequence.
  • the methods can include refining or altering the candidate polynucleotide sequence in accordance with any of a variety of additional properties provided herein, including but not limited to, melting temperature gap between oligonucleotides of synthetic gene, Shine-Dalgarno sequence, occurrences of 5 consecutive G r s or 5 consecutive Cs, occurrences of 6 consecutive A's or 6 consecutive T : s long exactly repeated subsequences, cloning restriction sites, or any other user-prohibited sequences. Further, any of a variety of combinations of these properties can be additionally included in the nucleotide sequence refinement methods provided herein.
  • the method provided herein can further include an evaluation step in which after the candidate polynucleotide sequence is altered, the sequence is compared with at least a portion of a data set of a property against which the sequence was refined.
  • an evaluation step in which after the candidate polynucleotide sequence is altered, the sequence is compared with at least a portion of a data set of a property against which the sequence was refined.
  • the candidate nucleotide sequence can be compared to each property considered in the refinement, and, if the values for all properties are deemed to be acceptable or desired, no further sequence alteration is required. If the values for fewer than all properties are deemed to be acceptable or desired, the candidate nucleotide sequence can be subjected to further sequence alteration and evaluation.
  • sequence alteration steps of methods provided herein can be performed iteratively. That is, one or more steps of altering the nucleotide sequence can be performed, and the candidate nucleotide sequence can be evaluated to determine whether or not further sequence alteration is necessary and/or desirable. These steps can be repeated until values for all properties are deemed to be acceptable or desired, or until no further improvement can be achieved.
  • the methods and sequences provided herein include determination and use of translational kinetics values for codon pairs. As provided herein, such a translational kinetics value can be calculated and/or empirically measured, and the final translational kinetics value used in graphical displays and methods of predicting translational kinetics can be a refined value resultant from two or more types of codon pair translational kinetics information.
  • codon pair translational kinetics information that can be used in refining or replacing a translational kinetics value for a codon pair include, for example, values of observed versus expected codon pair frequencies in a particular organism, normalized values of observed versus expected codon pair frequencies in a particular organism, the degree to which observed versus expected codon pair frequency values are conserved in related proteins across two or more species, the degree to which observed versus expected codon pair frequency values are conserved at predicted pause sites such as boundaries between autonomous folding units in related proteins across two or more species, the degree to which codon pairs are conserved at predicted pause sites across different proteins in the same species, and empirical measurement of translational kinetics for a codon pair.
  • the values of observed versus expected codon pair frequencies in a host organism can be determined by any of a variety of methods known in the art for statistically evaluating observed occurrences relative to expected occurrences. Regardless of the statistical method used, this typically involves obtaining codon sequence data for the organism, for example, on a gene-by-gene basis. In some embodiments, the analysis is focused only on the coding regions of the genome. Because the analysis is a statistical one, a large database is preferred. Initially, the total number of codons is determined and the number of times each of the 61 non-terminating codons appears is determined.
  • the expected frequency of each of the 3721 (61 2 ) possible non- terminating codon pairs is calculated, typically by multiplying together the frequencies with which each of the component codons appears.
  • This frequency analysis can be carried out on a global basis, analyzing all of the sequences in the database together; however, it is typically done on a local basis, analyzing each sequence individually. This will tend to minimize the statistical effect of an unusually high proportion of rare codons in a sequence.
  • the expected number of occurrences of each codon pair is calculated by, for example, multiplying the expected frequency by the number of pairs in the sequence. This information can then be added to a global table, and each next succeeding sequence can be analyzed in like manner.
  • the values of observed versus expected codon pair frequencies are chi-squared values, such as chi-squared 2 (chisq2) values or chi- squared 3 (chisq3) values.
  • Methods for calculating chi-squared values can be performed according to any method known in the art, as exemplified in U.S. Patent No. 5,082,767, which is incorporated by reference herein in its entirety.
  • a new value chi-squared 2 (chisq2) can be calculated as follows. For each group of codon pairs encoding the same amino acid pair (i.e., 400 groups), the sums of the expected and observed values are tallied; any non-randomness in amino acid pairs is reflected in the difference between these two values. Therefore, each of the expected values within the group is multiplied by the factor [sum observed/sum expected], so that the sums of the expected and observed values with the group are equal. The new chi- squared, chisq2, is evaluated using these new expected values.
  • a new value chi-squared 3 (chisq3) can be calculated. Correction is made only for those dinucleotides formed between adjacent codon pairs; any bias of dinucleotides within codons (codon triplet positions I-II and 11— III) will directly affect codon usage and is, therefore, automatically taken into account in the underlying calculations.
  • the sums of the expected and observed values are tallied: any non- randomness in dinucleotide pairs is reflected in the difference between these two values. Therefore, each of the expected values within the group is multiplied by the factor [sum observed/sum expected], so that the sums of the expected and observed values with the group are equal.
  • the new chi-squared, chisq3, is evaluated using these new expected values.
  • Dinucleotide bias represents a smaller effect in yeast, and only a very minor one in E. coli.
  • the predominant dinucleotide bias in human is the well-known CpG deficit, other dinucleotides are also very highly biased. For example, there is a deficit of TA, as well as an excess of TG, CA and CT. Overall, the deficit of CpG contributes only 35% of the total dinucleotide bias in the human database, and 17% in yeast.
  • the values of observed versus expected codon pair frequencies in a host organism herein can be normalized. Normalization permits different sets of values of observed versus expected codon pair frequencies to be compared by placing these values on the same numerical scale. For example, normalized codon pair frequency values can be compared between different organisms, or can be compared for different codon pair frequency value calculations within a particular organism (e.g., different calculations based on input sequence information or based on different calculations such as chisql or chisq2 or chisq3). Typically, normalization results in codon pair frequency values that are described in terms of their mean and standard deviation from the mean.
  • An exemplary method for normalizing codon pair frequency values is the calculation of z scores.
  • the z score for an item indicates how far and in what direction that item deviates from its distribution ' s mean, expressed in units of its distribution's standard deviation.
  • the mathematics of the z score transformation are such that if every item in a distribution is converted to its z score, the transformed scores will have a mean of zero and a standard deviation of one.
  • the z scores transformation can be especially useful when seeking to compare the relative standings of items from dist ⁇ butions with different means and/or different standard deviations, z scores are especially informative when the distribution to which they refer is normal. In a normal distribution, the distance between the mean and a given z score cuts off a fixed proportion of the total area under the curve.
  • An exemplary method for determining z scores for codon pair chi- squared values is as follows: First, a list of all 3721 possible non-terminating codon pairs is generated. Second, for the i th codon pair, the i th chi-squared value is calculated, where the i lh chi-squared value is denoted C 1 . The chi-squared value, C 1 , is given the sign of (observed - expected), so that over-represented codon pairs are assigned a positive C 1 and under-represented codon pairs are assigned a negative C 1 .
  • the mean chi-squared value is calculated where the mean is denoted m.
  • the standard deviation of the chi-squared values is calculated, where the standard deviation is denoted s.
  • a z score is calculated by subtracting the mean then dividing by the standard deviation, wherein the i lh z score is denoted Z 1 .
  • provided herein are methods of refining the predictive capability of a translational kinetics value of a codon pair in a host organism by providing an initial translational kinetics value based on the value of observed codon pair frequency versus expected codon pair frequency for a codon pair in a host organism, providing additional translational kinetics data for the codon pair in the host organism, and modifying the initial translational kinetics value according to the additional codon pair translational kinetics data to generate a refined translational kinetics value for the codon pair in the host organism.
  • the translational kinetics data that can be used to refine translational kinetics values and methods of modifying translational kinetics values according to such additional translational kinetics data to generate a refined translational kinetics value for a codon pair in a host organism are provided below.
  • translational kinetics data that can be used to refine translational kinetics values are based on recurrence of a codon pair and/or recurrence of a predicted translational kinetics value associated with a codon pair.
  • Recurrence-based refinement of translational kinetics values is based on the investigation of multiple polypeptide-encoding nucleotide sequences to determine whether or not there are multiple occurrences of either codon pairs or predicted translational kinetics values in those sequences.
  • Recurrence-based refinement of translational kinetics can be performed using any of a variety of known sequence comparison methods consistent with the examples provided herein. For purposes of exemplification, and not for limitation, the following example of recurrence-based refinement of translational kinetics is provided.
  • the predicted translational kinetics value for a codon pair can be refined according to the degree to which observed versus expected codon pair frequency values are conserved in related proteins across two or more species.
  • related proteins are proteins having homologous amino acid sequences and/or similar three dimensional structures.
  • Related proteins having homologous amino acid sequences will typically have at least about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95% sequence identity.
  • Related proteins having similar three dimensional structures will typically share similar secondary structure topology and similar relative positioning of secondary structural elements; exemplary related proteins having three dimensional structures are members of the same SCOP- classified Family (see, e.g., Murzin A. G., Brenner S.
  • the codon pair located at the position on a protein that is confirmed as, or considered to have an increased likelihood of, containing an actual translational pause or slowing can itself be confirmed as being, or considered to have an increased likelihood of being, a functional translational kinetics signal.
  • a codon pair located at a position on a protein that is confirmed as not containing, or considered to have a decreased likelihood of containing, an actual translational pause or slowing can itself be confirmed as not acting, or considered to have an decreased likelihood of acting, as a functional translational kinetics signal.
  • initially predicted translational kinetics data e.g., data based on values of observed codon pair frequency versus expected codon pair frequency
  • the predicted translational kinetics value for a codon pair can be refined according to the presence of the codon pair at a location predicted by methods other than codon pair frequency methods to contain a translational pause or slowing site.
  • a predicted location is a boundary location between autonomous folding units of a protein.
  • translational pauses are present in wild type genes in order to slow translation of a nascent polypeptide subsequent to translation of a secondary structural element of a protein and/or a protein domain, thus providing time for acquisition of secondary and at least partial tertiary structure by the nascent protein prior to further downstream translation, and thereby allowing each domain to partially organize and commit to a particular, independent fold.
  • codon pairs can be associated with translational pauses between autonomous folding units of a protein, where autonomous folding units can be secondary structural elements such as an alpha helix, or can be tertiary structural elements such as a protein domain.
  • the presence of a codon pair at a boundary location between autonomous folding units of a protein can confirm or increase the likelihood that the codon pair acts to pause or slow translation.
  • predicted translational kinetics data e.g., data based on values of observed codon pair frequency versus expected codon pair frequency
  • predicted translational kinetics data can be modified according to the presence of the codon pair at a boundary location between autonomous folding units of a protein, which can increase the likelihood of the codon pair acts to pause or slow translation.
  • an over-represented codon pair that is present at a boundary location between autonomous folding units of a protein can be confirmed as acting as a translational pause or slowing codon pair.
  • a single observation of the codon pair at a boundary location between autonomous folding units of a protein can confirm or increase the likely translational pause or slowing properties of a codon pair.
  • typically a plurality of observations will be used to more accurately estimate the translational pause or slowing properties of a codon pair.
  • methods of using, for example, predicted boundary locations can be combined with methods that are based on recurrence of a codon pair and/or recurrence of a predicted translational kinetics value associated with a codon pair in methods of refining a predicted translational kinetics value for a codon pair.
  • a protein present in two or more species can have conserved boundary locations between autonomous folding units of the protein, and recurrent presence of an over-represented codon pair at the boundary locations can confirm the likelihood of an actual translational pause at that boundary location, leading to confirmation, or increased likelihood, that the corresponding codon pair for the respective species acts as a translational pause or slowing codon pair.
  • two or more proteins of the same species can have boundary locations between autonomous folding units, and recurrent presence of an over-represented codon pair at the boundary locations can confirm or indicate the likelihood of an actual translational pause at that boundary location, leading to confirmation or indication of increased likelihood that the corresponding codon pair acts as a translational pause or slowing codon pair.
  • Such recurrence-based methods also can be used to confirm or indicate increased likelihood that a non-over-represented codon pair (e.g., an under-represented codon pair or a represented-as-expected codon pair) acts as a translational pause or slowing codon pair.
  • a non-over-represented codon pair e.g., an under-represented codon pair or a represented-as-expected codon pair
  • two or more proteins of the same species can have boundary locations between autonomous folding units, and recurrent presence of a non- over- represented codon pair at the boundary locations, particularly if no over-represented codon pair is present, can confirm or indicate the likelihood of an actual translational pause at that boundary location, leading to confirmation or indication of increased likelihood that the corresponding codon pair acts as a translational pause or slowing codon pair.
  • Such recurrence-based methods also can be used to confirm or indicate the likelihood that a codon pair, such as an over-represented codon pair, does not act as a translational pause or slowing codon pair.
  • a codon pair such as an over-represented codon pair
  • two or more proteins of the same species can have boundary locations between autonomous folding units, and consistent absence of a non-over-represented codon pair at the boundary locations can confirm or indicate increased likelihood that the codon pair does not act as a translational pause or slowing codon pair.
  • the predicted translational kinetics value for a codon pair can be refined according to empirical measurement of translational kinetics for a codon pair.
  • the influence of a codon pair on translational kinetics can be experimentally measured, and these experimental measurements can be used to refine or replace the predicted translational kinetics values for a codon pair.
  • Several methods of experimentally measuring the translational kinetics of a codon pair are known in the art, and can be used herein, as exemplified in Irwin et al, J. Biol. Chem., (1995) 270:22801.
  • One such exemplary assay is based on the observation that a ribosome pausing at a site near the beginning of an mRNA coding sequence can inhibit translation initiation by physically interfering with the attachment of a new ribosome to the message, and, thus, the codon pair to be assayed can be placed at the beginning of a polypeptide-encoding nucleotide sequence and the effect of the codon pair on translational initiation can be measured as an indication of the ability of the codon pair to cause a translational pause.
  • Another such exemplary assay is based on the fact that the transit time of a ribosome through the leader polypeptide coding region of the leader RNA of the trp operon sets the basal level of transcription through the trp attenuator, and, thus, the codon pair to be assayed can be placed into a trpLep leader polypeptide codon region, and level of expression can be inversely indicative of the translational pause properties of the codon pair, due to a faster translation causing formation of a stem-loop attenuator in the leader RNA, which results in transcriptional attenuation.
  • the methods provided herein for calculation of translational kinetics values can be applied to the native organism of the polypeptide of SEQ ID NOS: 2, 26, 50, 74, 98, 122, 146, 170. 194, 218, 242. 266. 290 or 302, and also can be applied to a selected organism in which the polypeptide of SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 194, 218, 242, 266, 290 or 302, or a modification thereof, is to be heterologously expressed.
  • the nucleotide sequence information of an organism can be used to calculate chi-squared values in accordance with the methods provided herein, and the translational kinetics values can be based on these chi-squared values as well as on additional translational kinetics information provided herein, including, but not limited to, codon pairs conserved in domain boundaries and empirically measured translational kinetics for a codon pair.
  • the translational kinetics data described herein can be combined in such a manner as to provide a refined translational kinetics value for a codon pair in a host organism.
  • Methods of combining predictive data to arrive at a refined predictive value are known in the art and can be used herein.
  • an hypothesis H is that a given sequence feature, e.g., a given codon pair, has utility for translational kinetics engineering, e.g., creates a translational pause site.
  • H) P(Dl & D2 & D3 & D4
  • H) P(Dl
  • H) P(Di
  • P(Di is correct) and P(Di is not correct) can be estimated a priori by the correlation of Di with previous experimental measurements.
  • H) are obtained by observing whether or not hypothesis H is consistent with observed data item Di. More complex and powerful Bayesian approaches are also well known to the art. The fully general approach rewrites P(D
  • the translational kinetics values for a codon pair can be refined by consideration of. for example, chi-squared value of observed versus expected codon pair frequency and the degree to which codon pairs are conserved at predicted pause sites across different proteins in the same species, for example, at protein structure domain boundaries.
  • An over-represented codon pair which is present with above-random frequency at boundary locations between autonomous folding units of proteins in the same species can have a translational kinetics value reflecting higher predicted translational pause properties of the codon pair.
  • an over- represented codon pair which is present with below-random frequency at boundary locations between autonomous folding units of proteins in the same species can have a translational kinetics value reflecting lower predicted translational pause properties of the codon pair.
  • the translational kinetics values for a codon pair can be refined by consideration of, for example, experimentally measured translation step times in one species and the degree to which codon pairs that correspond to measured pause sites in the first species are conserved across homologous proteins in other species, for example, in a multiple sequence alignment.
  • an over-represented codon pair in another species is aligned with above-random frequency to a codon pair that corresponds to a measured translation pause site in the first species, it can have a translational kinetics value reflecting higher predicted translational pause properties of that codon pair in the other species.
  • an over-represented codon pair in another species when aligned with below-random frequency to a codon pair that corresponds to a measured translation pause site in the first species, it can have a translational kinetics value reflecting lower predicted translational pause properties of that codon pair in the other species.
  • translational kinetics values for codon pairs can be determined.
  • the translational kinetic values can be organized according to the likelihood of causing a translational pause or slowing based on any method known in the art.
  • the translational kinetic values for two or more codon pairs, up to all codon pairs, in an organism are determined, and the mean translational kinetics value and associated standard deviation are calculated. Based on this, the translational kinetics value for a particular codon pair can be described in terms of the multiple of standard deviations the translational kinetics value for the particular codon pair differs from the mean translational kinetics value. Accordingly, reference herein to mean translational kinetics values and standard deviations, whether or not applied to a particular expression of translational kinetics value, can be applied to any of a variety of expressions of translational kinetics values provided herein.
  • Such a graphical display provides a visual display of the predicted translational influence, including translational pause or slowing for numerous or all codon pairs of a polypeptide-encoding nucleotide sequence.
  • This visual display can be used in methods of modifying polypeptide-encoding nucleotide sequences in order to thereby modify the predicted translational kinetics of the mRNA into polypeptide in methods such as those provided herein.
  • the graphical displays can be used to identify one or more codon pairs to be modified in a polypeptide-encoding nucleotide sequence.
  • the graphical displays can be used in analyzing a polypeptide-encoding nucleotide sequence prior to modifying the polypeptide-encoding nucleotide sequence, or can be used in analyzing a modified polypeptide-encoding nucleotide sequence to determine, for example, whether or not further modifications are desired.
  • the graphical displays can be created using translational kinetics values based on any of the methods for determining translational kinetics values provided herein or otherwise known in the art. For example, chi-squared as a function of codon pair position, chi-squared 2 as a function of codon position, or chi-squared 3 as a function of codon pair position, translational kinetics values thereof, empirical measurement of translational pause of codon pairs in a host organism, estimated translational pause capability based on observed presence and/or recurrence of a codon pair at predicted pause site, and variations and combinations thereof as provided herein.
  • the exact format of the graphical displays can take any of a variety of forms, and the specific form is typically selected for ease of analysis and comparison between plots.
  • the abscissa typically lists the position along the nucleotide sequence or polypeptide sequence, and can be represented by nucleotide position, codon position, codon pair position, amino acid position, or amino acid pair position.
  • the ordinate typically lists the translational kinetics value of the codon pair, such as, but not limited to, a translational kinetics value of codon pair frequency, including, but not limited to the z score of chisql , the z score of chisq2, the z score of chisq3, the empirically measured value, and the refined translational kinetics value.
  • the sequence position can be plotted along the ordinate and the translational kinetics value can be plotted along the abscissa.
  • As an example, a graphical display of translational kinetics is depicted in Figure 1. where each positive deflection or peak describes a predicted translational pause or slowing at the nucleotide location as defined by the abscissa. Comparinfi plots
  • a set of graphical displays including at least a first graphical display and a second graphical display, are prepared. These sets of displays can be compared in order to determine the difference in predicted translational efficiency or translational kinetics of the two plots.
  • the plots can differ according to any of a variety of criteria. For example, each plot can represent a different polypeptide-encoding nucleotide sequence, each plot can represent a different host organism, each plot can represent differently determined translational kinetics values, or any combination thereof.
  • any number of different graphical displays can be compared in accordance with the methods provided herein, for example, 2. 3. 4, 5, 6. 7, 8 or more different graphical displays can be compared.
  • two plots will represent different polypeptide-encoding nucleotide sequences, the same sequence in different host organisms, or different sequences in different host organisms.
  • Comparison of different graphical displays can be used to analyze the predicted change in translational kinetics as a result of the difference represented by the graphical displays. For example, comparison of the same polypeptide-encoding nucleotide sequence in different host organisms can be used to analyze any predicted transcriptional pauses that can be removed. Accordingly, provided herein are methods of analyzing translational kinetics of an mRNA into polypeptide in a host organism by comparing two graphical displays to understand or predict the differences in translational kinetics of the mRNA into polypeptide, where the differences in the graphical displays can be as a result of, for example, a difference in the polypeptide-encoding nucleotide sequence or a difference in the host organism.
  • a graphical display of the translational kinetics values of codon pairs for the original polypeptide- encoding nucleotide sequence in the heterologous host can be compared to a graphical display of the translational kinetics values of codon pairs for a modified polypeptide- encoding nucleotide sequence in the heterologous host, and it can be determined whether or not the modification to the polypeptide-encoding nucleotide sequence resulted in improved translational kinetics.
  • the nucleic acid sequences provided herein can be present in a polynucleotide (e.g., DNA or RNA molecule).
  • a polynucleotide e.g., DNA or RNA molecule.
  • the polynucleotides can be inserted into a replicable vector for cloning (e.g., amplification of the DNA) or for expression.
  • a replicable vector for cloning (e.g., amplification of the DNA) or for expression.
  • Various vectors are publicly available and are known in the art.
  • the vector can, for example, be in the form of a plasmid, cosmid, viral particle, or phage.
  • the appropriate nucleic acid sequence can be inserted into the vector by any of a variety of procedures known in the art.
  • Vector components can generally include, but are not limited to, one or more of a signal sequence, an origin of replication, one or more marker genes, an enhancer element, a promoter, and a transcription termination sequence. Construction of suitable vectors containing one or more of these components employs standard ligation techniques which are known to the skilled artisan.
  • the encoded polypeptide can be produced recombinantly not only directly, but also as a fusion polypeptide with a heterologous polypeptide, which can be, e.g., a signal sequence or other polypeptide having a specific cleavage site at the N- terminus of the mature protein or polypeptide.
  • the signal sequence can be a component of the vector, or it can be a part of the polynucleotide that is inserted into the vector.
  • the signal sequence can be a prokaryotic signal sequence selected, for example, from the group of the alkaline phosphatase, penicillinase, lpp, or heat-stable enterotoxin Il leaders.
  • the signal sequence can be, e.g., the yeast invertase leader, alpha factor leader (including Saccharomyces and Kluyveromyces ⁇ -factor leaders, the latter desc ⁇ bed in U S Patent No 5.010.182). or acid phosphatase leader, the C albicans glucoamylase leader (EP 362.179 published 4 April 1990). or the signal desc ⁇ bed in WO 90/13646 published 15 November 1990
  • mammalian signal sequences can be used to direct secretion of the protein, such as signal sequences from secreted polypeptides of the same or related species, as well as viral secretory leaders
  • Both expression and cloning vectors contain a polynucleoitde that permits the vector to replicate in one or more selected host cells Such sequences are well known for a va ⁇ ety of bacteria, yeast, and viruses
  • the origin of replication from the plasmid pBR322 is suitable for most Gram-negative bacteria, the 2 ⁇ plasmid origin is suitable for yeast, and various viral o ⁇ gins (SV40. polyoma, adenovirus, VSV or BPV) are useful for cloning vectors in mammalian cells
  • Selection genes will typically contain a selection gene, also termed a selectable marker.
  • Typical selection genes encode proteins that (a) confer resistance to antibiotics or other toxins, e g . ampicilhn, neomycin, methotrexate, or tetracycline, (b) complement auxotrophic deficiencies, or (c) supply critical nutrients not available from complex media, e g . the gene encoding D-alanine racemase for Bacilli.
  • Suitable selectable markers for mammalian cells are those that enable the identification of cells competent to take up the polynucleotide- containing vector, such as DHFR or thymidine kinase
  • DHFR polynucleotide-containing vector
  • An approp ⁇ ate host cell when wild-type DHFR is employed is the CHO cell line deficient in DHFR activity, prepared and propagated as desc ⁇ bed by Urlaub et al . Proc Natl Acad Sci. USA, 77:4216 (1980).
  • a suitable selection gene for use in yeast is the trpl gene present in the yeast plasmid YRp7 [Stinchcomb et al., Nature, 282.39 (1979): Kingsman et al., Gene, 7.141 (1979); Tschemper et al . Gene, 10 157 (1980)].
  • the trpl gene provides a selection marker for a mutant strain of yeast lacking the ability to grow in tryptophan, for example, ATCC No 44076 or PEP4-1 [Jones. Genetics, 85:12 (1977)].
  • Expression and cloning vectors usually contain a promoter operably linked to the polynucleotide provided herein to direct mRNA synthesis. Promoters recognized by a va ⁇ ety of potential host cells are well known Promoters suitable for use with prokaryotic hosts include the ⁇ -lactamase and lactose promoter systems [Chang et al., Nature, 275 615 (1978); Goeddel et al.. Nature. 281 .544 (1979)]. alkaline phosphatase, a tryptophan (trp) promoter system [Goeddel. Nucleic Acids Res., 8 4057 (1980): EP 36.776].
  • Promoters for use in bacterial systems also will contain a Shine-Dalgarno (S. D.) sequence operably linked to the polynucleotide provided herein.
  • Suitable promoting sequences for use with yeast hosts include the promoters for 3-phosphoglycerate kinase [Hitzeman et al.. J. Biol. Chem., 255:2073 ( 1980)] or other glycolytic enzymes [Hess et al.. J. Adv. Enzyme Reg., 7:149 (1968): Holland, Biochemistry, 17:4900 (1978)], such as enolase, glyceraldehyde-3- phosphate dehydrogenase, hexokinase, pyruvate decarboxylase, phosphofructokinase, glucose-6-phosphate isomerase, 3-phosphoglycerate mutase. pyruvate kinase, triosephosphate isomerase, phosphoglucose isomerase, and glucokinase.
  • yeast promoters which are inducible promoters having the additional advantage of transcription controlled by growth conditions, are the promoter regions for alcohol dehydrogenase 2, isocytochrome C, acid phosphatase, degradative enzymes associated with nitrogen metabolism, metallothionein, glyceraldehyde-3- phosphate dehydrogenase, and enzymes responsible for maltose and galactose utilization. Suitable vectors and promoters for use in yeast expression are further described in EP 73,657.
  • Transcription from vectors in mammalian host cells is controlled, for example, by promoters obtained from the genomes of viruses such as polyoma virus, fowlpox virus (UK 2,21 1 ,504 published 5 July 1989), adenovirus (such as Adenovirus 2), bovine papilloma virus, avian sarcoma virus, cytomegalovirus, a retrovirus, hepatitis-B virus and Simian Virus 40 (SV40), from heterologous mammalian promoters, e.g., the actin promoter or an immunoglobulin promoter, and from heat-shock promoters, provided such promoters are compatible with the host cell systems.
  • viruses such as polyoma virus, fowlpox virus (UK 2,21 1 ,504 published 5 July 1989), adenovirus (such as Adenovirus 2), bovine papilloma virus, avian sarcoma virus, cytomegalovirus, a
  • Enhancers are cis-acting elements of DNA, usually about from 10 to 300 bp, that act on a promoter to increase its transcription.
  • Many enhancer sequences are now known from mammalian genes (globin, elastase, albumin, ⁇ - fetoprotein, and insulin).
  • an enhancer from a eukaryotic cell virus. Examples include the SV40 enhancer on the late side of the replication origin (bp 100-270), the cytomegalovirus early promoter enhancer, the polyoma enhancer on the late side of the replication origin, and adenovirus enhancers.
  • the enhancer can be spliced into the vector at a position 5' or 3' to the polynucleotide provided herein, but is preferably located at a site 5' from the promoter.
  • 0451J Expression vectors used in eukaryotic host cells will also contain sequences necessary for the termination of transcription and for stabilizing the mRNA. Such sequences are commonly available from the 5' and. occasionally 3', untranslated regions of eukaryotic or viral DNAs or cDNAs. These regions contain nucleotide segments transcribed as polyadenylated fragments in the untranslated portion of the mRNA transcribed from the polynucleotide provided herein.
  • Host cells are transfected or transformed with expression or cloning vectors described herein for polypeptide production and cultured in conventional nutrient media modified as appropriate for inducing promoters, selecting transformants, or amplifying the genes encoding the desired sequences.
  • the culture conditions such as media, temperature, pH and the like, can be selected by the skilled artisan without undue experimentation. In general, principles, protocols, and practical techniques for maximizing the productivity of cell cultures can be found in Mammalian Cell Biotechnology: a Practical Approach, M. Butler, ed. (IRL Press, 1991 ) and Sambrook et al., supra.
  • Methods of eukaryotic cell transfection and prokaryotic cell transformation are known to the ordinarily skilled artisan, for example, CaCl 2 , CaPO 4 , liposome-mediated and electroporation. Depending on the host cell used, transformation is performed using standard techniques appropriate to such cells. The calcium treatment employing calcium chloride, as described in Sambrook et al., supra, or electroporation is generally used for prokaryotes. Infection with Agrobacterium tumefaciens is used for transformation of certain plant cells, as described by Shaw et al., Gene, 23:315 (1983) and WO 89/05859 published 29 June 1989.
  • electroporation bacterial protoplast fusion with intact cells, or polycations, e.g., polybrene. polyornithine, can also be used.
  • polycations e.g., polybrene. polyornithine
  • polybrene polyornithine
  • Suitable host cells for cloning or expressing the DNA in the vectors herein include prokaryote, yeast, or higher eukaryote cells.
  • Suitable prokaryotes include but are not limited to eubacteria. such as Gram-negative or Gram-positive organisms, for example, Enterobacteriaceae such as E. coli.
  • Various E. coli strains are publicly available, such as E. coli Kl 2 strain MM294 (ATCC 31 ,446); E. coli Xl 776 (ATCC 31 ,537); E. coli strain W31 10 (ATCC 27,325) and K5 772 (ATCC 53,635).
  • suitable prokaryotic host cells include Enterobacteriaceae such as Escherichia, e.g., E. coli, Enterobacter, Erwinia, Klebsiella, Proteus. Salmonella, e.g., Salmonella typhimurium, Serratia, e.g., Serratia marcescans, and Shigella, as well as Bacilli such as B. subtilis and B. licheniformis (e.g., B. licheniformis 41 P disclosed in DD 266,710 published 12 April 1989), Pseudomonas such as P. aeruginosa, and Streptomyces. These examples are illustrative rather than limiting.
  • Strain W31 10 is one particularly preferred host or parent host because it is a common host strain for recombinant DNA product fermentations. Preferably, the host cell secretes minimal amounts of proteolytic enzymes.
  • strain W31 10 can be modified to effect a genetic mutation in the genes encoding proteins endogenous to the host, with examples of such hosts including E. coli W31 10 strain 1A2, which has the complete genotype tonA ; E. coli W31 10 strain 9E4, which has the complete genotype tonA ptr3; E.
  • coli W31 10 strain 27C7 (ATCC 55,244), which has the complete genotype tonA ptr3 phoA El 5 (argF-lac)169 degP ompT kanr; E. coli W31 10 strain 37D6, which has the complete genotype tonA ptr3 phoA El 5 (argF- lac)169 degP ompT rbs7 ilvG kanr; E. coli W31 10 strain 40B4, which is strain 37D6 with a non-kanamycin resistant degP deletion mutation; and an E. coli strain having mutant periplasmic protease disclosed in U.S. Patent No. 4,946,783 issued 7 August 1990.
  • in vitro methods of cloning e.g., PCR or other nucleic acid polymerase reactions, are suitable.
  • eukaryotic microbes such as filamentous fungi or yeast are suitable cloning or expression hosts for polynucleoitide-containing vectors.
  • Saccharomyces cerevisiae is a commonly used lower eukaryotic host microorganism.
  • Others include Schizosaccharomyces pombe (Beach and Nurse, Nature, 290: 140 [1981 ]; EP 139,383 published 2 May 1985); Kluyveromyces hosts (U.S. Patent No. 4 ; 943 ; 529; Fleer et al., Bio/Technology, 9:968-975 ( 1991 )) such as, e.g., K.
  • lactis (MW98-8C, CBS683 ; CBS4574; Louvencourt et al., J. Bacteriol., 154(2):737-742 [ 1983]), K. fragilis (ATCC 12,424), K. bulgaricus (ATCC 16 : 045) ; K. wickeramii (ATCC 24, 178), K. waltii (ATCC 56,500), K. drosophilarum (ATCC 36 ; 906; Van den Berg et al., Bio/Technology, 8: 135 (1990)), K. thermotolerans. and K.
  • Schwanniomyces such as Schwanniomyces occidentalis (EP 394,538 published 31 October 1990); and filamentous fungi such as, e.g., Neurospora, Penicillium, Tolypocladium (WO 91/00357 published 10 January 1991), and Aspergillus hosts such as A. nidulans (Ballance et al., Biochem. Biophys. Res. Commun., 1 12:284-289 [1983]; Tilburn et al., Gene, 26:205-221 [1983]; Yelton et al., Proc. Natl. Acad. Sci.
  • Methylotropic yeasts are suitable herein and include, but are not limited to, yeast capable of growth on methanol selected from the genera consisting of Hansenula, Candida, Kloeckera, Pichia, Saccharomyces, Torulopsis, and Rhodotorula.
  • yeast capable of growth on methanol selected from the genera consisting of Hansenula, Candida, Kloeckera, Pichia, Saccharomyces, Torulopsis, and Rhodotorula.
  • yeast capable of growth on methanol selected from the genera consisting of Hansenula, Candida, Kloeckera, Pichia, Saccharomyces, Torulopsis, and Rhodotorula.
  • a list of specific species that are exemplary of this class of yeasts can be found in C. Anthony, The Biochemistry of Methylotrophs, 269 (1982).
  • Suitable host cells for the expression of glycosylated polypeptides are derived from multicellular organisms.
  • invertebrate cells include insect cells such as Drosophila S2 and Spodoptera Sf9, as well as plant cells.
  • useful mammalian host cell lines include Chinese hamster ovary (CHO) and COS cells. More specific examples include monkey kidney CVl line transformed by SV40 (COS-7, ATCC CRL 1651); human embryonic kidney line (293 or 293 cells subcloned for growth in suspension culture, Graham et al., J. Gen Virol., 36:59 (1977)); Chinese hamster ovary cells/-DHFR (CHO, Urlaub and Chasin, Proc. Natl. Acad. Sci.
  • mice Sertoli cells TM4, Mather, Biol. Reprod., 23:243-251 (1980)
  • human lung cells Wl 38, ATCC CCL 75
  • human liver cells Hep G2, HB 8065
  • mouse mammary tumor MMT 060562, ATCC CCL51. The selection of the appropriate host cell is deemed to be within the skill in the art.
  • Gene amplification and/or expression can be measured in a sample directly, for example, by conventional Southern blotting, Northern blotting to quantitate the transcription of mRNA [Thomas, Proc. Natl. Acad. Sci. USA, 77:5201 5205 (1980)], dot blotting (DNA analysis), or in situ hybridization, using an appropriately labeled probe, based on the sequences provided herein.
  • antibodies can be employed that can recognize specific duplexes, including DNA duplexes. RNA duplexes, and DNA RNA hybrid duplexes or DNA protein duplexes. The antibodies in turn can be labeled and the assay can be carried out where the duplex is bound to a surface, so that upon the formation of duplex on the surface, the presence of antibody bound to the duplex can be detected.
  • Gene expression can be measured by immunological methods, such as immunohistochemical staining of cells or tissue sections and assay of cell culture or body fluids, to quantitate directly the expression of gene product.
  • Antibodies useful for immunohistochemical staining and/or assay of sample fluids can be either monoclonal or polyclonal, and can be prepared in any mammal. Conveniently, the antibodies can be prepared against any polypeptide provided herein or against a synthetic peptide based on the sequences provided herein or against exogenous sequence fused to the polypeptide or fragment thereof and encoding a specific antibody epitope.
  • Polypeptides can be recovered from culture medium or from host cell lysates. If membrane-bound, it can be released from the membrane using a suitable detergent solution (e.g. Triton-X 100) or by enzymatic cleavage. Cells employed in expression of polypeptides can be disrupted by various physical or chemical means, such as freeze-thaw cycling, sonication, mechanical disruption, or cell lysing agents, as is known in the art.
  • a suitable detergent solution e.g. Triton-X 100
  • Cells employed in expression of polypeptides can be disrupted by various physical or chemical means, such as freeze-thaw cycling, sonication, mechanical disruption, or cell lysing agents, as is known in the art.
  • the following procedures are exemplary of suitable purification procedures: by fractionation on an ion-exchange column; ethanol precipitation; reverse phase HPLC; chromatography on silica or on a cation-exchange resin such as DEAE; chromatofocusing; SDS-PAGE; ammonium sulfate precipitation; gel filtration using, for example, Sephadex G-75; protein A Sepharose columns to remove contaminants such as IgG; and metal chelating columns to bind epitope-tagged forms of the polypeptide.
  • an expression system comprising an expression vector in a host organism, wherein the expression vector includes a DNA sequence of the embodiments provided herein operably linked to an expression control sequence.
  • an expression vector is a DNA or RNA vector that is capable of transforming a host cell and of effecting expression of a specified nucleic acid molecule.
  • the expression vector is also capable of replicating within the host cell.
  • Expression vectors can be either prokaryotic or eukaryotic. and are typically viruses or plasmids.
  • operably linked refers to functional linkage between a nucleic acid expression control sequence (such as a promoter, or array of transcription factor binding sites) and a second nucleic acid sequence, wherein the expression control sequence directs transcription of the nucleic acid corresponding to the second sequence.
  • An operably linked expression vector can also include secretion signals and other modifying sequences, and can encode chaperones and proteins for a variety of organisms and systems.
  • the methods include inserting a polypeptide- encoding nucleotide sequence designed by the methods provided herein into a cell, and expressing the polypeptide-encoding nucleotide sequence under conditions suitable for gene expression. Additionally provided expression methods include cell-free expression systems as known in the art, where such methods include providing a polypeptide- encoding nucleotide sequence designed by the methods provided herein and contacting the polypeptide-encoding nucleotide sequence with a cell-free expression system under conditions suitable for protein translation.
  • the expression levels of one or more enzymes in a metabolic pathway are individually manipulated. Differential metabolic expression levels can be manipulated using methods known in the art. For example, by selecting a specific promoter with a desired transcriptional level, one can vary the expression level of the gene that is operably linked to the promoter. Similarly, one may select an expression vector that produces the desired levels of expression. [0466J Accordingly, one can manipulate expression of the various components of the metabolic systems described herein by selecting a specific promoter with a desired level of transcriptional activation. Additionally, one can predict and manipulate expression of various components of the systems provided herein using a mathematical tool for modeling a metabolic pathway. Such tools are known in the art, for example, as described by Yang et al. (J. Biol. Chem (2005) 280(12): 1 1224-32) and by Yang et al. (Bioinformatics (2005) 6:774-780). each of which is hereby incorporated by reference in its entirety.
  • Endogenous sequences include genomic sequences of a cell. Such genomic sequences can include sequences previously modified by the constructs, methods and systems provided herein. Modifications of endogenous sequences can include insertions, deletions and mutations. In some embodiments, a modification can include the insertion of a heterologous sequence. Heterologous sequences include exogenous nucleic acid sequences and can include sequences with homology to endogenous sequences.
  • integrable polynucleotides for modifying endogenous nucleotide sequences in cell are provided.
  • Such integrable polynucleotides can contain sequences with homology to endogenous sequences and a removable selectable marker cassette.
  • the removable selectable marker cassette can include a selectable marker flanked by a 5' site-specific recombinase recognition sequence and a 3' site-specific recombinase recognition sequence.
  • integrable polynucleotides can also contain heterologous sequences.
  • the heterologous sequences and removable selectable marker cassette can be flanked by a 5' nucleic acid sequence with homology to an endogenous sequence and a 3' nucleic acid sequence with homology to an endogenous sequence.
  • integrable polynucleotides can include episomal nucleic acids, such as plasmids and YACS.
  • integrable polynucleotides can include autonomous replication sequences such as CoIEl , Ori, oriT. 2 ⁇ m, CEN/ARS.
  • integrable polynucleotides can include linearized episomal nucleic acids, for example, plasmids cut with a restriction enzyme.
  • integrable polynucleotides can include PCR products.
  • a removable selectable cassette can contain a selectable marker flanked by a 5' site-specific recombinase recognition sequence and a 3' site-specific recombinase recognition sequence.
  • Removable selectable marker cassettes can be used to select for integration of an integrable polynucleotide into the genome of a cell. Subsequent to integration of the integrable polynucleotide, the removable selectable marker cassette can be excised, if desired, from the genome of the cell. Because the number of known selectable markers is limited, one advantage of excising a selectable maker from the genome of a cell is that the selectable marker can be used repeatedly.
  • the same selectable marker can be used in a second integrable polynucleotide to modify the genome of a cell previously modified by the first integrable polynucleotide.
  • the selectable marker can allow selection for a cell in which the selectable marker has integrated into the cell's genome.
  • Selectable markers can be antibiotic resistance genes against compounds, for example, kanamycin, ampicillin, tetracycline, chloramphenicol, spectinomycin, gentamycin, zeomycin, or streptomycin. More selectable markers can be genes capable of complementing strains of yeast having well characterized metabolic deficiencies, for example, tryptophan or histidine deficient mutants.
  • a selectable marker can be used to select against cells that retain the selectable marker. In such embodiments, cells which do not express the selectable marker will be selected for.
  • a selectable marker can be selected for and against.
  • selectable markers examples include, but are not limited to.
  • URA3 Boeke, J. D. , LaCroute, F. . and Fink. G. R. (1984).
  • TRPl Toyn, J. H., Gunyuzlu, P. L., White, W. H., Thompson , L. A., and Holhs, G. F. (2000).
  • a counterselection for the tryptophan pathway in yeast 5-fluoroanthranilic acid resistance.
  • Yeast 16, 553-560 CANl (Whelan, W. L., Gocke, E. : and Manney ; T. R. ( 1979).
  • the CAN l locus of Saccharomyces cerevisiae fine-structure analysis and forward mutation rates. Genetics 35-51), KIURA3, CYH2, LYS2 and METl 5 (Singh, A. and Shennan, F. (1975). Genetic and physiological characterization of metl 5 mutants of Saccharomyces cerevisiae: a selective system for forward and reverse mutations. Genetics 75-97).
  • Such examples can typically be used in conjunction with specific strains of Saccharamyces cerevisiae which are non-functional for specific genes.
  • a first selection of the selectable marker can be made to select for incorporation of the selectable marker and a second selection of the selectable marker can be made to select against maintaining the selectable marker.
  • Such embodiments can find particular application when the same selectable marker is utilized iteratively. namely, two or more times, for the separate incorporation of two or more heterologous polynucleotides into the host organism.
  • the selectable marker can be flanked by site- specific recombinase recognition sequences.
  • site-specific recombinase recognition sequences allow a site-specific recombinase to excise the selectable marker from an integrable polynucleotide integrated into the genome of a cell.
  • sequence-specific recombinase target sites include, but are not limited to, loxP sites, frt sites, att sites and dif sites.
  • the site-specific recombinase recognition sequences can be loxP sites recognized by the CRE recombinase.
  • the CRE recombinase can be a CRE recombinase optimized for expression in a particular organism, for example, S. cerevisiae, using methods known in the art.
  • the site-specific recombinase recognition sequence can be frt sites recognized by the FLP recombinase.
  • flanking loxP sites or flanking frt sites should be in the same orientation, that is, the sites should be in tandem orientation.
  • CRE recombinase or FLP recombinase expressed in a cell can excise the sequence between loxP sites or frt sites, respectively.
  • the site-specific recombinase can be expressed from a plasmid. In other embodiments, the site-specific recombinase can be expressed from an inducible endogenous gene.
  • integration of an integrable polynucleotide into the genome of a cell can be mediated by a variety of processes.
  • Such processes can include, but are not limited to, random integration, homologous recombination, or site- specific recombination.
  • integrable polynucleotides can contain sequences with homology to endogenous sequences. Such sequences with homology to endogenous sequences can direct integration of integrable polynucleotides to certain locations in a cell ' s genome, specifically, the location of the endogenous sequence.
  • One advantage of directing integration of integrable polynucleotides to particular locations of the genome is that the integrable polynucleotides can be directed to locations of the genome that, for example, can contain enhancer elements, locus control regions, or can be more permissive for expression of a heterologous sequence contained within an integrable polynucleotide.
  • sequences with homology to endogenous sequences can be more than about 5 nucleotides, more than about 10 nucleotides, more than about 15 nucleotides, more than about 20 nucleotides, more than about 25 nucleotides, more than about 30 nucleotides, more than about 35 nucleotides, more than about 40 nucleotides, more than about 45 nucleotides, more than about 50 nucleotides, more than about 100 nucleotides, more than 500 nucleotides, more than about 1 kilobases, more than about 2 kilobases, more than about 3 kilobases, more than about 4 kilobases, or more than about 5 kilobases in length.
  • Sequences with homology to endogenous sequences can be 100% identical or can have at least 99 %, 98 %, 97 %, 96 %, 95 %, 94 %, 93 %, 92 %, 91 %, 90 %, 85 %, 80 %, 70 %, or 70% identity to the endogenous sequence.
  • sequences with homology to endogenous sequences can contain sequences with homology to genomic repetitive elements, such as long interspersed repeats (LINEs), short interspersed repeats (SINEs), or retrotransposon DNA, such as long terminal repeats (LTR).
  • genomic repetitive elements can be TyI or Ty3 elements.
  • integrable polynucleotides containing sequences with homology to genomic repetitive elements may integrate at more than one site in the genome of a cell.
  • sequences with homology to endogenous sequences can contain ⁇ sequences, ⁇ sequences are a component of the LTR of the TyI retrotransposon and are distributed throughout the S. cerevisiae genome.
  • Vectors containing ⁇ sequences for integration into S. cerevisiae are known in the art, as exemplified in Lee F. W. and Da Dilva N.A., Sequential delta-integration for the regulated insertion of cloned genes in Saccharomyces cerevisiae. Biotechnol Prog. (1997) 13(4): 368-373.
  • the 5' nucleic acid sequence with homology to an endogenous sequence and the 3' nucleic acid sequence with homology to an endogenous sequence can contain ⁇ sequences.
  • Vectors containing heterologous sequences flanked by ⁇ sequences are known in the art to have an increased stability for expression of heterologous sequences contained therein (Lee F. W.
  • an integrable polynucleotide can contain heterologous sequences.
  • Such heterologous sequences can include sequences encoding polypeptides.
  • the heterologous sequences can encode genes important in sugar metabolism, cellulose metabolism, arabinose metabolism, and xylose metabolism.
  • heterologous sequences can contain regulatory elements operatively linked to a sequence encoding a polypeptide.
  • regulatory elements can include, for example, promoters, enhancers, and terminator sequences. Promoters may be constitutive or inducible. Suitable promoters for use in prokaryotic hosts include, but are not limited to, the trp, lac and phage promoters, tRNA promoters and glycolytic enzyme promoters.
  • Useful yeast promoters include, but are not limited to, the promoter regions for metallothionein, 3-phosphoglycerate kinase or other glycolytic enzymes such as enolase or glyceraldehyde-3-phosphate dehydrogenase and the enzymes responsible for maltose and galactose utilization.
  • Appropriate mammalian promoters include, but are not limited to, the early and late promoters from SV40 and promoters derived from murine Moloney leukemia virus (MLV), mouse mammary tumor virus (MMTV), avian sarcoma viruses, adenovirus 11. bovine papilloma virus and polyomas.
  • a heterologous sequence can contain the PGKl promoter, the TEF] promoter, the CYCJ terminator, and combinations thereof.
  • heterologous sequences encode and express the gene of interest in a cell in which the heterologous sequence has integrated.
  • a cell can contain any of the integrable polynucleotides described herein.
  • a cell can be a prokaryotic cell or a eukaryotic cell.
  • prokaryotic cells include Escherichia coli, and Clostridium species.
  • eukaryotic cells include, but are not limited to, fungi and yeast cells, such as, Saccharomyces cerevisiae, Pichia pasto ⁇ s, Zymomonas mobilis. Kluyveromyces lactis, Kluveroinyces marxianus, Trichoderma species, and Aspergillus species; mammalian cells, such as Chinese hamster cells: avian cells; and insect cells.
  • the cell can contain an integrable polynucleotide integrated into the genome of a cell.
  • a cell can contain a heterologous nucleic acid integrated into the genome of the cell in which the removable selectable marker is juxtaposed to said heterologous nucleic acid.
  • a removable selectable marker can be juxtaposed to a heterologous nucleic acid where the removable selectable marker and the heterologous nucleic acid are adjacent to one another on a sequence, for example, the removable selectable marker and the heterologous nucleic acid can be immediately adjacent to one another, or separated by less than 1 nucleotide, less than about 5 nucleotides, less than about 10 nucleotides, less than about 20 nucleotides, less than about 30 nucleotides, less than about 40 nucleotides, less than about 50 nucleotides, less than about 60 nucleotides, less than about 70 nucleotides, less than about 80 nucleotides, less than about 90 nucleotides, less than about 100 nucleotides, less than about 200 nucleotides, less than about 300 nucleotides, less than about 400 nucleotides, less than about 0.5 kilobases, less than about 1 kilobases, less than about 2 kilobases, less
  • a cell can contain an integrable polynucleotide integrated into the genome of the cell where the removable selectable cassette has been excised from the integrated polynucleotide.
  • a cell can contain a heterologous nucleic acid integrated into the genome of the cell in which a site-specific recombinase recognition site is juxtaposed to the heterologous nucleic acid.
  • a site-specific recombinase recognition site can be juxtaposed to a heterologous nucleic acid where the site-specific recombinase recognition site and the heterologous nucleic acid are adjacent to one another on a sequence, for example, the site-specific recombinase recognition site and the heterologous nucleic acid can be immediately adjacent to one another, or separated by less than 1 nucleotide, less than about 5 nucleotides, less than about 10 nucleotides, less than about 20 nucleotides, less than about 30 nucleotides, less than about 40 nucleotides, less than about 50 nucleotides, less than about 60 nucleotides, less than about 70 nucleotides, less than about 80 nucleotides, less than about 90 nucleotides, less than about 100 nucleotides, less than about 200 nucleotides, less than about 300 nucleotides, less than about 400 nucleotides, less than about 0.5 kilob
  • a cell can contain a plurality of integrable polynucleotides.
  • a cell can contain a plurality of different integrable polynucleotides containing different selectable markers.
  • a cell contains no more than about 1 , no more than about 2, no more than about 3, no more than about 4, no more than about 5, no more than about 6, no more than about 7, no more than about 8, no more than about 8, or no more than about 10 different selectable markers.
  • the number of selectable markers a cell can contain can include the number of different selectable markers compatible with the methods and compositions described herein.
  • a cell can contain a plurality of different integrable polynucleotides that have integrated into the genome of the cell.
  • a cell can contain 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 15 or more, 20 or more, 25 or more, 30 or more, 40 or more, 45 or more, or 50 or more different integrable polynucleotides that have integrated into the genome of the cell.
  • a cell can contain a plurality of different integrable polynucleotides that have integrated into the genome of the cell where some integrable polynucleotides contain selectable markers, and some integrable polynucleotides have no selectable marker. In even more embodiments, a cell can contain a plurality of different integrable polynucleotides where some or all of the selectable markers have been excised.
  • methods to modify an endogenous sequence in a cell can include providing a cell with any integrable polynucleotide described herein, and selecting for at least one cell containing the integrable polynucleotide integrated into the genome of the cell.
  • a plurality of different integrable polynucleotides can be provided to a cell.
  • the plurality of different integrable polynucleotides can include 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more different integrable polynucleotides.
  • the plurality of integrable polynucleotides can include integrable polynucleotides with different selectable makers.
  • One advantage of providing a cell with a plurality of polynucleotides with different selectable markers includes the ability to make more than one modification to endogenous sequences in a cell simultaneously.
  • the plurality of integrable polynucleotides can include integrable polynucleotides with different heterologous sequences.
  • the plurality of integrable polynucleotides can include integrable polynucleotides with different flanking sequences with homology to endogenous sequences.
  • At least one selectable marker can be used iteratively.
  • a cell can be produced from a first round of modification(s) using the methods described herein.
  • a cell can be provided with a first integrable polynucleotide containing a selectable marker, a cell can be selected for containing the integrable polynucleotide integrated into the cell's genome, the selection cassette can be excised from a cell containing an integrated integrable polynucleotide, and a cell can be selected for having the selection cassette excised.
  • a cell containing the modifications of the first round can undergo at least a second round of modifications using a second integrable polynucleotide containing the same selectable marker as the first integrable polynucleotide.
  • a selectable marker can be reused and is used iteratively.
  • a cell can be provided with a plurality of integrable polynucleotides containing set of different selectable markers in a first round of modifications.
  • a cell containing the modifications of the first round of modifications can be provided with a plurality of integrable polynucleotides containing the same set of different selectable markers as the first round of modifications.
  • the integrable polynucleotide can be provided to a cell as a linearized plasmid.
  • the integrable polynucleotide can be provided to a cell as a PCR product.
  • Methods of PCR are well known in the art.
  • the template for the PCR can comprise a sequence for an integrable polynucleotide, for example, a vector containing the integrable polynucleotide sequence.
  • the initial template for PCR may not contain the entire sequence for an integrable polynucleotide.
  • One advantage of using PCR to generate the integrable polynucleotide includes the ability to incorporate additional sequences to the ends of the initial PCR template.
  • PCR primers with tails can be designed and used to amplify the initial PCR template and incorporate the additional sequences in the tails into the amplified product.
  • Such additional tail sequences can be 2 nucleotides, 3 nucleotides, 4 nucleotides, 5 nucleotides, 6 nucleotides, 7 nucleotides, 8 nucleotides, 9 nucleotides, 10 nucleotides, 1 1 nucleotides, 12 nucleotides, 13 nucleotides, 14 nucleotides, 15 nucleotides, 16 nucleotides, 17 nucleotides, 18 nucleotides, 19 nucleotides, 20 nucleotides, 21 nucleotides, 22 nucleotides, 24 nucleotides, 25 nucleotides, 26 nucleotides, 27 nucleotides, 28 nucleotides, 29 nucleotides, 30 nucleotides, 31 nucleotides, 32 nucleotides, 33 nucleotides, 34 nucleotides, 35 nucleotides, 36 nucleotides, 37 nucleotides, 38
  • primers for the PCR can be designed to add sequences with homology to endogenous sequences to the initial PCR template.
  • an integrable polynucleotide with flanking sequences with homology to endogenous sequences can be generated.
  • additional tail sequences can include TyI sequences.
  • methods to modify an endogenous sequence in a cell can also include excising the selectable marker from the integrable polynucleotide integrated into the genome of the cell.
  • excising a selectable marker integrated into the genome of a cell is that the selectable marker can be re-used to select for another modification in a subsequent round of modifications.
  • a selectable marker can be excised from an integrated site by site-specific recombination using a site-specific recombinase expressed in the cell.
  • Site-specific recombinases can include CRE recombinase to excise sequences between tandem loxP sites, and FLP recombinase to excise sequences between tandem frt sites.
  • the site-specific recombinase can be expressed from a plasmid transformed into the cell.
  • the site-specific recombinase can be expressed from an inducible endogenous gene. It is contemplated that in instances where more than one type of different selectable makers have integrated into the cell ' s genome, all the different selectable makers can be excised simultaneously by the expression of at least one type of site-specific recombination.
  • the selectable markers of an integrable polynucleotide containing the URA3 marker flanked by loxP sites, and an integrable polynucleotide containing the TRPl marker flanked by loxP sites can both be excised from sites where the integrable polynucleotides have integrated into the cell by expression in the cell of CRE recombinase.
  • a cell can be provided with a plurality of integrable polynucleotides which contain different recombinase recognition sequences.
  • the plurality of integrable polynucleotides can include some integrable polynucleotides that contain one type of recombinase recognition sequences, such as loxP sites, and some integrable polynucleotides can contain another type of recombinase recognition sequences, such as frt sites.
  • a cell in which a selectable marker has been excised can be identified by selecting against cells that retain the marker. Methods for such negative selection are well known in the art.
  • An exemplary eukaryotic system for xylose metabolism is a cassette of enzymes that can include xylose reductase (XR), xylitol dehydrogenase (XDH), and xylulokinase (XKI).
  • An exemplary bacterial system for xylose metabolism is a cassette of enzymes that can include xylose isomerase (XyIA). and xylulokinase (XKI).
  • one or more, or all of the enzymes are heterologous to the one or more host organisms.
  • the translational kinetics of each of the nucleotide sequences encoding the enzymes has been increased by silent permutation or conservative amino acid substitution of at least 1 , 2, 3, 4, 5 or 6 or more codon pairs present in the original sequence for each enzyme.
  • a silent permutation is a change to one or more nucleotides of a codon such that the encoded amino acid does not change.
  • a codon pair in the modified polynucleotide can be selected to preserve or insert a predicted pause.
  • Also provided herein are systems for arabinose metabolism comprising one or more host organisms that collectively include nucleotide sequences operably encoding at least two least two enzymes from bacterial or eukaryotic pathways.
  • An exemplary eukaryotic system for arabinose metabolism is a cassette of enzymes that can include aldose reductase (ARD), L-arabinitol 4-dehydrogenase (LAD), L-xylulose reductase (LXR), xylitol dehydrogenase (XDH). and xylulokinase (XKI).
  • An exemplary bacterial system for arabinose metabolism is a cassette of enzymes that can include L- arabinose isomerase (AraA), L-ribulokinase (AraB), and L-ribulose-5-P 4-epimerase (AraD).
  • one or more, or all of the enzymes are heterologous to the one or more host organisms.
  • the translational kinetics of each of the nucleotide sequences encoding the enzymes has been increased by silent permutation or conservative amino acid substitution of at least 1 , 2, 3, 4, 5 or 6 or more codon pairs present in the original sequence for each enzyme.
  • a silent permutation is a change to one or more nucleotides of a codon such that the encoded amino acid does not change.
  • the at least 1 , 2, 3, 4, 5 or 6 or more substituted codon pairs are predicted to cause a translational pause or slowing in the host organism, and the substituting codon pair is typically a codon pair not predicted to cause a translational pause or slowing in the host organism.
  • a codon pair in the modified polynucleotide can be selected to preserve or insert a predicted pause.
  • the stoichiometry of enzymes in a pathway can affect the overall efficiency of biomass conversion. Accordingly, provided herein are systems of two or more enzymes wherein one of the two or more enzymes in the pathway has a translational pause. Also provided herein are two or more enzymes wherein two of the enzymes in the pathway have a translational pause.
  • xylose reductase can have a pause
  • xylitol dehydrogenase can have a pause
  • xylulokinase can have a pause
  • combinations thereof can have pauses.
  • xylose isomerase can have a pause
  • xylulokinase can have a pause
  • both enzymes can have a pause.
  • aldose reductase can have a pause
  • L- arabinitol 4-dehydrogenase LAD
  • L-xylulose reductase LXR
  • XDH xylitol dehydrogenase
  • XKl xylulokinase
  • L-arabinose isomerase (AraA) can have a pause
  • L-ribulokinase (AraB) can have a pause
  • L- ribulose-5-P 4-epimerase (AraD) can have a pause, or combinations thereof can have pauses.
  • AraA and AraB do not have pauses, while AraD contains a pause; it is contemplated that such an arrangement would result in AraA and AraB having high levels of activity, with AraD retaining low levels of activity.
  • the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
  • each encoded enzyme in the system has at least a 50%, 60%, 70%, 80%, and more typically at least 90%, 91 %, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the with the original sequence of the enzyme.
  • one or more of the enzymes in the system retains at least 75% of the enzymatic activity of the enzyme encoded by the original sequence under conditions suitable for metabolism of xylose. Methods for measuring the activity of the enzymes in the system are known in the art.
  • Also provided are methods of hydrolyzing a carbohydrate comprising providing a carbohydrate comprising at least one glycosidic bond, providing a polypeptide encoded by any of the polynucleotides provided herein, and contacting said carbohydrate with said polypeptide under conditions that permit said polypeptide to hydrolyze at least one glycosidic bond of said carbohydrate, whereby at least one glycosidic bond of said carbohydrate is hydrolyzed.
  • the carbohydrate is cellulose.
  • the carbohydrate comprises two or more ⁇ -l,4-linked glucose units.
  • Such methods can be performed using the cells and systems provided herein. Such methods can be performed in order to provide smaller polysaccharides and/or monosaccharides which can be used by a cell or processed extracellularly according to any one of a variety of known methods in the art.
  • a polynucleotide containing an improved-expression nucleotide sequence calculated in accordance with the teachings herein can be prepared by known methods, such as, for example, assembly of overlapping oligonucleotides which can be solid phase synthesized, as is described in U.S. Patent Number 7,262,031 , and U.S. Patent Publication Numbers 2005/0106590 and 2007/0009928.
  • the prepared polynucleotide can then be amplified by PCR methodologies or by insertion into a vector, transformation into cells, and subsequent harvesting of the vector from the cells. Examples of such methods for amplification of a polynucleotide are provided in Ausubel et al., 2008, Current Protocols in Molecular Biology, Greene Publishing Associates and Wiley Interscience, N.Y.
  • the polynucleotide itself or amplicon thereof can be inserted into an expression vector configured to produce the polypeptide encoded by the inserted polynucleotide.
  • the expression vector is then inserted into cells, and according to the expression vector used, the cells are treated under conditions suitable for polypeptide expression.
  • the expressed polypeptide can be analyzed and manipulated as desired.
  • the expressed polypeptide can be analyzed by Western blot analysis using a known antibody to the expressed polypeptide or using an anti-polypeptide antibody generated by known methods.
  • the expressed polypeptide also can be subjected to one or more purification steps to increase the purity of the expressed polypeptide.
  • Various analytical and purification method, as well as antibody-generation methods are known in the art, as exemplified in Ausubel, supra.
  • This example describes optimization of a nucleotide sequence encoding Xyr for expression in yeast.
  • the chi-squared value "chisql” was generated by the expected and observed values determined.
  • the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.
  • the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.
  • ⁇ scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
  • the nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for S. cerevisiae.
  • the nucleotide sequence encoding Xyr (SEQ ID NO: 1) was derived from Genbank accession number Ml 6190 by removing untranslated sequence (5' untranslated region and introns).
  • a graphical display for the native gene (SEQ ID NO: 1) encoding the Xyr protein (SEQ ID NO: 2) in P. stipitis was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. stipitis as a function of codon pair position.
  • the graphical display is provided in Figure 1.
  • a graphical display for the native gene (SEQ ID NO: 1 ) encoding the Xyr protein (SEQ ID NO: 2) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
  • the graphical display is provided in Figure 2A.
  • the nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 3) was found to encode a protein (SEQ ID NO: 4) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 2).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 3) encoding the Xyr protein (SEQ ID NO: 4) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 2B.
  • Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2. chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for E. coli.
  • a graphical display for the native gene (SEQ ID NO: 1 ) encoding the Xyr protein (SEQ ID NO: 2) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 3 A.
  • the nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 9) was found to encode a protein (SEQ ID NO: 10) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 2).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 9) encoding the Xyr protein (SEQ ID NO: 10) expressed in E. coli was prepared by plotting ⁇ scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 3B.
  • This example describes optimization of a nucleotide sequence encoding Xyr for expression in P. pastoris.
  • Chi-squared values for P. pastoris were determined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for P. pastoris.
  • a graphical display for the native gene (SEQ ID NO: 1) encoding the Xyr protein (SEQ ID NO: 2) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4A.
  • the nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having ⁇ scores in P. pastoris greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 15) was found to encode a protein (SEQ ID NO: 16) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 2).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 15) encoding the Xyr protein (SEQ ID NO: 16) expressed in P. pastoris was prepared by plotting 2 scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4B.
  • Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for K. lactis.
  • a graphical display for the native gene (SEQ ID NO: 1) encoding the Xyr protein (SEQ ID NO: 2) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 5 A.
  • the resulting nucleotide sequence (SEQ ID NO: 21 ) was found to encode a protein (SEQ ID NO: 22) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 2).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 21 ) encoding the Xyr protein (SEQ ID NO: 22) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 5B.
  • This example describes optimization of a nucleotide sequence encoding Xyr for expression in Z. mobilis.
  • Chi-squared values for Z mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for Z mobilis.
  • a graphical display for the native gene (SEQ ID NO: 1) encoding the Xyr protein (SEQ ID NO: 2) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 6A.
  • the nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 23) was found to encode a protein (SEQ ID NO: 24) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 2).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 23) encoding the Xyr protein (SEQ ID NO: 24) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 6B. EXAMPLE 6
  • Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire. UK) according to manufacturer's instructions.
  • This example describes optimization of a nucleotide sequence encoding XyI 1 for expression in yeast.
  • the chi-squared value "chisql” was generated by the expected and observed values determined.
  • the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.
  • the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.
  • z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
  • the nucleotide sequence for the gene encoding the XyI l protein was modified to optimize codon usage for S. cerevisiae.
  • the nucleotide sequence encoding XyIl (SEQ ID NO: 25) was derived from Genbank accession number M16190 by removing untranslated sequence (5 : untranslated region and introns).
  • a graphical display for the native gene (SEQ ID NO: 25) encoding the XyIl protein (SEQ ID NO: 26) in C. parapsilosis was prepared by plotting z scores of translational kinetics values for codon pair utilization in C. parapsilosis as a function of codon pair position.
  • the graphical display is provided in Figure 7.
  • a graphical display for the native gene (SEQ ID NO: 25) encoding the XyIl protein (SEQ ID NO: 26) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position.
  • the graphical display is provided in Figure 8A.
  • the nucleotide sequence for the gene encoding the XyIl protein was modified to no longer contain codon pairs having ⁇ scores in S. cerevisiae greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 27) was found to encode a protein (SEQ ID NO: 28) with 100% amino acid sequence identity to wild-type XyIl (SEQ ID NO: 26).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 27) encoding the XyIl protein (SEQ ID NO: 28) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 8B.
  • Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • a graphical display for the native gene (SEQ ID NO: 25) encoding the XyI l protein (SEQ ID NO: 26) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position.
  • the graphical display is provided in Figure 9A.
  • the nucleotide sequence for the gene encoding the XyI l protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 33) was found to encode a protein (SEQ ID NO: 34) with 100% amino acid sequence identity to wild-type XyIl (SEQ ID NO: 26).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 33) encoding the XyIl protein (SEQ ID NO: 34) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 9B.
  • This example describes optimization of a nucleotide sequence encoding XyIl for expression in P. pasloris.
  • Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the XyIl protein was modified to optimize codon usage for P. pastoris.
  • a graphical display for the native gene (SEQ ID NO: 25) encoding the XyI l protein (SEQ ID NO: 26) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 1 OA.
  • the nucleotide sequence for the gene encoding the XyIl protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 39) was found to encode a protein (SEQ ID NO: 40) with 100% amino acid sequence identity to wild-type XyI l (SEQ ID NO: 26) %
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 39) encoding the XyIl protein (SEQ ID NO: 40) expressed in P. pasto ⁇ s was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastohs as a function of codon pair position. The graphical display is provided in Figure 1 OB.
  • This example describes optimization of a nucleotide sequence encoding XyIl for expression in K. lactis.
  • Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the XyIl protein was modified to optimize codon usage for K. lactis.
  • a graphical display for the native gene (SEQ ID NO: 25) encoding the XyIl protein (SEQ ID NO: 26) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 1 IA.
  • the nucleotide sequence for the gene encoding the XyIl protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 45) was found to encode a protein (SEQ ID NO: 46) with 100% amino acid sequence identity to wild-type XyIl (SEQ ID NO: 26).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 45) encoding the XyIl protein (SEQ ID NO: 46) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 1 I B.
  • This example describes optimization of a nucleotide sequence encoding XyI I for expression in Z. mobilis.
  • Chi-squared values for Z. mobilis were determined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for Z. inobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the XyIl protein was modified to optimize codon usage for Z. mobilis.
  • a graphical display for the native gene (SEQ ID NO: 25) encoding the XyIl protein (SEQ ID NO: 26) in Z. mobilis was prepared by plotting ⁇ scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 12A.
  • the nucleotide sequence for the gene encoding the XyI l protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 47) was found to encode a protein (SEQ ID NO: 48) with 100% amino acid sequence identity to wild-type XyIl (SEQ ID NO: 26).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 47) encoding the XyI l protein (SEQ ID NO: 48) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 12B.
  • E. coli expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 8 and native XyIl protein is examined by Western blot analysis.
  • Each vector is transformed into E. coli strain Top 10 (E-mcrA ⁇ (mrr-hsdRMS-mcrBC) ⁇ 80lacZ ⁇ M15 llacX74 deoR recAl araD139 ⁇ ara-leu) 7697 gall) galK rpsL (StrR) endAl mtpG).
  • An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOO ⁇ g/ml ampicillin and grown at 37°C to OD 600 of 0.5.
  • Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacryl amide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-Xyl 1 antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
  • This example describes optimization of a nucleotide sequence encoding Xdh for expression in yeast.
  • the chi-squared value "chisqT" was generated by the expected and observed values determined.
  • the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.
  • the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding ; "chisq3.
  • ⁇ scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
  • the nucleotide sequence for the gene encoding the Xdh protein was modified to optimize codon usage for S. cerevisiae.
  • the nucleotide sequence encoding Xdh (SEQ ID NO: 49) was derived from Genbank accession number M 16190 by removing untranslated sequence (5 ; untranslated region and introns).
  • a graphical display for the native gene (SEQ ID NO: 49) encoding the Xdh protein (SEQ ID NO: 50) in P. stipitis was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. stipitis as a function of codon pair position.
  • the graphical display is provided in Figure 13.
  • the graphical display is provided in Figure 14A.
  • the nucleotide sequence for the gene encoding the Xdh protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 51) was found to encode a protein (SEQ ID NO: 52) with 100% amino acid sequence identity to wild-type Xdh (SEQ ID NO: 50).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 51) encoding the Xdh protein (SEQ ID NO: 52) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 14B.
  • Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75.096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the Xdh protein was modified to optimize codon usage for E. coli.
  • a graphical display for the native gene (SEQ ID NO: 49) encoding the Xdh protein (SEQ ID NO: 50) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 15 A.
  • the nucleotide sequence for the gene encoding the Xdh protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 57) was found to encode a protein (SEQ ID NO: 58) with 100% amino acid sequence identity to wild-type Xdh (SEQ * 1D NO: 50).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 57) encoding the Xdh protein (SEQ ID NO: 58) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 15B.
  • This example describes optimization of a nucleotide sequence encoding Xdh for expression in P. pastoris.
  • Chi-squared values for P. pastoris were determined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the Xdh protein was modified to optimize codon usage for P. pastoris.
  • a graphical display for the native gene (SEQ ID NO: 49) encoding the Xdh protein (SEQ ID NO: 50) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 16A.
  • the nucleotide sequence for the gene encoding the Xdh protein was modified to no longer contain codon pairs having ⁇ scores in P. pastoris greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 63) was found to encode a protein (SEQ ID NO: 64) with 100% amino acid sequence identity to wild-type Xdh (SEQ ID NO: 50).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 63) encoding the Xdh protein (SEQ ID NO: 64) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 16B.
  • This example describes optimization of a nucleotide sequence encoding Xdh for expression in K. lactis.
  • Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the Xdh protein was modified to optimize codon usage for K. lactis.
  • a graphical display for the native gene (SEQ ID NO: 49) encoding the Xdh protein (SEQ ID NO: 50) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 17A.
  • the nucleotide sequence for the gene encoding the Xdh protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 63) was found to encode a protein (SEQ ID NO: 64) with 100% amino acid sequence identity to wild-type Xdh (SEQ ID NO: 50).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 63) encoding the Xdh protein (SEQ ID NO: 64) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 17B.
  • This example describes optimization of a nucleotide sequence encoding Xdh for expression in Z. mobilis.
  • Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
  • the nucleotide sequence for the gene encoding the Xdh protein was modified to optimize codon usage for Z. mobilis.
  • a graphical display for the native gene (SEQ ID NO: 49) encoding the Xdh protein (SEQ ID NO: 50) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 18A.
  • the nucleotide sequence for the gene encoding the Xdh protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3.
  • the resulting nucleotide sequence (SEQ ID NO: 21 ) was found to encode a protein (SEQ ID NO: 22) with 100% amino acid sequence identity to wild-type Xdh (SEQ ID NO: 50).
  • a graphical display for the codon pair utilization-modified gene (SEQ ID NO: 21 ) encoding the Xdh protein (SEQ ID NO: 22) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 18B.
  • Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-Xdh antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
  • the chi-squared value "chisql” was generated by the expected and observed values determined.
  • the chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.
  • the chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3.
  • z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
  • the nucleotide sequence for the gene encoding the XKl protein was modified to optimize codon usage for S. cerevisiae.
  • the nucleotide sequence encoding XKI (SEQ ID NO: 73) was derived from Genbank accession number M 16190 by removing untranslated sequence (5 T untranslated region and introns).

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Engineering & Computer Science (AREA)
  • Wood Science & Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Microbiology (AREA)
  • Biotechnology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Medicinal Chemistry (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • General Chemical & Material Sciences (AREA)
  • Micro-Organisms Or Cultivation Processes Thereof (AREA)
  • Enzymes And Modification Thereof (AREA)

Abstract

Provided are polynucleotide sequences and synthetic genes encoding xylose- and arabinose-metabolizing enzymes for expression in a host organism with improved and/or refined translational kinetics, and methods of making same. The resultant xylose- and arabinose-metabolizing enzyme-encoding nucleotide is predicted to be translated rapidly along its entire length. Expression of the resultant xylose- and arabinose-metabolizing enzyme-encoding nucleotide is predicted to result in improved protein expression levels in cases where inappropriate or excessive translation pauses reduce protein expression. In addition, expression of the resultant xylose- and arabinose-metabolizing enzyme-encoding nucleotide is predicted to result in improved levels of active and/or natively folded and functional polypeptide expression in cases where inappropriate or excessive translational pauses causes expression of inactive, insoluble, aggregated or somehow dysfunctional or minimally active xylose- and arabinose-metabolizing enzyme.

Description

XYLOSE- AND ARABINOSE- METABOLIZING ENZYME -ENCODING NUCLEOTIDE SEQUENCES WITH REFINED TRANSLATIONAL KINETICS
AND METHODS OF MAKING SAME
BACKGROUND
Field of the Invention
|0001] The present invention relates to refining the translational kinetics of an mRNA into polypeptide, and polypeptide-encoding nucleotide sequences which have refined translational properties.
Description of the Related Art
10002] Recent innovations have shown that enzymes can be useful for industrial applications. However, production of large amounts of functional enzyme is often limited. Despite the burgeoning knowledge of expression systems and recombinant DNA, significant obstacles remain when one attempts to express a foreign or synthetic gene in a non-native host organism. Often, a synthetic gene, even when coupled with a strong promoter, is inefficiently translated and can produce a low yield of protein, a faulty protein, or in many cases, low yields of an inactive protein. The same is frequently true of exogenous genes foreign to the expression organism. Even when the gene is, translated in a sufficiently efficient manner that recoverable quantities of the translation product are produced, the protein is often inactive, insoluble, aggregated, or otherwise different in structural and activity properties from the native protein expressed in the native organism.
|0003] The Saccharomyces yeasts have proven to be safe, effective and user- friendly microorganisms for large-scale production of industrial ethanol from glucose- based feedstocks. Recently, efforts have been made to use cellulosic biomass as feedstock for producing ethanol. However, the major fermentable sugars from hydrolysis of these feedstocks (such as rice and wheat straw, sugarcane bagasse, corn stover, corn fibre, softwood, hardwood and grasses) are D-glucose, L-arabinose and D-xylose. The Saccharomyces yeasts are not able to use arabinose or xylose for growth or production of ethanol. There is a need for recombinant yeast and other microorganisms that can co- ferment glucose, arabinose and xylose simultaneously to ethanol through expression of the enzymes involved in the arabinose and xylose fermentation pathways. Such pathways have been identified in yeast, filamentous fungi and other eukaryotes. Related pathways utilizing distinct enzymes have been identified in bacteria.
|0004] Despite knowledge in the art related to expression of a foreign or synthetic gene in a host organism, many sugar catabolic enzymes do not express well in host organisms such as Escherichia coli or Saccharornyces cerevisiae. As a result, large- scale production is limited. Therefore, there is a continued need for improved expression of these enzymes.
SUMMARY
[0005] Some translational pauses are resultant from the presence of particular codon pairs in the nucleotide sequence encoding the polypeptide to be translated. As provided herein, inappropriate or excessive translation pauses can reduce protein expression considerably. Further, the translational pausing properties of codon pairs vary from organism to organism. As a result, exogenous expression of genes foreign to the expression organism can lead to inefficient translation and poor expression. Even when the gene is translated in a sufficiently efficient manner that recoverable quantities of the translation product are produced, the protein is often inactive, insoluble, aggregated, or otherwise different in properties from the native protein. Thus, removing inappropriate or excessive translation pause structures coded for by specific di-codon nucleotide sequences in the open reading frame (ORF) can improve protein expression.
[0006] In accordance with the above, provided herein are sugar catabolic enzyme-encoding nucleotide sequences with refined translational kinetics and methods of designing and synthesizing the same. In one embodiment is provided a sugar catabolic enzyme-encoding nucleotide sequence, wherein the encoded sequence has amino acid sequence identity with an original sugar catabolic enzyme polypeptide, and wherein predicted translation pauses in the expression organism have been removed or reduced by replacing original codon pairs with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The resultant sugar catabolic enzyme- encoding nucleotide is predicted to be translated rapidly along its entire length. Expression of the resultant sugar catabolic enzyme-encoding nucleotide is predicted- to result in improved protein expression levels in cases where inappropriate or excessive translation pauses reduce protein expression. In addition, expression of the resultant sugar catabolic enzyme-encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression products in cases where inappropriate or excessive translation pauses cause expression of inactive, insoluble or aggregated enzyme.
|0007] Also provided herein are sugar catabolic enzyme-encoding nucleotide sequences, wherein the encoded sequence has amino acid sequence identity with an original sugar catabolic enzyme -encoding nucleotide sequence and is adapted for expression in a heterologous host organism, wherein at least 1. 2, or 3 codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein. In some embodiments, the host organism is not human, E. coli or S. cerevisiae.
|0008] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GGTATT (nucleotides 619-624); TTGAAC (nucleotides 16- 21 ); TTGAAC (nucleotides 274-279); TTGAAC (nucleotides 670-675); TTGAAC (nucleotides 688-693); CTTTCT (nucleotides 286-291); GCCATT (nucleotides 181 -186); TCTCCA (nucleotides 697-702); TCTCCA (nucleotides 751 -756); ATCAAG (nucleotides 103-108); ATCAAG (nucleotides 541 -546); ATCAAG (nucleotides 721 - 726); GCCAAG (nucleotides 889-894). In some such nucleotide sequences, at least 3, or 4. or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GGTATT (nucleotides 619-624) replaced with GGAATT; TTGAAC (nucleotides 16-21) replaced with TTAAAT; TTGAAC (nucleotides 274-279) replaced with CTAAAT; TTGAAC (nucleotides 670-675) replaced with TTAAAT; TTGAAC (nucleotides 688-693) replaced with TTAAAT; CTTTCT (nucleotides 286- 291 ) replaced with CTATCT; GCCATT (nucleotides 181 -186) replaced with GCTATT; TCTCCA (nucleotides 697-702) replaced with TCACCA; TCTCCA (nucleotides 751 - 756) replaced with TCACCA; ATCAAG (nucleotides 103-108) replaced with ATTAAA; ATCAAG (nucleotides 541 -546) replaced with ATTAAA; ATCAAG (nucleotides 721 - 726) replaced with ATTAAG; GCCAAG (nucleotides 889-894) replaced with GCTAAA. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
10009] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GAAGAT (nucleotides 136 - 141 ); CTTTCT (nucleotides 286 - 291 ); GAAGAT (nucleotides 415 - 420 ); ATTGCC (nucleotides 793 - 798 ); ATTGCC (nucleotides 886 - 891 ); GACTGG (nucleotides 928 - 933 ). In some such nucleotide sequences, at least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GAAGAT (nucleotides 136 - 141 ) replaced with GAAGAT; CTTTCT (nucleotides 286 - 291 ) replaced with CTATCT; GAAGAT (nucleotides 415 - 420 ) replaced with GAAGAT; ATTGCC (nucleotides 793 - 798 ) replaced with ATCGCT; ATTGCC (nucleotides 886 - 891 ) replaced with ATAGCT; GACTGG (nucleotides 928 - 933 ) replaced with GATTGG. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
|0010] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TCCAAG (nucleotides 226-231); ATCAAG (nucleotides 103-108); ATCAAG (nucleotides 541 -546); ATCAAG (nucleotides 721 -726); TTCAAG (nucleotides 343-348); TTCAAC (nucleotides 913-918); ATCAAC (nucleotides 901- 906); GGTATT (nucleotides 619-624); GTCAAG (nucleotides 172-177): GTCAAG (nucleotides 199-204); GTCAAG (nucleotides 460-465); GACGAA (nucleotides 187- 192); GACGAA (nucleotides 865-870); GGTATC (nucleotides 193-198); CCAAGA (nucleotides 589-594); CCAAGA (nucleotides 823-828); TTGAAC (nucleotides 16-21); TTGAAC (nucleotides 274-279); TTGAAC (nucleotides 670-675); TTGAAC (nucleotides 688-693). In some such nucleotide sequences, at least 3, or 4. or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TCCAAG (nucleotides 226-231) replaced with TCTAAA; ATCAAG (nucleotides 103-108) replaced with ATTAAA; ATCAAG (nucleotides 541 -546) replaced with ATTAAA; ATCAAG (nucleotides 721 -726) replaced with ATTAAG; TTCAAG (nucleotides 343-348) replaced with TTTAAA; TTCAAC (nucleotides 913-918) replaced with TTTAAT; ATCAAC (nucleotides 901 -906) replaced with ATTAAT; GGTATT (nucleotides 619-624) replaced with GGAATT; GTCAAG (nucleotides 172-177) replaced with GTTAAA; GTCAAG (nucleotides 199-204) replaced with GTTAAA; GTCAAG (nucleotides 460-465) replaced with GTTAAA; GACGAA (nucleotides 187- 192) replaced with GATGAA; GACGAA (nucleotides 865-870) replaced with GATGAA; GGTATC (nucleotides 193-198) replaced with GGAATT; CCAAGA (nucleotides 589-594) replaced with CCTAGA; CCAAGA (nucleotides 823-828) replaced with CCTCGT; TTGAAC (nucleotides 16-21 ) replaced with TTAAAT; TTGAAC (nucleotides 274-279) replaced with CTAAAT; TTGAAC (nucleotides 670- 675) replaced with TTAAAT; TTGAAC (nucleotides 688-693) replaced with TTAAAT. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
[0011] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2. wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTGAAC (nucleotides 16 - 21 ); AAGAAG (nucleotides 175 - 180 ); GCCATT (nucleotides 181 - 186 ); GGTATC (nucleotides 193 - 198 ); TTGAAC (nucleotides 274 - 279 ); CTTTCT (nucleotides 286 - 291 ); TTCCCA (nucleotides 331 - 336 ); TTCCCA (nucleotides 499 - 504 ); TTGAAC (nucleotides 670 - 675 ); TTGAAC (nucleotides 688 - 693 ); GCCAAG (nucleotides 889 - 894 ). In some such nucleotide sequences, at least 3, or 4, or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTGAAC (nucleotides 16 - 21 ) replaced with TTAAAC: AAGAAG (nucleotides 175 - 180 ) replaced with AAAAAG; GCCATT (nucleotides 181 - 186 ) replaced with GCTATT; GGTATC (nucleotides 193 - 198 ) replaced with GGAATT; TTGAAC (nucleotides 274 - 279 ) replaced with TTAAAT: CTTTCT (nucleotides 286 - 291 ) replaced with TTATCT; TTCCCA (nucleotides 331 - 336 ) replaced with TTTCCA; TTCCCA (nucleotides 499 - 504 ) replaced with TTTCCA: TTGAAC (nucleotides 670 - 675 ) replaced with TTAAAT; TTGAAC (nucleotides 688 - 693 ) replaced with TTAAAT; GCCAAG (nucleotides 889 - 894 ) replaced with GCTAAA. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
|0012] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2, wherein at least 3 codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GCCGGT (nucleotides 166 - 171 ); GGTATC (nucleotides 193 - 198 ); GCCTTG (nucleotides 271 - 276 ); GCCGGT (nucleotides 466 - 471 ); GCTTTG (nucleotides 508 - 513 ); GGTATT (nucleotides 619 - 624 ); GCTTTG (nucleotides 685 - 690 ); AACAGC (nucleotides 850 - 855 ); GCCAAG (nucleotides 889 - 894 ) .In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GCCGGT (nucleotides 166 - 171 ) replaced with GCTGGT: GGTATC (nucleotides 193 - 198 ) replaced with GGCATT; GCCTTG (nucleotides 271 - 276 ) replaced with GCCCTT; GCCGGT (nucleotides 466 - 471 ) replaced with GCTGGT; GCTTTG (nucleotides 508 - 513 ) replaced with GCGTTG; GGTATT (nucleotides 619 - 624 ) replaced with GGCATT; GCTTTG (nucleotides 685 - 690 ) replaced with GCTCTT; AACAGC (nucleotides 850 - 855 ) replaced with AATTCT: GCCAAG (nucleotides 889 - 894 ) replaced with GCCAAA. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis. |0013] Also provided herein is a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3. or 2.5. or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
[0014] Also provided herein is a xylose reductase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oiyctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli K12 W31 10; E. coli UTI89; E. coli O157:H7 EDL933; E. coli OJ57.H7 str. Sakai; Bombyx mori: Spodoptera frugiperda: Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe .
J0015] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0016] In some embodiments, provided herein is a system for metabolizing xylose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoήs, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis. Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the xylose reductase retains at least 75% of the enzymatic activity of wild-type Xyr (SEQ ID NO: 2) under normal physiological conditions.
|0017] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1. 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 5-301 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 5-301 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest ∑ scores of the wild type codon pairs encoding amino acids 5-301 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 5-301 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair TCCAAG when expressed in the native organism.
[0018] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 1 -5 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -5 of SEQ ID NO: 2 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%. or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -5 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-5 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 200%. or 100%, or 75%, or 50% or 40% of the wild type codon pair CCTTCT when expressed in the native organism.
[0019] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type xylose reductase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: AAGAAA (nucleotides 382 - 387); TTGAAG (nucleotides 694 - 699); ATCAAA (nucleotides 190 - 195); TTGAAC (nucleotides 34 - 39); TTGAAC (nucleotides 313 - 318); GCCATT (nucleotides 901 - 906); GCTACT (nucleotides 10 - 15); ATCAAG (nucleotides 121 - 126); ATCAAG (nucleotides 202 - 207); ATCAAG (nucleotides 559 - 564). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: AAGAAA (nucleotides 382 - 387) replaced with AAAAAG; TTGAAG (nucleotides 694 - 699) replaced with TTAAAA; ATCAAA (nucleotides 190 - 195) replaced with ATTAAA; TTGAAC (nucleotides 34 - 39) replaced with TTAAAT; TTGAAC (nucleotides 313 - 318) replaced with TTAAAT; GCCATT (nucleotides 901 - 906) replaced with GCTATA: GCTACT (nucleotides 10 - 15) replaced with GCTACC: ATCAAG (nucleotides 121 - 126) replaced with ATTAAA; ATCAAG (nucleotides 202 - 207) replaced with ATTAAA; ATCAAG (nucleotides 559 - 564) replaced with ATTAAA. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0020) In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-324 of wild-type xylose reductase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GAAGAG (nucleotides 226 - 231 ); ATTGCC (nucleotides 748 - 753); ATTGCC (nucleotides 904 - 909). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GAAGAG (nucleotides 226 - 231 ) replaced with GAAGAA; ATTGCC (nucleotides 748 - 753) replaced with ATTGCG; ATTGCC (nucleotides 904 - 909) replaced with ATCGCG. GAAGAG (nucleotides 226 - 231 ); ACCTGG (nucleotides 454 - 459 ); TTGCAG (nucleotides 574 - 579 ); ATTGCC (nucleotides 748 - 753 ); TTGCAG (nucleotides 895 - 900 ); ATTGCC (nucleotides 904 - 909 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GAAGAG (nucleotides 226 - 231 ) replaced with GAAGAA; ACCTGG (nucleotides 454 - 459 ) replaced with ACTTGG; TTGCAG (nucleotides 574 - 579 ) replaced with CTCCAG; ATTGCC (nucleotides 748 - 753 ) replaced with ATTGCG; TTGCAG (nucleotides 895 - 900 ) replaced with CTCCAG; ATTGCC (nucleotides 904 - 909 ) replaced with ATCGCG. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
[0021] In some embodiments are provided a xylose reductase -encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type xylose reductase as set forth in SEQ ID NO: 26; wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: AAGAAA (nucleotides 382 - 387); TCCAAG (nucleotides 244 - 249); ATCAAG (nucleotides 121 - 126): ATCAAG (nucleotides 202 - 207); ATCAAG (nucleotides 559 - 564): TTCAAC (nucleotides 931 - 936); ATCAAA (nucleotides 190 - 195); GTCAAG (nucleotides 217 - 222); GTCAAG (nucleotides 739 - 744); GGTATC (nucleotides 187 - 192); GGTATC (nucleotides 505 - 510); CCAAGA (nucleotides 823 - 828); TTGAAC (nucleotides 34 - 39); TTGAAC (nucleotides 313 - 318). In some such nucleotide sequences, at least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: AAGAAA (nucleotides 382 - 387) replaced with AAGAAG; TCCAAG (nucleotides 244
- 249) replaced with TCTAAA; ATCAAG (nucleotides 121 - 126) replaced with ATTAAA; ATCAAG (nucleotides 202 - 207) replaced with ATCAAA; ATCAAG (nucleotides 559 - 564) replaced with ATCAAA; TTCAAC (nucleotides 931 - 936) replaced with TTCAAC; ATCAAA (nucleotides 190 - 195) replaced with ATCAAA; GTCAAG (nucleotides 217 - 222) replaced with GTTAAA; GTCAAG (nucleotides 739 - 744) replaced with GTTAAA; GGTATC (nucleotides 187 - 192) replaced with GGTATC; GGTATC (nucleotides 505 - 510) replaced with GGTATC; CCAAGA (nucleotides 823 - 828) replaced with CCGCGC; TGAAC (nucleotides 34 - 39) replaced with CTGAAC; TTGAAC (nucleotides 313 - 318) replaced with CTGAAC. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoήs.
[0022] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type xylose reductase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTGAAC (nucleotides 34 - 39 ); GGTATC (nucleotides 187
- 192 ); ATCAAA (nucleotides 190 - 195 ); AAGAAG (nucleotides 271 - 276 ); TTGAAC (nucleotides 313 - 318 ); TTCCCA (nucleotides 349 - 354 ); AAGAAA (nucleotides 382 - 387 ); GGTATC (nucleotides 505 - 510 ); TTGAAG (nucleotides 694 - 699 ); GCCATT (nucleotides 901 - 906 ). In some such nucleotide sequences, at least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conserv ative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTGAAC (nucleotides 34 - 39 ) replaced with TTAAAT; GGTATC (nucleotides 187 - 192 ) replaced with GGAATT; ATCAAA (nucleotides 190 - 195 ) replaced with ATTAAA; AAGAAG (nucleotides 271 - 276 ) replaced with AAAAAA; TTGAAC (nucleotides 313 - 318 ) replaced with TTAAAT; TTCCCA (nucleotides 349 - 354 ) replaced with TTTCCA; AAGAAA (nucleotides 382 - 387 ) replaced with AAAAAA; GGTATC (nucleotides 505 - 510 ) replaced with GGAATC; TTGAAG (nucleotides 694 - 699 ) replaced with TTAAAA; GCCATT (nucleotides 901 - 906 ) replaced with GCTATC. In certain aspects, the nucleotide sequence is optimized for expression in K. laclis.
|0023] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-324 of wild-type xylose reductase as set forth in SEQ ID NO: 26, wherein at least 3 codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GGTATC (nucleotides 187 - 192 ); GAAGGC (nucleotides 208 - 213 ); GCTTTG (nucleotides 289 - 294 ); GCTTTG (nucleotides 463 - 468 ); GGTATC (nucleotides 505 - 510 ); GCCTTG (nucleotides 571 - 576 ); GCCTTG (nucleotides 703 - 708 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GGTATC (nucleotides 187 - 192 ) replaced with GGGATT; GAAGGC (nucleotides 208 - 213 ) replaced with GAAGGG; GCTTTG (nucleotides 289 - 294 ) replaced with GCCCTT; GCTTTG (nucleotides 463 - 468 ) replaced with GCCCTT; GGTATC (nucleotides 505 - 510 ) replaced with GGCATT; GCCTTG (nucleotides 571 - 576 ) replaced with GCCTTA; GCCTTG (nucleotides 703 - 708 ) replaced with GCATTG. In certain aspects, the nucleotide sequence is optimized for expression in Z mobilis. |0024] Also provided herein is a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type xylose reductase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5. or 3. or 2.5, or 2 times the Standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human. E. coli or S.cerevisiae.
|0025] Also provided herein is a xylose reductase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-324 of wild-type xylose reductase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris: Otyctolagus cuniculus (rabbit): Macaca fascicularis (Long-tailed monkey); M. mulatta (Monkey); E. coli K12 W31 10; E. coli UTI89; E. coli O157:H7 EDL933; E. coli O157:H7 str. Sakai; Bombyx rnori; Spodoptera frugiperda; Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
|0026] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0027] In some embodiments, provided herein is a system for metabolizing xylose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, and xylulokinase: wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoήs, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the xylose reductase retains at least 75% of the enzymatic activity of wild-type XyIl (SEQ ID NO: 26) under normal physiological conditions.
[0028] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type xylose reductase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 1 1 -306 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 1 1 -306 of SEQ ID NO: 26 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest ∑ scores of the wild type codon pairs encoding amino acids 1 1 -306 when expressed in the native organism.
[0029] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type xylose reductase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 25 and which encode amino acids 1 -1 1 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the 2 score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-1 1 of SEQ ID NO: 26 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -1 1 when expressed in the native organism.
10030] In some embodiments are provided a xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50,wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: AAGAAA (nucleotides 106 - 1 1 1); TTGAAG (nucleotides 637 - 642); CTTTTG (nucleotides 565 - 570); GGTATT (nucleotides 277 - 282); TTGAAC (nucleotides 25 - 30); ACTTTG (nucleotides 880 - 885); GCCATT (nucleotides 790 - 795); GCTACT (nucleotides 349 - 354); GCTACT (nucleotides 664 - 669); ATCAAG (nucleotides 709 - 714); ATCAAG (nucleotides 772 - 777); GCCAAG (nucleotides 583 - 588); GCCAAG (nucleotides 646 - 651). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: AAGAAA (nucleotides 106 - 1 1 1) replaced with AAAAAG; TTGAAG (nucleotides 637 - 642) replaced with TTAAAA; CTTTTG (nucleotides 565 - 570) replaced with TTGTTG; GGTATT (nucleotides 277 - 282) replaced with GGAATA; TTGAAC (nucleotides 25 - 30) replaced with TTAAAT; ACTTTG (nucleotides 880 - 885) replaced with ACATTG; GCCATT (nucleotides 790 - 795) replaced with GCTATT; GCTACT (nucleotides 349 - 354) replaced with GCTACC; GCTACT (nucleotides 664 - 669) replaced with GCAACT; ATCAAG (nucleotides 709 - 714) replaced with ATTAAA: ATCAAG (nucleotides 772 - 777) replaced with ATTAAA; GCCAAG (nucleotides 583 - 588) replaced with GCTAAA: GCCAAG (nucleotides 646 - 651 ) replaced with GCTAAA. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
|0031] In some embodiments are provided a xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 5O.wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CCTTCC (nucleotides 13 - 18 ); AAGAAA (nucleotides 106 - 1 1 1 ); GTCAGC (nucleotides 448 - 453 ); CTCGGT (nucleotides 460 - 465 ); GTTGCC (nucleotides 535 - 540 ); TTTGGT (nucleotides 544 - 549 ); GCTGAA (nucleotides 760 - 765 ); ATTGCC (nucleotides 793 - 798 ): GTCAGC (nucleotides 841 - 846 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CCTTCC (nucleotides 13 - 18 ) replaced with CCATCT; AAGAAA (nucleotides 106 - 1 1 1 ) replaced with AAAAAG; GTCAGC (nucleotides 448 - 453 ) replaced with GTTTCA; CTCGGT (nucleotides 460 - 465 ) replaced with TTGGGT; GTTGCC (nucleotides 535 - 540 ) replaced with GTTGCT; TTTGGT (nucleotides 544 - 549 ) replaced with TTCGGT; GCTGAA (nucleotides 760 - 765 ) replaced with GCTGAG; ATTGCC (nucleotides 793 - 798 ) replaced with ATTGCT; GTCAGC (nucleotides 841 - 846 ) replaced with GTATCT. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
[0032] In some embodiments are provided a xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50, wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: AAGAAA (nucleotides 106 - 1 1 1); TCCAAG (nucleotides 361 - 366); TCCAAG (nucleotides 502 - 507); TCCAAG (nucleotides 682 - 687); ATCAAG (nucleotides 709 - 714); ATCAAG (nucleotides 772 - 777); TTCAAG (nucleotides 406 - 41 1); TTCAAG (nucleotides 1012 - 1017); CTTTTG (nucleotides 565
- 570); TTCAAC (nucleotides 676 - 681 ): TTCAAC (nucleotides 907 - 912); GGTATT (nucleotides 277 - 282); GTCAAG (nucleotides 103 - 108); GTCAAG (nucleotides 430 - 435); GTCAAG (nucleotides 1063 - 1068); GACGAA (nucleotides 298 - 303); GGTATC (nucleotides 1 15 - 120); TTGAAC (nucleotides 25 - 30); TTTGAC (nucleotides 937 - 942). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: AAGAAA (nucleotides 106 - 1 1 1 ) replaced with AAAAAG; TCCAAG (nucleotides 361
- 366) replaced with TCTAAA; TCCAAG (nucleotides 502 - 507) replaced with TCTAAA; TCCAAG (nucleotides 682 - 687) replaced with TCTAAA; ATCAAG (nucleotides 709 - 714) replaced with ATTAAA; ATCAAG (nucleotides 772 - 777) replaced with ATTAAA; TTCAAG (nucleotides 406 - 41 1 ) replaced with TTTAAA; TTCAAG (nucleotides 1012 - 1017) replaced with TTTAAA; CTTTTG (nucleotides 565
- 570) replaced with TTGTTG; TTCAAC (nucleotides 676 - 681 ) replaced with TTTAAT; TTCAAC (nucleotides 907 - 912) replaced with TTTAAT; GGTATT (nucleotides 277 - 282) replaced with GGAATA; GTCAAG (nucleotides 103 - 108) replaced with GTTAAA; GTCAAG (nucleotides 430 - 435) replaced with GTTAAA; GTCAAG (nucleotides 1063 - 1068) replaced with GTTAAA; GACGAA (nucleotides 298 - 303) replaced with GATGAA; GGTATC (nucleotides 1 15 - 120) replaced with GGAATT; TTGAAC (nucleotides 25 - 30) replaced with TTAAAT; TTTGAC (nucleotides 937 - 942) replaced with TTCGAT. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
|0033] In some embodiments are provided a xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50,wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTGAAC (nucleotides 25 - 30 ); AAGAAA (nucleotides 106 - 1 1 1 ); GGTATC (nucleotides 1 15 - 120 ); GGTACC (nucleotides 388 - 393 ); CTTTTG (nucleotides 565 - 570 ); GCCAAG (nucleotides 583 - 588 ); TTGAAG (nucleotides 637 - 642 ); GCCAAG (nucleotides 646 - 651 ); GCCATT (nucleotides 790 - 795 ); TTCCCA (nucleotides 847 - 852 ). In some such nucleotide sequences, at least 3, or 4. or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTGAAC (nucleotides 25 - 30 ) replaced with TTAAAT; AAGAAA (nucleotides 106 - 1 1 1 ) replaced with AAAAAG; GGTATC (nucleotides 1 15
- 120 ) replaced with GGAATC; GGTACC (nucleotides 388 - 393 ) replaced with GGTACA; CTTTTG (nucleotides 565 - 570 ) replaced with CTCTTG; GCCAAG (nucleotides 583 - 588 ) replaced with GCTAAA; TTGAAG (nucleotides 637 - 642 ) replaced with TTAAAG; GCCAAG (nucleotides 646 - 651 ) replaced with GCTAAA; GCCATT (nucleotides 790 - 795 ) replaced with GCAATC; TTCCCA (nucleotides 847 - 852 ) replaced with TTCCCT. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
10034 J In some embodiments are provided a xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50,wherein at least 3 codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GATGCC (nucleotides 61 - 66 ); GGTATC (nucleotides 1 15
- 120 ); GCCGGT (nucleotides 205 - 210 ); GGTATT (nucleotides 277 - 282 ); GAAGGC (nucleotides 367 - 372 ); GCCAAG (nucleotides 583 - 588 ): GCCAAG (nucleotides 646 - 651 ); ACTTTG (nucleotides 880 - 885 ); GCTATT (nucleotides 1021
- 1026 ): GAAGCC (nucleotides 1027 - 1032 ); GTCAGA (nucleotides 1042 - 1047 ); GCCGGT (nucleotides 1048 - 1053 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GATGCC (nucleotides 61 - 66 ) replaced with GATGCT; GGTATC (nucleotides 1 15 - 120 ) replaced with GGCATT; GCCGGT (nucleotides 205 - 210 ) replaced with GCTGGA; GGTATT (nucleotides 277 - 282 ) replaced with GGCATT; GAAGGC (nucleotides 367 - 372 ) replaced with GAAGGT; GCCAAG (nucleotides 583 - 588 ) replaced with GCTAAA; GCCAAG (nucleotides 646 - 651 ) replaced with GCCAAA; ACTTTG (nucleotides 880 - 885 ) replaced with ACCTTG; GCTATT (nucleotides 1021 - 1026 ) replaced with GCGATT; GAAGCC (nucleotides 1027 - 1032 ) replaced with GAGGCT; GTCAGA (nucleotides 1042 - 1047 ) replaced with GTTCGT; GCCGGT (nucleotides 1048 - 1053 ) replaced with GCTGGA. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis.
[0035] Also provided herein is a xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human. E. coli or S.cerevisiae.
|0036] Also provided herein is a xylitol dehydrogenase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris: Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli Kl 2 W31 10; E. coli UTI89; E. coli O157:H7 EDL933; E. coli O157:H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogastβr Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0037] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence. |0038] In some embodiments, provided herein is a system for metabolizing xylose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx moή, Spodoptera fmgiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the xylitol dehydrogenase retains at least 75% of the enzymatic activity of wild-type Xdh (SEQ ID NO: 50) under normal physiological conditions.
|0039] In some embodiments are provided a xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 28- 146 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 28-146 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 28-146 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 28-146 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%. or 300%, or 200%, or 150% or 100% of the wild type codon pair AAGAAA when expressed in the native organism.
[0040] In some embodiments are provided a xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 175- 314 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 175-314 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 175-314 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 175-314 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair TCCAAG when expressed in the native organism.
10041] In some embodiments are provided a xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO.l and which encode amino acids 146- 175 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 146- 175 of SEQ ID NO: 50 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%; or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 146-175 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -5 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair TCCAAG when expressed in the native organism.
|0042] In some embodiments are provided a D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTGAAA (nucleotides 1858 - 1863); TTGAAG (nucleotides 67 - 72); TTGAAG (nucleotides 793 - 798); GAAAGT (nucleotides 1849 - 1854); GGTATT (nucleotides 283 - 288); GGTATT (nucleotides 1213 - 1218): GGGTTC (nucleotides 43 - 48): TTGAAC (nucleotides 1276 - 1281); ACTTTG (nucleotides 1366 - 1371); GCCATT (nucleotides 190 - 195); GATATC (nucleotides 490 - 495): GATATC (nucleotides 679 - 684); TCTCAA (nucleotides 1021 - 1026); TTCCCC (nucleotides 262
- 267); ATCAAG (nucleotides 1261 - 1266); ATCAAG (nucleotides 1606 - 161 1 ); GCCAAG (nucleotides 1717 - 1722); GCCAAG (nucleotides 1840 - 1845). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTGAAA (nucleotides 1858 - 1863) replaced with TTAAAA; TTGAAG (nucleotides 67 - 72) replaced with TTAAAA; TTGAAG (nucleotides 793 - 798) replaced with TTAAAA; GAAAGT (nucleotides 1849
- 1854) replaced with GAATCA; GGTATT (nucleotides 283 - 288) replaced with GGAATT; GGTATT (nucleotides 1213 - 1218) replaced with GGAATT; GGGTTC (nucleotides 43 - 48) replaced with GGTTTT; TTGAAC (nucleotides 1276 - 1281) replaced with TTAAAT; ACTTTG (nucleotides 1366 - 1371) replaced with ACTCTA; GCCATT (nucleotides 190 - 195) replaced with GCTATT; GATATC (nucleotides 490 - 495) replaced with GATATA; GATATC (nucleotides 679 - 684) replaced with GACATT; TCTCAA (nucleotides 1021 - 1026) replaced with TCACAA; TTCCCC (nucleotides 262 - 267) replaced with TTTCCA; ATCAAG (nucleotides 1261 - 1266) replaced with ATTAAG; ATCAAG (nucleotides 1606 - 161 1 ) replaced with ATTAAA; GCCAAG (nucleotides 1717 - 1722) replaced with GCTAAA: GCCAAG (nucleotides 1840 - 1845) replaced with GCTAAG. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0043] In some embodiments are provided a D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GAAGAG (nucleotides 451 - 456); GAAGAG (nucleotides 703 - 708); TTCCTC (nucleotides 37 - 42); GCCAGT (nucleotides 613 - 618); GCCAGT (nucleotides 1693 - 1698); AAAGAG (nucleotides 442 - 447); GCCAGA (nucleotides 1099 - 1 104); GCCAGA (nucleotides 1552 - 1557); AGCCAG (nucleotides 379 - 384); ATTGCC (nucleotides 847 - 852); GCCTGT (nucleotides 1666 - 1671 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GAAGAG (nucleotides 451 - 456) replaced with GAAGAA; GAAGAG (nucleotides 703 - 708) replaced with GAAGAA; TTCCTC (nucleotides 37 - 42) replaced with TTCCTG; GCCAGT (nucleotides 613 - 618) replaced with GCGTCT; GCCAGT (nucleotides 1693 - 1698) replaced with GCTAGC: AAAGAG (nucleotides 442 - 447) replaced with AAAGAA; GCCAGA (nucleotides 1099 - 1 104) replaced with GCTCGT; GCCAGA (nucleotides 1552 - 1557) replaced with GCTCGT; AGCCAG (nucleotides 379 - 384) replaced with TCTCAG; ATTGCC (nucleotides 847 - 852) replaced with ATCGCG; GCCTGT (nucleotides 1666 - 1671 ) replaced with GCTTGC. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
[0044] In some embodiments are provided a D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74. wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TCGTTG (nucleotides 934 - 939); GATATC (nucleotides 490 - 495); GATATC (nucleotides 679 - 684); ATCAAG (nucleotides 1261 - 1266); ATCAAG (nucleotides 1606 - 161 1 ); AAGTTT (nucleotides 1498 - 1503); TTCAAG (nucleotides 403 - 408); TTCAAG (nucleotides 556 - 561); TTGAAA (nucleotides 1858 - 1863); TTCAAC (nucleotides 268 - 273); TTCAAC (nucleotides 697 - 702); TTCAAC (nucleotides 877 - 882); TTCAAC (nucleotides 1 198 - 1203); ATCAAC (nucleotides 133 - 138); ATCAAC (nucleotides 166 - 171); ATCAAC (nucleotides 1750 - 1755); GGTATT (nucleotides 283 - 288); GGTATT (nucleotides 1213 - 1218); GTCAAG (nucleotides 1795 - 1800); GACGAA (nucleotides 172 - 177); GACGAA (nucleotides 1 1 17 - 1 122); GGTATC (nucleotides 781 - 786); GGGTTC (nucleotides 43 - 48); TCTTTG (nucleotides 1543 - 1548); TCGTTA (nucleotides 370 - 375); TTGAAC (nucleotides 1276 - 1281). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TCGTTG (nucleotides 934 - 939) replaced with TCTCTG; GATATC (nucleotides 490 - 495) replaced with GACATC; GATATC (nucleotides 679 - 684) replaced with GACATC; ATCAAG (nucleotides 1261 - 1266) replaced with ATCAAA; ATCAAG (nucleotides 1606 - 161 1 ) replaced with ATCAAA; AAGTTT (nucleotides 1498 - 1503) replaced with AAGTTC; TTCAAG (nucleotides 403 - 408) replaced with TTCAAA; TTCAAG (nucleotides 556 - 561) replaced with TTCAAA; TTGAAA (nucleotides 1858 - 1863) replaced with CTGAAA; TTCAAC (nucleotides 268 - 273) replaced with TTCAAC; TTCAAC (nucleotides 697 - 702) replaced with TTTAAC; TTCAAC (nucleotides 877 - 882) replaced with TTCAAC; TTCAAC (nucleotides 1 198 - 1203) replaced with TTCAAC; ATCAAC (nucleotides 133 - 138) replaced with ATCAAC; ATCAAC (nucleotides 166 - 171) replaced with ATCAAC; ATCAAC (nucleotides 1750 - 1755) replaced with ATCAAC; GGTATT (nucleotides 283 - 288) replaced with GGTATC; GGTATT (nucleotides 1213 - 1218) replaced with GGTATC; GTCAAG (nucleotides 1795 - 1800) replaced with GTTAAA: GACGAA (nucleotides 172 - 177) replaced with GACGAA; GACGAA (nucleotides 1 1 17 - 1 122) replaced with GACGAA; GGTATC (nucleotides 781 - 786) replaced with GGTATC; GGGTTC (nucleotides 43 - 48) replaced with GGTTTC: TCTTTG (nucleotides 1543 - 1548) replaced with TCTCTC; TCGTTA (nucleotides 370 - 375) replaced with TCCCTG: TTGAAC (nucleotides 1276 - 1281 ) replaced with CTGAAC. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
|0045] In some embodiments are provided a D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GGGTTC (nucleotides 43 - 48 ); TTGAAG (nucleotides 67 - 72 ); GCCATT (nucleotides 190 - 195 ); AAGAAG (nucleotides 250 - 255 ); TTCCCC (nucleotides 262 - 267 ); TCGTTA (nucleotides 370 - 375 ); GGTAAA (nucleotides 439 - 444 ); GATATC (nucleotides 490 - 495 ); GATATC (nucleotides 679 - 684 ); GGTATC (nucleotides 781 - 786 ); TTGAAG (nucleotides 793 - 798 ); TTTGTC (nucleotides 859 - 864 ); TCGTTG (nucleotides 934 - 939 ); AAGAAG (nucleotides 1 150 - 1 155 ): TTCCCA (nucleotides 1222 - 1227 ); TTGAAC (nucleotides 1276 - 1281 ); AAGAAG (nucleotides 1525 - 1530 ); GCCAAG (nucleotides 1717 - 1722 ); AAGAAG (nucleotides 1720 - 1725 ); AAATGG (nucleotides 1804 - 1809 ); GCCAAG (nucleotides 1840 - 1845 ); TTGAAA (nucleotides 1858 - 1863 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GGGTTC (nucleotides 43 - 48 ) replaced with GGTTTC; TTGAAG (nucleotides 67 - 72 ) replaced with TTAAAG; GCCATT (nucleotides 190 - 195 ) replaced with GCTATT; AAGAAG (nucleotides 250 - 255 ) replaced with AAAAAG; TTCCCC (nucleotides 262 - 267 ) replaced with TTTCCG; TCGTTA (nucleotides 370 - 375 ) replaced with TCTTTA; GGTAAA (nucleotides 439 - 444 ) replaced with GGAAAA; GATATC (nucleotides 490 - 495 ) replaced with GACATT; GATATC (nucleotides 679 - 684 ) replaced with GACATT; GGTATC (nucleotides 781 - 786 ) replaced with GGTATA; TTGAAG (nucleotides 793 - 798 ) replaced with TTAAAG; TTTGTC (nucleotides 859 - 864 ) replaced with TTCGTT; TCGTTG (nucleotides 934 - 939 ) replaced with TCATTG; AAGAAG (nucleotides 1 150 - 1 155 ) replaced with AAAAAG; TTCCCA (nucleotides 1222 - 1227 ) replaced with TTTCCA: TTGAAC (nucleotides 1276 - 1281 ) replaced with TTAAAT; AAGAAG (nucleotides 1525 - 1530 ) replaced with AAAAAG; GCCAAG (nucleotides 1717 - 1722 ) replaced with GCTAAA: AAGAAG (nucleotides 1720 - 1725 ) replaced with AAAAAG; AAATGG (nucleotides 1804 - 1809 ) replaced with AAGTGG; GCCAAG (nucleotides 1840 - 1845 ) replaced with GCGAAA: TTGAAA (nucleotides 1858 - 1863 ) replaced with TTAAAA. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
(0046] In some embodiments are provided a D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74, wherein at least 3 codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TCGACT (nucleotides 55 - 60 ); AACAGC (nucleotides 136
- 141 ); GATGCC (nucleotides 220 - 225 ); GGTATT (nucleotides 283 - 288 ); TCCGGT (nucleotides 289 - 294 ); GATGCC (nucleotides 478 - 483 ); GCCTTG (nucleotides 481 - 486 ); GAAGCC (nucleotides 649 - 654 ); GGTATC (nucleotides 781 - 786 ); ATCAAT (nucleotides 784 - 789 ); ACCGGA (nucleotides 907 - 912 ); ATTATC (nucleotides 928 - 933 ); GCTTTG (nucleotides 958 - 963 ); ATTATC (nucleotides 994 - 999 ); GGTATT (nucleotides 1213 - 1218 ); AACAGC (nucleotides 1279 - 1284 ); ACTTTG (nucleotides 1366 - 1371 ); ATTATC (nucleotides 1603 - 1608 ); GAAGCC (nucleotides 1714 - 1719 ); GCCAAG (nucleotides 1717 - 1722 ); GCCAAG (nucleotides 1840 - 1845 ). In some such nucleotide sequences, at least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TCGACT (nucleotides 55 - 60 ) replaced with TCTACC; AACAGC (nucleotides 136 - 141 ) replaced with AATTCT; GATGCC (nucleotides 220 - 225 ) replaced with GACGCG; GGTATT (nucleotides 283 - 288 ) replaced with GGCATT; TCCGGT (nucleotides 289 - 294 ) replaced with AGCGGT; GATGCC (nucleotides 478 - 483 ) replaced with GATGCT; GCCTTG (nucleotides 481 - 486 ) replaced with GCTTTA; GAAGCC (nucleotides 649 - 654 ) replaced with GAGGCC; GGTATC (nucleotides 781
- 786 ) replaced with GGTATA; ATCAAT (nucleotides 784 - 789 ) replaced with ATAAAC: ACCGGA (nucleotides 907 - 912 ) replaced with ACGGGA; ATTATC (nucleotides 928 - 933 ) replaced with ATTATT; GCTTTG (nucleotides 958 - 963 ) replaced with GCTCTA; ATTATC (nucleotides 994 - 999 ) replaced with ATTATT; GGTATT (nucleotides 1213 - 1218 ) replaced with GGCATC; AACAGC (nucleotides 1279 - 1284 ) replaced with AATTCT; ACTTTG (nucleotides 1366 - 1371 ) replaced with ACCTTG; ATTATC (nucleotides 1603 - 1608 ) replaced with ATTATT; GAAGCC (nucleotides 1714 - 1719 ) replaced with GAAGCT; GCCAAG (nucleotides 1717 - 1722 ) replaced with GCTAAA; GCCAAG (nucleotides 1840 - 1845 ) replaced with GCGAAA. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis.
|0047] Also provided herein is a D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly- overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the Standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
[0048] Also provided herein is a D-xylulokinase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoήs; Oryctolagus cuniciilus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli K12 W31 10; E. coli UTI89; E. coli O157:H7 EDL933; E. coli OJ57.H7 str. Sakai: Bombyx mori: Spodoptera frugiperda; Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. |0049] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0050] In some embodiments, provided herein is a system for metabolizing xylose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx moή, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the D-xylulokinase retains at least 75% of the enzymatic activity of wild-type XKI (SEQ ID NO: 74) under normal physiological conditions.
[0051] In some embodiments are provided a D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 12-312 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 12-312 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 400%. or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 12-312 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 12-312 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GATATC when expressed in the native organism.
[0052] In some embodiments are provided a D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -12 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -12 of SEQ ID NO: 74 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-12 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -12 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GATGCT when expressed in the native organism.
(0053] In some embodiments are provided a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CGCTAC (nucleotides 454 - 459 );GCCAAG (nucleotides 562 - 567 ); CTCGGT (nucleotides 574 - 579 ); GATATC (nucleotides 946 - 951 ); CGCTAC (nucleotides 964 - 969 ); GCCATT (nucleotides 1 102 - 1 107 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CGCTAC (nucleotides 454 - 459 ) replaced with AGGTAT; GCCAAG (nucleotides 562 - 567 ) replaced with GCTAAA; CTCGGT (nucleotides 574 - 579 ) replaced with TTGGGT; GATATC (nucleotides 946 - 951 ) replaced with GATATA; CGCTAC (nucleotides 964 - 969 ) replaced with AGATAT: GCCATT (nucleotides 1 102 - 1 107 ) replaced with GCTATT. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
|0054] In some embodiments are provided a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTGGCG (nucleotides 688 - 693); GCCAGC (nucleotides 856 - 861); ATCCTC (nucleotides 262 - 267); GCCAGT (nucleotides 928 - 933); CTCGGC (nucleotides 265 - 270); GTCAGC (nucleotides 775 - 780); TTCCCG (nucleotides 1045 - 1050); CTCGGT (nucleotides 574 - 579); TTCTGG (nucleotides 214 - 219); GCGCTG (nucleotides 517 - 522); ATCGCC (nucleotides 292 - 297). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTGGCG (nucleotides 688 - 693) replaced with CTCGCG; GCCAGC (nucleotides 856 - 861) replaced with GCGTCT; ATCCTC (nucleotides 262 - 267) replaced with ATCCTG; GCCAGT (nucleotides 928 - 933) replaced with GCGTCT; CTCGGC (nucleotides 265 - 270) replaced with CTGGGT; GTCAGC (nucleotides 775 - 780) replaced with GTTAGC; TTCCCG (nucleotides 1045 - 1050) replaced with TTCCCA; CTCGGT (nucleotides 574 - 579) replaced with CTGGGC; TTCTGG (nucleotides 214 - 219) replaced with TTTTGG; GCGCTG (nucleotides 517 - 522) replaced with GCTCTG; ATCGCC (nucleotides 292 - 297) replaced with ATCGCT. In certain aspects, the nucleotide sequence is optimized for expression in E.co/i.
[0055] ln some embodiments are provided a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GATATC (nucleotides 946 - 951); AAGTTT (nucleotides 862 - 867); GTCAAG (nucleotides 55 - 60); GTCAAG (nucleotides 1063 - 1068); GCCAAA (nucleotides 763 - 768); GGTATC (nucleotides 190 - 195); AAGAAT (nucleotides 898 - 903); TCCAAA (nucleotides 1024 - 1029). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GATATC (nucleotides 946 - 951 ) replaced with GACATC; AAGTTT (nucleotides 862 - 867) replaced with AAATTC; GTCAAG (nucleotides 55 - 60) replaced with GTTAAA; GTCAAG (nucleotides 1063 - 1068) replaced with GTTAAG; GCCAAA (nucleotides 763 - 768) replaced with GCGAAA; GGTATC (nucleotides 190 - 195) replaced with GGTATT; AAGAAT (nucleotides 898 - 903) replaced with AAAAAC; TCCAAA (nucleotides 1024 - 1029) replaced with TCTAAA. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
|0056] In some embodiments are provided a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GGTATC (nucleotides 190 - 195 ); CTGCGA (nucleotides 448 - 453 ); GCCAAG (nucleotides 562 - 567 ); GATATC (nucleotides 946 - 951 ); GCCATT (nucleotides 1 102 - 1 107 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GGTATC (nucleotides 190 - 195 ) replaced with GGAATT; CTGCGA (nucleotides 448 - 453 ) replaced with TTGAGG; GCCAAG (nucleotides 562 - 567 ) replaced with GCTAAA; GATATC (nucleotides 946 - 951 ) replaced with GATATA; GCCATT (nucleotides 1 102 - 1 107 ) replaced with GCAATT. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
|0057] In some embodiments are provided a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98, wherein at least 3 codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GTCGAT (nucleotides 16 - 21 ); GGGGCA (nucleotides 40 - 45 ); GATGCC (nucleotides 127 - 132 ): GGTATC (nucleotides 190 - 195 ): GCCAAG (nucleotides 562 - 567 ); GCCGGT (nucleotides 643 - 648 ); AGCCGT (nucleotides 682 - 687 ); TCGGCT (nucleotides 748 - 753 ); GTCGAT (nucleotides 943 - 948 ); GATGCC (nucleotides 1057 - 1062 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GTCGAT (nucleotides 16 - 21 ) replaced with GTTGAT; GGGGCA (nucleotides 40 - 45 ) replaced with GGCGCT; GATGCC (nucleotides 127 - 132 ) replaced with GACGCC; GGTATC (nucleotides 190 - 195 ) replaced with GGTATA; GCCAAG (nucleotides 562 - 567 ) replaced with GCTAAG; GCCGGT (nucleotides 643 - 648 ) replaced with GCTGGG; AGCCGT (nucleotides 682 - 687 ) replaced with TCTCGT; TCGGCT (nucleotides 748 - 753 ) replaced with TCTGCA; GTCGAT (nucleotides 943 - 948 ) replaced with GTTGAT; GATGCC (nucleotides 1057 - 1062 ) replaced with GATGCT. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis.
|0058] Also provided herein is a L-arabinitol 4-dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
[0059] Also provided herein is a L-arabinitol 4-dehydrogenase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long- tailed monkey); M. mulatto (Monkey); E. coli Kl 2 W31 10; E. coli UT189; E. coli O157:H7 EDL933; E. coli O157.H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
(0060] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0061] In some embodiments, provided herein is a system for metabolizing arabinose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: arabinose dehyodrogenase. L- arabinitol 4-dehydrogenase, and L-xylulose reductase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schi∑osaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the L-arabinitol 4- dehydrogenase retains at least 75% of the enzymatic activity of wild-type LADl (SEQ ID NO: 98) under normal physiological conditions.
[0062J In some embodiments are provided a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 53-164 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the ∑ score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 53-164 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 53-164 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 53-164 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair AAGATT when expressed in the native organism.
[0063] In some embodiments are provided a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 192-366 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 192-366 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 192-366 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 192-366 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GAGATT when expressed in the native organism.
[0064] In some embodiments are provided a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-53 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-53 of SEQ ID NO: 98 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 - 53 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -53 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GTCAAG when expressed in the native organism. [0065) In some embodiments are provided a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 164-192 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 164-192 of SEQ ID NO: 98 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 164-192 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 164-192 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GCGCTG when expressed in the native organism.
|0066] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GGTATT (nucleotides 619 - 624); TTGAAC (nucleotides 16 - 21); TTGAAC (nucleotides 274 - 279): TTGAAC (nucleotides 670 - 675); TTGAAC (nucleotides 688 - 693); CTTTCT (nucleotides 286 - 291); GCCATT (nucleotides 181 - 186); TCTCCA (nucleotides 697 - 702); TCTCCA (nucleotides 751 - 756); ATCAAG (nucleotides 103 - 108): ATCAAG (nucleotides 541 - 546); ATCAAG (nucleotides 721 - 726); GCCAAG (nucleotides 889 - 894). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GGTATT (nucleotides 619 - 624) replaced with GGAATT; TTGAAC (nucleotides 16 - 21) replaced with TTAAAT; TTGAAC (nucleotides 274 - 279) replaced with CTAAAT; TTGAAC (nucleotides 670 - 675) replaced with TTAAAT; TTGAAC (nucleotides 688 - 693) replaced with TTAAAT; CTTTCT (nucleotides 286 - 291) replaced with CTATCT; GCCATT (nucleotides 181 - 186) replaced with GCTATT; TCTCCA (nucleotides 697 - 702) replaced with TCACCA; TCTCCA (nucleotides 751 - 756) replaced with TCACCA; ATCAAG (nucleotides 103 - 108) replaced with ATTAAA; ATCAAG (nucleotides 541 - 546) replaced with ATTAAA; ATCAAG (nucleotides 721 - 726) replaced with ATTAAG; GCCAAG (nucleotides 889 - 894) replaced with GCTAAA. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0067] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GCCTGT (nucleotides 58 - 63 ); CTTGAT (nucleotides 124 - 129 ); GCCTGT (nucleotides 226 - 231 ); GAAGAT (nucleotides 346 - 351 ); CTTTCT (nucleotides 748 - 753 ); GCCAGC (nucleotides 781 - 786 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GCCTGT (nucleotides 58 - 63 ) replaced with GCATGT; CTTGAT (nucleotides 124 - 129 ) replaced with TTGGAT; GCCTGT (nucleotides 226 - 231 ) replaced with GCTTGT; GAAGAT (nucleotides 346 - 351 ) replaced with GAAGAT; CTTTCT (nucleotides 748 - 753 ) replaced with TTGTCT; GCCAGC (nucleotides 781 - 786 ) replaced with GCATCA. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
[0068] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be repjaced are selected from the following: TTGAAC (nucleotides 16 - 21 ); ATCAAG (nucleotides^ 03- - 108); GTCAAG (nucleotides 172 - 177); GACGAA (nucleotides 187 - 192); GGTATC (nucleotides 193 - 198); GTCAAG (nucleotides 199 - 204); TCCAAG (nucleotides 226 - 231); TTGAAC (nucleotides 274 - 279); TTCAAG (nucleotides 343 - 348); GTCAAG (nucleotides 460 - 465); ATCAAG (nucleotides 541 - 546): CCAAGA (nucleotides 589 - 594); GGTATT (nucleotides 619 - 624); TTGAAC (nucleotides 670 - 675); TTGAAC (nucleotides 688 - 693); ATCAAG (nucleotides 721 - 726); CCAAGA (nucleotides 823 - 828); GACGAA (nucleotides 865 - 870); ATCAAC (nucleotides 901 - 906); TTCAAC (nucleotides 913 - 918). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTGAAC (nucleotides 16 - 21 ) replaced with TTAAAT; ATCAAG (nucleotides 103 - 108) replaced with ATTAAA; GTCAAG (nucleotides 172 - 177) replaced with GTTAAA; GACGAA (nucleotides 187 - 192) replaced with GATGAA; GGTATC (nucleotides 193 - 198) replaced with GGAATT; GTCAAG (nucleotides 199 - 204) replaced with GTTAAA; TCCAAG (nucleotides 226 - 231 ) replaced with TCTAAA; TTGAAC (nucleotides 274 - 279) replaced with CTAAAT; TTCAAG (nucleotides 343 - 348) replaced with TTTAAA; GTCAAG (nucleotides 460 - 465) replaced with GTTAAA; ATCAAG (nucleotides 541 - 546) replaced with ATTAAA; CCAAGA (nucleotides 589 - 594) replaced with CCTAGA; GGTATT (nucleotides 619 - 624) replaced with GGAATT; TTGAAC (nucleotides 670 - 675) replaced with TTAAAT; TTGAAC (nucleotides 688 - 693) replaced with TTAAAT; ATCAAG (nucleotides 721 - 726) replaced with ATTAAG;CCAAGA (nucleotides 823 - 828) replaced with CCTCGT; GACGAA (nucleotides 865 - 870) replaced with GATGAA; ATCAAC (nucleotides 901 - 906) replaced with ATTAAT; TTCAAC (nucleotides 913 - 918) replaced with TTTAAT. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
[0069] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GATATC (nucleotides 127 - 132 ); TTGAAG (nucleotides 190 - 195 ); TTGAAA (nucleotides 196 - 201 ); GTGTTT (nucleotides 262 - 267 ); TTTGCT (nucleotides 265 - 270 ); TTCCCA (nucleotides 337 - 342 ); GCCAAG (nucleotides 358 - 363 ); TTTGCT (nucleotides 421 - 426 ); ATCAAA (nucleotides 436 - 441 ); GGTATC (nucleotides 445 - 450 ); GCCATT (nucleotides 490 - 495 ); GGTATC (nucleotides 688 - 693 ); CTTTCT (nucleotides 748 - 753 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made:GATATC (nucleotides 127 - 132 ) replaced with GACATT; TTGAAG (nucleotides 190 - 195 ) replaced with TTAAAG; TTGAAA (nucleotides 196 - 201 ) replaced with TTAAAG; GTGTTT (nucleotides 262 - 267 ) replaced with GTTTTC; TTTGCT (nucleotides 265 - 270 ) replaced with TTCGCT; TTCCCA (nucleotides 337 - 342 ) replaced with TTCCCT; GCCAAG (nucleotides 358 - 363 ) replaced with GCTAAA; TTTGCT (nucleotides 421 - 426 ) replaced with TTCGCT; ATCAAA (nucleotides 436 - 441 ) replaced with ATTAAA; GGTATC (nucleotides 445 - 450 ) replaced with GGAATT; GCCATT (nucleotides 490 - 495 ) replaced with GCAATT; GGTATC (nucleotides 688 - 693 ) replaced with GGCATT; CTTTCT (nucleotides 748 - 753 ) replaced with TTGTCT. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
[0070] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: ACTTTT (nucleotides 19 - 24 ); GCTTTG (nucleotides 1 18 - 123 ); CTTGAT (nucleotides 124 - 129 ); GCCAAG (nucleotides 358 - 363 ); GCCTTT (nucleotides 418 - 423 ); GGTATC (nucleotides 445 - 450 ); ACTTTG (nucleotides 562 - 567 ); ATCAAT (nucleotides 649 - 654 ); GGTATC (nucleotides 688 - 693 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: ACTTTT (nucleotides 19 - 24 ) replaced with ACCTTT; GCTTTG (nucleotides 1 18 - 123 ) replaced with GCTCTT; CTTGAT (nucleotides 124 - 129 ) replaced with TTGGAC; GCCAAG (nucleotides 358 - 363 ) replaced with GCTAAG; GCCTTT (nucleotides 418 - 423 ) replaced with GCTTTC; GGTATC (nucleotides 445 - 450 ) replaced with GGGATT; ACTTTG (nucleotides 562 - 567 ) replaced with ACCTTG; ATCAAT (nucleotides 649 - 654 ) replaced with ATTAAT; GGTATC (nucleotides 688 - 693 ) replaced with GGCATC. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis.
|0071] Also provided herein is a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
[0072] Also provided herein is a L-xylulose reductase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatta (Monkey); E. coli K12 W31 10; E. coli UT189; E. coli O157:H7 EDL933; E. coli O157:H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster Kluyveromyces lactis. Zymomonas mobilis and Schizosaccharomyces pombe.
[0073] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
|0074] In some embodiments, provided herein is a system metabolizing arabinose. comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: arabinose dehyodrogenase, L- arabinitol 4-dehydrogenase, and L-xylulose reductase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the L-xylulose reductase retains at least 75% of the enzymatic activity of wild-type LXR (SEQ ID NO: 122) under normal physiological conditions.
|0075] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 8- 267 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism In certain aspects, no replacement codon encoding amino acids 8-267 of SEQ ID NO 122 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 8-267 when expressed in the native organism
|0076] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-272 of wild-type L-xylulose reductase as set forth in SEQ ID NO 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO 1 and which encode amino acids 1-8 of SEQ ID NO 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism In certain aspects, at least one replacement codon encoding amino acids 1-8 of SEQ ID NO 122 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -8 when expressed in the native organism
(0077] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% ammo acid sequence identity with amino acids 1-266 of wild-type L-xylulose reductase as set forth in SEQ ID NO 146, wherein at least 3 codon pairs of SEQ ID NO 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof In some embodiments, the at least 3 codon pairs to be replaced are selected from the following TTGAAG (nucleotides 49 - 54), TTTGCC (nucleotides 583 - 588). GATATT (nucleotides 766 - 771). AGCGAT (nucleotides 364 - 369), GCCAAG (nucleotides 529 - 534), GCCAAG (nucleotides 700 - 705) In some such nucleotide sequences, at least 3, or 4, or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTGAAG (nucleotides 49 - 54) replaced with TTAAAA; TTTGCC (nucleotides 583 - 588) replaced with TTTGCT; GATATT (nucleotides 766 - 771 ) replaced with GATATA; AGCGAT (nucleotides 364 - 369) replaced with TCAGAT; GCCAAG (nucleotides 529 - 534) replaced with GCAAAA: GCCAAG (nucleotides 700 - 705) replaced with GCTAAA. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0078] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GATCTC (nucleotides 37 - 42); ATTGCC (nucleotides 313 - 318); GCCGGA (nucleotides 322 - 327); GCCAGC (nucleotides 361 - 366); CTGGCG (nucleotides 550 - 555); TTTGCC (nucleotides 583 - 588); GTCAGC (nucleotides 733 - 738). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GATCTC (nucleotides 37 - 42) replaced with GATTTG; ATTGCC (nucleotides 313 - 318) replaced with ATTGCT; GCCGGA (nucleotides 322 - 327) replaced with GCTGGA; GCCAGC (nucleotides 361 - 366) replaced with GCTTCA; CTGGCG (nucleotides 550 - 555) replaced with TTGGCT; TTTGCC (nucleotides 583 - 588) replaced with TTTGCT; GTCAGC (nucleotides 733 - 738) replaced with GTTTCA. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
[0079] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GTCAAG (nucleotides 220 - 225 ); TTCAAG (nucleotides 436 - 441 ); AAGAAG (nucleotides 439 - 444 ); GGCCAC (nucleotides 448 - 453 ); GGCCAC (nucleotides 484 - 489 ); TTTGCC (nucleotides 583 - 588 ); GATATT (nucleotides 766 - 771 ). In some such nucleotide sequences, at least 3, or 4, or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GTCAAG (nucleotides 220 - 225 ) replaced with GTTAAA; TTCAAG (nucleotides 436 - 441 ) replaced with TTTAAA; AAGAAG (nucleotides 439 - 444 ) replaced with AAAAAG; GGCCAC (nucleotides 448 - 453 ) replaced with GGACAT; GGCCAC (nucleotides 484 - 489 ) replaced with GGACAC; TTTGCC (nucleotides 583 - 588 ) replaced with TTCGCT; GATATT (nucleotides 766 - 771 ) replaced with GATATA; GCCAAG (nucleotides 700 - 705 ) replaced with GCTAAA. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
10080] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTGAAG (nucleotides 49 - 54 ); AAGAAG (nucleotides 439 . 444 ); GCCAAG (nucleotides 529 - 534 ); TTTGCC (nucleotides 583 - 588 ); GCCAAG (nucleotides 700 - 705 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTGAAG (nucleotides 49 - 54 ) replaced with TTAAAG; AAGAAG (nucleotides 439 - 444 ) replaced with AAAAAG; GCCAAG (nucleotides 529 - 534 ) replaced with GCCAAA; TTTGCC (nucleotides 583 - 588 ) replaced with TTCGCT; GCCAAG (nucleotides 700 - 705 ) replaced with GCTAAA. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
[0081] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146, wherein at least 3 codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTTGAT (nucleotides 34 - 39 ); GATGCC (nucleotides 304
- 309 ); GCCTTT (nucleotides 307 - 312 ); GCCGGA (nucleotides 322 - 327 ); GCCAAG (nucleotides 529 - 534 ); GCCGGT (nucleotides 535 - 540 ); AACAGC (nucleotides 595 - 600 ); GATGCC (nucleotides 697 - 702 ); GCCAAG (nucleotides 700
- 705 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTTGAT (nucleotides 34 - 39 ) replaced with TTGGAT; GATGCC (nucleotides 304 - 309 ) replaced with GATGCT; GCCTTT (nucleotides 307 - 312 ) replaced with GCTTTC; GCCGGA (nucleotides 322 - 327 ) replaced with GCTGGA; GCCAAG (nucleotides 529 - 534 ) replaced with GCTAAG; GCCGGT (nucleotides 535 - 540 ) replaced with GCCGGG; AACAGC (nucleotides 595 - 600 ) replaced with AATTCT; GATGCC (nucleotides 697 - 702 ) replaced with GATGCT. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis.
[0082] Also provided herein is a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
10083] Also provided herein is a L-xylulose reductase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris: Oiyctolagus cuniciύus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli Kl 2 W31 10; E. coli UT189; E. coli O157:H7 EDL933; E. coli OJ57.H7 str. Sakai: Bombyx mori; Spodoptera frugiperda: Drosophila melanogaster Khiyveromyces lactis, Zymomonas mobilis and Schi∑osaccharomyces pombe.
|0084j Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
|0085] In some embodiments, provided herein is a system metabolizing arabinose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: arabinose dehyodrogenase, L- arabinitol 4-dehydrogenase, and L-xylulose reductase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the L-xylulose reductase retains at least 75% of the enzymatic activity of wild-type LXR (SEQ ID NO: 146) under normal physiological conditions.
10086] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1. 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 10- 261 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 10-261 of SEQ ID NO: 146 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 10-261 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 10-261 of SEQ ID NO: 146 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair AAGACG when expressed in the native organism.
[0087] In some embodiments are provided a L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1. 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-10 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-10 of SEQ ID NO: 146 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-10 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -10 of SEQ ID NO: 146 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GCCAAC when expressed in the native organism. 10088) In some embodiments are provided a xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 1 70. wherein at least 3 codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GAGTTT (nucleotides 262 - 267); TTTGCC (nucleotides 130 - 135); GTGGAA (nucleotides 943 - 948); GCCATT (nucleotides 856 - 861 ); CAGTTT (nucleotides 766 - 771 ); CAAAGT (nucleotides 1033 - 1038); GGCCAA (nucleotides 1201 - 1206); TTTTTC (nucleotides 265 - 270). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GAGTTT (nucleotides 262 - 267) replaced with GAGTTC; TTTGCC (nucleotides 130 - 135) replaced with TTTGCT; GTGGAA (nucleotides 943 - 948) replaced with GTTGAA; GCCATT (nucleotides 856 - 861) replaced with GCTATA; CAGTTT (nucleotides 766 - 771) replaced with CAATTT; CAAAGT (nucleotides 1033 - 1038) replaced with CAATCT; GGCCAA (nucleotides 1201 - 1206) replaced with GGTCAA; TTTTTC (nucleotides 265 - 270) replaced with TTCTTT. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0089] In some embodiments are provided a xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170, wherein at least 3 codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTGGCG (nucleotides 226 - 231); CTGGCG (nucleotides 1093 - 1098); CTGGTG (nucleotides 94 - 99); CTGGTG (nucleotides 958 - 963); GAAGAG (nucleotides 1 15 - 120); GAAGAG (nucleotides 391 - 396); GAAGAG (nucleotides 946 - 951 ); CTGGCA (nucleotides 376 - 381); CTGGCA (nucleotides 820 - 825); CTGGCA (nucleotides 1213 - 1218); TTTGCC (nucleotides 130 - 135); ACGCTG (nucleotides 586 - 591 ); ACGCTG (nucleotides 817 - 822); AAAGAG (nucleotides 337 - 342): AAAGAG (nucleotides 781 - 786): TTCCAG (nucleotides 673 - 678); CTGGAA (nucleotides 775 - 780); CTGGAA (nucleotides 1285 - 1290); TTCCCG (nucleotides 931
- 936); GCGGCA (nucleotides 496 - 501 ); GTGATG (nucleotides 961 - 966); GCGCTG (nucleotides 955 - 960): GCGCTG (nucleotides 1096 - 1 101 ). In some such nucleotide sequences, at least 3. or 4. or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTGGCG (nucleotides 226 - 231) replaced with TTGGCT; CTGGCG (nucleotides 1093 - 1098) replaced with TTGGCA; CTGGTG (nucleotides 94 - 99) replaced with TTGGTT; CTGGTG (nucleotides 958 - 963) replaced with TTGGTT; GAAGAG (nucleotides 1 15 - 120) replaced with GAGGAA; GAAGAG (nucleotides 391 - 396) replaced with GAAGAA; GAAGAG (nucleotides 946 - 951 ) replaced with GAAGAA; CTGGCA (nucleotides 376 - 381) replaced with TTAGCT; CTGGCA (nucleotides 820 - 825) replaced with TTGGCT; CTGGCA (nucleotides 1213 - 1218) replaced with TTGGCT; TTTGCC (nucleotides 130
- 135) replaced with TTTGCT; ACGCTG (nucleotides 586 - 591 ) replaced with ACATTG; ACGCTG (nucleotides 817 - 822) replaced with ACATTG; AAAGAG (nucleotides 337 - 342) replaced with AAAGAA; AAAGAG (nucleotides 781 - 786) replaced with AAAGAA; TTCCAG (nucleotides 673 - 678) replaced with TTTCAA; CTGGAA (nucleotides 775 - 780) replaced with TTAGAA; CTGGAA (nucleotides 1285
- 1290) replaced with TTGGAA; TTCCCG (nucleotides 931 - 936) replaced with TTTCCA; GCGGCA (nucleotides 496 - 501) replaced with GCTGCT; GTGATG (nucleotides 961 - 966) replaced with GTTATG; GCGCTG (nucleotides 955 - 960) replaced with GCTTTG; GCGCTG (nucleotides 1096 - 1 101) replaced with GCATTA. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
|0090] In some embodiments are provided a xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170, wherein at least 3 codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GAGTTT (nucleotides 262 - 267); TTTGCC (nucleotides 130 - 135); AAACTG (nucleotides 790 - 795); GCCAAA (nucleotides 1018 - 1023); GCCAAA (nucleotides 1225 - 1230); CTGAAA (nucleotides 760 - 765); CTGAAA (nucleotides 1099 - 1 104); CTGAAA (nucleotides 1 195 - 1200); GACGAA (nucleotides 88 - 93): AAACAG (nucleotides 763 - 768); GGCCAA (nucleotides 1201 - 1206); CTGGTA (nucleotides 1294 - 1299); TCGTTA (nucleotides 331 - 336): TTTGAC (nucleotides 13 - 18): CAGTTT (nucleotides 766 - 771). In some such nucleotide sequences, at least 3. or 4, or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GAGTTT (nucleotides 262 - 267) replaced with GAGTTC: TTTGCC (nucleotides 130 - 135) replaced with TTTGCT; AAACTG (nucleotides 790 - 795) replaced with AAATTA; GCCAAA (nucleotides 1018 - 1023) replaced with GCTAAA; GCCAAA (nucleotides 1225 - 1230) replaced with GCTAAA; CTGAAA (nucleotides 760 - 765) replaced with CTAAAA; CTGAAA (nucleotides 1099 - 1 104) replaced with TTAAAA; CTGAAA (nucleotides 1 195 - 1200) replaced with TTAAAG; GACGAA (nucleotides 88 - 93) replaced with GATGAA; AAACAG (nucleotides 763 - 768) replaced with AAACAA; GGCCAA (nucleotides 1201 - 1206) replaced with GGTCAA; CTGGTA (nucleotides 1294 - 1299) replaced with TTGGTT; TCGTTA (nucleotides 331 - 336) replaced with TCTTTA; TTTGAC (nucleotides 13 - 18) replaced with TTTGAT; CAGTTT (nucleotides 766 - 771) replaced with CAATTT. In certain aspects, the nucleotide sequence is optimized for expression in P. pas tons.
|0091] In some embodiments are provided a xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170, wherein at least 3 codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTTGCC (nucleotides 130 - 135 ); GAGTTT (nucleotides 262 - 267 ); TCGTTA (nucleotides 331 - 336 ); CAGTTT (nucleotides 766 - 771 ); TTCCAT (nucleotides 835 - 840 ); GCCATT (nucleotides 856 - 861 ); GGCCAA (nucleotides 1201 - 1206 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTTGCC (nucleotides 130 - 135 ) replaced with TTCGCT: GAGTTT (nucleotides 262 - 267 ) replaced with GAATTT; TCGTTA (nucleotides 331 - 336 ) replaced with AGTTTA; CAGTTT (nucleotides 766 - 771 ) replaced with CAATTC; TTCCAT (nucleotides 835 - 840 ) replaced with TTCCAC; GCCATT (nucleotides 856 - 861 ) replaced with GCTATT; GGCCAA (nucleotides 1201 - 1206 ) replaced with GGTCAA. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
100921 In some embodiments are provided a xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170, wherein at least 3 codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GCCTAT (nucleotides 7 - 12 ); CTCGAT (nucleotides 22 - 27 ); GAAGGC (nucleotides 40 - 45 ); ATCAAT (nucleotides 346 - 351 ); AAGCTG (nucleotides 406 - 41 1 ); CTGTTA (nucleotides 589 - 594 ); GATGCC (nucleotides 736 - 741 ); GATGCC (nucleotides 1015 - 1020 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GCCTAT (nucleotides 7 - 12 ) replaced with GCTTAT; CTCGAT (nucleotides 22 - 27 ) replaced with TTGGAT; GAAGGC (nucleotides 40 - 45 ) replaced with GAAGGT; ATCAAT (nucleotides 346 - 351 ) replaced with ATTAAT; AAGCTG (nucleotides 406 - 41 1 ) replaced with AAATTG; CTGTTA (nucleotides 589 - 594 ) replaced with TTGTTG; GATGCC (nucleotides 736 - 741 ) replaced with GACGCC; GATGCC (nucleotides 1015 - 1020 ) replaced with GATGCT. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis.
[0093] Also provided herein is a xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5. or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
10094] Also provided herein is a xylose isomerase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatta (Monkey); E. coli Kl 2 W31 10; E. coli UT189; E. coli O157:H7 EDL933; E. coli O157:H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
|0095] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0096] In some embodiments, provided herein is a system for metabolizing xylose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: xylose isomerase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the xylose isomerase retains at least 75% of the enzymatic activity of wild-type XyIA (SEQ ID NO: 170) under normal physiological conditions.
|0097J In some embodiments are provided a xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least 1. 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 76- 286 of SEQ ID NO: 170 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 76-286 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 76-286 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 76-286 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 400%, or 300%: or 200%, or 150% or 100% of the wild type codon pair GAAGAG when expressed in the native organism.
10098] In some embodiments are provided a xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 1-76 of SEQ ID NO: 170 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -76 of SEQ ID NO: 170 has a z score for expression in the heterologous that is more than 200%, or 100%; or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -76 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -76 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair CTGGTG when expressed in the native organism.
|0099] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTGAAA (nucleotides 148 - 153 -); ATCAAC (nucleotides 268 - 273 ); ATCAAG (nucleotides 598 - 603 ); CTCGGT (nucleotides 1 1 1 1 - 1 1 16 ); GGTATT (nucleotides 1 1 14 - 1 1 19 ); GGATTT (nucleotides 1489 - 1494 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTGAAA (nucleotides 148 - 153 ) replaced with TTAAAA; ATCAAC (nucleotides 268 - 273 ) replaced with ATTAAT; ATCAAG (nucleotides 598 - 603 ) replaced with ATAAAA; CTCGGT (nucleotides 1 1 1 1 - 1 1 16 ) replaced with TTGGGA; GGTATT (nucleotides 1 1 14 - 1 1 19 ) replaced with GGAATT; GGATTT (nucleotides 1489 - 1494 ) replaced with GGTTTT. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0100] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTCGAC (nucleotides 142 - 147): ATCCTC (nucleotides 226 - 231 ); ATCCTC (nucleotides 640 - 645); GACTGG (nucleotides 1081 - 1086); GTGGTG (nucleotides 1 180 - 1 185); GTGGTG (nucleotides 1096 - 1 101 ); TTGCTG (nucleotides 1093 - 1098); CTCGGC (nucleotides 1327 - 1332); CTCGGC (nucleotides 922 - 927); CTGGAA (nucleotides 229 - 234); CTGGAA (nucleotides 649 - 654); CTGGAA (nucleotides 298 - 303); AGCCAG (nucleotides 1039 - 1044); ATTGCC (nucleotides 1 195 - 1200); GAAGTG (nucleotides 760 - 765); GAAGTG (nucleotides 799 - 804): GAAGTG (nucleotides 1054 - 1059); CAGGCG (nucleotides 43 - 48); GATCTC (nucleotides 1072 - 1077); CTCGGT (nucleotides 22 - 27); GTGATG (nucleotides 559 - 564); GCGCTG (nucleotides 1477 - 1482); GCGCTG (nucleotides 496
- 501); GCGCTG (nucleotides 1 192 - 1 197); GCGCTG (nucleotides 1 1 11 - 1 1 16); GCGCTG (nucleotides 958 - 963); GCGCTG (nucleotides 109 - 1 14); CTCGAC (nucleotides 328 - 333); ATCCTC (nucleotides 682 - 687); ATCCTC (nucleotides 1279 - 1284); GACTGG (nucleotides 1366 - 1371 ); GTGGTG (nucleotides 1462 - 1467). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTCGAC (nucleotides 142 - 147) replaced with TTAGTT; ATCCTC (nucleotides 226 - 231) replaced with TTAGTT; ATCCTC (nucleotides 640 - 645) replaced with TTGGTT; GACTGG (nucleotides 1081 - 1086) replaced with GAAGAA; GTGGTG (nucleotides 1 180 - 1 185) replaced with GCTTCT; GTGGTG (nucleotides 1096 - 1 101 ) replaced with TTGGAT; TTGCTG (nucleotides 1093 - 1098) replaced with ATTTTG; CTCGGC (nucleotides 1327 - 1332) replaced with ATTTTG; CTCGGC (nucleotides 922 - 927) replaced with GATTGG; CTGGAA (nucleotides 229 - 234) replaced with GTTGTT; CTGGAA (nucleotides 649 - 654) replaced with GTTGTT; CTGGAA (nucleotides 298 - 303) replaced with TTGTTG: AGCCAG (nucleotides 1039 - 1044) replaced with TTGGGT; ATTGCC (nucleotides 1 195 - 1200) replaced with TTGGGT; GAAGTG (nucleotides 760 - 765) replaced with TTGGAA: GAAGTG (nucleotides 799 - 804) replaced with TTAGAG; GAAGTG (nucleotides 1054 - 1059) replaced with TTGGAA; CAGGCG (nucleotides 43 - 48) replaced with TCACAA; GATCTC (nucleotides 1072 - 1077) replaced with ATTGCT: CTCGGT (nucleotides 22 - 27) replaced with GAAGTT; GTGATG (nucleotides 559 - 564) replaced with GAAGTA; GCGCTG (nucleotides 1477
- 1482) replaced with GAAGTT: GCGCTG (nucleotides 496 - 501) replaced with CAAGCA; GCGCTG (nucleotides 1 192 - 1 197) replaced with GATTTG; GCGCTG (nucleotides 1 1 1 1 - 1 1 16) replaced with TTGGGA; GCGCTG (nucleotides 958 - 963) replaced with GTAATG; GCGCTG (nucleotides 109 - 1 14) replaced with GCTTTA; CTCGAC (nucleotides 328 - 333) replaced with GCTTTG; ATCCTC (nucleotides 682 - 687) replaced with GCTTTG; ATCCTC (nucleotides 1279 - 1284) replaced with GCATTG; GACTGG (nucleotides 1366 - 1371 ) replaced with GCTTTA; GTGGTG (nucleotides 1462 - 1467) replaced with GCTTTG. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
[0101] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GACGAT (nucleotides 208 - 213); GACGAT (nucleotides 1 129 - 1 134); ATCAAG (nucleotides 598 - 603); AAACTG (nucleotides 127 - 132); AAACTG (nucleotides 139 - 144); AAACTG (nucleotides 1261 - 1266); TTGAAA (nucleotides 148 - 153); CTTCCA (nucleotides 862 - 867); TTCAAC (nucleotides 319 - 324); ATCAAC (nucleotides 268 - 273); GGTATT (nucleotides 1 1 14 - 1 1 19); GCCAAA (nucleotides 256 - 261 ); CTGAAA (nucleotides 526 - 531); CTGAAA (nucleotides 853 - 858); AAACAG (nucleotides 508 - 513); AAACAG (nucleotides 856 - 861 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GACGAT (nucleotides 208 - 213) replaced with GATGAT; GACGAT (nucleotides 1 129 - 1 134) replaced with GATGAT; ATCAAG (nucleotides 598 - 603) replaced with ATAAAA; AAACTG (nucleotides 127 - 132) replaced with AAATTG; AAACTG (nucleotides 139 - 144) replaced with AAATTA; AAACTG (nucleotides 1261 - 1266) replaced with AAATTG; TTGAAA (nucleotides 148 - 153) replaced with TTAAAA; CTTCCA (nucleotides 862 - 867) replaced with TTGCCA; TTCAAC (nucleotides 319 - 324) replaced with TTTAAT: ATCAAC (nucleotides 268 - 273) replaced with ATTAAT; GGTATT (nucleotides 1 1 14 - 1 1 19) replaced with GGAATT; GCCAAA (nucleotides 256 - 261) replaced with GCTAAA; CTGAAA (nucleotides 526 - 531) replaced with TTAAAG; CTGAAA (nucleotides 853 - 858) replaced with TTAAAA; AAACAG (nucleotides 508 - 513) replaced with AAACAA; AAACAG (nucleotides 856 - 861) replaced with AAACAA. In certain aspects, the nucleotide sequence is optimized for expression in P. pasioris.
[0102] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTTGTC (nucleotides 31 - 36 ); GTCATT (nucleotides 34 - 39 ); TTGAAA (nucleotides 148 - 153 ); GACGAT (nucleotides 208 - 213 ); CAGCAG (nucleotides 892 - 897 ); GAGAAA (nucleotides 1018 - 1023 ); GAGAAA (nucleotides 1084 - 1089 ); GACGTT (nucleotides 1099 - 1 104 ); GGTATT (nucleotides 1 1 14 - 1 1 19 ); GACGAT (nucleotides 1 129 - 1 134 ); GTGAAA (nucleotides 1237 - 1242 ); GCGTTT (nucleotides 1450 - 1455 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTTGTC (nucleotides 31 - 36 ) replaced with TTCGTT; GTCATT (nucleotides 34 - 39 ) replaced with GTTATT; TTGAAA (nucleotides 148 - 153 ) replaced with TTAAAG; GACGAT (nucleotides 208 - 213 ) replaced with GATGAT; CAGCAG (nucleotides 892 - 897 ) replaced with CAACAA; GAGAAA (nucleotides 1018 - 1023 ) replaced with GAAAAA; GAGAAA (nucleotides 1084 - 1089 ) replaced with GAAAAA; GACGTT (nucleotides 1099 - 1 104 ) replaced with GATGTT; GGTATT (nucleotides 1 1 14 - 1 1 19 ) replaced with GGAATT; GACGAT (nucleotides 1 129 - 1 134 ) replaced with GATGAT; GTGAAA (nucleotides 1237 - 1242 ) replaced with GTTAAA; GCGTTT (nucleotides 1450 - 1455 ) replaced with GCGTTC. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
|0103J In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GCTATT (nucleotides 184 - 189 ); GACAGT (nucleotides 340 - 345 ); GCGGTT (nucleotides 499 - 504 ); GCGGTT (nucleotides 628 - 633 ): GTCGAT (nucleotides 688 - 693 ); CAGCTT (nucleotides 859 - 864 ); GAAGGC (nucleotides 916 - 921 ); ACCTAT (nucleotides 1006 - 101 1 ); GGTATT (nucleotides 1 1 14 - 1 1 19 ); AAAGAC (nucleotides 1456 - 1461 ). In some such nucleotide sequences, at least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GCTATT (nucleotides 184 - 189 ) replaced with GCCATT; GACAGT (nucleotides 340 - 345 ) replaced with GACTCC; GCGGTT (nucleotides 499 - 504 ) replaced with GCCGTT: GCGGTT (nucleotides 628 - 633 ) replaced with GCCGTC; GTCGAT (nucleotides 688 - 693 ) replaced with GTTGAT; CAGCTT (nucleotides 859 - 864 ) replaced with CAGTTG; GAAGGC (nucleotides 916 - 921 ) replaced with GAGGGT; ACCTAT (nucleotides 1006 - 101 1 ) replaced with ACGTAC; GGTATT (nucleotides 1 1 14 - 1 1 19 ) replaced with GGCATA; AAAGAC (nucleotides 1456 - 1461 ) replaced with AAAGAT. In certain aspects, the nucleotide sequence is optimized for expression in Z mobilis.
[0104] Also provided herein is a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerβvisiae.
[0105] Also provided herein is a L-arabinose isomerase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey): M. mulatto (Monkey); E. coli Kl 2 W31 10: E. coli UTI89: E. coli O157:H7 EDL933; E. coli OJ57.H7 sir. Sakai: Bombyx mori: Spodoptera frugiperda: Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
|0106] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
J0107] In some embodiments, provided herein is a system for metabolizing arabinose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: L-arabinose isomerase, L- ribulokinase, and L-ribulose-5-P 4-epimerase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the L-arabinose isomerase retains at least 75% of the enzymatic activity of wild-type AraA (SEQ ID NO: 194) under normal physiological conditions.
|0108] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least 1. 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 8-472 of SEQ ID NO: 194 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 8-472 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 8-472 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 8-472 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair CTGGTG when expressed in the native organism.
|0109J In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -8 of SEQ ID NO: 194 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-8 of SEQ ID NO: 194 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -8 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -5 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GAAGTG when expressed in the native organism.
[0110] In some embodiments are provided a L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218, wherein at least 3 codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 562 - 567); GGTATT (nucleotides 445 - 450); GGTATT (nucleotides 943 - 948); GAGTTT (nucleotides 319 - 324); GGATTT (nucleotides 979 - 984); TTTGCC (nucleotides 322 - 327); GATATC (nucleotides 101 8 - 1023); CTTTAT (nucleotides 1603 - 1608); GATATT (nucleotides 586 - 591 ); GATATT (nucleotides 736 - 741 ); GGCCAA (nucleotides 1000 - 1005). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 562 - 567) replaced with TTGAGT; GGTATT (nucleotides 445 - 450) replaced with GGAATT; GGTATT (nucleotides 943 - 948) replaced with GGAATT; GAGTTT (nucleotides 319 - 324) replaced with GAATTT; GGATTT (nucleotides 979 - 984) replaced with GGATTT; TTTGCC (nucleotides 322 - 327) replaced with TTTGCA; GATATC (nucleotides 1018 - 1023) replaced with GACATT; CTTTAT (nucleotides 1603 - 1608) replaced with TTGTAT; GATATT (nucleotides 586 - 591) replaced with GACATT; GATATT (nucleotides 736 - 741) replaced with GATATA; GGCCAA (nucleotides 1000 - 1005) replaced with GGACAA. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0111] In some embodiments are provided a L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218, wherein at least 3 codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTGGCG (nucleotides 304 - 309); GAAGAG (nucleotides 73 - 78); GAAGAG (nucleotides 385 - 390); GCCAGC (nucleotides 64 - 69); GCCAGC (nucleotides 1 105 - 1 1 10); CTTTCC (nucleotides 562 - 567): CTCGAC (nucleotides 1 183 - 1 188); TTTGCC (nucleotides 322 - 327); GGGCAA (nucleotides 1 18 - 123); ATCCTC (nucleotides 685 - 690): GACTGG (nucleotides 544 - 549); GACTGG (nucleotides 1 186 - 1 191 ); GCCAGT (nucleotides 658 - 663); GCCAGT (nucleotides 1543 - 1548); GTGGTG (nucleotides 796 - 801 ); GTGGTG (nucleotides 970 - 975); GTGGTG (nucleotides 1 177 - 1 182); CTCGGC (nucleotides 778 - 783); GCGGTA (nucleotides 1549 - 1554); GACAGC (nucleotides 499 - 504); CTGGAA (nucleotides 991 - 996); CTGGAA (nucleotides 1057 - 1062); AGCCAG (nucleotides 1 108 - 1 1 13); ATTGCC (nucleotides 904 - 909); GCCGGG (nucleotides 610 - 615); CTCGGT (nucleotides 1471 - 1476); GCCTGG (nucleotides 1027 - 1032); GCGGCA (nucleotides 187 - 192); GTGATG (nucleotides 1363 - 1368); GGCGCA (nucleotides 832 - 837); GGCGCA (nucleotides 841 - 846); GGCGCA (nucleotides 847 - 852); GGCGCA (nucleotides 1309 - 1314); TTCTGG (nucleotides 466 - 471 ); GCGCTG (nucleotides 307
- 312); GCGCTG (nucleotides 1 129 - 1 134); GCGCTG (nucleotides 1369 - 1374); ATCGCC (nucleotides 79 - 84); ATCGCC (nucleotides 1348 - 1353). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTGGCG (nucleotides 304 - 309) replaced with CTGGCT; GAAGAG (nucleotides 73 - 78) replaced with GAAGAA; GAAGAG (nucleotides 385 - 390) replaced with GAAGAA; GCCAGC (nucleotides 64 - 69) replaced with GCGTCT; GCCAGC (nucleotides 1 105 - 1 1 10) replaced with GCGTCT; CTTTCC (nucleotides 562 - 567) replaced with CTGTCT; CTCGAC (nucleotides 1 183 - 1 188) replaced with CTGGAT: TTTGCC (nucleotides 322 - 327) replaced with TTTGCG; GGGCAA (nucleotides 1 18 - 123) replaced with GGTCAG; ATCCTC (nucleotides 685 - 690) replaced with ATCCTG: GACTGG (nucleotides 544 - 549) replaced with GATTGG; GACTGG (nucleotides 1 186 - 1 191 ) replaced with GATTGG; GCCAGT (nucleotides 658 - 663) replaced with GCGTCC; GCCAGT (nucleotides 1543 - 1548) replaced with GCTTCT; GTGGTG (nucleotides 796 - 801) replaced with GTTGTT: GTGGTG (nucleotides 970 - 975) replaced with GTTGTT; GTGGTG (nucleotides 1 177 - 1 182) replaced with GTTGTT; CTCGGC (nucleotides 778
- 783) replaced with CTGGGT; GCGGTA (nucleotides 1549 - 1554) replaced with GCGGTT; GACAGC (nucleotides 499 - 504) replaced with GATTCT; CTGGAA (nucleotides 991 - 996) replaced with CTGGAG; CTGGAA (nucleotides 1057 - 1062) replaced with CTCGAA; AGCCAG (nucleotides 1 108 - 1 1 13) replaced with TCTCAG: ATTGCC (nucleotides 904 - 909) replaced with ATCGCG; GCCGGG (nucleotides 610 - 615) replaced with GCGGGT; CTCGGT (nucleotides 1471 - 1476) replaced with TTGGGT; GCCTGG (nucleotides 1027 - 1032) replaced with GCGTGG; GCGGCA (nucleotides 187 - 192) replaced with GCTGCT; GTGATG (nucleotides 1363 - 1368) replaced with GTTATG; GGCGCA (nucleotides 832 - 837) replaced with GGTGCG: GGCGCA (nucleotides 841 - 846) replaced with GGTGCA; GGCGCA (nucleotides 847 - 852) replaced with GGTGCT; GGCGCA (nucleotides 1309 - 1314) replaced with GGCGCT; TTCTGG (nucleotides 466 - 471) replaced with TTTTGG; GCGCTG (nucleotides 307 - 312) replaced with GCTCTG; GCGCTG (nucleotides 1 129 - 1 134) replaced with GCGCTC; GCGCTG (nucleotides 1369 - 1374) replaced with GCTCTG; ATCGCC (nucleotides 79 - 84) replaced with ATTGCG; ATCGCC (nucleotides 1348 - 1353) replaced with ATCGCG. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
|0112] In some embodiments are provided a L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218, wherein at least 3 codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GAGTTT (nucleotides 319 - 324); GATATC (nucleotides 1018 - 1023); GATATT (nucleotides 586 - 591 ); GATATT (nucleotides 736 - 741); TTTGCC (nucleotides 322 - 327); CTTCCA (nucleotides 1651 - 1656); ATCAAC (nucleotides 1099 - 1 104); GGTATT (nucleotides 445 - 450); GGTATT (nucleotides 943
- 948); GCCAAA (nucleotides 1 147 - 1 152); CTGAAA (nucleotides 193 - 198); CTGAAA (nucleotides 1087 - 1092); CTGAAA (nucleotides 1228 - 1233); AAACAG (nucleotides 913 - 918); GGCCAA (nucleotides 1000 - 1005); CTGGTA (nucleotides 865
- 870): CTTTCC (nucleotides 562 - 567); TTTGAC (nucleotides 817 - 822). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GAGTTT (nucleotides 319 - 324) replaced with GAATTT; GATATC (nucleotides 1018 - 1023) replaced with GACATC; GATATT (nucleotides 586 - 591 ) replaced with GACATC: GATATT (nucleotides 736 - 741 ) replaced with GACATC: TTTGCC (nucleotides 322 - 327) replaced with TTTGCG: CTTCCA (nucleotides 1651 - 1656) replaced with CTCCCG; ATCAAC (nucleotides 1099 - 1 104) replaced with ATCAAC; GGTATT (nucleotides 445 - 450) replaced with GGTATC; GGTATT (nucleotides 943 - 948) replaced with GGTATC; GCCAAA (nucleotides 1 147 - 1 152) replaced with GCTAAA; CTGAAA (nucleotides 193 - 198) replaced with CTGAAA: CTGAAA (nucleotides 1087 - 1092) replaced with CTGAAA; CTGAAA (nucleotides 1228 - 1233) replaced with CTGAAA; AAACAG (nucleotides 913 - 918) replaced with AAACAG; GGCCAA (nucleotides 1000 - 1005) replaced with GGTCAG; CTGGTA (nucleotides 865 - 870) replaced with CTCGTT; CTTTCC (nucleotides 562 - 567) replaced with CTGTCT; TTTGAC (nucleotides 817 - 822) replaced with TTTGAC. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoήs.
|0113] In some embodiments are provided a L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218. wherein at least 3 codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GAGTTT (nucleotides 319 - 324 ); TTTGCC (nucleotides 322 - 327 ); CTTTCC (nucleotides 562 - 567 ); GGTACC (nucleotides 568 - 573 ); GGCCAA (nucleotides 1000 - 1005 ); GATATC (nucleotides 1018 - 1023 ); TTTGCT (nucleotides 1486 - 1491 ). In some such nucleotide sequences, at least 3, or 4. or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GAGTTT (nucleotides 319 - 324 ) replaced with GAGTTC; TTTGCC (nucleotides 322 - 327 ) replaced with TTCGCT; CTTTCC (nucleotides 562 - 567 ) replaced with TTGTCT; GGTACC (nucleotides 568 - 573 ) replaced with GGAACT; GGCCAA (nucleotides 1000 - 1005 ) replaced with GGACAA; GATATC (nucleotides 1018 - 1023 ) replaced with GACATT; TTTGCT (nucleotides 1486 - 1491 ) replaced with TTCGCT. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
(0114] In some embodiments are provided a L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218, wherein at least 3 codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTCGAT (nucleotides 19 - 24 ); GCTTTG (nucleotides 46 - 51 ); GATGCC (nucleotides 130 - 135 ); GACAGT (nucleotides 256 - 261 ); GCACCG (nucleotides 277 - 282 ); GATGCC (nucleotides 286 - 291 ); AAAGAC (nucleotides 358 - 363 ); GCGGTT (nucleotides 370 - 375 ); CGCTAT (nucleotides 433 - 438 ); GGTATT (nucleotides 445 - 450 ); GACAGC (nucleotides 499 - 504 ); TCCGGT (nucleotides 565 - 570 ); CGGGCA (nucleotides 931 - 936 ); GGTATT (nucleotides 943 - 948 ); GTGCCT (nucleotides 973 - 978 ); CAGCTT (nucleotides 1063 - 1068 ); GCATGG (nucleotides 1 141 - 1 146 ); GCCTTT (nucleotides 1303 - 1308 ); CAGCTT (nucleotides 1600 - 1605 ); CTTTAT (nucleotides 1603 - 1608 ); CGCTAT (nucleotides 1612 - 1617 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTCGAT (nucleotides 19 - 24 ) replaced with TTGGAT; GCTTTG (nucleotides 46 - 51 ) replaced with GCCCTT; GATGCC (nucleotides 130 - 135 ) replaced with GATGCT; GACAGT (nucleotides 256 - 261 ) replaced with GATTCT; GCACCG (nucleotides 277 - 282 ) replaced with GCCCCG; GATGCC (nucleotides 286 - 291 ) replaced with GACGCC; AAAGAC (nucleotides 358 - 363 ) replaced with AAAGAT; GCGGTT (nucleotides 370 - 375 ) replaced with GCCGTT; CGCTAT (nucleotides 433 - 438 ) replaced with CGTTAT; GGTATT (nucleotides 445 - 450 ) replaced with GGCATC; GACAGC (nucleotides 499 - 504 ) replaced with GATTCT; TCCGGT (nucleotides 565 - 570 ) replaced with TCTGGC; CGGGCA (nucleotides 931 - 936 ) replaced with CGTGCC; GGTATT (nucleotides 943 - 948 ) replaced with GGTATA; GTGCCT (nucleotides 973 - 978 ) replaced with GTTCCG; CAGCTT (nucleotides 1063 - 1068 ) replaced with CAGTTG; GCATGG (nucleotides 1 141 - 1 146 ) replaced with GCCTGG; GCCTTT (nucleotides 1303 - 1308 ) replaced with GCCTTC; CAGCTT (nucleotides 1600 - 1605 ) replaced with CAGTTG; CTTTAT (nucleotides 1603 - 1608 ) replaced with TTGTAT; CGCTAT (nucleotides 1612 - 1617 ) replaced with CGTTAT. In certain aspects, the nucleotide sequence is optimized for expression in Z mobilis. |0115] Also provided herein is a L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3. or 2.5. or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
|0116] Also provided herein is a L-ribulokinase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cunicidus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli Kl 2 W31 10; E. coli UTI89; E. coli O157:H7 EDL933; E. coli OJ57.H7 str. Sakai; Bombyx moή; Spodoptera frugiperda'. Drosophila melanogastβr Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
|0117] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0118] In some embodiments, provided herein is a system for metabolizing arabinose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: L-arabinose isomerase (AraA), L- ribulokinase (AraB), and L-ribulose-5-P 4-epimerase (AraD); wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Klnyveromyces lactis, Zymomonas mobilis and Schi∑osaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the L-ribulokinase retains at least 75% of the enzymatic activity of wild-type AraB (SEQ ID NO: 218) under normal physiological conditions.
[0119] In some embodiments are provided a L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218 and is adapted for expression in a heterologous host organism, wherein at least 1, 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 59-549 of SEQ ID NO: 218 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 59-549 of SEQ ID NO: 218 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 59-549 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 59-549 of SEQ ID NO: 218 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair CTGGCG when expressed in the native organism.
[01201 In some embodiments are provided a L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 1 -59 of SEQ ID NO: 218 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -59 of SEQ ID NO: 218 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest ∑ scores of the wild type codon pairs encoding amino acids 1 -59 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-59 of SEQ ID NO: 218 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GAAGAG when expressed in the native organism.
[0121] In some embodiments are provided a L-ribulose-5-P 4-epimerase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242, wherein at least 3 codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the 3 codon pairs to be replaced are selected from the following: AACGTC (nucleotides 82 - 87 ); ATCAAA (nucleotides 121 - 126 ); GGCCAG (nucleotides 322 - 327 ); GCAGAA (nucleotides 403 - 408 ); ATCAAC (nucleotides 409 - 414 ); AACGTC (nucleotides 439 - 444 ); GGTATC (nucleotides 469 - 474 ); CCGCAG (nucleotides 613 - 618 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: AACGTC (nucleotides 82 - 87 ) replaced with AATGTT; ATCAAA (nucleotides 121 - 126 ) replaced with ATTAAA: GGCCAG (nucleotides 322 - 327 ) replaced with GGTCAA; GCAGAA (nucleotides 403 - 408 ) replaced with GCTGAA: ATCAAC (nucleotides 409 - 414 ) replaced with ATTAAT; AACGTC (nucleotides 439 - 444 ) replaced with AATGTA; GGTATC (nucleotides 469 - 474 ) replaced with GGAATT; CCGCAG (nucleotides 613 - 618 ) replaced with CCACAA. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae .
|0122] In some embodiments are provided a L-ribulose-5-P 4-epimerase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242; wherein at least 3 codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTGGCG (nucleotides 40 - 45); GAAGAG (nucleotides 571 - 576); ACGCTG (nucleotides 637 - 642); GTCAGC (nucleotides 85 - 90); CTGGAA (nucleotides 568 - 573); ACGCCA (nucleotides 229 - 234); TTCCCG (nucleotides 259 - 264); GAAGTG (nucleotides 193 - 198); CAGGCG (nucleotides 316 - 321 ); GATCTC (nucleotides 10 - 15); GCGCTG (nucleotides 43 - 48). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTGGCG (nucleotides 40 - 45) replaced with TTGGCG; GAAGAG (nucleotides 571 - 576) replaced with GAAGAA; ACGCTG (nucleotides 637 - 642) replaced with ACATTG; GTCAGC (nucleotides 85 - 90) replaced with GTTTCA; CTGGAA (nucleotides 568 - 573) replaced with TTGGAA; ACGCCA (nucleotides 229 - 234) replaced with ACTCCA; TTCCCG (nucleotides 259 - 264) replaced with TTTCCA; GAAGTG (nucleotides 193 - 198) replaced with GAAGTT; CAGGCG (nucleotides 316 - 321) replaced with CAAGCT; GATCTC (nucleotides 10 - 15) replaced with GATTTA; GCGCTG (nucleotides 43 - 48) replaced with GCGTTG. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
|0123] In some embodiments are provided a L-ribulose-5-P 4-epimerase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242, wherein at least 3 codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GACGAT (nucleotides 160 - 165); ATCAAC (nucleotides 409 - 414); ATCAAA (nucleotides 121 - 126); GGTATC (nucleotides 469 - 474): AAACAG (nucleotides 463 - 468). In some such nucleotide sequences, at least 3. or 4. or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GACGAT (nucleotides 160 - 165) replaced with GATGAT; ATCAAC (nucleotides 409 - 414) replaced with ATTAAT; ATCAAA (nucleotides 121 - 126) replaced with ATTAAA; GGTATC (nucleotides 469 - 474) replaced with GGAATT; AAACAG (nucleotides 463 - 468) replaced with AAACAA. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoήs.
[0124] In some embodiments are provided a L-ribulose-5-P 4-epimerase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242, wherein at least 3 codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. ATCAAA (nucleotides 121 - 126 ): GACGAT (nucleotides 160 - 165 ); TATTTC (nucleotides 361 - 366 ); ACCATT (nucleotides 373 - 378 ); GGTATC (nucleotides 469 - 474 ); TTTGCA (nucleotides 520 - 525 ). In some such nucleotide sequences, at least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: ATCAAA (nucleotides 121 - 126 ) replaced with ATTAAA; GACGAT (nucleotides 160 - 165 ) replaced with GATGAT; TATTTC (nucleotides 361 - 366 ) replaced with TACTTC; ACCATT (nucleotides 373 - 378 ) replaced with ACAATT; GGTATC (nucleotides 469 - 474 ) replaced with GGAATT; TTTGCA (nucleotides 520 - 525 ) replaced with TTCGCG. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
[0125) In some embodiments are provided a L-ribulose-5-P 4-epimerase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242, wherein at least 3 codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: ACATGG (nucleotides 73 - 78 ): GTCGAT (nucleotides 136 - 141 ); CTCTAT (nucleotides 247 - 252 ); GGTATC (nucleotides 469 - 474 ); GCATGG (nucleotides 523 - 528 ). In some such nucleotide sequences, at least 3. or 4, or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: ACATGG (nucleotides 73 - 78 ) replaced with ACCTGG; GTCGAT (nucleotides 136 - 141 ) replaced with GTCGAC; CTCTAT (nucleotides 247 - 252 ) replaced with TTGTAT; GGTATC (nucleotides 469 - 474 ) replaced with GGCATT: GCATGG (nucleotides 523 - 528 ) replaced with GCTTGG. In certain aspects, the nucleotide sequence is optimized for expression in Z mobi/is.
[0126] Also provided herein is a L-ribulose-5-P 4-epimerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
[0127] Also provided herein is a L-ribulose-5-P 4-epimerase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pasloris; Oiyctolagns cunicuhis (rabbit); Macaca fascicularis (Long- tailed monkey); M. mulatto (Monkey); E. coli Kl 2 W31 10; E. coli UT189; E. coli O157:H7 EDL933; E. coli OJ57.H7 sir. Sakai: Bombyx mori: Spodoptera frugiperda: Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombβ.
[0128) Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
|0129] In some embodiments, provided herein is a system for metabolizing arabinose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: L-arabinose isomerase, L- ribulokinase, and L-ribulose-5-P 4-epimerase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the L-ribulose-5-P 4- epimerase retains at least 75% of the enzymatic activity of wild-type AraD (SEQ ID NO: 242) under normal physiological conditions.
|0130] In some embodiments are provided a L-ribulose-5-P 4-epimerase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 7-217 of SEQ ID NO: 242 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 7-217 of SEQ ID NO: 242 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 7-217 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 7-217 of SEQ ID NO: 242 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair CTGGCG when expressed in the native organism.
[0131] In some embodiments are provided a L-ribulose-5-P 4-epimerase- encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-7 of SEQ ID NO: 242 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -7 of SEQ ID NO: 242 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -7 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -7 of SEQ ID NO: 242 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GATCTC when expressed in the native organism.
|0132] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -322 of wild-type xylose reductase as set forth in SEQ ID NO: 266. wherein at least 3 codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: ATCAAA (nucleotides 22 - 27); TTGAAC (nucleotides 286 - 291 ); TTGAAC (nucleotides 700 - 705); ATCAAG (nucleotides 1 15 - 120); ATCAAG (nucleotides 553 - 558); ATCAAG (nucleotides 733 - 738); GCCAAG (nucleotides 748 - 753); GCCAAG (nucleotides 901 - 906). In some such nucleotide sequences, at least 3, or 4, or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: ATCAAA (nucleotides 22 - 27) replaced with ATTAAA; TTGAAC (nucleotides 286 - 291 ) replaced with TTAAAT; TTGAAC (nucleotides 700 - 705) replaced with TTAAAT; ATCAAG (nucleotides 1 15 - 120) replaced with ATTAAA: ATCAAG (nucleotides 553 - 558) replaced with ATTAAA; ATCAAG (nucleotides 733 - 738) replaced with ATTAAA; GCCAAG (nucleotides 748 - 753) replaced with GCAAAA; GCCAAG (nucleotides 901 - 906) replaced with GCTAAA. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
|0133] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-322 of wild-type xylose reductase as set forth in SEQ ID NO: 266, wherein at least 3 codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GAAGAG (nucleotides 220 - 225); TTCCTC (nucleotides 229 - 234) ;ATTGCC (nucleotides 349 - 354); ATCGCC (nucleotides 898 - 903); GACTGG (nucleotides 940 - 945). In some such nucleotide sequences, at least 3, or 4. or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GAAGAG (nucleotides 220 - 225) replaced with GAAGAA; TTCCTC (nucleotides 229 - 234) replaced with TTCCTG; ATTGCC (nucleotides 349 - 354) replaced with ATCGCG; ATCGCC (nucleotides 898 - 903) replaced with ATCGCG; GACTGG (nucleotides 940 - 945) replaced with GATTGG. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
|0134j In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-322 of wild-type xylose reductase as set forth in SEQ ID NO: 266. wherein at least 3 codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TCCAAG (nucleotides 238 - 243); ATCAAG (nucleotides 1 15 - 120); ATCAAG (nucleotides 553 - 558); ATCAAG (nucleotides 733 - 738); TTCAAG (nucleotides 355 - 360); TTCAAC (nucleotides 859 - 864); TTCAAC (nucleotides 925 - 930); ATCAAA (nucleotides 22 - 27); GTCAAG (nucleotides 184 - 189); GTCAAG (nucleotides 21 1 - 216); GACGAA (nucleotides 199 - 204); GGTATC (nucleotides 802 - 807); TTGAAC (nucleotides 286 - 291); TTGAAC (nucleotides 700 - 705). In some such nucleotide sequences, at least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TCCAAG (nucleotides 238 - 243) replaced with TCTAAA; ATCAAG (nucleotides 1 15 - 120) replaced with ATTAAA; ATCAAG (nucleotides 553 - 558) replaced with ATTAAG: ATCAAG (nucleotides 733 - 738) replaced with ATTAAG; TTCAAG (nucleotides 355 - 360) replaced with TTTAAA; TTCAAC (nucleotides 859 - 864) replaced with TTTAAT; TTCAAC (nucleotides 925 - 930) replaced with TTTAAT; ATCAAA (nucleotides 22 - 27) replaced with ATTAAA; GTCAAG (nucleotides 184 - 189) replaced with GTTAAA; GTCAAG (nucleotides 21 1 - 216) replaced with GTTAAG; GACGAA (nucleotides 199 - 204) replaced with GATGAA; GGTATC (nucleotides 802 - 807) replaced with GGAATT; TTGAAC (nucleotides 286 - 291 ) replaced with TTAAAT; TTGAAC (nucleotides 700 - 705) replaced with TTAAAT. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
|0135] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -322 of wild-type xylose reductase as set forth in SEQ ID NO: 266, wherein at least 3 codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: ATCAAA (nucleotides 22 - 27 ); TTGAAC (nucleotides 286
- 291 ); TTCCCA (nucleotides 343 - 348 ); TTCCCA (nucleotides 51 1 - 516 ); TTGAAC (nucleotides"700 - 705 ); GCCAAG (nucleotides 748 - 753 ); GGTATC (nucleotides 802
- 807 ); GCCAAG (nucleotides 901 - 906 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: ATCAAA (nucleotides 22 - 27 ) replaced with ATAAAA; TTGAAC (nucleotides 286 - 291 ) replaced with TTAAAT; TTCCCA (nucleotides 343 - 348 ) replaced with TTCCCT; TTCCCA (nucleotides 51 1 - 516 ) replaced with TTCCCT; TTGAAC (nucleotides 700 - 705 ) replaced with TTAAAC; GCCAAG (nucleotides 748 - 753 ) replaced with GCTAAA; GGTATC (nucleotides 802 - 807 ) replaced with GGAATT; GCCAAG (nucleotides 901 - 906 ) replaced with GCTAAA. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
[0136] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-322 of wild-type xylose reductase as set forth in SEQ ID NO: 266, wherein at least 3 codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GCCGGT (nucleotides 91 - 96 ); GCCGGT (nucleotides 121 - 126 ); GCCTTG (nucleotides 283 - 288 ); GCCGGT (nucleotides 478 - 483 ); GCTTTG (nucleotides 520 - 525 ); GCCGGT (nucleotides 628 - 633 ); GCTTTG (nucleotides 697 - 702 ); GCTATT (nucleotides 739 - 744 ); GCCAAG (nucleotides 748 - 753 ); GGTATC (nucleotides 802 - 807 ); GCCAAG (nucleotides 901 - 906 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GCCGGT (nucleotides 91 - 96 ) replaced with GCGGGT: GCCGGT (nucleotides 121 - 126 ) replaced with GCTGGT; GCCTTG (nucleotides 283 - 288 ) replaced with GCTCTT; GCCGGT (nucleotides 478 - 483 ) replaced with GCTGGC; GCTTTG (nucleotides 520 - 525 ) replaced with GCTCTT; GCCGGT (nucleotides 628 - 633 ) replaced with GCTGGA; GCTTTG (nucleotides 697 - 702 ) replaced with GCTCTT; GCTATT (nucleotides 739 - 744 ) replaced with GCCATT; GCCAAG (nucleotides 748 - 753 ) replaced with GCGAAA; GGTATC (nucleotides 802 - 807 ) replaced with GGCATA; GCCAAG (nucleotides 901 - 906 ) replaced with GCCAAA. In certain aspects, the nucleotide sequence is optimized for expression in Z mobilis.
[0137J Also provided herein is a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-322 of wild-type xylose reductase as set forth in SEQ ID NO: 266 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
[0138] Also provided herein is a xylose reductase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -322 of wild-type xylose reductase as set forth in SEQ ID NO: 266 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli Kl 2 W31 10; E. coli UT189; E. coli O157:H7 EDL933; E. coli O157.H7 str. Sakai; Bombyx morn Spodoptera frugiperda; Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0139] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence. |0140] In some embodiments, provided herein is a system for metabolizing xylose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: xylose reductase, xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the xylose reductase retains at least 75% of the enzymatic activity of wild-type Xyr (SEQ ID NO: 266) under normal physiological conditions.
|0141] In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-322 of wild-type xylose reductase as set forth in SEQ ID NO: 266 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 9- 306 of SEQ ID NO: 266 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 9-306 of SEQ ID NO: 266 has a z score for expression in the heterologous host that is more than 400%. or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 9-306 when expressed in the native organism. [0142) In some embodiments are provided a xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -322 of wild-type xylose reductase as set forth in SEQ ID NO: 266 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -9 of SEQ ID NO: 266 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-9 of SEQ ID NO: 266 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-9 when expressed in the native organism.
|0143] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290, wherein at least 3 codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 274 - 279 ); GATATC (nucleotides 325 - 330 ); CTTTAT (nucleotides 682 - 687 ); GGGTTT (nucleotides 901 - 906 ); TTTGCC (nucleotides 904 - 909 ); GCCATT (nucleotides 1 159 - 1 164 ); GATATT (nucleotides 1 180 - 1 185 ); TTGAAA (nucleotides 1291 - 1296 ); GAAAGT (nucleotides 1402 - 1407 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 274 - 279 ) replaced with TTAAGT; GATATC (nucleotides 325 - 330 ) replaced with GACATT; CTTTAT (nucleotides 682 - 687 ) replaced with CTATAT; GGGTTT (nucleotides 901 - 906 ) replaced with GGTTTT; TTTGCC (nucleotides 904 - 909 ) replaced with TTTGCA; GCCATT (nucleotides 1 159 - 1 164 ) replaced with GCTATT; GATATT (nucleotides 1 180 - 1 185 ) replaced with GATATA; TTGAAA (nucleotides 1291 - 1296 ) replaced with TTAAAA; GAAAGT (nucleotides 1402 - 1407 ) replaced with GAATCT. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0144] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290, wherein at least 3 codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TTCTGG (nucleotides 25 - 30 ); AGCCAG (nucleotides 43 - 48 ); GAAGAG (nucleotides 61 - 66 ); ACGCTG (nucleotides 67 - 72 ); CTGGAA (nucleotides 70 - 75 ); CTTTCC (nucleotides 274 - 279 ); ATTGCC (nucleotides 436 - 441 ); GAAGTG (nucleotides 460 - 465 ); GCCAGA (nucleotides 532 - 537 ); GCGGTA (nucleotides 562 - 567 ); GATCTC (nucleotides 634 - 639 ); GAAGTG (nucleotides 643 - 648 ); GTGATG (nucleotides 646 - 651 ); CAGGCG (nucleotides 763 - 768 ); GAAGTG (nucleotides 835 - 840 ); TTTGCC (nucleotides 904 - 909 ); CGGATG (nucleotides 943 - 948 ); GAAGTG (nucleotides 1048 - 1053 ); AAAGAG (nucleotides 1 1 14 - 1 1 19 ); TTCCGC (nucleotides 1195 - 1200 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TTCTGG (nucleotides 25 - 30 ) replaced with TTTTGG; AGCCAG (nucleotides 43 - 48 ) replaced with TCTCAG; GAAGAG (nucleotides 61 - 66 ) replaced with GAAGAA; ACGCTG (nucleotides 67 - 72 ) replaced with ACCCTC; CTGGAA (nucleotides 70 - 75 ) replaced with CTCGAA; CTTTCC (nucleotides 274 - 279 ) replaced with CTGAGC; ATTGCC (nucleotides 436 - 441 ) replaced with ATCGCG; GAAGTG (nucleotides 460 - 465 ) replaced with GAAGTT; GCCAGA (nucleotides 532 - 537 ) replaced with GCACGC; GCGGTA (nucleotides 562 - 567 ) replaced with GCGGTT; GATCTC (nucleotides 634 - 639 ) replaced with GATTTG; GAAGTG (nucleotides 643 - 648 ) replaced with GAAGTT; GTGATG (nucleotides 646 - 651 ) replaced with GTTATG; CAGGCG (nucleotides 763 - 768 ) replaced with CAGGCT; GAAGTG (nucleotides 835 - 840 ) replaced with GAAGTT; TTTGCC (nucleotides 904 - 909 ) replaced with TTCGCT; CGGATG (nucleotides 943 - 948 ) replaced with CGTATG; GAAGTG (nucleotides 1048 - 1053 ) replaced with GAGGTT: AAAGAG (nucleotides 1 1 14 - 1 1 19 ) replaced with AAGGAG; TTCCGC (nucleotides 1 195 - 1200 ) replaced with TTTCGT. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
[0145] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290, wherein at least 3 codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 274 - 279 ); GATATC (nucleotides 325 - 330 ); ATCAAC (nucleotides 403 - 408 ); GACGAA (nucleotides 733
- 738 ); TCGTTT (nucleotides 829 - 834 ); AAACAG (nucleotides 853 - 858 ); GGGTTT (nucleotides 901 - 906 ); TTTGCC (nucleotides 904 - 909 ); GATATT (nucleotides 1 180
- 1 185 ); TTGAAA (nucleotides 1291 - 1296 ); AAACTG (nucleotides 1438 - 1443 ); CTGAAA (nucleotides 1441 - 1446 ); CTTCAA (nucleotides 1480 - 1485 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 274 - 279 ) replaced with TTATCT; GATATC (nucleotides 325 - 330 ) replaced with GACATT; ATCAAC (nucleotides 403 - 408 ) replaced with ATTAAT; GACGAA (nucleotides 733 - 738 ) replaced with GATGAA: TCGTTT (nucleotides 829 - 834 ) replaced with TCTTTT; AAACAG (nucleotides 853 - 858 ) replaced with AAACAA; GGGTTT (nucleotides 901 - 906 ) replaced with GGATTC; TTTGCC (nucleotides 904 - 909 ) replaced with TTCGCT; GATATT (nucleotides 1 180 - 1 185 ) replaced with GATATA; TTGAAA (nucleotides 1291 - 1296 ) replaced with TTAAAA; AAACTG (nucleotides 1438 - 1443 ) replaced with AAATTG: CTGAAA (nucleotides 1441 - 1446 ) replaced with TTGAAG; CTTCAA (nucleotides 1480 - 1485 ) replaced with TTGCAA. In certain aspects, the nucleotide sequence is optimized for expression in P. pastoris.
[0146] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290, wherein at least 3 codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 274 - 279 ): GATATC (nucleotides 325 - 330 ); GTGAAA (nucleotides 463 - 468 ); GGGTTT (nucleotides 901 - 906 ); TTTGCC (nucleotides 904 - 909 ); GCCATT (nucleotides 1 159 - 1 164 ); TTGAAA (nucleotides 1291 - 1296 ); AAATGG (nucleotides 1456 - 1461 ). In some such nucleotide sequences, at least 3, or 4. or 5. or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 274 - 279 ) replaced with TTGTCC; GATATC (nucleotides 325 - 330 ) replaced with GACATT; GTGAAA (nucleotides 463 - 468 ) replaced with GTTAAA; GGGTTT (nucleotides 901 - 906 ) replaced with GGTTTC; TTTGCC (nucleotides 904 - 909 ) replaced with TTCGCA; GCCATT (nucleotides 1 159 - 1 164 ) replaced with GCTATT; TTGAAA (nucleotides 1291 - 1296 ) replaced with TTAAAG; AAATGG (nucleotides 1456 - 1461 ) replaced with AAGTGG. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
(0147] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290, wherein at least 3 codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTGTTA (nucleotides 184 - 189 ); ACATGG (nucleotides 229 - 234 ); GAAGGC (nucleotides 268 - 273 ); AACAGC (nucleotides 361
- 366 ); GCGGCT (nucleotides 496 - 501 ); GTAACG (nucleotides 565 - 570 ); ATCGGG (nucleotides 628 - 633 ); CTTTAT (nucleotides 682 - 687 ); GCTTTT (nucleotides 790 - 795 ); GCCGGT (nucleotides 907 - 912 ); GCTTTG (nucleotides 1066
- 1071 ); AAAGAC (nucleotides 1237 - 1242 ); GCATGG (nucleotides 1309 - 1314 ); CTTGAT (nucleotides 1375 - 1380 ); CTTTAC (nucleotides 1471 - 1476 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTGTTA (nucleotides 184 - 189 ) replaced with TTGTTG; ACATGG (nucleotides 229 - 234 ) replaced with ACCTGG; GAAGGC (nucleotides 268 - 273 ) replaced with GAGGGC; AACAGC (nucleotides 361
- 366 ) replaced with AACTCT; GCGGCT (nucleotides 496 - 501 ) replaced with GCCGCA; GTAACG (nucleotides 565 - 570 ) replaced with GTTACC; ATCGGG (nucleotides 628 - 633 ) replaced with ATTGGT; CTTTAT (nucleotides 682 - 687 ) replaced with TTGTAT; GCTTTT (nucleotides 790 - 795 ) replaced with GCATTC; GCCGGT (nucleotides 907 - 912 ) replaced with GCTGGT; GCTTTG (nucleotides 1066
- 1071 ) replaced with GCCTTA; AAAGAC (nucleotides 1237 - 1242 ) replaced with AAAGAT; GCATGG (nucleotides 1309 - 1314 ) replaced with GCTTGG; CTTGAT (nucleotides 1375 - 1380 ) replaced with TTGGAT; CTTTAC (nucleotides 1471 - 1476 ) replaced with TTATAT. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis.
[0148] Also provided herein is a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae.
[0149] Also provided herein is a L-arabinose isomerase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoris: Oiγctolagus cunicuhis (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli Kl 2 W31 10: E. coli UT189; E. coli O157:H7 EDL933; E. coli OJ57.H7 str. Sakai: Bombyx mori: Spodoptera frugiperda; Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
[0150] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
|0151] In some embodiments, provided herein is a system for metabolizing arabinose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: L-arabinose isomerase, L- ribulokinase, and L-ribulose-5-P 4-epimerase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the L-arabinose isomerase retains at least 75% of the enzymatic activity of wild-type AraA ( SEQ ID NO: 290) under normal physiological conditions.
[0152] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 7-487 of SEQ ID NO: 290 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 7-487 of SEQ ID NO: 290 has a z score for expression in the heterologous host that is more than 400%. or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 7-487 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 7-487 of SEQ ID NO: 290 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair GGCGGA when expressed in the native organism.
[0153] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -8 of SEQ ID NO: 290 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1-8 of SEQ ID NO: 290 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1 -8 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -5 of SEQ ID NO: 290 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair AAGGAT when expressed in the native organism.
|0154j In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 codon pairs of SEQ ID NO: 301 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: CTTTCC (nucleotides 274 - 279 ); CAGTTT (nucleotides 313 - 318 ): AATATT (nucleotides 361 - 366 ); ATCAAA (nucleotides 523 - 528 ); CTTTAT (nucleotides 703 - 708 ); GTGGAA (nucleotides 1204 - 1209 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: CTTTCC (nucleotides 274 - 279 ) replaced with TTGTCT; CAGTTT (nucleotides 313 - 318 ) replaced with CAATTT; AATATT (nucleotides 361 - 366 ) replaced with AACATT; ATCAAA (nucleotides 523 - 528 ) replaced with ATTAAG; CTTTAT (nucleotides 703 - 708 ) replaced with TTGTAT; GTGGAA (nucleotides 1204 - 1209 ) replaced with GTTGAA. In certain aspects, the nucleotide sequence is optimized for expression in S.cerevisiae.
[0155] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 codon pairs of SEQ ID NO: 301 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: AGCCAG (nucleotides 43 - 48 ); GAAGAG (nucleotides 61 - 66 ); GCGGTA (nucleotides 67 - 72 ); GAAGAG (nucleotides 82 - 87 ); TCGCTG (nucleotides 163 - 168 ): GAAGAG (nucleotides 190 - 195 ); GAAGAG (nucleotides 208 - 213 ); CTTTCC (nucleotides 274 - 279 ); ATCGCC (nucleotides 436 - 441 ); GCCGGA (nucleotides 439 - 444 ); GCGGTA (nucleotides 562 - 567 ); GATCTC (nucleotides 634 - 639 ); GCGGCA (nucleotides 727 - 732 ); CAGGCG (nucleotides 751 - 756 ); ATCCTC (nucleotides 1015 - 1020 ); CTCGGC (nucleotides 1018 - 1023 ); GAAGTG (nucleotides 1036 - 1041 ); ATTGCC (nucleotides 1051 - 1056 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: AGCCAG (nucleotides 43 - 48 ) replaced with TCTCAG; GAAGAG (nucleotides 61 - 66 ) replaced with GAAGAA; GCGGTA (nucleotides 67 - 72 ) replaced with GCTGTT; GAAGAG (nucleotides 82 - 87 ) replaced with GAAGAA: TCGCTG (nucleotides 163 - 168 ) replaced with TCTCTG; GAAGAG (nucleotides 190 - 195 ) replaced with GAAGAA: GAAGAG (nucleotides 208 - 213 ) replaced with GAAGAA; CTTTCC (nucleotides 274 - 279 ) replaced with CTGTCT; ATCGCC (nucleotides 436 - 441 ) replaced with ATCGCT; GCCGGA (nucleotides 439 - 444 ) replaced with GCTGGT; GCGGTA (nucleotides 562 - 567 ) replaced with GCGGTT; GATCTC (nucleotides 634 - 639 ) replaced with GACTTG; GCGGCA (nucleotides 727 - 732 ) replaced with GCTGCT; CAGGCG (nucleotides 751
- 756 ) replaced with CAGGCT: ATCCTC (nucleotides 1015 - 1020 ) replaced with ATCCTG; CTCGGC (nucleotides 1018 - 1023 ) replaced with CTGGGT; GAAGTG (nucleotides 1036 - 1041 ) replaced with GAAGTT; ATTGCC (nucleotides 1051 - 1056 ) replaced with ATCGCG. In certain aspects, the nucleotide sequence is optimized for expression in E.coli.
|0156] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 codon pairs of SEQ ID NO: 301 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: TCCAAA (nucleotides 91 - 96 ); AAACTG (nucleotides 181 - 186 ); GACGAA (nucleotides 205 - 210 ); GCCAAA (nucleotides 253 - 258 ); CTTTCC (nucleotides 274 - 279 ); CAGTTT (nucleotides 313 - 318 ); AATATT (nucleotides 361 - 366 ); ATCAAA (nucleotides 523 - 528 ): GTCAAG (nucleotides 742
- 747 ); TTTGAC (nucleotides 1 126 - 1 131 ); AAGTTT (nucleotides 1474 - 1479 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: TCCAAA (nucleotides 91 - 96 ) replaced with TCTAAA; AAACTG (nucleotides 181 - 186 ) replaced with AAATTG; GACGAA (nucleotides 205 - 210 ) replaced with GATGAA; GCCAAA (nucleotides 253 - 258 ) replaced with GCTAAA; CTTTCC (nucleotides 274 - 279 ) replaced with TTGTCT; CAGTTT (nucleotides 313 - 318 ) replaced with CAATTT; AATATT (nucleotides 361 - 366 ) replaced with AACATT; ATCAAA (nucleotides 523 - 528 ) replaced with ATTAAA: GTCAAG (nucleotides 742
- 747 ) replaced with GTTAAA; TTTGAC (nucleotides 1 126 - 1 131 ) replaced with TTTGAT; AAGTTT (nucleotides 1474 - 1479 ) replaced with AAATTT. In certain aspects,, the nucleotide sequence is optimized for expression in P. pastoris.
|0157] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 codon pairs of SEQ ID NO: 301 have been replaced with different codon. pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GTGTTT (nucleotides 22 - 27 ); CTTTCC (nucleotides 274 - 279 ); CAGTTT (nucleotides 313 - 318 ); AAATGG (nucleotides 481 - 486 ); ATCAAA (nucleotides 523 - 528 ); GTGTTT (nucleotides 1 123 - 1 128 ); AAATGG (nucleotides 1444 - 1449 ). In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GTGTTT (nucleotides 22 - 27 ) replaced with GTTTTC; CTTTCC (nucleotides 274 - 279 ) replaced with TTGTCT; CAGTTT (nucleotides 313 - 318 ) replaced with CAATTC; AAATGG (nucleotides 481 - 486 ) replaced with AAGTGG; ATCAAA (nucleotides 523 - 528 ) replaced with ATTAAA; GTGTTT (nucleotides 1 123
- 1 128 ) replaced with GTTTTC; AAATGG (nucleotides 1444 - 1449 ) replaced with AAGTGG. In certain aspects, the nucleotide sequence is optimized for expression in K. lactis.
|0158] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 codon pairs of SEQ ID NO: 301 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the at least 3 codon pairs to be replaced are selected from the following: GTCAGA (nucleotides 175 - 180 ): GCCGGA (nucleotides 439 - 444 ); CAGCTT (nucleotides 598 - 603 ): ATCAAT (nucleotides 649 - 654 ); CTTTAT (nucleotides 703 - 708 ); GAAGGC (nucleotides 718 - 723 ); GCAAGG (nucleotides 730 - 735 ); GCCTTT (nucleotides 805 - 810 ); CAGCTT (nucleotides 844 - 849 ); GAAGGC (nucleotides 880 - 885 ); ATCAAT (nucleotides 1 195 - 1200 ); TCGGCT (nucleotides 1288 - 1293 ); CTCGAT (nucleotides 1363 - 1368 ); ATCAAT (nucleotides 1402 - 1407 ). In some such nucleotide sequences, at least 3. or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some aspects of the above embodiments, at least 3 of the following codon pair replacements have been made: GTCAGA (nucleotides 175 - 180 ) replaced with GTTCGT; GCCGGA (nucleotides 439 - 444 ) replaced with GCTGGT; CAGCTT (nucleotides 598 - 603 ) replaced with CAGTTG; ATCAAT (nucleotides 649 - 654 ) replaced with ATTAAT; CTTTAT (nucleotides 703 - 708 ) replaced with TTGTAT; GAAGGC (nucleotides 718 - 723 ) replaced with GAGGGC; GCAAGG (nucleotides 730 - 735 ) replaced with GCTCGT; GCCTTT (nucleotides 805 - 810 ) replaced with GCTTTC; CAGCTT (nucleotides 844 - 849 ) replaced with CAGTTG; GAAGGC (nucleotides 880 - 885 ) replaced with GAGGGA; ATCAAT (nucleotides 1 195 - 1200 ) replaced with ATTAAT; TCGGCT (nucleotides 1288 - 1293 ) replaced with TCTGCT; CTCGAT (nucleotides 1363 - 1368 ) replaced with TTGGAC; ATCAAT (nucleotides 1402 - 1407 ) replaced with ATTAAT. In certain aspects, the nucleotide sequence is optimized for expression in Z. mobilis.
|0159] Also provided herein is a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302 and is adapted for expression in a heterologous host organism, and wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein, wherein a highly-overrepresented codon pair is a codon pair that has a translational kinetics value greater than 5, or 3, or 2.5, or 2 times the standard deviation of translational kinetics values for the host organism. In certain embodiments, the host organism is not human, E. coli or S.cerevisiae. |0160] Also provided herein is a L-arabinose isomerase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is selected from the following: Pichia pastoήs; Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli K12 W31 10; E. coli UT189; E. coli O157:H7 EDL933; E. coli Ol 57:H7 str. Sakai; Bombyx mori; Spodoptβra frugiperda; Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schi∑osaccharomyces pombe.
[0161] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes the nucleotide sequence of the embodiments provided herein, operably linked to an expression control sequence.
[0162] In some embodiments, provided herein is a system for metabolizing arabinose, comprising one or more host organisms that collectively include nucleotide sequences operably encoding the following enzymes: L-arabinose isomerase, L- ribulokinase, and L-ribulose-5-P 4-epimerase; wherein the enzymes are heterologous to the one or more host organisms, and wherein transcriptional kinetics of each of the nucleotide sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three replaced codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some aspects, the one or more host oganisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schi∑osaccharomyces pombe. In some aspects, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of the enzyme. In some aspects the L-arabinose isomerase retains at least 75% of the enzymatic activity of wild-type AraA (SEQ ID NO: 302) under normal physiological conditions. |0163] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 9-483 of SEQ ID NO: 302 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 9-483 of SEQ ID NO: 302 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 9-483 when expressed in the native organism. In certain aspects, no replacement codon encoding amino acids 9-483 of SEQ ID NO: 302 has a z score for expression in the heterologous host that is more than 400%, or 300%, or 200%, or 150% or 100% of the wild type codon pair CTGGTG when expressed in the native organism.
[0164] In some embodiments are provided a L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302 and is adapted for expression in a heterologous host organism, wherein at least 1. 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-8 of SEQ ID NO: 302 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. In certain aspects, the z score of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the z score for the wild type codon pair when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -8 of SEQ ID NO: 302 has a z score for expression in the heterologous that is more than 200%, or 100%, or 75%, or 50% or 40% of the mean or median of the five highest z scores of the wild type codon pairs encoding amino acids 1-8 when expressed in the native organism. In certain aspects, at least one replacement codon encoding amino acids 1 -5 of SEQ ID NO: 302 has a z score for expression in the heterologous host that is more than 200%, or 100%, or 75%, or 50% or 40% of the wild type codon pair GAAGTG when expressed in the native organism.
[0165J Also provided herein are isolated polynucleotides comprising the any of the nucleotide sequences provided herein. Also provided herein are isolated polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 3, 5. 7. 9. 1 1 , 13, 15, 17, 19, 21 , 23, 27, 29, 31 , 33, 35, 37, 39, 41 , 43, 45, 47, 51, 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71 , 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 99, 101 , 103, 105, 107, 109, 1 1 1 , 1 13, 1 15, 1 17, 1 19, 123, 125, 127, 129, 131 , 133, 135, 137, 139, 141, 143, 147, 149, 151 , 153, 155, 157, 159, 161 , 163, 165, 167, 171 , 173, 175, 177, 179, 181 , 183, 185, 187, 189, 191 , 195, 197, 199, 201, 203, 205, 207, 209, 21 1 , 213, 215, 219, 221 , 223, 225, 227, 229, 231 , 233, 235, 237, 239, 243, 245, 247, 249, 251 , 253, 255, 257, 259, 261 , 263, 267, 271 , 273, 275, 277, 279, 281, 283, 285, 287, 291 , 295, 297, 299, 303, 305, 307, 309 or 31 1. Also provided herein are isolated polypeptides encoded by the any of the nucleotide sequences provided herein, provided that the amino acid sequence of said polypeptide is not SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 194, 218, 242, 266, 290 or 302.
|0166| Also provided herein are expression systems, comprising: an expression vector in a host organism, wherein the expression vector includes the any of the polynucleotides provided herein operably linked to an expression control sequence. Also provided herein are expression systems, comprising: an expression vector in a host organism, wherein the expression vector includes two or more polynucleotides provided herein, each polynucleotide being operably linked to the same or different expression control sequences. Also provided herein are expression systems for metabolizing xylose, comprising: one or more host organisms that collectively include polynucleotides operably encoding the following enzymes xylose reductase, xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the polynucleotides encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. Also provided herein are expression systems for metabolizing xylose, comprising: one or more host organisms that collectively include polynucleotides operably encoding the following enzymes xylose isomerase and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the polynucleotides encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some such systems, one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 1 1, 13, 15, 17, 19, 21 , 23, 27, 29, 31 , 33, 35, 37, 39, 41 , 43, 45, 47, 51 , 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71 , 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 267, 271 , 273, 275, 277, 279, 281 , 283, 285 or 287. Some such systems comprise two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 1 1 , 13, 15, 17, 19, 21 , 23, 27, 29, 31 , 33, 35, 37, 39, 41 , 43, 45, 47, 51 , 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71 , 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 267, 271 , 273, 275, 277, 279, 281 , 283, 285 or 287. In some such systems, one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 171 , 173, 175, 177, 179, 181 , 183, 185, 187, 189 or 191. Some such systems comprise two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 171 , 173, 175, 177, 179, 181 , 183, 185, 187, 189 or 191. In some such systems, the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some such systems, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of said enzyme. In some such systems, each encoded enzyme retains at least 75% of the enzymatic activity of wild-type polypeptide (SEQ ID NO2, 26, 50, 74, 98, 122, 146, 170, 194, 218, 242, 266, 290 or 302) under normal physiological conditions.
|0167] Also provided herein are expression systems for metabolizing arabinose, comprising: one or more host organisms that collectively include polynuclotides operably encoding the following enzymes: L-arabinitol 4-dehydrogenase, L-xylulose reductase, xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the DNA sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. Also provided herein are expression systems for metabolizing arabinose, comprising: one or more host organisms that collectively include polynuclotides operably encoding the following enzymes: L- arabinose isomerase, L-ribulokinase, and L-ribulose-5-P 4-epimerase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the DNA sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs. In some such systems, one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 51 , 53, 55, 57, 59, 61, 63, 65, 67, 69, 71 , 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 99, 101 , 103, 105, 107, 109, 1 1 1 , 1 13, 1 15. 1 17, 1 19, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141 , 143, 147, 149, 151 , 153, 155, 157, 159, 161, 163, 165 or 167. Some such systems comprise two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 51 , 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71 , 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 99, 101 , 103, 105, 107, 109, 1 1 1 , 1 13, 1 15, 1 17, 1 19, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141 , 143, 147, 149, 151 , 153, 155, 157, 159, 161 , 163, 165 or 167. In some such systems, one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 195, 197, 199, 201 , 203, 205, 207, 209, 21 1 , 213, 215, 219, 221 , 223, 225, 227, 229, 231, 233, 235, 237, 239, 243, 245, 247, 249, 251 , 253, 255, 257, 259, 261 , 263, 291 , 295, 297, 299, 303, 305, 307, 309 or 31 1. Some such systems comprise two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 195, 197, 199, 201 , 203, 205, 207, 209, 21 1 , 213, 215, 219, 221 , 223, 225, 227, 229, 231 , 233, 235, 237, 239, 243, 245, 247, 249, 251 , 253, 255, 257, 259, 261, 263, 291, 295, 297, 299, 303, 305, 307, 309 or 31 1. In some such systems, the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe. In some such systems, each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of said enzyme. In some such systems, each encoded enzyme retains at least 75% of the enzymatic activity of wild-type polypeptide (SEQ ID NO2, 26, 50, 74, 98, 122, 146, 170, 194, 218, 242, 266, 290 or 302) under normal physiological conditions.
|0168] Also provided herein are cells comprising any of the polynucleotides provided herein. In some such cells, the cell expresses the polypeptide encoded by said polynucleotide.
|0169| Also provided herein are methods of introducing a polynucleotide into a host cell comprising: providing a host cell; and contacting said host cell with any of the polynucleotides provided herein under conditions that permit the polynucleotide to be introduced into the host cell.
|0170| Also provided herein are methods of expressing a polypeptide comprising: providing a cell comprising any of the polynucleotides provided herein; and placing the cell under conditions that permit the cell to express the polypeptide encoded by the DNA sequence, whereby said encoded polypeptide is expressed by said cell.
[0171] Also provided herein are methods of metabolizing a sugar comprising: providing a sugar comprising at least one covalent bond bond; providing a polypeptide encoded by any of the polynucleotides provided herein; and contacting said sugar with said polypeptide under conditions that permit said polypeptide to break or form at least one covalent bond of said sugar, whereby at least one covalentbond of said sugar is broken or formed.
|0172] Also provided herein are integrable polynucleotides for modifying an endogenous nucleotide sequence in a cell comprising: a removable selectable marker cassette comprising a selectable marker flanked by a 5' site-specific recombinase recognition site and a 3' site-specific recombinase recognition site, wherein said removable selectable marker cassette is flanked by a 5' nucleic acid sequence with homology to an endogenous sequence and a 3' nucleic acid sequence with homology to an endogenous sequence. Some such integrable polynucleotides further comprise a heterologous nucleic acid flanked by said 5' nucleic acid sequence with homology to an endogenous sequence and said 3' nucleic acid sequence with homology to an endogenous sequence. In some such integrable polynucleotides, the heterologous nucleic acid comprises a sequence encoding a polypeptide. In some such integrable polynucleotides, the heterologous nucleic acid comprises a regulatory sequence. In some such integrable polynucleotides, the sequence encoding a polypeptide is operatively linked to said regulatory sequence. In some such integrable polynucleotides, the regulatory sequence comprises a promoter sequence and a terminator sequence. In some such integrable polynucleotides, the heterologous nucleic acid comprises a polynucleotide in accordance with any of the polynucleotides provided herein. In some embodiments, the heterologous nucleic acid encodes a polypeptide that catalyzes a reaction in a sugar degradation pathway. In some such integrable polynucleotides, the heterologous nucleic acid comprises SEQ ID NOs: 3, 5, 7, 9, 1 1 , 13, 15, 17, 19, 21 , 23, 27, 29, 31 , 33, 35, 37, 39, 41 , 43, 45; 47, 51 , 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71 , 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 99, 101 , 103, 105, 107, 109, 1 1 1 , 1 13, 1 15, 1 17, 1 19, 123, 125, 127, 129, 131 , 133, 135, 137, 139, 141 , 143, 147, 149, 151 , 153, 155, 157, 159, 161 , 163, 165, 167, 171 , 173, 175, 177, 179, 181 , 183, 185, 187, 189, 191 , 195, 197, 199, 201 , 203, 205, 207, 209, 21 1 , 213, 215, 219, 221, 223, 225, 227, 229, 231 , 233, 235, 237, 239, 243, 245, 247, 249, 251 , 253, 255, 257, 259, 261 , 263, 267, 271 , 273, 275, 277, 279, 281 , 283, 285, 287, 291 , 295, 297, 299, 303, 305, 307, 309 or 31 1. In some such integrable polynucleotides, the selectable marker can be selected for or can be selected against. In some such integrable polynucleotides, the selectable marker can be selected for and can be selected against. In some such integrable polynucleotides, the selectable mark is selected from the group consisting of URA3, TRPl , CANl, KIURA3, CYH2, LYS2 and MET15. In some such integrable polynucleotides, the nucleic acid sequence with homology to an endogenous sequence comprises a genomic repetitive element. In some such integrable polynucleotides, the nucleic acid sequence with homology to an endogenous sequence comprises TyI DNA or Ty3 DNA. In some such integrable polynucleotides, the site- specific recombinase recognition site comprises a loxP sequence. In some such integrable polynucleotides, the site-specific recombinase recognition site comprises a frt sequence. In some such integrable polynucleotides, the integrable polynucleotide comprises a PCR product.
[0173] Also provided herein are cells comprising any of the integrable polynucleotides provided herein. Some such cells comprise a gene encoding a site- specific recombinase. In some such cells, the site-specific recombinase comprises a CRE recombinase or a FLP recombinase. Some such cells are S. cerevisiae cells.
|0174J Also provided herein are methods of modifying an endogenous sequence in a cell comprising: providing a cell with at least one of the integrable polynucleotides provided; and selecting for a cell comprising said at least one integrable polynucleotide integrated therein to the genome of the cell. Some such methods further comprise excising at least one selectable marker from said at least one cell comprising said at least one integrable polynucleotide integrated therein: and selecting for a cell in which said at least one selectable marker has been excised. In some such methods, the excising said selectable marker comprises providing said cell with a site-specific recombinase. In some such methods, the site-specific recombinase comprises a CRE recombinase or a FLP recombinase. In some such methods, the site-specific recombinase is expressed from an endogenous gene or from a heterologous nucleic acid. In some such methods, the providing a cell with at least one integrable polynucleotide comprises providing a cell with a plurality of integrable polynucleotides, wherein said plurality of integrable polynucleotides comprises at least a first integrable polynucleotide comprising a first selectable marker and a second integrable polynucleotide comprising a second selectable marker. In some such methods, the plurality comprises 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more different integrable polynucleotides. Also provided are cells comprising an endogenous sequence modified by any of such methods provided herein. In some such cells, the modified endogenous sequence comprises an insertion, a deletion or a mutation.
10175] Also provided are cells comprising a removable selectable marker cassette integrated into said cell comprising a selectable marker flanked by a 5' site- specific recombinase recognition site and a 3' site-specific recombinase recognition site; and a heterologous nucleic acid integrated into said cell, wherein said removable selectable marker is juxtaposed to said heterologous nucleic. Also provided are cells comprising: a heterologous nucleic acid integrated into said cell, and a site-specific recombinase recognition site integrated into said cell, wherein said site-specific recombinase recognition site is juxtaposed to said heterologous nucleic acid. In some such cells, the site-specific recombinase recognition site comprises a loxP or frt sequence. In some such cells, the cell is a S. cerevisae cell. In some such cells, the heterologous nucleic acid comprises a polynucleotide in accordance with any of the polynucleotides provided herein. In some such cells, the heterologous nucleic acid encodes a polypeptide that catalyzes a reaction in a sugar degradation pathway. In some such cells, the heterologous nucleic acid comprises SEQ ID NOs: 3, 5, 7, 9, 1 1 , 13, 15, 17, 19, 21 , 23, 27, 29, 31 , 33, 35, 37, 39, 41 , 43, 45, 47, 51 , 53, 55, 57, 59, 61 , 63, 65, 67, 69, 71 , 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 99, 101 , 103, 105, 107, 109, 1 1 1 , 1 13, 1 15, 1 17, 1 19, 123, 125, 127, 129, 131 , 133, 135, 137, 139, 141 , 143, 147, 149, 151, 153, 155, 157, 159, 161 , 163, 165, 167, 171 , 173, 175, 177, 179, 181 , 183, 185, 187, 189, 191 , 195, 197, 199, 201 , 203, 205, 207, 209, 21 1 , 213, 215, 219, 221 , 223, 225, 227, 229, 231 , 233, 235, 237, 239, 243, 245, 247, 249, 251 , 253, 255, 257, 259, 261 , 263. 267, 271 , 273, 275, 277, 279, 281 , 283: 285, 287, 291 , 295, 297, 299, 303, 305, 307, 309 or 31 1.
BRIEF DESCRIPTION OF THE DRAWINGS
|0176] Figure 1 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in P. stipitis of nucleic acid sequences encoding the xylose reductase enzyme of P. stipitis (Xyr), plotted as a function of codon pair position.
|0177] Figures 2-6 depicts effects of Translational eEngineering™ on protein expression levels. Each of Figures 2-6 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding Xyr, plotted as a function of codon pair position.
|0178] Figure 2A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the Xyr protein. Figure 2B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
|0179] Figure 3A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the Xyr protein. Figure 3B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0180] Figure 4A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the Xyr protein. Figure 4B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0181] Figure 5A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the Xyr protein. Figure 5B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
(0182] Figure 6A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the Xyr protein. Figure 6B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
|0183] Figure 7 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in C. parapsilosis of nucleic acid sequences encoding the xylose reductase enzyme of C. parapsilosis (XyIl ), plotted as a function of codon pair position.
|0184| Figures 8-12 depicts effects of Translational eEngineering™ on protein expression levels. Each of Figures 8-12 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding XyIl , plotted as a function of codon pair position.
|0185] Figure 8A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the XyIl protein. Figure 8B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the XyIl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in 5. cerevisiae.
|0186] Figure 9A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the XyIl protein. Figure 9B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the XyI l which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
|0187] Figure 1OA depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the XyIl protein. Figure 1OB depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the XyIl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
|0188] Figure 1 1A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the XyIl protein. Figure H B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the XyIl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
|0189] Figure 12A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the XyIl protein. Figure 12B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the XyI l which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
|0190] Figure 13 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in P. stipitis of nucleic acid sequences encoding the xylitol dehydrogenase enzyme of P. stipitis (Xdh), plotted as a function of codon pair position.
|0191] Figures 14-18 depicts effects of Translational eEngineering™ on protein expression levels. Each of Figures 14-18 depict graphical displays of ∑ scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding Xdh. plotted as a function of codon pair position.
|0192] Figure 14A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the Xdh protein. Figure 14B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the Xdh which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0193] Figure 15A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the Xdh protein. Figure 15B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the Xdh which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
(0194J Figure 16A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the Xdh protein. Figure 16B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the Xdh which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
|0195] Figure 17A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the Xdh protein. Figure 17B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the Xdh which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0196] Figure 18A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the Xdh protein. Figure 18B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the Xdh which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
|0197] Figure 19 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in P. stipitis of nucleic acid sequences encoding the D-xylulokinase enzyme of P. stipitis (XKI), plotted as a function of codon pair position.
|0198] Figures 20-40 depicts effects of Translational eEngineering™ on protein expression levels. Each of Figures 20-40 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding XKI, plotted as a function of codon pair position.
[0199] Figure 2OA depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the XKl protein. Figure 2OB depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the XKI which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
|0200] Figure 21 A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the XKI protein. Figure 21B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the XKI which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
|0201] Figure 22A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the XKI protein. Figure 22B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the XKI which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
|0202] Figure 23A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the XKI protein. Figure 23B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the XKI which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
|0203] Figure 24A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the XKl protein. Figure 24B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the XKl has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
|0204] Figure 25 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in T. reesei of nucleic acid sequences encoding the L-arabinitol 4-dehydrogenase enzyme of T. reesei (LADl ), plotted as a function of codon pair position.
|0205| Figures 26-30 depicts effects of Translational eEngineering™ on protein expression levels. Each of Figures 26-30 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding LADl , plotted as a function of codon pair position.
[0206] Figure 26A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LADl protein. Figure 26B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LADl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
|0207] Figure 27A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LADl protein. Figure 27B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LADl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
(0208] Figure 28A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LADl protein. Figure 28B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LADl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
|0209] Figure 29A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LADl protein. Figure 29B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LADl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0210] Figure 30A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the LADl protein. Figure 3OB depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the LADl which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
|0211j Figure 31 depicts a graphical display of z scores of translational kinetics values for codon pair utilization in A. monospora of nucleic acid sequences encoding the L-xylulose reductase enzyme of A. monospora (LXR), plotted as a function of codon pair position.
[0212] Figures 32-36 depicts effects of Translational eEngineering™ on protein expression levels. Each of Figures 32-36 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding LXR. plotted as a function of codon pair position.
|0213] Figure 32A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LXR protein. Figure 32B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0214] Figure 33A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LXR protein. Figure 33B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0215] Figure 34A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LXR protein. Figure 34B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0216] Figure 35A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LXR protein. Figure 35B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0217] Figure 36A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the LXR protein. Figure 36B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
|0218] Figure 37 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in T. reesei of nucleic acid sequences encoding the L-xylulose reductase enzyme of T. reesei (LXR), plotted as a function of codon pair position.
[0219] Figures 38-42 depicts effects of Translational eEngineering™ on protein expression levels. Each of Figures 38-42 depict graphical displays ofz scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding LXR, plotted as a function of codon pair position.
[0220] Figure 38A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the LXR protein. Figure 38B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0221] Figure 39A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the LXR protein. Figure 39B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0222] Figure 4OA depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the LXR protein. Figure 4OB depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0223] Figure 41 A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the LXR protein. Figure 41 B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0224] Figure 42A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the LXR protein. Figure 42B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the LXR which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis.
|02251 Figure 43 depicts a graphical display of z scores of translational kinetics values for codon pair utilization in E. coli of nucleic acid sequences encoding the xylose isomerase enzyme of E. coli (XyIA), plotted as a function of codon pair position.
[0226] Figures 44-48 depicts effects of Translational eEngineering™ on protein expression levels. Each of Figures 44-48 depict graphical displays ofz scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding XyIA, plotted as a function of codon pair position.
|0227] Figure 44A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the XyIA protein. Figure 44B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the XyIA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0228] Figure 45A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the XyIA protein. Figure 45B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the XyIA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0229] Figure 46A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the XyIA protein. Figure 46B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the XyIA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0230] Figure 47A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the XyIA protein. Figure 47B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the XyIA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0231] Figure 48A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the XyIA protein. Figure 48B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the XyIA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
[0232| Figure 49 depicts a graphical display of z scores of translational kinetics values for codon pair utilization in E. coli of nucleic acid sequences encoding the L-arabinose isomerase enzyme of E. coli (AraA), plotted as a function of codon pair position.
10233] Figures 50-54 depicts effects of Translational eEngineering™ on protein expression levels. Each of Figures 50-54 depict graphical displays of ∑ scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding AraA. plotted as a function of codon pair position.
|0234] Figure 50A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the AraA protein. Figure 50B depicts a graphical display of the 5. cerevisiae expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
|0235] Figure 51 A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the AraA protein. Figure 5 IB depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
|0236] Figure 52A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the AraA protein. Figure 52B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
|0237] Figure 53A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the AraA protein. Figure 53B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
|0238] Figure 54A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the AraA protein. Figure 54B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
|0239] Figure 55 depicts a graphical display of ∑ scores of translational kinetics values for codon pair utililization in E. coli of nucleic acid sequences encoding the L-ribulokinase enzyme of if. coli (AraB), plotted as a function of codon pair position.
|0240] Figures 56-60 depicts effects of Translational eEngineering™ on protein expression levels. Each of Figures 56-60 depict graphical displays of ∑ scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding AraB, plotted as a function of codon pair position.
[0241] Figure 56A depicts a graphical display of the 5. cerevisiae expression of the native nucleic acid sequence encoding the AraB protein. Figure 56B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the AraB which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
|0242] Figure 57A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the AraB protein. Figure 57B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the AraB which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
|0243] Figure 58A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the AraB protein. Figure 58B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the AraB which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0244] Figure 59A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the AraB protein. Figure 59B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the AraB which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0245] Figure 6OA depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the AraB protein. Figure 6OB depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the AraB which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis. |0246] Figure 61 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in E. coli of nucleic acid sequences encoding the L-ribulose-5-P 4-epimerase enzyme of E. coli (AraD). plotted as a function of codon pair position.
|0247| Figures 62-66 depicts effects of Translational eEngineering™ on protein expression levels. Each of Figures 62-66 depict graphical displays ofz scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding AraD. plotted as a function of codon pair position.
[0248] Figure 62A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the AraD protein. Figure 62B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the AraD which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
|0249] Figure 63A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the AraD protein. Figure 63B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the AraD which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
|0250] Figure 64A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the AraD protein. Figure 64B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the AraD which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
|0251] Figure 65A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the AraD protein. Figure 65B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the AraD which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
|0252] Figure 66A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the AraD protein. Figure 66B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the AraD which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z. mobilis. |02531 Figures 67-71 depict effects of Translational eEngineeringTM on protein expression levels. Each of Figures 67-71 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding the xylose reductase enzyme of C. tenuis (Xyr). plotted as a function of codon pair position.
|0254] Figure 67A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the Xyr protein. Figure 67B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
|0255] Figure 68A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the Xyr protein. Figure 68B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
10256] Figure 69A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the Xyr protein. Figure 69B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0257] Figure 70A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the Xyr protein. Figure 70B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0258] Figure 71 A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the Xyr protein. Figure 71 B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the Xyr which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
[0259] Figure 72 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in E. coli of nucleic acid sequences encoding the L-arabinose isomerase enzyme of E. coli ( AraA), plotted as a function of codon pair position. I0260J Figures 73-77 depicts effects of Translational eEngineering™ on protein expression levels. Each of Figures 73-77 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding AraA, plotted as a function of codon pair position.
[0261 J Figure 73 A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the AraA protein. Figure 73B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0262] Figure 74A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the AraA protein. Figure 74B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0263] Figure 75A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the AraA protein. Figure 75B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0264] Figure 76A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the AraA protein. Figure 76B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0265] Figure 77A depicts a graphical display of the Z mobilis expression of the native nucleic acid sequence encoding the AraA protein. Figure 77B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
[0266] Figure 78 depicts a graphical display of z scores of translational kinetics values for codon pair utililization in E. coli of nucleic acid sequences encoding the L-arabinose isomerase enzyme of E. coli (AraA). plotted as a function of codon pair position. [0267] Figures 79-83 depicts effects of Translational eEngineering™ on protein expression levels. Each of Figures 79-83 depict graphical displays of z scores of translational kinetics values for codon pair utililization of nucleic acid sequences encoding AraA, plotted as a function of codon pair position.
[0268] Figure 79A depicts a graphical display of the S. cerevisiae expression of the native nucleic acid sequence encoding the AraA protein. Figure 79B depicts a graphical display of the S. cerevisiae expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae.
[0269] Figure 80A depicts a graphical display of the E. coli expression of the native nucleic acid sequence encoding the AraA protein. Figure 80B depicts a graphical display of the E. coli expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in E. coli.
[0270] Figure 81 A depicts a graphical display of the P. pastoris expression of the native nucleic acid sequence encoding the AraA protein. Figure 81 B depicts a graphical display of the P. pastoris expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in P. pastoris.
[0271] Figure 82A depicts a graphical display of the K. lactis expression of the native nucleic acid sequence encoding the AraA protein. Figure 82B depicts a graphical display of the K. lactis expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in K. lactis.
[0272] Figure 83A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the AraA protein. Figure 83B depicts a graphical display of the Z. mobilis expression of a nucleic acid sequence encoding the AraA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
[0273] Figure 84A depicts a graphical display of the Z. mobilis expression of the native nucleic acid sequence encoding the XynA protein. Figure 84B depicts a graphical display of the Z mobilis expression of a nucleic acid sequence encoding the XynA which has been modified to eliminate codon pairs that are predicted to cause a translational pause in Z mobilis.
-I l l - |0274j Figure 85 depicts a Western blot analysis of expression in S. cerevisiae of the AraBAD enzymes. As shown in the figure. AraB and AraD are expressed and soluble. AraA is also well expressed (as seen in a denaturing purification, not shown). F denotes flowthrough and E denotes eluate of the HlS-tagged proteins on a Ni++ NTA column (Qiagen).
[0275] Figure 86 depicts a Western blot analysis showing expression in S. cerevisiae of P. stipitis xylose reductase (XYR). The native gene is compared to HotRod gene, which was modified to eliminate codon pairs that are predicted to cause a translational pause in S. cerevisiae. Time points are indicated as minutes after induction with galactose.
[0276J Figure 87 depicts Western blot analysis of expression of the HotRod version of the XKl enzyme in S. cerevisiae. The gene was expressed from the PGAL promoter in the pYES2 vector (Invitrogen), and purified under either denaturing or native conditions using the 6-H1S tag located at the N-terminus of the enzyme. These results show that this enzyme is soluble when expressed in yeast.
DETAILED DESCIPRTION
|0277] Biomass is the earth's most attractive alternative among fuel sources and most sustainable energy resource and is reproduced by the bioconversion of carbon dioxide. Ethanol produced from biomass is today the most widely used biofuel when blended with gasoline. As the carbon dioxide released by combustion is recycled into biomass, the use of biofuels can significantly reduce the accumulation of greenhouse gas. Ethanol is just one example of the uses of biomass harvesting using industrial enzymes. The technologies associated with biomass harvesting are similarly applicable in the production of other biofuels, fine chemicals as well as other diverse applications.
|0278] Lignocellulosic biomass is composed predominantly of cellulose, hemicellulose, and lignin and is naturally resistant to chemical and biologic conversion. An economical biomass-to-ethanol process critically depends on the rapid and efficient conversion of all of the sugars present in both its cellulose and hemicellulose fractions. While many microorganisms can ferment the glucose component in cellulose to ethanol, efficient conversion of the pentose sugars in the hemicellulose fraction, particularly xylose and arabinose, has been hindered by the lack of a suitable biocatalyst. Xylose is the predominant pentose sugar derived from hemicellulose, but arabinose can constitute a significant amount of the pentose sugars derived from various agricultural residues and other herbaceous crops, such as switchgrass.
10279] Xylose metabolism. Xylose is metabolized in the pentose phosphate pathway (PPP) where it enters through D-xylulose and is converted by transketolase (TLK). generating D-fructose-6-phosphate and D-glyceraldehyde-3-phosphate (GAP), which can be converted in a redox-neutral way to equimolar amounts of COT and ethanol. In yeast, filamentous fungi and other eukaryotes, this process proceeds via a two-step reduction and oxidation: First, D-xylose is reduced to xylitol by a xylose reductase (XR; e.g., Xyr, XYLl, XyUp) and then xylitol is oxidized to D-Xylulose by a xylitol dehydrogenase (XDH; e.g., XYL2, XyUp). Finally. D-Xylulose is converted to Xylulose- 5P by D-xylulokinase (XK).
|0280] The rate of the two-step reduction/oxidation reactions to generate D- xylulose, and hence feed the PPP and eventually generate ethanol, is governed by the cofactor requirements of the first two reactions which affect cellular demands for oxygen. For example, the first enzyme, AraA from Pichia stipitis, has a preference for NADPH (Kn, = 1 1 μM) (Jeppsson et al. (2006) Biotechnol. Bioeng. 93:665-673) but can use NADH (Verduyn et al. ( 1985) Biochem. J. 226:669-677). Conversely, XDH from Pichia stipitis is strictly NAD+-dependent.
[0281] In bacteria, the conversion is more direct, utilizing a xylose isomerase fXylA,) without the requirement of a reducing/oxidizing cofactor. Researchers have tried to express xylose isomerase in Saccharomyces cerevisiae in order to create an improved xylose fermenter. Ueng et al., ((1985) Biotechnol Lett, 7:153-158) cloned the gene for xylose isomerase from E. coli and Chan et al. (( 1986) Biotechnol Lett, 8:231-234) expressed it in Schi∑osaccharomyces pombe. These are the only researchers to have reported success with this approach. Amore et al. ((1989) Nucleic Acids Res. 17:7515) expressed the genes from Bacillus and Actinoplanes in S. cerevisiae, but it was not catalytically active. Sarthy et al. ((1987) Appl. Environ. Microbiol. 53:1996-2000) expressed E. coli xylose isomerase in S. cerevisiae but found that the protein had only about 10"3 as much activity as the native protein from E. coli.
10282] Arabinose metabolism. In yeast, filamentous fungi and other eukaryotes. the L-arabinose pathway consists of five enzymes: aldose reductase (ARD), L-arabinitol 4-dehydrogenase (LAD), L-xylulose reductase (LXR), xylitol dehydrogenase (XDH), and xylulokinase (XKI), converting L-arabinose to L-arabitol, L-xylulose, xylitol, D-xylulose, and D-xylulose-5-P, respectively. [0283J The bacterial pathway for L-arabinose utilization does not use redox reactions like the yeast/fungal system, but consists of L-arabinose isomerase (AraA), L- ribulokinase (AraB), and L-ribulose-5-P 4-epimerase (AraD) converting L-arabinose to L- ribulose. L-ribulose-5-P, and D-xylulose-5-P, respectively (Lee et al. (1986) Gene 47:231 -244). However, the expression of the E. coli pathway in 5. cerevisiae did not result in either growth on L-arabinose or production of ethanol from L-arabinose (Sedlak at al. (2001 ) 28:16-24). It was suggested that the main problem was the low activity of B. licheniformis L-arabinose isomerase in yeast.
|0284] When Becker and Boles ((2003) Appl. Environ. Microbiol. 69:4144- 50) tried to establish an L-arabinose utilization pathway in S. cerevisiae, the B. licheniformis AraA gene did not produce any L-arabinose isomerase activity in their yeast strains. Nevertheless, the corresponding enzymes from B. subήlis and Mycobacterium smegmatis were active in yeast but did not promote growth on L-arabinose. By using sequential transfer of yeast transformants in media containing L-arabinose as a breeding strategy, a strain that exhibits fast growth on L-arabinose and a high fermentative performance with L-arabinose was selected. Molecular analysis of this strain revealed that efficient utilization of L-arabinose resulted from balanced stoichiometry of the L- arabinose-utilizing enzymes with high L-arabinose uptake.
J0285] Thus, it is desirable to recombinantly express enzymes from bacterial pathways, fungal pathways or both pathways, in host organisms to more efficiently ferment arabinose and xylose from biomass. Yet, despite knowledge in the art related to expression of a foreign or synthetic gene in a host organism, many sugar catabolic enzymes do not express well in host organisms such as E. coli or S. cerevisiae. Accordingly, provided herein are hydroysis enzyme-encoding nucleotide sequences and methods of making the same for improved expression of sugar catabolic enzymes.
(0286] Some translational pauses are resultant from the presence of particular codon pairs in the nucleotide sequence encoding the polypeptide to be translated. As provided herein, inappropriate or excessive translation pauses can reduce protein expression considerably. Further, the translational pausing properties of codon pairs vary from organism to organism. As a result, exogenous expression of genes foreign to the expression organism can lead to inefficient translation. Even when the gene is translated in a sufficiently efficient manner that recoverable quantities of the translation product are produced, the protein is often inactive, insoluble, aggregated, or otherwise different in properties from the native protein. Thus, removing inappropriate or excessive translation pauses can improve protein expression.
|0287) Methods of determining patterns of codon pair utilization are known in the art. as exemplified by U.S. Patent Number 5.082,767 (which is incorporated by reference herein in its entirety), which describes analysis of patterns of nonrandom codon pair usage. The information obtained from codon pair utilization analysis can be used to construct and express altered or synthetic genes having desired levels of translational efficiency, to introduce translational pause sites into heterologous genes, and to ascertain relationship or ancestral origin of nucleotide sequences in accordance with the methods provided herein and the knowledge in the art.
[0288] A translational pause can serve to slow translation of the nascent amino acid chain. In some instances when such translational pauses arise in translation in native genes in the native organism, the pause(s) can serve to facilitate proper polypeptide folding, post-translational modification, re-organization/folding at protein domain boundaries, or other steps toward arriving at the native, active wild type protein. Accordingly, in some embodiments provided herein, one or more pauses that are predicted to be present in native translation of sugar catabolic enzymes is/are preserved in a modified hydrolysis-encoding polynucleotide provided in accordance with the teachings herein. For example, a codon pair in the modified sugar catabolic enzyme-encoding polynucleotide can be selected to have a predicted translational kinetics value that is at least 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, or 99% that of the native codon pair whose predicted pause is to be preserved; further, the codon pair in the modified sugar catabolic enzyme -encoding polynucleotide can be selected to be located within 30, 20, 10, 9, 8, 7, 6, 5, 4, 3, 2 or 1 codons of the native codon pair whose predicted pause is to be preserved.
[0289] Accordingly, as used herein, Translation Engineering™ refers to a process used to modify the translational kinetics of a polypeptide-encoding nucleic sequence. For example, Translation Engineering™ can be applied to modify the translational kinetics of a polypeptide-encoding nucleic sequence when expressed in its native organism. In another example, Translation Engineering™ can be applied to modify the translational kinetics of a polypeptide-encoding nucleic sequence when expressed in its native organism. In some embodiments, this process alters the polypeptide-encoding nucleic sequence to optimize codon usage and codon pair optimization in the organism in which the polypeptide-encoding nucleic sequence is expressed. For example, sequence modifications can be made to place or prevent restriction sites in the sequence, eliminate strong RNA secondary structures and avoid inadvertent Shine-Delgarno sequences. Additionally, Translation Engineering™ involves modifying the translational kinetics of a polypeptide-encoding nucleic sequence by removing, preserving, and/or inserting translational pauses into the polypeptide-encoding nucleic sequence.
|0290] In accordance with the above, provided herein are sugar catabolic enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same. In one embodiment, provided is a sugar catabolic enzyme -encoding DNA sequence, wherein the encoded sequence has amino acid sequence identity with wild-type sugar catabolic enzyme, and wherein predicted translation pauses in the expression organism have been removed or reduced by replacing input-sequence codon pairs with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some embodiments, the resultant sugar catabolic enzyme - encoding nucleotide is predicted to be translated rapidly along its entire length. Expression of the resultant sugar catabolic enzyme -encoding nucleotide is predicted to result in improved protein expression levels in cases where inappropriate or excessive translation pauses reduce protein expression. In addition, expression of the resultant sugar catabolic enzyme -encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression in cases where inappropriate or excessive translation pauses causes expression of inactive, insoluble or aggregated sugar catabolic enzyme . In some embodiments, expression of the resultant sugar catabolic enzyme -encoding nucleotide is predicted to result in improved levels of active and/or natively folded polypeptide expression in cases where one or more predicted pauses are preserved from the native expression profile or are added to preserve expression of active and/or soluble sugar catabolic enzyme . Thus, the sugar catabolic enzyme -encoding nucleotide sequences provided herein allow for one or more of the following results: higher expression levels; higher enzymatic activity; greater protein stability and resistance to degradation: and increased solubility.
|0291] As used herein the term sugar catabolic enzyme refers to the enzymes encoded by the nucleotide sequences provided herein, and includes xylose reductase, xylitol dehydrogenase, D-xylulokinase, L-arabinitol 4-dehydrogenase, L-xylulose reductase, xylose isomerase, L-arabinose isomerase, L-ribulokinase, and L-ribulose-5-P 4-epimerase enzymes. |0292| Accordingly, nucleic acid sequences encoding the xylose reductase enzyme of P. stipitis (Xyr) are provided. The nucleotide sequences provided herein include the native sequence from P. stipitis shown in the sequence listing (SEQ ID NO: 1) which encodes the Xyr amino acid sequence (SEQ ID NO: 2).
|0293] Further, nucleic acid sequences encoding the xylose reductase enzyme of C parapsilosis (XyI l) are provided. The nucleotide sequences provided herein include the native sequence from C. parapsilosis shown in the sequence listing (SEQ ID NO: 25) which encodes the XyIl amino acid sequence (SEQ ID NO: 26).
|0294] Further, nucleic acid sequences encoding the xylitol dehydrogenase enzyme of P. stipitis (Xdh) are provided. The nucleotide sequences provided herein include the native sequence from P. stipitis shown in the sequence listing (SEQ ID NO: 49) which encodes the Xdh amino acid sequence (SEQ ID NO: 50).
|0295] Further, nucleic acid sequences encoding the D-xylulokinase enzyme of P. stipitis (XKI) are provided. The nucleotide sequences provided herein include the native sequence from P. stipitis shown in the sequence listing (SEQ ID NO: 73) which encodes the XKI amino acid sequence (SEQ ID NO: 74).
10296] Further, nucleic acid sequences encoding the L-arabinitol 4- dehydrogenase enzyme of T. reesei (LADl ) are provided. The nucleotide sequences provided herein include the native sequence from T. reesei shown in the sequence listing (SEQ ID NO: 97) which encodes the LADl amino acid sequence (SEQ ID NO: 98).
10297] Further, nucleic acid sequences encoding the L-xylulose reductase enzyme of A. monospora (LXR) are provided. The nucleotide sequences provided herein include the native sequence from A. monospora shown in the sequence listing (SEQ ID NO: 121 ) which encodes the LXR amino acid sequence (SEQ ID NO: 122).
|0298] Further, nucleic acid sequences encoding the L-xylulose reductase enzyme of T. reesei (LXR) are provided. The nucleotide sequences provided herein include the native sequence from T. reesei shown in the sequence listing (SEQ ID NO: 145) which encodes the LXR amino acid sequence (SEQ ID NO: 146).
[0299] Further, nucleic acid sequences encoding the xylose isomerase enzyme of E. coli (XyIA) are provided. The nucleotide sequences provided herein include the native sequence from E. coli shown in the sequence listing (SEQ ID NO: 169) which encodes the XyIA amino acid sequence (SEQ ID NO: 170).
|0300] Further, nucleic acid sequences encoding the L-arabinose isomerase enzyme of E. coli (AraA) are provided. The nucleotide sequences provided herein include the native sequence from E. coli shown in the sequence listing (SEQ ID NO: 193) which encodes the AraA amino acid sequence (SEQ ID NO: 194).
|0301) Further, nucleic acid sequences encoding the L-ribulokinase enzyme of E. coli (AraB) are provided. The nucleotide sequences provided herein include the native sequence from E. coli shown in the sequence listing (SEQ ID NO: 217) which encodes the AraB amino acid sequence (SEQ ID NO: 21 8).
[0302] Further, nucleic acid sequences encoding the L-ribulose-5-P 4- epimerase enzyme of E. coli (AraD) are provided. The nucleotide sequences provided herein include the native sequence from E. coli shown in the sequence listing (SEQ ID NO: 241) which encodes the AraD amino acid sequence (SEQ ID NO: 242).
[0303] Further, nucleic acid sequences encoding the xylose reductase enzyme of C. tenuis (Xyr) are provided. The nucleotide sequences provided herein include the native sequence from C. tenuis shown in the sequence listing (SEQ ID NO: 265) which encodes the Xyr amino acid sequence (SEQ ID NO: 266).
[0304] Further, nucleic acid sequences encoding the L-arabinose isomerase enzyme of B. subtilis (AraA) are provided. The nucleotide sequences provided herein include the native sequence from E. coli shown in the sequence listing ( SEQ ID NO: 289) which encodes the AraA amino acid sequence ( SEQ ID NO: 290).
[0305] Further, nucleic acid sequences encoding the L-arabinose isomerase enzyme of E. coli (AraA) are provided. The nucleotide sequences provided herein include the native sequence from B. licheniformis shown in the sequence listing (SEQ ID NO: 301) which encodes the AraA amino acid sequence (SEQ ID NO: 302).
[0306] Further, provided herein are nucleic acid sequences encoding sugar catabolic enzymes with refined translational kinetics for expression in S. cerevisiae (SEQ ID NOS: 3, 27, 51 , 75, 998 123, 147, 171 , 195, 219, 243, 267, 291 , 303), £. cø/i (SEQ ID NOS: 9, 33, 57, 81 , 105, 129, 153, 177, 201, 225, 249, 273, 293 and 305), P. pastoris (SEQ ID NOS: 15, 39, 63, 87, 1 1 1 , 135, 159, 183, 207, 231 , 255, 279, 295 and 307), K. lactis (SEQ ID NOS: 21 , 45, 69, 93, 1 17, 141 , 165, 189, 213, 237, 261 , 285, 297 and 309). Also provided herein are sequences where additional sequence has been added to the 3 'or 5: ends, or both. As will be understood by one of skill in the art, nucleotide sequences may be added 3r or 5: of any nucleic acid, for example, to facilitate hybridization of PCR primers, to add cloning restriction sites or other sites that facilitate cloning and/or expression. Accordingly, provided in the sequence listing are nucleic acid sequences with additional 5: and 3: cloning and/or PCR sequences, and which encode sugar catabolic enzymes with refined translational kinetics for expression in S. cerevisiae (SEQ ID NOS: 5, 7, 29, 31, 53, 55, 77, 79: 101, 103.125, 127, 149, 151.173, 175, 197, 199, 221, 223, 245,247,269, 271 ), E. coli (SEQ ID NOS: 11, 13, 35, 37, 59, 61, 83, 85, 107, 109, 131, 133, 155, 157, 179, 181, 203, 205, 227, 229, 251, 253, 275, 277) and P. pastoris (SEQ ID NOS: 17, 19,41,43,65,67, 89, 91, 113, 115, 137, 139, 161, 163, 185, 187, 209, 211, 233, 235, 257, 259, 281, 283).
|0307] Further, provided in the sequence listing are sugar catabolic enzyme amino acid sequences encoded by the nucleotide sequences with refined translational kinetics described herein. Thus, sugar catabolic enzyme nucleic acid sequences with refined translational kinetics (SEQ IDNOS: 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83,85,87,89,91,93,95,99, 101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165, 167, 171, 173, 175, 177, 179, 181, 183, 185, 187, 189, 191, 195, 197, 199, 201, 203, 205, 207, 209, 211,213, 215, 219, 221, 223, 225, 227, 229, 231, 233, 235, 237, 239, 243, 245, 247, 249, 251, 253, 255, 257, 259, 261, 263, 267, 271, 273, 275, 277, 279, 281, 283, 285, 287, 291, 295, 297, 299, 303, 305, 307, 309 and 311) respectively encode the amino acid sequences shown in the sequence listing (SEQ ID NOS: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 100, 102, 104, 106, 108, 110, 112, 114, 116, 118, 120, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 148, 150, 152, 154, 156, 158, 160, 162, 164, 166, 168, 172, 174, 176, 178, 180, 182, 184, 186, 188, 190, 192, 196, 198, 200, 202, 204, 206, 208, 210, 212, 214, 216, 220, 222, 224, 226, 228, 230, 232, 234, 236, 238, 240, 244, 248, 250, 252, 254, 256, 258, 260, 262, 264, 268, 270, 272, 274, 276, 278, 280, 282, 284, 286, 288, 292, 294, 296, 298, 300, 304, 306, 308, 310 and 312)
[0308] Also provided herein are sugar catabolic enzyme-encoding DNA sequences, wherein the encoded sequence has amino acid sequence identity with an original sugar catabolic enzyme polypeptide and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly- overrepresented therein. In some embodiments, the host organism is not human. E. coli or S. cerevisiae.
|0309] As used herein, a xylose reductase polynucleotide encodes a polypeptide having xylose reductase activity. Xylose reductase and like terms refers to the enzymatic conversion of xylose to xylitol. A method for measuring xylose reductase activity is exemplified by a known method in which an enzymatic reaction is carried out and NADPH absorbance at 340 nm is monitored by spectrophotometry, as described in Rawat et al. ((1998) J. Biol. Chem. 273:9415-9423), hereby incorporated by reference in its entirety.
|0310] As used herein, a xylitol dehydrogenase polynucleotide encodes a polypeptide having xylitol dehydrogenase activity. Xylitol dehydrogenase and like terms refers to the enzymatic conversion of xylitol to D-xylulose. A method for measuring xylitol dehydrogenase activity is exemplified by a known method in which an enzymatic reaction is carried out and NADPH absorbance at 340 nm is monitored by spectrophotometry, as described in Ko et al. ((2006) Appl. Eviron. Microbiol. 72:4207- 4213), hereby incorporated by reference in its entirety.
[0311] As used herein, a D-xylulokinase polynucleotide encodes a polypeptide having D-xylulokinase activity. D-xylulokinase and like terms refers to the enzymatic conversion of D-xylulose to D-xylulose-5-phosphate. A method for measuring D-xylulokinase activity is exemplified by a known method in which an enzymatic reaction is carried out and NADPH absorbance at 340 nm is monitored by spectrophotometry, as described in Dills et al. ((1994) Protein Expr. Purif. 5:259-265), hereby incorporated by reference in its entirety.
|0312] As used herein, a L-arabinitol 4-dehydrogenase polynucleotide encodes a polypeptide having L-arabinitol 4-dehydrogenase activity. L-arabinitol 4- dehydrogenase and like terms refers to the enzymatic conversion of L-arabinose to L- arabitol. A method for measuring L-arabinitol 4-dehydrogenase activity is exemplified by a known method in which an enzymatic reaction is carried out and NADPH absorbance at 340 nm is monitored by spectrophotometry, as described in U.S. Patent Application No. 2003/0186402, hereby incorporated by reference in its entirety.
[0313] As used herein, a L-xylulose reductase polynucleotide encodes a polypeptide having L-xylulose reductase activity. L-xylulose reductase and like terms refers to the enzymatic conversion of L-xylulose to xylitol. A method for measuring L- xylulose reductase activity is exemplified by a known method as described in Verho et al. ((2004) J. Biol. Chem. 279:14746-14751 ), hereby incoφorated by reference in its entirety.
|0314] As used herein, a xylose isomerase polynucleotide encodes a polypeptide having xylose isomerase activity. Xylose isomerase and like terms refers to the enzymatic conversion of xylose to D-xylulose. A method for measuring xylose isomerase activity is exemplified by a known method in which an enzymatic reaction is carried out and xylulose production is monitored by spectrophotometry, as described in U.S. Patent No. 6.475,768, hereby incorporated by reference in its entirety.
[0315] As used herein, a L-arabinose isomerase polynucleotide encodes a polypeptide having L-arabinose isomerase activity. L-arabinose isomerase and like terms refers to the enzymatic conversion of L-arabinose to L-ribulose. A method for measuring L-arabinose isomerase activity is exemplified by a known method in which an enzymatic reaction is carried out and ribulose absorbance at 560 nm is monitored by spectrophotometry, as described in Lee et al. ((2005) Appl. Environ. Microbiol. 71 :7888- 7896), hereby incorporated by reference in its entirety.
|0316] As used herein, a L-ribulokinase polynucleotide encodes a polypeptide having L-ribulokinase activity. L-ribulokinase and like terms refers to the enzymatic conversion of L-ribulose to L-ribulose-5-P. A method for measuring L-ribulokinase activity is exemplified by a known method in which an enzymatic reaction is carried out and DPNH absorbance at 340 nm is monitored by spectrophotometry, as described by Lee and Englesberg (( 1962) Proc. Natl. Acad. Sci. 48:335). hereby incorporated by reference in its entirety.
[0317] As used herein, a L-ribulose-5-P 4-epimerase polynucleotide encodes a polypeptide having L-ribulose-5-P 4-epimerase activity. L-ribulose-5-P 4-epimerase and like terms refers to the enzymatic conversion of L-ribulose-5-P to D-xylulose-5-P. A method for measuring L-ribulose-5-P 4-epimerase activity is exemplified by a known method in which an enzymatic reaction is carried out and NADPH absorbance at 340 nm is monitored by spectrophotometry, as described in Becker and Boles ((2003) Appl. Environ. Microbiol. 69:4144-50, hereby incorporated by reference in its entirety.
[0318] The polynucleotides provided herein encode polypeptides that have sugar catabolism activity. Thus, a sugar catabolic enzyme-encoding polynucleotide comprising any of the DNA sequences provided herein can be transcribed and the resulting RNA translated to produce a polypeptide with sugar catabolic enzyme activity. |0319j As used herein, the term nucleotide sequence is used to refer to any polynucleotide sequence. The term DNA sequence is used herein to refer to the nucleotide sequences presented herein. As will be understood by one of skill in the art an RNA equivalent nucleotide sequences are also described by DNA sequences presented herein. As is well-known in the art, an equivalent RNA sequence can be substituted for a DNA sequecne by a T to U substitution, (i.e., replacing thymine in the DNA sequence with uracil in the RNA sequence).
[0320] In some embodiments, the sugar catabolic enzyme-encoding DNA sequence is adapted for expression in a heterologous host organism. As used herein, a DNA sequence that has been adapted for expression is a DNA sequence that has been inserted into an expression vector or otherwise modified to contain regulatory elements necessary for expression of the DNA in the host cell, positioned in such a manner as to permit expression of the DNA in the host cell. Such regulatory elements required for expression include promoter sequences, transcription initiation sequences and, optionally, enhancer sequences. For example, a DNA sequence may be inserted into a plasmid vector adapted for expression in a bacterial cell, such as E. coli, or a eukaryotic cell, such as S. cerevisiae or other yeast, or any other host organism.
[0321] A heterologous host organism is an organism used to express DNA, RNA or protein that is foreign to the host organism. In certain aspects, the host organism is not human, E. coli or S. cerevisiae.
Changes to translational kinetics
[0322] The methods and sequences provided herein permit modification of the translational kinetics of an mRNA into a sugar catabolic enzyme-encoding polypeptide. Translational kinetics of an mRNA into polypeptide can be changed in order to achieve any of a variety of expression profiles. For example, translational kinetics of an mRNA into polypeptide can be changed in order to remove some or all translational pauses. In another example, translational kinetics of an mRNA into polypeptide can be changed in order to replace some or all translational pauses predicted to occur within an autonomous folding unit of a nascent protein. In another example, translational kinetics of an mRNA into polypeptide can be changed in order to replace some or all over-represented codon pairs.
[0323] It is proposed herein that the presence of a pause or translation slowing codon pair can queue ribosomes back to the beginning of the coding sequence, thereby inhibiting further ribosome attachment to the message which can result in down- regulation of protein expression levels as the rate of translation initiation readily saturates and the slowest translation step time becomes rate limiting. It is also proposed herein that the presence of a pause or translational slowing codon pair can stall or detach a ribosome. It is also proposed herein that the presence of a pause or translational slowing codon pair can expose naked mRNA, which is then subject to message degradation. It is also proposed herein that the presence of a pause or translational slowing codon pair can decouple translation from transcription, leading to protein expression failure. For these reasons and more, methods for analyzing, designing and producing gene sequences and polynucleotides to remove or decrease in number, or selectively preserve or insert, pauses, or to replace or modify translational slowing codon pairs, have great utility.
|0324] Organism-specific codon usage and codon pair usage, and the presence of organism-specific pause sites, result in gene translation that is highly adapted to the original host organism. For example, ribosomal pausing sites that may be functional in a human cell will typically be scrambled, random, or not appropriate or not recognized in the proper context in a bacterium or other non-native host. A heterologous cDNA or synthetic polynucleotide has a random but high probability of inadvertently encoding a pause site somewhere, often leading to protein expression and/or activity failure.
[0325] Differences between codon pair (pause signal) coding among bacteria or among vertebrates are sufficient to make cross-family gene expression unpredictable. For example, in various organisms such as bacteria, a significant pause or translational slowing can result in premature transcription termination and/or messenger degradation. Even in eukaryotes there is a coupling between export of mRNA from the nucleus and translation; thus a different, but still effective system of clearing untranslated mRNA exists in eukaryotes.
[0326] Methods for refining translational kinetics of an mRNA into polypeptide can be performed according to any method known in the art, as exemplified in U.S. Patent Publication No. 2008/0046192, published on February 21 , 2008, which is incorporated by reference herein in its entirety. For example, a polypeptide-encoding nucleotide can be designed to be predicted to be translated rapidly along its entire length. Thus, some polypeptide-encoding nucleotides provided herein are those that have been engineered to remove all predicted pauses. Expression of such a polypeptide-encoding nucleotide can result in improved protein expression levels and improved levels of active and/or natively folded polypeptide expression. |0327j Further methods of refining translational kinetic values are contemplated herein, as can be seen in U.S. Patent Publication No. 2007/0298503. published on December 27; 2007, and U.S. Patent Publication No. 2007/0275399,. published on November 29, 2007, each of which is incorporated by reference herein in its entirety.
|0328] As provided herein, a test of translation pausing or slowing as a result of codon pair usage can be performed by comparing a series of genes that have random pauses with modified genes where codon pairs predicted to cause translational pauses are replaced. Unmodified genes moved from their source organism and expressed in a heterologous host can have an altered set of codon pairs predicted to cause a translational pause or ribosomal slowing (e.g., an altered set of over-represented codon pairs), resulting in altered configuration and location of presumed pause sites. Creation of synthetic codon-pair-optimized genes can have a dramatic effect on expression: expression of difficult-to-express genes can be seen for the first time, or improved at least 2-fold, 3- fold, 4-fold, 5-fold, 6-fold, 7-fold, 8-fold, 9-fold, 10-fold, 12-fold, 15-fold, 20-fold, 25- fold, 30-fold, or more, relative to unmodified polypeptide-encoding nucleic acid sequences.
|0329] In some embodiments, translational kinetics of an mRNA into sugar catabolic enzyme-encoding polypeptide can be changed in order to remove some or all translational pauses or replace other codon pairs that cause translational slowing, message instability and degradation, and poor protein translation, expression, and functional properties. While not intending to be limited to the following, it is believed that, for at least some proteins, reduction or elimination of translational pauses can serve to increase the expression level and/or quality and characteristics of the protein. Accordingly, by removing some or all translational pauses or replacing other codon pairs that cause translational slowing, the expression levels and/or quality of an expressed protein can be increased.
|0330] For example, the sugar catabolic enzyme-encoding nucleotide sequences provided herein allow for one or more of the following results: higher expression levels, higher enzymatic activity, greater protein stability, resistance to degradation, and increased solubility compared to the original native gene when expressed in a heterologous host.
|0331] Thus, also provided herein are sugar catabolic enzyme -encoding nucleotide sequences that have been modified to have one or more transcriptional pauses or slowing sites removed by modifying one or more codon pairs to a corresponding codon pair that is less likely to cause a translational pause or slowing. While in some embodiments it is preferred to replace all codon pairs predicted to cause a translational pause or slowing, in other embodiments, it is sufficient to replace a subset of codon pairs predicted to cause a translational pause or slowing. For example, expression levels can be increased by replacing at least 1. 2. 3, 4, 5, 6, 7, 8, 9, 10 or more codon pairs predicted to cause a translational pause or slowing. In another example, at least 10%, 20%, 30%. 40%, 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, 98% or 99% of codon pairs predicted to cause a translational pause or slowing are replaced by, for example, substituting different codon pairs that encode the same amino acids.
(0332] In some embodiments, translational kinetics of an mRNA into polypeptide can be changed in order to remove some or all translational pauses predicted to occur within an autonomous folding unit of a protein. As used herein, an autonomous folding unit of a protein refers to an element of the overall protein structure that is self- stabilizing and often folds independently of the rest of the protein chain. Such autonomous folding units typically correspond to a protein domain. As provided herein, expression of a gene in a heterologous host organism can result in translational pauses located in regions that inhibit protein expression and/or protein folding. Since the presence of codon pairs predicted to cause a translational pause or slowing in protein- encoding regions separating regions encoding different autonomous folding units of the protein can serve to pause or slow translation, it is also contemplated that removal of translational pauses predicted to occur within an autonomous folding unit of a protein, particularly for heterologously-expressed proteins, can result in improved expression levels and/or folding of expressed proteins. Accordingly, provided herein are methods of changing translational kinetics of an mRNA into polypeptide by removing some or all translational pauses predicted to occur within an autonomous folding unit of a protein, thereby increasing expression levels and/or improving the folding of the expressed protein.
|0333] It is further contemplated that preserving or inserting a translational pause in a region predicted to separate autonomous folding units of a protein, particularly for heterologously-expressed proteins, can result in improved folding and/or solubility of expressed proteins. Accordingly, provided herein are methods of changing translational kinetics of an mRNA into polypeptide by preserving, relative to native, or inserting one or more translational pauses in one or more regions predicted to separate autonomous folding units of a protein, thereby increasing improving the folding and/or solubility of the expressed protein.
|0334] In the methods provided herein that include changing translational kinetics of an mRNA into polypeptide by modifying codon pairs with regard to their location within or outside of autonomous folding units of proteins, one step can include identifying predicted autonomous folding units of a protein. Methods for identifying predicted autonomous folding units of a protein or protein domains are known in the art, and include alignment of amino acid sequences with protein sequences having known structures, and threading amino acid sequences against template protein domain databases. Such methods can employ any of a variety of software algorithms in searching any of a variety of databases known in the art for predicting the location of protein domains. The results of such methods will typically include an identification of the amino acids predicted to be present in a particular domain, and also can include an identification of the domain itself, and an identification of the secondary structural element, if any, in which each amino acid sequence of a domain is located.
|0335] In some instances, it is not possible to modify the polypeptide- encoding nucleotide sequence to remove a translational pause not present in the expression profile of the polypeptide in the native host organism. For example, there may be no codon pairs that are not predicted to cause a translational pause or slowing and that encode a corresponding pair of amino acids. In such instances, several options are available: the codon pair that is least likely to cause a translational pause or slowing can be selected; an amino acid insertion, deletion or mutation can be introduced to yield a codon pair that is not predicted to cause a translational pause or slowing; or no change is made. One option in a computational method is to request human input in order to resolve the issue. The computational method may, for example, involve the use of a computer that is programmed to request human input. Alternatively, the computer may be programmed to make a selection, or combination of selections, such that multiple genes, or Ordered Gene Sets or small permutation libraries are designed and synthetically produced for use in expression analysis. In methods in which an amino acid insertion, deletion or mutation is made in order to change translational kinetics, it is preferable to select a change that is predicted not to substantially influence the final three-dimensional structure of the protein and/or the activity of the protein. Such an amino acid insertion, deletion or mutation can include, for example, a conservative amino acid substitution such as the conservative substitutions shown in Table 1. The substitutions shown are based on amino acid physical-chemical properties, and as such, are independent of organism. In some embodiments, the conservative amino acid substitution is a substitution listed under the heading of exemplary substitutions.
Table 1
Original Conservative Exemplary Residue Substitutions Substitutions
Ala (A) val; leu; ile val Arg (R) lys; gin; asn lys Asn (N) gin; his; lys: arg gin Asp (D) glu glu Cys (C) ser ser GIn (Q) asn asn GIu (E) asp asp GIy (G) pro; ala ala His (H) asn; gin; lys; arg arg lie (1) leu; val; met; ala; phe leu Leu (L) ile; val; met; ala; phe ile Lys (K) arg; gin; asn arg Met (M) leu; phe; ile leu Phe (F) leu; val; ile; ala; tyr leu Pro (P) ala ala Ser (S) thr thr Thr (T) ser ser Trp (W) tyr; phe tyr Tyr (Y) trp; phe; thr; ser phe VaI (V) ile; leu; met; phe; ala leu
|0336] While in some embodiments, all codon pairs predicted to cause a translational pause or slowing are treated equally, in other embodiments, one or more different threshold levels can be established for differential treatment of codon pairs, where codon pairs above a highest threshold are the codon pairs most likely to cause a translational pause or slowing, and succeedingly lower codon pair threshold-based groups correspond to succeedingly lower likelihoods of the respective codon pairs causing a translational pause or slowing. Based on the codon pair groupings, different numbers or percentages of codon pairs can be replaced for each of these different threshold-based groups. For example, 95% or more codon pairs above a highest threshold level can be replaced, while 90% or less of all codon pairs between that level and an intermediate threshold level are replaced. As contemplated herein, codon pairs likely to cause a translational pause or slowing can be segregated into two or more different threshold- based groups, three or more different threshold-based groups, four or more different threshold-based groups, five or more different threshold-based groups, six or more different threshold-based groups, or more. Discussion of specific thresholds are provided elsewhere herein; however, typically the higher the threshold, the higher the likelihood of a translational pause or slowing caused by a codon pair with a translational kinetics value greater than the threshold. In embodiments in which codon pairs likely to cause a translational pause or slowing can be segregated into two or more different threshold- based groups, different numbers or percentages of codon pairs can be replaced for each codon pair group. For example, in one embodiment, at least 10%, 20%, 30%, 40%. 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, 98% or 99% of codon pairs above a highest threshold are replaced, while the same or a lower percentage of codon pairs are replaced from codon pair groups corresponding to one or more lower thresholds. Typically, for each successively lower threshold group, the same or a lower percentage of codon pairs are replaced. In one example, all codon pairs above a highest threshold are replaced, while a codon pair above an intermediate threshold is replaced only if the codon pair is located within an autonomous folding unit. In another example, all codon pairs above a highest threshold are replaced, while a codon pair above an intermediate threshold is replaced only if the codon pair can be replaced without requiring a change in the encoded polypeptide sequence. In another example, all codon pairs above a highest threshold are replaced, while a codon pair above a first higher intermediate threshold is replaced only if the codon pair can be replaced without changing the encoded polypeptide sequence or with only a conservative change to the encoded polypeptide sequence, while a codon pair above a second lower intermediate threshold is replaced only if the codon pair can be replaced without requiring any change in the encoded polypeptide sequence. While the above discussion has been applied to the use of a plurality of threshold levels, it will be readily apparent to one skilled in the art that, in the place of using threshold levels, an evaluation method can be used that determines the degree to which a codon pair should be replaced according to the translational kinetics value of the codon pair, where the degree to which the codon pair should be replaced can be counterbalanced by any of a variety of user-determined factors such as, for example, presence of the codon pair within or between autonomous folding units, and degree of change to the encoded polypeptide sequence.
[0337] In accordance with the methods and sequences provided herein, a translational kinetics value of a codon pair is a representation of the degree to which it is expected that a codon pair is associated with a translational pause. Methods of determining the translational kinetics value of a codon pair are discussed elsewhere herein. Such translational kinetics values can be normalized to facilitate comparison of translational kinetics values between species. In some embodiments, the translational value can be the degree of over-representation of a codon pair. An over-represented codon pair is a codon pair which is present in a protein-encoding sequence in higher abundance than would be expected if all codon pairs were statistically randomly abundant. When translational kinetics values of codon pairs are determined, a codon pair predicted to cause a translational pause or slowing is a codon pair whose likelihood of causing a translational pause or slowing is at least one standard deviation above the mean translational kinetics value, where a particular translational kinetics value above the mean translational kinetics value in this context refers to a translational kinetics value indicative of a greater likelihood of causing translational pausing or slowing, relative to a mean translational kinetics value, and is not strictly limited to a particular mathematical relationship (e.g., greater than the mean) since the depiction of propensity to cause a translational pause by a translational kinetics value can be selected to be negative or positive, based on the selected implementation by one skilled in the art. For example, over-represented codon pairs may be graphically displayed as a positive function in a SpeedPlot™, as depicted in Figure 1 , where a positive deflection or peak above a selected threshold describes a translational pause or slowing at the exact nucleotide location as defined by the abscissa. In the methods provided herein, a threshold for the translational kinetics value of codon pairs that are predicted to cause a translational pause or slowing can be set in accordance with the method and level of stringency desired by one skilled in the art. For example, when it is desired to identify only a small number of the codon pairs most likely to cause a translational pause or slowing, a threshold value can be set to 5, or 3, or 2, or 1.5 standard deviations or more above the mean. Typical threshold values can be at least I 5 1.25, 1.5, 1 .75. 2, 2.25, 2.5, 3, 3.5, 4, 4.5 and 5 or more Standard deviations above the mean. As provided herein, a plurality of thresholds can be applied in the herein-provided methods in segregating codon pairs into a plurality of groups. Each threshold of such a plurality can be a different value selected from 1 , 1 .25, 1.5, 1.75. 2, 2.25, 2.5, 3, 3.5. 4, 4.5 and 5 or more standard deviations above the mean.
(0338] In some embodiments, translational kinetics of an mRNA into polypeptide can be changed to add or retain one or more translational pauses predicted to occur before, after or within an autonomous folding unit of a protein, or between autonomous folding units. While not intending to be limited to the following, it is proposed that translational pauses are present in wild type genes in order to slow translation of a nascent polypeptide subsequent to translation of a protein domain, thus providing time for acquisition of secondary and at least partial tertiary structure in the domain prior to further downstream translation and reorganization or reconfiguration of the growing polypeptide or domain. By modifying the translational kinetics of complex multi-domain proteins it may be possible to experimentally alter the time each domain has available to organize. Folding of a heterologously-expressed gene having two or more independent domains can be altered by the presence of pause sites between the domains. Refolding studies indicate that the time it takes for a protein to settle into its final configuration may take longer than the translation of the protein. Pausing may allow each domain to partially organize and commit to a particular, independent fold. Other co- translational events, such as those associated with co-factors, protein subunits, protein complexes, membranes, chaperones, secretion, or proteolysis complexes, also can depend on the kinetics of the emerging nascent polypeptide. Pauses can be introduced by engineering one codon pair predicted to cause a translational pause or slowing, or two or more such codon pairs into the sequence to facilitate these co-translational interactions.
|0339] As such, provided herein is the recognition that the presence of codon pairs predicted to cause a translational pause or slowing in protein-encoding regions separating regions encoding different autonomous folding units of the protein can serve to pause translation and facilitate folding of the nascent translated protein, where autonomous folding units can be secondary structural elements such as an alpha helix, or can be tertiary structural elements such as a protein domain. Accordingly, provided herein are methods of changing translational kinetics of an mRNA into polypeptide by including or preserving one or more translational pauses predicted to occur before, after, or between autonomous folding units of a protein, thereby increasing the likelihood that the translated protein will be properly folded. In such embodiments, typically a translational pause is preserved, which refers to maintaining the same codon pair for a polypeptide-encoding nucleotide sequence that is expressed in the native host organism, or, when the polypeptide-encoding nucleotide sequence is heterologously expressed, changing the codon pair as appropriate to have a translational kinetics value comparable to or closest to the translational kinetics value of the native codon pair in the native host organism.
[0340] In some instances, it is not possible to modify the polypeptide- encoding nucleotide sequence to preserve or insert a translational pause without causing a change to the encoded amino acid sequence. For example, there may be no codon pairs that are predicted to cause a translational pause or slowing and that encode the same pair of amino acids as encoded in the original sequence. In such instances, several options are available. First, proximal codon pairs can be selected to be replaced in order to introduce a translational pause or slowing. For example, one of the 1 , 2, 3, 4 or 5 most proximal codon pairs upstream (5" of the desired pause site) or one of the 1 , 2, 3, 4 or 5 most proximal codon pairs downstream (3' of the desired pause site) can be chosen for replacement to introduce the translational pause or slowing. Typically in such instances, the selected codon pair for replacement to introduce the translational pause or slowing is the codon pair closest to the originally desired codon pair location of the translational pause or slowing, provided the desired translational pause or slowing can be attained (e.g., 1 codon pair upstream or downstream is typically selected instead of 2 codon pairs upstream or downstream, provided the desired translational pause or slowing can be attained). Alternatively, a translational pause or slowing can be introduced by selecting a replacement codon pair encoding a conservative amino acid substitution, such as the conservative substitutions shown in Table 1. In some embodiments, replacement of a proximal codon pair to introduce a translational pause or slowing is preferred over replacement of a codon pair resulting in a change in the encoded amino acid sequence.
[0341] Further methods of modifying polypeptide encoding nucleotide sequences are contemplated herein, as can be seen in U.S. Patent Publication No. 2007/0298503, published on December 27, 2007, and U.S. Patent Publication No. 2007/0275399, published on November 29, 2007, each of which is incorporated by reference herein in its entirety.
|0342] Further, provided herein is the recognition that predicted pause sites may be conserved across different proteins in the same species, or in related proteins across two or more species. In some embodiments, graphical displays of translational kinetics values of one or more proteins can be used to provide information to assist in the selection of a translational pause or slowing to preserve or insert in a redesigned polypeptide-encoding nucleotide sequence. In particular, graphical displays of translational kinetics values can permit, for example, alignment of homologous proteins from different species and an identification, based on this alignment, of predicted translational pause or slowing sites that are conserved in the aligned proteins. Such predicted translational pause or slowing sites can be preserved or inserted in a redesigned polypeptide-encoding nucleotide sequence. In another example, regions between autonomous folding units in one or more proteins within a particular species can be graphically examined for the presence or absence of predicted pause sites. Such graphical display methods can result in an identification of a region between autonomous folding units in which a translational pause or slowing is desirably preserved in a redesigned polypeptide-encoding sequence.
|0343] Methods for identifying and selecting conserved translational pauses can be performed according to any method known in the art. as exemplified in U.S. Patent Publication No. 2007/0298503, published on December 27, 2007, and U.S. Patent Publication No. 2007/0275399, published on November 29, 2007. For example, the codon pair translation kinetics values can be compared with a database of related gene sequences and conserved pause sites can be identified. Additionally, a synthetic gene can be designed wherein at least one conserved pause site is maintained to provide a synthetic gene with modified translation kinetics.
Redesign of polypeptide-encoding nucleotide sequence
|0344] As provided herein, codon pairs are associated with translational pauses, and can thereby influence translational kinetics of an mRNA into polypeptide. Thus, the methods of changing translational kinetics provided herein will typically be performed by modifying or designing one or more nucleotide sequences encoding a polypeptide to be expressed. Accordingly, provided herein are methods of modifying a gene or designing a synthetic nucleotide sequence encoding the polypeptide encoded by the gene, collectively referred to herein as redesigning a polypeptide-encoding gene sequence or redesigning a polypeptide-encoding nucleotide sequence. Also included in the various embodiments provided herein are redesigned gene sequences encoding polypeptides that are not identical to the original gene. [0345] In some embodiments are provided a sugar catabolic enzyme-encoding DNA sequence, wherein the encoded sequence has at least a 50%, 60%. 70%. 75%. 80%. 85%, and more typically at least 90%, 91%, 92%, 93%, 94%: 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the wild type sugar catabolic enzyme polypeptide sequence as set forth in SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 194. 218, 242, 266; 290 or 302.
|0346] In certain embodiments, at least 1 , 2 or 3 codon pairs of a polynucleotide sequence encoding the sugar catabolic enzyme have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In some such nucleotide sequences, at least 3, or 4, or 5, or 6 or more of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. In certain aspects, the DNA sequence is optimized for expression in S. cerevisiae. E. coli, P. pastoris, K. lac l is or Z mobilis.
[0347] In some embodiments, provided is a sugar catabolic enzyme-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the a functional domain of the sugar catabolic enzyme have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for functional domains are known in the art.
[0348] Typically in such embodiments, the replacement codon pairs are predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism. That is, the embodiments in which one or more codon pairs encoding amino acids of the a functional domain of one of the encoded polypeptides provided herein have been replaced include embodiments in which the nucleotide sequence encoding the functional domain is changed to increase the predicted translational kinetics of translation of the functional domain. As provided herein, incomplete translation, improper folding, or other protein expression shortcomings can result from the presence of one or more translational pauses in a heterologously-expressed polypeptide. In some embodiments, removal of one or more of these pauses can increase the speed of translation of the functional domain, and thereby increase the quantity of protein produced and/or increase the amount of stable, properly folded, active, and/or soluble protein produced. [0349] In such embodiments, the replacement codons. i.e., the codons added as replacements for the wild type codons, are typically predicted to be less likely to cause a translational pause. For example, the replacement codon can have a translational kinetics value in the heterologous host organism that is 95%; 90%: 85%, 80%: 75%, 70%, or less, than the translational kinetics value of the wild type codon pair when expressed in the heterologous host organism. In some embodiments, the replacement codon is selected to have a translational kinetics value similar to the translational kinetics value of the wild type codon pair in the native organism. For example, the z score of at least one replacement codon pair when expressed in the heterologous host organism can be no more than 250%, 200%, 150%, 125% or 100% of the z score for the wild type codon pair when expressed in the native organism.
[0350] In some embodiments, provided is a sugar catabolic enzyme-encoding DNA sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between domains of the sugar catabolic enzyme, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the domains are known in the art and are described in detail below.
[0351] In some embodiments, provided is a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9. 10, or more codon pairs present in wild-type nucleotide sequence and which encode the aldo/keto reductase domain of the xylose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for aldo/keto reductase domains are known in the art. In the case of the xylose reductase of SEQ ID NO: 2, the aldo/keto reductase domain includes at least amino acids 6-300 or 5-301.
[0352] In some embodiments, provided is a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the aldo/keto reductase domain of the xylose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the aldo/keto reductase domain are described hereinabove.
|0353] In some embodiments, provided is a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the aldo/keto reductase domain of the xylose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for aldo/keto reductase domains are known in the art. In the case of the xylose reductase of SEQ ID NO: 26, the aldo/keto reductase domain includes at least amino acids 1 1-306, 12-307 or 3-324.
[0354] In some embodiments, provided is a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the aldo/keto reductase domain of the xylose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the aldo/keto reductase domain are described hereinabove.
[0355] In some embodiments, provided is a xylitol dehydrogenase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the alcohol dehydrogenase GroES-like domain of the xylitol dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for alcohol dehydrogenase GroES-like domains are known in the art. In the case of the xylitol dehydrogenase of SEQ ID NO: 50,the alcohol dehydrogenase GroES-like domain includes at least amino acids 28-146 or 27- 147.
[0356] In some embodiments, provided is a xylitol dehydrogenase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4. 5, 6, 7, 8, 9. 10, or more codon pairs present in wild-type nucleotide sequence and which encode the zinc-binding dehydrogenase domain of the xylitol dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for zinc-binding dehydrogenase domains are known in the art. In the case of the xylitol dehydrogenase of SEQ ID NO: 5O.the zinc- binding dehydrogenase domain includes at least amino acids 175-314 or 174-315.
|0357] In some embodiments, provided is a xylitol dehydrogenase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5. 6, 7, 8, 9. 10. or more codon pairs present in wild-type nucleotide sequence and which encode the region between the zinc-binding dehydrogenase domain and the alcohol dehydrogenase GroES-like domain of the xylitol dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the zinc-binding dehydrogenase domain and the alcohol dehydrogenase GroES-like domain are described hereinabove.
|0358] In some embodiments, provided is a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the alcohol dehydrogenase GroES-like domain of the xylitol dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the alcohol dehydrogenase GroES-like domain are described hereinabove.
|0359] In some embodiments, provided is a D-xylulokinase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the FGGY carbohydrate kinse domain of the D-xylulokinase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for FGGY carbohydrate kinse domains are known in the art. In the case of the D-xylulokinase of SEQ ID NO: 74, the FGGY carbohydrate kinse domain includes at least amino acids 12-312 or 1 1 -313.
|0360] In some embodiments, provided is a D-xylulokinase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7. 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the FGGY carbohydrate kinse domain of the D-xylulokinase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the FGGY carbohydrate kinse domain are described hereinabove.
10361] In some embodiments, provided is a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9. 10, or more codon pairs present in wild-type nucleotide sequence and which encode the alcohol dehydrogenase GroES-like domain of the L-arabinitol 4-dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for alcohol dehydrogenase GroES- like domains are known in the art. In the case of the L-arabinitol 4-dehydrogenase of SEQ ID NO: 98, the alcohol dehydrogenase GroES-like domain includes at least amino acids 54-163 or 53-164.
|0362] In some embodiments, provided is a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10. or more codon pairs present in wild-type nucleotide sequence and which encode the alcohol dehydrogenase zinc binding domain of the L-arabinitol 4-dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for alcohol dehydrogenase zinc binding domains are known in the art. In the case of the L-arabinitoI 4-dehydrogenase of SEQ ID NO: 98, the alcohol dehydrogenase zinc binding domain includes at least amino acids 191-365 or 192-366.
[0363] In some embodiments, provided is a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the alcohol dehydrogenase GroES-like domain of the L-arabinitol 4-dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the alcohol dehydrogenase GroES-like domain are described hereinabove.
[0364] In some embodiments, provided is a L-arabinitol 4-dehydrogenase- encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1. 2, 3, 4, 5, 6, 7, 8, 9, 10. or more codon pairs present in wild-type nucleotide sequence and which encode the region between the alcohol dehydrogenase GroES-like domain and the alcohol dehydrogenase zinc binding domain of the L- arabinitol 4-dehydrogenase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the alcohol dehydrogenase zinc binding domain are described hereinabove.
[0365] In some embodiments, provided is a L-xylulose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the short-chain dehydrogenase/reductase domain of the L- xylulose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for short-chain dehydrogenase/reductase domains are known in the art. In the case of the L-xylulose reductase of SEQ ID NO: 122, the short-chain dehydrogenase/reductase domain includes at least amino acids 13- 194 or 8-267.
[0366] In some embodiments, provided is a L-xylulose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the short-chain dehydrogenase/reductase domain of the L-xylulose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the short-chain dehydrogenase/reductase domain are described hereinabove.
[0367] In some embodiments, provided is a L-xylulose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2. 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the short-chain dehydrogenase/reductase domain of the L- xylulose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for short-chain dehydrogenase/reductase domains are known in the art. In the case of the L-xylulose reductase of SEQ ID NO: 146, the short-chain dehydrogenase/reductase domain includes at least amino acids 20- 193 or 10-261.
[0368] In some embodiments, provided is a xylose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4. 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the xylose isomerase type TlM barrel domain of the xylose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for xylose isomerase type TIM barrel domains are known in the art. In the case of the xylose isomerase of SEQ ID NO: 170, the xylose isomerase type TIM barrel domain includes at least amino acids 77-285 or 76-286.
|0369] In some embodiments, provided is a xylose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the xylose isomerase type TIM barrel domain of the xylose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the xylose isomerase type TIM barrel domain are described hereinabove.
|0370] In some embodiments, provided is a L-arabinose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the arabinose isomerase domain of the L-arabinose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for arabinose isomerase domains are known in the art. In the case of the L-arabinose isomerase of SEQ ID NO: 194, the arabinose isomerase domain includes at least amino acids 9-471 or 8-472.
[0371] In some embodiments, provided is a L-arabinose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the arabinose isomerase domain of the L-arabinose isomerase. have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the arabinose isomerase domain are described hereinabove.
[0372] In some embodiments, provided is a L-ribulokinase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5. 6, 7, 8, 9, 10. or more codon pairs present in wild-type nucleotide sequence and which encode the carbohydrate kinase domain of the L-ribulokinase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for carbohydrate kinase domains are known in the art. In the case of the L- ribulokinase of SEQ ID NO: 218, the carbohydrate kinase domain includes at least amino acids 59-549 or 60-548.
[0373] In some embodiments, provided is a L-ribulokinase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the carbohydrate kinase domain of the L-ribulokinase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the carbohydrate kinase domain are described hereinabove.
[0374] In some embodiments, provided is a L-ribulose-5-P 4-epimerase- encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the aldolase domain of the L-ribulose-5-P 4- epimerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for aldolase domains are known in the art. In the case of the L-ribulose-5-P 4-epimerase of SEQ ID NO: 242, the aldolase domain includes at least amino acids 7-218 or 8-217.
[0375] In some embodiments, provided is a L-ribulose-5-P 4-epimerase- encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the aldolase domain of the L-ribulose-5-P 4-epimerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the aldolase domain are described hereinabove.
|0376] In some embodiments, provided is a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6. 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the aldo/keto reductase domain of the xylose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for aldo/keto reductase domains are known in the art. In the case of the xylose reductase of SEQ ID NO: 266, the aldo/keto reductase domain includes at least amino acids 10-305 or 9-306.
[0377] In some embodiments, provided is a xylose reductase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the aldo/keto reductase domain of the xylose reductase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the aldo/keto reductase domain are described hereinabove.
|0378] In some embodiments, provided is a L-arabinose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the arabinose isomerase domain of the L-arabinose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for arabinose isomerase domains are known in the art. In the case of the L-arabinose isomerase of SEQ ID NO: 290, the arabinose isomerase domain includes at least amino acids 7-487.
[0379] In some embodiments, provided is a L-arabinose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the arabinose isomerase domain of the L-arabinose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the arabinose isomerase domain are described hereinabove.
|0380] In some embodiments, provided is a L-arabinose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2. 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the arabinose isomerase domain of the L-arabinose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for arabinose isomerase domains are known in the art. In the case of the L-arabinose isomerase of SEQ ID NO: 302, the arabinose isomerase domain includes at least amino acids 9-483.
[0381] In some embodiments, provided is a L-arabinose isomerase-encoding nucleotide sequence adapted for expression in a heterologous host organism, wherein at least 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, or more codon pairs present in wild-type nucleotide sequence and which encode the region between the N-terminus and the arabinose isomerase domain of the L-arabinose isomerase, have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof. The conserved amino acid sequence pattern and domain boundaries for the arabinose isomerase domain are described hereinabove.
[0382] Thus, provided herein are methods for redesigning the polypeptide- encoding nucleotide sequence provided herein to modify the translational kinetics of the polypeptide-encoding nucleotide sequence, where the polypeptide-encoding nucleotide sequence is altered such that one or more codon pairs have a decreased likelihood of causing a translational pause or slowing relative to the unaltered polypeptide-encoding nucleotide sequence. For example, one or more nucleotides of a polypeptide-encoding nucleotide sequence can be changed such that a codon pair containing the changed nucleotides has a translational kinetics value indicative of a decreased likelihood of causing a translational pause or slowing relative to the unchanged polypeptide-encoding nucleotide sequence.
[0383] While it will be understood by those of skill in the art that a redesigned polypeptide-encoding nucleotide sequence need not possess a high degree of identity to the polypeptide-encoding nucleotide sequence of the original gene, in some embodiments, the redesigned polypeptide-encoding nucleotide sequence will have at least 50%, 60%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% nucleotide identity with the polypeptide-encoding nucleotide sequence of the original gene. As used herein an original gene refers to a gene for which codon pair refinement is to be performed; such original genes can be. for example, wild type genes, native genes, naturally occurring mutant genes, other mutant genes such as site-directed mutant genes or engineered or completely synthetic genes. In other embodiments, the polynucleotide sequence will be completely synthetic, and will bear much lower identity with the original gene, e.g., no more than 90%, 80%, 70%, 60%, 50%, 40%, or lower.
|0384] Because of the redundancy of the triplet genetic code it is possible to preserve amino acid sequence coding while redesigning the polypeptide-encoding gene nucleotide sequence. Polypeptide-encoding nucleotide sequences can be redesigned to be convenient to work with and specifically tailored to a particular host and vector system of choice. The resulting sequence can be designed to: ( 1) reduce or eliminate translational problems caused by inappropriate ribosome pausing, such as those caused by over- represented codon pairs or other codon pairs with translational values predictive of a translational pause; (2) have codon usage refined to avoid over-reliance on rare codons; (3) reduce in number or remove particular restriction sites, splice sites, internal Shine- Dalgarno sequences, or other sites that may cause problems in cloning or in interactions with the host organism; or (4) have controlled RNA secondary structure to avoid detrimental translational termination effects, translation initiation effects, or RNA processing, which can arise from, for example, RNA self-hybridization. When a synthetic polypeptide-encoding nucleotide sequence is to be used, this sequence also can be designed to avoid oligonucleotides that mis-hybridize, resulting in genes that can be assembled from refined oligonucleotides that by thermodynamic necessity only pair up in the desired manner, using methods known in the art, as exemplified in U.S. Patent Publication No. 2005/0106590, which is hereby incorporated by reference in its entirety.
[0385] In some instances, it is not possible to modify the polypeptide- encoding nucleotide sequence to suitably modify the translational kinetics of the mRNA into polypeptide without modifying the amino acid sequence of the encoded polypeptide. In such instances, an amino acid insertion, deletion or mutation can be introduced to yield a codon pair that is not predicted to cause a translational pause or slowing; or no change is made. In methods in which an amino acid insertion, deletion or mutation is made in order to change translational kinetics, the change is preferably predicted to not substantially influence the final three-dimensional structure of the protein and/or the activity of the protein. Such non-identical polypeptides can vary by containing one or more insertions, deletions and/or mutations. Although the nature and degree of change to the polypeptide sequence can vary according to the purpose of the change, typically such a change results in a polypeptide that is at least 50%, 60%, 70%, 75%. 80%, 85%, and more typically at least 90%, 91 %, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% identical to the wild type polypeptide sequence.
[0386] In some embodiments, the sequence of the polynucleotide can be generated, optionally in conjunction with optimization of a plurality of parameters where one such parameter can be codon pair usage, where the resultant polynucleotide can be prepared by assembly of a plurality of oligonucleotides sufficiently small to be synthesized by known oligonucleotide synthetic methods. Methods known in the art for optimizing multiple parameters in synthetic nucleotide sequences can be applied to optimizing the parameters recited in the present claims. Such methods may advantageously include those exemplified in U.S. Patent App. Publication No. 2005/0106590, U.S. Patent App. Publication No. 2007/0009928, and R. H. Lathrop et al. "Multi-Queue Branch-and-Bound Algorithm for Anytime Optimal Search with Biological Applications" in Proc. Intl. Conf. on Genome Informatics, Tokyo, Dec. 17-19, 2001 pp. 73-82; in Genome Informatics 2001 (Genome Informatics Series No. 12), Universal Academy Press, which are incorporated herein by reference in their entireties. Briefly, in addition to optimizing the various parameters, an exemplary method for generating a sequence can also include dividing the desired sequence into a plurality of partially overlapping segments; optimizing the melting temperatures of the overlapping regions of each segment to disfavor hybridization to the overlapping segments which are non- adjacent in the desired sequence; allowing the overlapping regions of single stranded segments which are adjacent to one another in the desired sequence to hybridize to one another under conditions which disfavor hybridization of non-adjacent segments; and filling in, ligating, or repairing the gaps between the overlapping regions, thereby forming a double-stranded DNA with the desired sequence. This process can be performed manually or can be automated, e.g., in a general purpose digital computer. In one embodiment, the search of possible codon assignments is mapped into an anytime branch and bound computerized algorithm developed for biological applications.
[0387] Accordingly, provided herein are methods of designing a synthetic nucleotide sequence for the polynucleotides provided herein, where the synthetic nucleotide sequence also is typically designed to have desirable translational kinetics properties, such as the removal of some or all codon pairs predicted to result in a translational pause or slowing. Such design methods include determining a set of partially overlapping segments with optimized melting temperatures, and determining the translational kinetics of the synthetic sequence, where if it is desired to change the translational kinetics of the synthetic gene, the sequences of the overlapping segments are modified and refined in order to approximate the desired translational kinetics while still possessing acceptable hybridization properties. In some embodiments, this process is performed iteratively. In some embodiments, a criterion is established for selecting codon pairs having high translational kinetics values to be replaced with codon pairs having lower the translational kinetics values unless a codon pair of this group is the site of a planned pause. For example, the top 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10% of codon pairs ranked by translational kinetics values can be replaced by codon pairs having lower translational kinetics values, such as translational kinetics value below a user defined level that can be, for example, a translational kinetics value equal to or below the translational kinetics values of codon pairs not in the top selected percentage, unless a codon pair of this group is the site of a planned pause (in which case it is not necessarily replaced). In another example all codon pairs above a user-selected translational kinetics value, such as more than 5, 4.5, 4, 3.5, 3, 2.5 or 2 standard deviations above the mean translational kinetics value can be replaced by codon pairs having lower translational kinetics values, such as translational kinetics value below a user defined level that can be, for example, a translational kinetics value that is 4, 3.5, 3, 2.5, 2, 1.5 or 1 standard deviations less than the mean translational kinetics value, unless a codon pair of this group is the site of a planned pause (in which case it is not necessarily replaced). Further synthetic nucleotide sequence refinement methods can be employed where additional properties of the synthetic nucleotide sequence can be refined in addition to hybridization and codon pair usage properties, where such properties can include, for example, codon usage, reduced number of restriction sites or Shine-Dai garno sequences, or reduced detrimental RNA secondary structure, as described above.
[0388] Those skilled in the art will recognize that various optimization methods can be used, e.g., simulated annealing, genetic algorithms, branch and bound techniques, hill-climbing, Monte Carlo methods, other search strategies, and the like. Thus, the methods provided herein for designing the polynucleotide sequences provided herein, that include optimization of a plurality of parameters, where one such parameter is codon pair usage, can be implemented in by applying those parameters to art-recognized algorithms or techniques. Advantageously, sequence design is performed using an optimization method that designs a synthetic nucleotide sequence encoding the polypeptide to be expressed.
|0389] The polynucleotide sequences design methods provided herein can be employed where a plurality of properties of the polynucleotide sequences can be refined in addition to codon pair usage properties, where such properties can include, but are not limited to, melting temperature gap between oligonucleotides of synthetic gene, average codon usage, average codon pair chi-squared (e.g., z score), worst codon usage, worst codon pair (e.g., z score), maximum usage in adjacent codons, Shine-Dalgarno sequence (for E. coli expression), occurrences of 5 consecutive G's or 5 consecutive Cs, occurrences of 6 consecutive A's or 6 consecutive T's, long exactly repeated subsequences, cloning restriction sites, user-prohibited sequences (e.g., other restriction sites), codon usage of a specific codon above user-specified limit, and out-of-frame stop codons (framecatchers). In embodiments that include expression in a eukaryotic host organism, additional properties that can be considered in a process of designing a polynucleotide sequence include, but are not limited to, occurrences of RNA splice sites, occurrences of polyA sites, and occurrence of ribosome binding sequence. For example, a process of designing a poly nucleotide sequence can include constraints including, but not limited to, minimum melting temperature gap between oligonucleotides of synthetic gene, minimum average codon usage, maximum average codon pair chi-squared (z score), minimum absolute codon usage, maximum absolute codon pair (z score), minimum maximum usage in adjacent codons, no Shine-Dalgarno sequence (for E. coli expression), no occurrences of 5 consecutive G's or 5 consecutive Cs, no occurrences of 6 consecutive A's or 6 consecutive T's no long exactly repeated subsequences, no cloning restriction sites, no user-prohibited sequences (e.g., other restriction sites), and optionally no codon usage of a specific codon above user-specified limit. In embodiments that include expression in a eukaryotic host organism, additional constraints can include, but are not limited to, minimum occurrences of RNA splice sites, minimum occurrences of polyA sites, and occurrence of ribosome binding sequence. A process of designing a polynucleotide sequence can include preferences including, but not limited to, prefer high average codon usage, prefer low average codon pair chi-squared, prefer larger melting temperature gap, prefer more out of frame stop codons (framecatchers), and optionally prefer evenly distributed codon usage. Any of a variety of nucleotide sequence refinement/optimization methods known in the art can be used to refine the polynucleotide sequence according to the codon pair usage properties, and according to any of the additional properties specifically described above, or other properties that are refined in nucleotide sequence redesign methods known in the art. In some embodiments, a branch and bound method is employed to refine the polynucleotide sequence according to codon pair usage properties and at least one additional property, such as codon usage.
[0390] In some embodiments, the methods provided herein can further include analyzing at least a portion of the candidate polynucleotide sequence in frame shift, and selecting codons for the candidate polynucleotide sequence such that stop codons are added to at least one said frame shift. In additional embodiments, the generating step further includes analyzing at least a portion of the candidate polynucleotide sequence in frame shift, and selecting codons for the candidate polynucleotide sequence such that one or more stop codons in one, two or three reading frames are added downstream of polypeptide-encoding region of the nucleotide sequence.
[0391] In some embodiments, methods are provided for redesigning a polypeptide-encoding gene for expression in a host organism, by providing a data set representative of codon pair translational kinetics for the host organism which includes translational kinetics values of the codon pairs utilized by the host organism, providing a desired polypeptide sequence for expression in the host organism, and generating a polynucleotide sequence encoding the polypeptide sequence by analyzing candidate nucleotides to select, where possible, codon pairs that are predicted not to cause a translational pause in the host organism, with reference to the data set, thereby providing a candidate polynucleotide sequence encoding the desired polypeptide.
[0392] Also provided herein are methods for redesigning a polypeptide- encoding gene for expression in a host organism, by providing a first data set representative of codon pair translational kinetics for the host organism which includes translational kinetics values of the codon pairs utilized by the host organism, providing a second data set representative of at least one additional desired property of the synthetic gene, providing a desired polypeptide sequence for expression in the host organism, and generating a polynucleotide sequence encoding the polypeptide sequence by analyzing candidate nucleotides to select, where possible, both (i) codon pairs that are predicted not to cause a translational pause in the host organism, with reference to the first data set, and (ii) nucleotides that provide a desired property, with reference to the second data set, thereby providing a candidate polynucleotide sequence encoding the desired polypeptide. In some embodiments, a branch and bound method is employed to refine the polypeptide- encoding nucleotide sequence according to codon pair usage properties of the first data set and according to the properties of the second data set. In some embodiments, the second data set contains codon preferences representative of codon usage by the host organism, including the most common codons used by the host organism for a given amino acid.
|0393] Accordingly, provided herein is a sugar catabolic enzyme -encoding DNA sequence, wherein the encoded sequence has at least a 50%, 60%, 70%, 75%,80%, 85%, and more typically at least 90%, 91 %, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the wild type sugar catabolic enzyme polypeptide sequence as set forth in the sequence listing. In certain aspects of the above embodiments, the polynucleotide provided herein is adapted for expression in a heterologous host organism. A heterologous host organism is an organism used to express DNA, RNA or protein that is foreign to the host organism. In certain aspects, the host organism is not human, E. coli or S. cerevisiae.
|0394] In certain aspects of the above embodiments, at least 1 , 2 or 3 codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein. In selected embodiments, the at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism are highly-overrepresented codon pairs therein and have been replaced with codon pairs that are not highly-overrepresented therein. As described further below, a highly- overrepresented codon pair is a codon pair that has a translational kinetics value greater than a designated threshold, wherein a threshold value can be at least 1 , 1.25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 or 5 or more standard deviations above the mean translational kinetics value.
|0395] Also provided herein is a sugar catabolic enzyme -encoding DNA sequence, having at least a 75% sequence identity with an original sugar catabolic enzyme polypeptide sequence as set forth in the sequence listing and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the original sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organisms are selected from the following: Pichia pastoris: Oryctolagus cuniculus (rabbit); Macaca fascicularis (Long- tailed monkey); M. mulatto (Monkey): E. coli Kl 2 W31 10; E. coli UT189; E. coli O157:H7 EDL933; E. coli O157:H7 str. Sakai; Bombyx moiϊ; Spodoptera frugiperda: Drosophila melanogaster and Schi∑osaccharomyces pombe.
[0396] Thus, the methods provided herein can include analyzing the candidate polynucleotide sequence to confirm that no codon pairs are predicted to cause a translational pause in the host organism by more than a designated threshold. As described elsewhere herein, the likelihood that a particular codon pair will cause translational pausing or slowing in an organism (or the relative predicted magnitude thereof) can be represented by a translational kinetics value. The translational kinetics value can be expressed in any of a variety of manners in accordance with the guidance provided herein. In one example, a translational kinetics value can be expressed in terms of the mean translational kinetics value and the corresponding standard deviation for all codon pairs in an organism. For example, the translational kinetics value for a particular codon pair can be expressed in terms of the number of standard deviations that separate the translational kinetics value of the codon pair from the mean translational kinetics value. In methods that include analyzing the candidate polynucleotide sequence to confirm that no codon pairs are predicted to cause a translational pause in the host organism by more than a designated threshold, a threshold value can be at least 1 , 1 .25, 1.5, 1.75, 2, 2.25, 2.5, 3, 3.5, 4, 4.5 or 5 or more standard deviations above the mean translational kinetics value. Although such a method is described in terms of a binary scoring of a codon pair as either at least or less than the threshold value, one skilled in the art, in view of the teachings herein, will recognize that multiple thresholds can be used, or methods can be used that weight a codon pair along a continuum according to the translational kinetics value, based on the teachings provided herein and the general knowledge in the art.
|0397] In some embodiments, in addition to generating a candidate nucleotide sequence according to codon pair usage properties, the methods provided herein also include generating a candidate nucleotide sequence according to codon usage. As is known in the art, different organisms can have different preference for the three- nucleotide codon sequence encoding a particular amino acid. As a result, translation can often be improved by using the most common three-nucleotide codon sequence encoding a particular amino acid. Thus, some methods provided herein also include generating a candidate nucleotide sequence such that codon utilization is non-randomly biased in favor of codons most commonly used by the host organism. Codon usage preferences are known in the art for a variety of organisms and methods for selecting the more commonly used codons are well known in the art.
|0398] In some embodiments, the methods of redesigning a polypeptide- encoding nucleotide sequence are based on a plurality of properties, where a conflict in the preferred nucleotide sequence arising from the plurality of properties is determined in order to optimize the predicted translational kinetics. That is, when the plurality of properties being optimized would lead to more than one possible nucleotide sequence depending on which property is to be accorded more weight, typically, the conflict is resolved by selecting the nucleotide sequence predicted to be translated more rapidly, for example, due to fewer predicted translational pauses. In some embodiments, the methods of redesigning a polypeptide-encoding nucleotide sequence are based on a plurality of properties, where a conflict in the preferred nucleotide sequence arising from the plurality of properties is determined in order to optimize codon pair usage preferences. That is, when the plurality of properties being optimized would lead to more than one possible nucleotide sequence depending on which property is to be accorded more weight, typically, codon pair usage will be accorded more weight in order to resolve the conflict between the more than one possible nucleotide sequences. In one example, the methods provided herein can include identifying at least one instance of a conflict between selecting common codons and avoiding codon pairs predicted to cause a translational pause; in such instances, the conflict is resolved in favor of avoiding codon pairs predicted to cause a translational pause.
10399] Some embodiments provided herein include generating a candidate polynucleotide sequence encoding the polypeptide sequence, the candidate polynucleotide sequence having a non-random codon pair usage, such that the codon pairs encoding any particular pair of amino acids have the lowest translational kinetics values. In some embodiments, the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that the encoded amino acid sequence is not altered. In some embodiments, the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that the three dimensional structure of the encoded polypeptide is not substantially altered. In some embodiments, the candidate polynucleotide sequence encoding the polypeptide sequence is generated and/or altered such that no more than conservative amino acid changes are made to the encoded polypeptide. [040Oj The methods provided herein can further include a step of refining or altering the candidate polynucleotide sequence in accordance with a second nucleotide sequence property to be refined. For example, in embodiments in which codon usage is also refined, the methods further include generating or refining a candidate polynucleotide sequence encoding a polypeptide sequence such that the candidate polynucleotide sequence has a non-random codon usage, where the most common codons used by the host organism are over-represented in the candidate polynucleotide sequence. The methods can include refining or altering the candidate polynucleotide sequence in accordance with any of a variety of additional properties provided herein, including but not limited to, melting temperature gap between oligonucleotides of synthetic gene, Shine-Dalgarno sequence, occurrences of 5 consecutive Grs or 5 consecutive Cs, occurrences of 6 consecutive A's or 6 consecutive T:s long exactly repeated subsequences, cloning restriction sites, or any other user-prohibited sequences. Further, any of a variety of combinations of these properties can be additionally included in the nucleotide sequence refinement methods provided herein.
[0401] The method provided herein can further include an evaluation step in which after the candidate polynucleotide sequence is altered, the sequence is compared with at least a portion of a data set of a property against which the sequence was refined. In such methods, it is possible to compare the candidate sequence to the data set in order to determine whether or not the candidate sequence possesses the desired or acceptable properties with respect to the data set. For example, subsequent to a round of nucleotide sequence refinement, it can be evaluated whether or not the codon pairs of the candidate sequence have acceptable translational kinetics values. If the values are deemed to be acceptable or desired, no further sequence alteration is required with respect to the property. In view of the methods provided herein which can be directed to the refinement or optimization of a plurality of properties, the candidate nucleotide sequence can be compared to each property considered in the refinement, and, if the values for all properties are deemed to be acceptable or desired, no further sequence alteration is required. If the values for fewer than all properties are deemed to be acceptable or desired, the candidate nucleotide sequence can be subjected to further sequence alteration and evaluation.
[0402) Thus, it is contemplated herein that the sequence alteration steps of methods provided herein can be performed iteratively. That is, one or more steps of altering the nucleotide sequence can be performed, and the candidate nucleotide sequence can be evaluated to determine whether or not further sequence alteration is necessary and/or desirable. These steps can be repeated until values for all properties are deemed to be acceptable or desired, or until no further improvement can be achieved.
Determination of translational kinetics values for codon pairs
|0403] The methods and sequences provided herein include determination and use of translational kinetics values for codon pairs. As provided herein, such a translational kinetics value can be calculated and/or empirically measured, and the final translational kinetics value used in graphical displays and methods of predicting translational kinetics can be a refined value resultant from two or more types of codon pair translational kinetics information. The various types of codon pair translational kinetics information that can be used in refining or replacing a translational kinetics value for a codon pair include, for example, values of observed versus expected codon pair frequencies in a particular organism, normalized values of observed versus expected codon pair frequencies in a particular organism, the degree to which observed versus expected codon pair frequency values are conserved in related proteins across two or more species, the degree to which observed versus expected codon pair frequency values are conserved at predicted pause sites such as boundaries between autonomous folding units in related proteins across two or more species, the degree to which codon pairs are conserved at predicted pause sites across different proteins in the same species, and empirical measurement of translational kinetics for a codon pair.
10404] The values of observed versus expected codon pair frequencies in a host organism can be determined by any of a variety of methods known in the art for statistically evaluating observed occurrences relative to expected occurrences. Regardless of the statistical method used, this typically involves obtaining codon sequence data for the organism, for example, on a gene-by-gene basis. In some embodiments, the analysis is focused only on the coding regions of the genome. Because the analysis is a statistical one, a large database is preferred. Initially, the total number of codons is determined and the number of times each of the 61 non-terminating codons appears is determined. From this information, the expected frequency of each of the 3721 (612) possible non- terminating codon pairs is calculated, typically by multiplying together the frequencies with which each of the component codons appears. This frequency analysis can be carried out on a global basis, analyzing all of the sequences in the database together; however, it is typically done on a local basis, analyzing each sequence individually. This will tend to minimize the statistical effect of an unusually high proportion of rare codons in a sequence. After the frequency data is obtained, for each sequence in the database, the expected number of occurrences of each codon pair is calculated by, for example, multiplying the expected frequency by the number of pairs in the sequence. This information can then be added to a global table, and each next succeeding sequence can be analyzed in like manner. This analysis results in a table of expected and observed values for each of the 3271 non-terminating codon pairs. The statistical significance of the variation between the expected and observed values can then be calculated, and the resulting information can be used in further practice of the various examples and embodiments provided herein.
|0405] In some embodiments, the values of observed versus expected codon pair frequencies are chi-squared values, such as chi-squared 2 (chisq2) values or chi- squared 3 (chisq3) values. Methods for calculating chi-squared values can be performed according to any method known in the art, as exemplified in U.S. Patent No. 5,082,767, which is incorporated by reference herein in its entirety. The result of chi-squared calculations is a list of 3,721 non-terminating codon pairs, each with an expected and observed value, together with a value for chi-squared (chisql ): chisql = (observed-expected)2 / expected
|0406] In order to remove the contribution to chi-squared of non-randomness in amino acid pairs, a new value chi-squared 2 (chisq2) can be calculated as follows. For each group of codon pairs encoding the same amino acid pair (i.e., 400 groups), the sums of the expected and observed values are tallied; any non-randomness in amino acid pairs is reflected in the difference between these two values. Therefore, each of the expected values within the group is multiplied by the factor [sum observed/sum expected], so that the sums of the expected and observed values with the group are equal. The new chi- squared, chisq2, is evaluated using these new expected values. Calculation methods for removing the contribution to chi-squared of non-randomness in amino acid pairs are known in the art, as exemplified in Gutman and Hatfϊeld, Proc. Natl. Acad. Sci. USA. (1989) 86:3699-3703.
|0407] Further, in order to remove the contribution to chi-squared of non- randomness in dinucleotides, a new value chi-squared 3 (chisq3) can be calculated. Correction is made only for those dinucleotides formed between adjacent codon pairs; any bias of dinucleotides within codons (codon triplet positions I-II and 11— III) will directly affect codon usage and is, therefore, automatically taken into account in the underlying calculations. For each dinucleotide pair formed between adjacent codon pairs (i.e.. 16 pairs), the sums of the expected and observed values are tallied: any non- randomness in dinucleotide pairs is reflected in the difference between these two values. Therefore, each of the expected values within the group is multiplied by the factor [sum observed/sum expected], so that the sums of the expected and observed values with the group are equal. The new chi-squared, chisq3, is evaluated using these new expected values.
[0408] As provided herein, and as will be readily apparent to those skilled in the statistical art, that further values chi-squared N (chisqN) could be calculated similarly by removing one or more other variables in like fashion.
10409] Analyses of the E. coli, S. cerevisiae, and human databases illustrate two important features. First, there is a highly significant codon pair bias in all three species, even after the amino acid nearest neighbor bias (chisq2) and the dinucleotide bias (chisq3) are discounted. Second, the effect associated with dinucleotide bias, i.e., the difference between chisq2 and chisq3, is much more pronounced in eukaryotes than in E. coli. It is by far the predominant effect in mammals, representing two thirds of the amount of chisq2 in excess of its expectation in human. Mouse and rat data exhibit a very similar pattern. Dinucleotide bias represents a smaller effect in yeast, and only a very minor one in E. coli. Although the predominant dinucleotide bias in human is the well-known CpG deficit, other dinucleotides are also very highly biased. For example, there is a deficit of TA, as well as an excess of TG, CA and CT. Overall, the deficit of CpG contributes only 35% of the total dinucleotide bias in the human database, and 17% in yeast.
[0410] As provided herein, the values of observed versus expected codon pair frequencies in a host organism herein can be normalized. Normalization permits different sets of values of observed versus expected codon pair frequencies to be compared by placing these values on the same numerical scale. For example, normalized codon pair frequency values can be compared between different organisms, or can be compared for different codon pair frequency value calculations within a particular organism (e.g., different calculations based on input sequence information or based on different calculations such as chisql or chisq2 or chisq3). Typically, normalization results in codon pair frequency values that are described in terms of their mean and standard deviation from the mean. |0411) An exemplary method for normalizing codon pair frequency values is the calculation of z scores. The z score for an item indicates how far and in what direction that item deviates from its distribution's mean, expressed in units of its distribution's standard deviation. The mathematics of the z score transformation are such that if every item in a distribution is converted to its z score, the transformed scores will have a mean of zero and a standard deviation of one. The z scores transformation can be especially useful when seeking to compare the relative standings of items from distπbutions with different means and/or different standard deviations, z scores are especially informative when the distribution to which they refer is normal. In a normal distribution, the distance between the mean and a given z score cuts off a fixed proportion of the total area under the curve.
[0412] An exemplary method for determining z scores for codon pair chi- squared values is as follows: First, a list of all 3721 possible non-terminating codon pairs is generated. Second, for the ith codon pair, the ith chi-squared value is calculated, where the ilh chi-squared value is denoted C1. The chi-squared value, C1, is given the sign of (observed - expected), so that over-represented codon pairs are assigned a positive C1 and under-represented codon pairs are assigned a negative C1. The formula for c, is: c, = sgn(obs, - CXp1) * (obs, - exp,)2 / exp,
|0413] Third, the mean chi-squared value is calculated where the mean is denoted m. The formula for the mean is: m = (Z1 C1) / 3721 where Σ1 means sum over i. Fourth, the standard deviation of the chi-squared values is calculated, where the standard deviation is denoted s. The formula for the standard deviation is: s = V(ΣI (c1 - m)2 / 3721 ) where V means square root. Fifth, for the ilh chi-squared value C1, a z score is calculated by subtracting the mean then dividing by the standard deviation, wherein the ilh z score is denoted Z1. The formula for the z score is: z, = (c, - m) / s
|0414] The above-described values of observed codon pair frequency versus expected codon pair frequency can be used as first approximations of translational kinetics of a polypeptide-encoding nucleotide sequence. However, such values are not true predictors of translational kinetics, and refinement of such values to more accurately predict translational kinetics can be performed according to the methods provided herein. Thus, provided herein are methods of refining the predictive capability of a translational kinetics value of a codon pair in a host organism by providing an initial translational kinetics value based on the value of observed codon pair frequency versus expected codon pair frequency for a codon pair in a host organism, providing additional translational kinetics data for the codon pair in the host organism, and modifying the initial translational kinetics value according to the additional codon pair translational kinetics data to generate a refined translational kinetics value for the codon pair in the host organism. The translational kinetics data that can be used to refine translational kinetics values and methods of modifying translational kinetics values according to such additional translational kinetics data to generate a refined translational kinetics value for a codon pair in a host organism are provided below.
10415] In one embodiment, translational kinetics data that can be used to refine translational kinetics values are based on recurrence of a codon pair and/or recurrence of a predicted translational kinetics value associated with a codon pair. Recurrence-based refinement of translational kinetics values is based on the investigation of multiple polypeptide-encoding nucleotide sequences to determine whether or not there are multiple occurrences of either codon pairs or predicted translational kinetics values in those sequences. Recurrence-based refinement of translational kinetics can be performed using any of a variety of known sequence comparison methods consistent with the examples provided herein. For purposes of exemplification, and not for limitation, the following example of recurrence-based refinement of translational kinetics is provided.
[0416] In one exemplary embodiment, the predicted translational kinetics value for a codon pair can be refined according to the degree to which observed versus expected codon pair frequency values are conserved in related proteins across two or more species. As provided herein, related proteins are proteins having homologous amino acid sequences and/or similar three dimensional structures. Related proteins having homologous amino acid sequences will typically have at least about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95% sequence identity. Related proteins having similar three dimensional structures will typically share similar secondary structure topology and similar relative positioning of secondary structural elements; exemplary related proteins having three dimensional structures are members of the same SCOP- classified Family (see, e.g., Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. MoI. Biol. 247, 536-540.). |0417| The observed versus expected codon pair frequency values for any given codon pair can vary from species to species. However, as provided herein, evolutionarily related proteins in different species will typically conserve some or all translational pause or slowing sites. Based on this, an observed conservation of one or more predicted translational pause or slowing sites in evolutionarily related proteins of different species can confirm or increase the likelihood that a translational pause or slowing site is a functional translational kinetics signal. The codon pair located at the position on a protein that is confirmed as, or considered to have an increased likelihood of, containing an actual translational pause or slowing can itself be confirmed as being, or considered to have an increased likelihood of being, a functional translational kinetics signal. Similarly, a codon pair located at a position on a protein that is confirmed as not containing, or considered to have a decreased likelihood of containing, an actual translational pause or slowing, can itself be confirmed as not acting, or considered to have an decreased likelihood of acting, as a functional translational kinetics signal. Accordingly, initially predicted translational kinetics data, e.g., data based on values of observed codon pair frequency versus expected codon pair frequency, can be modified according to conserved codon pair frequency values across two or more species, which can lead to the codon pair being confirmed as: being a functional translational kinetics signal; being considered to have an increased likelihood of being a functional translational kinetics signal; being confirmed as not acting as an actual translational pause codon pair; or being considered to have a decreased likelihood of being a functional translational kinetics signal.
[0418] In another embodiment, the predicted translational kinetics value for a codon pair can be refined according to the presence of the codon pair at a location predicted by methods other than codon pair frequency methods to contain a translational pause or slowing site. One example of such a predicted location is a boundary location between autonomous folding units of a protein. While not intending to be limited to the following, it is proposed that translational pauses are present in wild type genes in order to slow translation of a nascent polypeptide subsequent to translation of a secondary structural element of a protein and/or a protein domain, thus providing time for acquisition of secondary and at least partial tertiary structure by the nascent protein prior to further downstream translation, and thereby allowing each domain to partially organize and commit to a particular, independent fold. As such, it is proposed herein that codon pairs can be associated with translational pauses between autonomous folding units of a protein, where autonomous folding units can be secondary structural elements such as an alpha helix, or can be tertiary structural elements such as a protein domain. Thus, the presence of a codon pair at a boundary location between autonomous folding units of a protein can confirm or increase the likelihood that the codon pair acts to pause or slow translation. Accordingly, predicted translational kinetics data, e.g., data based on values of observed codon pair frequency versus expected codon pair frequency, can be modified according to the presence of the codon pair at a boundary location between autonomous folding units of a protein, which can increase the likelihood of the codon pair acts to pause or slow translation. For example, an over-represented codon pair that is present at a boundary location between autonomous folding units of a protein can be confirmed as acting as a translational pause or slowing codon pair.
|0419] In the above embodiment, a single observation of the codon pair at a boundary location between autonomous folding units of a protein can confirm or increase the likely translational pause or slowing properties of a codon pair. However, typically a plurality of observations will be used to more accurately estimate the translational pause or slowing properties of a codon pair. Thus, methods of using, for example, predicted boundary locations can be combined with methods that are based on recurrence of a codon pair and/or recurrence of a predicted translational kinetics value associated with a codon pair in methods of refining a predicted translational kinetics value for a codon pair. For example, a protein present in two or more species can have conserved boundary locations between autonomous folding units of the protein, and recurrent presence of an over-represented codon pair at the boundary locations can confirm the likelihood of an actual translational pause at that boundary location, leading to confirmation, or increased likelihood, that the corresponding codon pair for the respective species acts as a translational pause or slowing codon pair. In another example, two or more proteins of the same species can have boundary locations between autonomous folding units, and recurrent presence of an over-represented codon pair at the boundary locations can confirm or indicate the likelihood of an actual translational pause at that boundary location, leading to confirmation or indication of increased likelihood that the corresponding codon pair acts as a translational pause or slowing codon pair.
|0420] Such recurrence-based methods also can be used to confirm or indicate increased likelihood that a non-over-represented codon pair (e.g., an under-represented codon pair or a represented-as-expected codon pair) acts as a translational pause or slowing codon pair. For example, two or more proteins of the same species can have boundary locations between autonomous folding units, and recurrent presence of a non- over- represented codon pair at the boundary locations, particularly if no over-represented codon pair is present, can confirm or indicate the likelihood of an actual translational pause at that boundary location, leading to confirmation or indication of increased likelihood that the corresponding codon pair acts as a translational pause or slowing codon pair.
[042 IJ Such recurrence-based methods also can be used to confirm or indicate the likelihood that a codon pair, such as an over-represented codon pair, does not act as a translational pause or slowing codon pair. For example, two or more proteins of the same species can have boundary locations between autonomous folding units, and consistent absence of a non-over-represented codon pair at the boundary locations can confirm or indicate increased likelihood that the codon pair does not act as a translational pause or slowing codon pair.
|0422] In another embodiment, the predicted translational kinetics value for a codon pair can be refined according to empirical measurement of translational kinetics for a codon pair. The influence of a codon pair on translational kinetics can be experimentally measured, and these experimental measurements can be used to refine or replace the predicted translational kinetics values for a codon pair. Several methods of experimentally measuring the translational kinetics of a codon pair are known in the art, and can be used herein, as exemplified in Irwin et al, J. Biol. Chem., (1995) 270:22801. One such exemplary assay is based on the observation that a ribosome pausing at a site near the beginning of an mRNA coding sequence can inhibit translation initiation by physically interfering with the attachment of a new ribosome to the message, and, thus, the codon pair to be assayed can be placed at the beginning of a polypeptide-encoding nucleotide sequence and the effect of the codon pair on translational initiation can be measured as an indication of the ability of the codon pair to cause a translational pause. Another such exemplary assay is based on the fact that the transit time of a ribosome through the leader polypeptide coding region of the leader RNA of the trp operon sets the basal level of transcription through the trp attenuator, and, thus, the codon pair to be assayed can be placed into a trpLep leader polypeptide codon region, and level of expression can be inversely indicative of the translational pause properties of the codon pair, due to a faster translation causing formation of a stem-loop attenuator in the leader RNA, which results in transcriptional attenuation. [0423| As will be apparent to one skilled in the art, the methods provided herein for calculation of translational kinetics values can be applied to the native organism of the polypeptide of SEQ ID NOS: 2, 26, 50, 74, 98, 122, 146, 170. 194, 218, 242. 266. 290 or 302, and also can be applied to a selected organism in which the polypeptide of SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 194, 218, 242, 266, 290 or 302, or a modification thereof, is to be heterologously expressed. For example, the nucleotide sequence information of an organism can be used to calculate chi-squared values in accordance with the methods provided herein, and the translational kinetics values can be based on these chi-squared values as well as on additional translational kinetics information provided herein, including, but not limited to, codon pairs conserved in domain boundaries and empirically measured translational kinetics for a codon pair. Exemplary organisms for which translational kinetics values can be calculated and used to prepare a nucleotide sequence encoding a sugar catabolic enzyme protein provided herein incude Pichia pastoris; Oryctolagns cuniculus (rabbit); Macaca fascicularis (Long-tailed monkey); M. mulatto (Monkey); E. coli Kl 2 W31 10; E. coli UTI89; E. coli O157:H7 EDL933; E. coli O157:H7 str. Sakai; Bombyx mori; Spodoptera frugiperda; Drosophila melanogasier and Schizosaccharomyces pombe.
Calculation methods of modifying translational kinetics values based on additional translational kinetics data
[0424] The translational kinetics data described herein can be combined in such a manner as to provide a refined translational kinetics value for a codon pair in a host organism. Methods of combining predictive data to arrive at a refined predictive value are known in the art and can be used herein.
J0425] Estimates for translational kinetics values are informed by a number of knowledge sources known to those skilled in the art, including but not limited to experimental measurement, conservation at protein structural boundaries and across homologous families, statistical inference from genomic sequence data, and the like as provided elsewhere herein. All these disparate knowledge sources must be integrated into an overall estimate for purposes of gene design and engineering. The general problem of integrating diverse and disparate knowledge sources is ubiquitous and well-studied in many different engineering fields, e.g., distributed sensor fusion in remote sensing, bagging classifiers in machine learning, heterogeneous database integration in data warehouses, or perceptual integration in artificial intelligence. Many useful and applicable approaches are known to the art.
|0426| While many approaches are possible, those skilled in the art agree that the method of Bayes [Bayes. T., 1764. An essay toward solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London 53:370-418. Reprinted pp. 131 -153 in "Studies in the History of Statistics and Probability." (ed. Pearson. E. S., Kendall. M. G.). Charles Griffin. London, 1970.] has rigorous foundations in probability and many successes in bioinformatics [Baldi, P., and Brunak, S., 2001. Bioinformatics: The Machine Learning Approach, MIT Press, Cambridge, MA, USA]. Using the Bayesian approach as an example here, without intending to exclude other well-known approaches, the Bayesian approach seeks to choose a hypothesis H that is most probable given the observed data D.
|0427] Operationally, this means to choose H so as to maximize the probability of H given D. written P(H|D). By Bayes's rule, this may be rewritten as P(H|D) = P(D|H) * P(H) / P(D). This is equivalent to maximizing P(D|H) * P(H) because P(D) is constant for all H. The term P(H) is identified with the degree of belief in hypothesis H before the data was observed. The term P(D|H), read "the probability of D given H," is identified with how well hypothesis H predicts the observed data D. Thus, the Bayesian approach seeks to find an hypothesis that is a priori likely and also explains the data well.
|0428] In this example, an hypothesis H is that a given sequence feature, e.g., a given codon pair, has utility for translational kinetics engineering, e.g., creates a translational pause site. The observed data D may have several observations, e.g., D = Dl & D2 & D3 & D4, where Dl = an experimental measurement, D2 = conserved at protein structural domain boundaries, D3 = conserved across homologous protein families, and D4 = indicated as over-represented by statistical analysis that yields a high chisq3 value. In this case, the term P(D|H) = P(Dl & D2 & D3 & D4 | H), which indicates to choose an hypothesis that explains each of the observed datum. Of course, different data sources have different rates and magnitudes of observational error. This falls naturally into the Bayesian approach because the probability framework extends naturally to encompass the probability of observational error, as P(D|H) = P(D|H) * P(D is correct) + P(not D|H) * P(D is not correct). For example, an experimental measurement Dl that has been confirmed by replicate testing would have a very low probability of error, and therefore it would dominate the estimate if available. |0429J In the general case, where no experimental measurement is available, several Bayesian approaches are commonly employed. The simplest, which often works well, is named "Naive Bayes" because it assumes conditional independence among the individual observed data items. In this case, P(D|H) = P(Dl & D2 & D3 & D4 | H) = P(Dl |H) * P(D2|H) * P(D3|H) * P(D4|H), where each of the individual terms is further expanded as P(Di|H) = P(Di|H) * P(Di is correct) + P(not Di|H) * P(Di is not correct) as indicated above. The terms P(Di is correct) and P(Di is not correct) can be estimated a priori by the correlation of Di with previous experimental measurements. The terms P(Di|H) and P(not Di|H) are obtained by observing whether or not hypothesis H is consistent with observed data item Di. More complex and powerful Bayesian approaches are also well known to the art. The fully general approach rewrites P(D|H) = P(Dl & D2 & D3 & D4 I H) = P(D4 | D3 & D2 & Dl & H) * P(D3 | D2 & Dl & H) * P(D2 | Dl & H) * P(Dl I H). Many other approaches, both Bayesian and others, are well known to the art.
[0430] By way of example, the translational kinetics values for a codon pair can be refined by consideration of. for example, chi-squared value of observed versus expected codon pair frequency and the degree to which codon pairs are conserved at predicted pause sites across different proteins in the same species, for example, at protein structure domain boundaries. An over-represented codon pair which is present with above-random frequency at boundary locations between autonomous folding units of proteins in the same species can have a translational kinetics value reflecting higher predicted translational pause properties of the codon pair. In contrast, an over- represented codon pair which is present with below-random frequency at boundary locations between autonomous folding units of proteins in the same species can have a translational kinetics value reflecting lower predicted translational pause properties of the codon pair.
|0431] As another example, the translational kinetics values for a codon pair can be refined by consideration of, for example, experimentally measured translation step times in one species and the degree to which codon pairs that correspond to measured pause sites in the first species are conserved across homologous proteins in other species, for example, in a multiple sequence alignment. When an over-represented codon pair in another species is aligned with above-random frequency to a codon pair that corresponds to a measured translation pause site in the first species, it can have a translational kinetics value reflecting higher predicted translational pause properties of that codon pair in the other species. In contrast, when an over-represented codon pair in another species is aligned with below-random frequency to a codon pair that corresponds to a measured translation pause site in the first species, it can have a translational kinetics value reflecting lower predicted translational pause properties of that codon pair in the other species.
|0432] In various embodiments described herein, translational kinetics values for codon pairs, including refined translational kinetics values, can be determined. The translational kinetic values can be organized according to the likelihood of causing a translational pause or slowing based on any method known in the art. In one example, the translational kinetic values for two or more codon pairs, up to all codon pairs, in an organism are determined, and the mean translational kinetics value and associated standard deviation are calculated. Based on this, the translational kinetics value for a particular codon pair can be described in terms of the multiple of standard deviations the translational kinetics value for the particular codon pair differs from the mean translational kinetics value. Accordingly, reference herein to mean translational kinetics values and standard deviations, whether or not applied to a particular expression of translational kinetics value, can be applied to any of a variety of expressions of translational kinetics values provided herein.
Graphical analysis of translational kinetics
|0433] Also provided herein are methods of analyzing translational kinetics of an mRNA into polypeptide encoded by a gene in a host organism by determining translational kinetics values for codon pairs in the host organism and generating a graphical display of the translational kinetics values of actual codon pairs of an original polypeptide-encoding nucleotide sequence of a heterologous gene as a function of codon position. Such a graphical display provides a visual display of the predicted translational influence, including translational pause or slowing for numerous or all codon pairs of a polypeptide-encoding nucleotide sequence. This visual display can be used in methods of modifying polypeptide-encoding nucleotide sequences in order to thereby modify the predicted translational kinetics of the mRNA into polypeptide in methods such as those provided herein. For example, the graphical displays can be used to identify one or more codon pairs to be modified in a polypeptide-encoding nucleotide sequence. The graphical displays can be used in analyzing a polypeptide-encoding nucleotide sequence prior to modifying the polypeptide-encoding nucleotide sequence, or can be used in analyzing a modified polypeptide-encoding nucleotide sequence to determine, for example, whether or not further modifications are desired.
|0434| Methods for creating and using graphical displays can be performed according to any method known in the art, as exemplified in U.S. Patent Publication No. 2007/0298503, published on December 27, 2007, and U.S. Patent Publication No. 2007/0275399, published on November 29, 2007, which are incorporated by reference herein in their entireties. In particular, graphical displays as described therein can be created to illustrate the translational kinetics of an original or redesigned polypeptide- encoding nucleotide sequence in the native or a heterologous organism, or to illustrate differences and/or similarities of translation kinetic of a polypeptide-encoding nucleotide sequence in which one or more codon pairs have been modified. Additionally, numerous normalized graphical displays can be created to illustrate differences and/or similarities of translation kinetics of a polypeptide-encoding nucleotide sequence when expressed in two or more different organisms.
(0435J The graphical displays can be created using translational kinetics values based on any of the methods for determining translational kinetics values provided herein or otherwise known in the art. For example, chi-squared as a function of codon pair position, chi-squared 2 as a function of codon position, or chi-squared 3 as a function of codon pair position, translational kinetics values thereof, empirical measurement of translational pause of codon pairs in a host organism, estimated translational pause capability based on observed presence and/or recurrence of a codon pair at predicted pause site, and variations and combinations thereof as provided herein.
[0436] The exact format of the graphical displays can take any of a variety of forms, and the specific form is typically selected for ease of analysis and comparison between plots. For example, the abscissa typically lists the position along the nucleotide sequence or polypeptide sequence, and can be represented by nucleotide position, codon position, codon pair position, amino acid position, or amino acid pair position. In such instances, the ordinate typically lists the translational kinetics value of the codon pair, such as, but not limited to, a translational kinetics value of codon pair frequency, including, but not limited to the z score of chisql , the z score of chisq2, the z score of chisq3, the empirically measured value, and the refined translational kinetics value. In alternative embodiments, the sequence position can be plotted along the ordinate and the translational kinetics value can be plotted along the abscissa. |0437| As an example, a graphical display of translational kinetics is depicted in Figure 1. where each positive deflection or peak describes a predicted translational pause or slowing at the nucleotide location as defined by the abscissa. Comparinfi plots
[0438] Also contemplated herein are methods in which a set of graphical displays, including at least a first graphical display and a second graphical display, are prepared. These sets of displays can be compared in order to determine the difference in predicted translational efficiency or translational kinetics of the two plots. The plots can differ according to any of a variety of criteria. For example, each plot can represent a different polypeptide-encoding nucleotide sequence, each plot can represent a different host organism, each plot can represent differently determined translational kinetics values, or any combination thereof. As will be apparent to one skilled in the art, any number of different graphical displays can be compared in accordance with the methods provided herein, for example, 2. 3. 4, 5, 6. 7, 8 or more different graphical displays can be compared. Typically, two plots will represent different polypeptide-encoding nucleotide sequences, the same sequence in different host organisms, or different sequences in different host organisms.
[0439] Comparison of different graphical displays can be used to analyze the predicted change in translational kinetics as a result of the difference represented by the graphical displays. For example, comparison of the same polypeptide-encoding nucleotide sequence in different host organisms can be used to analyze any predicted transcriptional pauses that can be removed. Accordingly, provided herein are methods of analyzing translational kinetics of an mRNA into polypeptide in a host organism by comparing two graphical displays to understand or predict the differences in translational kinetics of the mRNA into polypeptide, where the differences in the graphical displays can be as a result of, for example, a difference in the polypeptide-encoding nucleotide sequence or a difference in the host organism. Upon determination of the differences in translational kinetics, it can be evaluated whether or not the change in translational kinetics as a result of the underlying difference between the two graphical displays is desirable. Such comparison methods also can lead to an identification of further modifications, e.g., further modifications to the polypeptide-encoding nucleotide sequence to further improve translational kinetics. Accordingly, it is contemplated herein that such comparison methods can be carried out iteratively. |0440| In embodiments where it is desired to improve expression of a polypeptide-encoding nucleotide sequence in a particular heterologous host, a graphical display of the translational kinetics values of codon pairs for the original polypeptide- encoding nucleotide sequence in the heterologous host can be compared to a graphical display of the translational kinetics values of codon pairs for a modified polypeptide- encoding nucleotide sequence in the heterologous host, and it can be determined whether or not the modification to the polypeptide-encoding nucleotide sequence resulted in improved translational kinetics.
Methods of inserting polynucleotide into vector, transforming cells, expressing polynucleotide, and purifying polypeptide
(0441] The nucleic acid sequences provided herein can be present in a polynucleotide (e.g., DNA or RNA molecule). Thus, in one embodiment, provided are polynucleotides containing the nucleic acid sequences provided herein. The polynucleotides can be inserted into a replicable vector for cloning (e.g., amplification of the DNA) or for expression. Various vectors are publicly available and are known in the art. The vector can, for example, be in the form of a plasmid, cosmid, viral particle, or phage. The appropriate nucleic acid sequence can be inserted into the vector by any of a variety of procedures known in the art. Typically, DNA is inserted into an appropriate restriction endonuclease site(s) using techniques known in the art or the DNA is inserted by any of a variety of PCR methodologies. Vector components can generally include, but are not limited to, one or more of a signal sequence, an origin of replication, one or more marker genes, an enhancer element, a promoter, and a transcription termination sequence. Construction of suitable vectors containing one or more of these components employs standard ligation techniques which are known to the skilled artisan.
|0442] The encoded polypeptide can be produced recombinantly not only directly, but also as a fusion polypeptide with a heterologous polypeptide, which can be, e.g., a signal sequence or other polypeptide having a specific cleavage site at the N- terminus of the mature protein or polypeptide. In general, the signal sequence can be a component of the vector, or it can be a part of the polynucleotide that is inserted into the vector. The signal sequence can be a prokaryotic signal sequence selected, for example, from the group of the alkaline phosphatase, penicillinase, lpp, or heat-stable enterotoxin Il leaders. For yeast secretion the signal sequence can be, e.g., the yeast invertase leader, alpha factor leader (including Saccharomyces and Kluyveromyces α-factor leaders, the latter descπbed in U S Patent No 5.010.182). or acid phosphatase leader, the C albicans glucoamylase leader (EP 362.179 published 4 April 1990). or the signal descπbed in WO 90/13646 published 15 November 1990 In mammalian cell expression, mammalian signal sequences can be used to direct secretion of the protein, such as signal sequences from secreted polypeptides of the same or related species, as well as viral secretory leaders
|0443] Both expression and cloning vectors contain a polynucleoitde that permits the vector to replicate in one or more selected host cells Such sequences are well known for a vaπety of bacteria, yeast, and viruses The origin of replication from the plasmid pBR322 is suitable for most Gram-negative bacteria, the 2μ plasmid origin is suitable for yeast, and various viral oπgins (SV40. polyoma, adenovirus, VSV or BPV) are useful for cloning vectors in mammalian cells
|0444] Expression and cloning vectors will typically contain a selection gene, also termed a selectable marker. Typical selection genes encode proteins that (a) confer resistance to antibiotics or other toxins, e g . ampicilhn, neomycin, methotrexate, or tetracycline, (b) complement auxotrophic deficiencies, or (c) supply critical nutrients not available from complex media, e g . the gene encoding D-alanine racemase for Bacilli.
[0445] An example of suitable selectable markers for mammalian cells are those that enable the identification of cells competent to take up the polynucleotide- containing vector, such as DHFR or thymidine kinase An appropπate host cell when wild-type DHFR is employed is the CHO cell line deficient in DHFR activity, prepared and propagated as descπbed by Urlaub et al . Proc Natl Acad Sci. USA, 77:4216 (1980). A suitable selection gene for use in yeast is the trpl gene present in the yeast plasmid YRp7 [Stinchcomb et al., Nature, 282.39 (1979): Kingsman et al., Gene, 7.141 (1979); Tschemper et al . Gene, 10 157 (1980)]. The trpl gene provides a selection marker for a mutant strain of yeast lacking the ability to grow in tryptophan, for example, ATCC No 44076 or PEP4-1 [Jones. Genetics, 85:12 (1977)].
|0446] Expression and cloning vectors usually contain a promoter operably linked to the polynucleotide provided herein to direct mRNA synthesis. Promoters recognized by a vaπety of potential host cells are well known Promoters suitable for use with prokaryotic hosts include the β-lactamase and lactose promoter systems [Chang et al., Nature, 275 615 (1978); Goeddel et al.. Nature. 281 .544 (1979)]. alkaline phosphatase, a tryptophan (trp) promoter system [Goeddel. Nucleic Acids Res., 8 4057 (1980): EP 36.776]. and hybπd promoters such as the tac promoter [deBoer et a!.. Proc Natl. Acad. Sci. USA; 80:21 -25 ( 1983)]. Promoters for use in bacterial systems also will contain a Shine-Dalgarno (S. D.) sequence operably linked to the polynucleotide provided herein.
|0447] Examples of suitable promoting sequences for use with yeast hosts include the promoters for 3-phosphoglycerate kinase [Hitzeman et al.. J. Biol. Chem., 255:2073 ( 1980)] or other glycolytic enzymes [Hess et al.. J. Adv. Enzyme Reg., 7:149 (1968): Holland, Biochemistry, 17:4900 (1978)], such as enolase, glyceraldehyde-3- phosphate dehydrogenase, hexokinase, pyruvate decarboxylase, phosphofructokinase, glucose-6-phosphate isomerase, 3-phosphoglycerate mutase. pyruvate kinase, triosephosphate isomerase, phosphoglucose isomerase, and glucokinase.
[0448] Other yeast promoters, which are inducible promoters having the additional advantage of transcription controlled by growth conditions, are the promoter regions for alcohol dehydrogenase 2, isocytochrome C, acid phosphatase, degradative enzymes associated with nitrogen metabolism, metallothionein, glyceraldehyde-3- phosphate dehydrogenase, and enzymes responsible for maltose and galactose utilization. Suitable vectors and promoters for use in yeast expression are further described in EP 73,657.
[0449] Transcription from vectors in mammalian host cells is controlled, for example, by promoters obtained from the genomes of viruses such as polyoma virus, fowlpox virus (UK 2,21 1 ,504 published 5 July 1989), adenovirus (such as Adenovirus 2), bovine papilloma virus, avian sarcoma virus, cytomegalovirus, a retrovirus, hepatitis-B virus and Simian Virus 40 (SV40), from heterologous mammalian promoters, e.g., the actin promoter or an immunoglobulin promoter, and from heat-shock promoters, provided such promoters are compatible with the host cell systems.
J0450] Transcription by higher eukaryotes can be increased by inserting an enhancer sequence into the vector. Enhancers are cis-acting elements of DNA, usually about from 10 to 300 bp, that act on a promoter to increase its transcription. Many enhancer sequences are now known from mammalian genes (globin, elastase, albumin, α- fetoprotein, and insulin). Typically, however, one will use an enhancer from a eukaryotic cell virus. Examples include the SV40 enhancer on the late side of the replication origin (bp 100-270), the cytomegalovirus early promoter enhancer, the polyoma enhancer on the late side of the replication origin, and adenovirus enhancers. The enhancer can be spliced into the vector at a position 5' or 3' to the polynucleotide provided herein, but is preferably located at a site 5' from the promoter. |0451J Expression vectors used in eukaryotic host cells (yeast, fungi, insect, plant, animal, human, or nucleated cells from other multicellular organisms) will also contain sequences necessary for the termination of transcription and for stabilizing the mRNA. Such sequences are commonly available from the 5' and. occasionally 3', untranslated regions of eukaryotic or viral DNAs or cDNAs. These regions contain nucleotide segments transcribed as polyadenylated fragments in the untranslated portion of the mRNA transcribed from the polynucleotide provided herein.
|0452] Still other methods, vectors, and host cells suitable for adaptation to the synthesis of the encoded proteins in recombinant vertebrate cell culture are described in Gething et al., Nature, 293:620-625 (1981 ); Mantei et al., Nature, 281 :40-46 (1979); EP 1 17;060; and EP 1 17,058.
|0453] Host cells are transfected or transformed with expression or cloning vectors described herein for polypeptide production and cultured in conventional nutrient media modified as appropriate for inducing promoters, selecting transformants, or amplifying the genes encoding the desired sequences. The culture conditions, such as media, temperature, pH and the like, can be selected by the skilled artisan without undue experimentation. In general, principles, protocols, and practical techniques for maximizing the productivity of cell cultures can be found in Mammalian Cell Biotechnology: a Practical Approach, M. Butler, ed. (IRL Press, 1991 ) and Sambrook et al., supra.
|0454] Methods of eukaryotic cell transfection and prokaryotic cell transformation are known to the ordinarily skilled artisan, for example, CaCl2, CaPO4, liposome-mediated and electroporation. Depending on the host cell used, transformation is performed using standard techniques appropriate to such cells. The calcium treatment employing calcium chloride, as described in Sambrook et al., supra, or electroporation is generally used for prokaryotes. Infection with Agrobacterium tumefaciens is used for transformation of certain plant cells, as described by Shaw et al., Gene, 23:315 (1983) and WO 89/05859 published 29 June 1989. For mammalian cells without such cell walls, the calcium phosphate precipitation method of Graham and van der Eb, Virology, 52:456-457 (1978) can be employed. General aspects of mammalian cell host system transfections have been described in U.S. Patent No. 4.399,216. Transformations into yeast are typically carried out according to the method of Van Solingen et al.. J. Bact., 130:946 (1977) and Hsiao et al., Proc. Natl. Acad. Sci. (USA), 76:3829 (1979). However, other methods for introducing DNA into cells, such as by nuclear microinjection. electroporation, bacterial protoplast fusion with intact cells, or polycations, e.g., polybrene. polyornithine, can also be used. For various techniques for transforming mammalian cells, see Keown et al.. Methods in Enzymology, 185:527-537 ( 1990) and Mansour et al., Nature, 336:348-352 ( 1988).
[0455| Suitable host cells for cloning or expressing the DNA in the vectors herein include prokaryote, yeast, or higher eukaryote cells. Suitable prokaryotes include but are not limited to eubacteria. such as Gram-negative or Gram-positive organisms, for example, Enterobacteriaceae such as E. coli. Various E. coli strains are publicly available, such as E. coli Kl 2 strain MM294 (ATCC 31 ,446); E. coli Xl 776 (ATCC 31 ,537); E. coli strain W31 10 (ATCC 27,325) and K5 772 (ATCC 53,635). Other suitable prokaryotic host cells include Enterobacteriaceae such as Escherichia, e.g., E. coli, Enterobacter, Erwinia, Klebsiella, Proteus. Salmonella, e.g., Salmonella typhimurium, Serratia, e.g., Serratia marcescans, and Shigella, as well as Bacilli such as B. subtilis and B. licheniformis (e.g., B. licheniformis 41 P disclosed in DD 266,710 published 12 April 1989), Pseudomonas such as P. aeruginosa, and Streptomyces. These examples are illustrative rather than limiting. Strain W31 10 is one particularly preferred host or parent host because it is a common host strain for recombinant DNA product fermentations. Preferably, the host cell secretes minimal amounts of proteolytic enzymes. For example, strain W31 10 can be modified to effect a genetic mutation in the genes encoding proteins endogenous to the host, with examples of such hosts including E. coli W31 10 strain 1A2, which has the complete genotype tonA ; E. coli W31 10 strain 9E4, which has the complete genotype tonA ptr3; E. coli W31 10 strain 27C7 (ATCC 55,244), which has the complete genotype tonA ptr3 phoA El 5 (argF-lac)169 degP ompT kanr; E. coli W31 10 strain 37D6, which has the complete genotype tonA ptr3 phoA El 5 (argF- lac)169 degP ompT rbs7 ilvG kanr; E. coli W31 10 strain 40B4, which is strain 37D6 with a non-kanamycin resistant degP deletion mutation; and an E. coli strain having mutant periplasmic protease disclosed in U.S. Patent No. 4,946,783 issued 7 August 1990. Alternatively, in vitro methods of cloning, e.g., PCR or other nucleic acid polymerase reactions, are suitable.
|0456] In addition to prokaryotes, eukaryotic microbes such as filamentous fungi or yeast are suitable cloning or expression hosts for polynucleoitide-containing vectors. Saccharomyces cerevisiae is a commonly used lower eukaryotic host microorganism. Others include Schizosaccharomyces pombe (Beach and Nurse, Nature, 290: 140 [1981 ]; EP 139,383 published 2 May 1985); Kluyveromyces hosts (U.S. Patent No. 4;943;529; Fleer et al., Bio/Technology, 9:968-975 ( 1991 )) such as, e.g., K. lactis (MW98-8C, CBS683; CBS4574; Louvencourt et al., J. Bacteriol., 154(2):737-742 [ 1983]), K. fragilis (ATCC 12,424), K. bulgaricus (ATCC 16:045); K. wickeramii (ATCC 24, 178), K. waltii (ATCC 56,500), K. drosophilarum (ATCC 36;906; Van den Berg et al., Bio/Technology, 8: 135 (1990)), K. thermotolerans. and K. marxianus; yarrowia (EP 402,226); Pichia pastoris (EP 183,070; Sreekrishna et al., J. Basic Microbiol., 28:265-278 [1988]); Candida; Trichoderma reesia (EP 244,234); Neurospora crassa (Case et al., Proc. Natl. Acad. Sci. USA, 76:5259-5263 [ 1979]); Schwanniomyces such as Schwanniomyces occidentalis (EP 394,538 published 31 October 1990); and filamentous fungi such as, e.g., Neurospora, Penicillium, Tolypocladium (WO 91/00357 published 10 January 1991), and Aspergillus hosts such as A. nidulans (Ballance et al., Biochem. Biophys. Res. Commun., 1 12:284-289 [1983]; Tilburn et al., Gene, 26:205-221 [1983]; Yelton et al., Proc. Natl. Acad. Sci. USA, 81 : 1470-1474 [1984]) and A. niger (Kelly and Hynes, EMBO J., 4:475-479 [1985]). Methylotropic yeasts are suitable herein and include, but are not limited to, yeast capable of growth on methanol selected from the genera consisting of Hansenula, Candida, Kloeckera, Pichia, Saccharomyces, Torulopsis, and Rhodotorula. A list of specific species that are exemplary of this class of yeasts can be found in C. Anthony, The Biochemistry of Methylotrophs, 269 (1982).
(0457] Suitable host cells for the expression of glycosylated polypeptides are derived from multicellular organisms. Examples of invertebrate cells include insect cells such as Drosophila S2 and Spodoptera Sf9, as well as plant cells. Examples of useful mammalian host cell lines include Chinese hamster ovary (CHO) and COS cells. More specific examples include monkey kidney CVl line transformed by SV40 (COS-7, ATCC CRL 1651); human embryonic kidney line (293 or 293 cells subcloned for growth in suspension culture, Graham et al., J. Gen Virol., 36:59 (1977)); Chinese hamster ovary cells/-DHFR (CHO, Urlaub and Chasin, Proc. Natl. Acad. Sci. USA, 77:4216 (1980)); mouse Sertoli cells (TM4, Mather, Biol. Reprod., 23:243-251 (1980)); human lung cells (Wl 38, ATCC CCL 75); human liver cells (Hep G2, HB 8065); and mouse mammary tumor (MMT 060562, ATCC CCL51). The selection of the appropriate host cell is deemed to be within the skill in the art.
[0458] Gene amplification and/or expression can be measured in a sample directly, for example, by conventional Southern blotting, Northern blotting to quantitate the transcription of mRNA [Thomas, Proc. Natl. Acad. Sci. USA, 77:5201 5205 (1980)], dot blotting (DNA analysis), or in situ hybridization, using an appropriately labeled probe, based on the sequences provided herein. Alternatively, antibodies can be employed that can recognize specific duplexes, including DNA duplexes. RNA duplexes, and DNA RNA hybrid duplexes or DNA protein duplexes. The antibodies in turn can be labeled and the assay can be carried out where the duplex is bound to a surface, so that upon the formation of duplex on the surface, the presence of antibody bound to the duplex can be detected.
[0459] Gene expression, alternatively, can be measured by immunological methods, such as immunohistochemical staining of cells or tissue sections and assay of cell culture or body fluids, to quantitate directly the expression of gene product. Antibodies useful for immunohistochemical staining and/or assay of sample fluids can be either monoclonal or polyclonal, and can be prepared in any mammal. Conveniently, the antibodies can be prepared against any polypeptide provided herein or against a synthetic peptide based on the sequences provided herein or against exogenous sequence fused to the polypeptide or fragment thereof and encoding a specific antibody epitope.
|0460] Polypeptides can be recovered from culture medium or from host cell lysates. If membrane-bound, it can be released from the membrane using a suitable detergent solution (e.g. Triton-X 100) or by enzymatic cleavage. Cells employed in expression of polypeptides can be disrupted by various physical or chemical means, such as freeze-thaw cycling, sonication, mechanical disruption, or cell lysing agents, as is known in the art.
|0461] It may be desired to purify polyeptpides. The following procedures are exemplary of suitable purification procedures: by fractionation on an ion-exchange column; ethanol precipitation; reverse phase HPLC; chromatography on silica or on a cation-exchange resin such as DEAE; chromatofocusing; SDS-PAGE; ammonium sulfate precipitation; gel filtration using, for example, Sephadex G-75; protein A Sepharose columns to remove contaminants such as IgG; and metal chelating columns to bind epitope-tagged forms of the polypeptide. Various additional known methods of protein purification can be employed; exemplary methods are described in Deutscher, Methods in Enzymology, 182 (1990); Scopes, Protein Purification: Principles and Practice, Springer- Verlag, New York (1982). The purification step(s) selected will depend, for example, on the nature of the production process used and the particular polypeptide produced.
[0462] Also provided herein is an expression system, comprising an expression vector in a host organism, wherein the expression vector includes a DNA sequence of the embodiments provided herein operably linked to an expression control sequence. As used herein, an expression vector is a DNA or RNA vector that is capable of transforming a host cell and of effecting expression of a specified nucleic acid molecule. Typically, the expression vector is also capable of replicating within the host cell. Expression vectors can be either prokaryotic or eukaryotic. and are typically viruses or plasmids.
|0463] The term operably linked refers to functional linkage between a nucleic acid expression control sequence (such as a promoter, or array of transcription factor binding sites) and a second nucleic acid sequence, wherein the expression control sequence directs transcription of the nucleic acid corresponding to the second sequence. An operably linked expression vector can also include secretion signals and other modifying sequences, and can encode chaperones and proteins for a variety of organisms and systems.
|0464] Also provided herein are methods of expressing a polypeptide- encoding nucleotide sequence generated by the methods provided herein. Methods of expressing polypeptides from polypeptide-encoding nucleotide sequences are known in the art, as exemplified, for example, by the techniques described in Maniatis et al., 1989, Molecular Cloning: A Laboratory Manual. Cold Spring Harbor Laboratory, N. Y. and Ausubel et al., 2008, Current Protocols in Molecular Biology, Greene Publishing Associates and Wiley Interscience, N.Y. The methods include inserting a polypeptide- encoding nucleotide sequence designed by the methods provided herein into a cell, and expressing the polypeptide-encoding nucleotide sequence under conditions suitable for gene expression. Additionally provided expression methods include cell-free expression systems as known in the art, where such methods include providing a polypeptide- encoding nucleotide sequence designed by the methods provided herein and contacting the polypeptide-encoding nucleotide sequence with a cell-free expression system under conditions suitable for protein translation.
Metabolic Engineering
|0465] In certain embodiments, the expression levels of one or more enzymes in a metabolic pathway are individually manipulated. Differential metabolic expression levels can be manipulated using methods known in the art. For example, by selecting a specific promoter with a desired transcriptional level, one can vary the expression level of the gene that is operably linked to the promoter. Similarly, one may select an expression vector that produces the desired levels of expression. [0466J Accordingly, one can manipulate expression of the various components of the metabolic systems described herein by selecting a specific promoter with a desired level of transcriptional activation. Additionally, one can predict and manipulate expression of various components of the systems provided herein using a mathematical tool for modeling a metabolic pathway. Such tools are known in the art, for example, as described by Yang et al. (J. Biol. Chem (2005) 280(12): 1 1224-32) and by Yang et al. (Bioinformatics (2005) 6:774-780). each of which is hereby incorporated by reference in its entirety.
Vectors for insertion of polynucleotide into cells
|0467] Nucleic acid constructs, methods and systems for modifying endogenous sequences also are provided herein. Endogenous sequences include genomic sequences of a cell. Such genomic sequences can include sequences previously modified by the constructs, methods and systems provided herein. Modifications of endogenous sequences can include insertions, deletions and mutations. In some embodiments, a modification can include the insertion of a heterologous sequence. Heterologous sequences include exogenous nucleic acid sequences and can include sequences with homology to endogenous sequences.
Integrable polynucleotides
[0468] In some embodiments, integrable polynucleotides for modifying endogenous nucleotide sequences in cell are provided. Such integrable polynucleotides can contain sequences with homology to endogenous sequences and a removable selectable marker cassette. The removable selectable marker cassette can include a selectable marker flanked by a 5' site-specific recombinase recognition sequence and a 3' site-specific recombinase recognition sequence. In more embodiments, integrable polynucleotides can also contain heterologous sequences. In such embodiments, the heterologous sequences and removable selectable marker cassette can be flanked by a 5' nucleic acid sequence with homology to an endogenous sequence and a 3' nucleic acid sequence with homology to an endogenous sequence.
|0469] In some embodiments, integrable polynucleotides can include episomal nucleic acids, such as plasmids and YACS. In such embodiments, integrable polynucleotides can include autonomous replication sequences such as CoIEl , Ori, oriT. 2 μm, CEN/ARS. In more embodiments, integrable polynucleotides can include linearized episomal nucleic acids, for example, plasmids cut with a restriction enzyme. In certain embodiments, integrable polynucleotides can include PCR products.
|0470] The following describes aspects of integrable polynucleotides, namely, removable selection cassettes, sequences with homology to endogenous sequences, and heterologous sequences contained therein.
[0471] The techniques and procedures described herein are generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the instant specification, for example, Sambrook el al., Molecular Cloning: A Laboratory Manual (Third ed.. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. 2000).
Removable selectable marker cassettes
|0472] In some embodiments, a removable selectable cassette can contain a selectable marker flanked by a 5' site-specific recombinase recognition sequence and a 3' site-specific recombinase recognition sequence. Removable selectable marker cassettes can be used to select for integration of an integrable polynucleotide into the genome of a cell. Subsequent to integration of the integrable polynucleotide, the removable selectable marker cassette can be excised, if desired, from the genome of the cell. Because the number of known selectable markers is limited, one advantage of excising a selectable maker from the genome of a cell is that the selectable marker can be used repeatedly. That is, after excising the selectable marker of a first integrable polynucleotide from a cell, the same selectable marker can be used in a second integrable polynucleotide to modify the genome of a cell previously modified by the first integrable polynucleotide.
|0473] In some embodiments, the selectable marker can allow selection for a cell in which the selectable marker has integrated into the cell's genome. Selectable markers can be antibiotic resistance genes against compounds, for example, kanamycin, ampicillin, tetracycline, chloramphenicol, spectinomycin, gentamycin, zeomycin, or streptomycin. More selectable markers can be genes capable of complementing strains of yeast having well characterized metabolic deficiencies, for example, tryptophan or histidine deficient mutants. In more embodiments, a selectable marker can be used to select against cells that retain the selectable marker. In such embodiments, cells which do not express the selectable marker will be selected for. In further embodiments, a selectable marker can be selected for and against. Examples of selectable markers that can be used in conjunction with the constructions and methods described herein can include, but are not limited to. URA3 (Boeke, J. D. , LaCroute, F. . and Fink. G. R. (1984). A positive selection for mutants lacking orotidine-5'-phosphate decarboxylase activity in yeast: 5-fluoro-orotic acid resistance. MoI. Gen. Genet. 197. 345-346). TRPl (Toyn, J. H., Gunyuzlu, P. L., White, W. H., Thompson , L. A., and Holhs, G. F. (2000). A counterselection for the tryptophan pathway in yeast: 5-fluoroanthranilic acid resistance. Yeast 16, 553-560), CANl (Whelan, W. L., Gocke, E.: and Manney; T. R. ( 1979). The CAN l locus of Saccharomyces cerevisiae: fine-structure analysis and forward mutation rates. Genetics 35-51), KIURA3, CYH2, LYS2 and METl 5 (Singh, A. and Shennan, F. (1975). Genetic and physiological characterization of metl 5 mutants of Saccharomyces cerevisiae: a selective system for forward and reverse mutations. Genetics 75-97). Such examples can typically be used in conjunction with specific strains of Saccharamyces cerevisiae which are non-functional for specific genes. In embodiments in which the selectable marker can be selected for or selected against, a first selection of the selectable marker can be made to select for incorporation of the selectable marker and a second selection of the selectable marker can be made to select against maintaining the selectable marker. Such embodiments can find particular application when the same selectable marker is utilized iteratively. namely, two or more times, for the separate incorporation of two or more heterologous polynucleotides into the host organism.
10474] In some embodiments, the selectable marker can be flanked by site- specific recombinase recognition sequences. Such sequences allow a site-specific recombinase to excise the selectable marker from an integrable polynucleotide integrated into the genome of a cell. Examples of sequence-specific recombinase target sites include, but are not limited to, loxP sites, frt sites, att sites and dif sites. In certain embodiments, the site-specific recombinase recognition sequences can be loxP sites recognized by the CRE recombinase. In further embodiments, the CRE recombinase can be a CRE recombinase optimized for expression in a particular organism, for example, S. cerevisiae, using methods known in the art. In more embodiments, the site-specific recombinase recognition sequence can be frt sites recognized by the FLP recombinase.
|0475] To excise an intervening piece of DNA, for example, DNA encoding a selectable marker, the flanking loxP sites or flanking frt sites should be in the same orientation, that is, the sites should be in tandem orientation. CRE recombinase or FLP recombinase expressed in a cell can excise the sequence between loxP sites or frt sites, respectively. In some embodiments, the site-specific recombinase can be expressed from a plasmid. In other embodiments, the site-specific recombinase can be expressed from an inducible endogenous gene. The use of an inducible CRE recombinase in yeast to delete endogenous sequences flanked by loxP sites is known in the art, as exemplified in Sauer B. Functional expression of the cre-lox site-specific recombination system in the yeast Saccharamyces cerevisiae. MoI. Cell Biol. (1987) 7, 2087-2096.
Sequences with homology to endoRenous sequences
|0476] In some embodiments, integration of an integrable polynucleotide into the genome of a cell can be mediated by a variety of processes. Such processes can include, but are not limited to, random integration, homologous recombination, or site- specific recombination.
10477] In some embodiments, integrable polynucleotides can contain sequences with homology to endogenous sequences. Such sequences with homology to endogenous sequences can direct integration of integrable polynucleotides to certain locations in a cell's genome, specifically, the location of the endogenous sequence. One advantage of directing integration of integrable polynucleotides to particular locations of the genome is that the integrable polynucleotides can be directed to locations of the genome that, for example, can contain enhancer elements, locus control regions, or can be more permissive for expression of a heterologous sequence contained within an integrable polynucleotide. In certain embodiments, sequences with homology to endogenous sequences can be more than about 5 nucleotides, more than about 10 nucleotides, more than about 15 nucleotides, more than about 20 nucleotides, more than about 25 nucleotides, more than about 30 nucleotides, more than about 35 nucleotides, more than about 40 nucleotides, more than about 45 nucleotides, more than about 50 nucleotides, more than about 100 nucleotides, more than 500 nucleotides, more than about 1 kilobases, more than about 2 kilobases, more than about 3 kilobases, more than about 4 kilobases, or more than about 5 kilobases in length. Sequences with homology to endogenous sequences can be 100% identical or can have at least 99 %, 98 %, 97 %, 96 %, 95 %, 94 %, 93 %, 92 %, 91 %, 90 %, 85 %, 80 %, 70 %, or 70% identity to the endogenous sequence.
|0478] In particular embodiments, the sequences with homology to endogenous sequences can contain sequences with homology to genomic repetitive elements, such as long interspersed repeats (LINEs), short interspersed repeats (SINEs), or retrotransposon DNA, such as long terminal repeats (LTR). In certain embodiments, genomic repetitive elements can be TyI or Ty3 elements. In some embodiments, integrable polynucleotides containing sequences with homology to genomic repetitive elements may integrate at more than one site in the genome of a cell. In further embodiments, sequences with homology to endogenous sequences can contain δ sequences, δ sequences are a component of the LTR of the TyI retrotransposon and are distributed throughout the S. cerevisiae genome. Vectors containing δ sequences for integration into S. cerevisiae are known in the art, as exemplified in Lee F. W. and Da Dilva N.A., Sequential delta-integration for the regulated insertion of cloned genes in Saccharomyces cerevisiae. Biotechnol Prog. (1997) 13(4): 368-373. In certain embodiments, the 5' nucleic acid sequence with homology to an endogenous sequence and the 3' nucleic acid sequence with homology to an endogenous sequence can contain δ sequences. Vectors containing heterologous sequences flanked by δ sequences are known in the art to have an increased stability for expression of heterologous sequences contained therein (Lee F. W. and Da Dilva N. A., Improved efficiency and stability of multiple cloned gene insertions at the delta sequences of Saccharomyces cerevisiae. Appl Microbiol Biotechnol (1997) 48(3): 339-345). Without wishing to be bound to any one theory, the increased stability of integrated vectors containing two δ sequences may be due to the vector integrating into the yeast genome by double-crossover integration.
HeteroloRous sequences
[0479] In addition to a removable selectable cassette and sequences with homology to endogenous sequences, in some embodiments, an integrable polynucleotide can contain heterologous sequences. Such heterologous sequences can include sequences encoding polypeptides. In more embodiments, the heterologous sequences can encode genes important in sugar metabolism, cellulose metabolism, arabinose metabolism, and xylose metabolism. In particular embodiments, a heterologous sequence can encode a one or more of the nucleotide sequences provided herein, such as, for example, one or more of SEQ ID NOs:(2x+l ), where x=0 to 101.
[0480] In some embodiments, heterologous sequences can contain regulatory elements operatively linked to a sequence encoding a polypeptide. Such regulatory elements can include, for example, promoters, enhancers, and terminator sequences. Promoters may be constitutive or inducible. Suitable promoters for use in prokaryotic hosts include, but are not limited to, the trp, lac and phage promoters, tRNA promoters and glycolytic enzyme promoters. Useful yeast promoters include, but are not limited to, the promoter regions for metallothionein, 3-phosphoglycerate kinase or other glycolytic enzymes such as enolase or glyceraldehyde-3-phosphate dehydrogenase and the enzymes responsible for maltose and galactose utilization. Appropriate mammalian promoters include, but are not limited to, the early and late promoters from SV40 and promoters derived from murine Moloney leukemia virus (MLV), mouse mammary tumor virus (MMTV), avian sarcoma viruses, adenovirus 11. bovine papilloma virus and polyomas. In certain embodiments, a heterologous sequence can contain the PGKl promoter, the TEF] promoter, the CYCJ terminator, and combinations thereof.
[0481] In some embodiments, heterologous sequences encode and express the gene of interest in a cell in which the heterologous sequence has integrated.
Cells
|0482] In some embodiments, a cell can contain any of the integrable polynucleotides described herein. Such a cell can be a prokaryotic cell or a eukaryotic cell. Examples of prokaryotic cells include Escherichia coli, and Clostridium species. Examples of eukaryotic cells include, but are not limited to, fungi and yeast cells, such as, Saccharomyces cerevisiae, Pichia pastoήs, Zymomonas mobilis. Kluyveromyces lactis, Kluveroinyces marxianus, Trichoderma species, and Aspergillus species; mammalian cells, such as Chinese hamster cells: avian cells; and insect cells.
|0483] In some embodiments, the cell can contain an integrable polynucleotide integrated into the genome of a cell. In such embodiments, a cell can contain a heterologous nucleic acid integrated into the genome of the cell in which the removable selectable marker is juxtaposed to said heterologous nucleic acid. A removable selectable marker can be juxtaposed to a heterologous nucleic acid where the removable selectable marker and the heterologous nucleic acid are adjacent to one another on a sequence, for example, the removable selectable marker and the heterologous nucleic acid can be immediately adjacent to one another, or separated by less than 1 nucleotide, less than about 5 nucleotides, less than about 10 nucleotides, less than about 20 nucleotides, less than about 30 nucleotides, less than about 40 nucleotides, less than about 50 nucleotides, less than about 60 nucleotides, less than about 70 nucleotides, less than about 80 nucleotides, less than about 90 nucleotides, less than about 100 nucleotides, less than about 200 nucleotides, less than about 300 nucleotides, less than about 400 nucleotides, less than about 0.5 kilobases, less than about 1 kilobases, less than about 2 kilobases, less than about 3 kilobases, less than about 4 kilobases, less than about 5 kilobases, or less than about 10 kilobases.
[0484] In more embodiments, a cell can contain an integrable polynucleotide integrated into the genome of the cell where the removable selectable cassette has been excised from the integrated polynucleotide. In such embodiments, a cell can contain a heterologous nucleic acid integrated into the genome of the cell in which a site-specific recombinase recognition site is juxtaposed to the heterologous nucleic acid. A site- specific recombinase recognition site can be juxtaposed to a heterologous nucleic acid where the site-specific recombinase recognition site and the heterologous nucleic acid are adjacent to one another on a sequence, for example, the site-specific recombinase recognition site and the heterologous nucleic acid can be immediately adjacent to one another, or separated by less than 1 nucleotide, less than about 5 nucleotides, less than about 10 nucleotides, less than about 20 nucleotides, less than about 30 nucleotides, less than about 40 nucleotides, less than about 50 nucleotides, less than about 60 nucleotides, less than about 70 nucleotides, less than about 80 nucleotides, less than about 90 nucleotides, less than about 100 nucleotides, less than about 200 nucleotides, less than about 300 nucleotides, less than about 400 nucleotides, less than about 0.5 kilobases, less than about 1 kilobases, less than about 2 kilobases, less than about 3 kilobases. less than about 4 kilobases, less than about 5 kilobases, or less than about 10 kilobases.
|0485] In further embodiments, a cell can contain a plurality of integrable polynucleotides. In such embodiments, a cell can contain a plurality of different integrable polynucleotides containing different selectable markers. Typically, a cell contains no more than about 1 , no more than about 2, no more than about 3, no more than about 4, no more than about 5, no more than about 6, no more than about 7, no more than about 8, no more than about 8, or no more than about 10 different selectable markers. However, it is contemplated that the number of selectable markers a cell can contain can include the number of different selectable markers compatible with the methods and compositions described herein. In some embodiments, a cell can contain a plurality of different integrable polynucleotides that have integrated into the genome of the cell. In such embodiments, a cell can contain 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 15 or more, 20 or more, 25 or more, 30 or more, 40 or more, 45 or more, or 50 or more different integrable polynucleotides that have integrated into the genome of the cell. In more embodiments, a cell can contain a plurality of different integrable polynucleotides that have integrated into the genome of the cell where some integrable polynucleotides contain selectable markers, and some integrable polynucleotides have no selectable marker. In even more embodiments, a cell can contain a plurality of different integrable polynucleotides where some or all of the selectable markers have been excised. Methods of modifying endogenous sequences
|0486] In addition to the nucleic acids and compositions described, also provided are methods of modifying endogenous sequences in cells. In some embodiments, methods to modify an endogenous sequence in a cell can include providing a cell with any integrable polynucleotide described herein, and selecting for at least one cell containing the integrable polynucleotide integrated into the genome of the cell.
|0487] In some embodiments, a plurality of different integrable polynucleotides can be provided to a cell. In such embodiments, the plurality of different integrable polynucleotides can include 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more different integrable polynucleotides.
|0488] In certain embodiments, the plurality of integrable polynucleotides can include integrable polynucleotides with different selectable makers. One advantage of providing a cell with a plurality of polynucleotides with different selectable markers includes the ability to make more than one modification to endogenous sequences in a cell simultaneously. Thus, also contemplated herein, are methods that include providing a cell with a plurality of different integrable polynucleotides simultaneously. In more embodiments, the plurality of integrable polynucleotides can include integrable polynucleotides with different heterologous sequences. In even more embodiments, the plurality of integrable polynucleotides can include integrable polynucleotides with different flanking sequences with homology to endogenous sequences.
|0489] In some embodiments, at least one selectable marker can be used iteratively. In such embodiments, a cell can be produced from a first round of modification(s) using the methods described herein. In other words, a cell can be provided with a first integrable polynucleotide containing a selectable marker, a cell can be selected for containing the integrable polynucleotide integrated into the cell's genome, the selection cassette can be excised from a cell containing an integrated integrable polynucleotide, and a cell can be selected for having the selection cassette excised. Subsequent to the first round of modifications, a cell containing the modifications of the first round, can undergo at least a second round of modifications using a second integrable polynucleotide containing the same selectable marker as the first integrable polynucleotide. As such, a selectable marker can be reused and is used iteratively. In more embodiments, a cell can be provided with a plurality of integrable polynucleotides containing set of different selectable markers in a first round of modifications. In at least a second subsequent round of modifications, a cell containing the modifications of the first round of modifications, can be provided with a plurality of integrable polynucleotides containing the same set of different selectable markers as the first round of modifications.
[0490] In certain embodiments, the integrable polynucleotide can be provided to a cell as a linearized plasmid.
|0491] In more embodiments, the integrable polynucleotide can be provided to a cell as a PCR product. Methods of PCR are well known in the art. In such embodiments, the template for the PCR can comprise a sequence for an integrable polynucleotide, for example, a vector containing the integrable polynucleotide sequence. In more embodiments, the initial template for PCR may not contain the entire sequence for an integrable polynucleotide. One advantage of using PCR to generate the integrable polynucleotide includes the ability to incorporate additional sequences to the ends of the initial PCR template. This ability to incorporate additional sequences reduces the number of subcloning steps required to generate an integrable polynucleotide. For example, PCR primers with tails can be designed and used to amplify the initial PCR template and incorporate the additional sequences in the tails into the amplified product. Such additional tail sequences can be 2 nucleotides, 3 nucleotides, 4 nucleotides, 5 nucleotides, 6 nucleotides, 7 nucleotides, 8 nucleotides, 9 nucleotides, 10 nucleotides, 1 1 nucleotides, 12 nucleotides, 13 nucleotides, 14 nucleotides, 15 nucleotides, 16 nucleotides, 17 nucleotides, 18 nucleotides, 19 nucleotides, 20 nucleotides, 21 nucleotides, 22 nucleotides, 24 nucleotides, 25 nucleotides, 26 nucleotides, 27 nucleotides, 28 nucleotides, 29 nucleotides, 30 nucleotides, 31 nucleotides, 32 nucleotides, 33 nucleotides, 34 nucleotides, 35 nucleotides, 36 nucleotides, 37 nucleotides, 38 nucleotides, 39 nucleotides, 40 nucleotides, or more than 40 nucleotides in length. In certain embodiments, primers for the PCR can be designed to add sequences with homology to endogenous sequences to the initial PCR template. In such embodiments, an integrable polynucleotide with flanking sequences with homology to endogenous sequences can be generated. In particular embodiments, additional tail sequences can include TyI sequences.
[0492] In some embodiments, methods to modify an endogenous sequence in a cell can also include excising the selectable marker from the integrable polynucleotide integrated into the genome of the cell. One advantage of excising a selectable marker integrated into the genome of a cell is that the selectable marker can be re-used to select for another modification in a subsequent round of modifications. In certain embodiments, a selectable marker can be excised from an integrated site by site-specific recombination using a site-specific recombinase expressed in the cell. Site-specific recombinases can include CRE recombinase to excise sequences between tandem loxP sites, and FLP recombinase to excise sequences between tandem frt sites. In some embodiments, the site-specific recombinase can be expressed from a plasmid transformed into the cell. Alternatively, the site-specific recombinase can be expressed from an inducible endogenous gene. It is contemplated that in instances where more than one type of different selectable makers have integrated into the cell's genome, all the different selectable makers can be excised simultaneously by the expression of at least one type of site-specific recombination. For example, the selectable markers of an integrable polynucleotide containing the URA3 marker flanked by loxP sites, and an integrable polynucleotide containing the TRPl marker flanked by loxP sites, can both be excised from sites where the integrable polynucleotides have integrated into the cell by expression in the cell of CRE recombinase. In other embodiments, a cell can be provided with a plurality of integrable polynucleotides which contain different recombinase recognition sequences. In other words, the plurality of integrable polynucleotides can include some integrable polynucleotides that contain one type of recombinase recognition sequences, such as loxP sites, and some integrable polynucleotides can contain another type of recombinase recognition sequences, such as frt sites.
[0493] In some embodiments, a cell in which a selectable marker has been excised can be identified by selecting against cells that retain the marker. Methods for such negative selection are well known in the art.
Systems for Xylose and Arabinose Metabolism
|0494) Also provided herein are systems for xylose metabolism, comprising one or more host organisms that collectively include nucleotide sequences operably encoding at least two enzymes from bacterial or eukaryotic pathways. An exemplary eukaryotic system for xylose metabolism is a cassette of enzymes that can include xylose reductase (XR), xylitol dehydrogenase (XDH), and xylulokinase (XKI). An exemplary bacterial system for xylose metabolism is a cassette of enzymes that can include xylose isomerase (XyIA). and xylulokinase (XKI). In certain aspects, one or more, or all of the enzymes are heterologous to the one or more host organisms. In certain aspects, the translational kinetics of each of the nucleotide sequences encoding the enzymes has been increased by silent permutation or conservative amino acid substitution of at least 1 , 2, 3, 4, 5 or 6 or more codon pairs present in the original sequence for each enzyme. A silent permutation is a change to one or more nucleotides of a codon such that the encoded amino acid does not change. In certain aspects, the at least 1. 2, 3. 4, 5 or 6 or more substituted codon pairs are predicted to cause a translational pause or slowing in the host organism, and the substituting codon pair is typically a codon pair not predicted to cause a translational pause or slowing in the host organism. In certain aspects, a codon pair in the modified polynucleotide can be selected to preserve or insert a predicted pause.
[0495] Also provided herein are systems for arabinose metabolism, comprising one or more host organisms that collectively include nucleotide sequences operably encoding at least two least two enzymes from bacterial or eukaryotic pathways. An exemplary eukaryotic system for arabinose metabolism is a cassette of enzymes that can include aldose reductase (ARD), L-arabinitol 4-dehydrogenase (LAD), L-xylulose reductase (LXR), xylitol dehydrogenase (XDH). and xylulokinase (XKI). An exemplary bacterial system for arabinose metabolism is a cassette of enzymes that can include L- arabinose isomerase (AraA), L-ribulokinase (AraB), and L-ribulose-5-P 4-epimerase (AraD). In certain aspects, one or more, or all of the enzymes are heterologous to the one or more host organisms. In certain aspects, the translational kinetics of each of the nucleotide sequences encoding the enzymes has been increased by silent permutation or conservative amino acid substitution of at least 1 , 2, 3, 4, 5 or 6 or more codon pairs present in the original sequence for each enzyme. A silent permutation is a change to one or more nucleotides of a codon such that the encoded amino acid does not change. In certain aspects, the at least 1 , 2, 3, 4, 5 or 6 or more substituted codon pairs are predicted to cause a translational pause or slowing in the host organism, and the substituting codon pair is typically a codon pair not predicted to cause a translational pause or slowing in the host organism. In certain aspects, a codon pair in the modified polynucleotide can be selected to preserve or insert a predicted pause.
[0496| It is contemplated herein that the stoichiometry of enzymes in a pathway can affect the overall efficiency of biomass conversion. Accordingly, provided herein are systems of two or more enzymes wherein one of the two or more enzymes in the pathway has a translational pause. Also provided herein are two or more enzymes wherein two of the enzymes in the pathway have a translational pause. For example, in the eukaryotic cassette of xylose metabolizing enzymes described above, xylose reductase (XR) can have a pause, xylitol dehydrogenase (XDH) can have a pause, xylulokinase (XKl) can have a pause, or combinations thereof can have pauses. As a further example. in the bacterial cassette of xylose metabolizing enzymes described above, xylose isomerase (XyIA) can have a pause, xylulokinase (XKl) can have a pause, or both enzymes can have a pause. As a further example, in the eukaryotic cassette of arabinose metabolizing enzymes described above, aldose reductase (ARD) can have a pause, L- arabinitol 4-dehydrogenase (LAD) can have a pause, L-xylulose reductase (LXR) can have a pause, xylitol dehydrogenase (XDH) can have a pause, xylulokinase (XKl) can have a pause, and combinations thereof can have pauses. As a further example, in the bacterial cassette of arabinose metabolizing enzymes described above, L-arabinose isomerase (AraA) can have a pause, L-ribulokinase (AraB) can have a pause, and L- ribulose-5-P 4-epimerase (AraD) can have a pause, or combinations thereof can have pauses. Thus, in one such example, AraA and AraB do not have pauses, while AraD contains a pause; it is contemplated that such an arrangement would result in AraA and AraB having high levels of activity, with AraD retaining low levels of activity.
|0497] In some aspects, the one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster Kluyveromyces lactis, Zymomonas mobilis and Schizosaccharomyces pombe.
|0498] In some aspects, each encoded enzyme in the system has at least a 50%, 60%, 70%, 80%, and more typically at least 90%, 91 %, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% amino acid sequence identity to the with the original sequence of the enzyme.
|04991 In some aspects, one or more of the enzymes in the system retains at least 75% of the enzymatic activity of the enzyme encoded by the original sequence under conditions suitable for metabolism of xylose. Methods for measuring the activity of the enzymes in the system are known in the art.
[0500] Also provided are methods of hydrolyzing a carbohydrate comprising providing a carbohydrate comprising at least one glycosidic bond, providing a polypeptide encoded by any of the polynucleotides provided herein, and contacting said carbohydrate with said polypeptide under conditions that permit said polypeptide to hydrolyze at least one glycosidic bond of said carbohydrate, whereby at least one glycosidic bond of said carbohydrate is hydrolyzed. In some such embodiments, the carbohydrate is cellulose. In some such embodiments, the carbohydrate comprises two or more β-l,4-linked glucose units. Typically such methods can be performed using the cells and systems provided herein. Such methods can be performed in order to provide smaller polysaccharides and/or monosaccharides which can be used by a cell or processed extracellularly according to any one of a variety of known methods in the art.
|0501] The following examples are included for illustrative purposes only and are not intended to limit the scope of the invention.
EXAMPLES
|0502] The methods provided herein below exemplify calculation of nucleotide sequences for improved expression in selected heterologous organisms, and expression of polynucleotides containing such sequences. It will be understood by those skilled in the art that any of a variety of known molecular techniques can be utilized in order to implement the following examples. For example, a polynucleotide containing an improved-expression nucleotide sequence calculated in accordance with the teachings herein can be prepared by known methods, such as, for example, assembly of overlapping oligonucleotides which can be solid phase synthesized, as is described in U.S. Patent Number 7,262,031 , and U.S. Patent Publication Numbers 2005/0106590 and 2007/0009928. The prepared polynucleotide can then be amplified by PCR methodologies or by insertion into a vector, transformation into cells, and subsequent harvesting of the vector from the cells. Examples of such methods for amplification of a polynucleotide are provided in Ausubel et al., 2008, Current Protocols in Molecular Biology, Greene Publishing Associates and Wiley Interscience, N.Y. The polynucleotide itself or amplicon thereof can be inserted into an expression vector configured to produce the polypeptide encoded by the inserted polynucleotide. The expression vector is then inserted into cells, and according to the expression vector used, the cells are treated under conditions suitable for polypeptide expression. Any of a variety of expression vectors, cell types, and polypeptide expression methodologies known in the art can be used, and examples of such methodologies are provided in Ausubel, supra. The expressed polypeptide can be analyzed and manipulated as desired. For example, the expressed polypeptide can be analyzed by Western blot analysis using a known antibody to the expressed polypeptide or using an anti-polypeptide antibody generated by known methods. The expressed polypeptide also can be subjected to one or more purification steps to increase the purity of the expressed polypeptide. Various analytical and purification method, as well as antibody-generation methods are known in the art, as exemplified in Ausubel, supra. EXAMPLE 1
[0503] This example describes optimization of a nucleotide sequence encoding Xyr for expression in yeast.
|0504] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, '"Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." ∑ scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0505] The nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for S. cerevisiae. The nucleotide sequence encoding Xyr (SEQ ID NO: 1) was derived from Genbank accession number Ml 6190 by removing untranslated sequence (5' untranslated region and introns).
[0506] A graphical display for the native gene (SEQ ID NO: 1) encoding the Xyr protein (SEQ ID NO: 2) in P. stipitis was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. stipitis as a function of codon pair position. The graphical display is provided in Figure 1.
[0507] A graphical display for the native gene (SEQ ID NO: 1 ) encoding the Xyr protein (SEQ ID NO: 2) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 2A.
[0508] The nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 3) was found to encode a protein (SEQ ID NO: 4) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 2). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 3) encoding the Xyr protein (SEQ ID NO: 4) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 2B.
EXAMPLE 2
|0509] This example describes optimization of a nucleotide sequence encoding Xyr for expression in bacteria.
|0510] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2. chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0511] The nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 1 ) encoding the Xyr protein (SEQ ID NO: 2) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 3 A.
[0512] The nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 9) was found to encode a protein (SEQ ID NO: 10) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 2). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 9) encoding the Xyr protein (SEQ ID NO: 10) expressed in E. coli was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 3B.
EXAMPLE 3
[0513] This example describes optimization of a nucleotide sequence encoding Xyr for expression in P. pastoris.
|0514] Chi-squared values for P. pastoris were determined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
10515] The nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 1) encoding the Xyr protein (SEQ ID NO: 2) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4A.
[0516] The nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having ∑ scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 15) was found to encode a protein (SEQ ID NO: 16) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 2). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 15) encoding the Xyr protein (SEQ ID NO: 16) expressed in P. pastoris was prepared by plotting 2 scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4B.
EXAMPLE 4
|0517] This example describes optimization of a nucleotide sequence encoding Xyrfor expression in K. lactis.
|0518] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0519] The nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 1) encoding the Xyr protein (SEQ ID NO: 2) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 5 A. |0520] The nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 21 ) was found to encode a protein (SEQ ID NO: 22) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 2). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 21 ) encoding the Xyr protein (SEQ ID NO: 22) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 5B.
EXAMPLE 5
[0521] This example describes optimization of a nucleotide sequence encoding Xyr for expression in Z. mobilis.
10522] Chi-squared values for Z mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0523] The nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene (SEQ ID NO: 1) encoding the Xyr protein (SEQ ID NO: 2) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 6A.
|0524] The nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 23) was found to encode a protein (SEQ ID NO: 24) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 2). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 23) encoding the Xyr protein (SEQ ID NO: 24) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 6B. EXAMPLE 6
[0525] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 2 and native Xyr protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA h{mrr-hsdRMS-mcrBC) φ 80lacZ 5M 15 δ!acX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (SnR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to ODβoo of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 370C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford. MA) and are incubated with rabbit polyclonal anti-Xyr antibody diluted 1 :20.000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire. UK) according to manufacturer's instructions.
[0526] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 7
[0527] This example describes optimization of a nucleotide sequence encoding XyI 1 for expression in yeast.
[0528] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfϊeld and Gutman, "Codon Pair Utilization Bias in Bacteria. Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfϊeld, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
|0529] The nucleotide sequence for the gene encoding the XyI l protein was modified to optimize codon usage for S. cerevisiae. The nucleotide sequence encoding XyIl (SEQ ID NO: 25) was derived from Genbank accession number M16190 by removing untranslated sequence (5: untranslated region and introns).
|0530] A graphical display for the native gene (SEQ ID NO: 25) encoding the XyIl protein (SEQ ID NO: 26) in C. parapsilosis was prepared by plotting z scores of translational kinetics values for codon pair utilization in C. parapsilosis as a function of codon pair position. The graphical display is provided in Figure 7.
|0531] A graphical display for the native gene (SEQ ID NO: 25) encoding the XyIl protein (SEQ ID NO: 26) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 8A.
|0532] The nucleotide sequence for the gene encoding the XyIl protein was modified to no longer contain codon pairs having ∑ scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 27) was found to encode a protein (SEQ ID NO: 28) with 100% amino acid sequence identity to wild-type XyIl (SEQ ID NO: 26). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 27) encoding the XyIl protein (SEQ ID NO: 28) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 8B.
EXAMPLE 8
|0533] This example describes optimization of a nucleotide sequence encoding XyIl for expression in bacteria.
|0534] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1. |0535] The nucleotide sequence for the gene encoding the XyIl protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 25) encoding the XyI l protein (SEQ ID NO: 26) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 9A.
(0536] The nucleotide sequence for the gene encoding the XyI l protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 33) was found to encode a protein (SEQ ID NO: 34) with 100% amino acid sequence identity to wild-type XyIl (SEQ ID NO: 26). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 33) encoding the XyIl protein (SEQ ID NO: 34) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 9B.
EXAMPLE 9
[0537] This example describes optimization of a nucleotide sequence encoding XyIl for expression in P. pasloris.
|0538] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0539] The nucleotide sequence for the gene encoding the XyIl protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 25) encoding the XyI l protein (SEQ ID NO: 26) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 1 OA.
[0540] The nucleotide sequence for the gene encoding the XyIl protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 39) was found to encode a protein (SEQ ID NO: 40) with 100% amino acid sequence identity to wild-type XyI l (SEQ ID NO: 26)% A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 39) encoding the XyIl protein (SEQ ID NO: 40) expressed in P. pastoήs was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastohs as a function of codon pair position. The graphical display is provided in Figure 1 OB.
EXAMPLE 10
[0541] This example describes optimization of a nucleotide sequence encoding XyIl for expression in K. lactis.
|0542] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0543] The nucleotide sequence for the gene encoding the XyIl protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 25) encoding the XyIl protein (SEQ ID NO: 26) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 1 IA.
[0544] The nucleotide sequence for the gene encoding the XyIl protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 45) was found to encode a protein (SEQ ID NO: 46) with 100% amino acid sequence identity to wild-type XyIl (SEQ ID NO: 26). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 45) encoding the XyIl protein (SEQ ID NO: 46) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 1 I B.
EXAMPLE 11
|0545] This example describes optimization of a nucleotide sequence encoding XyI I for expression in Z. mobilis.
[0546] Chi-squared values for Z. mobilis were determined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for Z. inobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0547] The nucleotide sequence for the gene encoding the XyIl protein was modified to optimize codon usage for Z. mobilis. A graphical display for the native gene (SEQ ID NO: 25) encoding the XyIl protein (SEQ ID NO: 26) in Z. mobilis was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 12A.
[0548] The nucleotide sequence for the gene encoding the XyI l protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 47) was found to encode a protein (SEQ ID NO: 48) with 100% amino acid sequence identity to wild-type XyIl (SEQ ID NO: 26). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 47) encoding the XyI l protein (SEQ ID NO: 48) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 12B.
EXAMPLE 12
[0549] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 8 and native XyIl protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (E-mcrA δ(mrr-hsdRMS-mcrBC) φ 80lacZ δM15 llacX74 deoR recAl araD139 δ{ara-leu) 7697 gall) galK rpsL (StrR) endAl mtpG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD600 of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacryl amide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-Xyl 1 antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0550] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 13
[0551] This example describes optimization of a nucleotide sequence encoding Xdh for expression in yeast.
[0552] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisqT" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding ;"chisq3." ∑ scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0553] The nucleotide sequence for the gene encoding the Xdh protein was modified to optimize codon usage for S. cerevisiae. The nucleotide sequence encoding Xdh (SEQ ID NO: 49) was derived from Genbank accession number M 16190 by removing untranslated sequence (5; untranslated region and introns).
|0554] A graphical display for the native gene (SEQ ID NO: 49) encoding the Xdh protein (SEQ ID NO: 50) in P. stipitis was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. stipitis as a function of codon pair position. The graphical display is provided in Figure 13. |0555j A graphical display for the native gene (SEQ ID NO: 49) encoding the Xdh protein (SEQ ID NO: 50) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 14A.
|0556] The nucleotide sequence for the gene encoding the Xdh protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 51) was found to encode a protein (SEQ ID NO: 52) with 100% amino acid sequence identity to wild-type Xdh (SEQ ID NO: 50). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 51) encoding the Xdh protein (SEQ ID NO: 52) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 14B.
EXAMPLE 14
|0557] This example describes optimization of a nucleotide sequence encoding Xdh for expression in bacteria.
[0558] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75.096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0559] The nucleotide sequence for the gene encoding the Xdh protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 49) encoding the Xdh protein (SEQ ID NO: 50) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 15 A.
|0560] The nucleotide sequence for the gene encoding the Xdh protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 57) was found to encode a protein (SEQ ID NO: 58) with 100% amino acid sequence identity to wild-type Xdh (SEQ *1D NO: 50). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 57) encoding the Xdh protein (SEQ ID NO: 58) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 15B.
EXAMPLE 15
[0561] This example describes optimization of a nucleotide sequence encoding Xdh for expression in P. pastoris.
|0562] Chi-squared values for P. pastoris were determined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
10563] The nucleotide sequence for the gene encoding the Xdh protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 49) encoding the Xdh protein (SEQ ID NO: 50) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 16A.
[0564] The nucleotide sequence for the gene encoding the Xdh protein was modified to no longer contain codon pairs having ∑ scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 63) was found to encode a protein (SEQ ID NO: 64) with 100% amino acid sequence identity to wild-type Xdh (SEQ ID NO: 50). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 63) encoding the Xdh protein (SEQ ID NO: 64) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 16B.
EXAMPLE 16
[0565] This example describes optimization of a nucleotide sequence encoding Xdh for expression in K. lactis.
[0566] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0567] The nucleotide sequence for the gene encoding the Xdh protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 49) encoding the Xdh protein (SEQ ID NO: 50) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 17A.
[0568] The nucleotide sequence for the gene encoding the Xdh protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 63) was found to encode a protein (SEQ ID NO: 64) with 100% amino acid sequence identity to wild-type Xdh (SEQ ID NO: 50). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 63) encoding the Xdh protein (SEQ ID NO: 64) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 17B.
EXAMPLE 17
[0569] This example describes optimization of a nucleotide sequence encoding Xdh for expression in Z. mobilis.
[0570] Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0571] The nucleotide sequence for the gene encoding the Xdh protein was modified to optimize codon usage for Z. mobilis. A graphical display for the native gene (SEQ ID NO: 49) encoding the Xdh protein (SEQ ID NO: 50) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 18A.
[0572] The nucleotide sequence for the gene encoding the Xdh protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 21 ) was found to encode a protein (SEQ ID NO: 22) with 100% amino acid sequence identity to wild-type Xdh (SEQ ID NO: 50). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 21 ) encoding the Xdh protein (SEQ ID NO: 22) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 18B.
EXAMPLE 18
|0573] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 14 and native Xdh protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δ(mrr-hsdRMS-mcrBC) φ 801acZ δM 15 hlacX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (St)-R) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus l OOμg/ml ampicillin and grown at 37°C to OD6oo of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-Xdh antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0574] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 19
|0575] This example describes optimization of a nucleotide sequence encoding XKI for expression in yeast.
[0576] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies,, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
|0577] The nucleotide sequence for the gene encoding the XKl protein was modified to optimize codon usage for S. cerevisiae. The nucleotide sequence encoding XKI (SEQ ID NO: 73) was derived from Genbank accession number M 16190 by removing untranslated sequence (5T untranslated region and introns).
|0578] A graphical display for the native gene (SEQ ID NO: 73) encoding the XKI protein (SEQ ID NO: 74) in P. stipitis was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. stipitis as a function of codon pair position. The graphical display is provided in Figure 19.
|0579] A graphical display for the native gene (SEQ ID NO: 73) encoding the XKI protein (SEQ ID NO: 74) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 2OA.
[0580] The nucleotide sequence for the gene encoding the XKl protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 75) was found to encode a protein (SEQ ID NO: 76) with 100% amino acid sequence identity to wild-type XKl (SEQ ID NO: 74). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 75) encoding the XKI protein (SEQ ID NO: 76) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in 5. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 2OB.
EXAMPLE 20
|0581] This example describes optimization of a nucleotide sequence encoding XKI for expression in bacteria. [0582] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0583] The nucleotide sequence for the gene encoding the XKI protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 73) encoding the XKI protein (SEQ ID NO: 74) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 21 A.
[0584] The nucleotide sequence for the gene encoding the XKI protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 81) was found to encode a protein (SEQ ID NO: 82) with 100% amino acid sequence identity to wild-type XKI (SEQ ID NO: 74). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 81 ) encoding the XKI protein (SEQ ID NO: 82) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 21 B.
EXAMPLE 21
[0585] This example describes optimization of a nucleotide sequence encoding XKI for expression in P. pastoris.
[0586] Chi-squared values for P. pastoris were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0587] The nucleotide sequence for the gene encoding the XKI protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 73) encoding the XKI protein (SEQ ID NO: 74) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 22 A.
[0588] The nucleotide sequence for the gene encoding the XKl protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 87) was found to encode a protein (SEQ ID NO: 88) with 100% amino acid sequence identity to wild-type XKI (SEQ ID NO: 74). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 87) encoding the XKI protein (SEQ ID NO: 88) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 22B.
EXAMPLE 22
[0589] This example describes optimization of a nucleotide sequence encoding XKIfor expression in K. lactis.
[0590] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0591] The nucleotide sequence for the gene encoding the XKI protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 73) encoding the XKI protein (SEQ ID NO: 74) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 23 A.
[0592] The nucleotide sequence for the gene encoding the XKI protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 93) was found to encode a protein (SEQ ID NO: 94) with 100% amino acid sequence identity to wild-type XKI (SEQ ID NO: 74). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 93) encoding the XKI protein (SEQ ID NO: 94) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 23B. EXAMPLE 23
[0593] This example describes optimization of a nucleotide sequence encoding XKl for expression in Z mobilis.
|0594] Chi-squared values for Z. mobilis were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0595] The nucleotide sequence for the gene encoding the XKl protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene (SEQ ID NO: 73) encoding the XKI protein (SEQ ID NO: 74) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 24A.
[0596] The nucleotide sequence for the gene encoding the XKI protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 95) was found to encode a protein (SEQ ID NO: 96) with 100% amino acid sequence identity to wild-type XKI (SEQ ID NO: 74). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 95) encoding the XKI protein (SEQ ID NO: 96) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 24B.
EXAMPLE 24
[0597] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 20 and native XKI protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA ξ>{mrr-hsdRMS-mcrBQ φ 80lacZ δM15 hlacX74 deoR recAl araD139 δ(ara-letι) 7697 gal V gal K rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 : 100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD6Oo of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-XKI antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[05981 Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 25
|0599] This example describes optimization of a nucleotide sequence encoding LADl for expression in yeast.
|0600] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria. Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for 5. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql " was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
|0601] The nucleotide sequence for the gene encoding the LADl protein was modified to optimize codon usage for S. cerevisiae.
|0602] A graphical display for the native gene (SEQ ID NO: 97) encoding the LADl protein (SEQ ID NO: 98) in T. reesei was prepared by plotting z scores of translational kinetics values for codon pair utilization in T. reesei as a function of codon pair position. The graphical display is provided in Figure 25. |0603) A graphical display for the native gene (SEQ ID NO: 97) encoding the LADl protein (SEQ ID NO: 98) in S. cerevisiae was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 26A.
10604] The nucleotide sequence for the gene encoding the LADl protein was modified to no longer contain codon pairs having ∑ scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 99) was found to encode a protein (SEQ ID NO: 100) with 100% amino acid sequence identity to wild-type LADl (SEQ ID NO: 98). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 99) encoding the LADl protein (SEQ ID NO: 100) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 26B.
EXAMPLE 26
|0605] This example describes optimization of a nucleotide sequence encoding LADl for expression in bacteria.
[0606] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0607] The nucleotide sequence for the gene encoding the LADl protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 97) encoding the LADl protein (SEQ ID NO: 98) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 27A.
[0608] The nucleotide sequence for the gene encoding the LADl protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 105) was found to encode a protein (SEQ ID NO: 106) with 100% amino acid sequence identity to wild-type LADl (SEQ ID NO: 98). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 105) encoding the LADl protein (SEQ ID NO: 106) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 27B.
EXAMPLE 27
|0609] This example describes optimization of a nucleotide sequence encoding LADl for expression in P. pastoήs.
10610] Chi-squared values for P. pastoήs were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0611] The nucleotide sequence for the gene encoding the LADl protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 97) encoding the LADl protein (SEQ ID NO: 98) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 28A.
[0612] The nucleotide sequence for the gene encoding the LADl protein was modified to no longer contain codon pairs having ∑ scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 1 1 1) was found to encode a protein (SEQ ID NO: 1 12) with 100% amino acid sequence identity to wild-type LADl (SEQ ID NO: 98). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 1 1 1 ) encoding the LADl protein (SEQ ID NO: 1 12) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 28B.
EXAMPLE 28
[0613] This example describes optimization of a nucleotide sequence encoding LADl for expression in K. lactis. |0614] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0615] The nucleotide sequence for the gene encoding the LADl protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 97) encoding the LADl protein (SEQ ID NO: 98) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 29A.
|0616] The nucleotide sequence for the gene encoding the LADl protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 1 17) was found to encode a protein (SEQ ID NO: 1 18) with 100% amino acid sequence identity to wild-type LADl (SEQ ID NO: 98). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 1 17) encoding the LADl protein (SEQ ID NO: 1 18) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 29B.
EXAMPLE 29
[0617] This example describes optimization of a nucleotide sequence encoding LADl for expression in Z mobilis.
10618] Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0619] The nucleotide sequence for the gene encoding the LADl protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene (SEQ ID NO: 97) encoding the LADl protein (SEQ ID NO: 98) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 30A.
10620J The nucleotide sequence for the gene encoding the LADl protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 1 19) was found to encode a protein (SEQ ID NO: 120) with 100% amino acid sequence identity to wild-type LADl (SEQ ID NO: 98). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 1 19) encoding the LADl protein (SEQ ID NO: 120) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 30B.
EXAMPLE 30
|0621] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 26 and native LADl protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δ(mrr-hsdRMS-mcrBC) φ 80lacZ δM15 δlacX74 deoR recAl araD139 δ(ara-leιή 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus l OOμg/ml ampicillin and grown at 37°C to OD6O0 of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacryl amide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-LADl antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire. UK) according to manufacturer's instructions.
[0622] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed. EXAMPLE 31
10623) This example describes optimization of a nucleotide sequence encoding LXR for expression in yeast.
[0624] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis. Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton. LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql " was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
|0625] The nucleotide sequence for the gene encoding the LXR protein was modified to optimize codon usage for S. cerevisiae.
|0626] A graphical display for the native gene (SEQ ID NO: 121) encoding the LXR protein (SEQ ID NO: 122) in A. monospora was prepared by plotting z scores of translational kinetics values for codon pair utilization in A. monospora as a function of codon pair position. The graphical display is provided in Figure 31.
[0627] A graphical display for the native gene (SEQ ID NO: 121) encoding the LXR protein (SEQ ID NO: 122) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 32A.
|0628] The nucleotide sequence for the gene encoding the LXR protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 123) was found to encode a protein (SEQ ID NO: 124) with 100% amino acid sequence identity to wild-type LXR (SEQ ID NO: 122). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 123) encoding the LXR protein (SEQ ID NO: 124) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in 5. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 32B.
EXAMPLE 32
10629] This example describes optimization of a nucleotide sequence encoding LXR for expression in bacteria.
|0630] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
10631 ] The nucleotide sequence for the gene encoding the LXR protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 121 ) encoding the LXR protein (SEQ ID NO: 122) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 33 A.
|0632] The nucleotide sequence for the gene encoding the LXR protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 129) was found to encode a protein (SEQ ID NO: 130) with 100% amino acid sequence identity to wild-type LXR (SEQ ID NO: 122). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 129) encoding the LXR protein (SEQ ID NO: 130) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 33B.
EXAMPLE 33
|0633] This example describes optimization of a nucleotide sequence encoding LXR for expression in P. pastoris.
10634] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisq l , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
10635] The nucleotide sequence for the gene encoding the LXR protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 121 ) encoding the LXR protein (SEQ ID NO: 122) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 34A.
10636] The nucleotide sequence for the gene encoding the LXR protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 135) was found to encode a protein (SEQ ID NO: 136) with 100% amino acid sequence identity to wild-type LXR (SEQ ID NO: 122). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 135) encoding the LXR protein (SEQ ID NO: 136) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 34B.
EXAMPLE 34
[0637] This example describes optimization of a nucleotide sequence encoding LXRfor expression in K. lactis.
[0638] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0639] The nucleotide sequence for the gene encoding the LXR protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 121) encoding the LXR protein (SEQ ID NO: 122) in K. lactis was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in K. laciis as a function of codon pair position. The graphical display is provided in Figure 35A.
[0640] The nucleotide sequence for the gene encoding the LXR protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 141 ) was found to encode a protein (SEQ ID NO: 142) with 100% amino acid sequence identity to wild-type LXR (SEQ ID NO: 122). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 141 ) encoding the LXR protein (SEQ ID NO: 142) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 35B.
EXAMPLE 35
[0641] This example describes optimization of a nucleotide sequence encoding LXR for expression in Z mobilis.
[0642) Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql . chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0643] The nucleotide sequence for the gene encoding the LXR protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene (SEQ ID NO: 121 ) encoding the LXR protein (SEQ ID NO: 122) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 36A.
[0644] The nucleotide sequence for the gene encoding the LXR protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 143) was found to encode a protein (SEQ ID NO: 144) with 100% amino acid sequence identity to wild-type LXR (SEQ ID NO: 122). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 143) encoding the LXR protein (SEQ ID NO: 144) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 36B.
EXAMPLE 36
[0645] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 32 and native LXR protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δ(mrr-hsdRMS-mcrBC) φ 80/acZ δM 15 UacX74 deoR recAl araD139 δ(ara-leu) 7697 galU gaIK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus l OOμg/ml ampicillin and grown at 37°C to OD6O0 of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to lmmobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-LXR antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
|0646] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 37
|0647] This example describes optimization of a nucleotide sequence encoding LXR for expression in yeast.
|0648] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfϊeld and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfϊeld, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75.403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql " was generated by the expected and observed values determined. The chsql was re- calculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2.:: The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding '"chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0649| The nucleotide sequence for the gene encoding the LXR protein was modified to optimize codon usage for S. cerevisiae.
(0650| A graphical display for the native gene (SEQ ID NO: 145) encoding the LXR protein (SEQ ID NO: 146) in T. reesei was prepared by plotting z scores of translational kinetics values for codon pair utilization in T. reesei as a function of codon pair position. The graphical display is provided in Figure 37.
|0651 ) A graphical display for the native gene (SEQ ID NO: 145) encoding the LXR protein (SEQ ID NO: 146) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 38A.
10652] The nucleotide sequence for the gene encoding the LXR protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 147) was found to encode a protein (SEQ ID NO: 148) with 100% amino acid sequence identity to wild-type LXR (SEQ ID NO: 146). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 147) encoding the LXR protein (SEQ ID NO: 148) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 38B.
EXAMPLE 38
[0653] This example describes optimization of a nucleotide sequence encoding LXR for expression in bacteria.
[0654] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2. chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0655| The nucleotide sequence for the gene encoding the LXR protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 145) encoding the LXR protein (SEQ ID NO: 146) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 39A.
[0656] The nucleotide sequence for the gene encoding the LXR protein was modified to no longer contain codon pairs having ∑ scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 153) was found to encode a protein (SEQ ID NO: 154) with 100% amino acid sequence identity to wild-type LXR (SEQ ID NO: 146). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 153) encoding the LXR protein (SEQ ID NO: 154) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 39B.
EXAMPLE 39
[0657] This example describes optimization of a nucleotide sequence encoding LXR for expression in P. pastoris.
[0658] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql . chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0659| The nucleotide sequence for the gene encoding the LXR protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 145) encoding the LXR protein (SEQ ID NO: 146) in P. pastoris was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 4OA.
[0660] The nucleotide sequence for the gene encoding the LXR protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 159) was found to encode a protein (SEQ ID NO: 160) with 100% amino acid sequence identity to wild-type LXR (SEQ ID NO: 146). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 159) encoding the LXR protein (SEQ ID NO: 160) expressed in P. pastoήs was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoήs as a function of codon pair position. The graphical display is provided in Figure 4OB.
EXAMPLE 40
[0661] This example describes optimization of a nucleotide sequence encoding LXRfor expression in K. laclis.
[0662] Chi-squared values for K. lactis were determined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0663] The nucleotide sequence for the gene encoding the LXR protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 145) encoding the LXR protein (SEQ ID NO: 146) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 41A.
[0664] The nucleotide sequence for the gene encoding the LXR protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 165) was found to encode a protein (SEQ ID NO: 166) with 100% amino acid sequence identity to wild-type LXR (SEQ ID NO: 146). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 165) encoding the LXR protein (SEQ ID NO: 166) expressed in K. lactis was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 4 IB.
EXAMPLE 41 10665) This example describes optimization of a nucleotide sequence encoding LXR for expression in Z mobilis.
|0666] Chi-squared values for Z mobilis were determined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1 .
[0667] The nucleotide sequence for the gene encoding the LXR protein was modified to optimize codon usage for Z. mobilis. A graphical display for the native gene (SEQ ID NO: 145) encoding the LXR protein (SEQ ID NO: 146) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 42A.
|0668] The nucleotide sequence for the gene encoding the LXR protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 167) was found to encode a protein (SEQ ID NO: 168) with 100% amino acid sequence identity to wild-type LXR (SEQ ID NO: 146). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 167) encoding the LXR protein (SEQ ID NO: 168) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 42B.
EXAMPLE 42
[0669] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 38 and native LXR protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δ(mrr-hsdRMS-mcrBC) φ 80lacZ δM15 δ/acX74 deoR recAl araD139 δ(ara-leu) 7697 gal U gal K rpsL (StrR) endAl mtpG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus l OOμg/ml ampicillin and grown at 37°C to OD600 of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford. MA) and are incubated with rabbit polyclonal anti-LXR antibody diluted 1 :20.000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
|0670] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 43
[0671] This example describes optimization of a nucleotide sequence encoding XyIA for expression in yeast.
[0672) Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding lichisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0673] The nucleotide sequence for the gene encoding the XyIA protein was modified to optimize codon usage for S. cerevisiae.
[0674] A graphical display for the native gene (SEQ ID NO: 169) encoding the XyIA protein (SEQ ID NO: 170) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 43. (0675| A graphical display for the native gene (SEQ ID NO: 169) encoding the XyIA protein (SEQ ID NO: 170) in S. cerevisiae was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 44A.
[0676J The nucleotide sequence for the gene encoding the XyIA protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 171 ) was found to encode a protein (SEQ ID NO: 172) with 100% amino acid sequence identity to wild-type XyIA (SEQ ID NO: 170). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 171) encoding the XyIA protein (SEQ ID NO: 172) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 44B.
EXAMPLE 44
|0677] This example describes optimization of a nucleotide sequence encoding XyIA for expression in bacteria.
[0678] Chi-squared values for E. coli were determined as described in Example 1, with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0679] The nucleotide sequence for the gene encoding the XyIA protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 169) encoding the XyIA protein (SEQ ID NO: 170) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 45A.
[0680] The nucleotide sequence for the gene encoding the XyIA protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 177) was found to encode a protein (SEQ ID NO: 178) with 100% amino acid sequence identity to wild-type XyIA (SEQ ID NO: 170). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 177) encoding the XyIA protein (SEQ ID NO: 178) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 45B.
EXAMPLE 45
[0681] This example describes optimization of a nucleotide sequence encoding XyIA for expression in P. pastoris.
[0682] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0683] The nucleotide sequence for the gene encoding the XyIA protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 169) encoding the XyIA protein (SEQ ID NO: 170) in P. pastoris was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 46A.
[0684] The nucleotide sequence for the gene encoding the XyIA protein was modified to no longer contain codon pairs having ∑ scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 183) was found to encode a protein (SEQ ID NO: 184) with 100% amino acid sequence identity to wild-type XyIA (SEQ ID NO: 170). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 183) encoding the XyIA protein (SEQ ID NO: 184) expressed in P. pastoris was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 46B.
EXAMPLE 46
|0685] This example describes optimization of a nucleotide sequence encoding XylAfor expression in K. lactis.
[0686] Chi-squared values for K. lactis were determined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2; chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0687] The nucleotide sequence for the gene encoding the XyIA protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 169) encoding the XyIA protein (SEQ ID NO: 170) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 47A.
|0688] The nucleotide sequence for the gene encoding the XyIA protein was modified to no longer contain codon pairs having ∑ scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 191 ) was found to encode a protein (SEQ ID NO: 190) with 100% amino acid sequence identity to wild-type XyIA (SEQ ID NO: 170). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 191) encoding the XyIA protein (SEQ ID NO: 190) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 47B.
EXAMPLE 47
[0689] This example describes optimization of a nucleotide sequence encoding XyIA for expression in Z. mobilis.
|0690] Chi-squared values for Z mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0691] The nucleotide sequence for the gene encoding the XyIA protein was modified to optimize codon usage for Z. mobilis. A graphical display for the native gene (SEQ ID NO: 169) encoding the XyIA protein (SEQ ID NO: 170) in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 48A.
|0692] The nucleotide sequence for the gene encoding the XyIA protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 191 ) was found to encode a protein (SEQ ID NO: 192) with 100% amino acid sequence identity to wild-type XyIA (SEQ ID NO: 170). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 191) encoding the XyIA protein (SEQ ID NO: 192) expressed in Z mobilis was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 48B.
EXAMPLE 48
|0693] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 44 and native XyIA protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (E-mcrA δ(mrr-hsdRMS-mcrBQ φ 80lacZ δM15 UacX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (StrR) endAl mipG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD60O of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to lmmobilon-P (Millipore, Bedford. MA) and are incubated with rabbit polyclonal anti-XylA antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
|0694] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologous! y expressed. EXAMPLE 49
|0695] This example descπbes optimization ol a nucleotide sequence encoding AraA for expression in yeast
[0696] Chi-squared values for S cere\ ιsiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteπa, Yeast, and Mammals" in Transfer RNA in Protein Synthesis. Hatfield. Lee and Pirtle Eds CRC Press (Boca Raton, LA) 1993) Briefly, non-redundant protein coding regions for S cerevisiae was obtained from GenBank sequence database (75.403 codon pairs in 177 sequences for S cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly The chi-squared value "chisql" was generated by the expected and observed values determined The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2 " The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3 " ∑ scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
|0697] The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for S cerevisiae
[0698J A graphical display for the native gene (SEQ ID NO: 193) encoding the AraA protein (SEQ ID NO 194) in E coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E coli as a function of codon pair position. The graphical display is provided in Figure 49
[0699] A graphical display for the native gene (SEQ ID NO: 193) encoding the AraA protein (SEQ ID NO 194) in S cei evisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S cerevisiae as a function of codon pair position. The graphical display is provided in Figure 50A.
[0700] The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having ∑ scores in S cerevisiae greater than 3 The resulting nucleotide sequence (SEQ ID NO: 195) was found to encode a protein (SEQ ID NO: 196) with 100% amino acid sequence identity to wild-type AraA (SEQ ID NO: 194). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 195) encoding the AraA protein (SEQ ID NO 196) expressed in S cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in 5. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 5OB.
EXAMPLE 50
[0701 J This example describes optimization of a nucleotide sequence encoding AraA for expression in bacteria.
|0702] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0703] The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 193) encoding the AraA protein (SEQ ID NO: 194) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 51 A.
10704] The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having ∑ scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 201 ) was found to encode a protein (SEQ ID NO: 202) with 100% amino acid sequence identity to wild-type AraA (SEQ ID NO: 194). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 201) encoding the AraA protein (SEQ ID NO: 202) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 51B.
EXAMPLE 51
[0705] This example describes optimization of a nucleotide sequence encoding AraA for expression in P. pastoris.
[0706] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql . chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0707] The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for P. pasioris. A graphical display for the native gene (SEQ ID NO: 193) encoding the AraA protein (SEQ ID NO: 194) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 52A.
|0708] The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 207) was found to encode a protein (SEQ ID NO: 208) with 100% amino acid sequence identity to wild-type AraA (SEQ ID NO: 194). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 207) encoding the AraA protein (SEQ ID NO: 208) expressed in P. pastoris was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 52B.
EXAMPLE 52
[0709] This example describes optimization of a nucleotide sequence encoding AraAfor expression in K. lactis.
[0710] Chi-squared values for K. lactis were determined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0711] The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 193) encoding the AraA protein (SEQ ID NO: 194) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. laciis as a function of codon pair position. The graphical display is provided in Figure 53A.
|0712] The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having z scores in K. laciis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 213) was found to encode a protein (SEQ ID NO: 214) with 100% amino acid sequence identity to wild-type AraA (SEQ ID NO: 194). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 213) encoding the AraA protein (SEQ ID NO: 214) expressed in K. laciis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 53B.
EXAMPLE 53
|0713] This example describes optimization of a nucleotide sequence encoding AraA for expression in Z mobilis.
|0714] Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0715] The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for Z. mobilis. A graphical display for the native gene (SEQ ID NO: 193) encoding the AraA protein (SEQ ID NO: 194) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 54A.
[0716] The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having z scores in Z mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 215) was found to encode a protein (SEQ ID NO: 216) with 100% amino acid sequence identity to wild-type AraA (SEQ ID NO: 194). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 215) encoding the AraA protein (SEQ ID NO: 216) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 54B.
EXAMPLE 54
|0717] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 50 and native AraA protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA h{mrr-hsdRMS-mcrBC) φ 80lacZ δM15 hlacX74 deoR recAl araD139 δ(ara-leιι) 7697 gall) gal K rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 : 100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD6OO of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-AraA antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0718] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 55
[0719] This example describes optimization of a nucleotide sequence encoding AraB for expression in yeast.
[0720] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfϊeld and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was re- calculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0721 J The nucleotide sequence for the gene encoding the AraB protein was modified to optimize codon usage for S. cerevisiae.
[0722] A graphical display for the native gene (SEQ ID NO: 217) encoding the AraB protein (SEQ ID NO: 218) in E. coli was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 55.
|0723] A graphical display for the native gene (SEQ ID NO: 217) encoding the AraB protein (SEQ ID NO: 218) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 56A.
[0724] The nucleotide sequence for the gene encoding the AraB protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 219) was found to encode a protein (SEQ ID NO: 220) with 100% amino acid sequence identity to wild-type AraB (SEQ ID NO: 218). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 219) encoding the AraB protein (SEQ ID NO: 220) expressed in S. cerevisiae was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 56B.
EXAMPLE 56
[0725] This example describes optimization of a nucleotide sequence encoding AraB for expression in bacteria.
[0726] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql . chisq2. chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0727| The nucleotide sequence for the gene encoding the AraB protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 217) encoding the AraB protein (SEQ ID NO: 218) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 57A.
|0728J The nucleotide sequence for the gene encoding the AraB protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 225) was found to encode a protein (SEQ ID NO: 226) with 100% amino acid sequence identity to wild-type AraB (SEQ ID NO: 218). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 225) encoding the AraB protein (SEQ ID NO: 226) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 57B.
EXAMPLE 57
[0729] This example describes optimization of a nucleotide sequence encoding AraB for expression in P. pastoris.
|0730] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisq l , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0731] The nucleotide sequence for the gene encoding the AraB protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 217) encoding the AraB protein (SEQ ID NO: 218) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 58A.
[0732] The nucleotide sequence for the gene encoding the AraB protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 231 ) was found to encode a protein (SEQ ID NO: 232) with 100% amino acid sequence identity to wild-type AraB (SEQ ID NO: 218). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 231 ) encoding the AraB protein (SEQ ID NO: 232) expressed in P. pastoήs was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoήs as a function of codon pair position. The graphical display is provided in Figure 58B.
EXAMPLE 58
[0733] This example describes optimization of a nucleotide sequence encoding AraBfor expression in K. lactis.
[0734] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to deteπnine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2. chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0735] The nucleotide sequence for the gene encoding the AraB protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 217) encoding the AraB protein (SEQ ID NO: 218) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 59A.
[0736] The nucleotide sequence for the gene encoding the AraB protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 237) was found to encode a protein (SEQ ID NO: 238) with 100% amino acid sequence identity to wild-type AraB (SEQ ID NO: 218). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 237) encoding the AraB protein (SEQ ID NO: 238) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 59B.
EXAMPLE 59 |0737] This example describes optimization of a nucleotide sequence encoding AraB for expression in Z mobilis.
|0738] Chi-squared values for Z. mobilis were deteπnined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql . chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0739] The nucleotide sequence for the gene encoding the AraB protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene (SEQ ID NO: 217) encoding the AraB protein (SEQ ID NO: 218) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 6OA.
10740] The nucleotide sequence for the gene encoding the AraB protein was modified to no longer contain codon pairs having ∑ scores in Z mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 239) was found to encode a protein (SEQ ID NO: 240) with 100% amino acid sequence identity to wild-type AraB (SEQ ID NO: 218). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 239) encoding the AraB protein (SEQ ID NO: 240) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 6OB.
EXAMPLE 60
|0741] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 56 and native AraB protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (E-mcrA δ(mrr-hsdRMS-mcrBC) φ 80lacZ δM15 hlacX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD6Oo of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacryl amide gel (Pierce). Proteins are transferred to lmmobilon-P (Millipore. Bedford, MA) and are incubated with rabbit polyclonal anti-AraB antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham. Buckinghamshire. UK) according to manufacturer's instructions.
|0742] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 61
[0743] This example describes optimization of a nucleotide sequence encoding AraD for expression in yeast.
[0744] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman. "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql " was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
|0745] The nucleotide sequence for the gene encoding the AraD protein was modified to optimize codon usage for S. cerevisiae.
[0746] A graphical display for the native gene (SEQ ID NO: 241 ) encoding the AraD protein (SEQ ID NO: 242) in E. colϊ was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 61. |0747] A graphical display for the native gene (SEQ ID NO 241 ) encoding the AraD protein (SEQ ID NO 242) in S cei evisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in 5 cei eMsiae as a function of codon pair position The graphical display is provided in Figure 62A
|0748] The nucleotide sequence for the gene encoding the AraD protein w as modified to no longer contain codon pairs having r scores in S cei evisiae greater than 3 The resulting nucleotide sequence (SEQ ID NO 243) was found to encode a protein (SEQ ID NO 244) with 100% amino acid sequence identity to wild-type AraD (SEQ ID NO 242) A graphical display for the codon pair utilization-modified gene (SEQ ID NO 243) encoding the AraD protein (SEQ ID NO 244) expressed in S cerevisiae was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in S cerevisiae as a function of codon pair position The graphical display is provided in Figure 62B
EXAMPLE 62
[0749] This example describes optimization of a nucleotide sequence encoding AraD for expression in bacteria
J0750] Chi-squared values for E coli were determined as descπbed in Example 1 , with the following differences Briefly, non-redundant protein coding regions for E coh were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E coh) to determine an observed number of occurrences for each codon pair The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly Chi-squared values chisql , chisq2. chisq3 and z scores of chisq3 were calculated as descπbed in Example 1
|0751] The nucleotide sequence for the gene encoding the AraD protein was modified to optimize codon usage for E coh A graphical display for the native gene (SEQ ID NO 241 ) encoding the AraD protein (SEQ ID NO 242) in E coh was prepared by plotting z scores of translational kinetics values for codon pair utilization in E coh as a function of codon pair position The graphical display is provided in Figure 63 A
|0752] The nucleotide sequence for the gene encoding the AraD protein was modified to no longer contain codon pairs having z scores in E coh greater than 3 The resulting nucleotide sequence (SEQ ID NO 249) was found to encode a protein (SEQ ID NO 250) with 100% amino acid sequence identity to wild-type AraD (SEQ ID NO 242) A graphical display for the codon pair utilization-modified gene (SEQ ID NO 249) encoding the AraD protein (SEQ ID NO: 250) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 63B.
EXAMPLE 63
|0753] This example describes optimization of a nucleotide sequence encoding AraD for expression in P. pastoris.
[0754] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2. chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0755] The nucleotide sequence for the gene encoding the AraD protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 241 ) encoding the AraD protein (SEQ ID NO: 242) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 64A.
|0756] The nucleotide sequence for the gene encoding the AraD protein was modified to no longer contain codon pairs having ∑ scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 255) was found to encode a protein (SEQ ID NO: 256) with 100% amino acid sequence identity to wild-type AraD (SEQ ID NO: 242). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 255) encoding the AraD protein (SEQ ID NO: 256) expressed in P. pastoris was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 64B.
EXAMPLE 64
[0757] This example describes optimization of a nucleotide sequence encoding AraDfor expression in K. lactis.
|0758] Chi-squared values for K. lactis were determined as described in Example I 5 with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
10759] The nucleotide sequence for the gene encoding the AraD protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 241 ) encoding the AraD protein (SEQ ID NO: 242) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 65A.
|0760] The nucleotide sequence for the gene encoding the AraD protein was modified to no longer contain codon pairs having ∑ scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 261) was found to encode a protein (SEQ ID NO: 262) with 100% amino acid sequence identity to wild-type AraD (SEQ ID NO: 242). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 261) encoding the AraD protein (SEQ ID NO: 262) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 65B.
EXAMPLE 65
|0761] This example describes optimization of a nucleotide sequence encoding AraD for expression in Z mobilis.
10762] Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0763] The nucleotide sequence for the gene encoding the AraD protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene (SEQ ID NO: 241) encoding the AraD protein (SEQ ID NO: 242) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 66A.
|0764] The nucleotide sequence for the gene encoding the AraD protein was modified to no longer contain codon pairs having ∑ scores in Z. mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 263) was found to encode a protein (SEQ ID NO: 264) with 100% amino acid sequence identity to wild-type AraD (SEQ ID NO: 242). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 263) encoding the AraD protein (SEQ ID NO: 264) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 66B.
EXAMPLE 66
[0765] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 62 and native AraD protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA h(mrr-hsdRMS-mcrBC) φ 80lacZ δM15 hlacX74 deoR recAl araD139 δ(ara-leιή 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus l OOμg/ml ampicillin and grown at 37°C to OD600 of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford. MA) and are incubated with rabbit polyclonal anti-AraD antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0766] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed. EXAMPLE 67
(0767] This example describes optimization of a nucleotide sequence encoding Xyr for expression in yeast.
[0768] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfϊeld and Gutman. "Codon Pair Utilization Bias in Bacteria. Yeast, and Mammals" in Transfer RNA in Protein Synthesis. Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value '"chisql " was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding '"chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding '"chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0769] The nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for S. cerevisiae.
[0770] A graphical display for the native gene (SEQ ID NO: 265) encoding the Xyr protein (SEQ ID NO: 266) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 67A.
[0771] The nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 267) was found to encode a protein (SEQ ID NO: 268) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 266). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 267) encoding the Xyr protein (SEQ ID NO: 268) expressed in 5. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 67B. EXAMPLE 68
[0772] This example describes optimization of a nucleotide sequence encoding Xyr for expression in bacteria.
|0773] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75.096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0774] The nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 265) encoding the Xyr protein (SEQ ID NO: 266) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 68A.
|0775] The nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 273) was found to encode a protein (SEQ ID NO: 274) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 266). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 273) encoding the Xyr protein (SEQ ID NO: 274) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 68B.
EXAMPLE 69
[0776] This example describes optimization of a nucleotide sequence encoding Xyr for expression in P. pastoris.
[0777] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2. chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0778| The nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for P. pasioris. A graphical display for the native gene (SEQ ID NO: 265) encoding the Xyr protein (SEQ ID NO: 266) in P. past oris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoήs as a function of codon pair position. The graphical display is provided in Figure 69A.
10779] The nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having z scores in P. pastoήs greater than 3. The resulting nucleotide sequence (SEQ ID NO: 279) was found to encode a protein (SEQ ID NO: 280) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 266). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 279) encoding the Xyr protein (SEQ ID NO: 280) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 69B.
EXAMPLE 70
|0780] This example describes optimization of a nucleotide sequence encoding Xyrfor expression in K. lactis.
|0781] Chi-squared values for K. lactis were determined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql . chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
10782] The nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene (SEQ ID NO: 265) encoding the Xyr protein (SEQ ID NO: 266) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 70A.
10783] The nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having ∑ scores in K. lactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 285) was found to encode a protein (SEQ ID NO: 286) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 266). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 285) encoding the Xyr protein (SEQ ID NO: 286) expressed in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 7OB.
EXAMPLE 71
(0784] This example describes optimization of a nucleotide sequence encoding Xyr for expression in Z. mobilis.
[0785] Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0786] The nucleotide sequence for the gene encoding the Xyr protein was modified to optimize codon usage for Z. mobilis. A graphical display for the native gene (SEQ ID NO: 265) encoding the Xyr protein (SEQ ID NO: 266) in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 71 A.
[0787] The nucleotide sequence for the gene encoding the Xyr protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 287) was found to encode a protein (SEQ ID NO: 288) with 100% amino acid sequence identity to wild-type Xyr (SEQ ID NO: 266). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 287) encoding the Xyr protein (SEQ ID NO: 288) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 71 B. EXAMPLE 72
10788] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 68 and native Xyr protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δimrr-hsdRMS-mcrBQ φ 80lacZ δM15 δlacX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus l OOμg/ml ampicillin and grown at 37°C to OD6Oo of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to Immobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-Xyr antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
[0789] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 73
[0790] This example describes optimization of a nucleotide sequence encoding AraA for expression in yeast.
[0791] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis, Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton, LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
[0792] The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for S. cerevisiae.
|0793] A graphical display for the native gene ( SEQ ID NO: 289) encoding the AraA protein ( SEQ ID NO: 290) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 72.
[0794] A graphical display for the native gene ( SEQ ID NO: 289) encoding the AraA protein ( SEQ ID NO: 290) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 73A.
[0795] The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having z scores in 5. cerevisiae greater than 3. The resulting nucleotide sequence ( SEQ ID NO: 291) was found to encode a protein ( SEQ ID NO: 292) with 100% amino acid sequence identity to wild-type AraA ( SEQ ID NO: 290). A graphical display for the codon pair utilization-modified gene ( SEQ ID NO: 291) encoding the AraA protein ( SEQ ID NO: 292) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 73B.
EXAMPLE 74
[0796] This example describes optimization of a nucleotide sequence encoding AraA for expression in bacteria.
[0797] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql . chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1. |0798] The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for E. coli. A graphical display for the native gene ( SEQ ID NO: 289) encoding the AraA protein ( SEQ ID NO: 290) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 74A.
[0799] The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence ( SEQ ID NO: 293) was found to encode a protein ( SEQ ID NO: 294) with 100% amino acid sequence identity to wild-type AraA ( SEQ ID NO: 290). A graphical display for the codon pair utilization-modified gene ( SEQ ID NO: 293) encoding the AraA protein ( SEQ ID NO: 294) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 74B.
EXAMPLE 75
[0800] This example describes optimization of a nucleotide sequence encoding AraA for expression in P. pastoris.
[0801] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0802] The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene ( SEQ ID NO: 289) encoding the AraA protein ( SEQ ID NO: 290) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 75A.
[0803] The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence ( SEQ ID NO: 295) was found to encode a protein ( SEQ ID NO: 296) with 100% amino acid sequence identity to wild-type AraA ( SEQ ID NO: 290). A graphical display for the codon pair utilization-modified gene ( SEQ ID NO: 295) encoding the AraA protein ( SEQ ID NO: 296) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 75B.
EXAMPLE 76
|0804] This example describes optimization of a nucleotide sequence encoding AraAfor expression in K. lactis.
[0805] Chi-squared values for K. lactis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for K. lactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1 .
[0806] The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for K. lactis. A graphical display for the native gene ( SEQ ID NO: 289) encoding the AraA protein ( SEQ ID NO: 290) in K. lactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 76A.
[0807] The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having z scores in K. lactis greater than 3. The resulting nucleotide sequence ( SEQ ID NO: 297) was found to encode a protein ( SEQ ID NO: 298) with 100% amino acid sequence identity to wild-type AraA ( SEQ ID NO: 290). A graphical display for the codon pair utilization-modified gene ( SEQ ID NO: 297) encoding the AraA protein ( SEQ ID NO: 298) expressed in K. lactis was prepared by plotting ∑ scores of translational kinetics values for codon pair utilization in K. lactis as a function of codon pair position. The graphical display is provided in Figure 76B.
EXAMPLE 77
|0808] This example describes optimization of a nucleotide sequence encoding AraA for expression in Z. mobilis. 10809] Chi-squared values for Z. mobilis were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for Z. mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2. chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0810| The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene ( SEQ ID NO: 289) encoding the AraA protein ( SEQ ID NO: 290) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 77A.
[0811] The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3. The resulting nucleotide sequence ( SEQ ID NO: 299) was found to encode a protein ( SEQ ID NO: 300) with 100% amino acid sequence identity to wild-type AraA ( SEQ ID NO: 290). A graphical display for the codon pair utilization-modified gene ( SEQ ID NO: 299) encoding the AraA protein ( SEQ ID NO: 300) expressed in Z. mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 77B.
EXAMPLE 78
[0812J Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 74 and native AraA protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δ(mrr-hsdRMS-mcrBC) φ 80lacZ δM15 δlacX74 deoR recAl araD139 δ{ara-leu) 7697 galU galK rpsL (StrR) endAl nupG). An overnight culture is inoculated at 1 :100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD6OO of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to lmmobilon-P (Millipore, Bedford. MA) and are incubated with rabbit polyclonal anti-AraA antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
10813J Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 79
|0814] This example describes optimization of a nucleotide sequence encoding AraA for expression in yeast.
|0815] Chi-squared values for S. cerevisiae were determined using previously reported methods (Hatfield and Gutman, "Codon Pair Utilization Bias in Bacteria, Yeast, and Mammals" in Transfer RNA in Protein Synthesis. Hatfield, Lee and Pirtle Eds. CRC Press (Boca Raton. LA) 1993). Briefly, non-redundant protein coding regions for S. cerevisiae was obtained from GenBank sequence database (75,403 codon pairs in 177 sequences for S. cerevisiae) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. The chi-squared value "chisql" was generated by the expected and observed values determined. The chsql was recalculated to remove any influence of non-randomness in amino acid pair frequencies, yielding "chisq2." The chsq2 was re-calculated to remove any influence of non- randomness in dinucleotide frequencies, yielding "chisq3." z scores of chisq3 were calculated by determining the mean chisq3 value and corresponding standard deviation for all codon pairs, and normalizing each chisq3 value to be reported in terms of number of standard deviations from the mean chisq3 values.
|0816] The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for S. cerevisiae.
|0817] A graphical display for the native gene (SEQ ID NO: 301 ) encoding the AraA protein (SEQ ID NO: 302) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 78. |0818| A graphical display for the native gene (SEQ ID NO: 301 ) encoding the AraA protein (SEQ ID NO: 302) in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in S. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 79A.
10819] The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having z scores in S. cerevisiae greater than 3. The resulting nucleotide sequence (SEQ ID NO: 303) was found to encode a protein (SEQ ID NO: 304) with 100% amino acid sequence identity to wild-type AraA (SEQ ID NO: 302). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 303) encoding the AraA protein (SEQ ID NO: 304) expressed in S. cerevisiae was prepared by plotting z scores of translational kinetics values for codon pair utilization in 5. cerevisiae as a function of codon pair position. The graphical display is provided in Figure 79B.
EXAMPLE 80
[0820] This example describes optimization of a nucleotide sequence encoding AraA for expression in bacteria.
|0821] Chi-squared values for E. coli were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for E. coli were obtained from GenBank sequence database (75,096 codon pairs in 237 sequences for E. coli) to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql, chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
[0822] The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for E. coli. A graphical display for the native gene (SEQ ID NO: 301) encoding the AraA protein (SEQ ID NO: 302) in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 80A.
[0823] The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having z scores in E. coli greater than 3. The resulting nucleotide sequence (SEQ ID NO: 305) was found to encode a protein (SEQ ID NO: 306) with 100% amino acid sequence identity to wild-type AraA (SEQ ID NO: 302). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 305) encoding the AraA protein (SEQ ID NO: 306) expressed in E. coli was prepared by plotting z scores of translational kinetics values for codon pair utilization in E. coli as a function of codon pair position. The graphical display is provided in Figure 8OB.
EXAMPLE 81
[0824] This example describes optimization of a nucleotide sequence encoding AraA for expression in P. pastoris.
[0825] Chi-squared values for P. pastoris were determined as described in Example 1 , with the following differences. Briefly, non-redundant protein coding regions for P. pastoris were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1 .
[0826] The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for P. pastoris. A graphical display for the native gene (SEQ ID NO: 301) encoding the AraA protein (SEQ ID NO: 302) in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 8 I A.
|0827J The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having z scores in P. pastoris greater than 3. The resulting nucleotide sequence (SEQ ID NO: 307) was found to encode a protein (SEQ ID NO: 308) with 100% amino acid sequence identity to wild-type AraA (SEQ ID NO: 302). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 307) encoding the AraA protein (SEQ ID NO: 308) expressed in P. pastoris was prepared by plotting z scores of translational kinetics values for codon pair utilization in P. pastoris as a function of codon pair position. The graphical display is provided in Figure 8 I B.
EXAMPLE 82
|0828] This example describes optimization of a nucleotide sequence encoding AraAfor expression in K. lactis.
[0829] Chi-squared values for K. lactis were determined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for K. Iactis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi-squared values chisql . chisq2. chisq3 and z scores of chisq3 were calculated as described in Example 1.
|0830| The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for K. Iactis. A graphical display for the native gene (SEQ ID NO: 301) encoding the AraA protein (SEQ ID NO: 302) in K. Iactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. Iactis as a function of codon pair position. The graphical display is provided in Figure 82A.
|0831 ) The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having z scores in K. Iactis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 309) was found to encode a protein (SEQ ID NO: 310) with 100% amino acid sequence identity to wild-type AraA (SEQ ID NO: 302). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 309) encoding the AraA protein (SEQ ID NO: 310) expressed in K. Iactis was prepared by plotting z scores of translational kinetics values for codon pair utilization in K. Iactis as a function of codon pair position. The graphical display is provided in Figure 82B.
EXAMPLE 83
[0832] This example describes optimization of a nucleotide sequence encoding AraA for expression in Z mobilis.
10833] Chi-squared values for Z mobilis were determined as described in Example 1. with the following differences. Briefly, non-redundant protein coding regions for Z mobilis were obtained from GenBank sequences to determine an observed number of occurrences for each codon pair. The expected number of occurrences of each codon pair was calculated under the assumption that the codon pairs are used randomly. Chi- squared values chisql , chisq2, chisq3 and z scores of chisq3 were calculated as described in Example 1.
10834] The nucleotide sequence for the gene encoding the AraA protein was modified to optimize codon usage for Z mobilis. A graphical display for the native gene (SEQ ID NO: 301 ) encoding the AraA protein (SEQ ID NO: 302) in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z mobilis as a function of codon pair position. The graphical display is provided in Figure 83A.
|0835| The nucleotide sequence for the gene encoding the AraA protein was modified to no longer contain codon pairs having z scores in Z. mobilis greater than 3. The resulting nucleotide sequence (SEQ ID NO: 31 1 ) was found to encode a protein (SEQ ID NO: 312) with 100% amino acid sequence identity to wild-type AraA (SEQ ID NO: 302). A graphical display for the codon pair utilization-modified gene (SEQ ID NO: 31 1 ) encoding the AraA protein (SEQ ID NO: 312) expressed in Z mobilis was prepared by plotting z scores of translational kinetics values for codon pair utilization in Z. mobilis as a function of codon pair position. The graphical display is provided in Figure 83B.
EXAMPLE 84
[0836] Expression in E. coli of the codon optimized, codon pair utilization- based modification (Hot-Rod) from Example 80 and native AraA protein is examined by Western blot analysis. Each vector is transformed into E. coli strain Top 10 (F-mcrA δ(mrr-hsdRMS-mcrBC) φ 80lacZ δM15 δlacX74 deoR recAl araD139 δ(ara-leu) 7697 galU galK rpsL (StrR) endAl mφG). An overnight culture is inoculated at 1 : 100 into 5 ml of LB medium plus lOOμg/ml ampicillin and grown at 37°C to OD60O of 0.5. Protein expression is induced by addition of 0.002 or 0.02% L-arabinose and grown for 3hrs at 37°C. Cells are harvested by centrifugation and the cell pellets are resuspended in phosphate buffered saline. Cells are disrupted by sonication and supernatant and pellet fractions are resolved in a 4-20% SDS-polyacrylamide gel (Pierce). Proteins are transferred to lmmobilon-P (Millipore, Bedford, MA) and are incubated with rabbit polyclonal anti-AraA antibody diluted 1 :20,000. Rabbit IgG is visualized using a HRP- conjugated secondary antibody and ECL + Plus (Amersham, Buckinghamshire, UK) according to manufacturer's instructions.
|0837] Western blot analysis demonstrates that changes to a polypeptide- encoding nucleic acid sequence can increase expression of the polypeptide, particularly when the polypeptide is heterologously expressed.
EXAMPLE 85
Scheme for introduction of heterologous sequences into S. cerevisiae. |0838| Nucleic acid constructs can be prepared, for example, as shown in Figure 54A (upper panel). Figure 54A (upper panel) shows nucleic acid constructs for expressing heterologous genes in S. cerevisiae. A yeast copy-number control element (CEN/ARS or 2 μim) was introduced into the EcoRλ site of the polylinker of the bacterial vector pUC18. The PGKJ promoter sequence (PGKJp) and CYCl terminator (CYCJt) sequences were introduced into a unique site {Sspl, B) separated by a restriction site (D, SpeMXhoY) which can be used for cloning of the heterologous gene of interest (GENE X) by ligation or recombination rescue cloning. In order to select for this gene in yeast, the desired nutritional MARKER (URA3, TRPl, CANl , or MET15) was introduced to the polylinker in the Smal site flanked by recognition sites for the Pl phage Cre recombinase (lo.xP). Figure 54B (lower panel) shows a scheme for the integration of heterologous gene expression cassettes. Stable expression of combinations of genes is achieved through sequential or simultaneous integration of heterologous genes into yeast chromosomes via recombinational replacement of TyI elements (ending in delta repeats, open boxes) inserted at positions which allow substantial gene expression. Primers containing outside ends with similarity to target genomic sequences (black boxes) and inside ends which overlap the PGKJp and loxP sequences are used in a PCR reaction to amplify a fragment containing GENE X and the selectable MARKER. The PCR fragment is integrated via a double crossover in terminal regions of homology with the genome and integrants are selected. In order to recycle selectable markers, cells are transformed with a plasmid expressing the Cre recombinase and cells in which the MARKER is lost by Cre-mediated recombination between the flanking loxP sites are selected by growth on medium containing the appropriate reverse selection agent. Construction of vectors containing removable selectable cassettes
|0839] 1. The PGKl promoter region was amplified from genomic S. cerevisiae DNA using primers PGKl -FOR (5'-AATATTaggcattgcaagaattactcgtgagtaagg- 3') and PGKl -REV (S'-ACTAGTatatttgttgtaaaaagtagataattacttcc-S1), which places a Sspl site at the 5' end of the construct and a Spel site at the 3' end of the construct. Next, the CYC terminator was amplified from plasmid pNB2258 using primers CYCl -FOR (5'- ACT AGTgatatctgcgcaCTCGAGtcatgtaattagttatgtcacgc-3') and CYCl -REV (5'- AATATTggccgcaaattaaagccttcgagcgtcccaaaaccttetc-3'). This amplified product therefore has Spel-12N-Xhol restriction sites at the 5' end and a Sspl site at the 31 end. These two cassettes were then digested with Sspl and Spel and ligated together. The ligated fragment composed of the PGKl -CYCterm with flanking Sspl sites was then ligated into the Sspl site of pUCI8. creating vectors pXP13 (forward direction) and pXP14 (reverse direction).
(0840) 2. The TEF-I promoter region was amplified from genomic S. cerevisiae DNA using primers TEF-1-FOR-SspI (5'-AATATTaccgcgaatccttacatcac-3') and TEFl-REV (5'-ccACTAGTtttgtaattaaaacttagattagattgctatgc-3'), which places a Sspl site at the 5' end of the construct and a Spel site at the 3' end of the construct. This was digested with Sspl and Spe\. Next, the CYCl terminator fragment described in section 1 with Spel-12N-Xhol restriction sites at the 5' end and a Sspl site at the 3' end was ligated with the TEFl promoter fragment. The ligated fragment composed of the TEFl- CYCl term with flanking Ssp] sites was then ligated into the Sspl site of pUC18, creating vectors pXP17 (forward direction) and pXP18 (reverse direction).
|0841] 3. The 2 μm origin was amplified from plasmid pRS425 using primers 2um-FOR (5'GAATTCaacgaagcatctgtgcttcattttgtagaa-3') and 2um-REV (5'- GAATTCgtatgatccaatatcaaaggaaatgatagc-3'). These primers place EcoRl sites at each side of the 2 μm origin cassette. Following sequence verification, this cassette was ligated into the pXPI3 and pXPI7 vectors described above, creating vectors pXP200 and pXP400, respectively.
|0842] 4. In a separate construction series, the CEN/ARS origin was amplified from plasmid pRS315 using primers CEN/ARS-FOR (51- GAATTCatcacgtgctataaaaataattataattt-3') and CEN/ARS-REV (5'-
GAATTCgtaacttacacgcgcctcgtatcttttaatg-3'). These primers place EcoRl sites at each side of the CEN/ARS origin cassette. Following sequence verification, this cassette was ligated into the pXP13 and pXPI7 vectors described above, creating vectors pXPlOO and pXP300, respectively.
|0843] 5. Each of the four selection markers was amplified from plasmids with the addition of loxP recombination sites at both the 5' and 31 ends, as well as Smal restriction sites for downstream cloning.
[0844] a. The CANl marker was amplified from plasmid pRS319a using primers CANl -FOR (51-
CCCGGGATACTTCGTATAGCATACATT ATACGAAGTTATgggcccattatgaatacgcacct ctatgtatttccg-3') and CAN l-REV (51-
CCCGGGATAACTTCGTATAGCATACATTATACGAAGTTATggtgaatcatcgataaaaata aatatactgag-3'). This fragment was cloned into the pCRBlunt 11 cloning vector from Invitrogen and sequence verified. The unique Spel site within this cloned fragment was then replaced by site directed mutagenesis, while preserving the amino acid context using primers CAN l -SDM-FOR (5'-cattcaaggtactgaactcgttggtatcactgctggtg-3') and CAN l - SDM-REV (5'-caccagcagtgataccaacgagttcagtaccttgaaatg-3f). This construct was then ligated into the unique Smal site of plasmids pXPl OO and pXP300 creating plasmids pXPl OOCAK pXP100CAN-REV, pXP300CAN, and pXP300CAN-REV. The same construct was also ligated into the unique Smal site of plasmids pXP200 and pXP400 creating plasmids pXP200CAN; pXP200CAN-REV, pXP400CAN and pXP400CAN- REV.
[0845] b. The METl 5 marker was amplified from plasmid pRS401 using primers MET-FOR (51-
CCCGGGATAACTTCGTATAGCATACATTATACGAAGTTATgccatcctcatgaaaactgtgt aacataataaccg-3') and MET-REV (51-
CCCGGGATAACTTCGTATAGCATACATTATACGAAGTT ATgtatagtacttgtgagagaaa gtaggttatac-3'). This construct was then ligated into the unique Smal site of plasmids pXPl OO and pXP300 creating plasmids pXPlOOMET, pXP100MET-REV, pXP300MET and pXP300MET-REV. The same construct was also ligated into the unique Smal site of plasmids pXP200 and pXP400 creating plasmids pXP200MET, pXP200MET-REV, pXP400MET and pXP400MET-REV.
[0846] c. The TRPl marker was amplified from plasmid pRS314 using primers TRP-FOR (51-
CCCGGGATAACTTCGTATAGCATACATTATACGAAGTT ATaacgacattactatatatataat ataggaagc-3') and TRP-REV (5'-
CCCGGGATAACITCGTATAGCATACATTATACGAAGTTATcaggcaagtgcacaaacaata cttaaataaatactactc-3'). This construct was then ligated into the unique Smal site of plasmids pXPl OO and pXP300 creating plasmids pXPlOOTRP and pXPIOOTRP-REV, pXP300TRP and pXP300TRP-REV. The same construct was also ligated into the unique Smal site of plasmids pXP200 and pXP400 creating plasmids pXP200TRP, pXP200TRP- REV, pXP400TRP and pXP400TRP-REV.
[0847] d. The URA3 marker was amplified from plasmid pRSl lό using primers URA-FOR (51-
CCCGGGATAACTTCGTATAGCATACATTATACGAAGTT ATcagggtccataaagctttcaat tcatc-3") and URA-REV (51-
CCCGGGATAAClTCGTATAGCATACATTATACGAAGTTATgggtaataactgatataattaaa ttgaagctct-31). This construct was then ligated into the unique Smal site of plasmids pXPl OO and pXP300 creating plasmids pXPlOOURA, pXP100URA-REV; pXP300URA and pXP300URA-REV. The same construct was also ligated into the unique Sma\ site of plasmids pXP200 and pXP400 creating plasmids pXP200URA, pXP200URA-REV, pXP400URA and pXP400URA-REV.
|0848| Since modifications will be apparent to those of skill in this art, it is intended that this invention be limited only by the scope of the appended claims.

Claims

WHAT IS CLAIMED IS:
1 . A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild- type xylose reductase as set forth in SEQ ID NO: 2, wherein at least 3 of the following codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GGTATT (nucleotides 619-624) TTGAAC (nucleotides 16-21 ) TTGAAC (nucleotides 274-279) TTGAAC (nucleotides 670-675) TTGAAC (nucleotides 688-693) CTTTCT (nucleotides 286-291 ) GCCATT (nucleotides 181 -186) TCTCCA (nucleotides 697-702) TCTCCA (nucleotides 751 -756) ATCAAG (nucleotides 103-108) ATCAAG (nucleotides 541 -546) ATCAAG (nucleotides 721-726) GCCAAG (nucleotides 889-894).
2. The nucleotide sequence of Claim 1 , in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
3. The nucleotide sequence of Claim 1 , in which at least 3 of the following codon pair replacements have been made:
GGTATT (nucleotides 619-624) replaced with GGAATT TTGAAC (nucleotides 16-21 ) replaced with TTAAAT TTGAAC (nucleotides 274-279) replaced with CTAAAT TTGAAC (nucleotides 670-675) replaced with TTAAAT TTGAAC (nucleotides 688-693) replaced with TTAAAT CTTTCT (nucleotides 286-291 ) replaced with CTATCT GCCATT (nucleotides 181 -186) replaced with GCTATT TCTCCA (nucleotides 697-702) replaced with TCACCA TCTCCA (nucleotides 751 -756) replaced with TCACCA ATCAAG (nucleotides 103- 108) replaced with ATTAAA ATCAAG (nucleotides 541 -546) replaced with ATTAAA ATCAAG (nucleotides 721 -726) replaced with ATTAAG GCCAAG (nucleotides 889-894) replaced with GCTAAA.
4. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-318 of wild- type xylose reductase as set forth in SEQ ID NO: 2, wherein at least 3 of the following codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GAAGAT (nucleotides 136 - 141 ) CTTTCT (nucleotides 286 - 291 ) GAAGAT (nucleotides 415 - 420 ) ATTGCC (nucleotides 793 - 798 ) ATTGCC (nucleotides 886 - 891 ) GACTGG (nucleotides 928 - 933 ).
5. The nucleotide sequence of Claim 4. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
6. The nucleotide sequence of Claim 4, in which at least 3 of the following codon pair replacements have been made:
GAAGAT (nucleotides 136 - 141 ) replaced with GAAGAT CTTTCT (nucleotides 286 - 291 ) replaced with CTATCT GAAGAT (nucleotides 415 - 420 ) replaced with GAAGAT ATTGCC (nucleotides 793 - 798 ) replaced with ATCGCT ATTGCC (nucleotides 886 - 891 ) replaced with ATAGCT GACTGG (nucleotides 928 - 933 ) replaced with GATTGG.
7. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-318 of wild- type xylose reductase as set forth in SEQ ID NO: 2. wherein at least 3 of the following codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TCCAAG (nucleotides 226-231) ATCAAG (nucleotides 103-108) ATCAAG (nucleotides 541 -546) ATCAAG (nucleotides 721-726) TTCAAG (nucleotides 343-348) TTCAAC (nucleotides 913-918) ATCAAC (nucleotides 901-906) GGTATT (nucleotides 619-624) GTCAAG (nucleotides 172-177) GTCAAG (nucleotides 199-204) GTCAAG (nucleotides 460-465) GACGAA (nucleotides 187-192) GACGAA (nucleotides 865-870) GGTATC (nucleotides 193-198) CCAAGA (nucleotides 589-594) CCAAGA (nucleotides 823-828) TTGAAC (nucleotides 16-21) TTGAAC (nucleotides 274-279) TTGAAC (nucleotides 670-675) TTGAAC (nucleotides 688-693).
8. The nucleotide sequence of Claim 7, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
9. The nucleotide sequence of Claim 7, in which at least 3 of the following codon pair replacements have been made:
TCCAAG (nucleotides 226-231) replaced with TCTAAA ATCAAG (nucleotides 103-108) replaced with ATTAAA ATCAAG (nucleotides 541 -546) replaced with ATTAAA ATCAAG (nucleotides 721-726) replaced with ATTAAG TTCAAG (nucleotides 343-348) replaced with TTTAAA TTCAAC (nucleotides 913-918) replaced with TTTAAT ATCAAC (nucleotides 901-906) replaced with ATTAAT GGTATT (nucleotides 619-624) replaced with GGAATT GTCAAG (nucleotides 172-177) replaced with GTTAAA GTCAAG (nucleotides 199-204) replaced with GTTAAA GTCAAG (nucleotides 460-465) replaced with GTTAAA GACGAA (nucleotides 187- 192) replaced with GATGAA GACGAA (nucleotides 865-870) replaced with GATGAA GGTATC (nucleotides 193-198) replaced with GGAATT CCAAGA (nucleotides 589-594) replaced with CCTAGA CCAAGA (nucleotides 823-828) replaced with CCTCGT TTGAAC (nucleotides 16-21 ) replaced with TTAAAT TTGAAC (nucleotides 274-279) replaced with CTAAAT TTGAAC (nucleotides 670-675) replaced with TTAAAT TTGAAC (nucleotides 688-693) replaced with TTAAAT.
10. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild- type xylose reductase as set forth in SEQ ID NO: 2, wherein at least 3 of the following codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTGAAC (nucleotides 16 - 21 ) AAGAAG (nucleotides 175 - 180 ) GCCATT (nucleotides 1 81 - 186 ) GGTATC (nucleotides 193 - 198 ) TTGAAC (nucleotides 274 - 279 ) CTTTCT (nucleotides 286 - 291 ) TTCCCA (nucleotides 331 - 336 ) TTCCCA (nucleotides 499 - 504 ) TTGAAC (nucleotides 670 - 675 ) TTGAAC (nucleotides 688 - 693 ) GCCAAG (nucleotides 889 - 894 ).
1 1. The nucleotide sequence of Claim 10. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
12. The nucleotide sequence of Claim 10, in which at least 3 of the following codon pair replacements have been made:
TTGAAC (nucleotides 16 - 21 ) replaced with TTAAAC AAGAAG (nucleotides 175 - 180 ) replaced with AAAAAG GCCATT (nucleotides 181 - 186 ) replaced with GCTATT GGTATC (nucleotides 193 - 198 ) replaced with GGAATT TTGAAC (nucleotides 274 - 279 ) replaced with TTAAAT CTTTCT (nucleotides 286 - 291 ) replaced with TTATCT TTCCCA (nucleotides 331 - 336 ) replaced with TTTCCA TTCCCA (nucleotides 499 - 504 ) replaced with TTTCCA TTGAAC (nucleotides 670 - 675 ) replaced with TTAAAT TTGAAC (nucleotides 688 - 693 ) replaced with TTAAAT GCCAAG (nucleotides 889 - 894 ) replaced with GCTAAA.
13. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild- type xylose reductase as set forth in SEQ ID NO: 2, wherein at least 3 of the following codon pairs of SEQ ID NO: 1 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GCCGGT (nucleotides 166 - 171 ) GGTATC (nucleotides 193 - 198 ) GCCTTG (nucleotides 271 - 276 ) GCCGGT (nucleotides 466 - 471 ) GCTTTG (nucleotides 508 - 513 ) GGTATT (nucleotides 619 - 624 ) GCTTTG (nucleotides 685 - 690 ) AACAGC (nucleotides 850 - 855 ) GCCAAG (nucleotides 889 - 894 ).
14. The nucleotide sequence of Claim 13, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
15. The nucleotide sequence of Claim 13, in which at least 3 of the following codon pair replacements have been made:
GCCGGT (nucleotides 166 - 171 ) replaced with GCTGGT GGTATC (nucleotides 193 - 198 ) replaced with GGCATT GCCTTG (nucleotides 271 - 276 ) replaced with GCCCTT GCCGGT (nucleotides 466 - 471 ) replaced with GCTGGT GCTTTG (nucleotides 508 - 513 ) replaced with GCGTTG GGTATT (nucleotides 619 - 624 ) replaced with GGCATT GCTTTG (nucleotides 685 - 690 ) replaced with GCTCTT AACAGC (nucleotides 850 - 855 ) replaced with AATTCT GCCAAG (nucleotides 889 - 894 ) replaced with GCCAAA.
16. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -318 of wild- type xylose reductase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
17. The nucleotide sequence of Claim 16, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
18. The nucleotide sequence of Claim 16. wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
19. A xylose reductase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -318 of wild-type xylose reductase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoris
Oiyctolagus cunieulus (rabbit)
Macaca fasciculaήs (Long-tailed monkey)
Macaca mulatto (Monkey)
Escherichia coli K 12 W31 10
Escherichia coli UTl 89
Escherichia co//O157:H7 EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Schizosaccharomyces pombe.
20. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-318 of wild- type xylose reductase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 5-301 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
21. The xylose reductase-encoding nucleotide sequence of Claim 20, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
22. The xylose reductase-encoding nucleotide sequence of any of Claims 20- 21 , wherein no replacement codon encoding amino acids 5-301 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair TCCAAG when expressed in the native organism.
23. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-318 of wild- type xylose reductase as set forth in SEQ ID NO: 2 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -5 of SEQ ID NO: 2 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
24. The xylose reductase-encoding nucleotide sequence of Claim 23, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
25. The xylose reductase-encoding nucleotide sequence of any of Claims 23- 24, wherein at least one replacement codon encoding amino acids 1 -5 of SEQ ID NO: 2 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair CCTTCT when expressed in the native organism.
26. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-324 of wild- type xylose reductase as set forth in SEQ ID NO: 26. wherein at least 3 of the following codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
AAGAAA (nucleotides 382 - 387)
TTGAAG (nucleotides 694 - 699)
ATCAAA (nucleotides 190 - 195)
TTGAAC (nucleotides 34 - 39)
TTGAAC (nucleotides 313 - 318)
GCCATT (nucleotides 901 - 906)
GCTACT (nucleotides 10 - 15)
ATCAAG (nucleotides 121 - 126)
ATCAAG (nucleotides 202 - 207)
ATCAAG (nucleotides 559 - 564).
27. The nucleotide sequence of Claim 26, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
28. The nucleotide sequence of Claim 26, in which at least 3 of the following codon pair replacements have been made:
AAGAAA (nucleotides 382 - 387) replaced with AAAAAG TTGAAG (nucleotides 694 - 699) replaced with TTAAAA ATCAAA (nucleotides 190 - 195) replaced with ATTAAA TTGAAC (nucleotides 34 - 39) replaced with TTAAAT TTGAAC (nucleotides 313 - 318) replaced with TTAAAT GCCATT (nucleotides 901 - 906) replaced with GCTATA GCTACT (nucleotides 10 - 15) replaced with GCTACC ATCAAG (nucleotides 121 - 126) replaced with ATTAAA ATCAAG (nucleotides 202 - 207) replaced with ATTAAA ATCAAG (nucleotides 559 - 564) replaced with ATTAAA.
29. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild- type xylose reductase as set forth in SEQ ID NO: 26. wherein at least 3 of the following codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GAAGAG (nucleotides 226 - 231 ) ACCTGG (nucleotides 454 - 459 ) TTGCAG (nucleotides 574 - 579 ) ATTGCC (nucleotides 748 - 753 ) TTGCAG (nucleotides 895 - 900 ) ATTGCC (nucleotides 904 - 909 ).
30. The nucleotide sequence of Claim 29. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
31. The nucleotide sequence of Claim 29, in which at least 3 of the following codon pair replacements have been made:
GAAGAG (nucleotides 226 - 231 ) replaced with GAAGAA ACCTGG (nucleotides 454 - 459 ) replaced with ACTTGG TTGCAG (nucleotides 574 - 579 ) replaced with CTCCAG ATTGCC (nucleotides 748 - 753 ) replaced with ATTGCG TTGCAG (nucleotides 895 - 900 ) replaced with CTCCAG ATTGCC (nucleotides 904 - 909 ) replaced with ATCGCG.
32. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-324 of wild- type xylose reductase as set forth in SEQ ID NO: 26, wherein at least 3 of the following codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
AAGAAA (nucleotides 382 - 387) TCCAAG (nucleotides 244 - 249) ATCAAG (nucleotides 121 - 126) ATCAAG (nucleotides 202 - 207) ATCAAG (nucleotides 559 - 564) TTCAAC (nucleotides 931 - 936) ATCAAA (nucleotides 190 - 195) GTCAAG (nucleotides 217 - 222) GTCAAG (nucleotides 739 - 744) GGTATC (nucleotides 187 - 192) GGTATC (nucleotides 505 - 510) CCAAGA (nucleotides 823 - 828) TTGAAC (nucleotides 34 - 39) TTGAAC (nucleotides 313 - 318).
33. The nucleotide sequence of Claim 32. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
34. The nucleotide sequence of Claim 32. in which at least 3 of the following codon pair replacements have been made:
AAGAAA (nucleotides 382 - 387) replaced with AAGAAG TCCAAG (nucleotides 244 - 249) replaced with TCTAAA ATCAAG (nucleotides 121 - 126) replaced with ATTAAA ATCAAG (nucleotides 202 - 207) replaced with ATCAAA ATCAAG (nucleotides 559 - 564) replaced with ATCAAA TTCAAC (nucleotides 931 - 936) replaced with TTCAAC ATCAAA (nucleotides 190 - 195) replaced with ATCAAA GTCAAG (nucleotides 217 - 222) replaced with GTTAAA GTCAAG (nucleotides 739 - 744) replaced with GTTAAA GGTATC (nucleotides 187 - 192) replaced with GGTATC GGTATC (nucleotides 505 - 510) replaced with GGTATC CCAAGA (nucleotides 823 - 828) replaced with CCGCGC TTGAAC (nucleotides 34 - 39) replaced with CTGAAC TTGAAC (nucleotides 313 - 318) replaced with CTGAAC.
35. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild- type xylose reductase as set forth in SEQ ID NO: 26, wherein at least 3 of the following codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTGAAC (nucleotides 34 - 39 ) GGTATC (nucleotides 187 - 192 ) ATCAAA (nucleotides 190 - 195 ) AAGAAG (nucleotides 271 - 276 ) TTGAAC (nucleotides 313 - 318 ) TTCCCA (nucleotides 349 - 354 ) AAGAAA (nucleotides 382 - 387 ) GGTATC (nucleotides 505 - 510 ) TTGAAG (nucleotides 694 - 699 ) GCCATT (nucleotides 901 - 906 ).
36. The nucleotide sequence of Claim 35. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
37. The nucleotide sequence of Claim 35. in which at least 3 of the following codon pair replacements have been made:
TTGAAC (nucleotides 34 - 39 ) replaced with TTAAAT GGTATC (nucleotides 1 87 - 192 ) replaced with GGAATT ATCAAA (nucleotides 190 - 195 ) replaced with ATTAAA AAGAAG (nucleotides 271 - 276 ) replaced with AAAAAA TTGAAC (nucleotides 313 - 318 ) replaced with TTAAAT TTCCCA (nucleotides 349 - 354 ) replaced with TTTCCA AAGAAA (nucleotides 382 - 387 ) replaced with AAAAAA GGTATC (nucleotides 505 - 510 ) replaced with GGAATC TTGAAG (nucleotides 694 - 699 ) replaced with TTAAAA GCCATT (nucleotides 901 - 906 ) replaced with GCTATC.
38. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild- type xylose reductase as set forth in SEQ ID NO: 26, wherein at least 3 of the following codon pairs of SEQ ID NO: 25 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GGTATC (nucleotides 187 - 192 ) GAAGGC (nucleotides 208 - 213 ) GCTTTG (nucleotides 289 - 294 ) GCTTTG (nucleotides 463 - 468 ) GGTATC (nucleotides 505 - 510 ) GCCTTG (nucleotides 571 - 576 ) GCCTTG (nucleotides 703 - 708 ).
39. The nucleotide sequence of Claim 38, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
40. The nucleotide sequence of Claim 38, in which at least 3 of the following codon pair replacements have been made:
GGTATC (nucleotides 187 - 192 ) replaced with GGGATT GAAGGC (nucleotides 208 - 213 ) replaced with GAAGGG GCTTTG (nucleotides 289 - 294 ) replaced with GCCCTT GCTTTG (nucleotides 463 - 468 ) replaced with GCCCTT GGTATC (nucleotides 505 - 510 ) replaced with GGCATT GCCTTG (nucleotides 571 - 576 ) replaced with GCCTTA GCCTTG (nucleotides 703 - 708 ) replaced with GCATTG.
41. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild- type xylose reductase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
42. The nucleotide sequence of Claim 41 , wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
43. The nucleotide sequence of Claim 41 , wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
44. A xylose reductase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type xylose reductase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pas tons
Oryctolagus cuniculus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatto (Monkey)
Escherichia coli Kl 2 W31 10
Escherichia coli UT189
Escherichia co//O157:H7 EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda Drosophila melanogaster Schizosaccharomyces pombe.
45. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild- type xylose reductase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 25 and which encode amino acids 1 1-306 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
46. The xylose reductase-encoding nucleotide sequence of Claim 45. wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
47. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild- type xylose reductase as set forth in SEQ ID NO: 26 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 25 and which encode amino acids 1-1 1 of SEQ ID NO: 26 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
48. The xylose reductase-encoding nucleotide sequence of Claim 47, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
49. A xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50,wherein at least 3 of the following codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: AAGAAA (nucleotides 106 - 1 1 1 ) TTGAAG (nucleotides 637 - 642) CTTTTG (nucleotides 565 - 570) GGTATT (nucleotides 277 - 282) TTGAAC (nucleotides 25 - 30) ACTTTG (nucleotides 880 - 885) GCCATT (nucleotides 790 - 795) GCTACT (nucleotides 349 - 354) GCTACT (nucleotides 664 - 669) ATCAAG (nucleotides 709 - 714) ATCAAG (nucleotides 772 - 777) GCCAAG (nucleotides 583 - 588) GCCAAG (nucleotides 646 - 651 ).
50. The nucleotide sequence of Claim 49, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
51. The nucleotide sequence of Claim 49, in which at least 3 of the following codon pair replacements have been made:
AAGAAA (nucleotides 106 - 1 1 1 ) replaced with AAAAAG TTGAAG (nucleotides 637 - 642) replaced with TTAAAA CTTTTG (nucleotides 565 - 570) replaced with TTGTTG GGTATT (nucleotides 277 - 282) replaced with GGAATA TTGAAC (nucleotides 25 - 30) replaced with TTAAAT ACTTTG (nucleotides 880 - 885) replaced with ACATTG GCCATT (nucleotides 790 - 795) replaced with GCTATT GCTACT (nucleotides 349 - 354) replaced with GCTACC GCTACT (nucleotides 664 - 669) replaced with GCAACT ATCAAG (nucleotides 709 - 714) replaced with ATTAAA ATCAAG (nucleotides 772 - 777) replaced with ATTAAA GCCAAG (nucleotides 583 - 588) replaced with GCTAAA GCCAAG (nucleotides 646 - 651 ) replaced with GCTAAA.
52. A xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50, wherein at least 3 of the following codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CCTTCC (nucleotides 13 - 18 )
AAGAAA (nucleotides 106 - 1 1 1 )
GTCAGC (nucleotides 448 - 453 )
CTCGGT (nucleotides 460 - 465 )
GTTGCC (nucleotides 535 - 540 )
TTTGGT (nucleotides 544 - 549 )
GCTGAA (nucleotides 760 - 765 )
ATTGCC (nucleotides 793 - 798 )
GTCAGC (nucleotides 841 - 846 ).
53. The nucleotide sequence of Claim 52. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
54. The nucleotide sequence of Claim 52. in which at least 3 of the following codon pair replacements have been made:
CCTTCC (nucleotides 13 - 18 ) replaced with CCATCT AAGAAA (nucleotides 106 - 1 1 1 ) replaced with AAAAAG GTCAGC (nucleotides 448 - 453 ) replaced with GTTTCA CTCGGT (nucleotides 460 - 465 ) replaced with TTGGGT GTTGCC (nucleotides 535 - 540 ) replaced with GTTGCT TTTGGT (nucleotides 544 - 549 ) replaced with TTCGGT GCTGAA (nucleotides 760 - 765 ) replaced with GCTGAG ATTGCC (nucleotides 793 - 798 ) replaced with ATTGCT GTCAGC (nucleotides 841 - 846 ) replaced with GTATCT.
55. A xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50,wherein at least 3 of the following codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
AAGAAA (nucleotides 106 - 1 1 1 ) TCCAAG (nucleotides 361 - 366) TCCAAG (nucleotides 502 - 507) TCCAAG (nucleotides 682 - 687) ATCAAG (nucleotides 709 - 714) ATCAAG (nucleotides 772 - 777) TTCAAG (nucleotides 406 - 41 1 ) TTCAAG (nucleotides 1012 - 1017) CTTTTG (nucleotides 565 - 570) TTCAAC (nucleotides 676 - 681) TTCAAC (nucleotides 907 - 912) GGTATT (nucleotides 277 - 282) GTCAAG (nucleotides 103 - 108) GTCAAG (nucleotides 430 - 435) GTCAAG (nucleotides 1063 - 1068) GACGAA (nucleotides 298 - 303) GGTATC (nucleotides 1 15 - 120) TTGAAC (nucleotides 25 - 30) TTTGAC (nucleotides 937 - 942).
56. The nucleotide sequence of Claim 55, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
57. The nucleotide sequence of Claim 55, in which at least 3 of the following codon pair replacements have been made:
AAGAAA (nucleotides 106 - 1 1 1) replaced with AAAAAG TCCAAG (nucleotides 361 - 366) replaced with TCTAAA TCCAAG (nucleotides 502 - 507) replaced with TCTAAA TCCAAG (nucleotides 682 - 687) replaced with TCTAAA ATCAAG (nucleotides 709 - 714) replaced with ATTAAA ATCAAG (nucleotides 772 - 777) replaced with ATTAAA TTCAAG (nucleotides 406 - 41 1 ) replaced with TTTAAA TTCAAG (nucleotides 1012 - 1017) replaced with TTTAAA CTTTTG (nucleotides 565 - 570) replaced with TTGTTG TTCAAC (nucleotides 676 - 681) replaced with TTTAAT TTCAAC (nucleotides 907 - 912) replaced with TTTAAT GGTATT (nucleotides 277 - 282) replaced with GGAATA GTCAAG (nucleotides 103 - 108) replaced with GTTAAA GTCAAG (nucleotides 430 - 435) replaced with GTTAAA GTCAAG (nucleotides 1063 - 1068) replaced with GTTAAA GACGAA (nucleotides 298 - 303) replaced with GATGAA GGTATC (nucleotides 1 15 - 120) replaced with GGAATT TTGAAC (nucleotides 25 - 30) replaced with TTAAAT TTTGAC (nucleotides 937 - 942) replaced with TTCGAT.
58. A xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 324 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50,wherein at least 3 of the following codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTGAAC (nucleotides 25 - 30 ) AAGAAA (nucleotides 106 - 1 1 1 ) GGTATC (nucleotides 1 15 - 120 ) GGTACC (nucleotides 388 - 393 ) CTTTTG (nucleotides 565 - 570 ) GCCAAG (nucleotides 583 - 588 ) TTGAAG (nucleotides 637 - 642 ) GCCAAG (nucleotides 646 - 651 ) GCCATT (nucleotides 790 - 795 ) TTCCCA (nucleotides 847 - 852 ).
59. The nucleotide sequence of Claim 58. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
60. The nucleotide sequence of Claim 58, in which at least 3 of the following codon pair replacements have been made:
TTGAAC (nucleotides 25 - 30 ) replaced with TTAAAT AAGAAA (nucleotides 106 - 1 1 1 ) replaced with AAAAAG GGTATC (nucleotides 1 15 - 120 ) replaced with GGAATC GGTACC (nucleotides 388 - 393 ) replaced with GGTACA CTTTTG (nucleotides 565 - 570 ) replaced with CTCTTG GCCAAG (nucleotides 583 - 588 ) replaced with GCTAAA TTGAAG (nucleotides 637 - 642 ) replaced with TTAAAG GCCAAG (nucleotides 646 - 651 ) replaced with GCTAAA GCCATT (nucleotides 790 - 795 ) replaced with GCAATC TTCCCA (nucleotides 847 - 852 ) replaced with TTCCCT.
61. A xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 324 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50.wherein at least 3 of the following codon pairs of SEQ ID NO: 49 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GATGCC (nucleotides 61 - 66 ) GGTATC (nucleotides 1 15 - 120 ) GCCGGT (nucleotides 205 - 210 ) GGTATT (nucleotides 277 - 282 ) GAAGGC (nucleotides 367 - 372 ) GCCAAG (nucleotides 583 - 588 ) GCCAAG (nucleotides 646 - 651 ) ACTTTG (nucleotides 880 - 885 ) GCTATT (nucleotides 1021 - 1026 ) GAAGCC (nucleotides 1027 - 1032 ) GTCAGA (nucleotides 1042 - 1047 ) GCCGGT (nucleotides 1048 - 1053 ).
62. The nucleotide sequence of Claim 61. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
63. The nucleotide sequence of Claim 61 , in which at least 3 of the following codon pair replacements have been made:
GATGCC (nucleotides 61 - 66 ) replaced with GATGCT GGTATC (nucleotides 1 15 - 120 ) replaced with GGCATT GCCGGT (nucleotides 205 - 210 ) replaced with GCTGGA GGTATT (nucleotides 277 - 282 ) replaced with GGCATT GAAGGC (nucleotides 367 - 372 ) replaced with GAAGGT GCCAAG (nucleotides 583 - 588 ) replaced with GCTAAA GCCAAG (nucleotides 646 - 651 ) replaced with GCCAAA ACTTTG (nucleotides 880 - 885 ) replaced with ACCTTG GCTATT (nucleotides 1021 - 1026 ) replaced with GCGATT GAAGCC (nucleotides 1027 - 1032 ) replaced with GAGGCT GTCAGA (nucleotides 1042 - 1047 ) replaced with GTTCGT GCCGGT (nucleotides 1048 - 1053 ) replaced with GCTGGA.
64. A xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
65. The nucleotide sequence of Claim 64, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
66. The nucleotide sequence of Claim 64, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
67. A xylitol dehydrogenase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoris
Oryctolagus cuniculus (rabbit)
Macaca fasciculaήs (Long-tailed monkey)
Macaca mulatto (Monkey)
Escherichia coli Kl 2 W31 10
Escherichia coli UT189
Escherichia co//O157:H7 EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Schi∑osaccharomyces pombe.
68. A xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 28-146 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
69. The xylitol dehydrogenase-encoding nucleotide sequence of Claim 68. wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
70. The xylitol dehydrogenase-encoding nucleotide sequence of any of Claims 68-69, wherein no replacement codon encoding amino acids 28-146 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair AAGAAA when expressed in the native organism.
71. A xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 363 of wild-type xylitol dehydrogenase as set forth in SEQ ID NO: 50 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 175-314 of SEQ ID NO: 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
72. The xylitol dehydrogenase-encoding nucleotide sequence of Claim 71. wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
73. The xylitol dehydrogenase-encoding nucleotide sequence of any of Claims 71 -72, wherein no replacement codon encoding amino acids 175-314 of SEQ ID NO: 50 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair TCCAAG when expressed in the native organism.
74. A xylitol dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 363 of wild-type xyhtol dehydrogenase as set forth in SEQ ID NO 50 and is adapted for expression in a heterologous host organism, wherein at least 1. 2 or 3 codon pairs present in SEQ ID NO 1 and which encode amino acids 146-175 of SEQ ID NO 50 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism
75 The xyhtol dehydrogenase-encoding nucleotide sequence of Claim 74. wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism
76 The xyhtol dehydrogenase-encoding nucleotide sequence of any of Claims 74-75. wherein at least one replacement codon encoding amino acids 146-175 of SEQ ID NO 50 has a z score for expression in the heterologous host that is more than 75% of the r score of the wild type codon pair TCCAAG when expressed in the native organism
77 A D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild- type D-xylulokmase as set forth in SEQ ID NO 74, wherein at least 3 of the following codon pairs of SEQ ID NO 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof
TTGAAA (nucleotides 1858 - 1863) TTGAAG (nucleotides 67 - 72) TTGAAG (nucleotides 793 - 798) GAAAGT (nucleotides 1849 - 1854) GGTATT (nucleotides 283 - 288) GGTATT (nucleotides 1213 - 1218) GGGTTC (nucleotides 43 - 48) TTGAAC (nucleotides 1276 - 1281 ) ACTTTG (nucleotides 1366 - 1371 ) GCCATT (nucleotides 190 - 195) GATATC (nucleotides 490 - 495) GATATC (nucleotides 679 - 684) TCTCAA (nucleotides 1021 - 1026) TTCCCC (nucleotides 262 - 267) ATCAAG (nucleotides 1261 - 1266) ATCAAG (nucleotides 1606 - 161 1 ) GCCAAG (nucleotides 1717 - 1722) GCCAAG (nucleotides 1840 - 1845).
78. The nucleotide sequence of Claim 77, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
79. The nucleotide sequence of Claim 77, in which at least 3 of the following codon pair replacements have been made:
TTGAAA (nucleotides 1858 - 1863) replaced with TTAAAA TTGAAG (nucleotides 67 - 72) replaced with TTAAAA TTGAAG (nucleotides 793 - 798) replaced with TTAAAA GAAAGT (nucleotides 1849 - 1854) replaced with GAATCA GGTATT (nucleotides 283 - 288) replaced with GGAATT GGTATT (nucleotides 1213 - 1218) replaced with GGAATT GGGTTC (nucleotides 43 - 48) replaced with GGTTTT TTGAAC (nucleotides 1276 - 1281 ) replaced with TTAAAT ACTTTG (nucleotides 1366 - 1371) replaced with ACTCTA GCCATT (nucleotides 190 - 195) replaced with GCTATT GATATC (nucleotides 490 - 495) replaced with GATATA GATATC (nucleotides 679 - 684) replaced with GACATT TCTCAA (nucleotides 1021 - 1026) replaced with TCACAA TTCCCC (nucleotides 262 - 267) replaced with TTTCCA ATCAAG (nucleotides 1261 - 1266) replaced with ATTAAG ATCAAG (nucleotides 1606 - 161 1) replaced with ATTAAA GCCAAG (nucleotides 1717 - 1722) replaced with GCTAAA GCCAAG (nucleotides 1840 - 1845) replaced with GCTAAG.
80. A D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild- type D-xylulokinase as set forth in SEQ ID NO: 74, wherein at least 3 of the following codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GAAGAG (nucleotides 451 - 456) GAAGAG (nucleotides 703 - 708) TTCCTC (nucleotides 37 - 42) GCCAGT (nucleotides 613 - 618) GCCAGT (nucleotides 1693 - 1698) AAAGAG (nucleotides 442 - 447) GCCAGA (nucleotides 1099 - 1 104) GCCAGA (nucleotides 1552 - 1557) AGCCAG (nucleotides 379 - 384) ATTGCC (nucleotides 847 - 852) GCCTGT (nucleotides 1666 - 1671 ).
81. The nucleotide sequence of Claim 80, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
82. The nucleotide sequence of Claim 80. in which at least 3 of the following codon pair replacements have been made:
GAAGAG (nucleotides 451 - 456) replaced with GAAGAA GAAGAG (nucleotides 703 - 708) replaced with GAAGAA TTCCTC (nucleotides 37 - 42) replaced with TTCCTG GCCAGT (nucleotides 613 - 618) replaced with GCGTCT GCCAGT (nucleotides 1693 - 1698) replaced with GCTAGC AAAGAG (nucleotides 442 - 447) replaced with AAAGAA GCCAGA (nucleotides 1099 - 1 104) replaced with GCTCGT GCCAGA (nucleotides 1552 - 1557) replaced with GCTCGT AGCCAG (nucleotides 379 - 384) replaced with TCTCAG ATTGCC (nucleotides 847 - 852) replaced with ATCGCG GCCTGT (nucleotides 1666 - 1671 ) replaced with GCTTGC.
83. A D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild- type D-xylulokinase as set forth in SEQ ID NO: 74. wherein at least 3 of the following codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TCGTTG (nucleotides 934 - 939) GATATC (nucleotides 490 - 495) GATATC (nucleotides 679 - 684) ATCAAG (nucleotides 1261 - 1266) ATCAAG (nucleotides 1606 - 161 1 ) AAGTTT (nucleotides 1498 - 1503) TTCAAG (nucleotides 403 - 408) TTCAAG (nucleotides 556 - 561 ) TTGAAA (nucleotides 1858 - 1863) TTCAAC (nucleotides 268 - 273) TTCAAC (nucleotides 697 - 702) TTCAAC (nucleotides 877 - 882) TTCAAC (nucleotides 1 198 - 1203) ATCAAC (nucleotides 133 - 138) ATCAAC (nucleotides 166 - 171 ) ATCAAC (nucleotides 1750 - 1755) GGTATT (nucleotides 283 - 288) GGTATT (nucleotides 1213 - 1218) GTCAAG (nucleotides 1795 - 1800) GACGAA (nucleotides 172 - 177) GACGAA (nucleotides 1 1 17 - 1 122) GGTATC (nucleotides 781 - 786) GGGTTC (nucleotides 43 - 48) TCTTTG (nucleotides 1543 - 1548) TCGTTA (nucleotides 370 - 375) TTGAAC (nucleotides 1276 - 1281).
84. The nucleotide sequence of Claim 83. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
85. The nucleotide sequence of Claim 83, in which at least 3 of the following codon pair replacements have been made:
TCGTTG (nucleotides 934 - 939) replaced with TCTCTG GATATC (nucleotides 490 - 495) replaced with GACATC GATATC (nucleotides 679 - 684) replaced with GACATC ATCAAG (nucleotides 1261 - 1266) replaced with ATCAAA ATCAAG (nucleotides 1606 - 161 1) replaced with ATCAAA AAGTTT (nucleotides 1498 - 1503) replaced with AAGTTC TTCAAG (nucleotides 403 - 408) replaced with TTCAAA TTCAAG (nucleotides 556 - 561) replaced with TTCAAA TTGAAA (nucleotides 1858 - 1863) replaced with CTGAAA TTCAAC (nucleotides 268 - 273) replaced with TTCAAC TTCAAC (nucleotides 697 - 702) replaced with TTTAAC TTCAAC (nucleotides 877 - 882) replaced with TTCAAC TTCAAC (nucleotides 1 198 - 1203) replaced with TTCAAC ATCAAC (nucleotides 133 - 138) replaced with ATCAAC ATCAAC (nucleotides 166 - 171) replaced with ATCAAC ATCAAC (nucleotides 1750 - 1755) replaced with ATCAAC GGTATT (nucleotides 283 - 288) replaced with GGTATC GGTATT (nucleotides 1213 - 1218) replaced with GGTATC GTCAAG (nucleotides 1795 - 1800) replaced with GTTAAA GACGAA (nucleotides 172 - 177) replaced with GACGAA GACGAA (nucleotides 1 1 17 - 1 122) replaced with GACGAA GGTATC (nucleotides 781 - 786) replaced with GGTATC GGGTTC (nucleotides 43 - 48) replaced with GGTTTC TCTTTG (nucleotides 1543 - 1548) replaced with TCTCTC TCGTTA (nucleotides 370 - 375) replaced with TCCCTG TTGAAC (nucleotides 1276 - 1281) replaced with CTGAAC.
86. A D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-324 of wild- type D-xylulokinase as set forth in SEQ ID NO: 74, wherein at least 3 of the following codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GGGTTC (nucleotides 43 - 48 ) TTGAAG (nucleotides 67 - 72 ) GCCATT (nucleotides 190 - 195 ) AAGAAG (nucleotides 250 - 255 ) TTCCCC (nucleotides 262 - 267 ) TCGTTA (nucleotides 370 - 375 ) GGTAAA (nucleotides 439 - 444 ) GATATC (nucleotides 490 - 495 ) GATATC (nucleotides 679 - 684 ) GGTATC (nucleotides 781 - 786 ) TTGAAG (nucleotides 793 - 798 ) TTTGTC (nucleotides 859 - 864 ) TCGTTG (nucleotides 934 - 939 ) AAGAAG (nucleotides 1 150 - 1 155 ) TTCCCA (nucleotides 1222 - 1227 ) TTGAAC (nucleotides 1276 - 1281 ) AAGAAG (nucleotides 1525 - 1530 ) GCCAAG (nucleotides 1717 - 1722 ) AAGAAG (nucleotides 1720 - 1725 ) AAATGG (nucleotides 1804 - 1809 ) GCCAAG (nucleotides 1840 - 1845 ) TTGAAA (nucleotides 1858 - 1863 ).
87. The nucleotide sequence of Claim 86, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
88. The nucleotide sequence of Claim 86, in which at least 3 of the following codon pair replacements have been made:
GGGTTC (nucleotides 43 - 48 ) replaced with GGTTTC TTGAAG (nucleotides 67 - 72 ) replaced with TTAAAG GCCATT (nucleotides 190 - 195 ) replaced with GCTATT AAGAAG (nucleotides 250 - 255 ) replaced with AAAAAG TTCCCC (nucleotides 262 - 267 ) replaced with TTTCCG TCGTTA (nucleotides 370 - 375 ) replaced with TCTTTA GGTAAA (nucleotides 439 - 444 ) replaced with GGAAAA GATATC (nucleotides 490 - 495 ) replaced with GACATT GATATC (nucleotides 679 - 684 ) replaced with GACATT GGTATC (nucleotides 781 - 786 ) replaced with GGTATA TTGAAG (nucleotides 793 - 798 ) replaced with TTAAAG TTTGTC (nucleotides 859 - 864 ) replaced with TTCGTT TCGTTG (nucleotides 934 - 939 ) replaced with TCATTG AAGAAG (nucleotides 1 150 - 1 155 ) replaced with AAAAAG TTCCCA (nucleotides 1222 - 1227 ) replaced with TTTCCA TTGAAC (nucleotides 1276 - 1281 ) replaced with TTAAAT AAGAAG (nucleotides 1525 - 1530 ) replaced with AAAAAG
GCCAAG (nucleotides 1717 - 1722 ) replaced with GCTAAA
AAGAAG (nucleotides 1720 - 1725 ) replaced with AAAAAG
AAATGG (nucleotides 1804 - 1809 ) replaced with AAGTGG
GCCAAG (nucleotides 1840 - 1845 ) replaced with GCGAAA
TTGAAA (nucleotides 1858 - 1863 ) replaced with TTAAAA.
89. A D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild- type D-xylulokinase as set forth in SEQ ID NO: 74, wherein at least 3 of the following codon pairs of SEQ ID NO: 73 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TCGACT (nucleotides 55 - 60 ) replaced with TCTACC
AACAGC (nucleotides 136 - 141 ) replaced with AATTCT
GATGCC (nucleotides 220 - 225 )
GGTATT (nucleotides 283 - 288 )
TCCGGT (nucleotides 289 - 294 )
GATGCC (nucleotides 478 - 483 )
GCCTTG (nucleotides 481 - 486 )
GAAGCC (nucleotides 649 - 654 )
GGTATC (nucleotides 781 - 786 )
ATCAAT (nucleotides 784 - 789 )
ACCGGA (nucleotides 907 - 912 )
ATTATC (nucleotides 928 - 933 )
GCTTTG (nucleotides 958 - 963 )
ATTATC (nucleotides 994 - 999 )
GGTATT (nucleotides 1213 - 1218 )
AACAGC (nucleotides 1279 - 1284 )
ACTTTG (nucleotides 1366 - 1371 )
ATTATC (nucleotides 1603 - 1608 )
GAAGCC (nucleotides 1714 - 1719 )
GCCAAG (nucleotides 1717 - 1722 )
GCCAAG (nucleotides 1840 - 1845 ).
90. The nucleotide sequence of Claim 89. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
91. The nucleotide sequence of Claim 89. in which at least 3 of the following codon pair replacements have been made:
TCGACT (nucleotides 55 - 60 ) replaced with TCTACC AACAGC (nucleotides 136 - 141 ) replaced with AATTCT GATGCC (nucleotides 220 - 225 ) replaced with GACGCG GGTATT (nucleotides 283 - 288 ) replaced with GGCATT TCCGGT (nucleotides 289 - 294 ) replaced with AGCGGT GATGCC (nucleotides 478 - 483 ) replaced with GATGCT GCCTTG (nucleotides 481 - 486 ) replaced with GCTTTA GAAGCC (nucleotides 649 - 654 ) replaced with GAGGCC GGTATC (nucleotides 781 - 786 ) replaced with GGTATA ATCAAT (nucleotides 784 - 789 ) replaced with ATAAAC ACCGGA (nucleotides 907 - 912 ) replaced with ACGGGA ATTATC (nucleotides 928 - 933 ) replaced with ATTATT GCTTTG (nucleotides 958 - 963 ) replaced with GCTCTA ATTATC (nucleotides 994 - 999 ) replaced with ATTATT GGTATT (nucleotides 1213 - 1218 ) replaced with GGCATC AACAGC (nucleotides 1279 - 1284 ) replaced with AATTCT ACTTTG (nucleotides 1366 - 1371 ) replaced with ACCTTG ATTATC (nucleotides 1603 - 1608 ) replaced with ATTATT GAAGCC (nucleotides 1714 - 1719 ) replaced with GAAGCT GCCAAG (nucleotides 1717 - 1722 ) replaced with GCTAAA GCCAAG (nucleotides 1840 - 1845 ) replaced with GCGAAA.
92. A D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild- type D-xylulokinase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
93. The nucleotide sequence of Claim 92, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
94. The nucleotide sequence of Claim 92, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the Standard deviation of translational kinetics values for the host organism.
95. A D-xylulokinase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -622 of wild-type D-xylulokinase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pas tons
Oiyctolagus cuniculus (rabbit)
Macaca fasciciilaris (Long-tailed monkey)
Macaca mulatta (Monkey)
Escherichia coli Kl 2 W31 10
Escherichia coli UTI 89
Escherichia coliO\ 57:Hl EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Schizosaccharomyces pombe.
96. A D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild- type D-xylulokinase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 12-312 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
97. The D-xylulokinase-encoding nucleotide sequence of Claim 96, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
98. The D-xylulokinase-encoding nucleotide sequence of any of Claims 96-97, wherein no replacement codon encoding amino acids 12-312 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair GATATC when expressed in the native organism.
99. A D-xylulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -622 of wild- type D-xylulokinase as set forth in SEQ ID NO: 74 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -12 of SEQ ID NO: 74 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
100. The D-xylulokinase-encoding nucleotide sequence of Claim 99. wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
101. The D-xylulokinase-encoding nucleotide sequence of any of Claims 99- 100, wherein at least one replacement codon encoding amino acids 1-12 of SEQ ID NO: 74 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair GATGCT when expressed in the native organism.
102. A L-arabinitol 4-dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98. wherein at least 3 of the following codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CGCTAC (nucleotides 454 - 459 ) GCCAAG (nucleotides 562 - 567 ) CTCGGT (nucleotides 574 - 579 ) GATATC (nucleotides 946 - 951 ) CGCTAC (nucleotides 964 - 969 ) GCCATT (nucleotides 1 102 - 1 107 ).
103. The nucleotide sequence of Claim 102, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
104. The nucleotide sequence of Claim 102, in which at least 3 of the following codon pair replacements have been made:
CGCTAC (nucleotides 454 - 459 ) replaced with AGGTAT GCCAAG (nucleotides 562 - 567 ) replaced with GCTAAA CTCGGT (nucleotides 574 - 579 ) replaced with TTGGGT GATATC (nucleotides 946 - 951 ) replaced with GATATA CGCTAC (nucleotides 964 - 969 ) replaced with AGATAT GCCATT (nucleotides 1 102 - 1 107 ) replaced with GCTATT.
105. A L-arabinitol 4-dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98, wherein at least 3 of the following codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTGGCG (nucleotides 688 - 693) GCCAGC (nucleotides 856 - 861 ) ATCCTC (nucleotides 262 - 267) GCCAGT (nucleotides 928 - 933) CTCGGC (nucleotides 265 - 270) GTCAGC (nucleotides 775 - 780) TTCCCG (nucleotides 1045 - 1050) CTCGGT (nucleotides 574 - 579) TTCTGG (nucleotides 214 - 219) GCGCTG (nucleotides 517 - 522) ATCGCC (nucleotides 292 - 297).
106. The nucleotide sequence of Claim 105. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
107. The nucleotide sequence of Claim 105, in which at least 3 of the following codon pair replacements have been made:
CTGGCG (nucleotides 688 - 693) replaced with CTCGCG GCCAGC (nucleotides 856 - 861 ) replaced with GCGTCT ATCCTC (nucleotides 262 - 267) replaced with ATCCTG GCCAGT (nucleotides 928 - 933) replaced with GCGTCT CTCGGC (nucleotides 265 - 270) replaced with CTGGGT GTCAGC (nucleotides 775 - 780) replaced with GTTAGC TTCCCG (nucleotides 1045 - 1050) replaced with TTCCCA CTCGGT (nucleotides 574 - 579) replaced with CTGGGC TTCTGG (nucleotides 214 - 219) replaced with TTTTGG GCGCTG (nucleotides 517 - 522) replaced with GCTCTG ATCGCC (nucleotides 292 - 297) replaced with ATCGCT.
108. A L-arabinitol 4-dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98, wherein at least 3 of the following codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GATATC (nucleotides 946 - 951) AAGTTT (nucleotides 862 - 867) GTCAAG (nucleotides 55 - 60) GTCAAG (nucleotides 1063 - 1068) GCCAAA (nucleotides 763 - 768) GGTATC (nucleotides 190 - 195) AAGAAT (nucleotides 898 - 903) TCCAAA (nucleotides 1024 - 1029).
109. The nucleotide sequence of Claim 108, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
1 10. The nucleotide sequence of Claim 108. in which at least 3 of the following codon pair replacements have been made:
GATATC (nucleotides 946 - 951 ) replaced with GACATC AAGTTT (nucleotides 862 - 867) replaced with AAATTC GTCAAG (nucleotides 55 - 60) replaced with GTTAAA GTCAAG (nucleotides 1063 - 1068) replaced with GTTAAG GCCAAA (nucleotides 763 - 768) replaced with GCGAAA GGTATC (nucleotides 190 - 195) replaced with GGTATT AAGAAT (nucleotides 898 - 903) replaced with AAAAAC TCCAAA (nucleotides 1024 - 1029) replaced with TCTAAA.
1 1 1. A L-arabinitol 4-dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98. wherein at least 3 of the following codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GGTATC (nucleotides 190 - 195 ) CTGCGA (nucleotides 448 - 453 ) GCCAAG (nucleotides 562 - 567 ) GATATC (nucleotides 946 - 951 ) GCCATT (nucleotides 1 102 - 1 107 ).
1 12. The nucleotide sequence of Claim U l , in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
1 13. The nucleotide sequence of Claim 1 1 1 , in which at least 3 of the following codon pair replacements have been made:
GGTATC (nucleotides 190 - 195 ) replaced with GGAATT CTGCGA (nucleotides 448 - 453 ) replaced with TTGAGG GCCAAG (nucleotides 562 - 567 ) replaced with GCTAAA GATATC (nucleotides 946 - 951 ) replaced with GATATA GCCATT (nucleotides 1 102 - 1 107 ) replaced with GCAATT.
1 14. A L-arabinitol 4-dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98, wherein at least 3 of the following codon pairs of SEQ ID NO: 97 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GTCGAT (nucleotides 16 - 21 ) GGGGCA (nucleotides 40 - 45 ) GATGCC (nucleotides 127 - 132 ) GGTATC (nucleotides 190 - 195 ) GCCAAG (nucleotides 562 - 567 ) GCCGGT (nucleotides 643 - 648 ) AGCCGT (nucleotides 682 - 687 ) TCGGCT (nucleotides 748 - 753 ) GTCGAT (nucleotides 943 - 948 ) GATGCC (nucleotides 1057 - 1062 ).
1 15. The nucleotide sequence of Claim 1 14, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
1 16. The nucleotide sequence of Claim 1 14, in which at least 3 of the following codon pair replacements have been made:
GTCGAT (nucleotides 16 - 21 ) replaced with GTTGAT GGGGCA (nucleotides 40 - 45 ) replaced with GGCGCT GATGCC (nucleotides 127 - 132 ) replaced with GACGCC GGTATC (nucleotides 190 - 195 ) replaced with GGTATA GCCAAG (nucleotides 562 - 567 ) replaced with GCTAAG GCCGGT (nucleotides 643 - 648 ) replaced with GCTGGG AGCCGT (nucleotides 682 - 687 ) replaced with TCTCGT TCGGCT (nucleotides 748 - 753 ) replaced with TCTGCA GTCGAT (nucleotides 943 - 948 ) replaced with GTTGAT GATGCC (nucleotides 1057 - 1062 ) replaced with GATGCT.
1 17. A L-arabinitol 4-dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human. E. coli or S. cerevisiae.
1 18. The nucleotide sequence of Claim 1 17. wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
1 19. The nucleotide sequence of Claim 1 17. wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
120. A L-arabinitol 4-dehydrogenase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L- arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoris
Oiyctolagus cuniculus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatto (Monkey)
Escherichia coli Kl 2 W31 10
Escherichia coli UT189
Escherichia coliθ\ 57:H7 EDL933
Escherichia coli Ol 57:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Schizosaccharomyces pombe.
121. A L-arabinitol 4-dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 53-164 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
122. The L-arabinitol 4-dehydrogenase-encoding nucleotide sequence of Claim 121 wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
123. The L-arabinitol 4-dehydrogenase-encoding nucleotide sequence of any of Claims 121 -122, wherein no replacement codon encoding amino acids 53-164 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair AAGATT when expressed in the native organism.
124. A L-arabinitol 4-dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 192-366 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
125. The L-arabinitol 4-dehydrogenase-encoding nucleotide sequence of Claim 124, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
126. The L-arabinitol 4-dehydrogenase-encoding nucleotide sequence of any of Claims 124-125, wherein no replacement codon encoding amino acids 192-366 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair GAGATT when expressed in the native organism.
127. A L-arabinitol 4-dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 1 -53 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
128. The L-arabinitol 4-dehydrogenase-encoding nucleotide sequence of Claim 127, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
129. The L-arabinitol 4-dehydrogenase-encoding nucleotide sequence of any of Claims 127-128, wherein at least one replacement codon encoding amino acids 1 -53 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair GTCAAG when expressed in the native organism.
130. A L-arabinitol 4-dehydrogenase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-377 of wild-type L-arabinitol 4-dehydrogenase as set forth in SEQ ID NO: 98 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 164-192 of SEQ ID NO: 98 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
131 . The L-arabinitol 4-dehydrogenase-encoding nucleotide sequence of Claim 130, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
132. The L-arabinitol 4-dehydrogenase-encoding nucleotide sequence of any of Claims 130-131 , wherein at least one replacement codon encoding amino acids 164-192 of SEQ ID NO: 98 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair GCGCTG when expressed in the native organism.
133. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 of the following codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GGTATT (nucleotides 619 - 624)
TTGAAC (nucleotides 16 - 21 )
TTGAAC (nucleotides 274 - 279)
TTGAAC (nucleotides 670 - 675)
TTGAAC (nucleotides 688 - 693)
CTTTCT (nucleotides 286 - 291 )
GCCATT (nucleotides 181 - 186)
TCTCCA (nucleotides 697 - 702)
TCTCCA (nucleotides 751 - 756)
ATCAAG (nucleotides 103 - 108)
ATCAAG (nucleotides 541 - 546)
ATCAAG (nucleotides 721 - 726)
GCCAAG (nucleotides 889 - 894).
134. The nucleotide sequence of Claim 133, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
135. The nucleotide sequence of Claim 133, in which at least 3 of the following codon pair replacements have been made:
GGTATT (nucleotides 619 - 624) replaced with GGAATT TTGAAC (nucleotides 16 - 21 ) replaced with TTAAAT TTGAAC (nucleotides 274 - 279) replaced with CTAAAT TTGAAC (nucleotides 670 - 675) replaced with TTAAAT TTGAAC (nucleotides 688 - 693) replaced with TTAAAT CTTTCT (nucleotides 286 - 291) replaced with CTATCT GCCATT (nucleotides 181 - 186) replaced with GCTATT TCTCCA (nucleotides 697 - 702) replaced with TCACCA TCTCCA (nucleotides 751 - 756) replaced with TCACCA ATCAAG (nucleotides 103 - 108) replaced with ATTAAA ATCAAG (nucleotides 541 - 546) replaced with ATTAAA ATCAAG (nucleotides 721 - 726) replaced with ATTAAG GCCAAG (nucleotides 889 - 894) replaced with GCTAAA.
136. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122. wherein at least 3 of the following codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GCCTGT (nucleotides 58 - 63 ) CTTGAT (nucleotides 124 - 129 ) GCCTGT (nucleotides 226 - 231 ) GAAGAT (nucleotides 346 - 351 ) CTTTCT (nucleotides 748 - 753 ) GCCAGC (nucleotides 781 - 786 ).
137. The nucleotide sequence of Claim 136, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
138. The nucleotide sequence of Claim 136, in which at least 3 of the following codon pair replacements have been made:
GCCTGT (nucleotides 58 - 63 ) replaced with GCATGT CTTGAT (nucleotides 124 - 129 ) replaced with TTGGAT GCCTGT (nucleotides 226 - 231 ) replaced with GCTTGT GAAGAT (nucleotides 346 - 351 ) replaced with GAAGAT CTTTCT (nucleotides 748 - 753 ) replaced with TTGTCT GCCAGC (nucleotides 781 - 786 ) replaced with GCATCA.
139. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 of the following codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTGAAC (nucleotides 16 - 21) ATCAAG (nucleotides 103 - 108) GTCAAG (nucleotides 172 - 177) GACGAA (nucleotides 187 - 192) GGTATC (nucleotides 193 - 198) GTCAAG (nucleotides 199 - 204) TCCAAG (nucleotides 226 - 231) TTGAAC (nucleotides 274 - 279) TTCAAG (nucleotides 343 - 348) GTCAAG (nucleotides 460 - 465) ATCAAG (nucleotides 541 - 546) CCAAGA (nucleotides 589 - 594) GGTATT (nucleotides 619 - 624) TTGAAC (nucleotides 670 - 675) TTGAAC (nucleotides 688 - 693) ATCAAG (nucleotides 721 - 726) CCAAGA (nucleotides 823 - 828) GACGAA (nucleotides 865 - 870) ATCAAC (nucleotides 901 - 906) TTCAAC (nucleotides 913 - 918).
140. The nucleotide sequence of Claim 139, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
141. The nucleotide sequence of Claim 139, in which at least 3 of the following codon pair replacements have been made:
TTGAAC (nucleotides 16 - 21) replaced with TTAAAT ATCAAG (nucleotides 103 - 108) replaced with ATTAAA GTCAAG (nucleotides 172 - 177) replaced with GTTAAA GACGAA (nucleotides 187 - 192) replaced with GATGAA GGTATC (nucleotides 193 - 198) replaced with GGAATT GTCAAG (nucleotides 199 - 204) replaced with GTTAAA TCCAAG (nucleotides 226 - 231) replaced with TCTAAA TTGAAC (nucleotides 274 - 279) replaced with CTAAAT TTCAAG (nucleotides 343 - 348) replaced with TTTAAA GTCAAG (nucleotides 460 - 465) replaced with GTTAAA ATCAAG (nucleotides 541 - 546) replaced with ATTAAA CCAAGA (nucleotides 589 - 594) replaced with CCTAGA GGTATT (nucleotides 619 - 624) replaced with GGAATT TTGAAC (nucleotides 670 - 675) replaced with TTAAAT TTGAAC (nucleotides 688 - 693) replaced with TTAAAT ATCAAG (nucleotides 721 - 726) replaced with ATTAAG CCAAGA (nucleotides 823 - 828) replaced with CCTCGT GACGAA (nucleotides 865 - 870) replaced with GATGAA ATCAAC (nucleotides 901 - 906) replaced with ATTAAT TTCAAC (nucleotides 913 - 918) replaced with TTTAAT.
142. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 324 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 of the following codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GATATC (nucleotides 127 - 132 ) TTGAAG (nucleotides 190 - 195 ) TTGAAA (nucleotides 196 - 201 ) GTGTTT (nucleotides 262 - 267 ) TTTGCT (nucleotides 265 - 270 ) TTCCCA (nucleotides 337 - 342 ) GCCAAG (nucleotides 358 - 363 ) TTTGCT (nucleotides 421 - 426 ) ATCAAA (nucleotides 436 - 441 ) GGTATC (nucleotides 445 - 450 ) GCCATT (nucleotides 490 - 495 ) GGTATC (nucleotides 688 - 693 ) CTTTCT (nucleotides 748 - 753 ).
143. The nucleotide sequence of Claim 142, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
144. The nucleotide sequence of Claim 142, in which at least 3 of the following codon pair replacements have been made:
GATATC (nucleotides 127 - 132 ) replaced with GACATT TTGAAG (nucleotides 190 - 195 ) replaced with TTAAAG TTGAAA (nucleotides 196 - 201 ) replaced with TTAAAG GTGTTT (nucleotides 262 - 267 ) replaced with GTTTTC TTTGCT (nucleotides 265 - 270 ) replaced with TTCGCT TTCCCA (nucleotides 337 - 342 ) replaced with TTCCCT GCCAAG (nucleotides 358 - 363 ) replaced with GCTAAA TTTGCT (nucleotides 421 - 426 ) replaced with TTCGCT ATCAAA (nucleotides 436 - 441 ) replaced with ATTAAA GGTATC (nucleotides 445 - 450 ) replaced with GGAATT GCCATT (nucleotides 490 - 495 ) replaced with GCAATT GGTATC (nucleotides 688 - 693 ) replaced with GGCATT CTTTCT (nucleotides 748 - 753 ) replaced with TTGTCT.
145. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 324 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122, wherein at least 3 of the following codon pairs of SEQ ID NO: 121 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
ACTTTT (nucleotides 19 - 24 ) GCTTTG (nucleotides 1 18 - 123 ) CTTGAT (nucleotides 124 - 129 ) GCCAAG (nucleotides 358 - 363 ) GCCTTT (nucleotides 418 - 423 ) GGTATC (nucleotides 445 - 450 ) ACTTTG (nucleotides 562 - 567 ) ATCAAT (nucleotides 649 - 654 ) GGTATC (nucleotides 688 - 693 ).
146. The nucleotide sequence of Claim 145, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
147. The nucleotide sequence of Claim 145, in which at least 3 of the following codon pair replacements have been made:
ACTTTT (nucleotides 19 - 24 ) replaced with ACCTTT GCTTTG (nucleotides 1 18 - 123 ) replaced with GCTCTT CTTGAT (nucleotides 124 - 129 ) replaced with TTGGAC GCCAAG (nucleotides 358 - 363 ) replaced with GCTAAG GCCTTT (nucleotides 418 - 423 ) replaced with GCTTTC GGTATC (nucleotides 445 - 450 ) replaced with GGGATT ACTTTG (nucleotides 562 - 567 ) replaced with ACCTTG ATCAAT (nucleotides 649 - 654 ) replaced with ATTAAT GGTATC (nucleotides 688 - 693 ) replaced with GGCATC.
148. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human. E. coli or S. cerevisiae.
149. The nucleotide sequence of Claim 148, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
150. The nucleotide sequence of Claim 148, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
151. A L-xylulose reductase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoris
Orvctolagus cuniculus (rabbit)
Macaca fasciculaήs (Long-tailed monkey)
Macaca mulatta (Monkey)
Escherichia coli Kl 2 W31 10
Escherichia coli UTI89
Escherichia cø//O157:H7 EDL933
Escherichia coli Ol 57:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Schi∑osaccharomyces pombe.
152. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 8-267 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
153. The L-xylulose reductase-encoding nucleotide sequence of Claim 152, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
154. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 272 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 122 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 1 -8 of SEQ ID NO: 122 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
155. The L-xylulose reductase-encoding nucleotide sequence of Claim 154, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
156. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146, wherein at least 3 of the following codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTGAAG (nucleotides 49 - 54) TTTGCC (nucleotides 583 - 588) GATATT (nucleotides 766 - 771) AGCGAT (nucleotides 364 - 369) GCCAAG (nucleotides 529 - 534) GCCAAG (nucleotides 700 - 705).
157. The nucleotide sequence of Claim 156, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
158. The nucleotide sequence of Claim 156, in which at least 3 of the following codon pair replacements have been made:
TTGAAG (nucleotides 49 - 54) replaced with TTAAAA TTTGCC (nucleotides 583 - 588) replaced with TTTGCT GATATT (nucleotides 766 - 771) replaced with GATATA AGCGAT (nucleotides 364 - 369) replaced with TCAGAT GCCAAG (nucleotides 529 - 534) replaced with GCAAAA GCCAAG (nucleotides 700 - 705) replaced with GCTAAA.
159. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146, wherein at least 3 of the following codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GATCTC (nucleotides 37 - 42) ATTGCC (nucleotides 313 - 318) GCCGGA (nucleotides 322 - 327) GCCAGC (nucleotides 361 - 366) CTGGCG (nucleotides 550 - 555) TTTGCC (nucleotides 583 - 588) GTCAGC (nucleotides 733 - 738).
160. The nucleotide sequence of Claim 159, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
161. The nucleotide sequence of Claim 159, in which at least 3 of the following codon pair replacements have been made:
GATCTC (nucleotides 37 - 42) replaced with GATTTG ATTGCC (nucleotides 313 - 318) replaced with ATTGCT GCCGGA (nucleotides 322 - 327) replaced with GCTGGA GCCAGC (nucleotides 361 - 366) replaced with GCTTCA CTGGCG (nucleotides 550 - 555) replaced with TTGGCT TTTGCC (nucleotides 583 - 588) replaced with TTTGCT GTCAGC (nucleotides 733 - 738) replaced with GTTTCA.
162. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146, wherein at least 3 of the following codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GTCAAG (nucleotides 220 - 225 ) TTCAAG (nucleotides 436 - 441 ) AAGAAG (nucleotides 439 - 444 ) GGCCAC (nucleotides 448 - 453 ) GGCCAC (nucleotides 484 - 489 ) TTTGCC (nucleotides 583 - 588 ) GATATT (nucleotides 766 - 771 ).
163. The nucleotide sequence of Claim 162, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
164. The nucleotide sequence of Claim 162, in which at least 3 of the following codon pair replacements have been made:
GTCAAG (nucleotides 220 - 225 ) replaced with GTTAAA TTCAAG (nucleotides 436 - 441 ) replaced with TTTAAA AAGAAG (nucleotides 439 - 444 ) replaced with AAAAAG GGCCAC (nucleotides 448 - 453 ) replaced with GGACAT GGCCAC (nucleotides 484 - 489 ) replaced with GGACAC TTTGCC (nucleotides 583 - 588 ) replaced with TTCGCT GATATT (nucleotides 766 - 771 ) replaced with GATATA.
165. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 324 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146, wherein at least 3 of the following codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTGAAG (nucleotides 49 - 54 ) AAGAAG (nucleotides 439 - 444 ) GCCAAG (nucleotides 529 - 534 ) TTTGCC (nucleotides 583 - 588 ) GCCAAG (nucleotides 700 - 705 ).
166. The nucleotide sequence of Claim 165, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
167. The nucleotide sequence of Claim 165, in which at least 3 of the following codon pair replacements have been made:
TTGAAG (nucleotides 49 - 54 ) replaced with TTAAAG AAGAAG (nucleotides 439 - 444 ) replaced with AAAAAG GCCAAG (nucleotides 529 - 534 ) replaced with GCCAAA TTTGCC (nucleotides 583 - 588 ) replaced with TTCGCT GCCAAG (nucleotides 700 - 705 ) replaced with GCTAAA.
168. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 324 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146, wherein at least 3 of the following codon pairs of SEQ ID NO: 145 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTTGAT (nucleotides 34 - 39 ) GATGCC (nucleotides 304 - 309 ) GCCTTT (nucleotides 307 - 312 ) GCCGGA (nucleotides 322 - 327 ) GCCAAG (nucleotides 529 - 534 ) GCCGGT (nucleotides 535 - 540 ) AACAGC (nucleotides 595 - 600 ) GATGCC (nucleotides 697 - 702 ) GCCAAG (nucleotides 700 - 705 ).
169. The nucleotide sequence of Claim 168, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
170. The nucleotide sequence of Claim 168, in which at least 3 of the following codon pair replacements have been made:
CTTGAT (nucleotides 34 - 39 ) replaced with TTGGAT GATGCC (nucleotides 304 - 309 ) replaced with GATGCT GCCTTT (nucleotides 307 - 312 ) replaced with GCTTTC GCCGGA (nucleotides 322 - 327 ) replaced with GCTGGA GCCAAG (nucleotides 529 - 534 ) replaced with GCTAAG GCCGGT (nucleotides 535 - 540 ) replaced with GCCGGG AACAGC (nucleotides 595 - 600 ) replaced with AATTCT GATGCC (nucleotides 697 - 702 ) replaced with GATGCT GCCAAG (nucleotides 700 - 705 ) replaced with GCTAAA.
171. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
172. The nucleotide sequence of Claim 171 , wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
173. The nucleotide sequence of Claim 171 , wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1 .5 times the standard deviation of translational kinetics values for the host organism.
174. A L-xylulose reductase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoήs
Oryctolagus cuniculus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatto (Monkey)
Escherichia coli K 12 W31 10
Escherichia coli UTI 89
Escherichia co//O157:H7 EDL933
Escherichia coli Ol 57:H7 str. Sakai Bombyx mori Spodoptera jϊugiperda Drosophila melanogasier Schi∑osaccharomyces pombe.
175. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 10-261 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
176. The L-xylulose reductase-encoding nucleotide sequence of Claim 175, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
177. The L-xylulose reductase-encoding nucleotide sequence of any of Claims 175-176, wherein no replacement codon encoding amino acids 10-261 of SEQ ID NO: 146 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair AAGACG when expressed in the native organism.
178. A L-xylulose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 266 of wild-type L-xylulose reductase as set forth in SEQ ID NO: 146 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -10 of SEQ ID NO: 146 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
179. The L-xylulose reductase-encoding nucleotide sequence of Claim 178, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
180. The L-xylulose reductase-encoding nucleotide sequence of any of Claims 178- 179. wherein at least one replacement codon encoding amino acids 1 -10 of SEQ ID NO: 146 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair GCCAAC when expressed in the native organism.
181. A xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-440 of wild- type xylose isomerase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GAGTTT (nucleotides 262 - 267) TTTGCC (nucleotides 130 - 135) GTGGAA (nucleotides 943 - 948) GCCATT (nucleotides 856 - 861) CAGTTT (nucleotides 766 - 771) CAAAGT (nucleotides 1033 - 1038) GGCCAA (nucleotides 1201 - 1206) TTTTTC (nucleotides 265 - 270).
182. The nucleotide sequence of Claim 181 , in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
183. The nucleotide sequence of Claim 181 , in which at least 3 of the following codon pair replacements have been made:
GAGTTT (nucleotides 262 - 267) replaced with GAGTTC TTTGCC (nucleotides 130 - 135) replaced with TTTGCT GTGGAA (nucleotides 943 - 948) replaced with GTTGAA GCCATT (nucleotides 856 - 861) replaced with GCTATA CAGTTT (nucleotides 766 - 771 ) replaced with CAATTT CAAAGT (nucleotides 1033 - 1038) replaced with CAATCT GGCCAA (nucleotides 1201 - 1206) replaced with GGTCAA TTTTTC (nucleotides 265 - 270) replaced with TTCTTT.
184. A xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-440 of wild- type xylose isomerase as set forth in SEQ ID NO: 170. wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTGGCG (nucleotides 226 - 231)
CTGGCG (nucleotides 1093 - 1098)
CTGGTG (nucleotides 94 - 99)
CTGGTG (nucleotides 958 - 963)
GAAGAG (nucleotides 1 15 - 120)
GAAGAG (nucleotides 391 - 396)
GAAGAG (nucleotides 946 - 951)
CTGGCA (nucleotides 376 - 381)
CTGGCA (nucleotides 820 - 825)
CTGGCA (nucleotides 1213 - 1218)
TTTGCC (nucleotides 130 - 135)
ACGCTG (nucleotides 586 - 591)
ACGCTG (nucleotides 817 - 822)
AAAGAG (nucleotides 337 - 342)
AAAGAG (nucleotides 781 - 786)
TTCCAG (nucleotides 673 - 678)
CTGGAA (nucleotides 775 - 780)
CTGGAA (nucleotides 1285 - 1290)
TTCCCG (nucleotides 931 - 936)
GCGGCA (nucleotides 496 - 501)
GTGATG (nucleotides 961 - 966)
GCGCTG (nucleotides 955 - 960)
GCGCTG (nucleotides 1096 - 1 101 ).
185. The nucleotide sequence of Claim 184, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
186. The nucleotide sequence of Claim 184, in which at least 3 of the following codon pair replacements have been made:
CTGGCG (nucleotides 226 - 231) replaced with TTGGCT CTGGCG (nucleotides 1093 - 1098) replaced with TTGGCA CTGGTG (nucleotides 94 - 99) replaced with TTGGTT CTGGTG (nucleotides 958 - 963) replaced with TTGGTT GAAGAG (nucleotides 1 15 - 120) replaced with GAGGAA GAAGAG (nucleotides 391 - 396) replaced with GAAGAA GAAGAG (nucleotides 946 - 951) replaced with GAAGAA CTGGCA (nucleotides 376 - 381 ) replaced with TTAGCT CTGGCA (nucleotides 820 - 825) replaced with TTGGCT CTGGCA (nucleotides 1213 - 1218) replaced with TTGGCT TTTGCC (nucleotides 130 - 135) replaced with TTTGCT ACGCTG (nucleotides 586 - 591) replaced with ACATTG ACGCTG (nucleotides 817 - 822) replaced with ACATTG AAAGAG (nucleotides 337 - 342) replaced with AAAGAA AAAGAG (nucleotides 781 - 786) replaced with AAAGAA TTCCAG (nucleotides 673 - 678) replaced with TTTCAA CTGGAA (nucleotides 775 - 780) replaced with TTAGAA CTGGAA (nucleotides 1285 - 1290) replaced with TTGGAA TTCCCG (nucleotides 931 - 936) replaced with TTTCCA GCGGCA (nucleotides 496 - 501) replaced with GCTGCT GTGATG (nucleotides 961 - 966) replaced with GTTATG GCGCTG (nucleotides 955 - 960) replaced with GCTTTG GCGCTG (nucleotides 1096 - 1 101\) replaced with GCATTA.
187. A xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild- type xylose isomerase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GAGTTT (nucleotides 262 - 267) TTTGCC (nucleotides 130 - 135) AAACTG (nucleotides 790 - 795) GCCAAA (nucleotides 1018 - 1023) GCCAAA (nucleotides 1225 - 1230) CTGAAA (nucleotides 760 - 765) CTGAAA (nucleotides 1099 - 1 104) CTGAAA (nucleotides 1 195 - 1200) GACGAA (nucleotides 88 - 93) AAACAG (nucleotides 763 - 768) GGCCAA (nucleotides 1201 - 1206) CTGGTA (nucleotides 1294 - 1299) TCGTTA (nucleotides 331 - 336) TTTGAC (nucleotides 13 - 18) CAGTTT (nucleotides 766 - 771).
188. The nucleotide sequence of Claim 187, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
189. The nucleotide sequence of Claim 187, in which at least 3 of the following codon pair replacements have been made:
GAGTTT (nucleotides 262 - 267) replaced with GAGTTC TTTGCC (nucleotides 130 - 135) replaced with TTTGCT AAACTG (nucleotides 790 - 795) replaced with AAATTA GCCAAA (nucleotides 1018 - 1023) replaced with GCTAAA GCCAAA (nucleotides 1225 - 1230) replaced with GCTAAA CTGAAA (nucleotides 760 - 765) replaced with CTAAAA CTGAAA (nucleotides 1099 - 1 104) replaced with TTAAAA CTGAAA (nucleotides 1 195 - 1200) replaced with TTAAAG GACGAA (nucleotides 88 - 93) replaced with GATGAA AAACAG (nucleotides 763 - 768) replaced with AAACAA GGCCAA (nucleotides 1201 - 1206) replaced with GGTCAA CTGGTA (nucleotides 1294 - 1299) replaced with TTGGTT TCGTTA (nucleotides 331 - 336) replaced with TCTTTA TTTGAC (nucleotides 13 - 18) replaced with TTTGAT CAGTTT (nucleotides 766 - 771) replaced with CAATTT.
190. A xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-324 of wild- type xylose isomerase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTTGCC (nucleotides 130 - 135 ) GAGTTT (nucleotides 262 - 267 ) TCGTTA (nucleotides 331 - 336 ) CAGTTT (nucleotides 766 - 771 ) TTCCAT (nucleotides 835 - 840 ) GCCATT (nucleotides 856 - 861 ) GGCCAA (nucleotides 1201 - 1206 ).
191 . The nucleotide sequence of Claim 190, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
192. The nucleotide sequence of Claim 190, in which at least 3 of the following codon pair replacements have been made:
TTTGCC (nucleotides 130 - 135 ) replaced with TTCGCT GAGTTT (nucleotides 262 - 267 ) replaced with GAATTT TCGTTA (nucleotides 331 - 336 ) replaced with AGTTTA CAGTTT (nucleotides 766 - 771 ) replaced with CAATTC TTCCAT (nucleotides 835 - 840 ) replaced with TTCCAC GCCATT (nucleotides 856 - 861 ) replaced with GCTATT GGCCAA (nucleotides 1201 - 1206 ) replaced with GGTCAA.
193. A xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild- type xylose isomerase as set forth in SEQ ID NO: 170, wherein at least 3 of the following codon pairs of SEQ ID NO: 169 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GCCTAT (nucleotides 7 - 12 ) CTCGAT (nucleotides 22 - 27 ) GAAGGC (nucleotides 40 - 45 ) ATCAAT (nucleotides 346 - 351 ) AAGCTG (nucleotides 406 - 41 1 ) CTGTTA (nucleotides 589 - 594 ) GATGCC (nucleotides 736 - 741 ) GATGCC (nucleotides 1015 - 1020 ).
194. The nucleotide sequence of Claim 193, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
195. The nucleotide sequence of Claim 193, in which at least 3 of the following codon pair replacements have been made: GCCTAT (nucleotides 7 - 12 ) replaced with GCTTAT CTCGAT (nucleotides 22 - 27 ) replaced with TTGGAT GAAGGC (nucleotides 40 - 45 ) replaced with GAAGGT ATCAAT (nucleotides 346 - 351 ) replaced with ATTAAT AAGCTG (nucleotides 406 - 41 1 ) replaced with AAATTG CTGTTA (nucleotides 589 - 594 ) replaced with TTGTTG GATGCC (nucleotides 736 - 741 ) replaced with GACGCC GATGCC (nucleotides 1015 - 1020 ) replaced with GATGCT.
196. A xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -440 of wild- type xylose isomerase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
197. The nucleotide sequence of Claim 196, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
198. The nucleotide sequence of Claim 196, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
199. A xylose isomerase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -440 of wild-type xylose isomerase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoris
Oryctolagus cuniculus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatto (Monkey)
Escherichia coli K 12 W31 10
Escherichia coli UTl 89
Escherichia co//O157:H7 EDL933 Escherichia coli O157:H7 str. Sakai Bombyx mori Spodoptera frugiperda Drosophila melanogaster Sch izosa ccharomyces pom be .
200. A xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-440 of wild- type xylose isomerase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 76-286 of SEQ ID NO: 170 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
201. The xylose isomerase-encoding nucleotide sequence of Claim 200, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
202. The xylose isomerase-encoding nucleotide sequence of any of Claims 200- 201 , wherein no replacement codon encoding amino acids 76-286 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair GAAGAG when expressed in the native organism.
203. A xylose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-440 of wild- type xylose isomerase as set forth in SEQ ID NO: 170 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 1-76 of SEQ ID NO: 170 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
204. The xylose isomerase-encoding nucleotide sequence of Claim 203, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
205. The xylose isomerase-encoding nucleotide sequence of any of Claims 203- 204. wherein at least one replacement codon encoding amino acids 1 -76 of SEQ ID NO: 170 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair CTGGTG when expressed in the native organism.
206. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTGAAA (nucleotides 148 - 153 ) ATCAAC (nucleotides 268 - 273 ) ATCAAG (nucleotides 598 - 603 ) CTCGGT (nucleotides 1 1 1 1 - 1 1 16 ) GGTATT (nucleotides 1 1 14 - 1 1 19 ) GGATTT (nucleotides 1489 - 1494 ).
207. The nucleotide sequence of Claim 206, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
208. The nucleotide sequence of Claim 206, in which at least 3 of the following codon pair replacements have been made:
TTGAAA (nucleotides 148 - 153 ) replaced with TTAAAA ATCAAC (nucleotides 268 - 273 ) replaced with ATTAAT ATCAAG (nucleotides 598 - 603 ) replaced with ATAAAA CTCGGT (nucleotides 1 1 1 1 - 1 1 16 ) replaced with TTGGGA GGTATT (nucleotides 1 1 14 - 1 1 19 ) replaced with GGAATT GGATTT (nucleotides 1489 - 1494 ) replaced with GGTTTT.
209. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTCGAC (nucleotides 142 - 147)
ATCCTC (nucleotides 226 - 231 )
ATCCTC (nucleotides 640 - 645)
GACTGG (nucleotides 1081 - 1086)
GTGGTG (nucleotides 1 180 - 1 185)
GTGGTG (nucleotides 1096 - 1 101)
TTGCTG (nucleotides 1093 - 1098)
CTCGGC (nucleotides 1327 - 1332)
CTCGGC (nucleotides 922 - 927)
CTGGAA (nucleotides 229 - 234)
CTGGAA (nucleotides 649 - 654)
CTGGAA (nucleotides 298 - 303)
AGCCAG (nucleotides 1039 - 1044)
ATTGCC (nucleotides 1 195 - 1200)
GAAGTG (nucleotides 760 - 765)
GAAGTG (nucleotides 799 - 804)
GAAGTG (nucleotides 1054 - 1059)
CAGGCG (nucleotides 43 - 48)
GATCTC (nucleotides 1072 - 1077)
CTCGGT (nucleotides 22 - 27)
GTGATG (nucleotides 559 - 564)
GCGCTG (nucleotides 1477 - 1482)
GCGCTG (nucleotides 496 - 501)
GCGCTG (nucleotides 1 192 - 1 197)
GCGCTG (nucleotides 1 1 1 1 - 1 1 16)
GCGCTG (nucleotides 958 - 963)
GCGCTG (nucleotides 109 - 1 14)
CTCGAC (nucleotides 328 - 333)
ATCCTC (nucleotides 682 - 687)
ATCCTC (nucleotides 1279 - 1284)
GACTGG (nucleotides 1366 - 1371)
GTGGTG (nucleotides 1462 - 1467).
210. The nucleotide sequence of Claim 209. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
21 1. The nucleotide sequence of Claim 209. in which at least 3 of the following codon pair replacements have been made:
CTCGAC (nucleotides 142 - 147) replaced with TTAGTT ATCCTC (nucleotides 226 - 231 ) replaced with TTAGTT ATCCTC (nucleotides 640 - 645) replaced with TTGGTT GACTGG (nucleotides 1081 - 1086) replaced with GAAGAA GTGGTG (nucleotides 1 180 - 1 185) replaced with GCTTCT GTGGTG (nucleotides 1096 - 1 101 ) replaced with TTGGAT TTGCTG (nucleotides 1093 - 1098) replaced with ATTTTG CTCGGC (nucleotides 1327 - 1332) replaced with ATTTTG CTCGGC (nucleotides 922 - 927) replaced with GATTGG CTGGAA (nucleotides 229 - 234) replaced with GTTGTT CTGGAA (nucleotides 649 - 654) replaced with GTTGTT CTGGAA (nucleotides 298 - 303) replaced with TTGTTG AGCCAG (nucleotides 1039 - 1044) replaced with TTGGGT ATTGCC (nucleotides 1 195 - 1200) replaced with TTGGGT GAAGTG (nucleotides 760 - 765) replaced with TTGGAA GAAGTG (nucleotides 799 - 804) replaced with TTAGAG GAAGTG (nucleotides 1054 - 1059) replaced with TTGGAA CAGGCG (nucleotides 43 - 48) replaced with TCACAA GATCTC (nucleotides 1072 - 1077) replaced with ATTGCT CTCGGT (nucleotides 22 - 27) replaced with GAAGTT GTGATG (nucleotides 559 - 564) replaced with GAAGTA GCGCTG (nucleotides 1477 - 1482) replaced with GAAGTT GCGCTG (nucleotides 496 - 501) replaced with CAAGCA GCGCTG (nucleotides 1 192 - 1 197) replaced with GATTTG GCGCTG (nucleotides 1 1 1 1 - 1 1 16) replaced with TTGGGA GCGCTG (nucleotides 958 - 963) replaced with GTAATG GCGCTG (nucleotides 109 - 1 14) replaced with GCTTTA CTCGAC (nucleotides 328 - 333) replaced with GCTTTG ATCCTC (nucleotides 682 - 687) replaced with GCTTTG ATCCTC (nucleotides 1279 - 1284) replaced with GCATTG GACTGG (nucleotides 1366 - 1371 ) replaced with GCTTTA GTGGTG (nucleotides 1462 - 1467) replaced with GCTTTG.
212. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194. wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GACGAT (nucleotides 208 - 213) GACGAT (nucleotides 1 129 - 1 134) ATCAAG (nucleotides 598 - 603) AAACTG (nucleotides 127 - 132) AAACTG (nucleotides 139 - 144) AAACTG (nucleotides 1261 - 1266) TTGAAA (nucleotides 148 - 153) CTTCCA (nucleotides 862 - 867) TTCAAC (nucleotides 319 - 324) ATCAAC (nucleotides 268 - 273) GGTATT (nucleotides 1 1 14 - 1 1 19) GCCAAA (nucleotides 256 - 261) CTGAAA (nucleotides 526 - 531) CTGAAA (nucleotides 853 - 858) AAACAG (nucleotides 508 - 513) AAACAG (nucleotides 856 - 861).
213. The nucleotide sequence of Claim 212. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
214. The nucleotide sequence of Claim 212, in which at least 3 of the following codon pair replacements have been made:
GACGAT (nucleotides 208 - 213) replaced with GATGAT GACGAT (nucleotides 1 129 - 1 134) replaced with GATGAT ATCAAG (nucleotides 598 - 603) replaced with ATAAAA AAACTG (nucleotides 127 - 132) replaced with AAATTG AAACTG (nucleotides 139 - 144) replaced with AAATTA AAACTG (nucleotides 1261 - 1266) replaced with AAATTG TTGAAA (nucleotides 148 - 153) replaced with TTAAAA CTTCCA (nucleotides 862 - 867) replaced with TTGCCA TTCAAC (nucleotides 319 - 324) replaced with TTTAAT ATCAAC (nucleotides 268 - 273) replaced with ATTAAT GGTATT (nucleotides 1 1 14 - 1 1 19) replaced with GGAATT GCCAAA (nucleotides 256 - 261 ) replaced with GCTAAA CTGAAA (nucleotides 526 - 531 ) replaced with TTAAAG CTGAAA (nucleotides 853 - 858) replaced with TTAAAA AAACAG (nucleotides 508 - 513) replaced with AAACAA AAACAG (nucleotides 856 - 861 ) replaced with AAACAA.
215. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 324 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTTGTC (nucleotides 31 - 36 ) GTCATT (nucleotides 34 - 39 ) TTGAAA (nucleotides 148 - 153 ) GACGAT (nucleotides 208 - 213 ) CAGCAG (nucleotides 892 - 897 ) GAGAAA (nucleotides 1018 - 1023 ) GAGAAA (nucleotides 1084 - 1089 ) GACGTT (nucleotides 1099 - 1 104 ) GGTATT (nucleotides 1 1 14 - 1 1 19 ) GACGAT (nucleotides 1 129 - 1 134 ) GTGAAA (nucleotides 1237 - 1242 ) GCGTTT (nucleotides 1450 - 1455 ).
216. The nucleotide sequence of Claim 215, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
217. The nucleotide sequence of Claim 215. in which at least 3 of the following codon pair replacements have been made:
TTTGTC (nucleotides 31 - 36 ) replaced with TTCGTT GTCATT (nucleotides 34 - 39 ) replaced with GTTATT TTGAAA (nucleotides 148 - 153 ) replaced with TTAAAG GACGAT (nucleotides 208 - 213 ) replaced with GATGAT CAGCAG (nucleotides 892 - 897 ) replaced with CAACAA GAGAAA (nucleotides 1018 - 1023 ) replaced with GAAAAA GAGAAA (nucleotides 1084 - 1089 ) replaced with GAAAAA GACGTT (nucleotides 1099 - 1 104 ) replaced with GATGTT GGTATT (nucleotides 1 1 14 - 1 1 19 ) replaced with GGAATT GACGAT (nucleotides 1 129 - 1 134 ) replaced with GATGAT GTGAAA (nucleotides 1237 - 1242 ) replaced with GTTAAA GCGTTT (nucleotides 1450 - 1455 ) replaced with GCGTTC.
218. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 324 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194, wherein at least 3 of the following codon pairs of SEQ ID NO: 193 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GCTATT (nucleotides 184 - 189 ) GACAGT (nucleotides 340 - 345 ) GCGGTT (nucleotides 499 - 504 ) GCGGTT (nucleotides 628 - 633 ) GTCGAT (nucleotides 688 - 693 ) CAGCTT (nucleotides 859 - 864 ) GAAGGC (nucleotides 916 - 921 ) ACCTAT (nucleotides 1006 - 101 1 ) GGTATT (nucleotides 1 1 14 - 1 1 19 ) AAAGAC (nucleotides 1456 - 1461 ).
219. The nucleotide sequence of Claim 218. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
220. The nucleotide sequence of Claim 21 8. in which at least 3 of the following codon pair replacements have been made:
GCTATT (nucleotides 184 - 189 ) replaced with GCCATT GACAGT (nucleotides 340 - 345 ) replaced with GACTCC GCGGTT (nucleotides 499 - 504 ) replaced with GCCGTT GCGGTT (nucleotides 628 - 633 ) replaced with GCCGTC GTCGAT (nucleotides 688 - 693 ) replaced with GTTGAT CAGCTT (nucleotides 859 - 864 ) replaced with CAGTTG GAAGGC (nucleotides 916 - 921 ) replaced with GAGGGT ACCTAT (nucleotides 1006 - 101 1 ) replaced with ACGTAC GGTATT (nucleotides 1 1 14 - 1 1 19 ) replaced with GGCATA AAAGAC (nucleotides 1456 - 1461 ) replaced with AAAGAT.
221. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
222. The nucleotide sequence of Claim 221 , wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
223. The nucleotide sequence of Claim 221 , wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
224. A L-arabinose isomerase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows: Pi c hi a pas tons
Oiyctolagus cuniculus (rabbit)
Macaca fasciculaήs (Long-tailed monkey)
Macaca mulatto (Monkey)
Escherichia coli Kl 2 W31 10
Escherichia coli UT189
Escherichia co//O157:H7 EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Schizosaccharomyces pombe.
225. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 8-472 of SEQ ID NO: 194 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
226. The L-arabinose isomerase-encoding nucleotide sequence of Claim 225, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
227. The L-arabinose isomerase-encoding nucleotide sequence of any of Claims 225-226, wherein no replacement codon encoding amino acids 8-472 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair CTGGTG when expressed in the native organism.
228. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 500 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 194 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -8 of SEQ ID NO: 194 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
229. The L-arabinose isomerase-encoding nucleotide sequence of Claim 228, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
230. The L-arabinose isomerase-encoding nucleotide sequence of any of Claims 228-229. wherein at least one replacement codon encoding amino acids 1 -8 of SEQ ID NO: 194 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair GAAGTG when expressed in the native organism.
231. A L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild- type L-ribulokinase as set forth in SEQ ID NO: 218, wherein at least 3 of the following codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTTTCC (nucleotides 562 - 567) GGTATT (nucleotides 445 - 450) GGTATT (nucleotides 943 - 948) GAGTTT (nucleotides 319 - 324) GGATTT (nucleotides 979 - 984) TTTGCC (nucleotides 322 - 327) GATATC (nucleotides 1018 - 1023) CTTTAT (nucleotides 1603 - 1608) GATATT (nucleotides 586 - 591 ) GATATT (nucleotides 736 - 741) GGCCAA (nucleotides 1000 - 1005).
232. The nucleotide sequence of Claim 231 , in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
233. The nucleotide sequence of Claim 231 , in which at least 3 of the following codon pair replacements have been made:
CTTTCC (nucleotides 562 - 567) replaced with TTGAGT GGTATT (nucleotides 445 - 450) replaced with GGAATT GGTATT (nucleotides 943 - 948) replaced with GGAATT GAGTTT (nucleotides 319 - 324) replaced with GAATTT
GGATTT (nucleotides 979 - 984) replaced with GGATTT
TTTGCC (nucleotides 322 - 327) replaced with TTTGCA
GATATC (nucleotides 1018 - 1023) replaced with GACATT
CTTTAT (nucleotides 1603 - 1608) replaced with TTGTAT
GATATT (nucleotides 586 - 591) replaced with GACATT
GATATT (nucleotides 736 - 741) replaced with GATATA
GGCCAA (nucleotides 1000 - 1005) replaced with GGACAA.
234. A L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild- type L-ribulokinase as set forth in SEQ ID NO: 218. wherein at least 3 of the following codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTGGCG (nucleotides 304 - 309)
GAAGAG (nucleotides 73 - 78)
GAAGAG (nucleotides 385 - 390)
GCCAGC (nucleotides 64 - 69)
GCCAGC (nucleotides 1 105 - 1 1 10)
CTTTCC (nucleotides 562 - 567)
CTCGAC (nucleotides 1 183 - 1 188)
TTTGCC (nucleotides 322 - 327)
GGGCAA (nucleotides 1 18 - 123)
ATCCTC (nucleotides 685 - 690)
GACTGG (nucleotides 544 - 549)
GACTGG (nucleotides 1 186 - 1 191)
GCCAGT (nucleotides 658 - 663)
GCCAGT (nucleotides 1543 - 1548)
GTGGTG (nucleotides 796 - 801)
GTGGTG (nucleotides 970 - 975)
GTGGTG (nucleotides 1 177 - 1 182)
CTCGGC (nucleotides 778 - 783)
GCGGTA (nucleotides 1549 - 1554)
GACAGC (nucleotides 499 - 504)
CTGGAA (nucleotides 991 - 996) CTGGAA (nucleotides 1057 - 1062) AGCCAG (nucleotides 1 108 - 1 1 13) ATTGCC (nucleotides 904 - 909) GCCGGG (nucleotides 610 - 615) CTCGGT (nucleotides 1471 - 1476) GCCTGG (nucleotides 1027 - 1032) GCGGCA (nucleotides 187 - 192) GTGATG (nucleotides 1363 - 1368) GGCGCA (nucleotides 832 - 837) GGCGCA (nucleotides 841 - 846) GGCGCA (nucleotides 847 - 852) GGCGCA (nucleotides 1309 - 1314) TTCTGG (nucleotides 466 - 471 ) GCGCTG (nucleotides 307 - 312) GCGCTG (nucleotides 1 129 - 1 134) GCGCTG (nucleotides 1369 - 1374) ATCGCC (nucleotides 79 - 84) ATCGCC (nucleotides 1348 - 1353).
235. The nucleotide sequence of Claim 234, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
236. The nucleotide sequence of Claim 234, in which at least 3 of the following codon pair replacements have been made:
CTGGCG (nucleotides 304 - 309) replaced with CTGGCT GAAGAG (nucleotides 73 - 78) replaced with GAAGAA GAAGAG (nucleotides 385 - 390) replaced with GAAGAA GCCAGC (nucleotides 64 - 69) replaced with GCGTCT GCCAGC (nucleotides 1 105 - 1 1 10) replaced with GCGTCT CTTTCC (nucleotides 562 - 567) replaced with CTGTCT CTCGAC (nucleotides 1 183 - 1 188) replaced with CTGGAT TTTGCC (nucleotides 322 - 327) replaced with TTTGCG GGGCAA (nucleotides 1 18 - 123) replaced with GGTCAG ATCCTC (nucleotides 685 - 690) replaced with ATCCTG GACTGG (nucleotides 544 - 549) replaced with GATTGG GACTGG (nucleotides 1 186 - 1 191 ) replaced with GATTGG GCCAGT (nucleotides 658 - 663) replaced with GCGTCC GCCAGT (nucleotides 1543 - 1548) replaced with GCTTCT GTGGTG (nucleotides 796 - 801 ) replaced with GTTGTT GTGGTG (nucleotides 970 - 975) replaced with GTTGTT GTGGTG (nucleotides 1 177 - 1 182) replaced with GTTGTT CTCGGC (nucleotides 778 - 783) replaced with CTGGGT GCGGTA (nucleotides 1549 - 1554) replaced with GCGGTT GACAGC (nucleotides 499 - 504) replaced with GATTCT CTGGAA (nucleotides 991 - 996) replaced with CTGGAG CTGGAA (nucleotides 1057 - 1062) replaced with CTCGAA AGCCAG (nucleotides 1 108 - 1 1 13) replaced with TCTCAG ATTGCC (nucleotides 904 - 909) replaced with ATCGCG GCCGGG (nucleotides 610 - 615) replaced with GCGGGT CTCGGT (nucleotides 1471 - 1476) replaced with TTGGGT GCCTGG (nucleotides 1027 - 1032) replaced with GCGTGG GCGGCA (nucleotides 187 - 192) replaced with GCTGCT GTGATG (nucleotides 1363 - 1368) replaced with GTTATG GGCGCA (nucleotides 832 - 837) replaced with GGTGCG GGCGCA (nucleotides 841 - 846) replaced with GGTGCA GGCGCA (nucleotides 847 - 852) replaced with GGTGCT GGCGCA (nucleotides 1309 - 1314) replaced with GGCGCT TTCTGG (nucleotides 466 - 471) replaced with TTTTGG GCGCTG (nucleotides 307 - 312) replaced with GCTCTG GCGCTG (nucleotides 1 129 - 1 134) replaced with GCGCTC GCGCTG (nucleotides 1369 - 1374) replaced with GCTCTG ATCGCC (nucleotides 79 - 84) replaced with ATTGCG ATCGCC (nucleotides 1348 - 1353) replaced with ATCGCG.
237. A L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild- type L-ribulokinase as set forth in SEQ ID NO: 218. wherein at least 3 of the following codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof: GAGTTT (nucleotides 319 - 324) GATATC (nucleotides 1018 - 1023) GATATT (nucleotides 586 - 591 ) GATATT (nucleotides 736 - 741 ) TTTGCC (nucleotides 322 - 327) CTTCCA (nucleotides 1651 - 1656) ATCAAC (nucleotides 1099 - 1 104) GGTATT (nucleotides 445 - 450) GGTATT (nucleotides 943 - 948) GCCAAA (nucleotides 1 147 - 1 152) CTGAAA (nucleotides 193 - 198) CTGAAA (nucleotides 1087 - 1092) CTGAAA (nucleotides 1228 - 1233) AAACAG (nucleotides 913 - 918) GGCCAA (nucleotides 1000 - 1005) CTGGTA (nucleotides 865 - 870) CTTTCC (nucleotides 562 - 567) TTTGAC (nucleotides 817 - 822).
238. The nucleotide sequence of Claim 237. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
239. The nucleotide sequence of Claim 237. in which at least 3 of the following codon pair replacements have been made:
GAGTTT (nucleotides 319 - 324) replaced with GAATTT GATATC (nucleotides 1018 - 1023) replaced with GACATC GATATT (nucleotides 586 - 591) replaced with GACATC GATATT (nucleotides 736 - 741 ) replaced with GACATC TTTGCC (nucleotides 322 - 327) replaced with TTTGCG CTTCCA (nucleotides 1651 - 1656) replaced with CTCCCG ATCAAC (nucleotides 1099 - 1 104) replaced with ATCAAC GGTATT (nucleotides 445 - 450) replaced with GGTATC GGTATT (nucleotides 943 - 948) replaced with GGTATC GCCAAA (nucleotides 1 147 - 1 152) replaced with GCTAAA CTGAAA (nucleotides 193 - 198) replaced with CTGAAA CTGAAA (nucleotides 1087 - 1092) replaced with CTGAAA CTGAAA (nucleotides 1228 - 1233) replaced with CTGAAA AAACAG (nucleotides 913 - 918) replaced with AAACAG GGCCAA (nucleotides 1000 - 1005) replaced with GGTCAG CTGGTA (nucleotides 865 - 870) replaced with CTCGTT CTTTCC (nucleotides 562 - 567) replaced with CTGTCT TTTGAC (nucleotides 817 - 822) replaced with TTTGAC.
240. A L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild- type L-ribulokinase as set forth in SEQ ID NO: 218. wherein at least 3 of the following codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GAGTTT (nucleotides 319 - 324 ) TTTGCC (nucleotides 322 - 327 ) CTTTCC (nucleotides 562 - 567 ) GGTACC (nucleotides 568 - 573 ) GGCCAA (nucleotides 1000 - 1005 ) GATATC (nucleotides 1018 - 1023 ) TTTGCT (nucleotides 1486 - 1491 ).
241. The nucleotide sequence of Claim 240. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
242. The nucleotide sequence of Claim 240, in which at least 3 of the following codon pair replacements have been made:
GAGTTT (nucleotides 319 - 324 ) replaced with GAGTTC TTTGCC (nucleotides 322 - 327 ) replaced with TTCGCT CTTTCC (nucleotides 562 - 567 ) replaced with TTGTCT GGTACC (nucleotides 568 - 573 ) replaced with GGAACT GGCCAA (nucleotides 1000 - 1005 ) replaced with GGACAA GATATC (nucleotides 1018 - 1023 ) replaced with GACATT TTTGCT (nucleotides 1486 - 1491 ) replaced with TTCGCT.
243. A L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-324 of wild- type L-ribulokinase as set forth in SEQ ID NO: 218. wherein at least 3 of the following codon pairs of SEQ ID NO: 217 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTCGAT (nucleotides 19 - 24 )
GCTTTG (nucleotides 46 - 51 )
GATGCC (nucleotides 130 - 135 )
GACAGT (nucleotides 256 - 261 )
GCACCG (nucleotides 277 - 282 )
GATGCC (nucleotides 286 - 291 )
AAAGAC (nucleotides 358 - 363 )
GCGGTT (nucleotides 370 - 375 )
CGCTAT (nucleotides 433 - 438 )
GGTATT (nucleotides 445 - 450 )
GACAGC (nucleotides 499 - 504 )
TCCGGT (nucleotides 565 - 570 )
CGGGCA (nucleotides 931 - 936 )
GGTATT (nucleotides 943 - 948 )
GTGCCT (nucleotides 973 - 978 )
CAGCTT (nucleotides 1063 - 1068 )
GCATGG (nucleotides 1 141 - 1 146 )
GCCTTT (nucleotides 1303 - 1308 )
CAGCTT (nucleotides 1600 - 1605 )
CTTTAT (nucleotides 1603 - 1608 )
CGCTAT (nucleotides 1612 - 1617 ).
244. The nucleotide sequence of Claim 243, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
245. The nucleotide sequence of Claim 243, in which at least 3 of the following codon pair replacements have been made:
CTCGAT (nucleotides 19 - 24 ) replaced with TTGGAT GCTTTG (nucleotides 46 - 51 ) replaced with GCCCTT GATGCC (nucleotides 130 - 135 ) replaced with GATGCT GACAGT (nucleotides 256 - 261 ) replaced with GATTCT GCACCG (nucleotides 277 - 282 ) replaced with GCCCCG GATGCC (nucleotides 286 - 291 ) replaced with GACGCC AAAGAC (nucleotides 358 - 363 ) replaced with AAAGAT GCGGTT (nucleotides 370 - 375 ) replaced with GCCGTT CGCTAT (nucleotides 433 - 438 ) replaced with CGTTAT GGTATT (nucleotides 445 - 450 ) replaced with GGCATC GACAGC (nucleotides 499 - 504 ) replaced with GATTCT TCCGGT (nucleotides 565 - 570 ) replaced with TCTGGC CGGGCA (nucleotides 931 - 936 ) replaced with CGTGCC GGTATT (nucleotides 943 - 948 ) replaced with GGTATA GTGCCT (nucleotides 973 - 978 ) replaced with GTTCCG CAGCTT (nucleotides 1063 - 1068 ) replaced with CAGTTG GCATGG (nucleotides 1 141 - 1 146 ) replaced with GCCTGG GCCTTT (nucleotides 1303 - 1308 ) replaced with GCCTTC CAGCTT (nucleotides 1600 - 1605 ) replaced with CAGTTG CTTTAT (nucleotides 1603 - 1608 ) replaced with TTGTAT CGCTAT (nucleotides 1612 - 1617 ) replaced with CGTTAT. 246. A L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild- type L-ribulokinase as set forth in SEQ ID NO: 218 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human. E. coli or S. cerevisiae.
2Al . The nucleotide sequence of Claim 246, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
248. The nucleotide sequence of Claim 246, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
249. A L-ribulokinase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -566 of wild-type L-ribulokinase as set forth in SEQ ID NO: 218 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoήs
Oryctolagus cuniculus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatto (Monkey)
Escherichia coli K 12 W31 10
Escherichia coli UT189
Escherichia co//O157:H7 EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda
Drosophila melanogaster
Sch izosaccharomyces pom be .
250. A L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild- type L-ribulokinase as set forth in SEQ ID NO: 218 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 59-549 of SEQ ID NO: 218 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
251. The L-ribulokinase-encoding nucleotide sequence of Claim 250, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
252. The L-ribulokinase-encoding nucleotide sequence of any of Claims 250- 251, wherein no replacement codon encoding amino acids 59-549 of SEQ ID NO: 218 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair CTGGCG when expressed in the native organism.
253. A L-ribulokinase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -566 of wild- type L-ribulokinase as set forth in SEQ ID NO: 218 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 1 -59 of SEQ ID NO: 218 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
254. The L-ribulokinase-encoding nucleotide sequence of Claim 253, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
255. The L-ribulokinase-encoding nucleotide sequence of any of Claims 253- 254. wherein at least one replacement codon encoding amino acids 1 -59 of SEQ ID NO: 218 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair GAAGAG when expressed in the native organism.
256. A L-ribulose-5-P 4-epimerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242, wherein at least the following codon pair of SEQ ID NO: 241 has been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
AACGTC (nucleotides 82 - 87 ) ATCAAA (nucleotides 121 - 126 ) GGCCAG (nucleotides 322 - 327 ) GCAGAA (nucleotides 403 - 408 ) ATCAAC (nucleotides 409 - 414 ) AACGTC (nucleotides 439 - 444 ) GGTATC (nucleotides 469 - 474 ) CCGCAG (nucleotides 613 - 618 ).
257. The nucleotide sequence of Claim 256, in which one or more of the following codon pair has been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
258. The nucleotide sequence of Claim 256, in which at least the following codon pair replacement has been made:
AACGTC (nucleotides 82 - 87 ) replaced with AATGTT ATCAAA (nucleotides 121 - 126 ) replaced with ATTAAA GGCCAG (nucleotides 322 - 327 ) replaced with GGTCAA GCAGAA (nucleotides 403 - 408 ) replaced with GCTGAA ATCAAC (nucleotides 409 - 414 ) replaced with ATTAAT AACGTC (nucleotides 439 - 444 ) replaced with AATGTA GGTATC (nucleotides 469 - 474 ) replaced with GGAATT CCGCAG (nucleotides 613 - 618 ) replaced with CCACAA.
259. A L-ribulose-5-P 4-epimerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242, wherein at least 3 of the following codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTGGCG (nucleotides 40 - 45) GAAGAG (nucleotides 571 - 576) ACGCTG (nucleotides 637 - 642) GTCAGC (nucleotides 85 - 90) CTGGAA (nucleotides 568 - 573) ACGCCA (nucleotides 229 - 234) TTCCCG (nucleotides 259 - 264) GAAGTG (nucleotides 193 - 198) CAGGCG (nucleotides 316 - 321 ) GATCTC (nucleotides 10 - 15) GCGCTG (nucleotides 43 - 48).
260. The nucleotide sequence of Claim 259, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
261. The nucleotide sequence of Claim 259. in which at least 3 of the following codon pair replacements have been made:
CTGGCG (nucleotides 40 - 45) replaced with TTGGCG GAAGAG (nucleotides 571 - 576) replaced with GAAGAA ACGCTG (nucleotides 637 - 642) replaced with ACATTG GTCAGC (nucleotides 85 - 90) replaced with GTTTCA CTGGAA (nucleotides 568 - 573) replaced with TTGGAA ACGCCA (nucleotides 229 - 234) replaced with ACTCCA TTCCCG (nucleotides 259 - 264) replaced with TTTCCA GAAGTG (nucleotides 193 - 198) replaced with GAAGTT CAGGCG (nucleotides 316 - 321 ) replaced with CAAGCT GATCTC (nucleotides 10 - 15) replaced with GATTTA GCGCTG (nucleotides 43 - 48) replaced with GCGTTG.
262. A L-ribulose-5-P 4-epimerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242. wherein at least 3 of the following codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GACGAT (nucleotides 160 - 165) ATCAAC (nucleotides 409 - 414) ATCAAA (nucleotides 121 - 126) GGTATC (nucleotides 469 - 474) AAACAG (nucleotides 463 - 468).
263. The nucleotide sequence of Claim 262, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
264. The nucleotide sequence of Claim 262. in which at least 3 of the following codon pair replacements have been made:
GACGAT (nucleotides 160 - 165) replaced with GATGAT ATCAAC (nucleotides 409 - 414) replaced with ATTAAT ATCAAA (nucleotides 121 - 126) replaced with ATTAAA GGTATC (nucleotides 469 - 474) replaced with GGAATT AAACAG (nucleotides 463 - 468) replaced with AAACAA.
265. A L-ribulose-5-P 4-epimerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 324 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242, wherein at least 3 of the following codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
ATCAAA (nucleotides 121 - 126 ) GACGAT (nucleotides 160 - 165 ) TATTTC (nucleotides 361 - 366 ) ACCATT (nucleotides 373 - 378 ) GGTATC (nucleotides 469 - 474 ) TTTGCA (nucleotides 520 - 525 ).
266. The nucleotide sequence of Claim 265. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
267. The nucleotide sequence of Claim 265. in which at least 3 of the following codon pair replacements have been made:
ATCAAA (nucleotides 121 - 126 ) replaced with ATTAAA GACGAT (nucleotides 160 - 165 ) replaced with GATGAT TATTTC (nucleotides 361 - 366 ) replaced with TACTTC ACCATT (nucleotides 373 - 378 ) replaced with ACAATT GGTATC (nucleotides 469 - 474 ) replaced with GGAATT TTTGCA (nucleotides 520 - 525 ) replaced with TTCGCG.
268. A L-ribulose-5-P 4-epimerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 324 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242, wherein at least 3 of the following codon pairs of SEQ ID NO: 241 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
ACATGG (nucleotides 73 - 78 ) GTCGAT (nucleotides 136 - 141 ) CTCTAT (nucleotides 247 - 252 ) GGTATC (nucleotides 469 - 474 ) GCATGG (nucleotides 523 - 528 ).
269. The nucleotide sequence of Claim 268, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
270. The nucleotide sequence of Claim 268, in which at least 3 of the following codon pair replacements have been made:
ACATGG (nucleotides 73 - 78 ) replaced with ACCTGG GTCGAT (nucleotides 136 - 141 ) replaced with GTCGAC CTCTAT (nucleotides 247 - 252 ) replaced with TTGTAT GGTATC (nucleotides 469 - 474 ) replaced with GGCATT GCATGG (nucleotides 523 - 528 ) replaced with GCTTGG.
271 . A L-ribulose-5-P 4-epimerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
272. The nucleotide sequence of Claim 271 , wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
273. The nucleotide sequence of Claim 271 , wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
274. A L-ribulose-5-P 4-epimerase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -231 of wild-type L-ribulose- 5-P 4-epimerase as set forth in SEQ ID NO: 242 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pi c hi a pas tor is
Oryctolagiis cuniculus (rabbit)
Macaca fascicidaris (Long-tailed monkey)
Macaca mulatto (Monkey)
Escherichia coli Kl 2 W31 10
Escherichia coli UTI89
Escherichia co/zO157:H7 EDL933
Escherichia coli O157:H7 str. Sakai
Bombyx mori
Spodoptera frngiperda
Drosophila melanogaster
Schizosaccharomyces pombe.
275. A L-ribulose-5-P 4-epimerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 7-217 of SEQ ID NO: 242 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
276. The L-ribulose-5-P 4-epimerase-encoding nucleotide sequence of Claim 275, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
277. The L-ribulose-5-P 4-epimerase-encoding nucleotide sequence of any of Claims 275-276, wherein no replacement codon encoding amino acids 7-217 of SEQ ID NO: 242 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair CTGGCG when expressed in the native organism.
278. A L-ribulose-5-P 4-epimerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 231 of wild-type L-ribulose-5-P 4-epimerase as set forth in SEQ ID NO: 242 and is adapted for expression in a heterologous host organism, wherein at least 1. 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -7 of SEQ ID NO: 242 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
279. The L-ribulose-5-P 4-epimerase-encoding nucleotide sequence of Claim 278, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
280. The L-ribulose-5-P 4-epimerase-encoding nucleotide sequence of any of Claims 278-279. wherein at least one replacement codon encoding amino acids 1 -7 of SEQ ID NO: 242 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair GATCTC when expressed in the native organism.
281. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -322 of wild- type xylose reductase as set forth in SEQ ID NO: 266. wherein at least 3 of the following codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
ATCAAA (nucleotides 22 - 27) TTGAAC (nucleotides 286 - 291 ) TTGAAC (nucleotides 700 - 705) ATCAAG (nucleotides 1 15 - 120) ATCAAG (nucleotides 553 - 558) ATCAAG (nucleotides 733 - 738) GCCAAG (nucleotides 748 - 753) GCCAAG (nucleotides 901 - 906).
282. The nucleotide sequence of Claim 281. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
283. The nucleotide sequence of Claim 281 , in which at least 3 of the following codon pair replacements have been made:
ATCAAA (nucleotides 22 - 27) replaced with ATTAAA TTGAAC (nucleotides 286 - 291) replaced with TTAAAT TTGAAC (nucleotides 700 - 705) replaced with TTAAAT ATCAAG (nucleotides 1 15 - 120) replaced with ATTAAA ATCAAG (nucleotides 553 - 558) replaced with ATTAAA ATCAAG (nucleotides 733 - 738) replaced with ATTAAA GCCAAG (nucleotides 748 - 753) replaced with GCAAAA GCCAAG (nucleotides 901 - 906) replaced with GCTAAA.
284. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1-322 of wild- type xylose reductase as set forth in SEQ ID NO: 266, wherein at least 3 of the following codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GAAGAG (nucleotides 220 - 225) TTCCTC (nucleotides 229 - 234) ATTGCC (nucleotides 349 - 354) ATCGCC (nucleotides 898 - 903) GACTGG (nucleotides 940 - 945).
285. The nucleotide sequence of Claim 284. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
286. The nucleotide sequence of Claim 284. in which at least 3 of the following codon pair replacements have been made:
GAAGAG (nucleotides 220 - 225) replaced with GAAGAA TTCCTC (nucleotides 229 - 234) replaced with TTCCTG ATTGCC (nucleotides 349 - 354) replaced with ATCGCG ATCGCC (nucleotides 898 - 903) replaced with ATCGCG GACTGG (nucleotides 940 - 945) replaced with GATTGG.
287. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -322 of wild- type xylose reductase as set forth in SEQ ID NO: 266. wherein at least 3 of the following codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TCCAAG (nucleotides 238 - 243) ATCAAG (nucleotides 1 15 - 120) ATCAAG (nucleotides 553 - 558) ATCAAG (nucleotides 733 - 738) TTCAAG (nucleotides 355 - 360) TTCAAC (nucleotides 859 - 864) TTCAAC (nucleotides 925 - 930) ATCAAA (nucleotides 22 - 27) GTCAAG (nucleotides 184 - 189) GTCAAG (nucleotides 21 1 - 216) GACGAA (nucleotides 199 - 204) GGTATC (nucleotides 802 - 807) TTGAAC (nucleotides 286 - 291) TTGAAC (nucleotides 700 - 705).
288. The nucleotide sequence of Claim 287, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
289. The nucleotide sequence of Claim 287. in which at least 3 of the following codon pair replacements have been made:
TCCAAG (nucleotides 238 - 243) replaced with TCTAAA ATCAAG (nucleotides 1 15 - 120) replaced with ATTAAA ATCAAG (nucleotides 553 - 558) replaced with ATTAAG ATCAAG (nucleotides 733 - 738) replaced with ATTAAG TTCAAG (nucleotides 355 - 360) replaced with TTTAAA TTCAAC (nucleotides 859 - 864) replaced with TTTAAT TTCAAC (nucleotides 925 - 930) replaced with TTTAAT ATCAAA (nucleotides 22 - 27) replaced with ATTAAA GTCAAG (nucleotides 184 - 189) replaced with GTTAAA GTCAAG (nucleotides 21 1 - 216) replaced with GTTAAG GACGAA (nucleotides 199 - 204) replaced with GATGAA GGTATC (nucleotides 802 - 807) replaced with GGAATT TTGAAC (nucleotides 286 - 291 ) replaced with TTAAAT TTGAAC (nucleotides 700 - 705) replaced with TTAAAT.
290. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild- type xylose reductase as set forth in SEQ ID NO: 266, wherein at least 3 of the following codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
ATCAAA (nucleotides 22 - 27 ) TTGAAC (nucleotides 286 - 291 ) TTCCCA (nucleotides 343 - 348 ) TTCCCA (nucleotides 51 1 - 516 ) TTGAAC (nucleotides 700 - 705 ) GCCAAG (nucleotides 748 - 753 ) GGTATC (nucleotides 802 - 807 ) GCCAAG (nucleotides 901 - 906 ).
291. The nucleotide sequence of Claim 290, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
292. The nucleotide sequence of Claim 290, in which at least 3 of the following codon pair replacements have been made: ATCAAA (nucleotides 22 - 27 ) replaced with ATAAAA TTGAAC (nucleotides 286 - 291 ) replaced with TTAAAT TTCCCA (nucleotides 343 - 348 ) replaced with TTCCCT TTCCCA (nucleotides 51 1 - 516 ) replaced with TTCCCT TTGAAC (nucleotides 700 - 705 ) replaced with TTAAAC GCCAAG (nucleotides 748 - 753 ) replaced with GCTAAA GGTATC (nucleotides 802 - 807 ) replaced with GGAATT GCCAAG (nucleotides 901 - 906 ) replaced with GCTAAA.
293. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -324 of wild- type xylose reductase as set forth in SEQ ID NO: 266, wherein at least 3 of the following codon pairs of SEQ ID NO: 265 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GCCGGT (nucleotides 91 - 96 ) GCCGGT (nucleotides 121 - 126 ) GCCTTG (nucleotides 283 - 288 ) GCCGGT (nucleotides 478 - 483 ) GCTTTG (nucleotides 520 - 525 ) GCCGGT (nucleotides 628 - 633 ) GCTTTG (nucleotides 697 - 702 ) GCTATT (nucleotides 739 - 744 ) GCCAAG (nucleotides 748 - 753 ) GGTATC (nucleotides 802 - 807 ) GCCAAG (nucleotides 901 - 906 ).
294. The nucleotide sequence of Claim 293. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
295. The nucleotide sequence of Claim 293, in which at least 3 of the following codon pair replacements have been made:
GCCGGT (nucleotides 91 - 96 ) replaced with GCGGGT GCCGGT (nucleotides 121 - 126 ) replaced with GCTGGT GCCTTG (nucleotides 283 - 288 ) replaced with GCTCTT GCCGGT (nucleotides 478 - 483 ) replaced with GCTGGC GCTTTG (nucleotides 520 - 525 ) replaced with GCTCTT GCCGGT (nucleotides 628 - 633 ) replaced with GCTGGA GCTTTG (nucleotides 697 - 702 ) replaced with GCTCTT GCTATT (nucleotides 739 - 744 ) replaced with GCCATT GCCAAG (nucleotides 748 - 753 ) replaced with GCGAAA GGTATC (nucleotides 802 - 807 ) replaced with GGCATA GCCAAG (nucleotides 901 - 906 ) replaced with GCCAAA.
296. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -322 of wild- type xylose reductase as set forth in SEQ ID NO: 266 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human. E. coli or S. cerevisiae.
297 '. The nucleotide sequence of Claim 296, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
298. The nucleotide sequence of Claim 296, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
299. A xylose reductase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1-322 of wild-type xylose reductase as set forth in SEQ ID NO: 266 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pastoris
Oryctolagiis cuniciilus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatto (Monkey)
Escherichia coli K 12 W31 10
Escherichia coli UTl 89
Escherichia cø/iO157:H7 EDL933
Escherichia coli Ol 57:H7 str. Sakai
Bombyx mori
Spodoptera frugiperda Drosophila melanogaster Schizosaccharomyces pombe.
300. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -322 of wild- type xylose reductase as set forth in SEQ ID NO: 266 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO: 1 and which encode amino acids 9-306 of SEQ ID NO: 266 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
301. The xylose reductase-encoding nucleotide sequence of Claim 300, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
302. A xylose reductase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 -322 of wild- type xylose reductase as set forth in SEQ ID NO: 266 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -9 of SEQ ID NO: 266 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
303. The xylose reductase-encoding nucleotide sequence of Claim 302, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
304. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290. wherein at least 3 of the following codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTTTCC (nucleotides 274 - 279 )
GATATC (nucleotides 325 - 330 )
CTTTAT (nucleotides 682 - 687 )
GGGTTT (nucleotides 901 - 906 )
TTTGCC (nucleotides 904 - 909 )
GCCATT (nucleotides 1 159 - 1 164 )
GATATT (nucleotides 1 180 - 1 185 )
TTGAAA (nucleotides 1291 - 1296 )
GAAAGT (nucleotides 1402 - 1407 ).
305. The nucleotide sequence of Claim 304, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
306. The nucleotide sequence of Claim 304, in which at least 3 of the following codon pair replacements have been made:
CTTTCC (nucleotides 274 - 279 ) replaced with TTAAGT GATATC (nucleotides 325 - 330 ) replaced with GACATT CTTTAT (nucleotides 682 - 687 ) replaced with CTATAT GGGTTT (nucleotides 901 - 906 ) replaced with GGTTTT TTTGCC (nucleotides 904 - 909 ) replaced with TTTGCA GCCATT (nucleotides 1 159 - 1 164 ) replaced with GCTATT GATATT (nucleotides 1 180 - 1 185 ) replaced with GATATA TTGAAA (nucleotides 1291 - 1296 ) replaced with TTAAAA GAAAGT (nucleotides 1402 - 1407 ) replaced with GAATCT.
307. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290. wherein at least 3 of the following codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TTCTGG (nucleotides 25 - 30 ) AGCCAG (nucleotides 43 - 48 ) GAAGAG (nucleotides 61 - 66 ) ACGCTG (nucleotides 61 - 12 ) CTGGAA (nucleotides 70 - 75 ) CTTTCC (nucleotides 274 - 279 ) ATTGCC (nucleotides 436 - 441 ) GAAGTG (nucleotides 460 - 465 ) GCCAGA (nucleotides 532 - 537 ) GCGGTA (nucleotides 562 - 567 ) GATCTC (nucleotides 634 - 639 ) GAAGTG (nucleotides 643 - 648 ) GTGATG (nucleotides 646 - 651 ) CAGGCG (nucleotides 763 - 768 ) GAAGTG (nucleotides 835 - 840 ) TTTGCC (nucleotides 904 - 909 ) CGGATG (nucleotides 943 - 948 ) GAAGTG (nucleotides 1048 - 1053 ) AAAGAG (nucleotides 1 1 14 - 1 1 19 ) TTCCGC (nucleotides 1 195 - 1200 ).
308. The nucleotide sequence of Claim 307, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
309. The nucleotide sequence of Claim 307, in which at least 3 of the following codon pair replacements have been made:
TTCTGG (nucleotides 25 - 30 ) replaced with TTTTGG AGCCAG (nucleotides 43 - 48 ) replaced with TCTCAG GAAGAG (nucleotides 61 - 66 ) replaced with GAAGAA ACGCTG (nucleotides 67 - 72 ) replaced with ACCCTC CTGGAA (nucleotides 70 - 75 ) replaced with CTCGAA CTTTCC (nucleotides 274 - 279 ) replaced with CTGAGC ATTGCC (nucleotides 436 - 441 ) replaced with ATCGCG GAAGTG (nucleotides 460 - 465 ) replaced with GAAGTT GCCAGA (nucleotides 532 - 537 ) replaced with GCACGC GCGGTA (nucleotides 562 - 567 ) replaced with GCGGTT GATCTC (nucleotides 634 - 639 ) replaced with GATTTG GAAGTG (nucleotides 643 - 648 ) replaced with GAAGTT GTGATG (nucleotides 646 - 651 ) replaced with GTTATG CAGGCG (nucleotides 763 - 768 ) replaced with CAGGCT GAAGTG (nucleotides 835 - 840 ) replaced with GAAGTT TTTGCC (nucleotides 904 - 909 ) replaced with TTCGCT CGGATG (nucleotides 943 - 948 ) replaced with CGTATG GAAGTG (nucleotides 1048 - 1053 ) replaced with GAGGTT AAAGAG (nucleotides 1 1 14 - 1 1 19 ) replaced with AAGGAG TTCCGC (nucleotides 1 195 - 1200 ) replaced with TTTCGT.
310. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290, wherein at least 3 of the following codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTTTCC (nucleotides 274 - 279 ) GATATC (nucleotides 325 - 330 ) ATCAAC (nucleotides 403 - 408 ) GACGAA (nucleotides 733 - 738 ) TCGTTT (nucleotides 829 - 834 ) AAACAG (nucleotides 853 - 858 ) GGGTTT (nucleotides 901 - 906 ) TTTGCC (nucleotides 904 - 909 ) GATATT (nucleotides 1 180 - 1 185 ) TTGAAA (nucleotides 1291 - 1296 ) AAACTG (nucleotides 1438 - 1443 ) CTGAAA (nucleotides 1441 - 1446 ) CTTCAA (nucleotides 1480 - 1485 ).
31 1. The nucleotide sequence of Claim 310, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
312. The nucleotide sequence of Claim 310, in which at least 3 of the following codon pair replacements have been made:
CTTTCC (nucleotides 274 - 279 ) replaced with TTATCT GATATC (nucleotides 325 - 330 ) replaced with GACATT ATCAAC (nucleotides 403 - 408 ) replaced with ATTAAT GACGAA (nucleotides 733 - 738 ) replaced with GATGAA TCGTTT (nucleotides 829 - 834 ) replaced with TCTTTT AAACAG (nucleotides 853 - 858 ) replaced with AAACAA GGGTTT (nucleotides 901 - 906 ) replaced with GGATTC TTTGCC (nucleotides 904 - 909 ) replaced with TTCGCT GATATT (nucleotides 1 180 - 1 185 ) replaced with GATATA TTGAAA (nucleotides 1291 - 1296 ) replaced with TTAAAA AAACTG (nucleotides 1438 - 1443 ) replaced with AAATTG CTGAAA (nucleotides 1441 - 1446 ) replaced with TTGAAG CTTCAA (nucleotides 1480 - 1485 ) replaced with TTGCAA.
313. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 324 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290. wherein at least 3 of the following codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTTTCC (nucleotides 274 - 279 ) GATATC (nucleotides 325 - 330 ) GTGAAA (nucleotides 463 - 468 ) GGGTTT (nucleotides 901 - 906 ) TTTGCC (nucleotides 904 - 909 ) GCCATT (nucleotides 1 159 - 1 164 ) TTGAAA (nucleotides 1291 - 1296 ) AAATGG (nucleotides 1456 - 1461 ).
314. The nucleotide sequence of Claim 313, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
315. The nucleotide sequence of Claim 313, in which at least 3 of the following codon pair replacements have been made:
CTTTCC (nucleotides 274 - 279 ) replaced with TTGTCC GATATC (nucleotides 325 - 330 ) replaced with GACATT GTGAAA (nucleotides 463 - 468 ) replaced with GTTAAA GGGTTT (nucleotides 901 - 906 ) replaced with GGTTTC TTTGCC (nucleotides 904 - 909 ) replaced with TTCGCA GCCATT (nucleotides 1 159 - 1 164 ) replaced with GCTATT TTGAAA (nucleotides 1291 - 1296 ) replaced with TTAAAG AAATGG (nucleotides 1456 - 1461 ) replaced with AAGTGG.
316. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 324 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290, wherein at least 3 of the following codon pairs of SEQ ID NO: 289 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTGTTA (nucleotides 184 - 189 ) ACATGG (nucleotides 229 - 234 ) GAAGGC (nucleotides 268 - 273 ) AACAGC (nucleotides 361 - 366 ) GCGGCT (nucleotides 496 - 501 ) GTAACG (nucleotides 565 - 570 ) ATCGGG (nucleotides 628 - 633 ) CTTTAT (nucleotides 682 - 687 ) GCTTTT (nucleotides 790 - 795 ) GCCGGT (nucleotides 907 - 912 ) GCTTTG (nucleotides 1066 - 1071 ) AAAGAC (nucleotides 1237 - 1242 ) GCATGG (nucleotides 1309 - 1314 ) CTTGAT (nucleotides 1375 - 1380 ) CTTTAC (nucleotides 1471 - 1476 ).
317. The nucleotide sequence of Claim 316. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
318. The nucleotide sequence of Claim 316, in which at least 3 of the following codon pair replacements have been made:
CTGTTA (nucleotides 184 - 189 ) replaced with TTGTTG ACATGG (nucleotides 229 - 234 ) replaced with ACCTGG GAAGGC (nucleotides 268 - 273 ) replaced with GAGGGC AACAGC (nucleotides 361 - 366 ) replaced with AACTCT GCGGCT (nucleotides 496 - 501 ) replaced with GCCGCA GTAACG (nucleotides 565 - 570 ) replaced with GTTACC ATCGGG (nucleotides 628 - 633 ) replaced with ATTGGT CTTTAT (nucleotides 682 - 687 ) replaced with TTGTAT GCTTTT (nucleotides 790 - 795 ) replaced with GCATTC GCCGGT (nucleotides 907 - 912 ) replaced with GCTGGT GCTTTG (nucleotides 1066 - 1071 ) replaced with GCCTTA AAAGAC (nucleotides 1237 - 1242 ) replaced with AAAGAT GCATGG (nucleotides 1309 - 1314 ) replaced with GCTTGG CTTGAT (nucleotides 1375 - 1380 ) replaced with TTGGAT CTTTAC (nucleotides 1471 - 1476 ) replaced with TTATAT.
319. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
320. The nucleotide sequence of Claim 319, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
321. The nucleotide sequence of Claim 319, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
322. A L-arabinose isomerase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pas tor is Oiyctolagiis ciinicidus (rabbit) Macaca fascicularis (Long-tailed monkey) Macaca mulatto (Monkey) Escherichia coli K 12 W31 10 Escherichia coli UTI89 Escherichia co/zO157:H7 EDL933 Escherichia coli O157:H7 str. Sakai Bombyx mori Spodoptera frugiperda Drosophila melanogaster Schizosaccharomyces pombe.
323. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 7-487 of SEQ ID NO: 290 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
324. The L-arabinose isomerase-encoding nucleotide sequence of Claim 323, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
325. The L-arabinose isomerase-encoding nucleotide sequence of any of Claims 323-324, wherein no replacement codon encoding amino acids 7-487 of SEQ ID NO: 290 has a z score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair GGCGGA when expressed in the native organism.
326. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 496 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 290 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1 -8 of SEQ ID NO: 290 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
327. The L-arabinose isomerase-encoding nucleotide sequence of Claim 326, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
328. The L-arabinose isomerase-encoding nucleotide sequence of any of Claims 326-327, wherein at least one replacement codon encoding amino acids 1-8 of SEQ ID NO: 290 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair AAGGAT when expressed in the native organism.
329. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 of the following codon pairs of SEQ ID NO: 301 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
CTTTCC (nucleotides 274 - 279 ) CAGTTT (nucleotides 313 - 318 ) AATATT (nucleotides 361 - 366 ) ATCAAA (nucleotides 523 - 528 ) CTTTAT (nucleotides 703 - 708 ) GTGGAA (nucleotides 1204 - 1209 ).
330. The nucleotide sequence of Claim 329, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
331. The nucleotide sequence of Claim 329, in which at least 3 of the following codon pair replacements have been made:
CTTTCC (nucleotides 274 - 279 ) replaced with TTGTCT CAGTTT (nucleotides 313 - 318 ) replaced with CAATTT AATATT (nucleotides 361 - 366 ) replaced with AACATT ATCAAA (nucleotides 523 - 528 ) replaced with ATTAAG CTTTAT (nucleotides 703 - 708 ) replaced with TTGTAT GTGGAA (nucleotides 1204 - 1209 ) replaced with GTTGAA.
332. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302. wherein at least 3 of the following codon pairs of SEQ ID NO: 301 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
AGCCAG (nucleotides 43 - 48 ) GAAGAG (nucleotides 61 - 66 ) GCGGTA (nucleotides 67 - 72 ) GAAGAG (nucleotides 82 - 87 ) TCGCTG (nucleotides 163 - 168 ) GAAGAG (nucleotides 190 - 195 ) GAAGAG (nucleotides 208 - 213 ) CTTTCC (nucleotides 274 - 279 ) ATCGCC (nucleotides 436 - 441 ) GCCGGA (nucleotides 439 - 444 ) GCGGTA (nucleotides 562 - 567 ) GATCTC (nucleotides 634 - 639 ) GCGGCA (nucleotides 727 - 732 ) CAGGCG (nucleotides 751 - 756 ) ATCCTC (nucleotides 1015 - 1020 ) CTCGGC (nucleotides 1018 - 1023 ) GAAGTG (nucleotides 1036 - 1041 ) ATTGCC (nucleotides 1051 - 1056 ).
333. The nucleotide sequence of Claim 332, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
334. The nucleotide sequence of Claim 332, in which at least 3 of the following codon pair replacements have been made:
AGCCAG (nucleotides 43 - 48 ) replaced with TCTCAG GAAGAG (nucleotides 61 - 66 ) replaced with GAAGAA GCGGTA (nucleotides 67 - 72 ) replaced with GCTGTT GAAGAG (nucleotides 82 - 87 ) replaced with GAAGAA TCGCTG (nucleotides 163 - 168 ) replaced with TCTCTG GAAGAG (nucleotides 190 - 195 ) replaced with GAAGAA GAAGAG (nucleotides 208 - 213 ) replaced with GAAGAA CTTTCC (nucleotides 274 - 279 ) replaced with CTGTCT ATCGCC (nucleotides 436 - 441 ) replaced with ATCGCT GCCGGA (nucleotides 439 - 444 ) replaced with GCTGGT GCGGTA (nucleotides 562 - 567 ) replaced with GCGGTT GATCTC (nucleotides 634 - 639 ) replaced with GACTTG GCGGCA (nucleotides 727 - 732 ) replaced with GCTGCT CAGGCG (nucleotides 751 - 756 ) replaced with CAGGCT ATCCTC (nucleotides 1015 - 1020 ) replaced with ATCCTG CTCGGC (nucleotides 1018 - 1023 ) replaced with CTGGGT GAAGTG (nucleotides 1036 - 1041 ) replaced with GAAGTT ATTGCC (nucleotides 1051 - 1056 ) replaced with ATCGCG.
335. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 of the following codon pairs of SEQ ID NO: 301 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
TCCAAA (nucleotides 91 - 96 ) AAACTG (nucleotides 181 - 186 ) GACGAA (nucleotides 205 - 210 ) GCCAAA (nucleotides 253 - 258 ) CTTTCC (nucleotides 274 - 279 ) CAGTTT (nucleotides 313 - 318 ) AATATT (nucleotides 361 - 366 ) ATCAAA (nucleotides 523 - 528 ) GTCAAG (nucleotides 742 - 747 ) TTTGAC (nucleotides 1 126 - 1 131 ) AAGTTT (nucleotides 1474 - 1479 ).
336. The nucleotide sequence of Claim 335. in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
337. The nucleotide sequence of Claim 335, in which at least 3 of the following codon pair replacements have been made:
TCCAAA (nucleotides 91 - 96 ) replaced with TCTAAA AAACTG (nucleotides 181 - 186 ) replaced with AAATTG GACGAA (nucleotides 205 - 210 ) replaced with GATGAA GCCAAA (nucleotides 253 - 258 ) replaced with GCTAAA CTTTCC (nucleotides 274 - 279 ) replaced with TTGTCT CAGTTT (nucleotides 313 - 318 ) replaced with CAATTT AATATT (nucleotides 361 - 366 ) replaced with AACATT ATCAAA (nucleotides 523 - 528 ) replaced with ATTAAA GTCAAG (nucleotides 742 - 747 ) replaced with GTTAAA TTTGAC (nucleotides 1 126 - 1 131 ) replaced with TTTGAT AAGTTT (nucleotides 1474 - 1479 ) replaced with AAATTT.
338. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 324 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 of the following codon pairs of SEQ ID NO: 301 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GTGTTT (nucleotides 22 - 27 ) CTTTCC (nucleotides 274 - 279 ) CAGTTT (nucleotides 313 - 318 ) AAATGG (nucleotides 481 - 486 ) ATCAAA (nucleotides 523 - 528 ) GTGTTT (nucleotides 1 123 - 1 128 ) AAATGG (nucleotides 1444 - 1449 ).
339. The nucleotide sequence of Claim 338, in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
340. The nucleotide sequence of Claim 338, in which at least 3 of the following codon pair replacements have been made:
GTGTTT (nucleotides 22 - 27 ) replaced with GTTTTC CTTTCC (nucleotides 274 - 279 ) replaced with TTGTCT CAGTTT (nucleotides 313 - 318 ) replaced with CAATTC AAATGG (nucleotides 481 - 486 ) replaced with AAGTGG ATCAAA (nucleotides 523 - 528 ) replaced with ATTAAA GTGTTT (nucleotides 1 123 - 1 128 ) replaced with GTTTTC AAATGG (nucleotides 1444 - 1449 ) replaced with AAGTGG.
341. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1- 324 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302, wherein at least 3 of the following codon pairs of SEQ ID NO: 301 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof:
GTCAGA (nucleotides 175 - 180 ) GCCGGA (nucleotides 439 - 444 ) CAGCTT (nucleotides 598 - 603 ) ATCAAT (nucleotides 649 - 654 ) CTTTAT (nucleotides 703 - 708 ) GAAGGC (nucleotides 718 - 723 ) GCAAGG (nucleotides 730 - 735 ) GCCTTT (nucleotides 805 - 810 ) CAGCTT (nucleotides 844 - 849 ) GAAGGC (nucleotides 880 - 885 ) ATCAAT (nucleotides 1 195 - 1200 ) TCGGCT (nucleotides 1288 - 1293 ) CTCGAT (nucleotides 1363 - 1368 ) ATCAAT (nucleotides 1402 - 1407 ).
342. The nucleotide sequence of Claim 341 , in which at least 5 of the specified codon pairs have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof.
343. The nucleotide sequence of Claim 341. in which at least 3 of the following codon pair replacements have been made:
GTCAGA (nucleotides 175 - 180 ) replaced with GTTCGT GCCGGA (nucleotides 439 - 444 ) replaced with GCTGGT CAGCTT (nucleotides 598 - 603 ) replaced with CAGTTG ATCAAT (nucleotides 649 - 654 ) replaced with ATTAAT CTTTAT (nucleotides 703 - 708 ) replaced with TTGTAT GAAGGC (nucleotides 718 - 723 ) replaced with GAGGGC GCAAGG (nucleotides 730 - 735 ) replaced with GCTCGT GCCTTT (nucleotides 805 - 810 ) replaced with GCTTTC CAGCTT (nucleotides 844 - 849 ) replaced with CAGTTG GAAGGC (nucleotides 880 - 885 ) replaced with GAGGGA ATCAAT (nucleotides 1 195 - 1200 ) replaced with ATTAAT TCGGCT (nucleotides 1288 - 1293 ) replaced with TCTGCT CTCGAT (nucleotides 1363 - 1368 ) replaced with TTGGAC ATCAAT (nucleotides 1402 - 1407 ) replaced with ATTAAT.
344. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, and wherein the host organism is not human, E. coli or S. cerevisiae.
345. The nucleotide sequence of Claim 344, wherein said at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause are highly-overrepresented codon pairs.
346. The nucleotide sequence of Claim 344, wherein a codon pair predicted to be less likely to cause a translational pause is a codon pair that has a translational kinetics value greater than 1.5 times the standard deviation of translational kinetics values for the host organism.
347. A L-arabinose isomerase-encoding nucleotide sequence, having at least a 75% amino acid sequence identity with amino acids 1 -493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302 and is adapted for expression in a heterologous host organism, wherein at least three codon pairs of the wild-type sequence that are predicted to cause a translational pause in the host organism have been replaced with codon pairs that are predicted to be less likely to cause a translational pause therein, wherein the host organisms are as follows:
Pichia pas tor is
Oryctolagus cuniculus (rabbit)
Macaca fascicularis (Long-tailed monkey)
Macaca mulatto (Monkey) Escherichia coli Kl 2 W31 10 Escherichia coli UTl 89 Escherichia co//O157:H7 EDL933 Escherichia coli O157:H7 str. Sakai Bombyx mori Spodoptera frugiperda Drosophila melanogaster Schizosaccharomyces pombe.
348. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302 and is adapted for expression in a heterologous host organism, wherein at least 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 9-483 of SEQ ID NO: 302 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least three replacement codon pair is predicted to be less likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
349. The L-arabinose isomerase-encoding nucleotide sequence of Claim 348, wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is no more than 150% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
350. The L-arabinose isomerase-encoding nucleotide sequence of any of Claims 348-349, wherein no replacement codon encoding amino acids 9-483 of SEQ ID NO: 302 has a 2 score for expression in the heterologous host that is more than 200% of the z score of the wild type codon pair CTGGTG when expressed in the native organism.
351. A L-arabinose isomerase-encoding nucleotide sequence, wherein the encoded sequence has at least a 75% amino acid sequence identity with amino acids 1 - 493 of wild-type L-arabinose isomerase as set forth in SEQ ID NO: 302 and is adapted for expression in a heterologous host organism, wherein at least 1 , 2 or 3 codon pairs present in SEQ ID NO:1 and which encode amino acids 1-8 of SEQ ID NO: 302 have been replaced with different codon pairs encoding identical amino acids or conservative amino acid substitutions thereof, and wherein at least one replacement codon pair is predicted to be equally or more likely to cause a translational pause in the heterologous host organism relative to the respective wild type codon pair when expressed in the heterologous host organism.
352. The L-arabinose isomerase-encoding nucleotide sequence of Claim 351. wherein the translational kinetics value of at least one replacement codon pair when expressed in the heterologous host organism is at least 75% of the translational kinetics value for the wild type codon pair when expressed in the native organism.
353. The L-arabinose isomerase-encoding nucleotide sequence of any of Claims 351-352. wherein at least one replacement codon encoding amino acids 1-8 of SEQ ID NO: 302 has a z score for expression in the heterologous host that is more than 75% of the z score of the wild type codon pair GAAGTG when expressed in the native organism.
354. An isolated polynucleotide comprising the nucleotide sequence of any of Claims 1 -353.
355. An isolated polynucleotide comprising the nucleotide sequence of SEQ ID NOs: 3; 5, 7, 9, 1 1 , 13, 15, 17, 19, 21 , 23, 27, 29, 31, 33, 35, 37, 39, 41 , 43, 45, 47, 51, 53: 55, 57, 59, 61 , 63, 65, 67, 69, 71, 75, 77, 79, 81 , 83, 85, 87, 89, 91 , 93, 95, 99, 101 , 103, 105, 107, 109, 1 1 1 , 1 13, 1 15, 1 17, 1 19, 123, 125, 127, 129, 131 , 133, 135, 137, 139, 141 , 143, 147, 149, 151 , 153, 155, 157, 159, 161 , 163, 165, 167, 171 , 173, 175, 177, 179, 181 , 183, 185, 187, 189, 191, 195, 197, 199, 201 , 203, 205, 207, 209, 21 1 , 213, 215, 219, 221 , 223, 225, 227, 229, 231, 233, 235, 237: 239, 243, 245, 247, 249, 251 , 253, 255, 257, 259, 261, 263, 267, 271 , 273, 275, 277, 279, 281 , 283, 285, 287, 291 , 295, 297, 299, 303, 305, 307, 309 or 31 1.
356. An isolated polypeptide encoded by the nucleotide sequence of any of Claims 1-353, provided that the amino acid sequence of said polypeptide is not SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 194, 218, 242, 266, 290 or 302.
357. An expression system, comprising: an expression vector in a host organism, wherein the expression vector includes the polynucleotide of Claim 354 or Claim 355 operably linked to an expression control sequence.
358. An expression system, comprising: an expression vector in a host organism, wherein the expression vector includes two or more polynucleotides in accordance with Claim 354 or Claim 355, each polynucleotide being operably linked to the same or different expression control sequences.
359. A system for metabolizing xylose, comprising: one or more host organisms that collectively include polynucleotides operably encoding the following enzymes: xylose reductase. xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the polynucleotides encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
360. A system for metabolizing xylose, comprising: one or more host organisms that collectively include polynucleotides operably encoding the following enzymes: xylose isomerase, and and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the polynucleotides encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
361. A system for metabolizing arabinose, comprising: one or more host organisms that collectively include polynuclotides operably encoding the following enzymes:
L-arabinitol 4-dehydrogenase,
L-xylulose reductase, xylitol dehydrogenase, and xylulokinase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the DNA sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
362. A system for metabolizing arabinose, comprising: one or more host organisms that collectively include polynuclotides operably encoding the following enzymes:
L-arabinose isomerase,
L-ribulokinase, and
L-ribulose-5-P 4-epimerase; wherein the enzymes are heterologous to the one or more host organisms, and wherein translational kinetics of each of the DNA sequences encoding the enzymes has been modified to replace at least three codon pairs present in the original sequence for each enzyme, wherein the at least three codon pairs are predicted to cause a translational pause in the host organism, and wherein said modification results in silent permutation or conservative amino acid substitution of said at least three codon pairs.
363. The system of any of Claims 357 or 359, wherein one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 11, 13. 15, 17, 19, 21, 23, 27: 29, 31, 33, 35, 37, 39, 41, 43, 45, 47: 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 267, 271, 273, 275, 277, 279, 281, 283, 285 or 287.
364. The system of any of Claims 357-359, comprising two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 267, 271, 273, 275, 277, 279, 281,283, 285 or 287.
365. The system of any of Claims 357 or 360, wherein one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 75, 77, 79, 81, 83, 85,87,89,91,93,95, 171, 173, 175, 177, 179, 181, 183, 185, 187, 189 or 191.
366. The system of any of Claims 357, 358 or 360, comprising two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 75, 77, 79, 81, 83, 85,87,89,91,93,95, 171, 173, 175, 177, 179, 181, 183, 185, 187, 189 or 191.
367. The system of any of Claims 357 or 361, wherein one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 99, 101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165 or 167.
368. The system of any of Claims 357, 358 or 361, comprising two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 51, 53, 55, 57, 59, 6I 5 63, 65; 67; 69, 71 , 75, 77, 79, 81 , 83: 85, 87; 89: 91 , 93, 95; 99, 101 , 103, 105, 107, 109; 1 1 1 , 1 13; 1 15, 1 17, 1 19, 123, 125, 127. 129, 131 , 133, 135, 137, 139, 141 , 143, 147, 149, 151 , 153, 155, 157, 159, 161 , 163, 165 or 167.
369. The system of any of Claims 357 or 362, wherein one or more of said polynucleotides comprises the nucleotide sequence of SEQ ID NOs: 195, 197, 199, 201 , 203, 205, 207, 209, 21 1 , 213, 215, 219, 221 , 223, 225, 227, 229, 231 , 233, 235, 237, 239, 243, 245, 247, 249, 251 , 253, 255, 257, 259, 261 , 263, 291 , 295, 297, 299, 303, 305, 307, 309 or 31 1.
370. The system of any of Claims 357, 358 or 362, comprising two or more polynucleotides comprising the nucleotide sequence of SEQ ID NOs: 195, 197, 199, 201 , 203, 205, 207, 209, 21 1 , 213, 215, 219, 221 , 223, 225, 227, 229, 231, 233, 235, 237, 239, 243, 245, 247, 249, 251 , 253, 255, 257, 259, 261 , 263, 291 , 295, 297, 299, 303, 305, 307, 309 or 31 1.
371. The system of any one of Claims 357-370, wherein said one or more host organisms are selected from the group consisting of: Saccharomyces cerevisiae, Pichia pastoris, Escherichia coli, Bombyx mori, Spodoptera frugiperda, Drosophila melanogaster, Kluyveromyces lactis, Zymomonas mobilis and Schi∑osaccharomyces pombe.
372. The system of any of Claims 357-371, wherein each encoded enzyme has at least a 75% amino acid sequence identity with the original sequence of said enzyme.
373. The system of any of Claims 357-372, wherein each encoded enzyme retains at least 75% of the enzymatic activity of wild-type polypeptide (SEQ ID NO: 2, 26, 50, 74, 98, 122, 146, 170, 194, 218, 242, 266, 290 or 302) under normal physiological conditions.
374. A cell comprising the polynucleotide of Claim 354 or Claim 355.
375. The cell of Claim 374, wherein said cell expresses the polypeptide encoded by said polynucleotide.
376. A method of introducing a polynucleotide into a host cell comprising: providing a host cell; and contacting said host cell with the polynucleotide of Claim 354 or Claim 355 under conditions that permit the polynucleotide to be introduced into the host cell.
377. A method of expressing a polypeptide comprising: providing a cell comprising the polynucleotide of Claim 354 or Claim 355; and placing the cell under conditions that permit the cell to express the polypeptide encoded by the DNA sequence, whereby said encoded polypeptide is expressed by said cell.
378. A method of metabolizing a sugar comprising: providing a sugar comprising at least one covalent bond; providing a polypeptide encoded by the polynucleotide of Claim 354 or Claim 355: and contacting said sugar with said polypeptide under conditions that permit said polypeptide to break or form at least one covalent bond of said sugar, whereby at least one covalentbond of said sugar is broken or formed.
PCT/US2008/006353 2007-05-14 2008-05-14 Xylose- and arabinose- metabolizing enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same WO2008144012A2 (en)

Applications Claiming Priority (24)

Application Number Priority Date Filing Date Title
US91787807P 2007-05-14 2007-05-14
US60/917,878 2007-05-14
US93813407P 2007-05-15 2007-05-15
US60/938,134 2007-05-15
US93887607P 2007-05-18 2007-05-18
US93890107P 2007-05-18 2007-05-18
US60/938,876 2007-05-18
US60/938,901 2007-05-18
US93920707P 2007-05-21 2007-05-21
US93917907P 2007-05-21 2007-05-21
US60/939,179 2007-05-21
US60/939,207 2007-05-21
US94034807P 2007-05-25 2007-05-25
US60/940,348 2007-05-25
US94151707P 2007-06-01 2007-06-01
US94138207P 2007-06-01 2007-06-01
US94139307P 2007-06-01 2007-06-01
US60/941,517 2007-06-01
US60/941,393 2007-06-01
US60/941,382 2007-06-01
US94192507P 2007-06-04 2007-06-04
US60/941,925 2007-06-04
US94748807P 2007-07-02 2007-07-02
US60/947,488 2007-07-02

Publications (2)

Publication Number Publication Date
WO2008144012A2 true WO2008144012A2 (en) 2008-11-27
WO2008144012A3 WO2008144012A3 (en) 2009-04-30

Family

ID=39941917

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/006353 WO2008144012A2 (en) 2007-05-14 2008-05-14 Xylose- and arabinose- metabolizing enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same

Country Status (1)

Country Link
WO (1) WO2008144012A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017176875A1 (en) * 2016-04-08 2017-10-12 E I Du Pont De Nemours And Company Arabinose isomerases for yeast
US10724040B2 (en) 2015-07-15 2020-07-28 The Penn State Research Foundation mRNA sequences to control co-translational folding of proteins
WO2021231621A1 (en) * 2020-05-13 2021-11-18 Novozymes A/S Improved microorganisms for arabinose fermentation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0450430A2 (en) * 1990-03-26 1991-10-09 Rhein Biotech Gesellschaft für biotechnologische Prozesse und Produkte mbH DNA sequence comprising a structural gene coding for xylose reductase or xylose reductase and xylitol dehydrogenase
WO2004042043A2 (en) * 2002-11-05 2004-05-21 Affinium Pharmaceuticals, Inc. Crystal structures of bacterial ribulose-phosphate 3-epimerases
WO2004044129A2 (en) * 2002-11-06 2004-05-27 Diversa Corporation Xylose isomerases, nucleic acids encoding them and methods for making and using them
WO2005113774A2 (en) * 2004-05-19 2005-12-01 Biotechnology Research And Development Corporation Methods for production of xylitol in microorganisms
WO2006009434A1 (en) * 2004-07-16 2006-01-26 Technische Universiteit Delft Metabolic engineering of xylose fermenting eukaryotic cells
US20060292566A1 (en) * 2002-11-08 2006-12-28 The University Of Queensland Method for optimising gene expressing using synonymous codon optimisation
WO2007021879A2 (en) * 2005-08-10 2007-02-22 Zuchem, Inc. Production of l-ribose and other rare sugars

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0450430A2 (en) * 1990-03-26 1991-10-09 Rhein Biotech Gesellschaft für biotechnologische Prozesse und Produkte mbH DNA sequence comprising a structural gene coding for xylose reductase or xylose reductase and xylitol dehydrogenase
WO2004042043A2 (en) * 2002-11-05 2004-05-21 Affinium Pharmaceuticals, Inc. Crystal structures of bacterial ribulose-phosphate 3-epimerases
WO2004044129A2 (en) * 2002-11-06 2004-05-27 Diversa Corporation Xylose isomerases, nucleic acids encoding them and methods for making and using them
US20060292566A1 (en) * 2002-11-08 2006-12-28 The University Of Queensland Method for optimising gene expressing using synonymous codon optimisation
WO2005113774A2 (en) * 2004-05-19 2005-12-01 Biotechnology Research And Development Corporation Methods for production of xylitol in microorganisms
WO2006009434A1 (en) * 2004-07-16 2006-01-26 Technische Universiteit Delft Metabolic engineering of xylose fermenting eukaryotic cells
WO2007021879A2 (en) * 2005-08-10 2007-02-22 Zuchem, Inc. Production of l-ribose and other rare sugars

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GUSTAFSSON C ET AL: "Codon bias and heterologous protein expression" TRENDS IN BIOTECHNOLOGY, ELSEVIER PUBLICATIONS, CAMBRIDGE, GB, vol. 22, no. 7, 1 July 2004 (2004-07-01), pages 346-353, XP004520507 ISSN: 0167-7799 *
JEPPSSON MARIE ET AL: "The expression of a Pichia stipitis xylose reductase mutant with higher K(M) for NADPH increases ethanol production from xylose in recombinant Saccharomyces cerevisiae." BIOTECHNOLOGY AND BIOENGINEERING 5 MAR 2006, vol. 93, no. 4, 5 March 2006 (2006-03-05), pages 665-673, XP002504734 ISSN: 0006-3592 *
JOHNASSON B ET AL: "XYLULOKINASE OVEREXPRESSION IN TWO STRAINS OF SACCHAROMYCES CEREVISIAE ALSO EXPRESSING XYLOSE REDUCTASE AND XYLITOL DEHYDROGENASE AND ITS EFFECT ON FERMENTATION OF XYLOSE AND LIGNOCELLULOSIC HYDROLYSATE" APPLIED AND ENVIRONMENTAL MICROBIOLOGY, AMERICAN SOCIETY FOR MICROBIOLOGY, US, vol. 67, no. 9, 1 September 2001 (2001-09-01), pages 4249-4255, XP009063768 ISSN: 0099-2240 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10724040B2 (en) 2015-07-15 2020-07-28 The Penn State Research Foundation mRNA sequences to control co-translational folding of proteins
WO2017176875A1 (en) * 2016-04-08 2017-10-12 E I Du Pont De Nemours And Company Arabinose isomerases for yeast
WO2021231621A1 (en) * 2020-05-13 2021-11-18 Novozymes A/S Improved microorganisms for arabinose fermentation

Also Published As

Publication number Publication date
WO2008144012A3 (en) 2009-04-30

Similar Documents

Publication Publication Date Title
US20220098600A1 (en) Methods for the improvement of product yield and production in a microorganism through the addition of alternate electron acceptors
EP3033413B1 (en) Methods for the improvement of product yield and production in a microorganism through glycerol recycling
DK2301949T3 (en) Genetically modified yeast species and fermentation methods using genetically modified yeast
CN105121637B (en) Electron-consuming ethanol production pathway replacing glycerol formation in saccharomyces cerevisiae
CA2855124C (en) A genetically modified strain of s. cerevisiae engineered to ferment xylose and arabinose
CA2822654A1 (en) Genetically modified clostridium thermocellum engineered to ferment xylose
US20080085341A1 (en) Methods and microorganisms for forming fermentation products and fixing carbon dioxide
CA2424890C (en) Ethanol production in gram-positive bacteria with a stabilized mutation in lactate dehydrogenase
WO2007110606A1 (en) Enhancement of microbial ethanol production
WO2008144012A2 (en) Xylose- and arabinose- metabolizing enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same
US7226776B2 (en) Recombinant hosts suitable for simultaneous saccharification and fermentation
WO2009005564A2 (en) Cellulose- and hemicellulose-degradation enzyme -encoding nucleotide sequences with refined translational kinetics and methods of making same
CN115976005A (en) Xylose isomerase obtained based on ancestral sequence construction method and application thereof
WO2008153676A2 (en) Pentose phosphate pathway and fermentation enzyme-encoding nucleotide sequences with refined translational kinetics and methods of making same
US20160340702A1 (en) Heat-stable, fe-dependent alcohol dehydrogenase for aldehyde detoxification
KR20210048394A (en) Microorganisms with enhanced carbon monoxide availability and use thereof
US20220090045A1 (en) Methods for producing isopropanol and acetone in a microorganism
Hon Reconstructing the ethanol production pathway of Thermoanaerobacterium saccharolyticum in Clostridium thermocellum
KR20200135469A (en) Xylose metabolic yeast
US20150104850A1 (en) Protein manipulation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08754523

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08754523

Country of ref document: EP

Kind code of ref document: A2