NZ717658B2 - Automated screening of enzyme variants - Google Patents
Automated screening of enzyme variants Download PDFInfo
- Publication number
- NZ717658B2 NZ717658B2 NZ717658A NZ71765814A NZ717658B2 NZ 717658 B2 NZ717658 B2 NZ 717658B2 NZ 717658 A NZ717658 A NZ 717658A NZ 71765814 A NZ71765814 A NZ 71765814A NZ 717658 B2 NZ717658 B2 NZ 717658B2
- Authority
- NZ
- New Zealand
- Prior art keywords
- substrate
- variants
- enzyme
- poses
- active
- Prior art date
Links
- 102000004190 Enzymes Human genes 0.000 title claims abstract description 196
- 108090000790 Enzymes Proteins 0.000 title claims abstract description 195
- 239000000758 substrate Substances 0.000 claims abstract description 229
- 238000006555 catalytic reaction Methods 0.000 claims abstract description 15
- 230000027455 binding Effects 0.000 claims description 175
- 238000006243 chemical reaction Methods 0.000 claims description 80
- 238000003032 molecular docking Methods 0.000 claims description 67
- 150000002500 ions Chemical class 0.000 claims description 49
- 230000035772 mutation Effects 0.000 claims description 46
- 230000003993 interaction Effects 0.000 claims description 34
- 230000003197 catalytic Effects 0.000 claims description 33
- 102000004316 Oxidoreductases Human genes 0.000 claims description 30
- 108090000854 Oxidoreductases Proteins 0.000 claims description 30
- 238000000338 in vitro Methods 0.000 claims description 30
- 229920000272 Oligonucleotide Polymers 0.000 claims description 29
- 239000000126 substance Substances 0.000 claims description 25
- 238000002703 mutagenesis Methods 0.000 claims description 22
- 231100000350 mutagenesis Toxicity 0.000 claims description 22
- 230000002194 synthesizing Effects 0.000 claims description 22
- 230000002349 favourable Effects 0.000 claims description 19
- 238000006722 reduction reaction Methods 0.000 claims description 19
- -1 erase Proteins 0.000 claims description 17
- 150000002576 ketones Chemical class 0.000 claims description 17
- 238000006460 hydrolysis reaction Methods 0.000 claims description 13
- 238000000126 in silico method Methods 0.000 claims description 13
- 238000000329 molecular dynamics simulation Methods 0.000 claims description 11
- 241000894007 species Species 0.000 claims description 11
- 230000024881 catalytic activity Effects 0.000 claims description 10
- 150000002466 imines Chemical class 0.000 claims description 10
- 230000003647 oxidation Effects 0.000 claims description 10
- 238000007254 oxidation reaction Methods 0.000 claims description 10
- 102000010909 EC 1.4.3.4 Human genes 0.000 claims description 8
- 108010062431 EC 1.4.3.4 Proteins 0.000 claims description 8
- 239000002253 acid Substances 0.000 claims description 8
- 125000002252 acyl group Chemical group 0.000 claims description 7
- 108010033272 EC 3.5.5.1 Proteins 0.000 claims description 6
- 102000004195 Isomerases Human genes 0.000 claims description 6
- 108090000769 Isomerases Proteins 0.000 claims description 6
- 102000003960 Ligases Human genes 0.000 claims description 6
- 108090000364 Ligases Proteins 0.000 claims description 6
- 239000007806 chemical reaction intermediate Substances 0.000 claims description 6
- 150000003944 halohydrins Chemical class 0.000 claims description 6
- 102000002004 Cytochrome P-450 Enzyme System Human genes 0.000 claims description 5
- 108010015742 Cytochrome P-450 Enzyme System Proteins 0.000 claims description 5
- 238000005695 dehalogenation reaction Methods 0.000 claims description 5
- 238000002922 simulated annealing Methods 0.000 claims description 5
- 238000007614 solvation Methods 0.000 claims description 5
- 238000005891 transamination reaction Methods 0.000 claims description 5
- 102000004317 Lyases Human genes 0.000 claims description 4
- 108090000856 Lyases Proteins 0.000 claims description 4
- 238000005411 Van der Waals force Methods 0.000 claims description 4
- 108010013164 halohydrin dehalogenase Proteins 0.000 claims description 4
- 125000000468 ketone group Chemical group 0.000 claims description 4
- 102000004157 Hydrolases Human genes 0.000 claims description 3
- 108090000604 Hydrolases Proteins 0.000 claims description 3
- 238000006317 isomerization reaction Methods 0.000 claims description 3
- 238000006220 Baeyer-Villiger oxidation reaction Methods 0.000 claims description 2
- NLZUEZXRPGMBCV-UHFFFAOYSA-N Butylhydroxytoluene Chemical compound CC1=CC(C(C)(C)C)=C(O)C(C(C)(C)C)=C1 NLZUEZXRPGMBCV-UHFFFAOYSA-N 0.000 claims 1
- 102000004169 proteins and genes Human genes 0.000 abstract description 184
- 108090000623 proteins and genes Proteins 0.000 abstract description 154
- 230000000694 effects Effects 0.000 abstract description 75
- 239000002831 pharmacologic agent Substances 0.000 abstract description 23
- 238000004590 computer program Methods 0.000 abstract description 9
- 235000018102 proteins Nutrition 0.000 description 179
- 238000000034 method Methods 0.000 description 163
- 239000003446 ligand Substances 0.000 description 104
- 229940088598 Enzyme Drugs 0.000 description 102
- 229940110715 ENZYMES FOR TREATMENT OF WOUNDS AND ULCERS Drugs 0.000 description 66
- 229940020899 hematological Enzymes Drugs 0.000 description 65
- 229920001184 polypeptide Polymers 0.000 description 48
- 150000007523 nucleic acids Chemical class 0.000 description 47
- 229920003013 deoxyribonucleic acid Polymers 0.000 description 46
- 108020004707 nucleic acids Proteins 0.000 description 43
- 235000001014 amino acid Nutrition 0.000 description 37
- 125000004429 atoms Chemical group 0.000 description 37
- 150000001413 amino acids Chemical class 0.000 description 36
- 238000003752 polymerase chain reaction Methods 0.000 description 35
- 239000000047 product Substances 0.000 description 30
- 238000003041 virtual screening Methods 0.000 description 25
- 125000003729 nucleotide group Chemical group 0.000 description 24
- 238000005215 recombination Methods 0.000 description 24
- 239000002773 nucleotide Substances 0.000 description 23
- 210000004027 cells Anatomy 0.000 description 22
- 239000000543 intermediate Substances 0.000 description 22
- 229920000023 polynucleotide Polymers 0.000 description 22
- 239000002157 polynucleotide Substances 0.000 description 22
- VYPSYNLAJGMNEJ-UHFFFAOYSA-N silicium dioxide Chemical compound O=[Si]=O VYPSYNLAJGMNEJ-UHFFFAOYSA-N 0.000 description 22
- 238000004166 bioassay Methods 0.000 description 20
- 230000001131 transforming Effects 0.000 description 20
- 239000000203 mixture Substances 0.000 description 18
- 230000015572 biosynthetic process Effects 0.000 description 17
- 238000003786 synthesis reaction Methods 0.000 description 17
- 125000003275 alpha amino acid group Chemical group 0.000 description 16
- 230000002068 genetic Effects 0.000 description 16
- KWOLFJPFCHCOCG-UHFFFAOYSA-N methylphenylketone Chemical compound CC(=O)C1=CC=CC=C1 KWOLFJPFCHCOCG-UHFFFAOYSA-N 0.000 description 16
- 239000002609 media Substances 0.000 description 14
- 238000004458 analytical method Methods 0.000 description 13
- 230000000875 corresponding Effects 0.000 description 12
- 229920001850 Nucleic acid sequence Polymers 0.000 description 11
- 150000001875 compounds Chemical class 0.000 description 11
- 239000000377 silicon dioxide Substances 0.000 description 11
- XJLXINKUBYWONI-NNYOXOHSSA-N Nicotinamide adenine dinucleotide phosphate Chemical compound NC(=O)C1=CC=C[N+]([C@H]2[C@@H]([C@H](O)[C@@H](COP([O-])(=O)OP(O)(=O)OC[C@@H]3[C@H]([C@@H](OP(O)(O)=O)[C@@H](O3)N3C4=NC=NC(N)=C4N=C3)O)O2)O)=C1 XJLXINKUBYWONI-NNYOXOHSSA-N 0.000 description 10
- 230000002401 inhibitory effect Effects 0.000 description 10
- 230000000670 limiting Effects 0.000 description 10
- 238000007481 next generation sequencing Methods 0.000 description 10
- 239000000523 sample Substances 0.000 description 10
- 230000000295 complement Effects 0.000 description 9
- 239000000243 solution Substances 0.000 description 9
- 229920002287 Amplicon Polymers 0.000 description 8
- 125000000539 amino acid group Chemical group 0.000 description 8
- 239000011324 bead Substances 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 8
- 238000004519 manufacturing process Methods 0.000 description 8
- 229920000160 (ribonucleotides)n+m Polymers 0.000 description 7
- 125000000524 functional group Chemical group 0.000 description 7
- 239000007788 liquid Substances 0.000 description 7
- 238000004886 process control Methods 0.000 description 7
- 108090000765 processed proteins & peptides Proteins 0.000 description 7
- 102000004196 processed proteins & peptides Human genes 0.000 description 7
- 230000002441 reversible Effects 0.000 description 7
- 239000007787 solid Substances 0.000 description 7
- 101700011961 DPOM Proteins 0.000 description 6
- 241000196324 Embryophyta Species 0.000 description 6
- 101710029649 MDV043 Proteins 0.000 description 6
- 101700061424 POLB Proteins 0.000 description 6
- 101700054624 RF1 Proteins 0.000 description 6
- 230000003321 amplification Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 6
- 238000010367 cloning Methods 0.000 description 6
- 238000006731 degradation reaction Methods 0.000 description 6
- 238000011156 evaluation Methods 0.000 description 6
- 238000010348 incorporation Methods 0.000 description 6
- 239000003112 inhibitor Substances 0.000 description 6
- 239000000463 material Substances 0.000 description 6
- 238000005259 measurement Methods 0.000 description 6
- 238000005457 optimization Methods 0.000 description 6
- 238000011160 research Methods 0.000 description 6
- 239000004065 semiconductor Substances 0.000 description 6
- 125000001493 tyrosinyl group Chemical group [H]OC1=C([H])C([H])=C(C([H])=C1[H])C([H])([H])C([H])(N([H])[H])C(*)=O 0.000 description 6
- 102000005922 Amidases Human genes 0.000 description 5
- 108020003076 Amidases Proteins 0.000 description 5
- VILAVOFMIJHSJA-UHFFFAOYSA-N Dicarbon monoxide Chemical compound [C]=C=O VILAVOFMIJHSJA-UHFFFAOYSA-N 0.000 description 5
- 238000007792 addition Methods 0.000 description 5
- CURLTUGMZLYLDI-UHFFFAOYSA-N carbon dioxide Chemical compound O=C=O CURLTUGMZLYLDI-UHFFFAOYSA-N 0.000 description 5
- 125000002915 carbonyl group Chemical group [*:2]C([*:1])=O 0.000 description 5
- 239000007795 chemical reaction product Substances 0.000 description 5
- 238000003776 cleavage reaction Methods 0.000 description 5
- 238000009396 hybridization Methods 0.000 description 5
- 229910052739 hydrogen Inorganic materials 0.000 description 5
- 239000001257 hydrogen Substances 0.000 description 5
- 238000003780 insertion Methods 0.000 description 5
- 230000001404 mediated Effects 0.000 description 5
- 230000015654 memory Effects 0.000 description 5
- 238000003199 nucleic acid amplification method Methods 0.000 description 5
- 125000004430 oxygen atoms Chemical group O* 0.000 description 5
- 230000005610 quantum mechanics Effects 0.000 description 5
- 238000004805 robotic Methods 0.000 description 5
- 230000000707 stereoselective Effects 0.000 description 5
- 238000006467 substitution reaction Methods 0.000 description 5
- WQZGKKKJIJFFOK-GASJEMHNSA-N D-Glucose Natural products OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-GASJEMHNSA-N 0.000 description 4
- 238000001712 DNA sequencing Methods 0.000 description 4
- 101710006860 PIG28 Proteins 0.000 description 4
- 101700008650 PPIB Proteins 0.000 description 4
- 108091005771 Peptidases Proteins 0.000 description 4
- 102000035443 Peptidases Human genes 0.000 description 4
- 108090000340 Transaminases Proteins 0.000 description 4
- 102000003929 Transaminases Human genes 0.000 description 4
- QTBSBXVTEAMEQO-UHFFFAOYSA-N acetic acid Chemical compound CC(O)=O QTBSBXVTEAMEQO-UHFFFAOYSA-N 0.000 description 4
- 229910052782 aluminium Inorganic materials 0.000 description 4
- XAGFODPZIPBFFR-UHFFFAOYSA-N aluminum Chemical compound [Al] XAGFODPZIPBFFR-UHFFFAOYSA-N 0.000 description 4
- 230000003042 antagnostic Effects 0.000 description 4
- 230000000903 blocking Effects 0.000 description 4
- 229910052799 carbon Inorganic materials 0.000 description 4
- 238000004113 cell culture Methods 0.000 description 4
- 239000003153 chemical reaction reagent Substances 0.000 description 4
- 230000001419 dependent Effects 0.000 description 4
- 230000029087 digestion Effects 0.000 description 4
- 238000009826 distribution Methods 0.000 description 4
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 4
- 239000007850 fluorescent dye Substances 0.000 description 4
- 239000008103 glucose Substances 0.000 description 4
- 238000003384 imaging method Methods 0.000 description 4
- 238000004949 mass spectrometry Methods 0.000 description 4
- 125000002496 methyl group Chemical group [H]C([H])([H])* 0.000 description 4
- 230000003278 mimic Effects 0.000 description 4
- 230000036961 partial Effects 0.000 description 4
- 125000001997 phenyl group Chemical group [H]C1=C([H])C([H])=C(*)C([H])=C1[H] 0.000 description 4
- 229920000642 polymer Polymers 0.000 description 4
- 238000006116 polymerization reaction Methods 0.000 description 4
- 230000000717 retained Effects 0.000 description 4
- 238000003530 single readout Methods 0.000 description 4
- 238000002741 site-directed mutagenesis Methods 0.000 description 4
- 239000001226 triphosphate Substances 0.000 description 4
- ZKHQWZAMYRWXGA-KQYNXXCUSA-N Adenosine triphosphate Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@@H]1O[C@H](COP(O)(=O)OP(O)(=O)OP(O)(O)=O)[C@@H](O)[C@H]1O ZKHQWZAMYRWXGA-KQYNXXCUSA-N 0.000 description 3
- 108010031132 Alcohol Oxidoreductases Proteins 0.000 description 3
- 102000005751 Alcohol Oxidoreductases Human genes 0.000 description 3
- 102000018832 Cytochromes Human genes 0.000 description 3
- 108010052832 Cytochromes Proteins 0.000 description 3
- 108010029182 EC 4.2.2.10 Proteins 0.000 description 3
- 241000282619 Hylobates lar Species 0.000 description 3
- OUYCCCASQSFEME-QMMMGPOBSA-N L-tyrosine Chemical compound OC(=O)[C@@H](N)CC1=CC=C(O)C=C1 OUYCCCASQSFEME-QMMMGPOBSA-N 0.000 description 3
- 108010029541 Laccase Proteins 0.000 description 3
- 108020004999 Messenger RNA Proteins 0.000 description 3
- 101700080605 NUC1 Proteins 0.000 description 3
- 101710038849 OB2597_18631 Proteins 0.000 description 3
- 239000004365 Protease Substances 0.000 description 3
- 101710007375 SAV2584 Proteins 0.000 description 3
- 101710038074 SCO0300 Proteins 0.000 description 3
- 101710038075 SCO3172 Proteins 0.000 description 3
- 229940035295 Ting Drugs 0.000 description 3
- 102000004357 Transferases Human genes 0.000 description 3
- 108090000992 Transferases Proteins 0.000 description 3
- 239000000556 agonist Substances 0.000 description 3
- 125000003277 amino group Chemical group 0.000 description 3
- 210000004102 animal cell Anatomy 0.000 description 3
- 238000000137 annealing Methods 0.000 description 3
- 239000005557 antagonist Substances 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000015556 catabolic process Effects 0.000 description 3
- 239000011248 coating agent Substances 0.000 description 3
- 238000000576 coating method Methods 0.000 description 3
- 238000005094 computer simulation Methods 0.000 description 3
- 101700056950 cpnB Proteins 0.000 description 3
- 238000002790 cross-validation Methods 0.000 description 3
- 230000004059 degradation Effects 0.000 description 3
- 239000010432 diamond Substances 0.000 description 3
- 239000003814 drug Substances 0.000 description 3
- 229940079593 drugs Drugs 0.000 description 3
- 238000006911 enzymatic reaction Methods 0.000 description 3
- 101700019977 ethA Proteins 0.000 description 3
- 238000006062 fragmentation reaction Methods 0.000 description 3
- 239000011521 glass Substances 0.000 description 3
- 101710003733 hapE Proteins 0.000 description 3
- 238000004128 high performance liquid chromatography Methods 0.000 description 3
- 238000009114 investigational therapy Methods 0.000 description 3
- 229920002106 messenger RNA Polymers 0.000 description 3
- 150000002825 nitriles Chemical class 0.000 description 3
- 101700006494 nucA Proteins 0.000 description 3
- 230000003287 optical Effects 0.000 description 3
- 229910052760 oxygen Inorganic materials 0.000 description 3
- 239000001301 oxygen Substances 0.000 description 3
- MYMOFIZGZYHOMD-UHFFFAOYSA-N oxygen Chemical compound O=O MYMOFIZGZYHOMD-UHFFFAOYSA-N 0.000 description 3
- 101700058784 pamO Proteins 0.000 description 3
- 239000011148 porous material Substances 0.000 description 3
- 238000000734 protein sequencing Methods 0.000 description 3
- 238000002708 random mutagenesis Methods 0.000 description 3
- 238000009790 rate-determining step (RDS) Methods 0.000 description 3
- 230000002829 reduced Effects 0.000 description 3
- 238000007841 sequencing by ligation Methods 0.000 description 3
- 150000003384 small molecules Chemical class 0.000 description 3
- 239000002904 solvent Substances 0.000 description 3
- 210000001519 tissues Anatomy 0.000 description 3
- 238000004627 transmission electron microscopy Methods 0.000 description 3
- 235000011178 triphosphate Nutrition 0.000 description 3
- 238000010200 validation analysis Methods 0.000 description 3
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 3
- 102000013142 Amylases Human genes 0.000 description 2
- 108010065511 Amylases Proteins 0.000 description 2
- 229940025131 Amylases Drugs 0.000 description 2
- 210000000349 Chromosomes Anatomy 0.000 description 2
- 108050008938 Glucoamylase Proteins 0.000 description 2
- 102000004867 Hydro-Lyases Human genes 0.000 description 2
- 108090001042 Hydro-Lyases Proteins 0.000 description 2
- 102000005385 Intramolecular Transferases Human genes 0.000 description 2
- 108010031311 Intramolecular Transferases Proteins 0.000 description 2
- 150000008575 L-amino acids Chemical class 0.000 description 2
- 239000004367 Lipase Substances 0.000 description 2
- 102100003028 MANBA Human genes 0.000 description 2
- 102100008175 MGAM Human genes 0.000 description 2
- 229920002521 Macromolecule Polymers 0.000 description 2
- 108091005503 Nucleic proteins Proteins 0.000 description 2
- QKFJKGMPGYROCL-UHFFFAOYSA-N Phenyl isothiocyanate Chemical compound S=C=NC1=CC=CC=C1 QKFJKGMPGYROCL-UHFFFAOYSA-N 0.000 description 2
- XPPKVPWEQAFLFU-UHFFFAOYSA-J Pyrophosphate Chemical compound [O-]P([O-])(=O)OP([O-])([O-])=O XPPKVPWEQAFLFU-UHFFFAOYSA-J 0.000 description 2
- 108060006943 RdRp Proteins 0.000 description 2
- PYMYPHUHKUWMLA-LMVFSUKVSA-N Ribose Natural products OC[C@@H](O)[C@@H](O)[C@@H](O)C=O PYMYPHUHKUWMLA-LMVFSUKVSA-N 0.000 description 2
- 150000001336 alkenes Chemical class 0.000 description 2
- 150000001408 amides Chemical class 0.000 description 2
- 235000019418 amylase Nutrition 0.000 description 2
- 102000004965 antibodies Human genes 0.000 description 2
- 108090001123 antibodies Proteins 0.000 description 2
- 150000001479 arabinose derivatives Chemical class 0.000 description 2
- 102000005936 beta-Galactosidase Human genes 0.000 description 2
- 108010005774 beta-Galactosidase Proteins 0.000 description 2
- 108010055059 beta-Mannosidase Proteins 0.000 description 2
- OKTJSMMVPCPJKN-UHFFFAOYSA-N carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 2
- 125000004432 carbon atoms Chemical group C* 0.000 description 2
- 229910002092 carbon dioxide Inorganic materials 0.000 description 2
- 108091006028 chimera Proteins 0.000 description 2
- 239000005515 coenzyme Substances 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000010192 crystallographic characterization Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 229910003460 diamond Inorganic materials 0.000 description 2
- 238000010790 dilution Methods 0.000 description 2
- 235000011180 diphosphates Nutrition 0.000 description 2
- 238000001962 electrophoresis Methods 0.000 description 2
- 238000005755 formation reaction Methods 0.000 description 2
- RRHGJUQNOFWUDK-UHFFFAOYSA-N isoprene Chemical compound CC(=C)C=C RRHGJUQNOFWUDK-UHFFFAOYSA-N 0.000 description 2
- 108090001060 lipase Proteins 0.000 description 2
- 102000004882 lipase Human genes 0.000 description 2
- 235000019421 lipase Nutrition 0.000 description 2
- 150000002632 lipids Chemical class 0.000 description 2
- 235000019689 luncheon sausage Nutrition 0.000 description 2
- 238000001819 mass spectrum Methods 0.000 description 2
- 239000012528 membrane Substances 0.000 description 2
- 229910052751 metal Inorganic materials 0.000 description 2
- 239000002184 metal Substances 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006011 modification reaction Methods 0.000 description 2
- 238000010369 molecular cloning Methods 0.000 description 2
- 239000002858 neurotransmitter agent Substances 0.000 description 2
- PXHVJJICTQNCMI-UHFFFAOYSA-N nickel Chemical compound [Ni] PXHVJJICTQNCMI-UHFFFAOYSA-N 0.000 description 2
- 230000037361 pathway Effects 0.000 description 2
- 108010044725 pectate disaccharide-lyase Proteins 0.000 description 2
- 108010087558 pectate lyase Proteins 0.000 description 2
- 239000012071 phase Substances 0.000 description 2
- 229940117953 phenylisothiocyanate Drugs 0.000 description 2
- 239000010318 polygalacturonic acid Substances 0.000 description 2
- 230000003334 potential Effects 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 230000004952 protein activity Effects 0.000 description 2
- 230000001850 reproductive Effects 0.000 description 2
- 238000005096 rolling process Methods 0.000 description 2
- 238000007480 sanger sequencing Methods 0.000 description 2
- 238000002804 saturated mutagenesis Methods 0.000 description 2
- 238000009738 saturating Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 239000007790 solid phase Substances 0.000 description 2
- 230000003595 spectral Effects 0.000 description 2
- 210000004215 spores Anatomy 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- UNXRWKVEANCORM-UHFFFAOYSA-I triphosphate(5-) Chemical compound [O-]P([O-])(=O)OP([O-])(=O)OP([O-])([O-])=O UNXRWKVEANCORM-UHFFFAOYSA-I 0.000 description 2
- 238000011144 upstream manufacturing Methods 0.000 description 2
- 230000003612 virological Effects 0.000 description 2
- 238000002424 x-ray crystallography Methods 0.000 description 2
- AEMOLEFTQBMNLQ-BKBMJHBISA-N α-D-galacturonic acid Chemical compound O[C@H]1O[C@H](C(O)=O)[C@H](O)[C@H](O)[C@H]1O AEMOLEFTQBMNLQ-BKBMJHBISA-N 0.000 description 2
- GRWFGVWFFZKLTI-UHFFFAOYSA-N (+-)-2-pinene Chemical compound CC1=CCC2C(C)(C)C1C2 GRWFGVWFFZKLTI-UHFFFAOYSA-N 0.000 description 1
- PQMRRAQXKWFYQN-UHFFFAOYSA-N 1-phenyl-2-sulfanylideneimidazolidin-4-one Chemical class S=C1NC(=O)CN1C1=CC=CC=C1 PQMRRAQXKWFYQN-UHFFFAOYSA-N 0.000 description 1
- ZYWGAHRBGCEFAO-UHFFFAOYSA-N 2,2-dimethyl-1-(4-methylphenyl)propan-1-one Chemical compound CC1=CC=C(C(=O)C(C)(C)C)C=C1 ZYWGAHRBGCEFAO-UHFFFAOYSA-N 0.000 description 1
- STYQHICBPYRHQK-UHFFFAOYSA-N 4-bromobenzenesulfonamide Chemical compound NS(=O)(=O)C1=CC=C(Br)C=C1 STYQHICBPYRHQK-UHFFFAOYSA-N 0.000 description 1
- IRLPACMLTUPBCL-KQYNXXCUSA-N 5'-adenylyl sulfate Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@@H]1O[C@H](COP(O)(=O)OS(O)(=O)=O)[C@@H](O)[C@H]1O IRLPACMLTUPBCL-KQYNXXCUSA-N 0.000 description 1
- 108010011619 6-Phytase Proteins 0.000 description 1
- 102100011382 ABHD2 Human genes 0.000 description 1
- 102100001770 AMY2B Human genes 0.000 description 1
- 102000034451 ATPases Human genes 0.000 description 1
- 108091006096 ATPases Proteins 0.000 description 1
- 108010013043 Acetylesterase Proteins 0.000 description 1
- 108091022082 Acyl transferases Proteins 0.000 description 1
- 102000019632 Acyl transferases Human genes 0.000 description 1
- 102000003677 Aldehyde-Lyases Human genes 0.000 description 1
- 108090000072 Aldehyde-Lyases Proteins 0.000 description 1
- 241001156002 Anthonomus pomorum Species 0.000 description 1
- 229910014033 C-OH Inorganic materials 0.000 description 1
- 240000002804 Calluna vulgaris Species 0.000 description 1
- 235000007575 Calluna vulgaris Nutrition 0.000 description 1
- 108090000209 Carbonic Anhydrases Proteins 0.000 description 1
- 102000003846 Carbonic Anhydrases Human genes 0.000 description 1
- 102000004031 Carboxy-Lyases Human genes 0.000 description 1
- 108090000489 Carboxy-Lyases Proteins 0.000 description 1
- 210000002421 Cell Wall Anatomy 0.000 description 1
- 108010084185 Cellulases Proteins 0.000 description 1
- 102000005575 Cellulases Human genes 0.000 description 1
- 108010022172 Chitinases Proteins 0.000 description 1
- 102000012286 Chitinases Human genes 0.000 description 1
- 229920001405 Coding region Polymers 0.000 description 1
- 108020004705 Codon Proteins 0.000 description 1
- 229920000453 Consensus sequence Polymers 0.000 description 1
- ATDGTVJJHBUTRL-UHFFFAOYSA-N Cyanogen bromide Chemical compound BrC#N ATDGTVJJHBUTRL-UHFFFAOYSA-N 0.000 description 1
- 102000003849 Cytochrome P450 Human genes 0.000 description 1
- 108050008488 Cytochrome P450 Proteins 0.000 description 1
- 229910014570 C—OH Inorganic materials 0.000 description 1
- GZCGUPFRVQAUEE-KCDKBNATSA-N D-(+)-Galactose Natural products OC[C@@H](O)[C@H](O)[C@H](O)[C@@H](O)C=O GZCGUPFRVQAUEE-KCDKBNATSA-N 0.000 description 1
- 150000008574 D-amino acids Chemical class 0.000 description 1
- ZAQJHHRNXZUBTE-WUJLRWPWSA-N D-xylulose Chemical compound OC[C@@H](O)[C@H](O)C(=O)CO ZAQJHHRNXZUBTE-WUJLRWPWSA-N 0.000 description 1
- 102000004163 DNA-directed RNA polymerases Human genes 0.000 description 1
- 108090000626 DNA-directed RNA polymerases Proteins 0.000 description 1
- 102000007698 EC 1.1.1.1 Human genes 0.000 description 1
- 108010021809 EC 1.1.1.1 Proteins 0.000 description 1
- 102000016912 EC 1.1.1.21 Human genes 0.000 description 1
- 108010053754 EC 1.1.1.21 Proteins 0.000 description 1
- 108010015776 EC 1.1.3.4 Proteins 0.000 description 1
- 108010018734 EC 1.1.3.5 Proteins 0.000 description 1
- 108010015133 EC 1.1.3.9 Proteins 0.000 description 1
- 108010054320 EC 1.11.1.14 Proteins 0.000 description 1
- 108010053835 EC 1.11.1.6 Proteins 0.000 description 1
- 102000016938 EC 1.11.1.6 Human genes 0.000 description 1
- 108030004480 EC 1.2.1.80 Proteins 0.000 description 1
- 108020004530 EC 2.2.1.2 Proteins 0.000 description 1
- 108030002489 EC 3.2.-.- Proteins 0.000 description 1
- 108030006203 EC 4.2.2.23 Proteins 0.000 description 1
- 101700008821 EXO Proteins 0.000 description 1
- 101700083023 EXRN Proteins 0.000 description 1
- 108010001817 Endo-1,4-beta Xylanases Proteins 0.000 description 1
- 229940066758 Endopeptidases Drugs 0.000 description 1
- 108010059378 Endopeptidases Proteins 0.000 description 1
- 102000005593 Endopeptidases Human genes 0.000 description 1
- 229940109526 Ery Drugs 0.000 description 1
- BJHIKXHVCXFQLS-UYFOZJQFSA-N Fructose Natural products OC[C@@H](O)[C@@H](O)[C@H](O)C(=O)CO BJHIKXHVCXFQLS-UYFOZJQFSA-N 0.000 description 1
- 239000005715 Fructose Substances 0.000 description 1
- 108091005957 GFP derivatives Proteins 0.000 description 1
- 101700012085 GRE3 Proteins 0.000 description 1
- 108010093031 Galactosidases Proteins 0.000 description 1
- 102000002464 Galactosidases Human genes 0.000 description 1
- 108010056771 Glucosidases Proteins 0.000 description 1
- 102000004366 Glucosidases Human genes 0.000 description 1
- 108020000311 Glutamate Synthase Proteins 0.000 description 1
- 102000019483 Glycosyltransferases Human genes 0.000 description 1
- 108091022077 Glycosyltransferases Proteins 0.000 description 1
- 241001191009 Gymnomyza Species 0.000 description 1
- 229920000209 Hexadimethrine bromide Polymers 0.000 description 1
- 229940088597 Hormone Drugs 0.000 description 1
- OUUQCZGPVNCOIJ-UHFFFAOYSA-N Hydroperoxyl Chemical compound O[O] OUUQCZGPVNCOIJ-UHFFFAOYSA-N 0.000 description 1
- JDNTWHVOXJZDSN-UHFFFAOYSA-N Iodoacetic acid Chemical compound OC(=O)CI JDNTWHVOXJZDSN-UHFFFAOYSA-N 0.000 description 1
- 241000694408 Isomeris Species 0.000 description 1
- 241001527806 Iti Species 0.000 description 1
- XUJNEKJLAYXESH-REOHCLBHSA-N L-cysteine Chemical compound SC[C@H](N)C(O)=O XUJNEKJLAYXESH-REOHCLBHSA-N 0.000 description 1
- 108010080864 Lactate Dehydrogenases Proteins 0.000 description 1
- 102000000428 Lactate Dehydrogenases Human genes 0.000 description 1
- 102000004856 Lectins Human genes 0.000 description 1
- 108090001090 Lectins Proteins 0.000 description 1
- 238000004510 Lennard-Jones potential Methods 0.000 description 1
- 108090000128 Lipoxygenases Proteins 0.000 description 1
- 102000003820 Lipoxygenases Human genes 0.000 description 1
- 239000005089 Luciferase Substances 0.000 description 1
- 108060001084 Luciferase family Proteins 0.000 description 1
- 240000001313 Lycium barbarum Species 0.000 description 1
- 108010074633 Mixed Function Oxygenases Proteins 0.000 description 1
- 102000008109 Mixed Function Oxygenases Human genes 0.000 description 1
- HNQBPUIXFDQDRJ-UHFFFAOYSA-N N-ethyl-N-(2-hydroxyethyl)nitrous amide Chemical compound CCN(N=O)CCO HNQBPUIXFDQDRJ-UHFFFAOYSA-N 0.000 description 1
- 125000001429 N-terminal alpha-amino-acid group Chemical group 0.000 description 1
- 102100015085 NCOR2 Human genes 0.000 description 1
- 101700070835 NCOR2 Proteins 0.000 description 1
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- 108090000913 Nitrate Reductases Proteins 0.000 description 1
- 108010020526 Nova antigen Proteins 0.000 description 1
- 108020005203 Oxidases Proteins 0.000 description 1
- 101710035539 PG14 Proteins 0.000 description 1
- 101710040957 PGN1 Proteins 0.000 description 1
- 108090000284 Pepsin A Proteins 0.000 description 1
- 108090000437 Peroxidases Proteins 0.000 description 1
- 102000003992 Peroxidases Human genes 0.000 description 1
- 108010064785 Phospholipases Proteins 0.000 description 1
- 102000015439 Phospholipases Human genes 0.000 description 1
- 108091000081 Phosphotransferases Proteins 0.000 description 1
- 102000030951 Phosphotransferases Human genes 0.000 description 1
- 102000003935 Phosphotransferases (Phosphomutases) Human genes 0.000 description 1
- 108090000337 Phosphotransferases (Phosphomutases) Proteins 0.000 description 1
- PJGSXYOJTGTZAV-UHFFFAOYSA-N Pinacolone Chemical compound CC(=O)C(C)(C)C PJGSXYOJTGTZAV-UHFFFAOYSA-N 0.000 description 1
- 108010059820 Polygalacturonase Proteins 0.000 description 1
- 102000014961 Protein Precursors Human genes 0.000 description 1
- 108010078762 Protein Precursors Proteins 0.000 description 1
- AUNGANRZJHBGPY-SCRDCRAPSA-N Riboflavin Chemical compound OC[C@@H](O)[C@@H](O)[C@@H](O)CN1C=2C=C(C)C(C)=CC=2N=C2C1=NC(=O)NC2=O AUNGANRZJHBGPY-SCRDCRAPSA-N 0.000 description 1
- 108020004682 Single-Stranded DNA Proteins 0.000 description 1
- 102100008098 TALDO1 Human genes 0.000 description 1
- 108010043652 Transketolase Proteins 0.000 description 1
- 102000014701 Transketolase Human genes 0.000 description 1
- GETQZCLCWQTVFV-UHFFFAOYSA-N Trimethylamine Chemical compound CN(C)C GETQZCLCWQTVFV-UHFFFAOYSA-N 0.000 description 1
- 102000004142 Trypsin Human genes 0.000 description 1
- 108090000631 Trypsin Proteins 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 101700006119 XYL1 Proteins 0.000 description 1
- 101700047052 XYLA Proteins 0.000 description 1
- 101700051122 XYLD Proteins 0.000 description 1
- 101700065756 XYN4 Proteins 0.000 description 1
- 101700001256 Xyn Proteins 0.000 description 1
- 108010093941 acetylxylan esterase Proteins 0.000 description 1
- 230000002378 acidificating Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 150000003838 adenosines Chemical class 0.000 description 1
- 238000005273 aeration Methods 0.000 description 1
- 150000001299 aldehydes Chemical class 0.000 description 1
- 102000005840 alpha-Galactosidase Human genes 0.000 description 1
- 108010030291 alpha-Galactosidase Proteins 0.000 description 1
- 108010084650 alpha-N-arabinofuranosidase Proteins 0.000 description 1
- 108010061261 alpha-glucuronidase Proteins 0.000 description 1
- 108020004134 amidinotransferase family Proteins 0.000 description 1
- QGZKDVFQNNGYKY-UHFFFAOYSA-N ammonia Chemical compound N QGZKDVFQNNGYKY-UHFFFAOYSA-N 0.000 description 1
- 210000003484 anatomy Anatomy 0.000 description 1
- 125000003118 aryl group Chemical group 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 229920001222 biopolymer Polymers 0.000 description 1
- 239000007853 buffer solution Substances 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 150000001720 carbohydrates Chemical class 0.000 description 1
- 235000014633 carbohydrates Nutrition 0.000 description 1
- 239000001569 carbon dioxide Substances 0.000 description 1
- 239000011203 carbon fibre reinforced carbon Substances 0.000 description 1
- 150000001735 carboxylic acids Chemical class 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 125000002091 cationic group Chemical group 0.000 description 1
- 150000001768 cations Chemical class 0.000 description 1
- 238000010370 cell cloning Methods 0.000 description 1
- 239000006143 cell culture media Substances 0.000 description 1
- 230000001413 cellular Effects 0.000 description 1
- 210000003850 cellular structures Anatomy 0.000 description 1
- 238000002144 chemical decomposition reaction Methods 0.000 description 1
- 229910052729 chemical element Inorganic materials 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000004587 chromatography analysis Methods 0.000 description 1
- 101710027542 codAch2 Proteins 0.000 description 1
- 230000023298 conjugation with cellular fusion Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 235000018417 cysteine Nutrition 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000003247 decreasing Effects 0.000 description 1
- 239000005547 deoxyribonucleotide Substances 0.000 description 1
- 125000002637 deoxyribonucleotide group Chemical group 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000003989 dielectric material Substances 0.000 description 1
- 238000009510 drug design Methods 0.000 description 1
- 238000007876 drug discovery Methods 0.000 description 1
- 230000036267 drug metabolism Effects 0.000 description 1
- 239000002359 drug metabolite Substances 0.000 description 1
- 239000000975 dye Substances 0.000 description 1
- 238000000132 electrospray ionisation Methods 0.000 description 1
- 239000000839 emulsion Substances 0.000 description 1
- 150000002118 epoxides Chemical class 0.000 description 1
- 150000002148 esters Chemical class 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 108010092086 exo-poly-alpha-galacturonosidase Proteins 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 108060000072 faeA Proteins 0.000 description 1
- 238000002397 field ionisation mass spectrometry Methods 0.000 description 1
- 239000010408 film Substances 0.000 description 1
- 238000001917 fluorescence detection Methods 0.000 description 1
- 238000000799 fluorescence microscopy Methods 0.000 description 1
- 238000002866 fluorescence resonance energy transfer Methods 0.000 description 1
- 238000001506 fluorescence spectroscopy Methods 0.000 description 1
- 238000007672 fourth generation sequencing Methods 0.000 description 1
- 230000005714 functional activity Effects 0.000 description 1
- 101700052622 gcdA Proteins 0.000 description 1
- 239000003365 glass fiber Substances 0.000 description 1
- 235000019420 glucose oxidase Nutrition 0.000 description 1
- 150000004676 glycans Polymers 0.000 description 1
- 230000003899 glycosylation Effects 0.000 description 1
- 238000006206 glycosylation reaction Methods 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 239000001963 growth media Substances 0.000 description 1
- 150000003278 haem Chemical class 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 108010002430 hemicellulase Proteins 0.000 description 1
- 229920001519 homopolymer Polymers 0.000 description 1
- 239000005556 hormone Substances 0.000 description 1
- UFHFLCQGNIYNRP-UHFFFAOYSA-N hydrogen Chemical compound [H][H] UFHFLCQGNIYNRP-UHFFFAOYSA-N 0.000 description 1
- 125000004435 hydrogen atoms Chemical group [H]* 0.000 description 1
- 230000003301 hydrolyzing Effects 0.000 description 1
- GPRLSGONYQIRFK-UHFFFAOYSA-N hydron Chemical compound [H+] GPRLSGONYQIRFK-UHFFFAOYSA-N 0.000 description 1
- 230000002209 hydrophobic Effects 0.000 description 1
- 125000002887 hydroxy group Chemical group [H]O* 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-M hydroxyl anion Chemical compound [OH-] XLYOFNOQVPJJNP-UHFFFAOYSA-M 0.000 description 1
- 238000011065 in-situ storage Methods 0.000 description 1
- 230000002779 inactivation Effects 0.000 description 1
- 239000002054 inoculum Substances 0.000 description 1
- 230000002452 interceptive Effects 0.000 description 1
- 230000002427 irreversible Effects 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 238000005304 joining Methods 0.000 description 1
- 150000002605 large molecules Chemical class 0.000 description 1
- 239000002523 lectin Substances 0.000 description 1
- 239000006193 liquid solution Substances 0.000 description 1
- 238000011068 load Methods 0.000 description 1
- 238000006977 lyase reaction Methods 0.000 description 1
- PWHULOQIROXLJO-UHFFFAOYSA-N manganese Chemical compound [Mn] PWHULOQIROXLJO-UHFFFAOYSA-N 0.000 description 1
- 229910052748 manganese Inorganic materials 0.000 description 1
- 239000011572 manganese Substances 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 230000013011 mating Effects 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 230000002503 metabolic Effects 0.000 description 1
- 108060004795 methyltransferase family Proteins 0.000 description 1
- 230000000813 microbial Effects 0.000 description 1
- 230000002906 microbiologic Effects 0.000 description 1
- 238000000386 microscopy Methods 0.000 description 1
- 230000000051 modifying Effects 0.000 description 1
- 238000000302 molecular modelling Methods 0.000 description 1
- 238000005817 monooxygenase reaction Methods 0.000 description 1
- 229910052759 nickel Inorganic materials 0.000 description 1
- 238000010641 nitrile hydrolysis reaction Methods 0.000 description 1
- 229910052757 nitrogen Inorganic materials 0.000 description 1
- 125000004433 nitrogen atoms Chemical group N* 0.000 description 1
- 230000000269 nucleophilic Effects 0.000 description 1
- 239000003921 oil Substances 0.000 description 1
- 210000000056 organs Anatomy 0.000 description 1
- 101700074470 pac Proteins 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 102000004251 pectinacetylesterase Human genes 0.000 description 1
- 108010072638 pectinacetylesterase Proteins 0.000 description 1
- 108020004410 pectinesterase Proteins 0.000 description 1
- 229940111202 pepsin Drugs 0.000 description 1
- 230000002093 peripheral Effects 0.000 description 1
- 230000000144 pharmacologic effect Effects 0.000 description 1
- 150000002989 phenols Chemical class 0.000 description 1
- 238000006366 phosphorylation reaction Methods 0.000 description 1
- 230000000865 phosphorylative Effects 0.000 description 1
- 238000000596 photon cross correlation spectroscopy Methods 0.000 description 1
- 230000000704 physical effect Effects 0.000 description 1
- 238000000053 physical method Methods 0.000 description 1
- 101710035540 plaa2 Proteins 0.000 description 1
- 229920000844 poly(butylene succinate-co-adipate) Polymers 0.000 description 1
- 238000002264 polyacrylamide gel electrophoresis Methods 0.000 description 1
- 229920001690 polydopamine Polymers 0.000 description 1
- 229920001282 polysaccharide Polymers 0.000 description 1
- 239000005017 polysaccharide Substances 0.000 description 1
- 150000004804 polysaccharides Polymers 0.000 description 1
- 235000020004 porter Nutrition 0.000 description 1
- 230000004481 post-translational protein modification Effects 0.000 description 1
- 238000005381 potential energy Methods 0.000 description 1
- 230000003389 potentiating Effects 0.000 description 1
- OZAIFHULBGXAKX-UHFFFAOYSA-N precursor Substances N#CC(C)(C)N=NC(C)(C)C#N OZAIFHULBGXAKX-UHFFFAOYSA-N 0.000 description 1
- 101700015794 pro Proteins 0.000 description 1
- 235000019833 protease Nutrition 0.000 description 1
- 125000006239 protecting group Chemical group 0.000 description 1
- 230000036678 protein binding Effects 0.000 description 1
- 230000012846 protein folding Effects 0.000 description 1
- 108010048769 pullulanase Proteins 0.000 description 1
- 239000000376 reactant Substances 0.000 description 1
- 239000012429 reaction media Substances 0.000 description 1
- 239000011541 reaction mixture Substances 0.000 description 1
- 238000010188 recombinant method Methods 0.000 description 1
- 238000006479 redox reaction Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000001105 regulatory Effects 0.000 description 1
- 230000027756 respiratory electron transport chain Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 108091007521 restriction endonucleases Proteins 0.000 description 1
- 125000000548 ribosyl group Chemical group C1([C@H](O)[C@H](O)[C@H](O1)CO)* 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 238000004513 sizing Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000007250 stereoselective catalysis Methods 0.000 description 1
- 230000003637 steroidlike Effects 0.000 description 1
- 229940086735 succinate Drugs 0.000 description 1
- KDYFGRWQOYBRFD-UHFFFAOYSA-L succinate(2-) Chemical compound [O-]C(=O)CCC([O-])=O KDYFGRWQOYBRFD-UHFFFAOYSA-L 0.000 description 1
- 229910052717 sulfur Inorganic materials 0.000 description 1
- 229920002258 tannic acid Polymers 0.000 description 1
- 235000015523 tannic acid Nutrition 0.000 description 1
- 125000000999 tert-butyl group Chemical group [H]C([H])([H])C(*)(C([H])([H])[H])C([H])([H])[H] 0.000 description 1
- 230000001225 therapeutic Effects 0.000 description 1
- 239000010409 thin film Substances 0.000 description 1
- 238000007671 third-generation sequencing Methods 0.000 description 1
- 230000001550 time effect Effects 0.000 description 1
- 230000002588 toxic Effects 0.000 description 1
- 231100000331 toxic Toxicity 0.000 description 1
- 230000001052 transient Effects 0.000 description 1
- 229960001322 trypsin Drugs 0.000 description 1
- 239000012588 trypsin Substances 0.000 description 1
- 230000021037 unidirectional conjugation Effects 0.000 description 1
- 230000035899 viability Effects 0.000 description 1
- 230000002034 xenobiotic Effects 0.000 description 1
- 239000002676 xenobiotic agent Substances 0.000 description 1
- 101700065693 xlnA Proteins 0.000 description 1
- 101700006979 xyl2 Proteins 0.000 description 1
- 101710017636 xynS20E Proteins 0.000 description 1
- WQZGKKKJIJFFOK-PHYPRBDBSA-N α-D-galactose Chemical compound OC[C@H]1O[C@H](O)[C@H](O)[C@@H](O)[C@H]1O WQZGKKKJIJFFOK-PHYPRBDBSA-N 0.000 description 1
- WQZGKKKJIJFFOK-VFUOTHLCSA-N β-D-glucose Chemical compound OC[C@H]1O[C@@H](O)[C@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-VFUOTHLCSA-N 0.000 description 1
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/10—Processes for the isolation, preparation or purification of DNA or RNA
- C12N15/1034—Isolating an individual clone by screening libraries
- C12N15/1058—Directional evolution of libraries, e.g. evolution of libraries is achieved by mutagenesis and screening or selection of mixed population of organisms
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/10—Processes for the isolation, preparation or purification of DNA or RNA
- C12N15/1034—Isolating an individual clone by screening libraries
- C12N15/1089—Design, preparation, screening or analysis of libraries using computer algorithms
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B35/00—ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B35/00—ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
- G16B35/20—Screening of libraries
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/60—In silico combinatorial chemistry
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/60—In silico combinatorial chemistry
- G16C20/64—Screening of libraries
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C99/00—Subject matter not provided for in other groups of this subclass
Abstract
Disclosed are methods for identifying bio-molecules with desired properties (or which are most suitable for a round of directed evolution) from complex bio-molecule libraries or sets of such libraries. Some embodiments of the present disclosure provide methods for virtually screening proteins for beneficial properties. Some embodiments of the present disclosure provide methods for virtually screening enzymes for desired activity and/or selectivity for catalytic reactions involving particular substrates. Some embodiments combine screening and directed evolution to design and develop proteins and enzymes having desired properties. Systems and computer program products implementing the methods are also provided. eneficial properties. Some embodiments of the present disclosure provide methods for virtually screening enzymes for desired activity and/or selectivity for catalytic reactions involving particular substrates. Some embodiments combine screening and directed evolution to design and develop proteins and enzymes having desired properties. Systems and computer program products implementing the methods are also provided.
Description
AUTOMATED SCREENING OF ENZYME TS
CROSS-REFERENCE TO RELATED APPLICATIONS
This ation claims benefit under 35 U.S.C. § 119(e) to US. Provisional
Patent Application No. 61/883,838, entitled: AUTOMATED SCREENING OF
ENZYME VARIANTS, filed September 27, 2013, which is herein incorporated by
reference in its entirety for all purposes.
BACKGROUND
Protein design has long been known to be a lt task if for no other reason
than the combinatorial explosion of possible molecules that constitute searchable
sequence space. The sequence space of proteins is immense and is ible to
explore exhaustively using methods currently known in the art, which are often
limited by the time and cost required to identify useful polypeptides. Part of the
problem arises from the great number of polypeptide variants that must be sequenced,
screened and assayed. ed evolution methods increase the efficiency in honing
in on the candidate biomolecules having advantageous ties. Today, ed
evolution of proteins is dominated by various high throughput screening and
recombination formats, often performed ively.
Various computational techniques have also been proposed for exploring
sequence-activity space. Relatively speaking, these techniques are in their infancy
and significant advances are still needed. Accordingly, new s for improving
the ncy of screening, sequencing, and assaying candidate biomolecules are
highly desirable.
SUMMARY
The present disclosure relates to the fields of molecular biology, molecular
evolution, bioinformatics, and digital systems. Systems, including l systems,
and system software for performing these methods are also provided. Methods of the
present sure have utility in the optimization of proteins for industrial and
therapeutic use. The methods and systems are especially useful for designing and
developing s having desired activity and selectivity for catalytic reactions of
particular substrates.
Certain aspects of the present disclosure relate to methods for virtually
screening proteins having beneficial properties and/or guiding directed evolution
programs. The disclosure presents methods for identifying bio-molecules with
desired properties (or which are most suitable for directed evolution toward such
properties) from complex bio-molecule libraries or sets of such libraries. Some
ments of the present disclosure provide methods for virtually screening
enzymes for d activity and selectivity for tic reactions on particular
substrates. Some embodiments combine screening and directed evolution to design
and p proteins and s having desired properties. Systems and computer
program products implementing the methods are also provided.
Some embodiments of the disclosure provide s for screening a
plurality of ent enzyme variants for activity with a substrate. In some
embodiments, the method is implemented using a computer system that includes one
or more processors and system memory. The method includes: (a) for each enzyme
variant, docking, by the computer system, a computational representation of the
substrate to a computational representation of an active site of the enzyme t,
wherein docking (i) generates a plurality of poses of the substrate in the active site,
and (ii) identifies energetically favorable poses of the substrate in the active site; (b)
for each tically favorable pose, determining whether the pose is active, wherein
an active pose meets one or more constraints for the substrate to undergo catalysis in
the active site; and (c) selecting at least one of the enzyme variants determined to
have one or more active poses.
In some embodiments, the constraints include one or more of the following:
position, distance, angle, and torsion constraints. In some embodiments, the
constraints include a distance between a particular moiety on the substrate and a
particular residue or residue moiety in the active site. In some embodiments, the
constraints include a distance between a particular moiety on the ligand and an ideally
positioned native ligand in the active site.
In some embodiments, the ational representation of the substrate
represents a species along the reaction coordinate for the enzyme activity. The
species is ed from the substrate, a on intermediate of the substrate, or a
transition state of the substrate. In some embodiments, the ts ed are
selected from a panel of enzymes that can turn over multiple substrates and wherein
the members of the panel possess at least one mutation relative to a reference
sequence. In some embodiments, at least one mutation is a single-residue mutation.
In some embodiments, at least one mutation is in the active site of the enzyme. In
some embodiments, the plurality of ts include one or more enzymes that can
ze a chemical reaction selected from ketone reduction, transamination,
oxidation, e hydrolysis, imine reduction, enone reduction, acyl hydrolysis, and
halohydrin dehalogenation. In some ments, the enzyme is selected from
ketone reductase, transaminase, cytochrome P450, —Villiger monooxygenase,
monoamine oxidase, nitrilase, imine reductase, enone reductase, acylase, and
drin dehalogenase. However, it is not intended that the present invention be
limited to any ular enzyme or class of enzyme, as any suitable enzyme finds use
in the methods of the present invention. In some embodiments, the variants are
members of library produced by one or more rounds of directed evolution in vitro
and/or in silico.
In some embodiments, the method screens at least about ten different variants.
In other embodiments the method screens at least about a nd different variants.
In some embodiments, the computational representations of active sites are
provided from 3-D homology models for the plurality of variants. In some
embodiments, methods are provided for producing the 3-D homology models for
protein variants. In some embodiments, the method is applied to screen a plurality of
substrates.
Some embodiments provide method for identifying the aints for the
substrate to o the catalyzed chemical transformation by identifying one or more
poses of a native substrate, a reaction intermediate of the native ate, or a
transition state of the native substrate when the native substrate undergoes the
catalyzed chemical transformation by a wild-type enzyme.
Some embodiments provide method for applying a set of one or more enzyme
constraints to the plurality of enzyme ts, wherein the one or more enzyme
constraints are similar to the constraints of a wild-type enzyme when a native
substrate undergoes a catalyzed chemical ormation in the presence of the wild-
type enzyme.
In some embodiments, the plurality of poses of the substrate is obtained by
docking operations including one or more of the following: high temperature
molecular dynamics, random on, ref1nement by grid-based simulated annealing,
and a final grid-based or full force field minimization. In some embodiments, the
plurality of poses of the ligand comprises at least about 10 poses of the substrate in
the active site.
In some embodiments, the selecting of variants in (c) above involves
identifying variants determined to have large numbers of active poses by ison
to other variants. In some embodiments, the selecting in (c) involves ranking the
variants by one or more of the following: the number of active poses the variants
have, docking scores of the active poses, and binding energies of the active poses.
Then variants are selected based on rank. In some ments, the docking scores
are based on van de Waals force and electrostatic interaction. In some embodiments,
the binding energies are based on one or more of the following: van der Waals force,
electrostatic interaction, and solvation .
In some embodiments, the screening method also involves preparing a
plurality of oligonucleotides containing or encoding at least a portion of at least one
selected variant. The method further involves performing one or more rounds of
directed evolution using the ity of oligonucleotides. In some embodiments,
preparing a plurality of oligonucleotides involves sizing the oligonucleotides
using a nucleic acid synthesizer. In some embodiments, performing one or more
rounds of ed evolution comprises fragmenting and recombining the plurality of
oligonucleotides. In some embodiments, performing one or more rounds of directed
evolution involves performing saturation mutagenesis on the plurality of
oligonucleotides.
In some embodiments, the screened enzyme t has desired catalytic
activity and/or selectivity. The method of some embodiments also involves
synthesizing the enzyme selected from screening.
In some embodiments, the ing method can be expanded to screen
biomolecules other than enzymes. Some embodiments provide a method for
screening a plurality of protein variants for interaction with a ligand. The method
involves: (a) for each protein variant, docking, by the computer system, a
computational representation of the ligand to a computational representation of an
active site of the enzyme variant, wherein docking (i) tes a plurality of poses of
the ligand in the active site, and (ii) identifies energetically ble poses of the
ligand in the active site; (b) for each energetically favorable pose, determining
r the pose is active, wherein an active pose meets one or more aints for
the ligand to undergo a particular interaction with protein variant; and (c) selecting at
least one of the n ts determined to have one or more active poses. In
WO 48572
some embodiments, the ligand can be selected from a substrate, an intermediate, a
transition state, a product, an inhibitor, an agonist, and/or an antagonist.
In some embodiments, computer program products and er s
implementing the methods for screening enzymes and proteins are also provided.
These and other features are presented below with reference to the associated
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 illustrates geometric constraints for identifying active poses for a
catalytic reaction of pro-R selectivity, the reaction involving a ketone reductase
enzyme with a tyrosine moiety, an acetophenone ate, and the or NADPH.
Figure 2 is a flow chart presenting a workflow for analyzing potential activity
of candidate biomolecules in some implementations.
Figures 3A is a flowchart showing an e of a workflow for designing
biomolecule sequences according to some embodiments of the disclosure.
Figures 3B is a flowchart showing an e of a workflow for designing
biomolecule sequences, which involves synthesizing and assaying sequences obtained
from l screening.
Figures 3C is a flowchart showing an example of a workflow for designing
biomolecule sequences, which combines in vitro directed ion and virtual
screening in each round of multiple iterations.
Figure 4 shows an exemplary l device that can be implemented
according to some embodiments of the current disclosure.
Figure 5 provides a plot of data showing the binding energy and selectivity of
best variants from a second round of directed evolution and the backbones for
round 1 (RdlBB) and round 2 (Rd2BB).
Figure 6A shows model fitness of a sequence activity model built using data
from a virtual protein screening system according some embodiments.
Figure 6B shows cross validation data indicating that the sequence ty
model as constructed in Figure 6A was accurate in predicting binding energy.
Figure 6C shows the ients for various mutations according to the
sequence activity model as constructed in Figure 6A.
Figure 7 shows quantities indicating conversion on X axis and ivity on Y
axis from virtually screening ductase variants for enantioseletive production of
(R)- l , l , l -trifluropropanol from 1 l l -trifluropropanone.
, ,
Figure 8 shows quantities indicating conversion and hits (variants with certain
level of improvement) from virtual directed evolution of P450 for regioseletive CH
ion to C-OH.
DETAILED DESCRIPTION
Screening of proteins and s may be performed in actual ways that
involve measurements of the chemical and physical properties of n and enzyme
molecules interacting with ligands and substrates. Actual measurements consume time
and ces, and underlying physical and chemical mechanisms are often difficult to
visualize or manipulate. The “virtual” screening methods and systems disclosed
herein provide tools to visualize or manipulate the structure and dynamics of
enzymes, proteins, and their substrates and s. These tools can save time and/or
materials for studying the molecules.
In some embodiments, virtual screening of proteins or enzymes is used in
directed evolution of proteins of interest. Virtual screening is used in place of physical
screening during various stages of these directed evolution embodiments, making it
possible to study a large number of molecules and reactions without requiring the
physical materials or the time required by actual screening. These embodiments can
speed up the processes for obtaining proteins and enzymes having desired properties.
Materials and resources may also be saved in the processes. Some embodiments are
especially useful for designing and developing enzymes having desired activity and/or
selectivity for tic reactions involving particular substrates.
I. DEFINITIONS
Unless def1ned otherwise herein, all technical and ific terms used herein
have the same meaning as commonly understood by one of ordinary skill in the art.
Various ific naries that include the terms included herein are well known
and available to those in the art. Any methods and materials similar or lent to
those described herein find use in the practice of the embodiments disclosed herein.
The terms defined immediately below are more fully understood by reference
to the specification as a whole. The definitions are for the e of describing
particular embodiments only and aiding in understanding the complex concepts
described in this specification. They are not intended to limit the full scope of the
sure. Specifically, it is to be understood that this disclosure is not limited to the
particular sequences, compositions, algorithms, systems, methodology, protocols, and
reagents described, as these may vary, depending upon the t they are used by
those of skill in the art.
As used in this cation and appended claims, the singular forms “a”,
“an”, and “the” include plural referents unless the content and context clearly dictates
otherwise. Thus, for example, reference to “a device” includes a combination of two
or more such devices, and the like. Unless indicated ise, an “or” conjunction is
intended to be used in its correct sense as a Boolean logical operator, encompassing
both the ion of features in the alternative (A or B, where the selection of A is
mutually exclusive from B) and the selection of features in conjunction (A or B,
where both A and B are ed).
“Docking” as used herein, refers to the computational process for simulating
and/or characterizing the binding of a computational representation of a molecule
(e.g., a substrate or ligand) to a computational representation of an active site of a
biomolecule (e.g., an enzyme or protein). Docking is typically implemented in a
computer system using a “docker” computer program. Typically, the result of a
docking process is a ational entation of the molecule “docked” in the
active site in a specific “pose.” A plurality of docking processes may be carried out
between the same computational representation of a molecule and the same
computational representation of an active site resulting in a plurality of different
“poses” of the molecule in the active site. The evaluation of the structure,
mation, and energetics of the plurality of different ” in the computational
representation of the active site can identify certain “poses” as more energetically
favorable for binding between the ligand and the biomolecule.
In some embodiments, poses generated from docking are evaluated to
determine if they are “active” for a desired ction with the biomolecule. “Active
poses” are those meeting one or more constraints for an activity under consideration.
A “constraint” may limit a pose’s structure, ry, conformation, energetics, etc.
In certain embodiments, an “active pose” of a computational representation of a
substrate in the active site of an enzyme satisfies ions for catalysis by the
enzyme. When docking identifies us active poses of a computational
2014/057899
representation of a substrate in the computational representation of the active site, the
specific enzyme represented may be selected as favorable for catalyzing the al
transformation of the substrate to product.
A “docker” is a computer program that computationally simulates and/or
characterizes the docking process between a computational representation of a
molecule (e.g., a substrate or ligand) and a computation representation of an active
site of interest in a protein or other biological molecule. .
Dockers are typically implemented as software that may be temporarily or
permanently stored in association with hardware such as a processor or processors.
Commercially available docking programs include CDocker (Accelrys), DOCK
(University of California, San Francisco), AutoDock (Scripps Research Institute),
FlexX (tripos.com), GOLD (ccdc.cam.ac.uk), and GLIDE (schrodinger.com).
Docking using a docker typically tes “poses” of computational
representations of ates and ligands with respect to active sites. These poses may
be used in ting a docking score or otherwise assessing docking. In some
embodiments, poses are associated with interaction energy values ated by a
docker. Some poses are energetically more favorable than other poses. In some
embodiments, the docker permits a user to y a number of poses (n) to use in
assessing docking. Only the top 11 poses with the best docking scores are considered
in assessing g. In some embodiments, only poses with favorable ction
energy that meet defined criteria are selected to be classified as active or inactive
poses.
In some embodiments, a docker can determine that a substrate or ligand is
likely to bind with a biomolecule if one or more poses of the substrate or ligand have
favorable ction energy with the biomolecule. A bound ligand may act as an
t or nist. Various dockers output a docking score or other measure of
g between the substrate or ligand and the biomolecule. For some combinations
of biomolecule active site with a substrate or ligand, the docking program will
determine that binding is unlikely to occur. In such cases, the docking program will
output a conclusion that the substrate or ligand does not bind with biomolecule.
A docker may be programmed to output an assessment of the likelihood that a
ligand will dock with the active site of biomolecule or the quality of such docking,
should it occur. The likelihood and quality of docking indicate the likelihood that a
ligand will bind with a biomolecule. At one level, a docker determines whether a
2014/057899
ligand is likely to bind to a ecule’s active site. If the docker logic concludes
that binding is not likely or is highly unfavorable, it may output a “no refined poses
found” result. This may occur when all the conformations the docking program
ted have unfavorable van der Waals clashes and/or electrostatic repulsions with
the active site. In the above example of a docking procedure, if the second operation
fails to find a pose with soft energy less than the threshold, the docker may return a
result such as “no ref1ned poses found.” Because soft energy primarily considers
nonbonded ctions including van der Waals and electrostatic forces, the “no
refined poses found” result means the ligand has severe steric clashes and/or
ostatic repulsions with the biomolecule receptor for a given number of poses.
In certain embodiments, the docker outputs a docking score that represents the
interaction between the ligand in the biomolecule active site. Dockers may ate
various features of the -biomolecule interaction. In one example, the output is
simply the interaction energy between the ligand and the biomolecule. In another
embodiment, a total energy is output. The total energy may be understood to be a
combination of ligand-biomolecule interaction energy and ligand strain. In certain
implementations, such energy may be calculated using a force field such as
CHARMm.
In various embodiments, docking programs generate such outputs by
considering multiple poses of the ligand in the active site of the biomolecule. Each
pose will have its own associated energy values. In some embodiments, the docking
m ranks the poses and considers the energy associated with one or more of the
high-ranking poses. In some cases, it may average the energies of certain high-ranking
poses or otherwise perform a statistical analysis of the top ranking poses. In other
embodiments, it simply chooses the value assisted with the top-ranked pose and
outputs this as the resulting energy for the docking.
In some embodiments, the ational representation of a substrate
corresponds to a molecular species along the reaction nate of an enzymatic
reaction that is capable of converting the substrate molecule to the desired product
molecule. In some embodiments, the computational representation of the substrate
represents the substrate molecule per se. In some embodiments, the computational
representation of the ate represents an intermediate structure of the substrate
that forms along the reaction coordinate (i.e., a “reaction ediate of the
substrate”). In some embodiments, the computational representation of the substrate
represents a transition state structure that forms along the enzymatic reaction
coordinate (i.e., a “transition state of the substrate”).
In some embodiments, a computational representation of a ligand can
represent a molecular species that binds strongly to an enzyme or ecule but
does not proceed along a reaction nate to a desired product. For example, the
computational representation of the ligand can represent a strong inhibitor in order to
screen for inhibitors of an enzyme, or strong-binding antagonists or agonists of
proteins (e.g., receptors).
A “pose” is the position or orientation of a ate or ligand with respect to
an active site of a biological molecule. In a pose, the three dimensional positions of
some or all atoms of the ligand are specified with respect to some or all positions of
atoms in the active site. While a ligand’s conformation is not its pose — because the
conformation does not consider the active site — the conformation can be used in
determining a pose. In some embodiments, a ligand’s orientation and conformation
together define a pose. In some embodiments, a pose only exists if a ligand’s
orientation/conformation ation meets a defined threshold energy level in the
nce active site.
Various ational mechanisms can be employed to generate poses for
docking. Examples include systematic or stochastic nal searches about rotatable
bonds, molecular dynamics simulations, and genetic algorithms to “evolve” new low
energy conformations. These techniques are used to modify computational
representations of the ligand and/or active site to explore “pose space.”
Dockers evaluate poses to determine how the ligand interacts with the active
site. In some embodiments, they do this by calculating energy of interaction based on
one or more of the interaction types mentioned above (e.g., van der Waals forces).
This information is used to characterize g and in some cases produce a docking
score. In some implementations, dockers rank poses based on docking scores. In
some implementations, dockers remove poses with rable docking scores from
consideration.
In certain embodiments, a virtual protein screening system tes a pose to
ine whether the pose is active. A pose is deemed to be active if it meets
defined constraints known to be important for the desired actiVity under consideration.
As an e, the Virtual protein screening system may determine whether a pose
supports catalytic transformation of the ligand in an active site.
A “ligand” is a molecule or x that interacts with an active site of a
biomolecule to form a stable x containing at least the ligand and biomolecule.
In on to the ligand and biomolecule, the stable complex may include (sometimes
e) other chemical entities such as organic and inorganic cofactors (e.g.,
coenzymes and prosthetic groups), metal ions, and the like. Ligands may be agonists
or antagonists.
The “active site” of a biomolecule is a site defined by the structure of the
biomolecule which is capable of ning and/or g all or part of a molecule
(e.g., a substrate or ligand). Many types of active sites are contemplated and some of
these are described elsewhere herein. Often the active site contains chemical and/or
physical features (e.g., amino acid residues) capable of g binding interactions
with the substrate or ligand. In some embodiments (e.g., when the biomolecule is an
enzyme), the “active site” includes at least one tic residue and a plurality of
binding residues, and sometimes other chemical entities such as organic and inorganic
cofactors (e.g., coenzymes and prosthetic groups), metal ions, and the like. The at
least one catalytic e of the active site may contain a catalytic moiety that
catalyzes the turnover of a ate. The binding residues of the active site provide
binding interactions with the substrate to hold it in the active site in a stereoselective
and/or regioselective manner. Such interactions may include van der Waals
interactions, electrostatic interactions, hydrogen bonding, hydrophilic interactions,
hydrophobic interactions, solvent interactions, covalent bonding, etc.
In some ments, a computational representation of an active site can be
used for g a computational representation of a substrate or ligand, thereby
generating poses that can be evaluated for favorable interaction with the active site
(e. g., determination of binding energy for poses).
In some embodiments, the computational representation of the active site is
defined geometrically by a sphere or other shape. In some embodiments, the active
site is defined by creating a sphere around the centroid of selected objects (e.g.,
ligands and/or other chemical entities in the structure template) with the radius
adjusted to include them. The m radius is 5A but the active site size can be
expanded by increasing the sphere radius by 1A, 2A, 3A, 4A, 6A, 8A, 10 A, and so
on. In some entations, the size of the radius is selected to capture residues
proximate the substrate. Therefore, larger substrates will be associated with larger
radii and small substrates will be associated with smaller radii. It is not intended that
the present sure be limited to any particular values of radii. In some
embodiments, the active site can be defined from receptor cavities, where the active
site was d from one of the cavities detected in the structure te. In some
embodiments, the active site can be defined from Protein Data Bank (PDB) site
records, as the PDB file of the structure template often has active site defined using
site records. Since all the homology models will be created using the structure
template, the defined active site is erable to all the homology models.
In some embodiments, the computational entation of the active site can
be defined by various three-dimensional shapes, such as a user izable shape
(e.g., an ellipse or an lar shape reflecting the structure of the substrate) with
reference to moieties on the substrate and/or the enzyme.
In some embodiments, the computational representation of the active site can
be defined to include amino acids that do not interact directly (e. g., via van der Waals
interactions, electrostatic interactions, hydrogen bonding) with the substrate or ligand
molecule in the active site, but which ct with other amino acids in the
computational representation of the active site, and thereby affect the evaluation of
poses of the substrate or ligand.
In some embodiments, residues contributing to sis and/or binding may
exist outside of the computational representation of the active site as defined above.
Such residues may be modified during directed ion by ering residues
beyond the active site as candidates for mutation or recombination.
A “reaction ediate” is a chemical entity generated from the substrate in
the transformation from substrate to reaction product. A “transition state” of a
substrate is the substrate in a state corresponding to the highest potential energy along
a reaction pathway. At a transition state that tends to have a g existence,
colliding reactant molecules proceed to form products. In this disclosure, sometimes
when a substrate is described in a process, the intermediate and transition state may
also be suitable for the process. In such situations, the substrate, intermediate, and
transition state may collectively be referred to as “ligands.” In some cases, multiple
intermediates are generated in the catalytic transformation of a substrate. In n
embodiments, the ligand species (substrate or intermediate or transition state) chosen
for analysis is one known to be associated with a rate limiting step in the catalytic
transformation. As an example, a substrate covalently bound to an enzyme cofactor
may be chemically modified in a rate limiting step. In such case, the substrate-
cofactor species is used in modeling the interaction.
A “ligand” is a molecule capable of binding to a biomolecule and can include
“substrate” molecules that are capable of binding and further undergoing a catalytic
chemical transformation. Some ligands bind with an active site but do not undergo a
catalytic transformation. Examples include ligands ted in the drug design field.
Such ligands may be small molecules chosen for their ability to valently bind
with a target biomolecule for pharmacological es. In some cases, a ligand is
evaluated for its ability to potentiate, activate, or inhibit the l behavior of a
biomolecule.
A lecule” or “biological molecule” refers to a molecule that is
generally found in or produced by a biological organism. In some embodiments,
biological molecules comprise polymeric biological macromolecules having multiple
subunits (2'.e., “biopolymers”). Typical biomolecules e proteins, enzymes, and
other polypeptides, DNA, RNA and other polynucleotides, and can also include
molecules that share some structural features with naturally occurring polymers such
as RNAs (formed from tide subunits), DNAs (formed from nucleotide
subunits), and peptides or polypeptides (formed from amino acid subunits), ing,
e. g., RNA analogues, DNA analogues, polypeptide analogues, peptide nucleic acids
(PNAs), ations of RNA and DNA (e.g., chimeraplasts), or the like. It is not
intended that biomolecules be limited to any particular molecule, as any suitable
biological molecule finds use in the present sure, including but not limited to,
e.g., lipids, carbohydrates, or other organic molecules that are made by one or more
genetically encodable molecules (e.g., one or more s or enzyme pathways) or
the like. Of particular interest for some aspects of this disclosure are ecules
having active sites that interact with a ligand to effect a chemical or biological
transformation, e.g., catalysis of a substrate, activation of biomolecules, or
inactivation of the ecules, specifically enzymes.
In some embodiments, a “beneficial ty” or “activity” is an increase or
decrease in one or more of the following: catalytic rate (kw), substrate binding
affinity (KM), tic efficiency (kw/KM), substrate specificity, chemoselectivity,
regioselectivity, stereoselectivity, stereospecificity, ligand specificity, receptor
agonism, receptor antagonism, conversion of a or, oxygen stability, protein
expression level, solubility, thermoactivity, thermostability, pH activity, pH stability
WO 48572
(e.g., at alkaline or acidic pH), glucose inhibition, and/or resistance to inhibitors (e.g.,
acetic acid, lectins, tannic acids, and phenolic compounds) and ses. Other
d activities may include an altered profile in response to a particular stimulus
(e.g., altered temperature and/or pH profiles). In the context of rational ligand design,
optimization of targeted covalent inhibition (TCI) is a type of activity. In some
embodiments, two or more variants screened as described herein act on the same
substrate but differ with t to one or more of the following activities: rate of
product formation, percent conversion of a substrate to a product, ivity, and/or
percent conversion of a or. It is not intended that the present sure be
limited to any particular beneficial property and/or desired activity.
In some embodiments, “activity” is used to be the more limited concept
of an enzyme’s ability to catalyze the turnover of a substrate to a product. A related
enzyme characteristic is its “selectivity” for a ular product such as an
omer or regioselective product. The broad definition of “activity” presented
herein includes ivity, although conventionally selectivity is sometimes viewed
as distinct from enzyme activity.
The terms “protein,3, CEpolypeptide” and “peptide” are used interchangeably to
denote a polymer of at least two amino acids covalently linked by an amide bond,
regardless of length or post-translational modification (e.g., glycosylation,
phosphorylation, lipidation, myristilation, tination, etc.). In some cases, the
polymer has at least about 30 amino acid residues, and usually at least about 50 amino
acid es. More typically, they contain at least about 100 amino acid residues.
The terms include compositions conventionally considered to be fragments of full-
length proteins or peptides. ed within this definition are D- and L-amino acids,
and mixtures of D- and L-amino acids. The polypeptides described herein are not
restricted to the genetically encoded amino acids. Indeed, in addition to genetically
encoded amino acids, the polypeptides described herein may be made up of, either in
whole or in part, naturally-occurring and/or synthetic non-encoded amino acids. In
some embodiments, a polypeptide is a portion of the full-length ral or parental
polypeptide, containing amino acid additions or deletions (e.g., gaps) and/or
substitutions, as compared to the amino acid sequence of the full-length al
polypeptide, while still retaining functional activity (e. g., catalytic activity).
A “wild type” or “wildtype” (WT) biomolecule or organism is one that has the
phenotype of the typical form of a species as it occurs in nature. Sometimes a wild
type biomolecule has been isolated from a naturally occurring source. Other times, it
is derived in the laboratory environment. Usually, wild type ecules are related
to or encoded by genetic sequences of normal or reference genomes as opposed to
mutant genomes. Included within the definition of “wild type biomolecules” are
inant forms of a polypeptide or polynucleotide having a sequence identical to
the native form. A substrate or ligand that reacts with a ype biomolecule is
sometimes considered a “native” ate or ligand.
As used herein, the terms “variant, 3, ECmutant, 3) “mutant sequence,” and
“variant sequence” refer to a biological sequence that differs in some respect from a
standard or reference sequence (e.g., in some embodiments, a parental sequence).
The difference may be referred to as a “mutation”. In some embodiments, a mutant is
a polypeptide or polynucleotide sequence that has been altered by at least one
substitution, insertion, cross-over, deletion, and/or other genetic ion. For
purposes of the present disclosure, mutants and variants are not limited to a particular
method by which they are generated. In some ments, a mutant or variant
sequence has increased, decreased, or substantially similar ties or properties, in
comparison to the al sequence. In some embodiments, the variant polypeptide
comprises one or more amino acid residues that have been mutated, as compared to
the amino acid ce of the wild-type polypeptide (e.g., a parent polypeptide). In
some embodiments, one or more amino acid residues of the polypeptide are held
constant, are invariant, or are not mutated as ed to a parent polypeptide in the
variant polypeptides making up a plurality of polypeptides. In some ments,
the parent ptide is used as the basis for generating variants with improved
stability, activity, or any other desired property.
As used herein, the terms "enzyme varian " and nt enzyme" are used in
reference to enzymes that are similar to a reference enzyme, particularly in their
function, but have mutations in their amino acid sequence that make them different in
sequence from the wild-type or another reference enzyme. Enzyme variants can be
made by a wide variety of different mutagenesis techniques well known to those
skilled in the art. In addition, mutagenesis kits are also available from many
commercial molecular y suppliers. Methods are available to make specific
substitutions at d amino acids (site-directed), specific or random mutations in a
localized region of the gene (regio-specific) or random mutagenesis over the entire
gene (e. g., saturation mutagenesis). Numerous suitable methods are known to those
2014/057899
in the art to generate enzyme variants, ing but not limited to site-directed
mutagenesis of single-stranded DNA or double-stranded DNA using PCR, cassette
mutagenesis, gene synthesis, error-prone PCR, shuffling, and chemical saturation
mutagenesis, or any other suitable method known in the art. After the ts are
ed, they can be screened for the desired property (e. g., high or increased; or
low or reduced ty, increased l and/or alkaline stability, etc.).
A “panel of enzymes” is a group of enzymes selected such that each member
of the panel catalyzes the same chemical reaction. In some embodiments, the
members of the panel can collectively turn over multiple substrates, each undergoing
the same on. Often the panel members are chosen to efficiently turn over
multiple substrates. In some cases, the panels are commercially available. In other
cases, they are proprietary to an entity. For example, a panel may e s
enzymes identified as hits in a screening procedure. In certain embodiments, one or
more members of a panel exist only as a computational representation. In other
words, the enzyme is a l enzyme.
A “model” is a representation of the structure of a biomolecule or ligand. It is
sometimes provided as a collection of three-dimensional positions for the atoms or
moieties of the entity being represented. Models often contain computationally-
produced representations of the active sites or other aspects of the enzyme variants.
Examples of models relevant to the embodiments herein are produced from homology
modeling, protein threading, or ab initio n modeling using a routine such as
Rosetta (rosettacommons.org/software/) or Molecular Dynamics simulations.
A “homology model” is a three dimensional model of a protein or portion of a
protein ning at least the active site of a ligand under consideration. Homology
modeling relies on the observation that protein structures tend to be conserved
amongst homologous proteins. A homology model provides three dimensional
positions of residues including backbone and side chains. The model is generated
from a structure template of a homologous protein likely to resemble the structure of
the modeled sequence. In some embodiments, a structure te is used in two
steps: “align sequence to templates” and “build gy models”.
The “align sequence to templates” step aligns the model sequence to one or
more structure template sequences and es an input sequence alignment for
building the homology model. The alignment identifies gaps and other regions of
dissimilarity between the model sequence and the structure template sequence(s).
The “building homology models” step uses structural features of the structure
template to derive spatial restraints which, in turn, are used to generate, e.g., model
protein structures using conjugate gradient and simulated ing optimization
procedures. The structural features of the template may be obtained from a technique
such as NMR or x-ray crystallography. Examples of such ques can be found in
the review article, “A Guide to Template Based Structure Prediction,” by Qu X,
n R, Day R, Tsai J. Curr Protein Pept Sci. 2009 Jun;lO(3):270-85.
The term “active conformation” is used in reference to a conformation of a
protein (e.g., an enzyme) that allows the protein to cause a substrate to undergo a
al transformation (e.g., a catalytic reaction).
An “active pose” is one in which a ligand is likely to o a catalytic
transformation or perform some desired role such as covalently binding with the
binding site.
The terms “oxidoreduction,3, CEoxidation-reduction,” and “redox” are used
interchangeably with reference to a ible chemical reaction in which one reaction
is an oxidation and the reverse is a reduction. The terms are also used to refer to all
chemical reactions in which atoms have their oxidation state changed; in general,
redox reactions involve the transfer of electrons between species. This can be either a
simple redox process, such as the oxidation of carbon to yield carbon dioxide (C02)
or the reduction of carbon by hydrogen to yield e (CH4), or a complex process
such as the oxidation of glucose (C6H1206) in the human body through a series of
complex on er processes.
An “oxidoreductase” is an enzyme that catalyzes an oxidoreduction reaction.
The term “transferation” is used herein to refer to a chemical reaction that
transfers a fianctional group from one compound to another compound. A
“transferase” is used to refer to any of various enzymes that catalyze a transferation
reaction.
The term “hydrolysis” is used to refer to a chemical reaction in which water
reacts with a nd to produce other compounds, which reaction involves the
splitting of a chemical bond by the addition of the hydrogen cation and the hydroxide
anion from the water.
A “hydrolase” is an enzyme that catalyzes a hydrolysis reaction.
The term “isomerization” is used to refer to a al on that converts a
compound into an isomer.
An “isomerase” is an enzyme that catalyzes an isomerization reaction, causing
its substrate to change into an isomeric form.
The term “ligation” is used herein to refer to any chemical reactions that join
two molecules by forming a new chemical bond. In some embodiments, a ligation
reaction involves hydrolysis of a small al group dependent to one of the larger
molecules. In some embodiments, an enzyme catalyzes the linking together of two
compounds, e.g., enzymes that catalyze joining of C-0, C-S, C-N, etc. An enzyme
that catalyzes a ligation reaction is referred to as a “ligase”.
A “lyase” is an enzyme that catalyzes the breaking of s al bonds
by means other than hydrolysis and oxidation. In some embodiments, a lyase reaction
forms a new double bond or a new ring ure.
A “ketoreductase” is an enzyme that typically uses cofactor NADPH to
stereospecifically reduce a keto group to a hydroxyl group (See e.g., variants
disclosed in W02008103248A2, W02009029554A2, W02009036404A2,
W02009042984Al, W02009046l53Al, and W02010025238A2).
A “transaminase” or an “aminotransferase” is an enzyme that catalyzes a
transamination reaction between an amino acid and an (x-keto acid, in which the
amine group NHZ on the amino acid is exchanged with the keto group =0 on the OL-
keto acid (See e.g., variants disclosed in W02010081053A2 and
W02010099501A2).
The “cytochrome” proteins viated as “CYP”) are enzymes involved in
ion of organic nces. One example is cytochrome P450 enzymes. The
substrates of CYP enzymes include, but are not limited to metabolic intermediates
such as lipids and steroidal hormones, as well as xenobiotic substances such as drugs
and other toxic als. CYPs are the major enzymes involved in drug metabolism
and bioactivation. CYPs use a variety of small and large molecules as substrates in
enzymatic reactions. The most common reaction catalyzed by cytochrome P450 is a
monooxygenase reaction, e.g., insertion of one atom of oxygen into an organic
substrate (RH) while the other oxygen atom is reduced to water. Cytochrome P450
enzymes belong to a superfamily of proteins ning a heme cofactor and,
therefore, are hemoproteins. In general, they are terminal e enzymes in
electron transfer chains. The MicroCyp® screening plates and enzymes available
from Codexis are useful in production of drug metabolites and novel lead nds
(See e.g., ts sed in W02002083868A2, W02005017105A2,
W02005017l l6A2, and W02003008563A2).
A “Baeyer—Villiger monooxygenase” is an enzyme that employs NADPH and
lar oxygen to catalyze a Baeyer-Villiger oxidation reaction, in which an
oxygen atom is inserted into a carbon—carbon bond of a carbonylic substrate (See
e.g., variants in WOZOl lO7l982A2 and WO2012078800A2).
A “monoamine oxidase” (MAO) (EC 1.4.3.4) is an enzyme that catalyze the
oxidation of monoamines, which are neurotransmitters and neuromodulators that
contain one amino group that is connected to an aromatic ring by a two-carbon chain
(-CH2-CH2-). MAOs belong to the n family of flaVin-containing amine
oxidoreductases (See e.g., variants in W02010008828A2).
A “nitrilase” or nitrile ydrolase (EC 3.5.5.1) is an enzyme that
catalyzes the hydrolysis of nitriles to carboxylic acids and ammonia, without the
ion of “free” amide intermediates ( See e.g., variants in WOZOl lOl l630A2).
An “imine reductase” is an enzyme that catalyzes the reduction of an imine
functional group ning a carbon—nitrogen double bond, breaking the double bond
by causing an electron to be donated to the nitrogen atom.
An “enone reductase” is an enzyme that catalyzes the reduction of an enone
functional group, which includes a conjugated system of an alkene and a ,
breaking the keto- or alkene double bond(See e.g., variants disclosed in
W02010075574A2).
An “acylase” is an enzyme that catalyzes the hydrolytic cleavage of acyl
amide or acyl ester bonds (See e.g., variants of penicillin G acylase in
W02010054319A2).
A “halohydrin dehalogenase” “HHDH” is an enzyme involved in the
degradation of Vicinal halohydrins. In Agrobacterz'um radiobacter ADl, for instance,
it zes the dehalogenation of halohydrins to produce the corresponding epoxides
(See e.g., variants disclosed in W02010080635A2).
The term “sequence” is used herein to refer to the order and identity of any
ical sequences including but not limited to a whole genome, whole
some, chromosome segment, collection of gene sequences for interacting
genes, gene, nucleic acid sequence, protein, peptide, polypeptide, polysaccharide, etc.
In some contexts, a nce” refers to the order and identity of amino acid residues
in a protein (i.e., a protein sequence or protein ter string) or to the order and
identity of nucleotides in a c acid (i.e., a nucleic acid sequence or nucleic acid
character string). A sequence may be represented by a character string. A “nucleic
acid sequence” refers to the order and identity of the tides comprising a nucleic
acid. A “protein sequence” refers to the order and identity of the amino acids
comprising a protein or e.
“Codon” refers to a specific sequence of three consecutive nucleotides that is
part of the genetic code and that specifies a particular amino acid in a n or starts
or stops n synthesis.
The term “gene” is used broadly to refer to any segment of DNA or other
nucleic acid associated with a biological fianction. Thus, genes include coding
sequences and optionally, the regulatory sequences required for their expression.
Genes also optionally include non-expressed nucleic acid segments that, for example,
form recognition sequences for other proteins. Genes can be obtained from a variety
of sources, including g fiom a source of interest or synthesizing fiom known or
predicted sequence information, and may include ces designed to have desired
parameters.
A “moiety” is a part of a molecule that may include either whole functional
groups or parts of functional groups as substructures, while functional groups are
groups of atoms or bonds within molecules that are responsible for the characteristic
chemical reactions of those molecules.
“Screening” refers to the process in which one or more properties of one or
more bio-molecules are determined. For example, typical screening processes include
those in which one or more properties of one or more members of one or more
libraries are determined. Screening can be performed computationally using
computational models of biomolecules and virtual environment of the biomolecules.
In some embodiments, l protein screening systems are provided for selected
enzymes of d ty and selectivity.
An “expression system” is a system for expressing a protein or e
encoded by a gene or other nucleic acid.
“Directed evolution,3, “guided evolution,” or “artificial ion” refers to in
silica, in vitro, or in viva processes of artificially changing one or more biomolecule
sequences (or a character string representing that sequence) by artificial ion,
mutation, recombination, or other manipulation. In some embodiments, directed
ion occurs in a reproductive tion in which (1) there are varieties of
individuals, (2) some varieties having heritable genetic information, and (3) some
varieties differ in fitness. Reproductive success is determined by outcome of
selection for a predetermined property such as a beneficial property. The
uctive population can be, e.g., a al population in an in vitro process or a
virtual population in a computer system in an in silico process.
Directed evolution methods can be y applied to polynucleotides to
generate variant ies that can be expressed, screened, and assayed. Mutagenesis
and directed evolution methods are well known in the art (See e. g., US Patent Nos.
,605,793, 721, 6,132,970, 6,420,175, 6,277,638, 6,365,408, 6,602,986,
7,288,375, 6,287,861, 053, 6,576,467, 6,444,468, 5,811238, 6,117,679,
6,165,793, 6,180,406, 6,291,242, 017, 6,395,547, 6,506,602, 6,519,065,
6,506,603, 6,413,774, 6,573,098, 030, 356, 497, 7,868,138,
,834,252, 5,928,905, 6,489,146, 6,096,548, 6,387,702, 6,391,552, 6,358,742,
6,482,647, 6,335,160, 6,653,072, 6,355,484, 6,03,344, 6,319,713, 6,613,514,
6,455,253, 6,579,678, 6,586,182, 6,406,855, 6,946,296, 7,534,564, 7,776,598,
5,837,458, 6,391,640, 6,309,883, 7,105,297, 7,795,030, 6,326,204, 6,251,674,
6,716,631, 6,528,311, 6,287,862, 6,335,198, 6,352,859, 6,379,964, 7,148,054,
7,629,170, 7,620,500, 6,365,377, 740, 6,406,910, 6,413,745, 6,436,675,
664, 7,430,477, 7,873,499, 7,702,464, 7,783,428, 7,747,391, 7,747,393,
7,751,986, 6,376,246, 6,426,224, 6,423,542, 6,479,652, 6,319,714, 453,
6,368,861, 7,421,347, 7,058,515, 312, 7,620,502, 7,853,410, 7,957,912,
7,904,249, and all related non-US counterparts; Ling et al., Anal. Biochem.,
254(2):157-78 [1997]; Dale et al., Meth. Mol. Biol., 57:369-74 [1996]; Smith, Ann.
Rev. Genet., 19:423-462 [1985]; Botstein et al., Science, 229:1193-1201 [1985];
Carter, Biochem. J., 7 [1986]; Kramer et al., Cell, 38:879-887 [1984]; Wells et
al., Gene, 34:315-323 [1985]; Minshull et al., Curr. Op. Chem. Biol., 3:284-290
; Christians et al., Nat. Biotechnol., 17:259-264 [1999]; Crameri et al., ,
391:288-291 [1998]; Crameri, et al., Nat. Biotechnol., 15:436-438 [1997]; Zhang et
al., Proc. Nat. Acad. Sci. U.S.A., 94:4504-4509 [1997]; Crameri et al., Nat.
Biotechnol., 14:315-319 [1996]; Stemmer, Nature, 9-391 [1994]; Stemmer,
Proc. Nat. Acad. Sci. USA, 91:10747-10751 [1994]; WO 95/22625; WO 8;
WO 97/35966; WO 98/27230; WO 00/42651; WO 01/75767; and ,
all of which are incorporated herein by reference).
In certain embodiments, directed evolution methods generate protein variant
libraries by recombining genes encoding variants developed from a parent protein, as
well as by recombining genes encoding variants in a parent protein variant library.
The methods may employ oligonucleotides containing sequences or uences
ng at least one protein of a parental variant library. Some of the
oligonucleotides of the parental variant library may be closely related, differing only
in the choice of codons for alternate amino acids selected to be varied by
recombination with other variants. The method may be performed for one or multiple
cycles until desired results are achieved. If multiple cycles are used, each lly
involves a screening step to fy those variants that have acceptable or ed
performance and are candidates for use in at least one subsequent recombination
cycle. In some embodiments, the screening step involves a virtual protein screening
system for determining the catalytic activity and selectivity of enzymes for desired
ates.
In some ments, directed evolution methods generate protein variants by
site- directed mutagenesis at def1ned residues. These defined residues are typically
identified by structural analysis of binding sites, quantum chemistry analysis,
ce homology analysis, sequence-activity models, etc. Some embodiments
employ saturation mutagenesis, in which one tries to generate all possible (or as close
to as possible) mutations at a specific site, or narrow region of a gene.
“Shuffling” and “gene shuffling” are types of ed ion methods that
recombine a collection of fragments of the parental polynucleotides through a series
of chain extension cycles. In certain embodiments, one or more of the chain
extension cycles is self-priming; i.e., med without the on of primers other
than the fragments themselves. Each cycle involves annealing single stranded
fragments through hybridization, uent elongation of annealed fragments
through chain extension, and denaturing. Over the course of shuffling, a growing
nucleic acid strand is typically exposed to multiple different annealing partners in a
process sometimes referred to as “template switching,” which involves switching one
nucleic acid domain from one nucleic acid with a second domain from a second
nucleic acid (i.e., the first and second nucleic acids serve as templates in the shuffling
procedure).
Template switching ntly produces chimeric sequences, which result
from the introduction of crossovers between fragments of different origins. The
crossovers are created h template ed recombinations during the multiple
cycles of ing, extension, and denaturing. Thus, shuffling typically leads to
WO 48572
production of variant polynucleotide sequences. In some embodiments, the variant
sequences comprise a “library” of variants (i.e., a group comprising multiple
variants). In some embodiments of these libraries, the variants contain sequence
segments from two or more parent polynucleotides.
When two or more parental polynucleotides are employed, the individual
parental cleotides are sufficiently homologous that fragments from different
parents hybridize under the annealing conditions employed in the shuffling cycles. In
some embodiments, the shuffling permits recombination of parent polynucleotides
having relatively limited/low homology levels. Often, the individual parent
polynucleotides have distinct and/or unique s and/or other sequence
characteristics of interest. When using parent polynucleotides having ct
sequence characteristics, shuffling can produce highly diverse variant
polynucleotides.
Various shuffling techniques are known in the art (See e.g., US Patent Nos.
6,917,882, 7,776,598, 8,029,988, 7,024,312, and 7,795,030, all of which are
incorporated herein by reference in their entireties).
Some directed evolution techniques employ “Gene Splicing by p
Extension” or “gene SOEing,” which is a PCR-based method of recombining DNA
ces without reliance on ction sites and of directly ting mutated
DNA fragments in vitro. In some implementations of the technique, initial PCRs
generate pping gene segments that are used as template DNA for a second PCR
to create a ength t. Internal PCR primers generate overlapping,
complementary 3’ ends on intermediate segments and introduce nucleotide
substitutions, insertions or deletions for gene splicing. Overlapping strands of these
intermediate ts ize at 3’ region in the second PCR and are extended to
generate the full-length product. In various applications, the full length product is
amplified by flanking primers that can include restriction enzyme sites for inserting
the product into an expression vector for cloning purposes (See e.g., Horton, et al.,
Biotechniques, 8(5): 528-35 [1990]). “Mutagenesis” is the process of introducing a
mutation into a standard or reference sequence such as a parent c acid or parent
polypeptide.
Site-directed mutagenesis is one example of a useful technique for introducing
mutations, gh any suitable method finds use. Thus, alternatively or in on,
the mutants may be provided by gene synthesis, saturating random mutagenesis, semi-
synthetic combinatorial libraries of residues, recursive sequence recombination
(“RSR”) (See e.g., US Patent Application Publ. No. 2006/0223143, incorporated by
reference herein in its entirety), gene shuffling, prone PCR, and/or any other
suitable method.
One example of a suitable saturation mutagenesis procedure is described in
US Patent Application Publ. No. 2010/0093560, which is incorporated herein by
reference in its ty.
A “fragment” is any portion of a sequence of nucleotides or amino acids.
Fragments may be produced using any suitable method known in the art, including
but not limited to cleaving a polypeptide or polynucleotide sequence. In some
embodiments, nts are produced by using nucleases that cleave polynucleotides.
In some onal ments, fragments are generated using al and/or
biological synthesis ques. In some embodiments, fragments se
subsequences of at least one parental sequence, generated using partial chain
elongation of complementary nucleic ). In some embodiments involving in
silico techniques, virtual fragments are generated computationally to mimic the results
of fragments generated by chemical and/or biological techniques. In some
embodiments, polypeptide fragments exhibit the activity of the full-length
polypeptide, while in some other embodiments, the polypeptide fragments do not
have the activity exhibited by the fiJll-length polypeptide.
“Parental polypeptide,33 ECparental cleotide, 3) “parent nucleic acid,” and
“parent” are generally used to refer to the wild-type polypeptide, wild-type
polynucleotide, or a t used as a starting point in a diversity generation procedure
such as a directed evolution. In some embodiments, the parent itself is produced via
ng or other diversity tion procedure(s). In some embodiments, mutants
used in directed evolution are directly related to a parent polypeptide. In some
embodiments, the parent polypeptide is stable when exposed to extremes of
temperature, pH and/or solvent conditions and can serve as the basis for generating
variants for shuffling. In some embodiments, the parental polypeptide is not stable to
extremes of temperature, pH and/or solvent conditions, and the parental ptide is
evolved to make a robust variants.
A “parent c acid” encodes a parental polypeptide.
A “library” or “population” refers to a collection of at least two ent
molecules, character strings, and/or models, such as nucleic acid sequences (e.g.,
genes, oligonucleotides, etc.) or expression products (e.g., enzymes or other proteins)
therefrom. A library or population generally includes a number of different
molecules. For example, a library or population typically includes at least about 10
different molecules. Large libraries typically include at least about 100 different
molecules, more typically at least about 1000 different molecules. For some
applications, the y includes at least about 10000 or more different molecules.
However, it is not intended that the present invention be limited to a specific number
of different molecules. In certain embodiments, the library contains a number of
variant or chimeric nucleic acids or proteins ed by a directed evolution
ure.
Two nucleic acids are “recombined” when ces from each of the two
nucleic acids are combined to produce y nucleic acid(s). Two sequences are
“directly” recombined when both of the nucleic acids are substrates for
ination.
“Selection” refers to the process in which one or more bio-molecules are
fied as having one or more properties of interest. Thus, for example, one can
screen a library to determine one or more properties of one or more library members.
If one or more of the y members is/are identified as possessing a property of
interest, it is ed. Selection can include the isolation of a library member, but this
is not necessary. r, selection and screening can be, and often are, simultaneous.
Some embodiments disclosed herein provide systems and methods for screening and
selecting enzymes of desirable activity and/or selectivity.
The term “sequence-activity model” refers to any mathematical models that
describe the relationship between activities, characteristics, or properties of ical
molecules on the one hand, and various biological sequences on the other hand.
ence sequence” is a sequence from which variation of sequence is
effected. In some cases, a “reference sequence” is used to define the variations. Such
ce may be one predicted by a model to have the highest value (or one of the
highest values) of the desired activity. In another case, the reference sequence may be
that of a member of an original protein variant library. It certain embodiments, a
reference ce is the sequence of a parent protein or nucleic acid.
“Next-generation sequencing” and throughput sequencing” are
sequencing techniques that parallelize the sequencing process, producing thousands or
millions of ces at once. Examples of suitable next-generation sequencing
methods include, but are not limited to, single molecule real-time sequencing (e.g.,
Pacific Biosciences, Menlo Park, California), ion semiconductor sequencing (e.g., Ion
Torrent, South San Francisco, California), pyrosequencing (e.g., 454, rd,
Connecticut), sequencing by ligation (e. g., SOLiD sequencing of Life Technologies,
Carlsbad, rnia), sequencing by synthesis and ible terminator (e.g.,
Illumina, San Diego, California), nucleic acid imaging technologies such as
transmission electron microscopy, and the like.
A “genetic thm” is a process that mimics evolutionary processes.
Genetic algorithms (GAs) are used in a wide y of fields to solve problems which
are not fillly characterized or too x to allow filll characterization, but for which
some analytical evaluation is available. That is, GAs are used to solve problems that
can be evaluated by some quantifiable measure for the relative value of a solution (or
at least the relative value of one potential solution in comparison to r). In the
context of the present disclosure, a genetic algorithm is a process for selecting or
lating character strings in a computer, typically where the character string
corresponds to one or more biological molecules (e.g., nucleic acids, proteins, or the
like) or data used to train a model such as a sequence activity model.
In a l implementation, a genetic algorithm provides and evaluates a
population of ter strings in a first generation. A “fitness function” evaluates the
members of the population and ranks them based on one or more criteria such as high
ty. High ranking character strings are selected for promotion to a second
generation and/or mating to produce “children character strings” for the second
generation. The population in the second generation is similarly evaluated by the
fitness filnction, and high ranking members are promoted and/or mated as with the
first generation. The genetic algorithm continues in this manner for subsequent
generations until a “convergence criterion” is met, at which point the algorithm
concludes with one or more high ranking individuals.
The term “genetic operation” (or “G0”) refer to biological and/or
ational genetic operations, wherein all changes in any population of any type
of character strings (and thus in any al properties of physical objects encoded
by such strings) can be described as a result of random and/or predetermined
application of a finite set of logical algebraic ons. Examples of GO include but
are not limited to multiplication, crossover, ination, mutation, ligation,
fragmentation, etc.
II. VIRTUAL PROTEIN SCREENING
In some embodiments, a virtual protein screening system is configured to
perform various ions associated with computationally identifying biomolecule
variants that are likely to have a desirable activity such as ntly and ively
catalyzing a reaction at a defined temperature. The virtual protein screening system
may take as inputs, representations of one or more than one ligands that are intended
to interact with the variants. The system may take as other inputs, representations of
the biomolecule variants, or at least the active sites of these variants. The
representations may contain three-dimensional positions of atoms and/or moieties of
the ligands and/or variants. Homology models are es of the representations of
the biomolecule variants. The virtual protein screening system may apply docking
information and activity aints to assess the functioning of the variants.
In certain embodiments, a virtual protein screening system applies one or more
constraints to distinguish active and inactive poses. Such poses may be ted by a
docker as described above or by another tool. A ligand pose is evaluated in its
environment to determine whether one or more features of the ligand are positioned in
the environment so as to result in a catalytic transformation or other def1ned activity.
The environment in question is typically an active site of an enzyme or other
biomolecule.
If one assumes that a substrate or other ligand binds to an active site of the
biomolecule, the question to be asked whether it binds in an “active” way. A typical
g program can tell one whether or not a ligand will bind to the active site, but
does not tell one whether it binds in an e” way.
In certain embodiments, activity is determined by considering one or more
poses generated by a docker or other tool. Each pose is evaluated to determine
whether it meets constraints associated with an activity of interest (e.g., a “desired
activity”). An active pose is one in which the ligand is likely to undergo a catalytic
transformation or perform some d role such as covalently binding with the
binding site.
When considering catalytic turnover of a substrate as the activity, the l
protein screening system may be configured to identify poses known to be associated
with a particular on. In some ments, this involves considering a reaction
intermediate or tion state rather than the substrate itself. In addition to turnover,
poses may be evaluated for other types of ty such as stereoselective synthesis of
2014/057899
enantiomers, binding to a receptor of a target biomolecule identified as ant for
drug ery, elective conversion of products, etc. In some cases, the activity
is irreversible or reversible covalent binding such as targeted covalent inhibition
(TCI).
Constraints may be determined ly, manually, automatically, empirically,
and/or based on previously known information. In one approach, a researcher
evaluates the active site and a native ate for a wild-type protein. This is because
wild-type n is known to be evolved for its native ate by nature and hence
has optimal catalytic nt (km). In some cases, crystal ures of the wild-type
protein and native substrate or an intermediate complex have been solved. The
constraint can then be set up based on structural analysis. This is referred to as a
“direct approach” for determining the constraint. In cases where such crystal
structures are not available, the evaluation may be conducted with a docking program
for example. Using the program, the researcher identifies aints associated with a
catalytic transformation of the native substrate in the wild-type n. This is
referred to as a manual or empirical approach for determining constraints. In another
approach, constraints are determined using quantum mechanics calculations. For
example, a researcher can optimize the substrate or intermediate or tion state in
the presence of functional groups of the catalytic residues (e. g., Tyr) and/or cofactors
(e.g., NADHP), using quantum mechanics and set the constraint to resemble those
states. This approach is sometimes referred to as an automatic or ab initio approach.
An example of a commercial tool using this approach is Gaussian available from
www|.|Gaussian.com.
Constraints may take various forms. In certain embodiments, some or all these
constraints are geometric constraints that specify the relative position(s) of one more
atoms in a ligand pose in a three-dimensional space. In some embodiments, the space
may be defined with t to the positions of atoms in an active site.
A “geometric constraint” is a constraint that evaluates the geometry of two or
more participant moieties or other chemical elements. In certain embodiments, one of
the participants is a moiety or other chemical species on the ligand. In some
embodiments, another of the participants is a moiety or other chemical feature of an
active site of a biomolecule. The moiety or other al feature of the active site
may be associated with residues on the biomolecule active site (e.g., an amino acid
residue side-chain), a feature on a cofactor or other compound that is typically
associated with the active site and/or catalysis, and the like. As an example, in the
reduction of ketones by a ketoreductase protein, the carbonyl group of the substrate
may be one participant in a ric constraint and a tyrosine moiety of an enzyme
active site may be a second participant in the geometric constraint.
In general, geometric constraints are made with respect to a ligand on the one
hand and one or more features of the binding environment on the other hand. In some
ments, the environment may include residue positions of the e backbone
(or side-chains) and/or cofactors or other non-backbone materials that normally reside
in an active site.
The geometry of the participants in the geometric constraint may be defined in
terms of distance between moieties, angles between es, torsional relation
between moieties, etc. Sometimes, a constraint includes multiple basic geometric
constraints used to characterize activity. For example, a constraint on the position of
a substrate may be defined by distances between two or more pairs of atoms. An
example is shown in Figure 1. In the case of a torsional relation, the constraint may
be appropriate when a substrate and a feature of the active site nment are
viewed as lly parallel plates sharing a common axis of on. The relative
angular position of these plates around the axis defines the torsional constraint.
Figure 1 depicts an example of a workflow that may be employed to identify
ric constraints for identifying active poses. The depicted workflow assumes
that the wild type enzyme is a ketone reductase and the native substrate is
henone. As depicted in the top left comer of Figure l, the native reaction
converts acetophenone to a corresponding alcohol by stereoselective catalysis. The
reaction introduces a chiral center at the acetyl carbon of the ketone substrate. The
wild-type ketone reductase controls the conversion so that only the R enantiomer is
produced. The on is accomplished in the presence ofNADPH as a cofactor. The
reaction is depicted schematically in the top left comer of Figure 1.
In the top right comer of Figure l, the mechanism of sis and selectivity
is depicted. This mechanism is considered when defining geometric constraints used
to distinguish active from inactive poses. As part of the process, a researcher or
automated system determines the orientation of the acetophenone substrate with
t to its catalytic environment in the wild-type ketone ase. In l, the
nt environment includes the surrounding residues, cofactors, etc. present when
the catalytic transformation takes place.
In the depicted example, the relevant features of the active site environment in
the wild-type ketone reductase are the positions of atoms in (l) a tyrosine residue in
the backbone of the wild-type enzyme and (2) the cofactor, NADPH. Other relevant
environmental features of the ate in the active poses are sub-pockets within the
active site. These are not shown in Figure 1. One of the sub-pockets accommodates
the phenyl group of the acetophenone ate and another accommodates the methyl
group of the acetophenone. Together these sub-pockets hold the substrate in an
orientation that dictates the stereospecificity of the on. In some embodiments,
the above information is gathered based on ural analysis of the crystal structure
of the ype ketone reductase and native henone substrate complex. Hence,
the geometric constraints can be directly .
The catalytic mechanism of ketoreductase is depicted by a sequence of arrows
shown in the depicted arrangement (top right comer of Figure 1). Specifically, the
NADPH donates electrons through a e ion that couples with the carbonyl
carbon of the acetophenone. Concurrently, an electron pair from the carbonyl oxygen
of the acetophenone is donated to the proton of the ne residue, and an electron
pair from the hydroxyl oxygen of the ne is donated to the proton of the ribose
moiety of NADP(H), hence completing the substrate’s conversion to the
corresponding alcohol. As noted, the reaction proceeds while the substrate’s phenyl
group is held in one larger sub-pocket, its methyl group is held in a r sub sub-
pocket, and its ketone group is held in close proximity toward the tyrosine hydroxyl
group.
As fiarther shown in Figure l, the wild-type ketone reductase is evolved to a
variant ketone reductase that specif1cally catalyzes the conversion of a different
substrate, called a “desired substrate,” herein. As depicted in a middle of Figure l, the
desired on is a conversion of methyl tert-butyl ketone to the S enantiomer of the
corresponding alcohol (1 tert-butyl ethyl l). The reaction is presumed to be
catalyzed in an active site of a variant enzyme optimized for the conversion and with
the cofactor NADPH.
To ensure that the reaction unfolds with the desired stereospeciflcity, one or
more constraints should be determined. Note that the native substrate is converted by
the wild-type ketone reductase to the R enantiomer and the desired substrate is to be
converted by the variant to the S enantiomer. Therefore, one may consider that the
tert-butyl group of the desired substrate should be positioned in the sub-pocket that
normally accommodates the methyl group of the native acetophenone substrate and
the methyl group of the desired substrate should be positioned in the sub-pocket that
accommodates the phenyl group of the native substrate.
With this in mind, a set of positional constraints may be defined as depicted in
the lower left comer of Figure 1. As shown therein, various constraints are defined
with respect to the three-dimensional position of the native substrate as it sits in the
WT enzyme active site in the crystal structure, in order to obtain maximum turnover
(kw). In other words, the orientation of the key functional group of the native
substrate, including carbonyl carbon and carbonyl oxygen that dictate catalytic
turnover and either of the two carbons next to the carbonyl carbon that dictate
stereoselectivity, as determined with respect to the diagram in the top right comer of
Figure l is ated into X, Y, Z coordinates. Since homology models of all the
variants were built using WT ure as template, the X, Y, Z coordinates are
erable to the variants. With this frame of reference, the positions of the key
functional group (C1(C2)C=O) of the desired substrate can be compared to the
ons of the corresponding 4 atoms of the native substrate as they are predicted to
sit in an l ation toward the catalytic tyrosine residue and NADPH
cofactor. It is noteworthy that the es for catalysis (e.g., tyrosine) and es
for cofactor (NADPH) binding are conserved in all the variants and only subtle
conformational or positional changes are expected for this tyrosine and NADPH in all
the variants. With this in mind, the positional constraints depicted in the bottom left
comer of Figure 1 specify a range of positions of the desired substrate’s carbonyl
carbon atom, carbonyl oxygen atom, and central tert-butyl atom with respect to
corresponding positions of the native substrate’s carbonyl carbon atom, carbonyl
oxygen atom, and methyl carbon atom. The range of positional differences n
the desired substrate’s atoms and the native substrate’s corresponding atoms is
depicted by the distances dl, d2, and d3. As an example, each of these ces may
be required to be 1 angstrom or more or less in order for a pose of the desired
substrate to be deemed an active pose. The constraint values are usually set to be a
range that allows certain flexibility reflecting subtle mational changes of the
catalytic tyrosine and cofactor in a variant. In some implementations, the ia for
these distances are refined by e learning algorithms.
In the examples above, the positions of the three relevant atoms of the desired
substrate approximate those of the native substrate. The ketoreductase variants
docked with the desired substrate in poses satisfying the above positional constraints
are ed to be catalytically active and S selective.
In general, the virtual n screening system may apply ric
aints of any of various types. In some entations, it s the absolute
distance between participants. For example, the ce between an oxygen atom in
the carbonyl group of a substrate and an atom of a tyrosine group of an active site
may be specified as a constraint (e.g., the distance between these atoms must be 2 A ::
0.5 A). In another example, the angle between one line defined by the axis between
the carbon and oxygen atoms in a carbonyl group and another line along an axis of a
phenyl group in an active site is 120° :: 20°.
The bottom right of Figure 1 depicts examples of types of geometric
constraints, each defined between one or more atoms of the desired substrate and one
or more atoms of the enzyme or a cofactor (or other entity) within a binding pocket. A
ce constraint is defined as the distance between an atom on the substrate and an
atom on an active site residue, a cofactor, etc. In angle constraint is defined for a pose
by the r relation between two or more axes defined on the substrate and its
environment. The axes may be covalent bonds, lines between atoms of the substrate
and a moiety in the binding pocket, etc. For example, an angle may be defined
between one axis defined between two atoms on the substrate and another axis
defined as the separation between an atom on a residue and an atom on the substrate.
In some other embodiments, one axis is defined between two atoms on a residue side
chain and another axis is defined by separation between an atom on the substrate and
an atom on the residue. An additional type of geometric constraint is depicted in the
bottom right comer of Figure 1. This type of constraint is referred to as a “torsional
constraint” and assumes that two distinct entities in the binding pocket (one of them
typically being all or part of the substrate) share a common axis of rotation. The
torsional constraint may be defined by a range of angular positions of one of the
entities with respect to the other around the common axis of rotation.
In l, the geometric constraint may be applied with respect to some
preset geometric position or orientation of a substrate moiety within a binding pocket.
Such position or orientation may be specified by, for e, a representative
position of an active moiety in a native substrate in a binding . As an e,
the carbon and oxygen atoms of the carbonyl group of the substrate under
consideration must be within 1 A of the locations of the carbon oxygen atoms of a
carbonyl group in a native substrate in the binding pocket. See the positional
constraint shown in the lower left comer of Figure 1. Note that the positional
constraints in the lower left comer of Figure I exist between the desired ate and
the native substrate. However, the positional constraints can be translated into
relations between the desired substrate and enzyme variants, which correspond to the
geometric constraints in the lower middle and right comer of Figure 1.
In on to determining the geometric constraints directly, ly, or
automatically using computer systems, the constraints can also be refined by
screening results. For example, if one or more than one variants are identified as being
active while some others are identified as being inactive for the desired reaction
through laboratory screening, their poses can be further ed and the constraints
can be trained.
While the example depicted in Figure 1 uses a relatively small and simple
molecule l tert-butyl ketone) as a desired substrate, much larger and more
complex substrates are often evaluated in a directed evolution effort.
Figure 2 presents a w for analyzing the potential activity of candidate
ecules in some implementations. While many different activities may be
considered, the one that will be emphasized in this embodiment is catalytic
transformation of the substrate. The transformation may be enantioselective or
regioselective. In such case, the ts are enzymes. In the description of this
Figure, when the term “substrate” is used, the t extends to related ligands such
as reaction intermediates or transition states that are important in a rate determining
step in the catalytic transformation of the substrate to a reaction product.
As shown in Figure 2, the process begins by identifying constraints for
distinguishing active from inactive poses of the substrate. See block 201. In some
cases, the constraints are identified by g. In such processes, a researcher takes
into consideration the interaction of the substrate or reaction intermediate or transition
state with the enzyme active site. In the s, she identifies constraints that result
in the d activity (e.g., stereospecific catalytic transformation the substrate). The
researcher may do this with the aid of structure analysis, a docking program and/or
quantum mechanics ations that present a representation of an enzyme and
ated substrate, intermediate, or transition state. Docking done with a docker is
sometimes referred to as an “empirical” docking approach and optimization done with
2014/057899
a quantum mechanics tool is sometimes referred to as an “ab initio” approach. In
some ments, the docking is performed with a wild type enzyme and the native
substrate, intermediate, or tion state. See block 201. As explained above, some
aints are geometrical constraints representing the relative positions of moieties
in the desired substrate and moieties in the native substrate or an associated cofactor
as shown in the lower left comer of Figure I. In some implementations, aints
can be defined as ons between desired substrates and enzyme variants, such as
the geometric aints shown in the lower middle and right comer of Figure I.
In some cases, constraints for active poses can be identified by techniques
other than docking a native substrate in a wild type enzyme. For instance, it is
possible to identify es relevant for a catalytic reaction and define relations
n the identified moieties using quantum mechanics and molecular dynamics
tools.
Returning to the process shown in Figure 2, the virtual protein screening
system creates or es structural models for each of multiple variant biomolecules
that are to be considered for activity. See block 203. As explained, the structural
models are three dimensional ationally-produced representations of the active
sites or other aspects of the enzyme variants. These models may be saved for later
use in a database or other data repository. In some cases, at least one of the models is
created for use in the work flow. In some cases, at least one of the models was
previously created, in which case the process simply receives such models.
Multiple models, each for a different biomolecule sequence are used in the
process shown in Figure 2. This should be contrasted with conventional work flows
utilizing docking programs. Conventional work flows focus on a single target or
sequence. In some cases, a conventional work flow considers multiple instances of a
receptor, but these are based on the same ce. Each of the instances has
different three-dimensional coordinates generated from NMR or molecular dynamics
simulations.
The structural models used in the Figure 2 process may vary from one another
by the insertion, deletion, or replacement in the models of one or more amino acid
residues at positions associated with the active site or with some other position in the
enzyme’s sequence. ural models may be created by various techniques. In one
embodiment, they are created by homology modeling.
With the activity constraints and structural models in place, the virtual protein
screening system iterates over the variants that have been selected for consideration.
Control of the iteration is illustrated by a block 205, which indicates that the next
variant enzyme under consideration is selected for analysis. This operation and the
remaining operations of Figure 2 may be implemented by software or digital logic.
For the variant enzyme currently under consideration, the virtual protein
screening system first attempts to dock the desired substrate to the active site of the
t. See block 207. This process may correspond to a tional docking
procedure. Therefore, a docker may be employed to determine whether or not the
substrate is capable of docking with active site in the variant. This decision is
represented in a block 209. Note that the desired substrate is sometimes different
from the native ate, which may have been used to generate the constraints.
If the l n screening system determines that docking is unlikely to
be successful, process control is directed to a block 220, where the system ines
whether there are any further ts to er. If there are no further variants to
consider, the process is completed with an optional ion 223, as indicated. If, on
the other hand, one or more ts remain to be considered, process control is
directed back to process step 205 where the next t for consideration is selected.
This variant is then evaluated for its ability to dock the substrate under consideration
as bed above with reference to blocks 207 and 209.
If it turns out that the variant under consideration can successfully dock with
the substrate, process control is directed to a portion of the algorithm where multiple
poses are considered and each evaluated for activity. As described below, this analysis
is depicted by blocks 211, 213, 215, and 217.
As shown, the process iterates over multiple available poses. In various
embodiments, a docker helps select the poses. As explained, dockers may generate
numerous poses of a substrate in an active site. It may also rank poses based on one
or more criteria such as docking score, energetic considerations, etc. Total energy
and/or interaction energy may be considered, as bed elsewhere. Regardless of
how poses are generated and/or ranked, the work flow may be configured to consider
a specified number of poses. The number of poses to be considered can be set
arbitrarily. In one embodiment, at least about the top 10 poses are ered. In
another embodiment, at least about 20 poses are considered, or at least about 50
poses, or at least about 100 poses. However, it is not intended that the present
invention be limited to a c number of poses.
As depicted at block 211, the process selects the next pose for analysis. The
currently selected pose is then evaluated against the constraints identified in block
201, to determine whether the pose is an active pose. As explained, such constraints
may be geometric constraints that determine whether one or more moieties of the
substrate are located within the active site, such that the substrate is likely to undergo
a desired catalytic transformation.
If the evaluation conducted at block 213 indicates that the current pose is not
an active pose, the virtual protein ing system then determines r there are
any further poses to consider for the current variant under consideration. See block
215. Assuming that there are more poses to consider, s control is directed back
to block 211, where the next pose is considered.
Assuming that the virtual protein screening system determines at block 213
that the pose under consideration is active, it notes this pose for later consideration.
See block 217. In some ments, the virtual protein screening system may keep a
running tally of the number of active poses for the variant currently under
consideration.
After appropriately noting that the current pose is active, process control is
directed to block 215, where the virtual protein screening system determines whether
there are any r poses to er. After repeating the consideration of all
ble poses for the variant under consideration, the l n screening
system determines that there are no further poses to consider and process control is
directed to a block 218, which characterizes the likely activity of the current variant.
Characterization can be made by various techniques, including but not limited to the
number of active poses and associated docking scores for the variant under
consideration and other considerations as described . After the operation of
block 218 is te, process control is directed to decision operation 220, which
determines r there are any fiarther variants to consider. If there are additional
variants to consider, process control is ed to block 205, where the workflow
continues as described above.
After considering all variants in the workflow, the virtual protein screening
system may rank them based on one or more criteria, such as the number of active
poses the variants have, one or more docking scores of the active poses, and/or one or
more binding energies of the active poses. See block 223. Only the poses identified
as active poses (block 217) need to be evaluated in performing the ranking of block
223. In this way, the operations in the work flow serve to filter inactive poses from
active poses and save computational effort associated with ranking the variants.
While not shown in Figure 2, variants may be ed for fiarther igation based
on their rankings.
In certain embodiments, a protocol to calculate binding energies is executed to
evaluate the energetics of each active pose of a variant. In some implementations, the
protocol may consider van der Waals force, ostatic interaction, and solvation
energy. Solvation is typically not considered in calculations performed by dockers.
s solvation models are available for calculating binding energies, including, but
not limited to ce dependent dielectrics, Generalized Born with pairwise
summation (GenBorn), Generalized Born with it Membrane (GBIM),
Generalized Born with Molecular Volume ation (GBMV), lized Born
with a simple switching (GBSW), and the Poisson-Boltzmann equation with non-
polar e area (PBSA). Protocols for calculating binding energies are different or
te from docker programs. They generally produce results that are more accurate
than docking scores, due in part to the inclusion of ion effects in their
calculations. In s implementations, binding energies are calculated only for
poses that are deemed to be active.
A. Generation of Models of Multiple Biomolecales Each Containing an
Active Site
A computer system may provide three-dimensional models for a plurality of
protein variants. The three-dimensional models are computational representations of
some or all of the protein variants’ filll length sequences. Typically, at a minimum,
the computation representations cover at least the protein ts’ active sites.
In some cases, the three-dimensional models are homology models prepared
using an appropriately designed computer system. The three-dimensional models
employ a structural template in which the protein ts vary from one another in
their amino acid sequences. Generally, a structural template is a structure previously
solved by X-ray crystallography or NMR for a sequence that is homologous to the
model sequence. The quality of the homology model is dependent on the sequence
identity and resolution of the structure template. In certain embodiments, the three-
dimensional models may be stored in a database for use as needed for current or
future projects.
dimensional models of the protein variants may be produced by
techniques other than homology modeling. One example is n threading, which
also requires a structure template. Another example is ab initio- or cle nova-protein
modeling which does not require a structure template and is based on underlying
physical principles. Examples of ab initio techniques e molecular dynamics
simulations and simulations using the Rosetta software suite.
In some embodiments, the protein variants vary from one another in their
active sites. In some cases, the active sites differ from one another by at least one
mutation in the amino acid sequence of the active site. The on(s) may be made
in a wild type protein ce or some other reference protein sequence. In some
cases, two or more of the protein ts share the same amino acid sequence for the
active site but differ in the amino acid sequence for another region of the protein. In
some cases, two protein variants differ from one another by at least about 2 amino
acids, or at least about 3 amino acids, or at least about 4 amino acids. However, it is
not intended that the present invention be limited to a ic number of amino acid
differences between protein variants.
In certain embodiments, the plurality of variants includes members of library
ed by one or more rounds of directed evolution. Diversity generation
techniques used in directed evolution include gene shuffling, mutagenesis,
recombination and the like. Examples of directed evolution techniques are described
in US Patent Application Publ. No. 2006/0223143, which is incorporated herein by
reference in its entirety.
In some implemented processes, the plurality of variants include at least about
ten different ts, or at least about 100 different variants, or at least about one
thousand different variants. However, it is not intended that the present ion be
limited to a specific number of protein variants.
B. ting a Ligand in Multiple Difi”erent Protein Variants
As explained herein, docking is ted by an appropriately programmed
computer system that uses a computational representation of a ligand and
computational representations of the active sites of the generated plurality of variants.
As an example, a docker may be configured to m some or all of the
following operations:
1. te a set of ligand conformations using high-temperature
molecular dynamics with random seeds. The docker may generate such
conformations without consideration of the ligand’s environment. Hence, the
docker may identify favorable conformations by considering only internal
strain or other considerations specific to the ligand alone. The number of
mations to be generated can be set arbitrarily. In one embodiment, at
least about 10 conformations are generated. In another embodiment, at least
about 20 conformations are generated, or at least about 50 conformations, or at
least about 100 conformations. However, it is not ed that the present
invention be limited to a specific number of conformations.
2. Generate random orientations of the conformations by translating the
center of the ligand to a specified location within the receptor active site, and
performing a series of random rotations. The number of orientations to refine
can be set arbitrarily. In one embodiment, at least about 10 orientations are
generated. In another embodiment, at least about 20 orientations are ted,
or at least about 50 ations, or at least about 100 orientations. However,
it is not intended that the t invention be limited to any specific number
of orientations. In certain embodiments, the docker calculates a “softened”
energy to generate further combinations of ation and conformation. The
docker calculates softened energy using physically unrealistic assumptions
about the permissibility of certain orientations in an active site. For example,
the docker may assume that ligand atoms and active site atoms can occupy
essentially the same space, which is impossible based on Pauli repulsion and
steric considerations. This softened assumption can be implemented by, for
example, employing a relaxed form of the Lennard-Jones potential when
ing conformation space. By using a ed energy calculation, the
docker allows a more complete exploration of conformations than ble
using physically realistic energy considerations. If the softened energy of a
conformation in a particular orientation is less than a specified threshold, the
conformation-orientation is kept. These low energy conformations are retained
as “poses”. In certain implementations, this process continues until either a
desired number of low-energy poses is found, or a maximum number of bad
poses is found.
3. t each retained pose from step 2 to simulated annealing
molecular dynamics to refine the pose. The temperature is increased to a high
value then cooled to the target temperature. The docker may do this to
provide a more physically realistic orientation and/or conformation than is
provided by the softened energy calculation.
4. Perform a final minimization of the ligand in the rigid receptor using
non-softened potential. This provides a more accurate energy value for the
retained poses. However, the ation may provide only partial information
about the poses’ energies.
. For each final pose, calculate the total energy (receptor-ligand
interaction energy plus ligand internal ) and the interaction energy alone.
The calculation may be performed using CHARMm. The poses are sorted by
CHARMm energy and the top scoring (most negative, thus favorable to
binding) poses are retained. In some embodiments, this step r step 4)
removes poses that are energetically unfavorable.
The following reference provides an example of a ’s functioning: Wu et
al., Detailed Analysis of Grid-Based Molecular Docking: A Case Study ofCDOCKER
— A CHARMm-Based MD Docking Algorithm, J. Computational Chem., Vol. 24, No.
13, pp 1549-62 (2003), which is incorporated herein by reference in its entirety.
A docker such as the one described here may provide one or more pieces of
information used by the screening system to identify high-performing variants. Such
information includes the identity of variants for which docking with the desired
substrate is unlikely. Such variants need not be ted for actiVity, etc. Other
information provided by the docker includes sets of poses (one set for each variant)
that can be considered for activity. Still other information includes docking scores of
the poses in the sets.
C. Determine r Poses 0fthe Docked Ligand are Active
For a protein variant that successfully docks with the ligand, the l protein
screening system makes the following operations: (i) consider a ity of poses of
the computational representation of the ligand in the active site of the protein t
under consideration, and (ii) ine which if any of the plurality of poses is active.
An active pose is one meeting one more constraints for the ligand to bind
under defined conditions (rather than arbitrary binding condition). If the ligand is a
substrate and the protein is an enzyme, active binding may be binding that allows the
substrate to undergo a catalyzed chemical transformation, particularly a stereo-
specific transformation. In some implementations, the aints are geometrical
constraints defining a range of relative positions of one or more atoms in the ligand
and one or more atoms in the protein and/or cofactor associated with the protein.
In some cases, constraints are identified from one or more conformations of a
native substrate and/or subsequent intermediate when it undergoes a catalyzed
chemical transformation by a wild type enzyme. In certain embodiments, the
constraints include (i) a distance between a particular moiety on the substrate and/or
uent ediate and a particular residue or residue moiety in the active site,
(ii) a distance between a particular moiety on the substrate and/or uent
intermediate and a particular cofactor in the active site, and/or (iii) a distance between
a particular moiety on the substrate and/or subsequent intermediate and a particular
moiety on an ideally positioned native substrate, and/or subsequent intermediate in the
active site. In certain embodiments, the constraints can include angles n
chemical bonds, torsion around axes, or strain at chemical bonds.
The ity of poses of the computational representation of the ate
and/or subsequent intermediate may be generated with respect to a ational
representation of the protein variant under consideration. The plurality of poses may
be generated by various techniques. General es of such ques include
systematic or stochastic torsional searches about rotatable bonds, molecular dynamics
simulations, and genetic algorithms designed to locate low energy conformations. In
one example, the poses are generated using high temperature molecular cs,
followed by random rotation, refinement by grid-based simulated annealing, and a
final grid-based or force field minimization to generate a conformation and/or
orientation of the substrate and/or subsequent intermediate in the computational
representation active site. Some of these operations are optional, e. g., refinement by
grid-based simulated annealing, and grid-based or force field minimization.
In certain embodiments, the number of poses considered is at least about 10,
or at least about 20, or at least about 50, or at least about 100, or at least about 200, or
at least about 500. However, it is not intended that the t ion be d to
a specific number of poses considered.
If the project is successful, at least one of the variants is determined to have
one or more poses that are active and energetically favorable. In certain embodiments,
a variant selected for further consideration is one determined to have large s of
active conformations in comparison with other variants. In certain embodiments, the
variants are selected by ranking the ts based on the number of active poses they
have, one or more docking scores for the active poses, and/or one or more binding
energies of the active poses. As examples, the types of docking scores that may be
considered include scores based on van de Waals force and/or electrostatics
interaction. As examples, the types of binding energies that may be considered
e van der Waals force, electrostatic interaction, and ion energy.
A protein variant determined to support one or more active poses may be
selected for further investigation, synthesis, production, etc. In one example, a
selected protein variant is used to seed one or more rounds of ed ion. As
an example, a round of directed evolution may include (i) preparing a plurality of
ucleotides containing or encoding at least a portion of the selected protein
variant, and (ii) performing a round of directed evolution using the plurality of
oligonucleotides. The oligonucleotides may be prepared by any suitable means,
including but not limited to gene synthesis, fragmentation of a nucleic acid encoding
some or all of the selected protein variant, etc. In certain embodiments, the round of
ed evolution includes fragmenting and recombining the plurality of
ucleotides. In certain embodiments, the round of directed evolution includes
performing saturation mutagenesis on the plurality of oligonucleotides
Catalyzed chemical transformations that may be screened using constraints
include, but are not d to for e, ketone ion, transamination,
oxidation, nitrile ysis, imine reduction, enone reduction, acyl hydrolysis, and
halohydrin dehalogenation. Examples of enzyme classes that may provide the
multiple variants evaluated using constraints include, but are not limited to: ketone
reductases, transaminases, cytochrome P450s, Baeyer—Villiger monooxygenases,
monoamine oxidases, nitrilases, imine reductases, enone reductases, acylases, and
halohydrin dehalogenases. In the context of rational ligand , optimization of
targeted nt tion (TCI) is a type of activity that may be screened for using
constraints. An example of a TCI application is described in Singh et al., The
resurgence of covalent drugs, Nature Reviews Drug Discovery, vol. 10, pp. 307-317
(2011), which is incorporated herein by reference in its entirety. In some
entations, the TCI activity is found by identifying a nucleophilic amino acid
(e.g., cysteine) in a protein. The process described herein can help identify inhibitors
that satisfy constraints defining an ideal orientation of an electrophilic moiety
important for the inhibition (a putative inhibitor) that can react with the biomolecule
to be inhibited.
III. USING THE VIRTUAL PROTEIN SCREENING SYSTEM TO DESIGN
ENZYMES
Some embodiments provide processes for virtually ng and screening
enzymes using a virtual protein screening , thereby identifying enzymes having
desired properties, e.g., catalytic activity and selectivity. In some embodiments, a
family of actual enzymes can be virtually modeled and screened as an initial t
library. Some embodiments can iteratively use one or more enzymes selected by
virtual screening from the initial library as parent polypeptides or reference sequences
to generate a new t y by in silica, in vitra, or in viva techniques. In some
embodiments, one or more enzymes ranked highly by the system as described herein
are selected as parent polypeptide(s). The new variant library includes n
sequences that are ent from the sequences of the parent polypeptides, and/or can
be used as precursors to introduce subsequent variation(s).
In some embodiments, the parent polypeptides are modified in a directed
evolution procedure by performing mutagenesis and/or a recombination-based
diversity generation mechanism to generate the new library of protein variants. In
some embodiments, the parent polypeptides are altered by at least one substitution,
insertion, over, deletion, and/or other genetic operation. The directed evolution
may be implemented directly on the polypeptides (e.g., in an in silica process) or
ctly on the nucleic acids encoding the polypeptides (e. g., in an in vitra process).
The new y may be used to te new homology models for further screening
and directed evolution.
In some embodiments, the modeling, screening, and evolution of enzymes are
carried out iteratively in silica until one or more enzymes meeting certain criteria are
met. For instance, the criteria may be a specified binding energy or score, or an
improvement f. Other embodiments may combine in silica and physical (e.g.,
in vitra or in viva) techniques. For instance, it is possible to start an enzyme design
2014/057899
process using s d by in vitra screening and sequencing. In vitra
sequencing may be performed by next-generation sequencing. Then, the enzyme
design process may use in silica methods for ed evolution, ng, and further
screening. The process can finally use in vitra and/or in viva techniques to validate an
enzyme in a biological system. Other combinations and orders of in silica and
physical techniques are suitable for s applications. Indeed, it is not intended
that the present invention be limited to any specific combination and/or order of
methods.
In some embodiments, preparation of polypeptide sequences is ed in
silica. In other ments, polypeptides are generated by synthesizing
oligonucleotides or nucleic acid sequences using a nucleic acid synthesizer and
translating the tide sequences to obtain the polypeptides.
As stated above, in some embodiments, the selected enzyme may be modified
by performing one or more recombination-based diversity generation mechanisms to
generate the new library of protein variants. Such recombination mechanisms
include, but are not limited to, e.g., ng, template switching, Gene Splicing by
Overlap Extension, error-prone PCR, semi-synthetic combinatorial libraries of
residues, recursive sequence recombination (“RSR”) (See e.g., US Patent Application
Publ. No. 2006/0223143, incorporated by reference herein in its entirety). In some
embodiments, some of these recombination mechanisms may be implemented in
vitra. In some embodiments, some of these recombination mechanisms may be
implemented computationally in silica to mimic the biological mechanisms.
Some embodiments include selecting one or more positions in a protein
sequence and conducting site-directed mutation methods such as saturation
mutagenesis at the one or more positions so selected. In some embodiments, the
positions are selected by evaluating the structure of the active site and/or constraints
related to the catalytic reaction as discussed elsewhere in the document. Combining
virtual screening with sequence-activity modeling finds use in some ments. In
these embodiments, the process of directed evolution may select the positions by
evaluating the coefficients of the terms of a sequence-activity model, thereby
identifying one or more of residuals that contribute to the activity of interest. US.
Patent No. 7,783,428 n incorporated by nce in its entirety) es
examples of sequence activity models that can be used to identify amino acids for
mutagenesis.
2014/057899
In some embodiments, the method involves ing one or more members of
the new protein variant library for production. One or more of these variants may
then be synthesized and/or sed in an expression system. In a specific
embodiment, the method continues in the following manner: (i) ing an
expression system from which a selected member of the new protein variant library
can be expressed; and (ii) expressing the selected member of the new protein variant
library.
s 3A-3C are flowcharts showing es of workflows for designing
ecule sequences, which ent various combinations of elements described
elsewhere herein. Figure 3A shows a flowchart for a process 300 that starts by
receiving sequence information of multiple starting sequences from a panel of
biomolecules, such as a panel of enzymes. See block 302. The s then performs
a virtual screening of the currently received sequences using a virtual protein
screening system. See block 304. In some embodiments, the virtual protein screening
system can create three-dimensional homology models of the starting sequences, and
dock one or more ates with the homology models by considering poses of the
substrates as described above, thereby generating docking scores for the starting
sequences. The virtual protein screening system can also calculate interaction energy
and internal energy of the docking participants (the s and the substrates).
Moreover, the virtual protein screening system can evaluate various constraints of
poses to determine whether the poses are active, i.e. the substrates bind with the
enzyme in a manner that is likely to cause a catalytic conversion of the substrate.
rmore, in some embodiments, evaluation of the constraints also provides
inference regarding whether the ts of the catalytic reaction is enantioselective
and/or regioselective. In some embodiments, the process selects one or more
sequences based on the binding , ty, and selectivity determined by the
virtual screening system. See block 306. The process then evaluates whether it is
necessary to conduct fiarther investigation of the selected sequences in step 308. If so,
the process in this example computationally mutates the selected sequences. The
mutations are based on the various diversity generation mechanisms described above,
such as mutagenesis or recombination. See block 310. The computationally mutated
sequences are then provided for a new round of virtual screening by the virtual protein
screening . See block 304. The virtual screening and ion may carry on
for iterations, until no fiarther investigation of sequences are necessary, which may be
determined by preset criteria such as a specific number of iterations and/or a
particular level of desired activity. At which point, the process of designing
biomolecules (e.g., enzymes) is finished at step 312.
Figure 3B shows a flowchart for a process 320 for directed evolution of
biomolecules such as enzymes, which process has some similar and some different
ts compared to process of 300. Process 320 starts by in vitro synthesis of
multiple starting sequences of biomolecules (e.g., enzymes), which may be necessary
or useful when a pre-existing panel of biomolecules is not available. See block 322.
The synthesized sequences may also be assayed to collect data for the sequences,
which data may be useful for designing biomolecules of desired properties, in which
data cannot be obtained by the virtual screening system. The process then performs a
virtual screening of the sized sequences using a virtual protein screening
system, depicted in block 324, which is similar to step 304 in process 300. The
process then s one or more sequences based on the binding energy, activity, and
selectivity determined by the virtual ing system. See block 326. The process
then tes whether it is necessary to perform fiarther directed evolution of the
ed sequences in step 328. If so, the process in this e mutates the selected
sequences in silico or in vitro. The ons are based on the various diversity
generation mechanisms described above. See block 330. The mutated ces are
then provided for a new round of virtual screening by the virtual protein screening
. See block 324. The virtual screening and selection may carry on for
iterations, until no fiarther evolution of sequences are necessary, which may be
determined by preset criteria such as a specific number of iterations and/or a
particular level of desired activity. At which point, the ces selected by the
virtual screening system are synthesized and expressed to produce actual enzymes.
See block 332. The ed enzymes can be assayed for activities of interest, which
can be used to validate the results of the l screening process. See block 334.
After the assay, the directed ion process is concluded at step 336.
Figure 3C shows a flowchart for a process 340 for directed evolution of
biomolecules such as enzymes. Process 340 starts by in vitro directed evolution to
derive multiple starting sequences of ecules (e.g., enzymes). See block 342.
As in process 320, the derived sequences are assayed to determine whether the
sequences meet certain criteria, such as desired activity or selectivity. Sequences
meeting the criteria are determined as hits for fiarther development. See block 344.
WO 48572
The process then performs a l ing of the hits using a virtual protein
screening system, depicted in block 346, which is similar to step 304 in process 300.
In some embodiments, the process also selects one or more sequences based on the
binding energy, activity, and selectivity determined by the virtual ing system as
bed above. The process then evaluates whether it is necessary to perform
further round of directed evolution of the selected ces in step 348. If so, the
process in provides the selected sequences for a further round of in vitro directed
evolution in a new iteration, see block 342. The virtual screening and selection may
carry on for iterations, until no fithher evolution of sequences are necessary, which
may be determined by preset criteria. At which point, the process of ing
biomolecules (e.g., enzymes) is finished at step 350.
IV. GENERATING A PROTEIN VARIANT LIBRARY
Protein variant libraries comprise groups of multiple proteins having one or
more residues that vary from member to member in a library. These libraries may be
generated using the methods described herein and/or any suitable means known in the
art. In s embodiments, these libraries provide candidate enzymes for the virtual
protein screening system. In some embodiments, the libraries may be provided and
screened in silica in l rounds, and resulting proteins selected by the virtual
ing system from a later or final round may be sequenced and/or screened in
vitro. Because the initial rounds of screening are performed in silico, the time and
cost for screening can be reduced significantly. The number of proteins included in a
protein variant y can be easily increased in the initial rounds of ing in
some entations compared to tional physical screening. It is not
intended that the present disclosure be limited to any particular number of proteins in
the protein libraries used in the methods of the present disclosure. It is filrther not
intended that the present disclosure be limited to any particular protein t library
or libraries.
In one example, the protein variant library is generated from one or more
naturally occurring proteins, which may be encoded by a single gene family in some
embodiments, or a panel of enzymes in other embodiments. Other starting points
include, but are not limited to recombinants of known proteins and/or novel synthetic
proteins. From these “seed” or “starting” proteins, the library may be generated by
various techniques. In one case, the library is generated by virtual processes that
reflect biological or chemical techniques, e.g., DNA fragmentation-mediated
recombination as described in Stemmer (1994) Proceedings of the National Academy
of Sciences, USA, 10747-10751 and WO 95/22625 (both of which are incorporated
herein by reference), synthetic ucleotide-mediated recombination as described
in Ness et al. (2002) Nature Biotechnology 20:1251-1255 and WO 00/42561 (both of
which are incorporated herein by reference), or nucleic acids encoding part or all of
one or more parent proteins. Combinations of these methods may be used (e.g.,
recombination of DNA fragments and synthetic ucleotides) as well as other
recombination-based methods known in the art, for example, WO97/20078 and
WO98/27230, both of which are incorporated herein by reference. Any suitable
methods used to generate protein variant libraries find use in the t disclosure.
Indeed, it is not intended that the present disclosure be limited to any particular
method for producing t libraries.
In some embodiments, a single “starting” sequence (which may be an
“ancestor” ce) may be employed for purposes of defining a group of ons
used in the modeling process. In some embodiments, there is more than one starting
sequence. In some onal embodiments, at least one of the starting ces is a
wild-type sequence. In certain embodiments, the mutations are (a) identified in the
literature as affecting substrate specificity, selectivity, stability, and/or any other
property of interest and/or (b) computationally predicted to improve protein folding
patterns (e.g., packing the interior residues of a protein), improve ligand binding,
improve t interactions, or improve family shuffling methods between multiple
diverse homologs, etc. It is not ed that the present invention be limited to any
specific choice of property/ies of st or function(s).
In some ments, the mutations may be virtually introduced into the
starting sequence and the ns may be virtually screened for beneficial properties.
Site-directed mutagenesis is one example of a useful technique for introducing
mutations, although any le method finds use. Thus, alternatively or in addition,
the s may be provided by gene synthesis, saturating random mutagenesis, semi-
synthetic combinatorial libraries of es, directed evolution, recursive sequence
recombination (“RSR”) (See e.g., US Patent Application Publ. No. 2006/0223143,
incorporated by reference herein in its entirety), gene shuffling, error-prone PCR,
and/or any other suitable method. One e of a suitable saturation mutagenesis
procedure is described in US Patent Application Publ. No. 2010/0093560, which is
incorporated herein by reference in its entirety.
The starting sequence need not be identical to the amino acid sequence of a
wild type protein. However, in some embodiments, the starting sequence is the
sequence of a wild type protein. In some embodiments, the starting sequence includes
mutations not present in the wild-type protein. In some embodiments, the starting
sequence is a consensus sequence derived from a group of proteins having a common
ty, e.g., a family of proteins.
In some embodiments, catalyzed chemical transformations that may be
screened using the virtual ing system e but are not limited to, for
example, ketone reduction, transamination, oxidation, nitrile hydrolysis, imine
reduction, enone reduction, acyl hydrolysis, and halohydrin dehalogenation.
Examples of enzyme s that may provide the multiple variants evaluated include,
but are not d to, ketone reductases, transaminases, cytochrome P450s, Baeyer-
Villiger monooxygenases, monoamine es, ases, imine reductases, enone
reductases, acylases, and halohydrin dehalogenases.
A non-limiting representative list of families or classes of enzymes which may
serve as sources of parent sequences includes, but is not limited to, the following:
oxidoreductases (E.C.l); transferases ); hydrolyases (E.C.3); lyases );
isomerases (E.C.5) and ligases (EC. 6). More specific but non-limiting subgroups of
oxidoreductases include dehydrogenases (e.g., alcohol dehydrogenases (carbonyl
reductases), xylulose reductases, aldehyde reductases, famesol ogenase, lactate
dehydrogenases, arabinose ogenases, glucose dehyrodgenase, fructose
dehydrogenases, xylose reductases and succinate dehyrogenases), oxidases
(e.g., glucose oxidases, hexose oxidases, galactose oxidases and laccases),
monoamine oxidases, lipoxygenases, peroxidases, aldehyde ogenases,
reductases, long-chain acyl-[acyl-carrier-protein] reductases, oA
dehydrogenases, ene-reductases, synthases (e.g., glutamate synthases), nitrate
reductases, mono and di-oxygenases, and catalases. More specific but non-limiting
ups of transferases include methyl, amidino, and carboxyl transferases,
transketolases, transaldolases, acyltransferases, glycosyltransferases, transaminases,
lutaminases and polymerases. More specific but non-limiting subgroups of
ases include ester hydrolases, peptidases, glycosylases, amylases, cellulases,
hemicellulases, xylanases, chitinases, glucosidases, glucanases, glucoamylases,
WO 48572
acylases, galactosidases, pullulanases, phytases, lactases, arabinosidases,
nucleosidases, nitrilases, atases, lipases, phospholipases, proteases, ATPases,
and dehalogenases. More specific but non-limiting subgroups of lyases include
decarboxylases, aldolases, hydratases, dehydratases (e.g., carbonic anhydrases),
synthases (e.g., isoprene, pinene and famesene synthases), pectinases (e.g., pectin
lyases) and halohydrin dehydrogenases. More specific, but non-limiting subgroups of
isomerases e ses, epimerases, isomerases (e.g., , arabinose, ribose,
glucose, galactose and e isomerases), tautomerases, and mutases (e.g. acyl
transferring mutases, phosphomutases, and aminomutases. More specific but non-
limiting subgroups of ligases include ester synthases. Other families or classes of
enzymes which may be used as sources of parent sequences include transaminases,
proteases, kinases, and synthases. This list, While illustrating certain specific s
of the possible enzymes of the disclosure, is not considered exhaustive and does not
portray the limitations or circumscribe the scope of the disclosure.
In some cases, the candidate enzymes useful in the methods described herein
are capable of catalyzing an enantioselective reaction such as an oselective
reduction reaction, for example. Such enzymes can be used to make intermediates
useful in the synthesis of pharmaceutical compounds for example.
In some embodiments, the candidate enzymes are selected from endoxylanases
(EC 3.2.1.8); B-Xylosidases (EC 3.2.1.37); alpha-L-arabinofiJranosidases (EC
3.2.1.55); alpha-glucuronidases (EC 3.2.1.139); acetylxylanesterases (EC 3.1.1.72);
feruloyl esterases (EC 3.1.1.73); coumaroyl esterases (EC 3.1.1.73);
alpha-galactosidases (EC 3.2.1.22); beta-galactosidases (EC 3.2.1.23); beta-
mannanases (EC 3.2.1.78); beta-mannosidases (EC 3.2.1.25); endo-
polygalacturonases (EC 3.2.1.15); pectin methyl esterases (EC 3.1.1.11 ); endo-
galactanases (EC 3.2.1.89); pectin acetyl esterases (EC 3.1.1.6); endo-pectin lyases
(EC 4.2.2.10); pectate lyases (EC 4.2.2.2); alpha rhamnosidases (EC 3.2.1.40); exopoly-alpha-galacturonosidase
(EC 3.2.1.82); pha-galacturonidase (EC 3.2.1.67);
exopolygalacturonate lyases (EC 4.2.2.9); rhamnogalacturonan endolyases EC
(4.2.2.B3); galacturonan acetylesterases (EC 3.2.1.B11); galacturonan
galacturonohydrolases (EC 3.2.1.B11); endo-arabinanases (EC 3.2.1.99); es
(EC 1.10.3.2); manganese-dependent dases (EC 1.10.3.2); amylases (EC
3.2.1.1), glucoamylases (EC 3), proteases, lipases, and lignin peroxidases (EC
.14). Any combination of one, two, three, four, five, or more than five enzymes
find use in the compositions of the present sure. It is not intended that the
present invention be limited to any particular number of enzymes and/or enzyme
classes.
It is not intended that the present invention be limited to any particular method
for generating systematically varied sequences, as any suitable method finds use. In
one or more embodiments of the disclosure, a single starting sequence is modified in
various ways to te the library. In some embodiments, the library is generated
by systematically varying the dual residues of the starting sequence. The set of
systematically varied sequences of a library can be designed a priori using design of
experiment (DOE) methods to define the sequences in the data set. A description of
DOE methods can be found in Diamond, W.J. (2001) Practical Experiment Designs:
for Engineers and Scientists, John Wiley & Sons and in “Practical Experimental
Design for Engineers and ists” by William J Drummond (1981) Van Nostrand
Reinhold Co New York, “Statistics for experimenters” George E.P. Box, William G
Hunter and J. Stuart Hunter (1978) John Wiley and Sons, New York, or, e.g., on the
World Wide Web at itl.nist.gov/div898/handbook/. There are several computational
packages available to perform the relevant mathematics, including Statistics Toolbox
(MATLAB®), JMP®, TICA®, and STAT-EASE® DESIGN EXPERT®.
The result is a systematically varied and orthogonal dispersed data set of ces
that is suitable for ing by the virtual protein screening system sed herein.
DOE-based data sets can also be readily ted using either Plackett-Burman or
Fractional Factorial Designs, as known in the art. Diamond, W.J. (2001).
Because l rounds of screening can be performed in silico with high
efficiency, some ments may use some or all available sequences to provide the
n variant library when the number of variants is usually too large to be screened
with conventional al s. For instance, for a sequence with 15 positions,
each having 20 possible amino acids, there are 300 possible positions vs. amino acid
pairs, and 232% (320) different variant sequences. In some implementations, a library
can include hundreds, thousands, tens of thousands, hundreds of thousands, or more
variants from this possible pool depending on the available computing power and
application needs. It is not intended that the present disclosure be limited to any
particular number of variant in the libraries.
V. SEQUENCING PROTEIN VARIANTS
In some embodiments, physical protein variants are used to generate
computational models of active sites of the protein ts used in l screening
as described above. In some embodiments, protein variants obtained from virtual
screening are physically generated using various methods described above. In some
embodiments, the physically generated protein variants are assayed for their reaction
against one or more ligands of interest. In various embodiments, the sequences of the
physical n variants are ascertained by protein sequencing methods, some of
which methods are further described below.
Protein sequencing involves determining the amino acid sequence of a protein.
Some n sequencing techniques also determine conformation the protein adopts,
and the extent to which it is complexed with any non-peptide molecules. Mass
spectrometry and the Edman degradation on may be used to directly determine
the sequence of amino acids of a protein.
The Edman degradation reaction allows the ordered amino acid ition
of a protein to be discovered. In some embodiments, automated Edman sequencers
can be used to determine the sequence of protein variants. Automated Edman
sequencers are able to sequence peptides of increasingly longer sequences, e.g., up to
approximately 50 amino acids long. In some embodiments, a protein cing
process implementing Edman degradation involves one or more of the following:
k disulfide bridges in the protein with a ng agent, e.g., 2-
toethanol. A protecting group such as iodoacetic acid may be used to prevent
bonds from re-forming
--Separate and purify individual chains of the protein complex if there are
more than one
--Determine the amino acid composition of each chain
--Determine the terminal amino acids of each chain
k each chain into fragments, e.g., fragments under 50 amino acids long.
--Separate and purify the fragments
--Determine the sequence of each fragment using the Edman degradation
reaction
--Repeat the above steps applying a different pattern of cleavage to e
additional read(s) of amino acid sequences
WO 48572
--Construct the sequence of the overall protein from amino acid sequence
reads
In s implementations, peptides longer than about 50-70 amino acids are
to be broken up into small fragments to facilitate sequencing by Edman reactions.
Digestion of longer sequences can be performed by endopeptidases such as trypsin or
pepsin, or by chemical reagents such as cyanogen bromide. Different enzymes give
ent cleavage patterns, and the overlap between fragments can be used to
uct an overall sequence.
During the Edman degradation reaction, the peptide to be sequenced is
adsorbed onto a solid surface of a substrate. In some embodiments, one suitable
substrate is glass fiber coated with polybrene, a cationic polymer. The Edman t,
phenylisothiocyanate (PITC), is added to the adsorbed e, together with a mildly
basic buffer solution of trimethylamine. This reaction solution reacts with the amine
group of the N-terminal amino acid. The terminal amino acid can then be selectively
detached by the addition of anhydrous acid. The tive then isomerises to give a
substituted phenylthiohydantoin, which can be washed off and identified by
chromatography. Then the cycle can be repeated.
In some embodiments, mass spectrometry can be used to determine an amino
acid ce by determining the mass-to-charge ratios of nts of the amino
acid sequence. The mass spectrum including peaks corresponding to multiply charged
fragments can be determined, where the distance n the peaks corresponding to
different isotope is inversely proportional to the charge on the fragment. The mass
spectrum is analyzed, e. g., by comparison against a database of usly sequenced
proteins to determine the sequences of the fragments. This process is then repeated
with a different digestion enzyme, and the overlaps in the sequences are used to
construct a complete amino acid sequence.
Peptides are often easier to prepare and analyze for mass spectrometry than
whole ns. In some embodiments, electrospray ionization is used for delivering
the peptides to the spectrometer. The protein is digested by an endoprotease, and the
resulting solution is passed through a high-pressure liquid chromatography column.
At the end of this column, the solution is sprayed into the mass spectrometer, the
solution being d with a positive potential. The charge on solution droplets
causes them to fragment into single ions. The peptides are then fragmented and the
mass-to-charge ratios of the fragments measured.
It is also possible to indirectly determine an amino acid sequence from the
DNA or mRNA sequence encoding the protein. Nucleic acid sequencing methods,
e. g., various next generation sequencing s, may be used to determine DNA or
RNA sequences. In some entations, a protein ce is newly isolated
without knowledge of the nucleotides encoding the protein. In such implementations,
one may first determine a short polypeptide sequence using one of the direct protein
sequencing methods. A complementary marker for the protein’s RNA can be
ined from this short ce. This can then be used to isolate the mRNA
coding for the protein, which can then be replicated in a polymerase chain reaction to
yield a significant amount of DNA, which can then be sequenced using DNA
sequencing methods. The amino acid sequence of the protein can then be deduced
from the DNA sequence. In the deduction, it is necessary to take into account the
amino acids removed after the mRNA has been translated.
in one or more embodiments, nucleic acid, sequence data can he used in
various stages in the process of directed ion of ns. in one or more
enilnulinients, sequence data can he obtained using hulk sequencing s
including, for example, Sanger sequencing or h/laxani-Gilhert sequencing, which are
considered the first generation cing methods. Sanger sequencing, which
involves using laheled dideoxy chain terntinators, is well known in the art; see, eg,
Sanger at al., Proceedings of the National Academy of Sciences of the United States
of America 74, 54636467 0997}. h’laxani—Gilhert sequencing, which es
performing multiple partial chemical degradation reactions on fractions of the nucleic
acid sample lhllowed, by detection and analysis ofthe li'agn'ients to infer the sequence,
is also well lfllOVVll in the art; see, eg, h/laxarn 62‘ at, dings of the National
Academy of Sciences of the United States of America 74, 5.60664 (l 977). Another
hulh sequencing method is sequencing by hybridization, in which the sequence of a
sample is deduced based on its hybridization properties to a plurality of sequences,
eg, on a niicroarray or gene chip; see, eg, Drrnanac, er of), Nature Biotechnology l6,
54—58 (1998)
in one or more enihodiinents, c acid ce data is obtained using
next—generation sequencing methods. Next-generation sequencing is also referred to
as high-throughput sequencing. The techniques parallelize the sequencing process,
producing thousands or millions of sequences at once. es of suitable next-
generation sequencing methods e, but are not d to, single molecule real-
time sequencing (e.g., Pacific Biosciences of Menlo Park, California), Ion
semiconductor sequencing (e.g., Ion Torrent of South San Francisco, California),
pyrosequencing (e. g., 454 of Branford, Connecticut), sequencing by ligation (e.g.,
SOLiD sequencing owned by Life Technologies of Carlsbad, California), sequencing
by synthesis and reversible terminator (e.g., Illumina of San Diego, California),
nucleic acid imaging technologies such as transmission electron microscopy, and the
like.
In general, next-generation sequencing methods typically use an in vitro
cloning step to amplify individual DNA molecules. Emulsion PCR (emPCR) isolates
individual DNA molecules along with primer-coated beads in aqueous droplets within
an oil phase. PCR produces copies of the DNA molecule, which bind to primers on
the bead, followed by immobilization for later cing. emPCR is used in the
methods by Marguilis et al. (commercialized by 454 Life Sciences, Branford, CT),
Shendure and Porreca et al. (also known as “polony sequencing”) and SOLiD
sequencing, ed Biosystems Inc., Foster City, CA). See M. Margulies, et al.
(2005) e cing in microfabricated high-density picolitre reactors”
Nature 437: 376—380; J. Shendure, et al. (2005) “Accurate Multiplex Polony
Sequencing of an Evolved ial Genome” Science 309 (5741): 1728—1732. In
vitro clonal amplification can also be carried out by “bridge PCR,” where fragments
are amplified upon primers attached to a solid surface. Braslavsky et al. developed a
single-molecule method (commercialized by Helicos ences Corp., Cambridge,
MA) that omits this amplification step, directly fixing DNA molecules to a surface. I.
vsky, et al. (2003) “Sequence information can be obtained from single DNA
molecules” Proceedings of the National Academy of Sciences of the United States of
America 100: 964.
DNA molecules that are physically bound to a surface can be sequenced in
parallel. In “sequencing by sis,” a complementary strand is built based on the
sequence of a te strand using a DNA polymerase. like dye-termination
electrophoretic sequencing, Reversible terminator methods rcialized by
Illumina, Inc., San Diego, CA and s Biosciences Corp., Cambridge, MA) use
reversible versions of dye-terminators, adding one nucleotide at a time, and detect
cence at each position in real time, by repeated removal of the blocking group
to allow polymerization of another nucleotide. “Pyrosequencing” also uses DNA
polymerization, adding one nucleotide at a time and detecting and quantifying the
number of nucleotides added to a given location through the light emitted by the
release of ed pyrophosphates (commercialized by 454 Life Sciences, Branford,
CT). See M. Ronaghi, et al. (1996). “Real-time DNA sequencing using ion of
pyrophosphate release” Analytical Biochemistry 242: 84-89.
Specific examples of next-generation sequencing methods are described in
further details below. One or more implementations of the current invention may use
one or more of the following sequencing methods without deviating fiom the
principles of the invention.
Single molecule real time sequencing (also known as SMRT) is a parallelized
single molecule DNA sequencing by synthesis technology developed by Pacific
ences. Single molecule real time sequencing utilizes the zero-mode waveguide
(ZMW). A single DNA polymerase enzyme is affixed at the bottom of a ZMW with a
single molecule of DNA as a template. The ZMW is a structure that creates an
illuminated ation volume that is small enough to observe only a single
nucleotide of DNA (also known as a base) being incorporated by DNA polymerase.
Each of the four DNA bases is attached to one of four different fluorescent dyes.
When a nucleotide is incorporated by the DNA polymerase, the fluorescent tag is
cleaved off and diffuses out of the ation area of the ZMW where its
cence is no longer observable. A detector detects the fluorescent signal of the
nucleotide incorporation, and the base call is made according to the corresponding
fluorescence of the dye.
r single molecule sequencing logy applicable is the Helicos True
Single Molecule cing (tSMS) technology (e. g. as described in Harris T.D. et
al., Science 320:106-109 [2008]). In the tSMS technique, a DNA sample is cleaved
into strands of approximately 100 to 200 tides, and a polyA sequence is added
to the 3’ end of each DNA strand. Each strand is labeled by the addition of a
fluorescently labeled adenosine nucleotide. The DNA s are then hybridized to a
flow cell, which contains millions of oligo-T capture sites that are immobilized to the
flow cell e. In certain embodiments the templates can be at a density of about
100 n templates/cmz. The flow cell is then loaded into an instrument, e.g.,
HeliScopeTM sequencer, and a laser illuminates the surface of the flow cell, revealing
the position of each template. A CCD camera can map the position of the templates
on the flow cell e. The template fluorescent label is then cleaved and washed
away. The sequencing reaction begins by introducing a DNA polymerase and a
fluorescently labeled tide. The T nucleic acid serves as a primer. The
polymerase incorporates the labeled nucleotides to the primer in a template directed
manner. The polymerase and unincorporated nucleotides are removed. The templates
that have directed incorporation of the fluorescently labeled nucleotide are ned
by imaging the flow cell surface. After imaging, a cleavage step removes the
fluorescent label, and the process is ed with other fluorescently labeled
tides until the desired read length is achieved. Sequence information is
collected with each nucleotide addition step. Whole genome sequencing by single
molecule sequencing technologies excludes or typically obviates PCR—based
amplification in the preparation of the sequencing libraries, and the methods allow for
direct measurement of the sample, rather than measurement of copies of that sample.
Ion Semiconductor Sequencing is a method of DNA sequencing based on the
detection of hydrogen ions that are released during the polymerization of DNA. This
is a method of “sequencing by synthesis,” during which a complementary strand is
built based on the ce of a template strand. A microwell containing a template
DNA strand to be ced is flooded with a single species of deoxyribonucleotide
triphosphate (dNTP). If the introduced dNTP is complementary to the leading
template nucleotide, it is incorporated into the growing complementary strand. This
causes the release of a hydrogen ion that triggers an ISFET ion sensor, which
indicates that a reaction has occurred. If homopolymer repeats are present in the
template sequence, multiple dNTP molecules will be incorporated in a single cycle.
This leads to a corresponding number of released hydrogens and a tionally
higher electronic signal. This technology differs from other sequencing logies
in that no modified nucleotides or optics are used. Ion semiconductor cing
may also be referred to as ion torrent sequencing, pH-mediated sequencing, silicon
sequencing, or semiconductor sequencing.
ln pyrosequencing, the pyrophosphate ion released by the polymerization
reaction is reacted with adehosine 5’ phosphosulfate by ATP ylase to produce
ATP; the ATP then drives the eorwersion of lucil‘erin to oxyluciferin plus light by
luciferase. As the fluorescence is transient, no separate step to eliminate cence
is necessary in this method. Gne type of deoxyribonucleetide triphosphate (dNTP) is
added at a time, and sequence information is ned ing to which dNTP
generates significant signal at a. reaction site. The commercially available Roche GS
FLX instrument acquires sequence using this method. This technique and
applications thereof are sed in , for example, in Ronaghi er (IL, Analytical
Biochemistry 242, 84~89 @996) and Margulies at 01., Nature 437, 3764380 (2605)
(corrigendurn at Nature 441, l2t) (2906)). A commercially available queneing
technology is 454 sequencing (Roche) (e.g. as described in ies, M. et al.
Nature 437:376-380 [2005]).
in ligation sequencing, a ligase enzyme is used to join a partially double—
stranded oligonncleotide with an overhang to the c acid heing sequencedi which
has an overhang; in order for ligation to occur, the overhangs must be complementary.
The bases in, the ng of the partially double~stranded oligonucleotide can he
identified according to a fluorophore conjugated to the partially dotthle~stranded
oligonuclectide and/or to a secondary cligonucleotide that izes to r part
of the partially double—stranded oligonucleotide. After acquisition of fluorescence
data, the ligated complex is cleaved upstream of the on site, such as by a type lls
restriction , for example, Blwl, which cuts at a site a fixed distance li'on'i its
ecognition site {which was included in the partially double stranded oligonucleotide}.
This cleavage reaction exposes a new overhang just upstream of the previous
overhang, and the process is repeated. This technique and applications thereof are
discussed in detail, fer example, in Brenner ct (IL, Nature Biotechnology l8, 6304334
(2030:). in some embodiments, ligation sequencing is adapted, to the methods of the
inventicn by obtaining a rolling circle amplification product of a circular nucleic acid
molecule, and, using the rolling circle amplification product as the template thr
ligation sequencing.
A commercially ble example of ligation sequencing technology is the
SOLiDTM technology (Applied Biosystems). In SOLiDTM sequencing-by-ligation,
genomic DNA is sheared into fragments, and adaptors are attached to the 5’ and 3’
ends of the fragments to generate a nt library. Alternatively, internal adaptors
can be introduced by ng adaptors to the 5’ and 3’ ends of the fragments,
arizing the fragments, digesting the circularized fragment to generate an internal
adaptor, and attaching rs to the 5’ and 3’ ends of the resulting nts to
generate a mate-paired y. Next, clonal bead populations are prepared in
microreactors containing beads, primers, template, and PCR components. Following
PCR, the tes are denatured and beads are enriched to separate the beads With
extended templates. tes on the selected beads are subjected to a 3’
modification that permits bonding to a glass slide. The sequence can be determined
by tial hybridization and ligation of partially random oligonucleotides with a
central determined base (or pair of bases) that is identified by a specific fiuorophore.
After a color is recorded, the ligated oligonucleotide is cleaved and removed and the
process is then repeated.
ln ihle terminator sequencing, a fluorescent d iemlaheled nucleotide
analog that is a reversible. chain terminator due to the presence of a blocking grnup is
incerperated in a single—base extensinn reactinn. The identity of the base is
deten'riined aceerding to the fluerephere; in other werds, each base is paired with a
different fluorephere. After fluereseence/sequenee data is acquired, the fluerephere
arid the blocking greup are chemically removed, and the cycle is repeated t0 acquire
the next base of sequence inforrnatien. The lilunrina (3A ment operates by this
methnd. This technique and, applieatiens thereef are discussed in detail, fer example,
in R'uparel er (2!), Proceedings of the National Academy of Sciences of the United
States of America l02, 5932—5937 {2005), and Harris at (15., Science 320, lfiénl09
(2038).
A commercially available example nf reversible terminator sequencing
method is Illumina’s sequencing-by-synthesis and ible terminator-based
sequencing (e.g. as described in Bentley et al., Nature 6:53-59 [2009]). Illumina’s
cing technology relies on the ment of fragmented genomic DNA to a
planar, optically transparent e on which oligonucleotide s are bound.
Template DNA is end-repaired to generate 5'-phosphorylated blunt ends, and the
polymerase activity of Klenow nt is used to add a singleA base to the 3' end of
the blunt phosphorylated DNA fragments. This addition prepares the DNA fragments
for ligation to oligonucleotide adapters, which have an overhang of a single T base at
their 3' end to increase ligation efficiency. The adapter oligonucleotides are
complementary to the flow-cell anchors. Under limiting-dilution conditions, adapter-
d, single-stranded template DNA is added to the flow cell and immobilized by
hybridization to the anchors. Attached DNA fragments are extended and bridge
amplified to create an ultra-high y cing flow cell with hundreds of
millions of rs, each containing ~l,000 copies of the same te. The
templates are sequenced using a robust four-color DNA sequencing-by-synthesis
technology that employs reversible terminators with removable fluorescent dyes.
High-sensitivity fluorescence detection is achieved using laser excitation and total
internal ion optics. Short sequence reads of about 20-40 hp e.g. 36 bp, are
d against a repeat-masked reference genome and unique mapping of the short
sequence reads to the reference genome are identified using specially developed data
analysis pipeline software. Non-repeat-masked reference genomes can also be used.
Whether repeat-masked or non-repeat-masked reference genomes are used, only reads
that map uniquely to the nce genome are d. After completion of the first
read, the templates can be rated in situ to enable a second read from the
opposite end of the fragments. Thus, either single-end or paired end sequencing of
the DNA fragments can be used. Partial sequencing of DNA fragments present in the
sample is performed, and sequence tags comprising reads of predetermined length e.g.
36 bp, are mapped to a known reference genome are counted.
in nanopore sequencing. a single stranded c acid molecule is threaded
through a pore, ag, using an electrophoretic driving force, and sequence is deduced
by analyzing data obtained as the single stranded nucleic acid molecule passes
through the pore. The data can he ion current data, wherein each base alters the
current, eg, by partially blocking the current g h the pore to a different,
distinguishahl e .
In another illustrative, but non-limiting, embodiment, the methods described
herein comprises obtaining sequence information using transmission electron
microscopy (TEM). The method comprises utilizing single atom resolution
transmission electron microscope g of high-molecular weight (lSOkb or
greater) DNA selectively labeled with heavy atom markers and arranging these
molecules on ultra-thin films in ultra-dense (3nm strand-to-strand) parallel arrays with
consistent base-to-base spacing. The electron microscope is used to image the
molecules on the films to determine the on of the heavy atom markers and to
extract base sequence information from the DNA. The method is further described in
PCT patent publication .
In another illustrative, but non-limiting, embodiment, the methods described
herein comprises obtaining sequence information using third-generation sequencing.
ln third—generation cing, a slide with an aluminum coating with many small
(“50 nm) holes is used as a. zero niode ide (see, eg, evene at at... Science 299,
682686 (2003)). The aluminum SlllfélCt? is protected, from attachment of EENA
rase by polyphosphonate chemistry, cg, polyvinylphosphonate chemistry (see.
eg, h at (11., dings of the National Academy of Sciences of the United
States of America ltlfs, ll7o—l lSl (26%)). This results in preferential attachment of
the DNA polymerase molecules to the exposed silica in the holes of the aluminum
coating. This setup allows evanescent wave phenomena to he used to reduce
fluorescence background, allowing the use of higher concentrations of scently
laheled dNTl’s. The tluorophore is attached to the terminal ate of the dN'l'Ps,
such that fluorescence is released upon incorporation of the dNTP, but the
fluorophore does not remain attached to the newly orated nucleotide, meaning
that the complex is immediately ready for another round of incorporation. By this
method, incorporation of dNTPs into an individual primer—template complexes
present in the holes of the aluminum coating can be detected. See; eflg Eid 62‘ al.,,
Science 323, 1334138 (2009),
VI. ASSAYING GENE AND PROTEIN VARIANTS
In some embodiments, polynucleotides generated in connection with methods
of the present ion are optionally cloned into cells to express protein variants for
ty screening (or used in in Vitro ription reactions to make products which
are screened). Furthermore, the nucleic acids encoding protein variants can be
enriched, sequenced, expressed, amplified in Vitro or d in any other common
recombinant method.
General texts that describe molecular biological techniques useful herein,
including cloning, mutagenesis, y construction, screening assays, cell culture
and the like include Berger and Kimmel, Guide to Molecular Cloning Techniques,
Methods in Enzymology volume 152 Academic Press, Inc., San Diego, CA (Berger);
Sambrook et al., Molecular Cloning - A Laboratory Manual (2nd Ed.), Vol. 1-3, Cold
Spring Harbor Laboratory, Cold Spring Harbor, New York, 1989 (Sambrook) and
Current Protocols in Molecular Biology, F.M. Ausubel et al., eds., Current Protocols,
a joint venture between Greene Publishing Associates, Inc. and John Wiley & Sons,
Inc., New York (supplemented through 2000) (Ausubel). Methods of ucing
cells, ing plant and animal cells, with nucleic acids are generally ble, as
are methods of expressing proteins encoded by such nucleic acids. In addition to
Berger, Ausubel and Sambrook, useful general references for e of animal cells
include Freshney (Culture of Animal Cells, a Manual of Basic Technique, third
edition Wiley- Liss, New York ) and the references cited therein, Humason
(Animal Tissue Techniques, fourth edition W.H. Freeman and Company (1979)) and
Ricciardelli, et al., In Vitro Cell Dev. Biol. 25:1016-1024 (1989). References for
plant cell cloning, culture and ration include Payne et al. (1992) Plant Cell and
Tissue Culture in Liquid Systems John Wiley & Sons, Inc. New York, NY (Payne);
and Gamborg and Phillips (eds) (1995) Plant Cell, Tissue and Organ Culture;
Fundamental Methods Springer Lab , Springer-Verlag n berg
New York) (Gamborg). A variety of Cell culture media are described in Atlas and
Parks (eds) The Handbook of Microbiological Media (1993) CRC Press, Boca Raton,
FL (Atlas). Additional information for plant cell culture is found in available
commercial literature such as the Life Science Research Cell e Catalogue
2014/057899
(1998) from Sigma-Aldrich, Inc (St Louis, MO) (Sigma-LSRCCC) and, e.g., the Plant
Culture Catalogue and supplement (1997) also from Sigma-Aldrich, Inc (St Louis,
MO) (Sigma-PCCS).
Examples of techniques sufficient to direct persons of skill through in vitro
amplification methods, useful e.g., for amplifying oligonucleotide recombined nucleic
acids including polymerase chain reactions (PCR), ligase chain reactions (LCR), QB-
replicase amplifications and other RNA polymerase mediated techniques (e.g.,
NASBA). These techniques are found in Berger, Sambrook, and Ausubel, supra, as
well as in Mullis et al., (1987) US. Patent No. 202; PCR Protocols A Guide to
Methods and Applications (Innis et al. eds) Academic Press Inc. San Diego, CA
(1990) (Innis); Amheim & Levinson (October 1, 1990) C&EN 36-47; The l Of
NIH Research (1991) 3, 81-94; Kwoh et al. (1989) Proc. Natl. Acad. Sci. USA 86,
1173; Guatelli et al. (1990) Proc. Natl. Acad. Sci. USA 87, 1874; Lomell et al. (1989)
J. Clin. Chem 35, 1826; Landegren et al., (1988) e 241, 1077-1080; Van Brunt
(1990) Biotechnology 8, 291-294; Wu and Wallace, (1989) Gene 4, 560; ger et
al. (1990) Gene 89, 117, and Sooknanan and Malek (1995) Biotechnology 13: 563-
564. Improved methods of cloning in vitro amplified nucleic acids are described in
Wallace et al., US. Pat. No. 5,426,039. Improved methods of amplifying large
nucleic acids by PCR are summarized in Cheng et al. (1994) Nature 369: 684-685 and
the nces therein, in which PCR amplicons of up to 40kb are generated. One of
skill will appreciate that essentially any RNA can be converted into a double stranded
DNA suitable for restriction digestion, PCR expansion and sequencing using e
transcriptase and a polymerase. See, Ausubel, Sambrook and Berger, all supra.
In one preferred method, mbled sequences are checked for incorporation
of family-based recombination ucleotides. This can be done by cloning and
sequencing the nucleic acids, and/or by restriction ion, e.g., as essentially taught
in Sambrook, Berger and Ausubel, supra. In addition, sequences can be PCR
amplified and sequenced directly. Thus, in addition to, e. g., Sambrook, ,
Ausubel and Innis (supra), additional PCR sequencing methodologies are also
particularly useful. For example, direct sequencing of PCR generated amplicons by
selectively incorporating boronated nuclease ant nucleotides into the amplicons
during PCR and digestion of the amplicons with a nuclease to produce sized template
fragments has been med (Porter et al. (1997) Nucleic Acids Research
2014/057899
(8):l6l l-l6l7). In the methods, four PCR reactions on a template are med,
in each of which one of the nucleotide triphosphates in the PCR reaction mixture is
partially substituted with a 2’deoxynucleoside 5’-[P-borano]-triphosphate. The
boronated nucleotide is stochastically incorporated into PCR products at varying
positions along the PCR amplicon in a nested set of PCR fragments of the template.
An exonuclease that is blocked by incorporated boronated nucleotides is used to
cleave the PCR amplicons. The cleaved ons are then separated by size using
polyacrylamide gel electrophoresis, providing the sequence of the amplicon. An
advantage of this method is that it uses fewer biochemical manipulations than
performing standard Sanger-style sequencing of PCR amplicons.
Synthetic genes are amenable to conventional cloning and expression
approaches; thus, properties of the genes and proteins they encode can readily be
examined after their expression in a host cell. Synthetic genes can also be used to
generate ptide products by in vitro (cell-free) transcription and translation.
Polynucleotides and polypeptides can thus be examined for their ability to bind a
variety of predetermined ligands, small molecules and ions, or polymeric and
polymeric substances, including other proteins and polypeptide epitopes, as
well as microbial cell walls, viral particles, surfaces and membranes.
For example, many physical methods can be used for detecting
polynucleotides encoding ypes associated with catalysis of chemical reactions
by either polynucleotides directly, or by encoded polypeptides. Solely for the purpose
of illustration, and ing on the ics of particular pre-determined chemical
ons of interest, these methods may include a multitude of techniques known in
the art which account for a physical difference between substrate(s) and product(s), or
for changes in the reaction media associated with chemical reaction (e.g. s in
electromagnetic emissions, adsorption, ation, and cence, whether UV,
visible or infrared (heat)). These methods also can be selected from any combination
of the following: mass-spectrometry; nuclear magnetic resonance; isotopically d
materials, partitioning and spectral methods accounting for isotope distribution or
labeled product formation; spectral and chemical methods to detect accompanying
changes in ion or elemental compositions of reaction product(s) (including changes in
pH, inorganic and c ions and the like). Other methods of al assays,
suitable for use in the methods herein, can be based on the use of biosensors specific
for reaction product(s), ing those sing antibodies with reporter
properties, or those based on in vivo affinity recognition coupled with expression and
activity of a reporter gene. Enzyme-coupled assays for reaction product ion and
cell life-death-growth selections in viva can also be used where appropriate.
less of the specific nature of the physical assays, they all are used to select a
desired activity, or combination of desired ties, provided or encoded by a
biomolecule of interest.
The specific assay used for the selection will depend on the application. Many
assays for proteins, receptors, ligands, enzymes, substrates and the like are known.
Formats include binding to immobilized components, cell or smal viability,
tion of reporter itions, and the like.
High throughput assays are particularly suitable for screening libraries
employed in the present invention. In high throughput , it is possible to screen
up to several thousand different variants in a single day. For example, each well of a
microtiter plate can be used to run a separate assay, or, if concentration or tion
time effects are to be observed, every 5-10 wells can test a single variant (e.g., at
different trations). Thus, a single standard microtiter plate can assay about 100
(e.g., 96) reactions. If 1536 well plates are used, then a single plate can easily assay
from about 100 to about 1500 different reactions. It is le to assay several
ent plates per day; assay screens for up to about 6,000-20,000 different assays
(i.e., involving ent nucleic acids, encoded proteins, concentrations, etc.) is
possible using the integrated systems of the invention. More recently, microfluidic
approaches to reagent manipulation have been developed, e.g., by Caliper
Technologies (Mountain View, CA) which can provide very high throughput
microfiuidic assay s.
High throughput screening systems are commercially available (see, e.g.,
Zymark Corp., Hopkinton, MA; Air Technical Industries, Mentor, OH; Beckman
Instruments, Inc. Fullerton, CA; Precision Systems, Inc., Natick, MA, etc.). These
systems typically automate entire ures including all sample and reagent
pipetting, liquid dispensing, timed incubations, and final readings of the microplate in
detector(s) appropriate for the assay. These configurable systems provide high
throughput and rapid start up as well as a high degree of flexibility and customization.
The manufacturers of such systems provide detailed protocols for various high
throughput screening assays. Thus, for example, Zymark Corp. provides technical
bulletins describing screening systems for detecting the modulation of gene
transcription, ligand binding, and the like.
A variety of commercially available eral equipment and software is
available for digitizing, storing and analyzing a digitized video or digitized optical or
other assay images, e.g., using PC (Intel X86 or pentium chip- compatible MAC OS,
WINDOWSTM family, or UNIX based (e.g., SUNTM work station) computers.
Systems for is typically include a digital computer specifically
programmed to perform lized algorithms using software for directing one or
more steps of one or more of the methods herein, and, optionally, also e, e. g., a
next generation sequencing platform control re, high-throughput liquid control
software, image is software, data interpretation software, a robotic liquid
control armature for erring solutions from a source to a destination operably
linked to the digital computer, an input device (e.g., a computer keyboard) for
entering data to the digital computer to control operations or high hput liquid
transfer by the robotic liquid control armature and, optionally, an image scanner for
digitizing label signals from d assay components. The image scanner can
interface with image analysis software to provide a measurement of probe label
intensity. Typically, the probe label intensity measurement is interpreted by the data
interpretation software to show whether the labeled probe hybridizes to the DNA on
the solid support.
In some embodiments, cells, viral plaques, spores or the like, comprising in
vitro oligonucleotide-mediated recombination products or physical embodiments of in
silico recombined nucleic acids, can be separated on solid media to produce
dual colonies (or plaques). Using an automated colony picker (e.g., the Q-bot,
Genetix, U.K.), es or plaques are identified, picked, and up to 10,000 different
mutants inoculated into 96 well iter dishes containing two 3 mm glass
balls/well. The Q-bot does not pick an entire colony but rather inserts a pin through
the center of the colony and exits with a small ng of cells, (or mycelia) and
spores (or viruses in plaque ations). The time the pin is in the , the
number of dips to inoculate the culture medium, and the time the pin is in that
medium each effect inoculum size, and each parameter can be controlled and
optimized.
The uniform process of automated colony picking such as the Q-bot decreases
human handling error and increases the rate of establishing cultures (roughly 10,000/4
hours). These cultures are optionally shaken in a temperature and humidity lled
incubator. Optional glass balls in the microtiter plates act to promote uniform
aeration of cells and the dispersal of cellular (e.g., mycelial) fragments similar to the
blades of a tor. Clones from es of interest can be isolated by limiting
dilution. As also described supra, plaques or cells constituting libraries can also be
screened directly for the production of proteins, either by detecting ization,
protein activity, protein binding to antibodies, or the like. To increase the chances of
identifying a pool of sufficient size, a prescreen that increases the number of mutants
processed by 10-fold can be used. The goal of the primary screen is to quickly
identify mutants having equal or better product titers than the parent strain(s) and to
move only these mutants d to liquid cell culture for subsequent analysis.
One approach to screening diverse libraries is to use a massively el solid-
phase procedure to screen cells expressing polynucleotide variants, e.g.,
polynucleotides that encode enzyme variants. Massively parallel solid-phase
ing apparatus using tion, fluorescence, or FRET are available. See, e. g.,
US. Pat. No. 5,914,245 to Bylina, et al. (1999); see also, http://www|.|kairos-
scientific.com/; Youvan et al. (1999) “Fluorescence Imaging Micro-
Spectrophotometer (FIMS)” Biotechnology et alia, <www|.|et-al.com> 1:1-16; Yang
et al. (1998) “High Resolution g Microscope (HIRIM)” Biotechnology et alia,
<www|.|et-al.com> 4:1-20; and Youvan et al. (1999) “Calibration of Fluorescence
nce Energy Transfer in Microscopy Using Genetically Engineered GFP
Derivatives on Nickel Chelating Beads” posted at www|.|kairos-scientif1c.com.
Following screening by these techniques, molecules of interest are typically isolated,
and ally ced using methods that are known in the art. The sequence
information is then used as set forth herein to design a new protein t library.
Similarly, a number of well-known robotic systems have also been developed
for solution phase chemistries useful in assay systems. These systems e
automated ations like the automated synthesis apparatus developed by Takeda
Chemical Industries, LTD. (Osaka, Japan) and many robotic systems utilizing robotic
arms (Zymate II, Zymark Corporation, Hopkinton, Mass.; Orca, Beckman r,
Inc. (Fullerton, CA)) which mimic the manual synthetic operations performed by a
scientist. Any of the above s are suitable for use with the present invention,
e. g., for high-throughput screening of molecules encoded by nucleic acids evolved as
described herein. The nature and implementation of modifications to these devices (if
WO 48572
any) so that they can operate as discussed herein will be apparent to persons skilled in
the relevant art.
VII. L APPARATUS AND SYSTEMS
As should be nt, embodiments described herein employ processes
acting under control of instructions and/or data stored in or transferred through one or
more computer systems. ments disclosed herein also relate to systems and
apparatus (e.g., equipment) for performing these operations. In some embodiments,
the apparatus is specially designed and/or constructed for the required purposes, or it
may be a general-purpose computer selectively activated or reconfigured by a
computer program and/or data structure stored in the computer. The processes
provided by the present disclosure are not inherently related to any particular
computer or other specific apparatus. In particular, s general-purpose machines
find use with programs written in accordance with the teachings herein. However, in
some embodiments, a specialized tus is constructed to perform the required
method operations. One embodiment of a particular structure for a variety of these
machines is described below.
In addition, n embodiments of the present disclosure relate to computer-
readable media or computer program products that include program instructions
and/or data (including data structures) for performing various computer-implemented
operations. Examples of computer-readable media include, but are not limited to,
magnetic media such as hard disks; optical media such as CD-ROM s and
holographic devices; magneto-optical media; and semiconductor memory devices,
such as flash memory. Hardware devices such as read-only memory devices (ROM)
and random access memory devices (RAM) may be configured to store program
ctions. Hardware devices such as application-specific integrated circuits
(ASICs) and programmable logic devices (PLDs) may be configured to execute and
store program instructions. It is not intended that the present disclosure be limited to
any ular computer-readable media or any other computer m products that
e instructions and/or data for performing computer-implemented ions.
Examples of program instructions include, but are not limited to low-level
code such as produced by a compiler, and files containing higher level code that may
be executed by the er using an interpreter. Further, the program instructions
e, but are not limited to machine code, source code and any other code that
2014/057899
directly or indirectly controls operation of a computing machine in accordance with
the present disclosure. The code may specify input, output, calculations, ionals,
branches, iterative loops, etc.
In one illustrative example, code embodying s disclosed herein are
embodied in a fixed media or transmissible program component containing logic
instructions and/or data that when loaded into an appropriately configured computing
device causes the device to perform virtual screening of one or more biomolecule
variants interacting with one or more ligands. Figure 4 shows an example digital
device 800 that is a l apparatus that can read instructions from media 817,
network port 819, user input keyboard 809, user input 811, or other inputting means.
Apparatus 800 can thereafter use those instructions to direct statistical operations in
data space, e. g., to te a geometric relation between a ligand moiety and one or
more features of an active site, cofactor, etc. (e. g., to determine a ce between the
position of a native ate in an active site and the position of a substrate under
eration in the active site of a protein variant). One type of logical apparatus
that can embody disclosed embodiments is a computer system as in computer system
800 comprising CPU 807, optional user input devices keyboard 809, and GUI
pointing device 811, as well as peripheral components such as disk drives 815 and
monitor 805 (which displays GO modified character strings and provides for
simplified selection of s of such character strings by a user. Fixed media 817 is
optionally used to program the overall system and can include, e.g., a disk-type
optical or magnetic media or other electronic memory storage element.
ication port 819 can be used to program the system and can represent any
type of communication connection.
Certain embodiments can also be embodied within the circuitry of an
application specific integrated circuit (ASIC) or mmable logic device (PLD).
In such a case, the ments are implemented in a computer readable descriptor
language that can be used to create an ASIC or PLD. Some embodiments of the
present disclosure are implemented within the circuitry or logic processors of a
variety of other digital apparatus, such as PDAs, laptop computer systems, displays,
image editing ent, etc.
In some embodiments, the present disclosure relates to a computer program
product comprising one or more computer-readable storage media having stored
thereon computer-executable instructions that, when executed by one or more
processors of a computer system, cause the er system to ent a method
for virtual screening of protein variants and/or in silico directed evolution of proteins
having desired activity. Such a method may be any method described herein such as
those encompassed by the figures and pseudocode. In some ments, for
example, the method es sequence data for a plurality of enzymes, creates three-
dimensional homology models of biological molecules, dock the homology models of
enzymes with one or more computational representations of substrates, and select
enzymes having desired catalytic activity and selectivity. In some embodiments, the
method can filrther develop variant ies from variants that have been highly
ranked by the screening process. The variant libraries can be used in re-iterative
directed evolution and screening, which can result in enzymes of desired beneficial
properties.
In some embodiments, the docking of the homology models of enzymes with
one or more computational representations of substrates is conducted by a docking
program on a computer system that uses a computational representation of a ligand
and computational representations of the active sites of a ity of variants as
described herein. In various embodiments, methods for determining docking involve
evaluating the binding energy n a pose of the substrate and the . For a
protein variant that successfully docks with the ligand, the virtual protein screening
system considers a plurality of poses of the computational representation of the ligand
in the active site of the protein t under consideration, and determines which if
any of the plurality of poses is active. In various embodiments, methods for
ining active poses involve evaluating the geographical constraints g a
range of relative positions of one or more atoms in the ligand and one or more atoms
in the protein and/or cofactor associated with the protein.
VIII. EMBODIMENTS IN WEBSITES AND CLOUD ING
The Internet es computers, information appliances, and computer
networks that are interconnected through communication links. The interconnected
computers exchange information using various services, such as electronic mail, ftp,
the World Wide Web (“WWW”) and other es, including secure services. The
WWW service can be understood as allowing a server computer system (e.g., a Web
server or a Web site) to send web pages of information to a remote client information
appliance or computer system. The remote client computer system can then display
the web pages. Generally, each resource (e.g., computer or web page) of the WWW
is uniquely identifiable by a Uniform Resource r (“URL”). To view or interact
with a c web page, a client computer system specifies a URL for that web page
in a request. The request is forwarded to a server that supports that web page. When
the server receives the request, it sends that web page to the client information system.
When the client computer system receives that web page, it can display the web page
using a browser or can interact with the web page or interface as otherwise provided.
A browser is a logic module that effects the requesting of web pages and displaying or
interacting with web pages.
Currently, yable web pages are lly defined using a Hyper Text
Markup Language (“HTML”). HTML provides a standard set of tags that define how
a web page is to be displayed. An HTML document contains various tags that control
the displaying of text, graphics, controls, and other features. The HTML document
may contain URLs of other Web pages available on that server er system or
other server computer systems. URLs can also indicate other types of interfaces,
including such things as CGI scripts or executable interfaces, that information
appliances use to communicate with remote information appliances or servers without
necessarily displaying information to a user.
The Internet is ally conducive to providing information services to one
or more remote customers. Services can include items (e.g., music or stock quotes)
that are delivered electronically to a purchaser over the Internet. Services can also
include ng orders for items (e.g., groceries, books, or chemical or biologic
compounds, etc.) that may be delivered through conventional distribution channels
(e.g., a common carrier). es may also e handling orders for items, such as
e or theater reservations, that a purchaser accesses at a later time. A server
er system may provide an electronic version of an interface that lists items or
es that are available. A user or a potential purchaser may access the interface
using a browser and select various items of interest. When the user has completed
selecting the items desired, the server computer system may then prompt the user for
information needed to complete the service. This transaction-specific order
information may include the purchaser's name or other identification, an fication
for payment (such as a corporate purchase order number or account number), or
additional information needed to complete the service, such as flight information.
Among services of particular interest that can be provided over the et
and over other networks are biological data and biological databases. Such services
include a variety of services provided by the National Center for Biotechnology
Information (NCBI) of the National Institutes of Health (NIH). NCBI is charged with
ng automated systems for storing and analyzing knowledge about molecular
biology, biochemistry, and genetics; facilitating the use of such ses and
software by the research and medical community; coordinating efforts to gather
biotechnology information both nationally and internationally; and performing
research into advanced methods of computer-based information processing for
analyzing the structure and function of biologically important molecules.
NCBI holds responsibility for the GenBank® DNA sequence database. The
database has been ucted from ces submitted by individual laboratories
and by data exchange with the ational nucleotide sequence databases, the
European Molecular Biology Laboratory (EMBL) and the DNA Database of Japan
(DDBJ), and includes patent sequence data submitted to the US. Patent and
Trademark Office. In addition to GenBank®, NCBI supports and butes a variety
of databases for the medical and scientific ities. These include the Online
Mendelian Inheritance in Man , the Molecular Modeling Database (MMDB)
of 3D protein structures, the Unique Human Gene ce Collection (UniGene), a
Gene Map of the Human Genome, the Taxonomy Browser, and the Cancer Genome
Anatomy Project (CGAP), in collaboration with the National Cancer Institute. Entrez
is NCBI's search and retrieval system that provides users with integrated access to
ce, mapping, taxonomy, and structural data. Entrez also provides graphical
views of sequences and chromosome maps. A feature of Entrez is the ability to
retrieve related sequences, structures, and nces. BLAST, as described herein, is
a program for ce rity searching developed at NCBI for identifying genes
and genetic features that can execute sequence searches against the entire DNA
database. Additional software tools provided by NCBI include: Open Reading Frame
Finder (ORF ), Electronic PCR, and the sequence submission tools, Sequin and
BankIt. NCBI's various databases and software tools are available from the WWW or
by FTP or by e-mail servers. Further information is available at
www| .|ncbi.nlm.nih.gov.
Some biological data available over the internet is data that is generally
viewed with a special browser “plug-in” or other able code. One example of
2014/057899
such a system is CHIME, a browser plug-in that allows an interactive virtual 3-
dimensional display of molecular structures, including ical molecular
structures. Further information ing CHIME is available at
wwwl .|mdlchime.com/chime/.
A y of companies and institutions provide online systems for ordering
biological compounds. Examples of such systems can be found at
wwwl .|genosys.com/oligo_custinfo.cfm or
wwwl .|genomictechnologies.com/Qbrowser2_FP.html. Typically, these systems
accept some descriptor of a desired biological compound (such as an oligonucleotide,
DNA strand, RNA strand, amino acid sequence, etc.) and then the requested
nd is manufactured and is shipped to the customer in a liquid solution or other
appropriate form.
As the methods provides herein may be implemented on a website as further
described below, the computational results or physical results involving polypeptides
or cleotides produced by some embodiments of the disclosure may be provided
through the t in ways similar to the biological information and compounds
described above.
To fiarther illustrate, the methods of this invention can be implemented in a
localized or distributed computing environment. In a distributed environment, the
s may be implemented on a single computer comprising multiple processors or
on a multiplicity of computers. The computers can be linked, e. g. through a common
bus, but more ably the computer(s) are nodes on a network. The network can be
a generalized or a dedicated local or wide-area network and, in certain preferred
embodiments, the computers may be ents of an Intranet or an Internet.
In one intemet embodiment, a client system typically es a Web r
and is coupled to a server computer ing a Web server. The Web browser is
lly a program such as IBM's Web Explorer, Microsoft’s Internet explorer,
NetScape, Opera, or Mosaic. The Web server is typically, but not necessarily, a
program such as IBM's HTTP Daemon or other www daemon (e.g., LINUX-based
forms of the program). The client computer is bi-directionally coupled with the server
computer over a line or via a wireless system. In turn, the server computer is bi-
directionally coupled with a e (server hosting the website) providing access to
software implementing the methods of this invention.
As mentioned, a user of a client ted to the Intranet or Internet may
cause the client to request resources that are part of the web site(s) hosting the
application(s) providing an implementation of the methods of this invention. Server
program(s) then process the request to return the specified resources (assuming they
are currently available). The standard naming convention (i.e., Uniform Resource
Locator (“URL”)) encompasses several types of on names, presently including
subclasses such as Hypertext Transport Protocol (“http”), File Transport Protocol
(“ftp”), , and Wide Area Information Service (“WAIS”). When a resource is
downloaded, it may include the URLs of additional resources. Thus, the user of the
client can easily learn of the existence of new resources that he or she had not
specifically requested.
The software implementing the method(s) of this invention can run locally on
the server hosting the website in a true client-server ecture. Thus, the client
er posts ts to the host server which runs the requested process(es)
y and then ads the results back to the client. atively, the methods
of this invention can be implemented in a “multi-tier” format in which a component of
the method(s) are performed locally by the client. This can be implemented by
software aded from the server on request by the client (e. g. a Java application)
or it can be implemented by software “permanently” led on the client.
In one embodiment the application(s) implementing the methods of this
invention are divided into frames. In this paradigm, it is helpfill to view an
application not so much as a collection of features or functionality but, d, as a
tion of discrete frames or views. A typical application, for instance, generally
includes a set of menu items, each of with invokes a particular frame--that is, a form
which manifest certain functionality of the application. With this perspective, an
application is viewed not as a monolithic body of code but as a collection of applets,
or bundles of functionality. In this manner fiom within a browser, a user would select
a Web page link which would, in turn, invoke a particular frame of the application
(i.e., a sub-application). Thus, for example, one or more frames may provide
fianctionality for inputting and/or encoding biological molecule(s) into one or more
data spaces, while another frame provides tools for refining a model of the data space.
In certain embodiments, the methods of this invention are implemented as one
or more frames providing, e.g., the following functionalit(ies): function(s) to encode
two or more biological les into character strings to provide a collection of two
or more different initial character strings wherein each of said biological molecules
comprises a ed set of ts; fianctions to select at least two substrings from
the character strings; functions to concatenate the substrings to form one or more
product strings about the same length as one or more of the initial character strings;
functions to add (place) the product strings to a collection of strings; fianctions to
create and manipulate computational representation/models of enzymes and
substrates, functions to dock a computational representation of a substrate (e.g., a
ligand) with the computational representation of an enzyme (e.g., a protein); functions
to apply molecular dynamics to lar models; functions to calculate various
constraints between molecules that affect al reactions involving the molecules
(e.g., ce or angle between a substrate moiety and an enzyme active site); and
functions to implement any feature set forth .
One or more of these fianctionalities may also be ented exclusively on
a server or on a client computer. These functions, e.g., functions for creating or
manipulating computational models of biological molecules, can provide one or more
windows wherein the user can insert or manipulate representation(s) of biological
molecules. In addition, the ons also, optionally, provides access to private
and/or public databases accessible through a local network and/or the intranet
whereby one or more sequences contained in the ses can be input into the
methods of this invention. Thus, for e, in one embodiment, the user can,
optionally, have the ability to request a search of GenBank® and input one or more of
the sequences returned by such a search into an encoding and/or a diversity generating
function.
Methods of implementing Intranet and/or Intranet embodiments of
computational and/or data access processes are well known to those of skill in the art
and are nted in great detail (see, e.g., Cluer et al. (1992) “A General
Framework for the Optimization of Object-Oriented Queries,” Proc SIGMOD
ational ence on Management of Data, San Diego, California, Jun. 2-5,
1992, SIGMOD , vol. 21, Issue 2, Jun., 1992; Stonebraker, M., Editor; ACM
Press, pp. 2; ISO-ANSI, Working Draft, “Information Technology-Database
Language SQL,” Jim Melton, Editor, International Organization for Standardization
and American National Standards Institute, Jul. 1992; Microsoft Corporation, “ODBC
2.0 Programmer's Reference and SDK Guide. The Microsoft Open Database Standard
for Microsoft Windows.TM and Windows NTTM, Microsoft Open Database
Connectivity.TM. Software Development Kit,” 1992, 1993, 1994 Microsoft Press, pp.
3-30 and 41-56; ISO Working Draft, “Database Language rt 2:Foundation
(SQL/Foundation),” CD9075-2:199.chi.SQL, Sep. 11, 1997, and the like). Additional
relevant details regarding web-based applications are found in WO 00/42559, entitled
“METHODS OF POPULATING DATA STRUCTURES FOR USE IN
EVOLUTIONARY SIMULATIONS,” by Selifonov and Stemmer.
In some embodiments, the methods for exploring, screening, and/or
developing polynucleotide or polypeptide sequences can be implemented as a multi-
user system on a computer system with a ity of sing units and memories
buted over a computer network, wherein the network may include intranet on
LAN and/or the Internet. In some embodiments, the distributed computing
architecture involves a “cloud,” which is a collection of computer systems available
over a computer network for computation and data storage. The computing
environment involving a cloud is referred to as a cloud ing environment. In
some embodiments, one or more users can access the computers of the cloud
distributed over an intranet and/or the Internet. In some embodiments, a user may
remotely access, through a web client, server computers that implement the methods
for screening and/or ping protein variants described above.
In some embodiments involving a cloud computing nment, virtual
machines (VMs) are provisioned on the server computers, and the results of the
virtual machines can be sent back to the user. A l machine (VM) is a re-
based emulation of a computer. Virtual machines may be based on specifications of a
hypothetical computer or emulate the computer architecture and functions of a real
world computer. The structure and ons of VMs are well known in the art.
Typically, a VM is installed on a host platform that includes system hardware, and the
VM itself includes virtual system hardware and guest software.
The host system hardware for a VM es one or more Central sing
Units (CPUs), memory, one or more hard disks and various other devices. The VM’s
l system hardware includes one or more virtual CPUs, virtual memory, one or
more l hard disks and one or more virtual devices. The VM’s guest software
es guest system software and guest applications. In some implementations,
guest system software includes a guest operating system with drivers for virtual
devices. In some implementations, the VM’s guest applications include at least one
instance of a virtual n screening system as described above.
In some embodiments, the number of provisioned VMs can be scaled to the
computational load of the problem to be solved. In some embodiments, a user can
request a virtual machine from a cloud, the VM including a virtual screening system.
In some ments, the cloud computing environment can provision a VM based
on the user t. In some embodiments a VM may exist in a previously stored VM
image, which can be stored in an image repository. The cloud computing
environment can search and transfer the image to a server or a user system. The cloud
computing environment can then boot the image on the server or user system.
IX. ES
Example I
The following example illustrates a process of virtually screening enzyme
variants and developing enzymes of desired catalytic activity and selectivity
implementing various embodiments.
In summary, the process involved creating 3-dimensional homology models of
an actual panel of enzymes and virtually ing the members of the enzyme panel
to select a first t that (a) docked with the substrate in an active pose, (b) docked
in a pro-S conformation, and (c) had the lowest total g energy (or docking
score) among those that docked in active poses and in a pro-S conformation. The
process then used the first variant as a round-l backbone, or parental sequence, to
create a round-l virtual t library using virtual nesis techniques for virtual
directed evolution. Then, the process created models of members of the round-l
virtual variant library, screened the round-l virtual variant y, and ed a
second variant as a round-2 backbone using similar selection methods as in selecting
the round-l backbone. The process also selected additional variants from the round-l
virtual t library. The additional variants (a) docked with the substrate in active
poses, and (b) had low total binding energy (or docking score) among those that dock
in active poses. The process then recombined the 2 backbone with the
additional variants to introduce diversity into a round-2 variant library. y, the
process computationally d, screened and selected ts, yielding virtual
enzyme variants with improved activity and selectivity compared to the round-l and
round-2 backbones.
More specifically, the example process started by creating 194 homology
models of an actual panel of enzymes. These enzymes catalyze a native substrate that
is structurally or onally related to a desired substrate. The process docked the
desired substrate to the homology models, and virtually screened members of the
actual enzyme panel to find only one variant that (a) docked with the desired substrate
in an active pose, and (b) docked in a pro-S conformation. Successful g in an
active pose suggested that the ligand was likely to undergo a tic transformation
or perform some desired role such as covalently binding with the binding site. The
docking of the d substrate and the panel members was med by docking
methods described in details above. The functionally relevant moieties of the desired
substrate were compared to the native substrate by placing the two substrates in the
same X, Y, Z coordinates in a docking space. Whether a pose of the desired ate
was active, pro-S, or pro-R, was determined by the distance between the moieties of
the desired substrate and the native substrate. The distance criterion was set at 1.25 A
for this example. The criterion value and rules (requiring the mean, min, max, etc. of
the distances to be smaller than the criterion) may be adjusted in different applications
and at various rounds of directed evolution.
It was found that this t could bind the substrate in both pro-S and pro-R
conformations. It was suspected that the variant might not be very selective. To
derive an active and S selective enzyme for the d substrate, this variant was
selected as a round-l backbone to create a round-l variant library by mutagenesis in
the first round of directed ion in silico. There were 15 active site positions
identified in this l backbone, and 19 amino acids possible for each position that
would be different from the round-l backbone variant, amounting to 285 ent
possible point mutations. In round-l evolution, 1000 mutants were generated for the
round-l variant library, each mutant having a random number of mutations, wherein
the random number was selected from a an distribution of mean=4 and SD=2.
The mutations were randomly chosen from the 285 possible point ons.
Then, the process used docking and ing methods similar to those
described above for the actual enzyme panel, with the exception that the ion for
determining activity and selectivity of poses was set at a more stringent value of l A
as opposed to 1.25 A. The process identified one variant as comprising the mutation
having the lowest total binding energy among all mutants that would bind in active
and pro-S poses. In fact, the mutation in this variant prevented the substrate from
binding in an undesired pro-R conformation, representing a beneficial mutation for
selectivity. The process thus selected this variant as the backbone for a round-2
directed evolution.
However, the binding energy of the round-2 backbone at 0.38303 kcal/mol
was relatively high even compared to that determined for the round-l backbone (-
4.005 kcal/mol), suggesting that evolution could further improve the beneficial
properties of the enzyme. A 2 ed ion was carried out in silico by
introducing 29 mutations into the round-2 backbone. The 29 ons were derived
from 29 variants of the round-l library having the lowest binding energy among all
variants obtained from the round-l evolution. In round-2 evolution, 1000 mutants
were generated to produce the round-2 variant library, each mutant having a random
number of mutations, wherein the random number was selected from a Gaussian
distribution of mean=6 and SD=4. The mutations were ly chosen from the 29
possible mutations derived fiom 29 variants.
Then, the process used g and screening methods r to those
described above to determine that most variants favored binding the substrate in a
desired pro-S mation only, and at least 10 variants had better binding energy
than round 1 and round-2 backbones. See Table l for the binding energies of the
improved variants from round-2 evolution and the round-l and round-2 backbones.
In addition to showing the data of Table 1, Figure 5 shows the selectivity of the 10
improved ts from round-2 evolution, as well as the round-l and round-2
backbones. The Figure illustrates that l screening of enzyme panel first
identified the round-l backbone that had a low binding energy, but was not S-
selective. The process then improved S-selectivity using in silico directed evolution
enesis), to obtain the 2 backbone. The process finally improved
substrate binding in round-2 evolution through recombination, yielding enzyme
variants that had high affinity with the desired substrate and were enantioselective.
Table 1. Binding Energies of Variants from
2 Evolution
Binding Energy (kcal/mol)
Rd2 Variant 8
Rd2 Variant 7
Rd2 Variant 6
Rd2 Variant 5
Rd2 Variant 4
Rd2 Variant 3
Rd2 Variant 2
Rd2 t l
Rd2BB
RdlBB
The diversity provided in the two rounds of evolution was generated by
mutagenesis and recombination, inspired by ical genetic operations. In some
applications, the virtual n screening method may be combined with sequenceactivity
models that guide directed evolution methods. A sequence activity model
was built with multiple linear regression techniques according to s described
in US. Patent No. 7,783,428. In Figure 6A, the sequence activity model’s predicted
g energy are d against the observed energy obtained by the virtual
screening system for a test set of ces. Cross validation of the sequence activity
model was performed by testing a validation set of sequences left out from the test set.
The model accounts for 90.9% of the variance in the test set (R2=0.909). Cross
validation data in Figure 6B show that the sequence activity model was accurate in
predicting binding energy from the sequences of particular mutations at particular
positions, accounting for 82.9% of the variance in the validation set (R2=.829).
The model may be used to fy amino acids for mutagenesis. Among
other ways to use a sequence activity model to guide directed evolution, one way
relies on the regression coefficients for a particular mutation of a specific residue at a
specific position, which reflect the mutation’s contribution to protein activity.
Specifically, a process of ed evolution could select the positions for mutation by
evaluating the ients of the terms of the sequence-activity model to identify one
or more of amino acids that contribute to substantial binding energy calculated by the
virtual screening system. For instance, in this example, on 1 has a large
positive coefficient, indicating that mutation 1 increases the ty to a large extent.
See Figure 6C. On the contrary, mutation 27 has a large negative coeff1cient,
suggesting this mutation should be avoided in order to obtain a high activity as
ed in Figure 6C.
Example 2
Example 2 provides an experimental validation of virtually screening
ketoreductase variants for the R-enantiomer of a chiral alcohol from a pro-chiral
ketone, as the reaction shown at the top of Figure 7.
The s involved creating 3-dimensional homology models of two
existing Panels of ketoreductase enzyme variants (96 wells format for each Panel) and
virtually screening the 192 members of the ketoreductase Panels to select variants that
(a) docked with the substrate in an active pose, (b) docked in a pro-R conformation,
and (c) had favorable docking score.
The process identified 24 variants that can lead to active and energetically
favorable poses, which may be prioritized for further development and screening. To
validate the utility and validity of the virtual in silico screening results, the process
also performed in vitro ing for all 192 members with a rd ol, and
substrate/products were detected with high-performance liquid chromatography
(HPLC).
The results are shown in Figure 7, where x-axis is % conversion calculated as
T€a(R)_a1coho1 + PeakAr€a(S)_a1coho1 ) + T€a(R)_a1coho1 + PeakArea(S)—alcohol +
PeakAreaketone)><100% and y-axis is % e.e. toward desired R product (an index of
enantioselectivity) calculated as (Peak Area(R)_a100h01 - Peak Area(S)_a100h01 )+( Peak
Area(R)_a100h01 + Peak Area(S)_a100h01) X 100%. The 24 ts prioritized by virtual
screening were emphasized as Red Square and the remaining variants were
highlighted as Blue Diamond. The results suggest: 1) virtual screening can help
ine if a desired conversion is feasible with a set of enzyme variants before any
in vitro screening; 2) a good amount of predicted variants indeed gave high activity
(% Conversion) and enantioselectivity (% e.e.), despite the fact that such a small and
flexible substrate is usually ered to be a challenge for modeling. Virtual
screening can therefore filter out very unlikely ons for in vitro screening and
select less samples to test (24 vs. 192 in this case), which can lead to significant time-
and cost-savings.
Example 3
Example 3 provides an experimental tion of virtual directed evolution of
transaminase for stereoselective C=O reduction to CH-NHZ, as the reaction shown at
the top of Figure 8.
The process involved creating 3-dimensional homology models of 228 virtual
sequences from in silico saturated mutagenesis of 12 active site ons of the
backbone (12 positions X 19 AA/position = 228 variants, 1 mutation/variant) and
virtually screening the 228 virtual ts to select variants that (a) docked with the
substrate in an active pose, (b) docked in a conformation that lead to the desired
stereoselectivity, and (c) had the lowest total binding energy among those that docked
in active poses and in a targeted conformation.
The process then fied 12 variants or 12 mutations that can lead to active
and energetically favorable poses. The 12 mutations were used to synthesize a
library, which was screen in vitro. The in vitro screening was carried out for 360
variants (one or more than one mutations per t) with a proprietary protocol.
Substrate/products were detected with HPLC.
The results for the best variants from in vitro screening are shown in Figure 8,
where x-axis is the samples screened, and the y-axis is FIOPC defined as Fold
ement Over Positive Control and calculated as (%ConversionVariam
%C011VerSionNegativeControl) + (%converSionPositiveControl - %C011VerSionNegativeControl) X
100%. Positive Control is the backbone of virtual screening and in vitro screening and
ve Control is the empty vector t enzyme.
The in vitro library ing resulted in 13% of the variants having a FIOPC
> 1.5 and 5.3% with a FIOPC >2. The top hit had a FIOPC of 2.4. l screening
can therefore filter out rious mutations for in vitro screening and help design
more targeted libraries, which can lead to significant time- and cost-savings. For
example, if we had to do the saturated mutagenesis step in vitro, at least another 800
variants will need to be screened.
While the foregoing has been described in some detail for purposes of clarity
and understanding, it will be clear to one skilled in the art from a reading of this
disclosure that various changes in form and detail can be made without departing
from the true scope of the disclosure. For example, all the techniques and apparatus
described above may be used in various combinations. All publications, patents,
patent applications, or other documents cited in this application are incorporated by
reference in their entirety for all purposes to the same extent as if each individual
publication, , patent application, or other document were individually indicated
to be incorporated by nce for all purposes.
Claims (31)
1. A method for selecting and synthesizing or expressing an enzyme variant, the method comprising: (a) creating or receiving a structural model for each of the plurality of different enzyme variants, wherein the plurality of different enzyme variants comprises at least ten different variants, and n each structural model contains a three dimensional computational representation of an active site of an enzyme variant; (b) for each enzyme variant, docking, by a computer system, a computational representation of the substrate to the three ional computational representation of the active site of the enzyme t, wherein docking (i) generates a plurality of poses of the substrate in the active site, wherein a pose comprises a position or orientation of the ate with respect to the active site of the enzyme variant, and (ii) fies energetically favorable poses of the substrate in the active site, wherein an energetically favorable pose is a pose having an energy that is favorable for binding between the substrate and the enzyme variant; (c) for each energetically favorable pose, determining whether the pose is active, wherein an active pose meets one or more constraints for the substrate to undergo a catalytic reaction in the active site; (d) selecting at least one of the enzyme variants having an active site in which the substrate has one or more active poses as determined in (c); and (e) synthesizing or expressing the at least one of the enzyme variants selected in (d).
2. The method of claim 1, wherein the computational representation of the substrate represents a s along the on coordinate for the enzyme activity, the species being selected from the substrate, a reaction intermediate of the substrate, or a transition state of the substrate.
3. The method of claim 1 or claim 2, wherein the plurality of enzyme ts comprise a panel of enzymes that can turn over multiple substrates and wherein the members of the panel possess at least one mutation relative to a reference sequence.
4. The method of claim 3, wherein the at least one mutation is a single-residue mutation in the active site of the enzyme.
5. The method of any one of the preceding claims, wherein the plurality of variants comprise one or more enzymes that can catalyze a chemical reaction selected from oxidoreduction, transferation, hydrolysis, isomerization, ligation, and chemical bond breaking by a reaction other than hydrolysis, ion, or reduction.
6. The method of claim 5, wherein the enzyme is selected from oxidoreductase, erase, hydrolase, isomerase, ligase, and lyase.
7. The method of claim 5, wherein the plurality of variants comprise one or more enzymes that can catalyze a chemical reaction selected from ketone reduction, transamination, oxidation, e hydrolysis, imine reduction, enone ion, acyl hydrolysis, and halohydrin dehalogenation.
8. The method of claim 7, wherein the enzyme is selected from ketone reductase, transaminase, cytochrome P450, Baeyer–Villiger ygenase, monoamine oxidase, nitrilase, imine reductase, enone ase, acylase, and halohydrin dehalogenase.
9. The method of any one of the preceding claims, wherein the plurality of variants comprises members of a library produced by one or more rounds of ed evolution in vitro and/or in silico.
10. The method of any one of the preceding claims, wherein the plurality of variants comprises at least about a thousand ent variants.
11. The method of any one of the preceding claims, wherein the ational representations of active sites are provided from 3-D gy models for the ity of variants.
12. The method of claim 11, further comprising producing said 3-D homology models for the plurality of variants.
13. The method of any one of the preceding claims, wherein the ational representation of the substrate is a 3-D model of the substrate.
14. The method of any one of the preceding claims, wherein the method is d to screen a plurality of substrates.
15. The method of any one of the preceding claims, further comprising identifying the constraints for the substrate to undergo the catalyzed chemical transformation by identifying one or more poses of a native substrate, a reaction intermediate of the native substrate, or a transition state of the native substrate when the native substrate undergoes the catalyzed chemical transformation by a wild-type .
16. The method of any one of the preceding claims, wherein the constraints comprise one or more of the following: position, distance, angle, and torsion constraints.
17. The method of any one of the preceding claims, wherein the constraints comprise a distance n a particular moiety on the substrate and a particular residue or residue moiety in the active site.
18. The method of any one of the preceding claims, wherein the constraints comprise a distance between a particular moiety on the substrate and a particular e or residue moiety on a cofactor.
19. The method of any one of the preceding claims, wherein the constraints comprise a distance n a particular moiety on the substrate and an ideally positioned native substrate in the active site.
20. The method of any one of the preceding claims, the method further comprising applying a set of one or more enzyme constraints to the plurality of enzyme variants, wherein the one or more enzyme constraints are similar to the constraints of a ype enzyme when a native substrate undergoes a catalyzed chemical transformation in the presence of the ype enzyme.
21. The method of any one of the preceding , wherein the plurality of poses of the substrate is obtained by one or more docking operations selected from the group consisting of: high temperature molecular dynamics, random rotation, refinement by ased simulated annealing, grid-based or full force field minimization, and any combinations f.
22. The method of any one of the preceding claims, wherein the plurality of poses of the ates comprises at least about 10 poses of the substrate in the active site.
23. The method of any one of the preceding claims, wherein the selecting in (d) comprises identifying variants determined to have large numbers of active poses by comparison to other variants.
24. The method of any one of the preceding claims, wherein the selecting in (d) comprises: ranking the ts by one or more of the following: the number of active poses the variants have, docking scores of the active poses, and binding energies of the active poses; and selecting ts based on their ranks.
25. The method of claim 24, wherein the docking scores are based on van de Waals force and electrostatics interaction.
26. The method of claim 24, wherein the binding energies are based on one or more of the following: van der Waals force, electrostatic interaction, and solvation energy.
27. The method of any one of the preceding , wherein (e) comprises: preparing a plurality of oligonucleotides containing or ng at least a portion of the at least one variant selected in (d); and performing one or more rounds of directed evolution using the plurality of oligonucleotides.
28. The method of claim 27, wherein preparing a plurality of ucleotides comprises synthesizing the oligonucleotides using a c acid synthesizer.
29. The method of claim 27 or claim 28, wherein performing one or more rounds of ed evolution comprises fragmenting and recombining the plurality of oligonucleotides.
30. The method of any one of claims 27 to 29, wherein performing one or more rounds of directed evolution comprises performing saturation mutagenesis on the plurality of oligonucleotides.
31. The method of any one of the preceding claims, wherein the at least one enzyme variant has desired catalytic activity and/or selectivity.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201361883838P | 2013-09-27 | 2013-09-27 | |
US61/883,838 | 2013-09-27 | ||
PCT/US2014/057899 WO2015048572A1 (en) | 2013-09-27 | 2014-09-26 | Automated screening of enzyme variants |
Publications (2)
Publication Number | Publication Date |
---|---|
NZ717658A NZ717658A (en) | 2020-11-27 |
NZ717658B2 true NZ717658B2 (en) | 2021-03-02 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11535845B2 (en) | Automated screening of enzyme variants | |
US11342046B2 (en) | Methods and systems for engineering biomolecules | |
KR102490720B1 (en) | Methods, systems, and software for identifying bio-molecules with interacting components | |
Wijma et al. | Computational design gains momentum in enzyme catalysis engineering | |
Appel et al. | uPIC–M: efficient and scalable preparation of clonal single mutant libraries for high-throughput protein biochemistry | |
NZ717658B2 (en) | Automated screening of enzyme variants |