WO2024000579A1 - 一种机器学习引导的生物序列工程改造方法及装置 - Google Patents
一种机器学习引导的生物序列工程改造方法及装置 Download PDFInfo
- Publication number
- WO2024000579A1 WO2024000579A1 PCT/CN2022/103382 CN2022103382W WO2024000579A1 WO 2024000579 A1 WO2024000579 A1 WO 2024000579A1 CN 2022103382 W CN2022103382 W CN 2022103382W WO 2024000579 A1 WO2024000579 A1 WO 2024000579A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequence
- biological
- fitness
- sequences
- model
- Prior art date
Links
- 238000002715 modification method Methods 0.000 title abstract 2
- 238000010801 machine learning Methods 0.000 claims abstract description 38
- 238000000034 method Methods 0.000 claims abstract description 38
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 35
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 34
- 238000005457 optimization Methods 0.000 claims abstract description 28
- 238000002474 experimental method Methods 0.000 claims abstract description 24
- 238000012986 modification Methods 0.000 claims abstract description 4
- 230000004048 modification Effects 0.000 claims abstract description 4
- 230000035772 mutation Effects 0.000 claims description 75
- 238000005070 sampling Methods 0.000 claims description 20
- 238000012407 engineering method Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 14
- 238000012360 testing method Methods 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 11
- 239000013589 supplement Substances 0.000 claims description 5
- 230000006978 adaptation Effects 0.000 claims description 4
- 229920001184 polypeptide Polymers 0.000 claims description 4
- 102000004196 processed proteins & peptides Human genes 0.000 claims description 4
- 108090000765 processed proteins & peptides Proteins 0.000 claims description 4
- 108020004414 DNA Proteins 0.000 claims description 3
- 102000053602 DNA Human genes 0.000 claims description 3
- 238000013531 bayesian neural network Methods 0.000 claims description 3
- 239000003795 chemical substances by application Substances 0.000 claims description 3
- 238000013136 deep learning model Methods 0.000 claims description 3
- 229920002477 rna polymer Polymers 0.000 claims description 3
- 230000004907 flux Effects 0.000 claims description 2
- 238000004422 calculation algorithm Methods 0.000 abstract description 29
- 238000013461 design Methods 0.000 abstract description 11
- 238000004364 calculation method Methods 0.000 abstract description 10
- 230000007547 defect Effects 0.000 abstract 1
- 238000012876 topography Methods 0.000 description 9
- 108090000790 Enzymes Proteins 0.000 description 7
- 102000004190 Enzymes Human genes 0.000 description 7
- 150000001413 amino acids Chemical class 0.000 description 7
- 238000012216 screening Methods 0.000 description 6
- 230000009466 transformation Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- FCBUKWWQSZQDDI-UHFFFAOYSA-N rhamnolipid Chemical compound CCCCCCCC(CC(O)=O)OC(=O)CC(CCCCCCC)OC1OC(C)C(O)C(O)C1OC1C(O)C(O)C(O)C(C)O1 FCBUKWWQSZQDDI-UHFFFAOYSA-N 0.000 description 4
- 239000000126 substance Substances 0.000 description 4
- 125000003275 alpha amino acid group Chemical group 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 239000000047 product Substances 0.000 description 3
- 102000008300 Mutant Proteins Human genes 0.000 description 2
- 108010021466 Mutant Proteins Proteins 0.000 description 2
- 108010002833 beta-lactamase TEM-1 Proteins 0.000 description 2
- 238000010170 biological method Methods 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000012790 confirmation Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 125000003473 lipid group Chemical group 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 239000000523 sample Substances 0.000 description 2
- 238000012772 sequence design Methods 0.000 description 2
- 238000011426 transformation method Methods 0.000 description 2
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 1
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241000588724 Escherichia coli Species 0.000 description 1
- 108091006027 G proteins Proteins 0.000 description 1
- 102000030782 GTP binding Human genes 0.000 description 1
- 108091000058 GTP-Binding Proteins 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 108091028043 Nucleic acid sequence Proteins 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- AVKUERGKIZMTKX-NJBDSQKTSA-N ampicillin Chemical compound C1([C@@H](N)C(=O)N[C@H]2[C@H]3SC([C@@H](N3C2=O)C(O)=O)(C)C)=CC=CC=C1 AVKUERGKIZMTKX-NJBDSQKTSA-N 0.000 description 1
- 229960000723 ampicillin Drugs 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000004071 biological effect Effects 0.000 description 1
- 239000012472 biological sample Substances 0.000 description 1
- 239000003876 biosurfactant Substances 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000003197 catalytic effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000002742 combinatorial mutagenesis Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000003795 desorption Methods 0.000 description 1
- 238000009510 drug design Methods 0.000 description 1
- 230000009144 enzymatic modification Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000013213 extrapolation Methods 0.000 description 1
- 238000000855 fermentation Methods 0.000 description 1
- 230000004151 fermentation Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000003446 ligand Substances 0.000 description 1
- 238000004949 mass spectrometry Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 230000001575 pathological effect Effects 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 238000002922 simulated annealing Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000003041 virtual screening Methods 0.000 description 1
Images
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
-
- C—CHEMISTRY; METALLURGY
- C40—COMBINATORIAL TECHNOLOGY
- C40B—COMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
- C40B10/00—Directed molecular evolution of macromolecules, e.g. RNA, DNA or proteins
Definitions
- the invention belongs to the field of bioengineering, and specifically relates to a machine learning-guided biological sequence engineering transformation method and device.
- Protein directed evolution technology uses multiple rounds of genetic mutation and phenotypic screening, which is usually limited by the throughput of experimental technology.
- Employing model-based virtual screening technology machine learning has reduced the screening burden to a level acceptable for manual experimentation.
- robotics allows for large-scale screening and rapid iteration of protein engineering, however, this technology has not yet been applied in machine learning-assisted directed evolution techniques.
- the protein fitness topography describes a metaphorical, high-dimensional surface that relates amino acid sequences to properties of interest, namely fitness. It is extremely challenging for protein engineering to explore this terrain: 1. The search space grows exponentially as the length of the amino acid sequence increases; 2. Functional proteins are extremely rare; 3. High-value protein sequences decay exponentially as the fitness increases; 4. Due to the epistasis effect , adaptability to rugged terrain; 5. Experimental testing is expensive, inefficient and energy-consuming. As an effective protein engineering strategy, directed evolution applies multiple rounds of mutation library construction and screening processes, but only the optimal mutations are fixed in each round. However, this greedy strategy can get stuck in local optima, especially on rough fitness terrain.
- Machine learning algorithms are increasingly used both in modeling protein fitness terrain and in guiding protein engineering.
- supervised learning is used for the prediction of various properties, including three-dimensional structure, thermal stability, fluorescence intensity, ligand binding affinity, and catalytic performance.
- gene and protein sequences in public databases (such as UniProt) are accumulating at an unprecedented rate.
- Unsupervised learning models such as UniRep, TAPE, ESM-1v and ProtT5-XL-U50, learn protein representations by discovering hidden patterns from large amounts of unlabeled protein sequence data.
- existing fitness models are still limited by lack of data and biases.
- sequence-function label data can only cover a local area of the entire design space. Both cause the learned model to be pathological in areas not covered by the data.
- MLDE machine learning-assisted directed evolution
- ECNet used homologous sequence information as supervision to predict high-order mutation effects, and successfully modified TEM-1 ⁇ -lactamase, achieving an 8-fold enhancement in ampicillin resistance after 2-6 mutations.
- biofoundry accelerates the "design-build-test-learn" closed-loop process through the automation of physics and information.
- automation allows large-scale library construction and screening in a short time, so that large quantities and high-quality new sequence-function data can be collected iteratively to improve model prediction capabilities and sequence design.
- this data capability is underutilized in algorithm-guided protein engineering.
- Bayesian optimization obtains candidate objects by optimizing a given sampling function, which is responsible for the balance between exploration and utilization.
- Bayesian optimization is widely used in the fields of protein engineering, pathway and fermentation strain design.
- Bayesian optimization In order to use Bayesian optimization to guide protein engineering on combinatorial fitness terrains, Bayesian optimization needs to be improved.
- BO only obtains one sample at a time, batches of samples can be built and tested in parallel by the robot.
- Batch BO has been used to design biological sequences. Batch methods fall into two categories: iteratively generating batches of data or acquiring data in one go. However, how to balance the exploration and utilization of single-batch data remains an open question.
- BO locates the global optimum through violent search of the entire design space. For the combinatorial mutation library of the protein (N sites ⁇ 20 N design space), the calculation amount of BO increases exponentially with the problem dimension.
- the present invention proposes a Bayesian Optimization Guided Evolutionary Algorithm (BO-EVO) to achieve efficient iteration between machine learning models and robot experiments to economically obtain high-value new protein variants.
- the method of the present invention combines BO and EVO, and uses EVO to solve the problem of excessive computation caused by BO's violent search of the entire design space to locate the global optimum.
- the exploratory nature of BO is used to neutralize the greed and under-exploration of the evolutionary algorithm. Shortcomings. Therefore, combining EVO and BO is expected to achieve efficient and scalable calculation and exploration, and provide efficient biological sequence engineering solutions.
- One aspect of the present invention provides a machine learning-guided biological sequence engineering method, which includes the following steps:
- Model training Perform machine learning on the obtained biological sequences and their corresponding fitness data to obtain the fitness prediction model and model uncertainty of the biological sequences;
- step S3 Obtain the first seed: Obtain the first seed sequence from the biological sequence obtained in step S1) and its corresponding fitness data through fitness-based sampling;
- Mutation subspace generation Random mutation is performed on the site to be mutated in the seed sequence obtained in step S3) or S6), and the set of mutated biological sequences forms a mutation subspace, in which the single point mutation rate of random mutation is at the site to be mutated. reciprocal of points;
- Bayesian optimization of mutation subspace Bayesian optimization is used on the mutation subspace to select a single candidate sequence for experimental query, where the sampling function is the upper confidence bound (UCB) and the surrogate model is the fitness prediction obtained by S2) Model;
- step S1 the sequence and its corresponding fitness data obtained from the experimentally measured sequence and corresponding fitness data can be the above-mentioned candidate sequence and the fitness obtained after testing it.
- the result can also be other known sequences and corresponding fitness data test results.
- sequence is a protein sequence, a polypeptide sequence, a ribonucleic acid sequence, or a deoxyribonucleic acid sequence.
- the first seed sequence obtained may be a natural sequence or a mutant sequence.
- step S4) the number of mutated biological sequences in the mutation subspace is 1.0-3.0 times the number of sites to be mutated.
- step S5 the value of ⁇ in UCB is 0.05-0.25.
- step S7) the preset flux is limited to between 200-500.
- step S10) the preset rounds are at least 2 rounds.
- step S8 the method for conducting experiments on batches of candidate sequences to be experimentally queried is to construct biological sequences and test fitness values using engineering methods.
- Another aspect of the present invention provides a machine learning-guided biological sequence engineering device, which is a device capable of realizing the above steps S1)-S10).
- the biological sequence engineering modification device includes the following modules: biological sequence and its fitness data acquisition module, machine learning model module and biological sequence recommendation module;
- the biological sequence and its fitness data acquisition module is used to store and call the sequence and its fitness data, convert the called sequence into a digital code; return the corresponding adaptation according to the batch of candidate sequences generated in the biological sequence recommendation module degree value;
- the machine learning model module is used to call the biological sequence from the biological sequence and its fitness data acquisition module, perform machine learning based on the called data and form a prediction model and model uncertainty between the biological sequence and its fitness;
- the biological sequence recommendation module is used to obtain batches of candidate sequences to be experimentally queried under the guidance of the machine learning model module.
- the biological sequence recommendation module can call the data in the biological sequence and its fitness data acquisition module, and based on The first seed sequence obtained by fitness sampling is randomly mutated based on the first seed sequence, and the random mutation sequence is used to form a mutation subspace.
- Bayesian optimization is used to select a single candidate sequence in the mutation subspace, where the sampling function is UCB,
- the agent model is a prediction model formed by the machine learning model module;
- the candidate sequence will be used as the next seed sequence of this round and used to form mutations of this round. subspace; if the uncertainty is higher than 2 times the uncertainty of the first seed sequence of this round, in the called sequence, the next seed sequence of this round is obtained based on fitness sampling; and the next cycle is started, Until the preset throughput is reached, and a batch of candidate sequences is obtained.
- the biological sequence recommendation module also includes a unit that starts the next round.
- This unit transmits the generated batches of candidate sequences to the biological sequence and its fitness data acquisition module, and guides the biological sequence and its fitness data acquisition module.
- the batches of candidate sequences generated by its fitness data acquisition module return corresponding fitness values; the batches of candidate sequences and their corresponding fitness values generated by the biological sequences and their fitness data acquisition module are merged into existing biological sequences. sequence and its fitness data; and used to guide the machine learning model module to further generate a new prediction model.
- the new prediction model in the machine learning model module is used to guide the next round of the biological sequence recommendation module. recommend.
- the biological sequence and its fitness data acquisition module includes a recording unit for recording information, a coding unit for converting the called sequence into a digital code, and a batch of candidate sequences generated according to the biological sequence recommendation module, and returns Batch experimental units with corresponding fitness values.
- the biological sequence engineering transformation method of the present invention uses batch “dry-wet” iteration for optimization, which not only reduces the experimental cost and experimental volume of wet experiments, but also improves algorithm performance.
- dry or “dry experiment” refers to data collection, training, simulation or prediction through computers
- wet or “wet experiment” refers to actual experiment.
- This invention adopts algorithm and wet experiment iterative optimization - the algorithm guides the wet experiment to explore the fitness terrain, thereby obtaining new sequence-fitness relationship data, and the data obtained from the wet experiment updates the machine learning model, thereby improving the algorithm performance and further guiding Wet experiment exploration.
- Bayesian optimization methods usually uses "brute force" to evaluate the entire search space.
- protein sequences there are at least 20 types of amino acids (natural amino acids) as their constituent units.
- the extension of protein sequences requires a very large amount of data.
- Using conventional Bayesian optimization methods to conduct brute force exploration of the entire space requires excessive calculations, resulting in a significant increase in exploration costs.
- the present invention generates a subspace through an evolutionary method and uses Bayesian optimization in the subspace to effectively alleviate the scale scalability of Bayesian optimization. By obtaining the subspace in batches and filtering out candidate sequences from the subspace, it also ensures Scale and data scalability. Therefore, the method of the present invention is fast, requires less computing resources, and can be used for longer sequences of engineering transformation tasks.
- the present invention not only considers the relationship between fitness and sequence, but also considers the uncertainty of the predicted value, and establishes an uncertainty model for this. Since the machine learning model is outside the data support set, it is difficult to predict accurately through extrapolation. Therefore, the uncertainty estimation and utilization of the predicted value can effectively prevent the optimization algorithm from "using" the inaccurate prediction of the model outside the data support set.
- the method of the present invention improves the efficiency of biological sample transformation and evolution. By sampling less than 1% of all possible mutations, the method of the present invention can achieve a performance improvement of more than 7 times.
- Bayesian optimization-guided evolutionary algorithm used in the present invention does not require target object-related structures or homologous sequences.
- Figure 1 is a flow chart of the method of the present invention.
- Figure 2 is a schematic diagram of the BO-EVO solution.
- a is the concept diagram of BO-EVO. Iteratively, candidate descendant sequences are generated through random mutation of the parent sequence, and BO selects a sequence from the candidate sequences.
- the biological sequence recommendation module proposes batch sequences by interacting with the other two modules. First, the sequence and its corresponding fitness data are obtained based on known databases or experimental results. Perform machine learning on the data obtained from the data acquisition module to obtain a prediction model, predict the sequence according to the prediction model, and guide the generation of new sequences to be queried in the recommendation module. The new sequence obtained by synthesizing and detecting the sequence in the recommendation module Sequence and fitness data are further used in the learning of predictive models.
- Figure 3 shows the results of the fitness exploration algorithm.
- a is the success rate achieved by each round.
- b is the maximum (top) and average (bottom) fitness of all sequences obtained.
- some specific embodiments of the present invention provide a machine learning-guided biological sequence engineering method, which includes the following steps:
- the sequence and its corresponding fitness data can be obtained from the known fitness terrain, from the known sequence and the corresponding fitness data set, or from the experimentally measured sequence and the corresponding fitness data.
- the sequence obtained from the experimental data and the corresponding fitness data may be the sequence to be queried obtained according to the method of the present invention and the fitness result obtained after testing it, or they may be are other known sequences and corresponding fitness data test results.
- the biological sequence is a protein sequence, a polypeptide sequence, a ribonucleic acid sequence, or a deoxyribonucleic acid sequence.
- the sequence and its corresponding fitness data come from a known fitness topography
- the fitness topography can come from an empirical fitness topography or a statistical fitness topography.
- Model training Perform machine learning on the obtained biological sequences and their corresponding fitness data to obtain the fitness prediction model and model uncertainty of the predicted sequence;
- the prediction model can be a Gaussian process model, a Bayesian neural network prediction model, an ensemble model, an evidence-based deep learning model, or other models that predict uncertainty.
- the Gaussian Process Model (GPR) is used as the prediction model.
- GPR Gaussian Process Model
- Gaussian process models are completely described by mean functions and covariance functions.
- the mean corresponds to fitness and the covariance is used as an estimate of uncertainty.
- the kernel used is RBF,
- the RBF kernel parameters k and ⁇ are predicted using the maximum likelihood method, using GPyTorch with CUDA acceleration as a more efficient Gaussian process implementation than scikit-learn, and employing gradient descent and Adam optimization for maximum marginal likelihood estimation.
- step S3 Obtain the first seed: Obtain the first seed sequence from the biological sequence obtained in step S1) and its corresponding fitness data based on fitness;
- the obtained seed sequence can be a natural sequence or a mutant sequence.
- Mutation subspace generation Randomly mutate the sites to be mutated in the seed sequence obtained in step S3) or S6).
- the set of mutated biological sequences forms a mutation subspace, and the mutation rate is the reciprocal of the number of sites to be mutated;
- the length of the sequence to be screened is 50 amino acids
- the sites to be mutated are 4 amino acids
- the point mutation rate is 4 amino acids
- the point mutation rate is 0.25.
- the point mutation rate is 0.25.
- the mutated biological sequence is a single-point mutation sequence.
- the number of randomly mutated biological sequences generated in the mutation subspace is 1.0-3.0 times the number of sites to be mutated.
- the number of biological sequences in the mutation subspace can be adjusted as needed. It is usually not necessary to include all biological sequences that meet the point mutation rate into the mutation subspace.
- the mutation subspace can complete the present invention regardless of its size, its size may affect the efficiency of the method of the present invention.
- the size of the mutation subspace can be selected according to the extent of expansion of the target sequence, or according to the possibilities given in the prior art. Screening and confirmation of mutation sites.
- 4-12 mutant sequences can be selected to be randomly mutated and included in the mutation subspace for Bayesian optimization.
- the number of sites to be mutated is the number of the smallest units of sites to be mutated in the biological sequence.
- the smallest unit in a polypeptide or protein sequence is an amino acid
- the smallest unit in a DNA sequence is a nucleotide.
- Bayesian optimization of mutation subspace Bayesian optimization is used on the mutation subspace to select a single candidate sequence for experimental query, where the sampling function is UCB and the surrogate model is the fitness prediction model obtained in S2);
- the value of ⁇ in UCB is 0.05-0.25, such as 0.1 or 0.2;
- S6 New seed generation: First, use the prediction model obtained in S2) to predict the uncertainty of the candidate sequence obtained in S5). If the uncertainty is not higher than 2 times the uncertainty of the first seed sequence of this round, then the candidate The sequence is used as the next seed sequence in this round and is used to form the next mutation subspace; if the uncertainty is higher than 2 times the uncertainty of the first seed sequence in this round, the next step in this round is obtained according to step S3) A seed sequence and used to form the next mutation subspace;
- the preset throughput is limited to between 200-500 cycles, that is, 200-500 candidate sequences to be experimentally queried are obtained.
- step S6 Obtaining a candidate sequence to be experimentally queried is a cycle. After completing a cycle, step S6) is also included to evaluate and determine the first seed of the next cycle.
- the method of conducting experiments in step S8) is to first use chemical or biological methods to obtain candidate sequences to be experimentally queried, and actually test the fitness value of each candidate sequence.
- step S8 the method for conducting experiments on batches of candidate sequences to be experimentally queried is to synthesize biological sequences using engineering methods and test fitness values.
- the preset rounds are at least 2 rounds.
- the biological sequence engineering modification device includes the following modules: a biological sequence and its fitness data acquisition module, a machine learning model module and a biological sequence recommendation module;
- the biological sequence and its fitness data acquisition module is used to store and call the sequence and its fitness data, convert the called sequence into a digital code; return the corresponding adaptation according to the batch of candidate sequences generated in the biological sequence recommendation module degree value;
- the biological sequence and its fitness data acquisition module includes a recording unit for recording information, a coding unit for converting the called sequence into a digital code, and a batch of candidate sequences generated according to the biological sequence recommendation module. , returns the batch experimental units with corresponding fitness values.
- the machine learning model module is used to call the biological sequence from the biological sequence and its fitness data acquisition module, perform machine learning and training based on the called data, and form a prediction model and model uncertainty between the biological sequence and its fitness.
- the biological sequence recommendation module is used to obtain batches of candidate sequences to be experimentally queried under the guidance of the machine learning model module.
- the biological sequence recommendation module can call the data in the biological sequence and its fitness data acquisition module, and based on The first seed sequence obtained by fitness sampling is randomly mutated based on the first seed sequence, and the random mutation sequence is used to form a mutation subspace.
- Bayesian optimization is used to select a single candidate sequence in the mutation subspace, where the sampling function is UCB,
- the agent model is a prediction model formed by the machine learning model module;
- the candidate sequence will be used as the next seed sequence of this round and used to form mutations of this round. subspace; if the uncertainty is higher than 2 times the uncertainty of the first seed sequence of this round, in the called sequence, the next seed sequence of this round is obtained based on fitness sampling; and the next cycle is started, Until the preset throughput is reached, and a batch of candidate sequences is obtained.
- the biological sequence recommendation module also includes a unit that starts the next round, which transmits the generated batches of candidate sequences to the biological sequence and its fitness data acquisition module, and guides the biological sequence and its fitness.
- the batch of candidate sequences generated by the data acquisition module returns the corresponding fitness value; the batch of candidate sequences and their corresponding fitness values generated by the biological sequence and its fitness data acquisition module are merged into the existing biological sequence and its corresponding fitness value. in the fitness data; and used to guide the machine learning model module to further generate a new prediction model.
- the new prediction model in the machine learning model module is used to guide the next round of recommendations by the biological sequence recommendation module.
- GB1 terrain is a combined empirical fitness terrain of 4 sites (V39, D40, G41 and V54) of the G protein B1 domain , its wild type (WT) sequence is from the Protein Structural Database (PDB ID: 2GI9).
- a total of 149,361 sequences were measured in the experiment, and the remaining sequences were estimated using measured data.
- the fitness of a sequence is determined by stability (i.e. fraction of folded protein) and function (i.e. binding affinity to IgG Fc). Normalization is performed by taking the global maximum value of fitness minus the global minimum value so that the fitness value is between 0 and 1, and the WT fitness is approximately 0.1.
- step S3 Obtain the first seed: Obtain the first seed sequence from the biological sequence obtained in step S1) and its corresponding fitness data through fitness-based sampling;
- Mutation subspace generation random mutation is performed on the seed sequence obtained in step S3) or S6), and the set of mutated biological sequences forms a mutation subspace, in which the single-point mutation rate of random mutation is the reciprocal of the number of sites to be mutated, 4; The number of biological sequences in the subspace is 1-3 times the number of sites to be mutated, 4. Within this range, 8 is selected for experiments;
- S6 New seed generation: First, use the prediction model obtained in S2) to predict the uncertainty of the candidate sequence obtained in S5). If the uncertainty is not higher than 2 times the uncertainty of the first seed sequence of this round, then the candidate The sequence is used as the first seed sequence of the next round and is used to form the mutation subspace of the next round; if the uncertainty is higher than 2 times the uncertainty of the first seed sequence of this round, the first seed sequence of the next round is obtained according to step S3) ;
- the BO-EVO algorithm suitable for biological sequence transformation was developed through GB1 fitness terrain. This algorithm can greatly increase the calculation speed and reduce the calculation requirements.
- Example 2 Use empirical fitness terrain PhoQ and use NK model as protein fitness terrain to verify the BO-EVO algorithm
- the BO-EVO algorithm obtained in Example 1 was further verified using different fitness terrains.
- the PhoQ terrain is similar to the GB1 terrain and is also a 4-point experience-based terrain. PhoQ uses enrichment rate as fitness. Using the fitness normalization strategy, the WT fitness is approximately 0.02.
- NK terrain is modified from the original NK terrain.
- Example 2 A similar method to Example 1 is used to further confirm the BO-EVO algorithm of the present invention with PhoQ terrain and NK terrain. Experimental results show that the BO-EVO algorithm of the present invention can be applied to various types of terrain.
- RhlA is a key enzyme in the synthesis of the lipid moiety of the important biosurfactant rhamnolipid (RL).
- the enzyme specificity of RhlA determines the chemical structure of the lipid part, which in turn affects the physical, chemical and biological activities of the corresponding RL molecules.
- Rha-C18, Rha-C20, Rha-C22 and Rha-C24 were quantified.
- Rha-C20 is the main product for E. coli cells containing wild-type RhlA (sequence UniProt ID: Q51559). This experiment aims to improve the production and proportion (fitness) of the product Rha-C18 by modifying RhlA. ). Using the Rha-C18 production and ratio corresponding to wild-type RhlA as a reference, the fitness was standardized so that the fitness of wild-type RhlA was 1.
- Embodiment 4 Comparative experiment of fitness terrain exploration algorithm
- AdaLead The separate evolutionary algorithm (AdaLead) and the separate BO algorithm are used as the benchmark of BO-EVO to test the necessity of combining these two exploration strategies.
- the performance of Random Mutation (Random) was also evaluated as a baseline, which is expected to be the worst among the four algorithms ( Figure 2).
- AdaLead is better than random mutation, but this pure evolutionary algorithm is obviously Inferior to BO-EVO or BO, both of which use UCB sampling functions to balance exploration and exploitation.
- the present invention greatly reduces the number of evaluation sequences in the proxy model by combining BO and EVO, achieving high speed and small computing resources. And obtaining the top sequence through UCB better balances the problems of exploration and utilization.
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Organic Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Biotechnology (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biochemistry (AREA)
- Wood Science & Technology (AREA)
- Zoology (AREA)
- Biophysics (AREA)
- Microbiology (AREA)
- Plant Pathology (AREA)
- Physics & Mathematics (AREA)
- Chemical Kinetics & Catalysis (AREA)
- General Chemical & Material Sciences (AREA)
- Medicinal Chemistry (AREA)
- Peptides Or Proteins (AREA)
Abstract
本发明提供了一种机器学习引导的生物序列工程改造方法及装置。具体地,提供了一种贝叶斯优化引导的进化算法(BO-EVO),结合贝叶斯优化( Bayesian optimization,BO)和进化算法( evolutionary algorithm,EVO),通过EVO解决BO通过暴力搜索整个设计空间定位全局最优时带来的计算量过大的问题,同时,利用BO的探索性中和EVO的贪婪和欠探索的缺点,以实现机器学习模型与机器人实验之间的高效迭代,以经济地获取高价值的蛋白新变体。采用本发明的方法有望实现高效和可扩展的计算和探索,提供高效的生物序列工程改造方案。
Description
本发明属于生物工程领域,具体涉及一种机器学习引导的生物序列工程改造方法及装置。
蛋白定向进化技术应用多轮的基因突变和表型筛选,通常受限于实验技术的通量。采用基于模型的虚拟筛选技术,机器学习已经将筛选负担降低到手动实验可以接受的水平。另一方面,机器人技术允许蛋白质工程大批量筛选且快速迭代,然而,该技术尚未在机器学习辅助的定向进化技术中得到应用。
蛋白适应度地形描述一个隐喻性的、高维表面,将氨基酸序列与关心的性质,亦即适应度,关联起来。蛋白质工程探索这个地形极具挑战:1.搜索空间随氨基酸序列长度增加呈指数增长;2.功能蛋白极其稀少;3.高价值的蛋白序列随适应度升高呈指数衰减;4.由于上位效应,适应度地形崎岖不平;5.实验测试昂贵、低效且耗费体力。作为有效的蛋白质工程策略,定向进化应用多轮的突变建库和筛选过程,但每轮只固定最优突变。然而,这种贪婪的策略会陷入局部最优,特别是在崎岖的适应度地形上。
机器学习算法在蛋白适应度地形建模和引导蛋白质工程方面均获得越来越多的应用。为了从标签数据学习序列-结构-功能关系,监督学习用于各种性质的预测,包括三维结构、热稳定性、荧光强度、配体结合亲和度和催化性能。另一方面,公共数据库中(例如UniProt)基因和蛋白序列正以空前的速度累积。非监督学习模型,比如UniRep、TAPE、ESM-1v和ProtT5-XL-U50,通过从大量无标签蛋白序列数据中发现隐藏的模式从而学习蛋白表示。尽管获得初步的概念性验证,现有的适应度模型仍然受限于数据的匮乏和偏见。例如,归因于低适应度和极高适应度序列的缺失,进化数据的标签多样性缺乏。又如,由于功能表征昂贵且费力,序列-功能标签数据只能覆盖整个设计空间的局部区域。两者皆导致学得的模型在数据未覆盖区域呈现病态。
为了指导蛋白质工程,监督和非监督模型皆被用于提高样本效率。例如,机器学习辅助的定向进化(MLDE)首先采用基于物理的ddG方法预筛设计空间所有可能的序列,获得384个候选序列以供实验表征,进而训练适应度模型,然后用该模型优先化获得最好的96条序列。在GB1数据集上,MLDE的成功率81倍于定向进化。另外,ECNet利用同源序列信息作为监督以预测高阶突变效应,成功地改造TEM-1β-内酰胺酶,经过2-6个突变,在氨苄西林抗性上获得8倍的增强。还有,通过在2千多万的天然序列上无监督学习蛋白表示,low-N的 蛋白质工程仅需要24个序列-功能数据就获得了适应度模型。通过模拟退火优化该适应度模型,在GFP和TEM-1β-内酰胺酶数据集上成功识别较优的蛋白。值得一提的是,受限于人力,现有的机器学习指导的蛋白工程方法都试图降低实验通量和迭代轮次。
另一方面,在生物工程方面,生物铸造厂(biofoundry)通过物理和信息的自动化,加速“设计-构建-测试-学习”闭环过程。当应用于蛋白质工程,自动化允许短时间内实现大规模建库和筛选,因此,大批量和高质量的新序列-功能数据可以迭代式的收集以提高模型预测能力和序列设计。然而,在算法指导的蛋白质工程中,这种数据能力未被充分利用。
贝叶斯优化作为“黑盒”函数优化的原理性方法,通过优化给定的采样函数获得候选对象,采样函数负责探索与利用的平衡。贝叶斯优化在蛋白质工程、通路和发酵菌株设计领域有广泛的利用。
为了将贝叶斯优化用于指导组合适应度地形上的蛋白质工程改造,需对贝叶斯优化进行改进。首先,尽管BO每次只获取一个样本,批量的样本可以通过机器人并行的构建和测试。批量BO已被用于设计生物序列。批量化方法分为两类:迭代式生成批量数据或一次性获取量数据。然而,如何平衡单批次数据的探索和利用仍然是开放性问题。其次,BO通过暴力搜索整个设计空间定位全局最优,针对蛋白的组合突变库(N个位点~20
N的设计空间),BO的计算量随问题维度指数增长。
发明内容
面对在批量处理过程中如何平衡单批次数据的探索,以及如何降低突变蛋白库BO计算量过大以及针对生物序列设计任务中存在的机器学习模型病态问题、探索能力不足问题和规模可扩展性问题。本发明提出贝叶斯优化引导的进化算法(BO-EVO),以实现机器学习模型与机器人实验之间的高效迭代,以经济地获取高价值的蛋白新变体。本发明的方法结合BO和EVO,通过EVO解决BO通过暴力搜索整个设计空间定位全局最优时带来的计算量过大的问题,同时,利用BO的探索性中和进化算法的贪婪和欠探索的缺点。因此,结合EVO和BO,有望实现高效和可扩展的计算和探索,提供高效的生物序列工程改造方案。
本发明一个方面提供了一种机器学习引导的生物序列工程改造方法,包括以下步骤:
S1)数据获取:获得生物序列及其对应的适应度数据;
S2)模型训练:通过对获得的生物序列及其对应的适应度数据进行机器学习,以获得生物序列的适应度预测模型及模型不确定度;
S3)获取首个种子:通过基于适应度的采样从步骤S1)得的生物序列及其对应的适应度数 据中获得首个种子序列;
S4)突变子空间生成:对于步骤S3)或S6)获得的种子序列的待突变位点进行随机突变,突变后的生物序列集合形成突变子空间,其中随机突变的单点突变率为待突变位点数的倒数;
S5)突变子空间贝叶斯优化:在突变子空间上采用贝叶斯优化选取单个候选序列以待实验查询,其中采样函数为置信上界(UCB),代理模型为S2)获得的适应度预测模型;
S6)新种子生成:首先以S2)获得的预测模型预测S5)获得的候选序列的不确定度,若不确定度不高于2倍的本轮首个种子序列的不确定度,则该候选序列作为本轮次下一个种子序列,并用于形成下一个突变子空间;若不确定度高于2倍的本轮次首个种子序列的不确定度,则依据步骤S3)获取本轮次下一个种子序列,并用于形成下一个突变子空间;
S7)重复S4)到S6)的循环,直至满足预设的通量的循环,获得批量的待实验查询的候选序列;
S8)对S7)获得的批量的待实验查询的候选序列进行实验,以获得候选序列并检测其对应的适应度值;
S9)将候选序列及其对应的适应度值补充到S1)的生物序列及其对应的适应度数据中,并用于下一轮模型训练,获得下一轮次的生物序列的适应度预测模型及模型不确定度;
S10)以S6)获得的最新种子序列作为下一轮次的首个种子序列,并重复S4)到S9)直至满足预设的轮次或筛选出期待的突变体。
进一步地,在步骤S1)中,生物序列及其对应的适应度数据为从已知适应度地形中获得、从已知序列和对应的适应度数据集中获得、从实验测得的序列和对应的适应度数据中获得。
更进一步地,在步骤S1)中,所述从实验测得的序列和对应的适应度数据中获得的序列及其对应的适应度数据可以是上述候选序列以及对其进行测试后获得的适应度结果,也可以是其他已知的序列和对应的适应度数据试验结果。
进一步地,所述序列为蛋白序列、多肽序列、核糖核酸序列、脱氧核糖核酸序列。
进一步地,在步骤S2)中,预测模型选自高斯过程模型、贝叶斯神经网络预测模型、集成模型、证据深度学习模型或其他预测不确定度的模型。
进一步地,在步骤S2)中,在预测模型学习过程之前,先将序列以数字化形式表达。
进一步地,在步骤S3)中,获取的首个种子序列可以为天然序列或者突变序列。
进一步地,在步骤S4)中,突变子空间中突变后的生物序列的数量为1.0-3.0倍待突变位点数。
进一步地,在步骤S5)中,UCB中的β取值为0.05-0.25。
进一步地,在步骤S7)中,所述预设的通量限定在200-500之间。
进一步地,在步骤S10)中,所述预设的轮次为至少2轮。
进一步地,在步骤S8)中批量的待实验查询的候选序列进行实验的方法为以工程化方法构建生物序列和测试适应度值。
进一步地,在步骤S8)中进行实验的方法为首先采用化学或生物学方法获得待实验查询的候选序列,并实际测试每一条候选序列的适应度值。
本发明另一个方面提供了一种机器学习引导的生物序列工程改造装置,所述生物序列工程改造装置为能够实现上述步骤S1)-S10)的装置。
所述生物序列工程改造装置包括以下模块:生物序列及其适应度数据获取模块、机器学习模型模块和生物序列推荐模块;
所述生物序列及其适应度数据获取模块用于存储和调用序列和其适应度数据,将被调用的序列转换为数字编码;根据生物序列推荐模块中生成的批量的候选序列,返回相应的适应度值;
所述机器学习模型模块用于从所述生物序列及其适应度数据获取模块中调用生物序列,根据调用的数据进行机器学习并形成生物序列与其适应度之间的预测模型及模型不确定度;
所述生物序列推荐模块用于在机器学习模型模块指导下,获得批量的待实验查询的候选序列,所述生物序列推荐模块能够在生物序列及其适应度数据获取模块中调用的数据,并基于适应度采样获得的首个种子序列,根据首个种子序列进行随机突变,并以随机突变序列构成突变子空间,在突变子空间内采用贝叶斯优化选取单个候选序列,其中采样函数为UCB,代理模型为机器学习模型模块形成的预测模型;
确认单个候选序列的不确定度,若不确定度不高于2倍的本轮首个种子序列的不确定度,则该候选序列作为本轮次下一个种子序列,并用于形成本轮次突变子空间;若不确定度高于2倍的本轮次首个种子序列的不确定度,则在调用的序列中,基于适应度采样获取本轮次下一个种子序列;并开启下一循环,直至达到预设通量,以及批量的候选序列。
进一步地,所述生物序列推荐模块中还包含启动下一轮次的单元,该单元将生成的批量的候选序列传输给所述生物序列及其适应度数据获取模块,并指导所述生物序列及其适应度数据获取模块生成的批量的候选序列,返回相应的适应度值;所述生物序列及其适应度数据获取模块生成的批量的候选序列及其相应的适应度值并入已有的生物序列及其适应度数据中;并用于指导所述机器学习模型模块进一步生成新的预测模型,所述机器学习模型模块中的新的预测模型用于指导所述生物序列推荐模块下一轮次的推荐。
进一步地,生物序列及其适应度数据获取模块包含记录信息的记录单元、用于将被调用的序列转换为数字编码的编码单元、用于根据生物序列推荐模块中生成的批量的候选序列,返回相应的适应度值的批量实验单元。
1)本发明生物序列工程改造方法以批量“干-湿”迭代进行优化,不但降低了湿实验的实验成本实验量,又提高了算法性能。其中,“干”或“干试验”为通过计算机进行数据收集、训练、模拟或预测,“湿”或“湿实验”为实际实验。本发明采用算法与湿实验迭代优化——由算法指导湿实验探索适应度地形,从而获得新的序列-适应度关系数据,由湿实验获得的数据更新机器学习模型,从而提高算法性能,进一步指导湿实验探索。
2)通常贝叶斯优化方法中采样函数的优化通常采用“蛮力”评估整个搜索空间,而对于蛋白序列而言,其组成单位——氨基酸的种类至少为20种(天然氨基酸),随着蛋白序列的延长,其数据量非常大,采用常规的贝叶斯优化方法对全部空间进行蛮力探索,对于计算量的要求过大,导致探索成本大幅提高。本发明通过进化方法生成的子空间,并在子空间内使用贝叶斯优化能有效缓解贝叶斯优化的规模可扩展性,而通过批量获得子空间以及由子空间筛选出候选序列,同时保证了规模性和数据的扩展性。因此,本发明的方法速度快、计算资源需求少,可以用于更长序列的工程改造任务。
3)本发明在建模过程中不但考虑了对于适应度与序列之间的关系,同时考虑了预测值的不确定性,并对此建立了不确定度模型。由于机器学习模型在数据支集外,难以通过外插准确预测,因此,对预测值的不确定度估计及利用,可以有效避免优化算法“利用”模型在数据支集外的不准确预测。
4)本发明的方法提高了生物样品改造和进化的效率,通过本发明的方法仅采样少于所有可能突变1%的样本,就能获得高于7倍性能的提升。
5)本发明使用的贝叶斯优化引导的进化算法不需要目标对象相关结构或同源序列。
图1为本发明方法的流程图。
图2为BO-EVO方案示意图。
a为BO-EVO的概念图。迭代地,通过父序列的随机突变产生候选子代序列,并且BO从候选序列中选出一个序列。
b为BO-EVO的构成模块,包括了生物序列及其适应度数据获取模块、机器学习模型模块和生物序列推荐模块。生物序列推荐模块通过与其他两个模块交互来提出批量序列。首先根据已知数据库或实验结果获得序列及其对应的适应度数据。对从数据获取模块获得的数据进行机器学习获得预测模型,根据预测模型对序列进行预测并在推荐模块中引导生成新的待查询序列,通过对推荐模块中的序列进行合成和检测获得的新的序列和适应度数据,并进一步应用于预测模型的学习中。
图3为适应度探索算法结果图。
a为截止每轮达到的成功率。
b为获得的所有序列的最大(顶部)和平均(底部)适应度。
为了使本发明的上述目的、特征和优点能够更加明显易懂,下面对本发明的具体实施方式做详细的说明,但不能理解为对本发明的可实施范围的限定。
结合图1和图2a进行说明,本发明一些具体实施方案提供了一种机器学习引导的生物序列工程改造方法,包括以下步骤:
S1)数据获取:获得生物序列及其对应的适应度数据;
其中序列及其对应的适应度数据可以从已知适应度地形中获得、从已知序列和对应的适应度数据集中获得、从实验测得的序列和对应的适应度数据中获得。所述从实验数据中获得序列和对应的适应度数据中获得的序列及其对应的适应度数据可以是根据本发明方法获得的待查询序列以及对其进行测试后获得的适应度结果,也可以是其他已知的序列和对应的适应度数据试验结果。
所述生物序列为蛋白序列、多肽序列、核糖核酸序列、脱氧核糖核酸序列。
在一个具体的实施方案中,序列及其对应的适应度数据来自于已知的适应度地形,所述的适应度地形可以来自于经验型适应度地形或者统计型适应度地形。例如来自于经验型4位点组合的(20
4=160,000)GB1适应度地形、经验型适应度地形PhoQ和统计NK地形,并结合湿实验对RhlA酶进行改造,通过采用不同适应度地形,证明了本发明的方法具有通用性。
S2)模型训练:通过对获得的生物序列及其对应的适应度数据进行机器学习,以获得预测序列的适应度预测模型及模型不确定度;
预测模型可以为高斯过程模型、贝叶斯神经网络预测模型、集成模型、证据深度学习模型或其他预测不确定度的模型。
在一个具体实施方案中,使用高斯过程模型(GPR)作为预测模型。高斯过程模型完全由均值函数和协方差函数描述。平均值对应于适应度,协方差用作不确定性的评估。所使用的核是RBF,
使用最大似然法预测RBF核参数k和γ,使用带有CUDA加速的GPyTorch作为比scikit-learn更高效的高斯过程实现,并采用梯度下降和Adam优化进行最大边际似然估计。
在预测模型学习过程中,需要把序列数值化,例如将氨基酸序列以ESM蛋白质语言模型转换为数字化表达方式。
S3)获取首个种子:通过基于适应度的从步骤S1)得的生物序列及其对应的适应度数据中获得首个种子序列;
获取的种子序列可以为天然序列或者突变序列。
S4)突变子空间生成:对于步骤S3)或S6)获得的种子序列的待突变位点进行随机突变,突变后的生物序列集合形成突变子空间,突变率为待突变位点数的倒数;
例如待筛选序列的长度为50个氨基酸的序列,待突变位点为4个位点的氨基酸,点突变率待突变位点数,即4个氨基酸,点突变率为0.25,换言之,针对每个待突变位点,突变后的生物序列为单点的突变序列。通过限制随机突变的突变率,保证了子空间的局部性。
在一些具体的实施方案中,所述的突变子空间中生成的随机突变的生物序列的数量为1.0-3.0倍待突变位点数。突变子空间中的生物序列数量可以根据需要调整,通常不需要将所有满足点突变率的生物序列纳入突变子空间范围内。虽然突变子空间无论大小都能完成本发明,但其大小可能会影响了本发明方法效率,可以根据对于目标序列的扩展程度选择突变子空间的大小,也可以根据现有技术中给出的可能的突变位点的进行筛选确认。例如扩展序列范围为可能产生4个氨基酸突变蛋白(即待突变位点为4个氨基酸),可以选择随机突变4-12条突变序列纳入突变子空间中,进行贝叶斯优化。待突变位点数为生物序列中待突变位点的最小单位的个数。例如多肽或蛋白序列中最小单位为氨基酸,而DNA序列中最小单位为核苷酸。
S5)突变子空间贝叶斯优化:在突变子空间上采用贝叶斯优化选取单个候选序列以待实验查询,其中采样函数为UCB,代理模型为S2)获得的适应度预测模型;
在一些具体的实施方案中,UCB中的β取值为0.05-0.25,例如为0.1、0.2;
S6)新种子生成:首先以S2)获得的预测模型预测S5)获得的候选序列的不确定度,若不确定度不高于2倍的本轮首个种子序列的不确定度,则该候选序列作为本轮次下一个种子序列,并用于形成下一个突变子空间;若不确定度高于2倍的本轮次首个种子序列的不确定度,则 依据步骤S3)获取本轮次下一个种子序列,并用于形成下一个突变子空间;
S7)重复S4)到S6)的循环,直至满足预设的通量的循环,获得批量的待实验查询的候选序列;
所述预设的通量限定在200-500循环之间,即获得200-500条待实验查询的候选序列。
获得一条待实验查询的候选序列为一循环。完成一循环后,还包括采用步骤S6)来评估确定下一循环的首个种子。
S8)对S7)获得的批量的待实验查询的候选序列进行实验,以获得候选序列并检测其对应的适应度值。
在步骤S8)中进行实验的方法为首先采用化学或生物学方法获得待实验查询的候选序列,并实际测试每一条候选序列的适应度值。
在步骤S8)中批量的待实验查询的候选序列进行实验的方法为以工程化方法合成生物序列和测试适应度值。
S9)将候选序列及其对应的适应度值补充到S1)的生物序列及其对应的适应度数据中,并用于下一轮模型训练,获得下一轮次的生物序列的适应度预测模型及模型不确定度;
S10)以S6)获得的最新种子序列作为下一轮次的首个种子序列,并重复S4)到S9)直至满足预设的轮次或筛选出期待的突变体。
获得批量的候选序列实验结果为一轮次,并将所得到序列及其对应的适应度数据补充到S1)的生物序列及其对应的适应度数据中作为下一轮次的开始。在下一轮次中,补充后的生物序列及其对应的适应度数据集合用于训练下一轮次的预测模型,而从第二轮次开始,以第一轮次步骤S6)获得的种子序列作为该轮次的首个种子序列,并进行这一轮次的循环。
所述预设的轮次为至少2轮次。
结合图2b进行说明本发明一些具体实施方案还提供了一种机器学习引导的生物序列工程改造装置,所述生物序列工程改造装置为能够实现上述步骤S1)-S10)的装置。
所述生物序列工程改造装置包括以下模块:生物序列和其适应度数据获取模块、机器学习模型模块和生物序列推荐模块;
所述生物序列及其适应度数据获取模块用于存储和调用序列和其适应度数据,将被调用的序列转换为数字编码;根据生物序列推荐模块中生成的批量的候选序列,返回相应的适应度值;
具体地,所述生物序列及其适应度数据获取模块包含记录信息的记录单元、用于将被调用的序列转换为数字编码的编码单元、用于根据生物序列推荐模块中生成的批量的候选序列, 返回相应的适应度值的批量实验单元。
所述机器学习模型模块用于从所述生物序列及其适应度数据获取模块中调用生物序列,根据调用的数据进行机器学习和训练并形成生物序列与其适应度之间的预测模型及模型不确定度;
所述生物序列推荐模块用于在机器学习模型模块指导下,获得批量的待实验查询的候选序列,所述生物序列推荐模块能够在生物序列及其适应度数据获取模块中调用的数据,并基于适应度采样获得的首个种子序列,根据首个种子序列进行随机突变,并以随机突变序列构成突变子空间,在突变子空间内采用贝叶斯优化选取单个候选序列,其中采样函数为UCB,代理模型为机器学习模型模块形成的预测模型;
确认单个候选序列的不确定度,若不确定度不高于2倍的本轮首个种子序列的不确定度,则该候选序列作为本轮次下一个种子序列,并用于形成本轮次突变子空间;若不确定度高于2倍的本轮次首个种子序列的不确定度,则在调用的序列中,基于适应度采样获取本轮次下一个种子序列;并开启下一循环,直至达到预设通量,以及批量的候选序列。
所述生物序列推荐模块中还包含启动下一轮次的单元,该单元将生成的批量的候选序列传输给所述生物序列及其适应度数据获取模块,并指导所述生物序列及其适应度数据获取模块生成的批量的候选序列,返回相应的适应度值;所述生物序列及其适应度数据获取模块生成的批量的候选序列及其相应的适应度值并入已有的生物序列及其适应度数据中;并用于指导所述机器学习模型模块进一步生成新的预测模型,所述机器学习模型模块中的新的预测模型用于指导所述生物序列推荐模块下一轮次的推荐。
实施例1以GB1适应度地形开发BO-EVO算法
S1)数据获取:从GB1适应度地形获得生物序列及其对应的适应度数据;GB1地形是G蛋白B1结构域的4个位点(V39、D40、G41和V54)的组合经验型适应度地形,其野生型(wild type,WT)序列来自蛋白质结构数据库(PDB ID:2GI9)。地形由20
4=160,000个序列组成,实验共测量了149,361条序列,其余序列用测量数据估算。序列的适应度由稳定性(即折叠蛋白的分数)和功能(即与lgG Fc的结合亲和力)决定。通过采用适应度的全局最大值减去全局最小值进行归一化,使适应度值介于0到1之间,WT适应度约为0.1。
S2)模型训练:通过对获得的生物序列及其对应的适应度数据进行机器学习,机器学习的方法为高斯过程回归(GPR),以获得生物序列的适应度预测模型及模型不确定度;
S3)获取首个种子:通过基于适应度的采样从步骤S1)得的生物序列及其对应的适应度数据中获得首个种子序列;
S4)突变子空间生成:对于步骤S3)或S6)获得的种子序列进行随机突变,突变后的生物序列集合形成突变子空间,其中随机突变的单点突变率为待突变位点数4的倒数;子空间中的生物序列数量为待突变位点数4的1-3倍,在此范围内选择8进行实验;
S5)突变子空间贝叶斯优化:在突变子空间上采用贝叶斯优化选取单个候选序列以待实验查询,其中采样函数为UCB,代理模型为S2)获得的适应度预测模型;β=0.2。
S6)新种子生成:首先以S2)获得的预测模型预测S5)获得的候选序列的不确定度,若不确定度不高于2倍的本轮首个种子序列的不确定度,则该候选序列作为下轮首个种子序列,并用于形成下轮突变子空间;若不确定度高于2倍的本轮首个种子序列的不确定度,则依据步骤S3)获取下轮首个种子序列;
S7)重复S4)到S6),直至满足预设的通量,获得384条待实验查询的候选序列;
S8)对S7)获得的批量的待实验查询的候选序列进行查询适应度值。
S9)将候选序列及其对应的适应度值补充到S1)的生物序列及其对应的适应度数据中,并用于下一轮模型训练,获得下一轮次的生物序列的适应度预测模型及其不确定度;
S10)以S6)获得的最新种子序列作为下一轮次的首个种子序列,并重复S4)到S9)直至满足预设的轮次或筛选出期待的突变体。
通过GB1适应度地形开发出适用于生物序列改造的BO-EVO算法,该算法能够极大程度提高计算速度,降低了计算需求。
实施例2以经验型适应度地形PhoQ和将NK模型用作蛋白适应度地形验证BO-EVO算法
对于实施例1获得的BO-EVO算法以不同的适应度地形做进行进一步验证。
PhoQ地形与GB1地形类似,也是4位点的经验型地形。PhoQ使用富集率作为适应度。采用适应度归一化策略,WT适应度约为0.02。
NK地形由原始NK地形改造而得。
采用与实施例1类似的方法以PhoQ地形和NK地形进一步对本发明的BO-EVO算法进行确认。实验结果显示,采用本发明的BO-EVO算法能够适用于多种类型的地形。
实施例3以RhlA酶改造确认BO-EVO算法
RhlA是合成重要生物表面活性剂鼠李糖脂(RL)脂质部分的关键酶。RhlA的酶特异性决定了脂质部分的化学结构,进而影响相应RL分子的理化和生物活性。然而,通过(半)理性设计或定向进化,很难改变RhlA的酶特异性。本实施例通过机器学习模型和机器人实验之间的迭代反馈,将BO-EVO应用于RhlA的酶特异性改造。
为了评估RhlA酶特异性,应用基质辅助激光解吸/电离飞行时间(MALDI-ToF)质谱(MS)对作为生产宿主的重组大肠杆菌液体培养物中的四种RL产物Rha-C18、Rha-C20、Rha-C22和Rha-C24进行定量。
在微孔板培养中,对含有野生型RhlA(序列UniProt ID:Q51559)的大肠杆菌细胞而言Rha-C20是主要产物,本实验旨通过改造RhlA提高产物Rha-C18的产量和比例(适应度)。以野生型RhlA对应的Rha-C18产量和比例作为参考,标准化适应度,使得野生型RhlA的适应度为1。
为了应用BO-EVO,选择R74、A101、L148和S173作为组合突变的四个目标残基,因为现有知识表明这些残基上的许多突变显著增强了Rha-C18的产生。对于BO-EVO迭代,我们观察到累积最大适应度逐轮增加,在第4轮达到7.35,而需要机器人实验定量的突变体数不到整个设计空间的1%。实验结果图3显示,表明本发明的方法能够结合机器人实验,高效地提升RhlA酶特异性。
实施例4适应度地形探索算法的比较实验
用单独进化算法(AdaLead)和单独BO算法作为BO-EVO的基准,以检验将这两种探索策略结合起来的必要性。随机突变(Random)的性能也作为基准进行评估,这是四种算法中预期最差的(图2)。就五轮后算法搜索到全局最优序列的成功率(图2a)和算法推荐的所有序列的最大和平均适应度(图2b)而言,AdaLead优于随机突变,但这种纯进化算法明显劣于BO-EVO或BO,后两者均使用UCB采样函数平衡探索和利用。这些结果表明,在探索崎岖的适应环境时,同时考虑序列适应度和模型不确定性的重要性。另一方面,通过在每次迭代中对整个设计空间(160,000个序列)进行暴力搜索,单独BO取得了比BO-EVO更好的性能(图5),但后者每轮仅评估3072个序列(仅占整个设计空间的1.92%)(四种探索算法的详细设置比较见表1。虽然BO-EVO的性能不如纯BO,但在探索组合突变地形时,BO-EVO的计算时间几乎是恒定的,而纯BO的计算时间随目标残基数呈指数级增长,因此不具可扩展性。而本发明算法可扩展,可以用于更长序列的工程改造任务。
综合实验条件和结果可知,本发明通过结合BO和EVO极大地降低了在代理模型中评估序列的数量,实现了速度快、计算资源小。且通过UCB获得顶部序列更好地平衡了探索和利用的问题。
表1
Claims (10)
- 一种机器学习引导的生物序列工程改造方法,其特征在于,其包括以下步骤:S1)数据获取:获得生物序列及其对应的适应度数据;S2)模型训练:通过对获得的生物序列及其对应的适应度数据进行机器学习,以获得生物序列的适应度预测模型及模型不确定度;S3)获取首个种子:通过基于适应度的采样从步骤S1)得的生物序列及其对应的适应度数据中获得首个种子序列;S4)突变子空间生成:对于步骤S3)或S6)获得的种子序列的待突变位点进行随机突变,突变后的生物序列集合形成突变子空间,其中随机突变的单点突变率为待突变位点的倒数;S5)突变子空间贝叶斯优化:在突变子空间上采用贝叶斯优化选取单个候选序列以待实验查询,其中采样函数为UCB,代理模型为S2)获得的适应度预测模型;S6)新种子生成:首先以S2)获得的预测模型预测S5)获得的候选序列的不确定度,若不确定度不高于2倍的本轮首个种子序列的不确定度,则该候选序列作为本轮次下一个种子序列,并用于形成下一个突变子空间;若不确定度高于2倍的本轮次首个种子序列的不确定度,则依据步骤S3)获取本轮次下一个种子序列,并用于形成下一个突变子空间;S7)重复S4)到S6)的循环,直至满足预设的通量的循环,获得批量的待实验查询的候选序列;S8)对S7)获得的批量的待实验查询的候选序列进行实验,以获得候选序列并检测其对应的适应度值;S9)将候选序列及其对应的适应度值补充到S1)的生物序列及其对应的适应度数据中,并用于下一轮模型训练,获得下一轮次的生物序列的适应度预测模型及模型不确定度;S10)以S6)获得的最新种子序列作为下一轮次的首个种子序列,并重复S4)到S9)直至满足预设的轮次或筛选出期待的突变体。
- 根据权利要求1所述的生物序列工程改造方法,其特征在于,在步骤S1)中,生物序列及其对应的适应度数据为从已知适应度地形中获得、从已知序列和对应的适应度数据集中获得、从实验测得的生物序列和对应的适应度数据中获得;优选地,在步骤S1)中,所述从实验测得的生物序列和对应的适应度数据中获得的序列及其对应的适应度数据为候选序列以及对其进行测试后获得的适应度结果,或为其他已知的生物序列和对应的适应度数据试验结果。
- 根据权利要求1所述的生物序列工程改造方法,其特征在于,所述序列为蛋白序列、多肽序列、核糖核酸序列、脱氧核糖核酸序列。
- 根据权利要求1所述的生物序列工程改造方法,其特征在于,在步骤S2)中,预测模型选自高斯过程模型、贝叶斯神经网络预测模型、集成模型、证据深度学习模型或其他预测不确定度的模型。
- 根据权利要求1所述的生物序列工程改造方法,其特征在于,在步骤S2)中,在预测模型学习过程之前,先将序列以数字化形式表达。
- 根据权利要求1所述的生物序列工程改造方法,其特征在于,在步骤S4)中,突变子空间中突变后的生物序列的数量为1.0-3.0倍待突变位点数。
- 根据权利要求1所述的生物序列工程改造方法,其特征在于,在步骤S5)中,UCB中的β取值为0.05-0.25。
- 根据权利要求1所述的生物序列工程改造方法,其特征在于,在步骤S7)中,所述预设的通量限定在200-500之间。
- 一种机器学习引导的生物序列工程改造装置,其特征在于,所述生物序列工程改造装置为能够实现权利要求1-8任一项所述生物序列工程改造方法的步骤S1)-S10)的装置;优选地,所述生物序列工程改造装置包括以下模块:生物序列及其适应度数据获取模块、机器学习模型模块及生物序列推荐模块;所述生物序列及其适应度数据获取模块用于存储和调用序列和其适应度数据,将被调用的序列转换为数字编码;根据生物序列推荐模块中生成的批量的候选序列,返回相应的适应度值;所述机器学习模型模块用于从所述生物序列及其适应度数据获取模块中调用生物序列,根据调用的数据进行机器学习和训练并形成生物序列与其适应度之间的预测模型及模型不确定度;所述生物序列推荐模块用于在机器学习模型模块指导下,获得批量的待实验查询的候选序列,所述生物序列推荐模块能够在生物序列及其适应度数据获取模块中调用的数据,并基 于适应度采样获得的首个种子序列,根据首个种子序列进行随机突变,并以随机突变序列构成突变子空间,在突变子空间内采用贝叶斯优化选取单个候选序列,其中采样函数为UCB,代理模型为机器学习模型模块形成的预测模型;确认单个候选序列的不确定度,若不确定度不高于2倍的本轮首个种子序列的不确定度,则该候选序列作为下轮首个种子序列,并用于形成下轮突变子空间;若不确定度高于2倍的本轮首个种子序列的不确定度,则在调用的序列中,基于适应度采样获取下轮首个种子序列;并开启下一轮,直至达到预设数量轮,以及批量的候选序列;优选地,所述生物序列推荐模块中还包含启动下一轮次的单元,该单元将生成的批量的候选序列传输给所述生物序列及其适应度数据获取模块,并指导所述生物序列及其适应度数据获取模块生成的批量的候选序列,返回相应的适应度值;所述生物序列及其适应度数据获取模块生成的批量的候选序列及其相应的适应度值并入已有的生物序列及其适应度数据中;并用于指导所述机器学习模型模块进一步生成新的预测模型,所述机器学习模型模块中的新的预测模型用于指导所述生物序列推荐模块下一轮次的推荐。
- 根据权利要求9所述的生物序列工程改造装置,其特征在于,生物序列及其适应度数据获取模块包含记录信息的记录单元、用于将被调用的序列转换为数字编码的编码单元、用于根据生物序列推荐模块中生成的批量的候选序列,返回相应的适应度值的批量实验单元。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2022/103382 WO2024000579A1 (zh) | 2022-07-01 | 2022-07-01 | 一种机器学习引导的生物序列工程改造方法及装置 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2022/103382 WO2024000579A1 (zh) | 2022-07-01 | 2022-07-01 | 一种机器学习引导的生物序列工程改造方法及装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024000579A1 true WO2024000579A1 (zh) | 2024-01-04 |
Family
ID=89383898
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/103382 WO2024000579A1 (zh) | 2022-07-01 | 2022-07-01 | 一种机器学习引导的生物序列工程改造方法及装置 |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024000579A1 (zh) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210225455A1 (en) * | 2018-08-15 | 2021-07-22 | Zymergen Inc. | Bioreachable prediction tool with biological sequence selection |
CN114005493A (zh) * | 2021-10-29 | 2022-02-01 | 上海商汤智能科技有限公司 | 一种生物序列检索方法、装置、电子设备及存储介质 |
CN114651064A (zh) * | 2019-09-13 | 2022-06-21 | 芝加哥大学 | 使用机器学习对蛋白质和其它序列定义的生物分子进行进化数据驱动设计的方法和设备 |
-
2022
- 2022-07-01 WO PCT/CN2022/103382 patent/WO2024000579A1/zh unknown
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210225455A1 (en) * | 2018-08-15 | 2021-07-22 | Zymergen Inc. | Bioreachable prediction tool with biological sequence selection |
CN114651064A (zh) * | 2019-09-13 | 2022-06-21 | 芝加哥大学 | 使用机器学习对蛋白质和其它序列定义的生物分子进行进化数据驱动设计的方法和设备 |
CN114005493A (zh) * | 2021-10-29 | 2022-02-01 | 上海商汤智能科技有限公司 | 一种生物序列检索方法、装置、电子设备及存储介质 |
Non-Patent Citations (2)
Title |
---|
JIAN-LIN SHAO, SHI DING-HUA, WANG YI-FEI : "Application of Bayesian Neural Networks to Biological Sequence Analysis", NATURE MAGAZINE, vol. 26, no. 2SHAO Jian-lin ①, SHI Ding-hua ②, WANG Yi-fei, 30 April 2004 (2004-04-30), pages 108 - 111, XP093119273 * |
ZERJU LUO, ZHU SI-MING, HE MIAO: "Multiple Alignment Analysis Based on Hidden Markov Models", ACTA SCIENTIARUM NATURALIUM UNIVERSITATIS SUNYATSENI, vol. 44, no. 2, 25 March 2005 (2005-03-25), pages 9 - 13, XP093119277 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Basith et al. | iGHBP: computational identification of growth hormone binding proteins from sequences using extremely randomised tree | |
Camproux et al. | A hidden markov model derived structural alphabet for proteins | |
US11620544B2 (en) | Method, apparatus, and computer-readable medium for efficiently optimizing a phenotype with a specialized prediction model | |
JP2005505031A (ja) | 多重破壊表現ライブラリから生成される遺伝子調節ネットワークを用いた生物学的発見 | |
US20210257049A1 (en) | Method, apparatus, and computer-readable medium for efficiently optimizing a phenotype with a combination of a generative and a predictive model | |
WO2021217138A1 (en) | Method for efficiently optimizing a phenotype with a combination of a generative and a predictive model | |
Huang et al. | Harnessing deep learning for population genetic inference | |
Meluzzi et al. | Computational approaches for inferring 3D conformations of chromatin from chromosome conformation capture data | |
CN115249514A (zh) | 一种机器学习引导的生物序列工程改造方法及装置 | |
Raza et al. | iPro-TCN: prediction of DNA promoters recognition and their strength using temporal convolutional network | |
Lee et al. | Survival prediction and variable selection with simultaneous shrinkage and grouping priors | |
Thiel et al. | Sampling globally and locally correct RNA 3D structures using Ernwin, SPQR and experimental SAXS data | |
WO2024000579A1 (zh) | 一种机器学习引导的生物序列工程改造方法及装置 | |
US20230108368A1 (en) | Combined and transfer learning of a variant pathogenicity predictor using gapped and non-gapped protein samples | |
US20230122168A1 (en) | Conformal Inference for Optimization | |
WO2022221587A1 (en) | Artificial intelligence-based analysis of protein three-dimensional (3d) structures | |
WO2022221593A1 (en) | Efficient voxelization for deep learning | |
Thareja et al. | Applications of deep learning models in bioinformatics | |
Corander | Is there a real Bayesian revolution in pattern recognition for bioinformatics? | |
Tsapalou | Inferring the Additive and Epistatic Genetic Architecture involved in Speciation Using Neural Networks | |
Mishra | Deep Learning Based Convolute Neural Approach in The Prediction of RNA Structure | |
EP4413575A1 (en) | Combined and transfer learning of a variant pathogenicity predictor using gapped and non-gapped protein samples | |
JP2024529837A (ja) | 変異体病原性予測のためのタンパク質コンタクトマップの深層学習に基づく使用 | |
JP2024538478A (ja) | ギャップ付き及び非ギャップタンパク質サンプルを使用した変異体病原性予測器の複合学習及び転移学習 | |
WO2023059750A1 (en) | Combined and transfer learning of a variant pathogenicity predictor using gapped and non-gapped protein samples |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22948678 Country of ref document: EP Kind code of ref document: A1 |