CN117275582A - Construction of amino acid sequence generation model and method for obtaining protein variant - Google Patents
Construction of amino acid sequence generation model and method for obtaining protein variant Download PDFInfo
- Publication number
- CN117275582A CN117275582A CN202310832292.4A CN202310832292A CN117275582A CN 117275582 A CN117275582 A CN 117275582A CN 202310832292 A CN202310832292 A CN 202310832292A CN 117275582 A CN117275582 A CN 117275582A
- Authority
- CN
- China
- Prior art keywords
- amino acid
- model
- acid sequence
- construction
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 125000003275 alpha amino acid group Chemical group 0.000 title claims abstract description 88
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 63
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 61
- 238000010276 construction Methods 0.000 title claims abstract description 42
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 28
- 238000011156 evaluation Methods 0.000 claims abstract description 18
- 230000006870 function Effects 0.000 claims abstract description 12
- 102000004190 Enzymes Human genes 0.000 claims description 41
- 108090000790 Enzymes Proteins 0.000 claims description 41
- 102000013460 Malate Dehydrogenase Human genes 0.000 claims description 25
- 108010026217 Malate Dehydrogenase Proteins 0.000 claims description 25
- 230000000694 effects Effects 0.000 claims description 14
- 150000001413 amino acids Chemical class 0.000 claims description 7
- 238000009826 distribution Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 239000013078 crystal Substances 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000012795 verification Methods 0.000 claims description 2
- 238000006243 chemical reaction Methods 0.000 description 19
- 239000000243 solution Substances 0.000 description 18
- 239000007788 liquid Substances 0.000 description 12
- BJEPYKJPYRNKOW-REOHCLBHSA-N (S)-malic acid Chemical compound OC(=O)[C@@H](O)CC(O)=O BJEPYKJPYRNKOW-REOHCLBHSA-N 0.000 description 9
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 9
- BOPGDPNILDQYTO-NNYOXOHSSA-N nicotinamide-adenine dinucleotide Chemical compound C1=CCC(C(=O)N)=CN1[C@H]1[C@H](O)[C@H](O)[C@@H](COP(O)(=O)OP(O)(=O)OC[C@@H]2[C@H]([C@@H](O)[C@@H](O2)N2C3=NC=NC(N)=C3N=C2)O)O1 BOPGDPNILDQYTO-NNYOXOHSSA-N 0.000 description 9
- 229930027945 nicotinamide-adenine dinucleotide Natural products 0.000 description 9
- 239000000758 substrate Substances 0.000 description 9
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Chemical compound O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 8
- 230000001580 bacterial effect Effects 0.000 description 7
- 230000003197 catalytic effect Effects 0.000 description 7
- BJEPYKJPYRNKOW-UHFFFAOYSA-N alpha-hydroxysuccinic acid Natural products OC(=O)C(O)CC(O)=O BJEPYKJPYRNKOW-UHFFFAOYSA-N 0.000 description 6
- 238000012258 culturing Methods 0.000 description 6
- BAWFJGJZGIEFAR-NNYOXOHSSA-O NAD(+) Chemical compound NC(=O)C1=CC=C[N+]([C@H]2[C@@H]([C@H](O)[C@@H](COP(O)(=O)OP(O)(=O)OC[C@@H]3[C@H]([C@@H](O)[C@@H](O3)N3C4=NC=NC(N)=C4N=C3)O)O2)O)=C1 BAWFJGJZGIEFAR-NNYOXOHSSA-O 0.000 description 5
- 235000011090 malic acid Nutrition 0.000 description 5
- 241000894006 Bacteria Species 0.000 description 4
- 235000016496 Panda oleosa Nutrition 0.000 description 4
- 240000000220 Panda oleosa Species 0.000 description 4
- 238000007664 blowing Methods 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 239000013612 plasmid Substances 0.000 description 4
- 239000006228 supernatant Substances 0.000 description 4
- 241000196324 Embryophyta Species 0.000 description 3
- MUBZPKHOEPUJKR-UHFFFAOYSA-N Oxalic acid Chemical compound OC(=O)C(O)=O MUBZPKHOEPUJKR-UHFFFAOYSA-N 0.000 description 3
- 239000003242 anti bacterial agent Substances 0.000 description 3
- 229940088710 antibiotic agent Drugs 0.000 description 3
- 239000007864 aqueous solution Substances 0.000 description 3
- 239000012153 distilled water Substances 0.000 description 3
- 229940116298 l- malic acid Drugs 0.000 description 3
- 229940099690 malic acid Drugs 0.000 description 3
- 239000001630 malic acid Substances 0.000 description 3
- 238000005259 measurement Methods 0.000 description 3
- 239000002609 medium Substances 0.000 description 3
- 238000002156 mixing Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000001105 regulatory effect Effects 0.000 description 3
- QKNYBSVHEMOAJP-UHFFFAOYSA-N 2-amino-2-(hydroxymethyl)propane-1,3-diol;hydron;chloride Chemical compound Cl.OCC(N)(CO)CO QKNYBSVHEMOAJP-UHFFFAOYSA-N 0.000 description 2
- 238000009010 Bradford assay Methods 0.000 description 2
- 241000588724 Escherichia coli Species 0.000 description 2
- 241000282326 Felis catus Species 0.000 description 2
- 239000012880 LB liquid culture medium Substances 0.000 description 2
- 241001052560 Thallis Species 0.000 description 2
- 238000002835 absorbance Methods 0.000 description 2
- 238000010521 absorption reaction Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 2
- 239000006285 cell suspension Substances 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000004128 high performance liquid chromatography Methods 0.000 description 2
- BPHPUYQFMNQIOC-NXRLNHOXSA-N isopropyl beta-D-thiogalactopyranoside Chemical compound CC(C)S[C@@H]1O[C@H](CO)[C@H](O)[C@H](O)[C@H]1O BPHPUYQFMNQIOC-NXRLNHOXSA-N 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000000691 measurement method Methods 0.000 description 2
- 230000037353 metabolic pathway Effects 0.000 description 2
- KHPXUQMNIQBQEV-UHFFFAOYSA-N oxaloacetic acid Chemical compound OC(=O)CC(=O)C(O)=O KHPXUQMNIQBQEV-UHFFFAOYSA-N 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000007789 sealing Methods 0.000 description 2
- 239000013049 sediment Substances 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 230000035939 shock Effects 0.000 description 2
- 239000011550 stock solution Substances 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 230000002378 acidificating effect Effects 0.000 description 1
- 238000012271 agricultural production Methods 0.000 description 1
- 230000003042 antagnostic effect Effects 0.000 description 1
- 230000003115 biocidal effect Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 239000012295 chemical reaction liquid Substances 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 239000011248 coating agent Substances 0.000 description 1
- 238000000576 coating method Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000001816 cooling Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000002255 enzymatic effect Effects 0.000 description 1
- 238000001952 enzyme assay Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000001476 gene delivery Methods 0.000 description 1
- 230000012010 growth Effects 0.000 description 1
- 239000005457 ice water Substances 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000003834 intracellular effect Effects 0.000 description 1
- 229940049920 malate Drugs 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000002207 metabolite Substances 0.000 description 1
- 244000005700 microbiome Species 0.000 description 1
- 239000012452 mother liquor Substances 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 235000006408 oxalic acid Nutrition 0.000 description 1
- 230000029553 photosynthesis Effects 0.000 description 1
- 238000010672 photosynthesis Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 239000010453 quartz Substances 0.000 description 1
- 230000035484 reaction time Effects 0.000 description 1
- 238000006479 redox reaction Methods 0.000 description 1
- 230000008844 regulatory mechanism Effects 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000021749 root development Effects 0.000 description 1
- 230000002786 root growth Effects 0.000 description 1
- VYPSYNLAJGMNEJ-UHFFFAOYSA-N silicon dioxide Inorganic materials O=[Si]=O VYPSYNLAJGMNEJ-UHFFFAOYSA-N 0.000 description 1
- 239000002689 soil Substances 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 238000003892 spreading Methods 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 238000009210 therapy by ultrasound Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000004102 tricarboxylic acid cycle Effects 0.000 description 1
- 230000007306 turnover Effects 0.000 description 1
- 229910021642 ultra pure water Inorganic materials 0.000 description 1
- 239000012498 ultrapure water Substances 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N9/00—Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
- C12N9/0004—Oxidoreductases (1.)
- C12N9/0006—Oxidoreductases (1.) acting on CH-OH groups as donors (1.1)
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Y—ENZYMES
- C12Y101/00—Oxidoreductases acting on the CH-OH group of donors (1.1)
- C12Y101/01—Oxidoreductases acting on the CH-OH group of donors (1.1) with NAD+ or NADP+ as acceptor (1.1.1)
- C12Y101/01037—Malate dehydrogenase (1.1.1.37)
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/20—Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Molecular Biology (AREA)
- Chemical & Material Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Biotechnology (AREA)
- Medical Informatics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- Organic Chemistry (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Biochemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Databases & Information Systems (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Medicinal Chemistry (AREA)
- Microbiology (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
Abstract
The invention provides a construction of an amino acid sequence generation model and a protein variant obtaining method to generate a high-quality protein sequence with reasonable structure and corresponding actual functions, and specifically, the construction comprises the following steps: constructing a data set for generation, collecting all actually existing amino acid sequences corresponding to target proteins from a public protein database, clustering, and dividing the actually existing amino acid sequences into a training data set and an evaluation data set; constructing a network model structure, and performing generating network construction and judging network construction to form a TPGAN preliminary model; model training and evaluation, namely, a preliminary model is adopted, a training data set is input, a back propagation algorithm is utilized to simultaneously optimize and iterate training on a generating network and a judging network, and the evaluation set adjusts the preliminary model to avoid overfitting to obtain an adjusted model; and obtaining a generation model, and verifying the adjustment model to obtain the generation model of the amino acid sequence which can be used for generating the target protein.
Description
Technical Field
The invention belongs to the technical field of biology, and particularly relates to a construction method of an amino acid sequence generation model and a protein variant obtaining method.
Background
Proteins are important basic substances in living bodies, and the diversity of amino acid sequences is important for the survival and reproduction of living bodies.
The TPGAN model (transducer-based protein generative adversarial network) is a large language model, can effectively generate brand-new protein sequences (amino acid sequences), and has wide application value.
However, the conventional experimental-based protein sequence prediction method has certain limitations such as high cost, long time consumption, etc., which also facilitate the research and exploration of the TPGAN model based on the deep learning technique.
Disclosure of Invention
The invention provides a construction of an amino acid sequence generation model and a protein variant obtaining method, and aims to describe a specific implementation method of a TPGAN model and application of the TPGAN model in protein sequence generation in detail so as to obtain a high-quality protein sequence which has a reasonable structure and corresponding actual functions, so that better technical popularization and application are expected to be achieved.
For this purpose, the present invention provides the following technical solutions.
The present invention provides a construction of an amino acid sequence generation model for generating an amino acid sequence of a target protein, comprising: constructing a data set for generation, collecting all actually existing amino acid sequences corresponding to target proteins from a public protein database, preprocessing, clustering based on the consistency percentage of the actually existing amino acid sequences, randomly selecting a certain number of clusters from all clusters with the number of sequences less than or equal to 5 in the clusters to be used as an evaluation set, wherein the total number of the randomly selected clusters as the evaluation set accounts for 20% or less of the total number of all clusters obtained by clustering, and the sequences of the other clusters are gathered together to be used as a training data set to construct a network model structure, and generating a network construction and judging the network construction to form a TPGAN preliminary model; model training and evaluation, wherein a training data set is input by adopting the preliminary model, a generating network and a judging network are simultaneously optimized and iteratively trained by utilizing a back propagation algorithm, and the preliminary model is adjusted by adopting the evaluation data set to avoid overfitting so as to obtain an adjusted model; and obtaining a generation model, and verifying the adjustment model to obtain the generation model of the amino acid sequence which can be used for generating the target protein.
The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: wherein the pretreatment comprises de-duplication and de-noising, and discarding sequences with amino acid lengths exceeding 500.
The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: wherein the generating network comprises a self-encoder construction and a generator construction, the self-encoder construction being: a transducer module is adopted to construct a coder and a decoder, four layers of networks are respectively used, and a multi-head attention mechanism is applied in the middle; the generator is a neural network constructed by three fully connected layers, inputs a noise conforming to Gaussian distribution, uses KL divergence loss, changes a vector conforming to a normal distribution through calculation of a plurality of hidden layers, and transmits the vector to a decoder for decoding to generate a new amino acid sequence.
The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: wherein, the discrimination network discriminates whether the amino acid sequence generated by the generation network is reasonable or not, and preferably, the discrimination network is a 3-layer MLP model.
The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: the method comprises the steps of judging whether a network receives a real amino acid sequence and generates a network generated amino acid sequence, and learning differences between the real amino acid sequence and the generated amino acid sequence by calculating a plurality of hidden layers by using binary cross entropy as a loss function so as to judge whether the received amino acid sequence is the real amino acid sequence or not.
The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: wherein a plurality of loss functions are optimized simultaneously in training, and corresponding super-parameters are adjusted, preferably the super-parameters are learning rates, and the learning rates are adjusted to be 1e-4,dropout rate 0.1,batchsize to be 8.
The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: the verification of the adjustment model is as follows: and comparing the amino acid sequences generated by the model after each training and adjustment in a protein database by adopting blast software, and obtaining a generated model when the comparison result is improved by 3 times compared with the comparison result generated by the initial training.
The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: wherein the protein of interest is a variant of malate dehydrogenase.
The invention also provides a method for obtaining a protein variant, which is characterized by comprising the following steps: randomly generating a plurality of amino acid sequences by using the generation model; calculating similarity scores of each generated amino acid sequence and a protein database sequence library by using blast software, and selecting an amino acid sequence with the blast score of top 100; predicting the three-dimensional structure of each selected amino acid sequence by using alpha fold2 to obtain plddt fraction, and reserving the amino acid sequence of plddt > 90; comparing the three-dimensional structure corresponding to the reserved amino acid sequence with the wild crystal structure to obtain the structural RMSD, and simultaneously analyzing the conserved site of the wild type; the amino acid sequence with all conserved sites reserved and RMSD <2.0 was selected to obtain the corresponding protein variants, and functional alignment tests were performed with wild type to select the desired protein variants.
The obtaining method provided by the invention also has the following characteristics: wherein a desired malate dehydrogenase variant is when the protein variant is a malate dehydrogenase variant, preferably the malate dehydrogenase variant has an enzyme activity that is at least 1-fold that of the wild-type enzyme activity.
Drawings
FIG. 1 is a flow chart showing the construction of an amino acid sequence generation model described in example 1;
fig. 2 is a graph of the fit equation involved in example 2.
Detailed Description
The following detailed description of the invention is provided in connection with the accompanying drawings. With respect to the specific methods or materials used in the embodiments, those skilled in the art may perform conventional alternatives based on the technical idea of the present invention and are not limited to the specific descriptions of the embodiments of the present invention.
The methods used in the examples are conventional methods unless otherwise specified; the materials, reagents and the like used, unless otherwise specified, are all commercially available.
The malate dehydrogenase (Malate dehydrogenase, abbreviated as MDH, EC 1.1.1.37) is an enzyme protein, and is widely used in organisms including plants, animals, microorganisms and the like. The function of this enzyme is to catalyze the redox reaction between malic acid and NAD+ to convert malic acid to oxalic acid, while reducing NAD+ to NADH.
Malate dehydrogenase plays an important physiological role in organisms and is involved in the regulation of many metabolic pathways, such as tricarboxylic acid cycle, photosynthesis, respiration, etc. In plants, malate dehydrogenase is also involved in regulating plant response to environmental adaptation, such as regulating root growth and development, adaptation to acidic soil, and the like.
Therefore, research on malate dehydrogenase has important significance for deeply knowing metabolic pathways and regulation mechanisms thereof in organisms and improving agricultural production efficiency.
Variants herein refer to those that have amino acids that are not identical relative to the wild type, but retain the essential properties of the wild type.
The enzyme activity herein refers to a unit of measurement of the enzyme activity, that is, 1 unit of enzyme activity refers to an amount of enzyme capable of converting 1. Mu. Mole of a substrate in 1 minute under a specific condition (25 ℃ C., other is the optimum condition), or an amount of enzyme capable of converting 1. Mu. Mole of a relevant group in the substrate.
Kcat is the catalytic constant of the enzyme (catalytic constant, kcat), also called turnover number, i.e., how many substrates 1 enzyme molecule catalyzes into products per unit time. Kcat can be used to measure the catalytic efficiency of an enzyme, the greater the Kcat value, the greater the catalytic efficiency of the enzyme.
Miq constant K m Defined as the substrate concentration at which the enzyme is running at half its maximum catalytic rate; thus, it describes the affinity of an enzyme for a particular substrate. K (K) m Knowledge of the values is crucial for quantitative understanding of enzymatic and regulatory interactions between enzymes and metabolites: it will metabolize the intracellular concentration, K m Can reflect the affinity of the enzyme to the substrate, i.e. K m The smaller the value, the greater the affinity of the enzyme to the substrate; conversely, the smaller the affinity.
K cat /K m Will K cat And K m Taken together, not only can be used to measure the catalytic efficiency of an enzyme, but also can show the perfection of an enzyme.
Example 1
The present embodiment provides a construction of an amino acid sequence generation model for generating an amino acid sequence of a target protein, comprising the steps of: the method comprises the steps of constructing a data set for generation, constructing a network model structure, training a model, evaluating the model and obtaining a generation model.
The protein of interest refers to a protein having a desired function, which is finally obtained, for example, a variant of malate dehydrogenase.
The construction process is explained in detail as follows (as in fig. 1):
the data set for construction and generation specifically includes: collection of proteins of interest from public protein databases
And (3) preprocessing all the actually existing amino acid sequences corresponding to the quality, clustering based on the consistency percentage of the actually existing amino acid sequences, randomly selecting a certain number of clusters from all clusters with the number of sequences less than or equal to 5 in the clusters to be clustered together as an evaluation set, wherein the total number of the randomly selected clusters as the evaluation set accounts for 20% or less of the total number of all clusters obtained by clustering, for example, 50 clusters are obtained by clustering, 30 clusters with the number of sequences less than or equal to 5 are obtained by clustering, and 10 clusters are randomly selected from the 30 clusters to be clustered to be the evaluation set. "all the true amino acid sequences corresponding to the protein of interest" means that the amino acid sequences are already present in reality and: wild-type proteins corresponding to the final protein of interest, as well as all other variants relative to the wild-type.
Optionally, when the number of clusters with the sequence number of less than or equal to 5 is less than or equal to 20% of the number of all clusters obtained by clustering, a set of all clusters with the sequence number of less than or equal to 5 is selected as the evaluation set, for example, 50 clusters are obtained by clustering, and 5 clusters with the sequence number of less than or equal to 5 are obtained, and then all the 5 clusters are collected as the evaluation set.
Preferably, the protein of interest for which the construction process is directed is a variant of malate dehydrogenase.
The construction of the network model structure is specifically as follows: the generation network construction and the discrimination network construction are performed,
forming a preliminary model of TPGAN.
Model training and evaluation are specifically: adopting a preliminary model, inputting a training data set, carrying out back propagation according to a loss function, simultaneously carrying out optimization and iterative training on a generating network and a judging network by using a back propagation algorithm, and adopting an evaluation data set to adjust the preliminary model to avoid over fitting so as to obtain an adjusted model;
and obtaining a generation model, and verifying the adjustment model to obtain the generation model of the amino acid sequence which can be used for generating the target protein.
The TPGAN model adopts a technology of generating an antagonistic network, and the model extracts the characteristics of the protein sequence by learning the arrangement and distribution rules of amino acids in the protein sequence and generates a brand new protein sequence based on the rule characteristics.
Compared with the common generation of the countermeasure network, a protein language pre-training large model based on a transducer is added. The pretrained large model with massive protein sequences can more effectively extract the regular characteristics of protein language, and the attention mechanism in the transducer can more effectively enable data to automatically learn weight, so that more weight can be provided for the model.
In one example, the pre-processing described above includes de-duplication and de-noising the collected, truly existing amino acid sequences, and discarding sequences that are more than 500 amino acids in length.
In an example, the building of the generation network includes a self-encoder building and a generator building.
The self-encoder is constructed as follows: the encoder and decoder are constructed by using a transducer module, four layers of networks are used respectively, and a multi-head attention mechanism is used in the middle: the input from the encoder is an amino acid sequence and the output is a vector.
The generator is a neural network constructed by three fully connected layers, inputs a noise conforming to Gaussian distribution, changes a vector conforming to a normal distribution through calculation of a plurality of hidden layers by utilizing KL divergence loss, transmits the vector to a decoder to decode and generate probability of each site, and finally converts the vector into an amino acid sequence.
In one example, the discrimination network discriminates whether the amino acid sequence generated by the generation network is reasonable or not, and preferably, the discrimination network is a neural network model consisting of 3 full-connection layers. Specifically, the training data set of the real amino acid sequence received by the discrimination network and the amino acid sequence generated by the generation network are used as a loss function, and the difference between the real amino acid sequence and the generated amino acid sequence is learned through calculation of a plurality of hidden layers to judge whether the received amino acid sequence is the real amino acid sequence or not: the received real and generated amino acid sequences are numbered, and the output is 1, which is determined to be a real sequence, and 0, which is not real sequence.
In an example, in optimizing and iteratively training the generating network and the discriminating network simultaneously using a back propagation algorithm, a plurality of loss functions are optimized simultaneously, and corresponding super-parameters are adjusted, preferably, the super-parameters are learning rates, and the learning rates are adjusted to 1e-4,dropout rate 0.1,batchsize to 8.
In one example, the validation of the adjustment model is: and comparing the amino acid sequences generated by the model after each training and adjustment in a protein database by adopting blast software to obtain similar scores, and obtaining the generated model when the scores are improved by 3 times compared with the comparison scores generated by the initial training.
The embodiment also provides a method for obtaining a protein variant, which comprises the following steps:
randomly generating a plurality of amino acid sequences by using a generating model obtained by the training;
calculating similarity scores of each amino acid sequence generated and the protein database sequence library by using blast software, specifically, selecting an amino acid sequence with the blast score of top 100 by comparing the amino acid sequence with the collected real amino acid sequences;
predicting the three-dimensional structure of each selected amino acid sequence by using alpha fold2 to obtain plddt fraction, and reserving the amino acid sequence of plddt > 90;
comparing the three-dimensional structure corresponding to the reserved amino acid sequence with the wild crystal structure to obtain the structural RMSD, and simultaneously analyzing the conserved site of the wild type;
the amino acid sequence with all conserved sites reserved and RMSD <2.0 was selected to obtain the corresponding protein variants, and functional alignment tests were performed with wild type to select the desired protein variants.
In one example, the protein variant for which the obtaining method is directed is a malate dehydrogenase variant,
preferably, the malate dehydrogenase variant is a desired malate dehydrogenase variant when the malate dehydrogenase variant has an enzyme activity that is at least 1-fold that of the wild-type (malate dehydrogenase has an amino acid sequence as shown in SEQ ID NO: 13). In one example, the malate dehydrogenase variant obtained by the method has an amino acid sequence as shown in any one of SEQ ID NO. 1-12, or has an amino acid sequence that has at least 85%, 90%, 95% or more identity with any one of SEQ ID NO. 1-12.
SEQ ID NOS 1-13 are shown in detail as follows:
SEQ ID NO:1:
MKVAVLGAAGGIGQALALLLKTQLPAGSELSLYDIAPVTPGVALDLSHIPTNVEVKGFSGEDATPALEGADVVLISAGVARKPGMDRSDLFNINAGIVRNLVEKIAKTFPSAIIGIITNPVNTTVAIAAEVLKKAGKYDKNKLFGVTTLDIIRSETFVAELKGKDPVEVDVPVIGGHSGVTILPLLSQVPGVSFTNQEVAALTKRIQNAGTEVVEAKAGGGSATLSMGQAAVRFGLSLVRGLQGENGVVECALVEGDGKHARFCAQPLLLGKNGVEERKSYGDL SAFEQQALEGMLATLKTDITLGEEFVKK;
SEQ ID NO:2:
MKVAVLGAAGGIGQALALLLKTQLPSGSELTLYDIAPVTPGVAVDLSHIPTAVKITGFSGEDAAPALEGADIVVISAGVRRKPGMDRSDLAPVNYGIVENLTKQIAKVTPDAIVGIITNPVNATVAVAEAVLEKAGVYDPRKLFGVTTLDIIRSNTFVAELKGKQPGEVEVPVIGGHSGRTIIPLLSQVEGVTFTPEEVKALTRRIQNAGTEVVEAKAGGGSATLSMGQAAARFVLDLVAAKEGAENIVRDALVKNDGSYAHFFTRPCLLGTDGIKEVLSIGELSEFEKARLEASRPYLSAEIAKGFAYVNT;
SEQ ID NO:3:
MKVAVLGAAGGIGQALALLLKTQLPSGSTLTLYDIAPVTPGVAVDLSHIPTAVKIEGFTGEDAAPALEGADIVVISAGVRRKPGMDRSDLKPVNFGIVENLTKQIAEVTPDAIILIITNPVNTTVAIAAEVLKKAGVYDPKRLFGVTTLDIIRSNTFVAELKGKQPGEVEVPVIGGHSGKTIIPLLSKVEGLTFTDEEVEELTKRIQNAGTEVVEAKAGGGSATLSMGQAAARTVLAVARARAGAENVVLDVLVEGDGSYARFFTRPCLLGTDGVKEILSIGELSDFEKKRLEESIPYMKEEIDAGYDYVNN;
SEQ ID NO:4:
MKVAVLGAAGGIGQALALLLKTQLPAGSELSLYDIAPVTPGVAVDLSHIPTAVKVKGFSGEDHTPALEGADVVLISAGVARKPGMDRSDLFNVNAGIVKNLVEQIAKTFPKAIIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDVIRSETFVAELKPKDPVEVDVPVIGGHSGVTILPLLSQVPGVSFTNQEVAALTKRIQNAGTEVVEAKAGGGSATLSMGQAAVRFGLSLVRGLQGENGVVECALVEGDGKHARFCAQPLLLGKNGVEERKSYGKLSAFEQQALEGMLATLKKDITLGEEFVKKGSPAATAAERILVVVITDRN;SEQ ID NO:5:
MKKKVTVVGAGNVGATAAQEIAEKESRDVVLDDGMEGLPQGKALDVLQAGPLIGQSARISGTNDSSGTAGSDVVVITAGIPRKPGMSRDDLIGTNADIVKSVTENVVKLSPKAYIIVVSNPLDAMGYTAFSATGFPIERVIGMAGALDSARFRAFIAMELNVSAGNIQAVVLGGHGDTMVPLKRRTTVAGIPITSLMSAEGIEVIVMRTRMGGAEIVILLKTGSAYAAPSASEATMVDSIVKDQKRILPCALYLEGEYGASGICVGVPVKLGANGVEEIVDIKLQEEEKLLISISAKAVREMNKVLSVL;
SEQ ID NO:6:
MKVAVLGAAGGIGQALALLLKTQLPAGSELSLYDIAPVTPGVALDLSHIPTNVEVKGFSGEDATPALEGADVVLISAGVARKPGMDRSDLFNINAGIVRNLVEQIAKTFPKAIIGIITNPVNTTVAIAAEVLKKAGKYDKNKLFGVTTLDIIRSETFVAELKGKDPVEVDVPVIGGHSGVTILPLLSQVPGVSFTNQEVAALTKRIQNAGTEVVEAKAGGGSATLSMGQAAVRFGLSLVRGLQGENGVVECALVEGDGKHARFCAQPLLLGKNGVEERKSYGDLSAFEQQALEGMLATLKKDITTGE;
SEQ ID NO:7:
MKVAVLGAAGGIGQALALLLKTQLPAGSELSLYDIAPVTPGVAADLSHIPTNVFVKGFSGEDATPALEGADVVLISAGVARKPGMDRSDLFNVNAGIVKNLVEQIAKTFPKAIIGIITNPVNTTVAIAAEVLKKAGKYDKNKLFGVTTLDVIRSETFVAELKPKDPVEVDVPVIGGHSGVTILPLLSQVPGVSFTNQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAVRFGLSLVRALQGENGVVECALVEGDGKHARFCAQPLLLGKNGVEERKSYGDLSAFEQQALDGMLATLKKDITTME;
SEQ ID NO:8:
MKVAVLGAAGGIGQALALLLKTQLPSGSELTLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDASPALEGADVVVISAGVRRKPGMDRSDLAPVNFGIVENLTRQIAKVTPNAIVGIITNPVNSTVAVAAEVLKKEGVYDPKR LFGVTTLDIIRSNTFVAELKGKQPGEVEVPVIGGHSGETIIPLLSQVKGLTFSDEEIRDLTARIQNAGTEVVEAKAGGGSATLSMGQAAARFVLDVVAALEGEKNIIRDALVENDGSYARFFTAPCLLGTDGIEKVLSIGTLSAFEKAQLAASRPIMNAEIDKGFDYVNK;
SEQ ID NO:9:
MKVTVVGAGAVGATCAENIANKQIASEVVLLDIKEGFAEGKALDIMQTASLNGFDTKITGVTNDYSKTAGSDVVVITSGIPRKPGMTREELIGINAGIVKSVTENLLKLSPDRIIIVVSNPMDTMTYLAFKATGLPKNRIIGMGGALDSVRFRYFLSLALNVSASDLQAMVIGGHGDTTMIPLIRLATLNSIPVSKMLAGEELDEVAQDTMVGGATLTKLIGTSAWYAPGAAVATLVDSIVKDQKKIFPCSVYLEGEYGQKDICIGVPVILGANGVEKIVDIDLQDAEKAKLSKSADAVREMNKVLSV;
SEQ ID NO:10:
MVLKKILVGGAGNVGHTAANRAADERIGVVVLFDIVAGVPQGKELDIAESGPNEGFDRKTKGTNDYAGIAGSDVVIITAGIPRKPGMSRDDLLEINAKIVKSVVEGILKYSPDAIVIVVSNPLDVMVWVAQKFSGFPKNRVLGMAGVLDSSRFKYFEAEYLEVSMEDVLAFVLGGHGDTMVPLVRYDTVAGIPVTELLDSPEIAAIVERTRGGGAEIVTLLKTGSAYYAPSAAVAELVEAILPDTKKILPVAAHLAGEYGVSDMFVGVPVKLGSHGVEGIIEGKLTEAEDAAFQSSAESVDEGLAVLAAL;
SEQ ID NO:11:
MKVAVLGAAGGIGQALALLLKTRLPAGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFAGEDPTPALEGADVVLISAGVARKPGMDRSDLFNINAGIVKNLVEQNAKIFPKAIIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFIVTTLDVIRSETFVAELKGLDPAEVDVPVIGGHSGVTILPLLSQVPGVSFTNQEVAALTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVRALQGENGVVECALVEGDGKHARFGAQPLLLGKNGVEAVKSYGKLSA FEQQALEGMLATLKADIVLGEEFVKK;
SEQ ID NO:12:
MKVAVLGAAGGIGQALALLLKTQLPSGSELKLYDIAPVTPGVAVDLSHIPTAVRIEGFTGEDATPALEGADVVVISAGVRRKPGMDRSDLIPVNFGIVENLIKQIAETTPDAVILIITNPVNSTVAVAAEVLEKAGVYDPKRLFGVTTLDIIRSNTFVAELKGKQPGEVEVRVIGGHSGETIIPLLSQVEGVTFTEEEKKELTDRIQNAGTEVVEAKAGGGSATLSMGQAAARTVLAVVRALRGEKDVVLDLLVKGDGSYSEFFTAPCLLGKDGVEEILSIGELDEYEKELLESSLPYLNRLIAIGKDYVNN;
SEQ ID NO:13:
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVRRKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIIRSNTFVAELKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVRALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGQEFVNK。
example 2
This example demonstrates the effectiveness of the method of example 1 using the malate dehydrogenase variant obtained.
Plasmids containing variants were constructed and synthesized by Beijing qingke biotechnology Co., ltd using wild type malate dehydrogenase as a comparative example.
1. Recombinant escherichia coli culture and crude enzyme preparation experiment
1. Transformation of the plasmid into E.coli BL21 (DE 3): on a super clean bench, 2. Mu.L of plasmid at a concentration of 50mg/L was added to 100. Mu.L of BL21 (DE 3) competent cell suspension. Flick by hand or mix with gun, place on ice for 30min; heat shock is carried out for 90s in a water bath at the temperature of 42 ℃, and the heat shock is quickly carried out on ice for cooling for 5min, so that shaking is avoided. On an ultra-clean bench, 0.9mL of LB liquid medium is heated into the cell suspension, and after uniform mixing, the cells are cultured for 45min at 37 ℃ in a shaking way, and the rotating speed is 150-225rpm, so that the cells are recovered to a normal growth state. Centrifuging at 4000rpm for 1min, sucking the supernatant on an ultra-clean bench until 200uL of bacterial liquid remains, and blowing and sucking uniformly;
2. coating: 200. Mu.L of the mixture was plated on LB medium plates containing kana antibiotics, and the mixture was spread with a disposable spreading bar. Sealing the flat plate with sealing film, and standing until bacteria liquid is fully absorbed. Inverting the plate, and culturing at 37 ℃ for 12-24 hours until the transformant appears;
3. culturing primary seed liquid: 10mL of LB liquid culture medium and final concentration of kana antibiotics of 50 mug/mL are used for picking 1 monoclonal and culturing first-stage seed liquid. Culture conditions: 37 ℃,200rpm,12-18h;
4. and (5) gene delivery sequencing: 2mL of the cultured primary seed solution is taken to send gene sequencing, and whether the target gene fragment sequence of the MDH variant on the plasmid is correct or not is checked;
5. glycerol-retaining bacteria: on an ultra clean bench, a sterile bacteria-preserving pipe is opened, 0.5mL of first-stage seed liquid is added by a liquid-transferring gun, then 0.5mL of sterilized 50% glycerol is added, the mixture is uniformly mixed by the liquid-transferring gun, and a cover is covered. Placing into a refrigerator at-80deg.C for preservation;
6. culturing first-stage seed liquid by glycerol bacteria: and (3) verifying the sequence to be tested, and culturing the first-stage seed solution by using 10mL of LB liquid culture medium and 10 mu L of MDH variant glycerol bacteria with the final concentration of kana antibiotics of 50 mu g/mL. Culture conditions: 37 ℃,200rpm,12-18h;
7. culturing a secondary seed solution: the second seed solution was cultured with 200mL of LB medium, final kana antibiotic concentration of 50. Mu.g/mL, 10mL of MDH variant first seed solution. Culture conditions: sampling and measuring OD600nm to OD value of 0.6-0.8 in the culture process at 37 ℃ and 180 rpm;
8. induction: and after the secondary seed solution is cultured, adding an IPTG aqueous solution into the secondary seed solution on an ultra-clean bench to ensure that the final concentration of the IPTG in the secondary seed solution reaches 0.5mM. Culture conditions: 25 ℃,180rpm,16-24 hours;
9. and (3) centrifugally collecting thalli at a low temperature: for the induced bacterial liquid, 40mL of liquid is filled in a 50mL centrifuge tube, the liquid is centrifuged for 5min at 4 ℃ and 8000rpm, the supernatant is removed, and bacterial mud is reserved;
10. washing: for bacterial mud in each centrifuge tube, adding 5mL of Tris-HCl buffer (50 mM, pH 6.8-7.2) into each tube, blowing and sucking uniformly, combining a plurality of tubes into one tube, swirling, centrifuging at 8000rpm for 5min, removing supernatant, and reserving bacterial mud;
11. ultrasonic crushing: for the washed bacterial mud, adding 15mL of Tris-HCl buffer (50 mM, pH 6.8-7.2) into a 50mL centrifuge tube, blowing, sucking and suspending, vortex mixing uniformly, setting the working power of an ultrasonic breaker to 230W, carrying out ultrasonic treatment for 3s, stopping for 7s, and carrying out total working time of 40min. The centrifuge tube is in ice water bath in the ultrasonic breaking stage, and the thalli starts to be broken;
12. centrifuging to remove sediment: and (3) centrifuging the bacterial liquid after ultrasonic disruption at 4 ℃ and 10000rpm for 40min. Removing sediment, and reserving supernatant to obtain crude enzyme liquid of MDH variant enzyme.
2. Enzyme activity assay:
1. preheating an ultraviolet spectrophotometer for 30min in advance, adjusting the wavelength to 340nm, and setting background absorption to 0 by distilled water;
2. keeping the temperature in a water bath with the temperature of 37 ℃ of standby distilled water;
sequentially adding 760 mu L of distilled water, 10 mu L of 0.8mM NADH aqueous solution, 10 mu L of 1.6 mM-malic acid aqueous solution and 20 mu L of crude enzyme solution to be detected into A1 mL quartz cuvette, fully blowing, sucking and uniformly mixing, immediately recording initial absorbance A1 and absorbance A2 after 1min of reaction at a wavelength of 340nm, and keeping the temperature of the reaction solution at 37 ℃ in the reaction process;
3. enzyme activity calculation: under optimal conditions, the amount of enzyme that converts 1. Mu. Mol of substrate in 1min is 1U. Here, the enzyme activity calculation formula of the crude enzyme solution (U/mL) =Δa×v reaction solution/(ε×l×t×v enzyme solution). The notes for each term in the formula are as in Table 1:
both wild-type MDH and MDH variants were tested and the enzyme activity was calculated as described above, and the results are shown in Table 2:
conclusion: as can be seen from the enzyme activity determination table 2, the enzyme activities of the 12 groups of MDH variants are significantly better than that of the MDH wild type;
3. kcat value, K of wild MDH and 12 MDH variants M Measurement of the value:
kcat value and K of MDH M The measurement method of the value takes wild MDH as an example:
after the wild-type MDH crude enzyme solution was purified, the concentration of the purified enzyme was measured by the Bradford method.
The reaction was designed to catalyze the conversion of the substrates L-malate and NAD+ to oxaloacetate and NADH with wild-type MDH, and the concentration of NADH generated by the reaction was measured by High Performance Liquid Chromatography (HPLC).
The reaction system: the total volume of the reaction was 10mL, and the addition amount of the wild-type MDH-purified enzyme solution was 1mL. L-malic acid concentration 9 gradients [ S ] were set: 5. Mu.M, 10. Mu.M, 20. Mu.M, 40. Mu.M, 80. Mu.M, 160. Mu.M, 320. Mu.M, 640. Mu.M, 1280. Mu.M, L-malic acid was prepared as a 10mM stock solution at the time of use, and the loading volume was calculated from the desired concentration. NAD+ was set at 2mM, and NAD+ was prepared as a 10mM stock solution at the time of use, and the loading volume was calculated from the desired concentration. The whole reaction volume was made up to 10mL with ultrapure water.
Reaction sampling and detection: the reaction temperature is 37 ℃ and the reaction time is 1min, 1mL of reaction solution is taken, and the reaction solution is inactivated for 1min at a high temperature in a water bath with the temperature of 80 ℃. The obtained inactivated sample is diluted to a proper concentration, the absorption peak value of NADH at the wavelength of 340nm is detected by HPLC, and the actual NADH concentration in the reaction liquid is calculated according to the standard concentration curve of NADH standard substance. The reaction rate v (NADH is formed in equivalent to oxaloacetate) was calculated from the concentration of NADH formed.
As shown in FIG. 2, the Lineweaver-Burk equation for wild-type MDH under the experimental reaction conditions was linearly fitted using a double reciprocal mapping method: taking the reciprocal 1/[ S ] of the initial concentration of L-malic acid as an abscissa and taking the reciprocal 1/v of the reaction rate measured at each concentration as an ordinate, making a scatter diagram in Excel, and calculating a corresponding linear equation y=kx+b, wherein k in the equation is KM/Vmax in a Lineweaver-Burk equation, and b is 1/Vmax in the Lineweaver-Burk equation. The values of the variables are shown in Table 3.
Vmax=1/105.3=9.50×10 can be calculated from the fitted equation -3 (mol/min),K M =0.0126*Vmax=1.20*10 -4 (mol/L)。
Kcat=vmax per the molar amount of enzyme in the reaction is known from the definition. Vmax=9.50×10 -3 mol/min=1.58*10 -4 mol/s. The concentration of the mother liquor of the pure enzyme of the wild type MDH is 1.01mg/mL by the Bradford method, 1mL is taken in the reaction, and the molecular weight of the wild type MDH is 34458.63Da, thus the Kcat=Vmax/(1.01 mg/34458.63 g.mol) can be calculated -1 )=5387s -1 。
Kcat value and K of other mutants of MDH M The value measurement method is consistent with the wild type MDH. The measurement results are shown in Table 4:
from example 2, it can be seen that the variant of malate dehydrogenase obtained by the model and method constructed in example 1 improves the catalytic efficiency on malate and nad+, and in particular, the enzyme activity of the variant is 1-5 times that of the wild type, i.e., the model and method constructed in example 1 can obtain a high-quality protein sequence which can generate a protein sequence with reasonable structure and corresponding practical function, and has greater application potential.
Claims (10)
1. A construction of an amino acid sequence generation model for generating an amino acid sequence of a protein of interest, comprising:
constructing a data set for generation, collecting all actually existing amino acid sequences corresponding to target proteins from a public protein database, preprocessing, clustering based on the consistency percentage of the actually existing amino acid sequences, randomly selecting a certain number of clusters from all clusters with the number of sequences less than or equal to 5 in the clusters to be used as an evaluation set, wherein the total number of the randomly selected clusters as the evaluation set accounts for 20% or less of the total number of all clusters obtained by clustering, and the rest sequences are merged into a training data set;
constructing a network model structure, and performing generating network construction and judging network construction to form a TPGAN preliminary model;
model training and evaluation, wherein a training data set is input by adopting the preliminary model, a generating network and a judging network are simultaneously optimized and iteratively trained by utilizing a back propagation algorithm, and the preliminary model is adjusted by adopting the evaluation data set to avoid overfitting so as to obtain an adjusted model;
and obtaining a generation model, and verifying the adjustment model to obtain the generation model of the amino acid sequence which can be used for generating the target protein.
2. The construction of claim 1, wherein:
wherein the pretreatment comprises de-duplication and de-noising, and discarding sequences with amino acid lengths exceeding 500.
3. Construction according to claim 1 or 2, characterized in that:
wherein the generating network construction includes a self-encoder construction and a generator construction,
the self-encoder is constructed to: a transducer module is adopted to construct a coder and a decoder, four layers of networks are respectively used, and a multi-head attention mechanism is applied in the middle;
the generator is a neural network constructed by three fully connected layers, inputs noise conforming to Gaussian distribution, changes a vector conforming to a normal distribution through calculation of a plurality of hidden layers by utilizing KL divergence loss, transmits the vector to the decoder to decode and generate probability of adopting each amino acid at each position, and finally converts the probability into a new amino acid sequence.
4. A construction according to claim 3, wherein:
wherein the discrimination network discriminates whether the amino acid sequence generated by the generation network is reasonable, and preferably, the discrimination network is a 3-layer MLP model.
5. The construction of claim 4, wherein:
the discrimination network receives the training data set and the amino acid sequence generated by the generation network, and learns the difference between the real amino acid sequence and the generated amino acid sequence in the training data set by using binary cross entropy as a loss function and calculating a plurality of hidden layers so as to judge whether the received amino acid sequence is the real amino acid sequence.
6. The construction according to claim 5, wherein:
wherein a plurality of loss functions are optimized simultaneously in the training, and corresponding super-parameters are adjusted,
preferably, the super-parameter is a learning rate, which is adjusted to 1e-4,dropout rate 0.1,batchsize to 8.
7. A construction according to claim 3, wherein:
the verification of the adjustment model is as follows: and comparing the amino acid sequences generated by the adjusted model after each training in the protein database by adopting blast software, and obtaining the generated model when the comparison result is improved by 3 times compared with the comparison result generated by the initial training.
8. The construction of claim 1, wherein:
wherein the protein of interest is a variant of malate dehydrogenase.
9. A method for obtaining a protein variant, comprising:
randomly generating a number of amino acid sequences using the generation model of any one of claims 1-8;
calculating the similarity score of each generated amino acid sequence and protein database sequence library by using blast software, and selecting the amino acid sequence with the blast score of top 100;
predicting the three-dimensional structure of each selected amino acid sequence by using alpha fold2 to obtain plddt fraction, and reserving the amino acid sequence of plddt > 90;
comparing the three-dimensional structure corresponding to the reserved amino acid sequence with the wild crystal structure to obtain the structural RMSD, and simultaneously analyzing the conserved site of the wild type;
the amino acid sequence with all conserved sites reserved and RMSD <2.0 was selected to obtain the corresponding protein variants, and functional alignment tests were performed with wild type to select the desired protein variants.
10. The obtaining method according to claim 9, characterized in that:
wherein, when the protein variant is a malate dehydrogenase variant, preferably, the malate dehydrogenase variant has an enzyme activity at least 1-fold that of the wild-type enzyme activity, the desired malate dehydrogenase variant.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310832292.4A CN117275582A (en) | 2023-07-07 | 2023-07-07 | Construction of amino acid sequence generation model and method for obtaining protein variant |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310832292.4A CN117275582A (en) | 2023-07-07 | 2023-07-07 | Construction of amino acid sequence generation model and method for obtaining protein variant |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117275582A true CN117275582A (en) | 2023-12-22 |
Family
ID=89201572
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310832292.4A Pending CN117275582A (en) | 2023-07-07 | 2023-07-07 | Construction of amino acid sequence generation model and method for obtaining protein variant |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117275582A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112733444A (en) * | 2020-12-30 | 2021-04-30 | 浙江大学 | Multistep long time sequence prediction method based on CycleGAN neural network |
CN114303201A (en) * | 2019-05-19 | 2022-04-08 | 贾斯特-埃沃泰克生物制品有限公司 | Generation of protein sequences using machine learning techniques |
CN115620831A (en) * | 2022-10-09 | 2023-01-17 | 深圳瑞德林生物技术有限公司 | Method for generating sequence mutation fitness through loop iteration optimization and related device |
CN116230074A (en) * | 2022-12-14 | 2023-06-06 | 粤港澳大湾区数字经济研究院(福田) | Protein structure prediction method, model training method, device, equipment and medium |
-
2023
- 2023-07-07 CN CN202310832292.4A patent/CN117275582A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114303201A (en) * | 2019-05-19 | 2022-04-08 | 贾斯特-埃沃泰克生物制品有限公司 | Generation of protein sequences using machine learning techniques |
CN112733444A (en) * | 2020-12-30 | 2021-04-30 | 浙江大学 | Multistep long time sequence prediction method based on CycleGAN neural network |
CN115620831A (en) * | 2022-10-09 | 2023-01-17 | 深圳瑞德林生物技术有限公司 | Method for generating sequence mutation fitness through loop iteration optimization and related device |
CN116230074A (en) * | 2022-12-14 | 2023-06-06 | 粤港澳大湾区数字经济研究院(福田) | Protein structure prediction method, model training method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Monod | The phenomenon of enzymatic adaptation | |
US7842485B2 (en) | Enhanced alcohol tolerant microorganism and methods of use thereof | |
Huang et al. | Application of artificial neural network coupling particle swarm optimization algorithm to biocatalytic production of GABA | |
US20150031100A1 (en) | Compositions and methods for chemical reporter vectors | |
JP2009523424A (en) | Methods and compositions for cyanobacteria producing ethanol | |
CN107002019A (en) | The method for producing the recombination yeast of 3 hydracrylic acids and 3 hydracrylic acids being produced using it | |
CN105112436A (en) | Complete-biological synthesis method of adipic acid | |
CN109415418A (en) | The method for generating interested molecule by the inclusion of the microbial fermentation of the gene of coding sugar phosphotransferase system (PTS) | |
US10233439B2 (en) | Directed modification of glucosamine synthase mutant and application thereof | |
CN110615832A (en) | Bmor mutant for efficiently screening isobutanol high-yield strains | |
CN117275582A (en) | Construction of amino acid sequence generation model and method for obtaining protein variant | |
CN105255934A (en) | Strategy for efficiently coproducing alpha-aminobutyric acid and gluconic acid | |
JP2010508021A (en) | Methods of destroying quorum sensing that affect cell density of microbial populations | |
CN114657159B (en) | 4-hydroxyl-L-threonine-phosphate dehydrogenase PdxA mutant and application thereof in preparation of vitamin B 6 In (1) | |
CN110396509A (en) | Change the coenzyme activity of glucose dehydrogenase and the method and its application of Preference | |
CN109321508A (en) | Produce genetic engineering bacterium and its application of heparosan | |
CN106574230A (en) | Fed-batch process for the production of bacterial ghosts | |
CN116656637B (en) | Variant of malate dehydrogenase | |
Ciobanu et al. | Enhanced growth and β-galactosidase production on Escherichia coli using oxygen vectors | |
CN114854625B (en) | Wound escherichia and method for preparing carotenoid degrading enzyme by using same | |
CN108949785A (en) | Application of the sporulation related gene spo0A in producing enzyme | |
CN116254268B (en) | Promoter library and application thereof in different bacteria | |
CN117603924B (en) | Formate dehydrogenase mutant with improved protein solubility expression and application thereof | |
CN114891706B (en) | High acid-resistant acetobacter and application thereof | |
CN113604413B (en) | Recombinant strain, preparation method and application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |