CN117275582A

CN117275582A - Construction of amino acid sequence generation model and method for obtaining protein variant

Info

Publication number: CN117275582A
Application number: CN202310832292.4A
Authority: CN
Inventors: 王晨阳; 胡亦朗; 夏晋
Original assignee: Shanghai Zhuyao Technology Co ltd
Current assignee: Shanghai Zhuyao Technology Co ltd
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-12-22

Abstract

The invention provides a construction of an amino acid sequence generation model and a protein variant obtaining method to generate a high-quality protein sequence with reasonable structure and corresponding actual functions, and specifically, the construction comprises the following steps: constructing a data set for generation, collecting all actually existing amino acid sequences corresponding to target proteins from a public protein database, clustering, and dividing the actually existing amino acid sequences into a training data set and an evaluation data set; constructing a network model structure, and performing generating network construction and judging network construction to form a TPGAN preliminary model; model training and evaluation, namely, a preliminary model is adopted, a training data set is input, a back propagation algorithm is utilized to simultaneously optimize and iterate training on a generating network and a judging network, and the evaluation set adjusts the preliminary model to avoid overfitting to obtain an adjusted model; and obtaining a generation model, and verifying the adjustment model to obtain the generation model of the amino acid sequence which can be used for generating the target protein.

Description

Construction of amino acid sequence generation model and method for obtaining protein variant

Technical Field

The invention belongs to the technical field of biology, and particularly relates to a construction method of an amino acid sequence generation model and a protein variant obtaining method.

Background

Proteins are important basic substances in living bodies, and the diversity of amino acid sequences is important for the survival and reproduction of living bodies.

The TPGAN model (transducer-based protein generative adversarial network) is a large language model, can effectively generate brand-new protein sequences (amino acid sequences), and has wide application value.

However, the conventional experimental-based protein sequence prediction method has certain limitations such as high cost, long time consumption, etc., which also facilitate the research and exploration of the TPGAN model based on the deep learning technique.

Disclosure of Invention

The invention provides a construction of an amino acid sequence generation model and a protein variant obtaining method, and aims to describe a specific implementation method of a TPGAN model and application of the TPGAN model in protein sequence generation in detail so as to obtain a high-quality protein sequence which has a reasonable structure and corresponding actual functions, so that better technical popularization and application are expected to be achieved.

For this purpose, the present invention provides the following technical solutions.

The present invention provides a construction of an amino acid sequence generation model for generating an amino acid sequence of a target protein, comprising: constructing a data set for generation, collecting all actually existing amino acid sequences corresponding to target proteins from a public protein database, preprocessing, clustering based on the consistency percentage of the actually existing amino acid sequences, randomly selecting a certain number of clusters from all clusters with the number of sequences less than or equal to 5 in the clusters to be used as an evaluation set, wherein the total number of the randomly selected clusters as the evaluation set accounts for 20% or less of the total number of all clusters obtained by clustering, and the sequences of the other clusters are gathered together to be used as a training data set to construct a network model structure, and generating a network construction and judging the network construction to form a TPGAN preliminary model; model training and evaluation, wherein a training data set is input by adopting the preliminary model, a generating network and a judging network are simultaneously optimized and iteratively trained by utilizing a back propagation algorithm, and the preliminary model is adjusted by adopting the evaluation data set to avoid overfitting so as to obtain an adjusted model; and obtaining a generation model, and verifying the adjustment model to obtain the generation model of the amino acid sequence which can be used for generating the target protein.

The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: wherein the pretreatment comprises de-duplication and de-noising, and discarding sequences with amino acid lengths exceeding 500.

The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: wherein the generating network comprises a self-encoder construction and a generator construction, the self-encoder construction being: a transducer module is adopted to construct a coder and a decoder, four layers of networks are respectively used, and a multi-head attention mechanism is applied in the middle; the generator is a neural network constructed by three fully connected layers, inputs a noise conforming to Gaussian distribution, uses KL divergence loss, changes a vector conforming to a normal distribution through calculation of a plurality of hidden layers, and transmits the vector to a decoder for decoding to generate a new amino acid sequence.

The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: wherein, the discrimination network discriminates whether the amino acid sequence generated by the generation network is reasonable or not, and preferably, the discrimination network is a 3-layer MLP model.

The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: the method comprises the steps of judging whether a network receives a real amino acid sequence and generates a network generated amino acid sequence, and learning differences between the real amino acid sequence and the generated amino acid sequence by calculating a plurality of hidden layers by using binary cross entropy as a loss function so as to judge whether the received amino acid sequence is the real amino acid sequence or not.

The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: wherein a plurality of loss functions are optimized simultaneously in training, and corresponding super-parameters are adjusted, preferably the super-parameters are learning rates, and the learning rates are adjusted to be 1e-4,dropout rate 0.1,batchsize to be 8.

The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: the verification of the adjustment model is as follows: and comparing the amino acid sequences generated by the model after each training and adjustment in a protein database by adopting blast software, and obtaining a generated model when the comparison result is improved by 3 times compared with the comparison result generated by the initial training.

The construction of the amino acid sequence generation model provided by the invention also has the following characteristics: wherein the protein of interest is a variant of malate dehydrogenase.

The invention also provides a method for obtaining a protein variant, which is characterized by comprising the following steps: randomly generating a plurality of amino acid sequences by using the generation model; calculating similarity scores of each generated amino acid sequence and a protein database sequence library by using blast software, and selecting an amino acid sequence with the blast score of top 100; predicting the three-dimensional structure of each selected amino acid sequence by using alpha fold2 to obtain plddt fraction, and reserving the amino acid sequence of plddt > 90; comparing the three-dimensional structure corresponding to the reserved amino acid sequence with the wild crystal structure to obtain the structural RMSD, and simultaneously analyzing the conserved site of the wild type; the amino acid sequence with all conserved sites reserved and RMSD <2.0 was selected to obtain the corresponding protein variants, and functional alignment tests were performed with wild type to select the desired protein variants.

The obtaining method provided by the invention also has the following characteristics: wherein a desired malate dehydrogenase variant is when the protein variant is a malate dehydrogenase variant, preferably the malate dehydrogenase variant has an enzyme activity that is at least 1-fold that of the wild-type enzyme activity.

Drawings

FIG. 1 is a flow chart showing the construction of an amino acid sequence generation model described in example 1;

fig. 2 is a graph of the fit equation involved in example 2.

Detailed Description

The following detailed description of the invention is provided in connection with the accompanying drawings. With respect to the specific methods or materials used in the embodiments, those skilled in the art may perform conventional alternatives based on the technical idea of the present invention and are not limited to the specific descriptions of the embodiments of the present invention.

The methods used in the examples are conventional methods unless otherwise specified; the materials, reagents and the like used, unless otherwise specified, are all commercially available.

The malate dehydrogenase (Malate dehydrogenase, abbreviated as MDH, EC 1.1.1.37) is an enzyme protein, and is widely used in organisms including plants, animals, microorganisms and the like. The function of this enzyme is to catalyze the redox reaction between malic acid and NAD+ to convert malic acid to oxalic acid, while reducing NAD+ to NADH.

Malate dehydrogenase plays an important physiological role in organisms and is involved in the regulation of many metabolic pathways, such as tricarboxylic acid cycle, photosynthesis, respiration, etc. In plants, malate dehydrogenase is also involved in regulating plant response to environmental adaptation, such as regulating root growth and development, adaptation to acidic soil, and the like.

Therefore, research on malate dehydrogenase has important significance for deeply knowing metabolic pathways and regulation mechanisms thereof in organisms and improving agricultural production efficiency.

Variants herein refer to those that have amino acids that are not identical relative to the wild type, but retain the essential properties of the wild type.

The enzyme activity herein refers to a unit of measurement of the enzyme activity, that is, 1 unit of enzyme activity refers to an amount of enzyme capable of converting 1. Mu. Mole of a substrate in 1 minute under a specific condition (25 ℃ C., other is the optimum condition), or an amount of enzyme capable of converting 1. Mu. Mole of a relevant group in the substrate.

Kcat is the catalytic constant of the enzyme (catalytic constant, kcat), also called turnover number, i.e., how many substrates 1 enzyme molecule catalyzes into products per unit time. Kcat can be used to measure the catalytic efficiency of an enzyme, the greater the Kcat value, the greater the catalytic efficiency of the enzyme.

Miq constant K _m Defined as the substrate concentration at which the enzyme is running at half its maximum catalytic rate; thus, it describes the affinity of an enzyme for a particular substrate. K (K) _m Knowledge of the values is crucial for quantitative understanding of enzymatic and regulatory interactions between enzymes and metabolites: it will metabolize the intracellular concentration, K _m Can reflect the affinity of the enzyme to the substrate, i.e. K _m The smaller the value, the greater the affinity of the enzyme to the substrate; conversely, the smaller the affinity.

K _cat /K _m Will K _cat And K _m Taken together, not only can be used to measure the catalytic efficiency of an enzyme, but also can show the perfection of an enzyme.

Example 1

The present embodiment provides a construction of an amino acid sequence generation model for generating an amino acid sequence of a target protein, comprising the steps of: the method comprises the steps of constructing a data set for generation, constructing a network model structure, training a model, evaluating the model and obtaining a generation model.

The protein of interest refers to a protein having a desired function, which is finally obtained, for example, a variant of malate dehydrogenase.

The construction process is explained in detail as follows (as in fig. 1):

the data set for construction and generation specifically includes: collection of proteins of interest from public protein databases

And (3) preprocessing all the actually existing amino acid sequences corresponding to the quality, clustering based on the consistency percentage of the actually existing amino acid sequences, randomly selecting a certain number of clusters from all clusters with the number of sequences less than or equal to 5 in the clusters to be clustered together as an evaluation set, wherein the total number of the randomly selected clusters as the evaluation set accounts for 20% or less of the total number of all clusters obtained by clustering, for example, 50 clusters are obtained by clustering, 30 clusters with the number of sequences less than or equal to 5 are obtained by clustering, and 10 clusters are randomly selected from the 30 clusters to be clustered to be the evaluation set. "all the true amino acid sequences corresponding to the protein of interest" means that the amino acid sequences are already present in reality and: wild-type proteins corresponding to the final protein of interest, as well as all other variants relative to the wild-type.

Optionally, when the number of clusters with the sequence number of less than or equal to 5 is less than or equal to 20% of the number of all clusters obtained by clustering, a set of all clusters with the sequence number of less than or equal to 5 is selected as the evaluation set, for example, 50 clusters are obtained by clustering, and 5 clusters with the sequence number of less than or equal to 5 are obtained, and then all the 5 clusters are collected as the evaluation set.

Preferably, the protein of interest for which the construction process is directed is a variant of malate dehydrogenase.

The construction of the network model structure is specifically as follows: the generation network construction and the discrimination network construction are performed,

forming a preliminary model of TPGAN.

Model training and evaluation are specifically: adopting a preliminary model, inputting a training data set, carrying out back propagation according to a loss function, simultaneously carrying out optimization and iterative training on a generating network and a judging network by using a back propagation algorithm, and adopting an evaluation data set to adjust the preliminary model to avoid over fitting so as to obtain an adjusted model;

and obtaining a generation model, and verifying the adjustment model to obtain the generation model of the amino acid sequence which can be used for generating the target protein.

The TPGAN model adopts a technology of generating an antagonistic network, and the model extracts the characteristics of the protein sequence by learning the arrangement and distribution rules of amino acids in the protein sequence and generates a brand new protein sequence based on the rule characteristics.

Compared with the common generation of the countermeasure network, a protein language pre-training large model based on a transducer is added. The pretrained large model with massive protein sequences can more effectively extract the regular characteristics of protein language, and the attention mechanism in the transducer can more effectively enable data to automatically learn weight, so that more weight can be provided for the model.

In one example, the pre-processing described above includes de-duplication and de-noising the collected, truly existing amino acid sequences, and discarding sequences that are more than 500 amino acids in length.

In an example, the building of the generation network includes a self-encoder building and a generator building.

The self-encoder is constructed as follows: the encoder and decoder are constructed by using a transducer module, four layers of networks are used respectively, and a multi-head attention mechanism is used in the middle: the input from the encoder is an amino acid sequence and the output is a vector.

The generator is a neural network constructed by three fully connected layers, inputs a noise conforming to Gaussian distribution, changes a vector conforming to a normal distribution through calculation of a plurality of hidden layers by utilizing KL divergence loss, transmits the vector to a decoder to decode and generate probability of each site, and finally converts the vector into an amino acid sequence.

In one example, the discrimination network discriminates whether the amino acid sequence generated by the generation network is reasonable or not, and preferably, the discrimination network is a neural network model consisting of 3 full-connection layers. Specifically, the training data set of the real amino acid sequence received by the discrimination network and the amino acid sequence generated by the generation network are used as a loss function, and the difference between the real amino acid sequence and the generated amino acid sequence is learned through calculation of a plurality of hidden layers to judge whether the received amino acid sequence is the real amino acid sequence or not: the received real and generated amino acid sequences are numbered, and the output is 1, which is determined to be a real sequence, and 0, which is not real sequence.

In an example, in optimizing and iteratively training the generating network and the discriminating network simultaneously using a back propagation algorithm, a plurality of loss functions are optimized simultaneously, and corresponding super-parameters are adjusted, preferably, the super-parameters are learning rates, and the learning rates are adjusted to 1e-4,dropout rate 0.1,batchsize to 8.

In one example, the validation of the adjustment model is: and comparing the amino acid sequences generated by the model after each training and adjustment in a protein database by adopting blast software to obtain similar scores, and obtaining the generated model when the scores are improved by 3 times compared with the comparison scores generated by the initial training.

The embodiment also provides a method for obtaining a protein variant, which comprises the following steps:

randomly generating a plurality of amino acid sequences by using a generating model obtained by the training;

calculating similarity scores of each amino acid sequence generated and the protein database sequence library by using blast software, specifically, selecting an amino acid sequence with the blast score of top 100 by comparing the amino acid sequence with the collected real amino acid sequences;

predicting the three-dimensional structure of each selected amino acid sequence by using alpha fold2 to obtain plddt fraction, and reserving the amino acid sequence of plddt > 90;

comparing the three-dimensional structure corresponding to the reserved amino acid sequence with the wild crystal structure to obtain the structural RMSD, and simultaneously analyzing the conserved site of the wild type;

the amino acid sequence with all conserved sites reserved and RMSD <2.0 was selected to obtain the corresponding protein variants, and functional alignment tests were performed with wild type to select the desired protein variants.

In one example, the protein variant for which the obtaining method is directed is a malate dehydrogenase variant,

preferably, the malate dehydrogenase variant is a desired malate dehydrogenase variant when the malate dehydrogenase variant has an enzyme activity that is at least 1-fold that of the wild-type (malate dehydrogenase has an amino acid sequence as shown in SEQ ID NO: 13). In one example, the malate dehydrogenase variant obtained by the method has an amino acid sequence as shown in any one of SEQ ID NO. 1-12, or has an amino acid sequence that has at least 85%, 90%, 95% or more identity with any one of SEQ ID NO. 1-12.

SEQ ID NOS 1-13 are shown in detail as follows:

SEQ ID NO:1：

MKVAVLGAAGGIGQALALLLKTQLPAGSELSLYDIAPVTPGVALDLSHIPTNVEVKGFSGEDATPALEGADVVLISAGVARKPGMDRSDLFNINAGIVRNLVEKIAKTFPSAIIGIITNPVNTTVAIAAEVLKKAGKYDKNKLFGVTTLDIIRSETFVAELKGKDPVEVDVPVIGGHSGVTILPLLSQVPGVSFTNQEVAALTKRIQNAGTEVVEAKAGGGSATLSMGQAAVRFGLSLVRGLQGENGVVECALVEGDGKHARFCAQPLLLGKNGVEERKSYGDL SAFEQQALEGMLATLKTDITLGEEFVKK；

SEQ ID NO:2：

MKVAVLGAAGGIGQALALLLKTQLPSGSELTLYDIAPVTPGVAVDLSHIPTAVKITGFSGEDAAPALEGADIVVISAGVRRKPGMDRSDLAPVNYGIVENLTKQIAKVTPDAIVGIITNPVNATVAVAEAVLEKAGVYDPRKLFGVTTLDIIRSNTFVAELKGKQPGEVEVPVIGGHSGRTIIPLLSQVEGVTFTPEEVKALTRRIQNAGTEVVEAKAGGGSATLSMGQAAARFVLDLVAAKEGAENIVRDALVKNDGSYAHFFTRPCLLGTDGIKEVLSIGELSEFEKARLEASRPYLSAEIAKGFAYVNT；

SEQ ID NO:3：

MKVAVLGAAGGIGQALALLLKTQLPSGSTLTLYDIAPVTPGVAVDLSHIPTAVKIEGFTGEDAAPALEGADIVVISAGVRRKPGMDRSDLKPVNFGIVENLTKQIAEVTPDAIILIITNPVNTTVAIAAEVLKKAGVYDPKRLFGVTTLDIIRSNTFVAELKGKQPGEVEVPVIGGHSGKTIIPLLSKVEGLTFTDEEVEELTKRIQNAGTEVVEAKAGGGSATLSMGQAAARTVLAVARARAGAENVVLDVLVEGDGSYARFFTRPCLLGTDGVKEILSIGELSDFEKKRLEESIPYMKEEIDAGYDYVNN；

SEQ ID NO:4：

MKVAVLGAAGGIGQALALLLKTQLPAGSELSLYDIAPVTPGVAVDLSHIPTAVKVKGFSGEDHTPALEGADVVLISAGVARKPGMDRSDLFNVNAGIVKNLVEQIAKTFPKAIIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDVIRSETFVAELKPKDPVEVDVPVIGGHSGVTILPLLSQVPGVSFTNQEVAALTKRIQNAGTEVVEAKAGGGSATLSMGQAAVRFGLSLVRGLQGENGVVECALVEGDGKHARFCAQPLLLGKNGVEERKSYGKLSAFEQQALEGMLATLKKDITLGEEFVKKGSPAATAAERILVVVITDRN；SEQ ID NO:5：

MKKKVTVVGAGNVGATAAQEIAEKESRDVVLDDGMEGLPQGKALDVLQAGPLIGQSARISGTNDSSGTAGSDVVVITAGIPRKPGMSRDDLIGTNADIVKSVTENVVKLSPKAYIIVVSNPLDAMGYTAFSATGFPIERVIGMAGALDSARFRAFIAMELNVSAGNIQAVVLGGHGDTMVPLKRRTTVAGIPITSLMSAEGIEVIVMRTRMGGAEIVILLKTGSAYAAPSASEATMVDSIVKDQKRILPCALYLEGEYGASGICVGVPVKLGANGVEEIVDIKLQEEEKLLISISAKAVREMNKVLSVL；

SEQ ID NO:6：

MKVAVLGAAGGIGQALALLLKTQLPAGSELSLYDIAPVTPGVALDLSHIPTNVEVKGFSGEDATPALEGADVVLISAGVARKPGMDRSDLFNINAGIVRNLVEQIAKTFPKAIIGIITNPVNTTVAIAAEVLKKAGKYDKNKLFGVTTLDIIRSETFVAELKGKDPVEVDVPVIGGHSGVTILPLLSQVPGVSFTNQEVAALTKRIQNAGTEVVEAKAGGGSATLSMGQAAVRFGLSLVRGLQGENGVVECALVEGDGKHARFCAQPLLLGKNGVEERKSYGDLSAFEQQALEGMLATLKKDITTGE；

SEQ ID NO:7：

MKVAVLGAAGGIGQALALLLKTQLPAGSELSLYDIAPVTPGVAADLSHIPTNVFVKGFSGEDATPALEGADVVLISAGVARKPGMDRSDLFNVNAGIVKNLVEQIAKTFPKAIIGIITNPVNTTVAIAAEVLKKAGKYDKNKLFGVTTLDVIRSETFVAELKPKDPVEVDVPVIGGHSGVTILPLLSQVPGVSFTNQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAVRFGLSLVRALQGENGVVECALVEGDGKHARFCAQPLLLGKNGVEERKSYGDLSAFEQQALDGMLATLKKDITTME；

SEQ ID NO:8：

MKVAVLGAAGGIGQALALLLKTQLPSGSELTLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDASPALEGADVVVISAGVRRKPGMDRSDLAPVNFGIVENLTRQIAKVTPNAIVGIITNPVNSTVAVAAEVLKKEGVYDPKR LFGVTTLDIIRSNTFVAELKGKQPGEVEVPVIGGHSGETIIPLLSQVKGLTFSDEEIRDLTARIQNAGTEVVEAKAGGGSATLSMGQAAARFVLDVVAALEGEKNIIRDALVENDGSYARFFTAPCLLGTDGIEKVLSIGTLSAFEKAQLAASRPIMNAEIDKGFDYVNK；

SEQ ID NO:9：

MKVTVVGAGAVGATCAENIANKQIASEVVLLDIKEGFAEGKALDIMQTASLNGFDTKITGVTNDYSKTAGSDVVVITSGIPRKPGMTREELIGINAGIVKSVTENLLKLSPDRIIIVVSNPMDTMTYLAFKATGLPKNRIIGMGGALDSVRFRYFLSLALNVSASDLQAMVIGGHGDTTMIPLIRLATLNSIPVSKMLAGEELDEVAQDTMVGGATLTKLIGTSAWYAPGAAVATLVDSIVKDQKKIFPCSVYLEGEYGQKDICIGVPVILGANGVEKIVDIDLQDAEKAKLSKSADAVREMNKVLSV；

SEQ ID NO:10：

MVLKKILVGGAGNVGHTAANRAADERIGVVVLFDIVAGVPQGKELDIAESGPNEGFDRKTKGTNDYAGIAGSDVVIITAGIPRKPGMSRDDLLEINAKIVKSVVEGILKYSPDAIVIVVSNPLDVMVWVAQKFSGFPKNRVLGMAGVLDSSRFKYFEAEYLEVSMEDVLAFVLGGHGDTMVPLVRYDTVAGIPVTELLDSPEIAAIVERTRGGGAEIVTLLKTGSAYYAPSAAVAELVEAILPDTKKILPVAAHLAGEYGVSDMFVGVPVKLGSHGVEGIIEGKLTEAEDAAFQSSAESVDEGLAVLAAL；

SEQ ID NO:11：

MKVAVLGAAGGIGQALALLLKTRLPAGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFAGEDPTPALEGADVVLISAGVARKPGMDRSDLFNINAGIVKNLVEQNAKIFPKAIIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFIVTTLDVIRSETFVAELKGLDPAEVDVPVIGGHSGVTILPLLSQVPGVSFTNQEVAALTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVRALQGENGVVECALVEGDGKHARFGAQPLLLGKNGVEAVKSYGKLSA FEQQALEGMLATLKADIVLGEEFVKK；

SEQ ID NO:12：

MKVAVLGAAGGIGQALALLLKTQLPSGSELKLYDIAPVTPGVAVDLSHIPTAVRIEGFTGEDATPALEGADVVVISAGVRRKPGMDRSDLIPVNFGIVENLIKQIAETTPDAVILIITNPVNSTVAVAAEVLEKAGVYDPKRLFGVTTLDIIRSNTFVAELKGKQPGEVEVRVIGGHSGETIIPLLSQVEGVTFTEEEKKELTDRIQNAGTEVVEAKAGGGSATLSMGQAAARTVLAVVRALRGEKDVVLDLLVKGDGSYSEFFTAPCLLGKDGVEEILSIGELDEYEKELLESSLPYLNRLIAIGKDYVNN；

SEQ ID NO:13：

MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVRRKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIIRSNTFVAELKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVRALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGQEFVNK。

example 2

This example demonstrates the effectiveness of the method of example 1 using the malate dehydrogenase variant obtained.

Plasmids containing variants were constructed and synthesized by Beijing qingke biotechnology Co., ltd using wild type malate dehydrogenase as a comparative example.

1. Recombinant escherichia coli culture and crude enzyme preparation experiment

1. Transformation of the plasmid into E.coli BL21 (DE 3): on a super clean bench, 2. Mu.L of plasmid at a concentration of 50mg/L was added to 100. Mu.L of BL21 (DE 3) competent cell suspension. Flick by hand or mix with gun, place on ice for 30min; heat shock is carried out for 90s in a water bath at the temperature of 42 ℃, and the heat shock is quickly carried out on ice for cooling for 5min, so that shaking is avoided. On an ultra-clean bench, 0.9mL of LB liquid medium is heated into the cell suspension, and after uniform mixing, the cells are cultured for 45min at 37 ℃ in a shaking way, and the rotating speed is 150-225rpm, so that the cells are recovered to a normal growth state. Centrifuging at 4000rpm for 1min, sucking the supernatant on an ultra-clean bench until 200uL of bacterial liquid remains, and blowing and sucking uniformly;

2. coating: 200. Mu.L of the mixture was plated on LB medium plates containing kana antibiotics, and the mixture was spread with a disposable spreading bar. Sealing the flat plate with sealing film, and standing until bacteria liquid is fully absorbed. Inverting the plate, and culturing at 37 ℃ for 12-24 hours until the transformant appears;

3. culturing primary seed liquid: 10mL of LB liquid culture medium and final concentration of kana antibiotics of 50 mug/mL are used for picking 1 monoclonal and culturing first-stage seed liquid. Culture conditions: 37 ℃,200rpm,12-18h;

4. and (5) gene delivery sequencing: 2mL of the cultured primary seed solution is taken to send gene sequencing, and whether the target gene fragment sequence of the MDH variant on the plasmid is correct or not is checked;

5. glycerol-retaining bacteria: on an ultra clean bench, a sterile bacteria-preserving pipe is opened, 0.5mL of first-stage seed liquid is added by a liquid-transferring gun, then 0.5mL of sterilized 50% glycerol is added, the mixture is uniformly mixed by the liquid-transferring gun, and a cover is covered. Placing into a refrigerator at-80deg.C for preservation;

6. culturing first-stage seed liquid by glycerol bacteria: and (3) verifying the sequence to be tested, and culturing the first-stage seed solution by using 10mL of LB liquid culture medium and 10 mu L of MDH variant glycerol bacteria with the final concentration of kana antibiotics of 50 mu g/mL. Culture conditions: 37 ℃,200rpm,12-18h;

7. culturing a secondary seed solution: the second seed solution was cultured with 200mL of LB medium, final kana antibiotic concentration of 50. Mu.g/mL, 10mL of MDH variant first seed solution. Culture conditions: sampling and measuring OD600nm to OD value of 0.6-0.8 in the culture process at 37 ℃ and 180 rpm;

8. induction: and after the secondary seed solution is cultured, adding an IPTG aqueous solution into the secondary seed solution on an ultra-clean bench to ensure that the final concentration of the IPTG in the secondary seed solution reaches 0.5mM. Culture conditions: 25 ℃,180rpm,16-24 hours;

9. and (3) centrifugally collecting thalli at a low temperature: for the induced bacterial liquid, 40mL of liquid is filled in a 50mL centrifuge tube, the liquid is centrifuged for 5min at 4 ℃ and 8000rpm, the supernatant is removed, and bacterial mud is reserved;

10. washing: for bacterial mud in each centrifuge tube, adding 5mL of Tris-HCl buffer (50 mM, pH 6.8-7.2) into each tube, blowing and sucking uniformly, combining a plurality of tubes into one tube, swirling, centrifuging at 8000rpm for 5min, removing supernatant, and reserving bacterial mud;

11. ultrasonic crushing: for the washed bacterial mud, adding 15mL of Tris-HCl buffer (50 mM, pH 6.8-7.2) into a 50mL centrifuge tube, blowing, sucking and suspending, vortex mixing uniformly, setting the working power of an ultrasonic breaker to 230W, carrying out ultrasonic treatment for 3s, stopping for 7s, and carrying out total working time of 40min. The centrifuge tube is in ice water bath in the ultrasonic breaking stage, and the thalli starts to be broken;

12. centrifuging to remove sediment: and (3) centrifuging the bacterial liquid after ultrasonic disruption at 4 ℃ and 10000rpm for 40min. Removing sediment, and reserving supernatant to obtain crude enzyme liquid of MDH variant enzyme.

2. Enzyme activity assay:

1. preheating an ultraviolet spectrophotometer for 30min in advance, adjusting the wavelength to 340nm, and setting background absorption to 0 by distilled water;

2. keeping the temperature in a water bath with the temperature of 37 ℃ of standby distilled water;

sequentially adding 760 mu L of distilled water, 10 mu L of 0.8mM NADH aqueous solution, 10 mu L of 1.6 mM-malic acid aqueous solution and 20 mu L of crude enzyme solution to be detected into A1 mL quartz cuvette, fully blowing, sucking and uniformly mixing, immediately recording initial absorbance A1 and absorbance A2 after 1min of reaction at a wavelength of 340nm, and keeping the temperature of the reaction solution at 37 ℃ in the reaction process;

3. enzyme activity calculation: under optimal conditions, the amount of enzyme that converts 1. Mu. Mol of substrate in 1min is 1U. Here, the enzyme activity calculation formula of the crude enzyme solution (U/mL) =Δa×v reaction solution/(ε×l×t×v enzyme solution). The notes for each term in the formula are as in Table 1:

both wild-type MDH and MDH variants were tested and the enzyme activity was calculated as described above, and the results are shown in Table 2:

conclusion: as can be seen from the enzyme activity determination table 2, the enzyme activities of the 12 groups of MDH variants are significantly better than that of the MDH wild type;

3. kcat value, K of wild MDH and 12 MDH variants _M Measurement of the value:

kcat value and K of MDH _M The measurement method of the value takes wild MDH as an example:

after the wild-type MDH crude enzyme solution was purified, the concentration of the purified enzyme was measured by the Bradford method.

The reaction was designed to catalyze the conversion of the substrates L-malate and NAD+ to oxaloacetate and NADH with wild-type MDH, and the concentration of NADH generated by the reaction was measured by High Performance Liquid Chromatography (HPLC).

The reaction system: the total volume of the reaction was 10mL, and the addition amount of the wild-type MDH-purified enzyme solution was 1mL. L-malic acid concentration 9 gradients [ S ] were set: 5. Mu.M, 10. Mu.M, 20. Mu.M, 40. Mu.M, 80. Mu.M, 160. Mu.M, 320. Mu.M, 640. Mu.M, 1280. Mu.M, L-malic acid was prepared as a 10mM stock solution at the time of use, and the loading volume was calculated from the desired concentration. NAD+ was set at 2mM, and NAD+ was prepared as a 10mM stock solution at the time of use, and the loading volume was calculated from the desired concentration. The whole reaction volume was made up to 10mL with ultrapure water.

Reaction sampling and detection: the reaction temperature is 37 ℃ and the reaction time is 1min, 1mL of reaction solution is taken, and the reaction solution is inactivated for 1min at a high temperature in a water bath with the temperature of 80 ℃. The obtained inactivated sample is diluted to a proper concentration, the absorption peak value of NADH at the wavelength of 340nm is detected by HPLC, and the actual NADH concentration in the reaction liquid is calculated according to the standard concentration curve of NADH standard substance. The reaction rate v (NADH is formed in equivalent to oxaloacetate) was calculated from the concentration of NADH formed.

As shown in FIG. 2, the Lineweaver-Burk equation for wild-type MDH under the experimental reaction conditions was linearly fitted using a double reciprocal mapping method: taking the reciprocal 1/[ S ] of the initial concentration of L-malic acid as an abscissa and taking the reciprocal 1/v of the reaction rate measured at each concentration as an ordinate, making a scatter diagram in Excel, and calculating a corresponding linear equation y=kx+b, wherein k in the equation is KM/Vmax in a Lineweaver-Burk equation, and b is 1/Vmax in the Lineweaver-Burk equation. The values of the variables are shown in Table 3.

Vmax=1/105.3=9.50×10 can be calculated from the fitted equation ^-3 (mol/min)，K _M ＝0.0126*Vmax＝1.20*10 ^-4 (mol/L)。

Kcat=vmax per the molar amount of enzyme in the reaction is known from the definition. Vmax=9.50×10 ^-3 mol/min＝1.58*10 ^-4 mol/s. The concentration of the mother liquor of the pure enzyme of the wild type MDH is 1.01mg/mL by the Bradford method, 1mL is taken in the reaction, and the molecular weight of the wild type MDH is 34458.63Da, thus the Kcat=Vmax/(1.01 mg/34458.63 g.mol) can be calculated ^-1 )＝5387s ^-1 。

Kcat value and K of other mutants of MDH _M The value measurement method is consistent with the wild type MDH. The measurement results are shown in Table 4:

from example 2, it can be seen that the variant of malate dehydrogenase obtained by the model and method constructed in example 1 improves the catalytic efficiency on malate and nad+, and in particular, the enzyme activity of the variant is 1-5 times that of the wild type, i.e., the model and method constructed in example 1 can obtain a high-quality protein sequence which can generate a protein sequence with reasonable structure and corresponding practical function, and has greater application potential.

Claims

1. A construction of an amino acid sequence generation model for generating an amino acid sequence of a protein of interest, comprising:

constructing a data set for generation, collecting all actually existing amino acid sequences corresponding to target proteins from a public protein database, preprocessing, clustering based on the consistency percentage of the actually existing amino acid sequences, randomly selecting a certain number of clusters from all clusters with the number of sequences less than or equal to 5 in the clusters to be used as an evaluation set, wherein the total number of the randomly selected clusters as the evaluation set accounts for 20% or less of the total number of all clusters obtained by clustering, and the rest sequences are merged into a training data set;

constructing a network model structure, and performing generating network construction and judging network construction to form a TPGAN preliminary model;

model training and evaluation, wherein a training data set is input by adopting the preliminary model, a generating network and a judging network are simultaneously optimized and iteratively trained by utilizing a back propagation algorithm, and the preliminary model is adjusted by adopting the evaluation data set to avoid overfitting so as to obtain an adjusted model;

2. The construction of claim 1, wherein:

wherein the pretreatment comprises de-duplication and de-noising, and discarding sequences with amino acid lengths exceeding 500.

3. Construction according to claim 1 or 2, characterized in that:

wherein the generating network construction includes a self-encoder construction and a generator construction,

the self-encoder is constructed to: a transducer module is adopted to construct a coder and a decoder, four layers of networks are respectively used, and a multi-head attention mechanism is applied in the middle;

the generator is a neural network constructed by three fully connected layers, inputs noise conforming to Gaussian distribution, changes a vector conforming to a normal distribution through calculation of a plurality of hidden layers by utilizing KL divergence loss, transmits the vector to the decoder to decode and generate probability of adopting each amino acid at each position, and finally converts the probability into a new amino acid sequence.

4. A construction according to claim 3, wherein:

wherein the discrimination network discriminates whether the amino acid sequence generated by the generation network is reasonable, and preferably, the discrimination network is a 3-layer MLP model.

5. The construction of claim 4, wherein:

the discrimination network receives the training data set and the amino acid sequence generated by the generation network, and learns the difference between the real amino acid sequence and the generated amino acid sequence in the training data set by using binary cross entropy as a loss function and calculating a plurality of hidden layers so as to judge whether the received amino acid sequence is the real amino acid sequence.

6. The construction according to claim 5, wherein:

wherein a plurality of loss functions are optimized simultaneously in the training, and corresponding super-parameters are adjusted,

preferably, the super-parameter is a learning rate, which is adjusted to 1e-4,dropout rate 0.1,batchsize to 8.

7. A construction according to claim 3, wherein:

the verification of the adjustment model is as follows: and comparing the amino acid sequences generated by the adjusted model after each training in the protein database by adopting blast software, and obtaining the generated model when the comparison result is improved by 3 times compared with the comparison result generated by the initial training.

8. The construction of claim 1, wherein:

wherein the protein of interest is a variant of malate dehydrogenase.

9. A method for obtaining a protein variant, comprising:

randomly generating a number of amino acid sequences using the generation model of any one of claims 1-8;

calculating the similarity score of each generated amino acid sequence and protein database sequence library by using blast software, and selecting the amino acid sequence with the blast score of top 100;

10. The obtaining method according to claim 9, characterized in that:

wherein, when the protein variant is a malate dehydrogenase variant, preferably, the malate dehydrogenase variant has an enzyme activity at least 1-fold that of the wild-type enzyme activity, the desired malate dehydrogenase variant.