CN116230074A - Protein structure prediction method, model training method, device, equipment and medium - Google Patents

Protein structure prediction method, model training method, device, equipment and medium Download PDF

Info

Publication number
CN116230074A
CN116230074A CN202211606821.0A CN202211606821A CN116230074A CN 116230074 A CN116230074 A CN 116230074A CN 202211606821 A CN202211606821 A CN 202211606821A CN 116230074 A CN116230074 A CN 116230074A
Authority
CN
China
Prior art keywords
feature vector
protein
training
amino acid
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211606821.0A
Other languages
Chinese (zh)
Inventor
熊袁鹏
刘子敬
幺宝刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Digital Economy Academy IDEA
Original Assignee
International Digital Economy Academy IDEA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Digital Economy Academy IDEA filed Critical International Digital Economy Academy IDEA
Priority to CN202211606821.0A priority Critical patent/CN116230074A/en
Publication of CN116230074A publication Critical patent/CN116230074A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a protein structure prediction method, a model training method, a device, equipment and a medium, which relate to the technical fields of biological information, deep learning and computer application, wherein the training method of the protein structure prediction model comprises the following steps: acquiring a training dataset comprising physicochemical properties of known protein sequences and amino acid residues; generating a first eigenvector containing protein sequence information according to the protein sequence; clustering the physicochemical properties of the amino acid residues, and generating a second feature vector containing physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues; and training the feature extraction network and the structure prediction network by using the first feature vector and the second feature vector to obtain a protein structure prediction model. According to the training method of the protein structure prediction model, provided by the invention, the input characteristics are extracted without a complex model, the calculation speed is high, and the training time of the protein prediction model is effectively saved.

Description

Protein structure prediction method, model training method, device, equipment and medium
Technical Field
The invention relates to the technical fields of biological information, deep learning and computer application, in particular to a protein structure prediction method, a model training method, a device, equipment and a medium.
Background
Proteins are essentially one or more chains of amino acids, which are folded repeatedly to form a specific three-dimensional structure and which have a specific function. Although experimental means including single particle cryoelectron microscopy, x-ray, nuclear magnetic resonance and the like can accurately measure the three-dimensional structure of the protein to obtain spatial information in a natural state, the experimental techniques have the defects of high cost, overlong period and the like.
In recent years, the development of artificial intelligence technology and theory thereof has made great progress, and is widely applied to the field of biopharmaceuticals, and a batch of methods for predicting the three-dimensional structure of protein emerge. The trained deep neural network can predict the properties of proteins based on their amino acid sequences, based primarily on the distance between pairs of amino acids and the angle between chemical bonds connecting these amino acids. From the known information, angle and distance information after folding of the protein can be deduced, and thus the structure of the whole protein. The current prediction methods of protein structure can be mainly divided into: (1) Direct prediction of the atomic position reduction three-dimensional structure information of all amino acids in a protein, wherein the method for directly predicting the protein by multiple sequence alignment has the longest history and is a more effective scheme in the prior art. For example, the AlphaFold2 tool developed by the uk artificial intelligence company can predict the structure of general proteins and can reach the accuracy of comparable experiments. The technology mainly completes the prediction of the protein structure by introducing multi-sequence alignment information and a complex information interaction mechanism. (2) The three-dimensional structure of the protein is obtained by predicting the information of the included angles and the distance between atoms on amino acid and then by a complex 'energy minimum' optimization method. The most typical protocol for such methods includes Igfold, deepAb and RosettaFold. However, the existing protein structure prediction method is often used for processing sequence original information, ignoring physicochemical properties of amino acids, so that more complex models are required for feature extraction, and the complex models generate a large amount of calculation cost, so that training time of a protein structure prediction model and prediction time of a protein structure are prolonged.
Accordingly, the prior art is still in need of improvement and development.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a protein structure prediction method, a model training method, a device, equipment and a medium, which aims to solve the problem that the existing protein structure prediction method ignores the physicochemical properties of amino acids and requires a more complex model for feature extraction.
The technical scheme of the invention is as follows:
in a first aspect of the present invention, there is provided a training method of a protein structure prediction model, wherein the protein structure prediction model includes a feature extraction network and a structure prediction network, the training method of the protein structure prediction model including the steps of:
acquiring a training dataset comprising a known protein sequence and physicochemical properties of amino acid residues in the protein sequence;
generating a first eigenvector containing protein sequence information according to the protein sequence;
clustering the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and generating a second eigenvector containing the physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues;
And training the feature extraction network and the structure prediction network by using the first feature vector and the second feature vector to obtain the protein structure prediction model.
Optionally, the training dataset further comprises a sequence of position-coding numbers acquired based on the protein sequence; generating a first eigenvector containing protein sequence information according to the protein sequence; clustering the physicochemical properties of the amino acid residues to obtain clustered physicochemical properties of the amino acid residues, and generating a second eigenvector containing physicochemical information of the amino acid residues according to the clustered physicochemical properties of the amino acid residues specifically comprises the steps of:
converting the position coding sequence into a position coding feature vector;
coding amino acids in the protein sequence to obtain a sequence feature vector, and splicing the sequence feature vector and the position coding feature vector to obtain a first feature vector;
clustering the physicochemical properties of the amino acid residues to obtain a plurality of representative physicochemical properties of the amino acid residues capable of representing each amino acid;
encoding amino acids in the protein sequence according to the plurality of representative amino acid residue physicochemical properties capable of representing each amino acid to obtain physicochemical property feature vectors;
And splicing the physicochemical characteristic feature vector with the position coding characteristic vector to obtain a second characteristic vector.
Optionally, the step of encoding the amino acids in the protein sequence to obtain a sequence feature vector specifically includes:
providing a pre-training model;
inputting the protein sequence into the pre-training model, and training the pre-training model;
inputting the protein sequence into a trained pre-training model, and encoding amino acids in the protein sequence to obtain a sequence feature vector.
Optionally, the method for training the pre-training model includes the steps of:
randomly masking one amino acid in the protein sequence as input, and adjusting the weight of the pre-training model through a gradient anti-transmission algorithm until the pre-training model can recover the masked amino acid in the protein sequence.
Optionally, when the protein is multi-chain, the protein sequence in the training dataset is a protein sequence after splicing the protein sequences of each sub-chain; the position coding sequence in the training data set is the position coding sequence after the position coding sequence of each sub-chain is spliced.
Optionally, before training the feature extraction network and the structure prediction network by using the first feature vector and the second feature vector, the method further includes the steps of:
and (3) carrying out relative position coding on different sub-chains of the protein, and splicing to obtain a third feature vector.
Optionally, the training dataset further comprises true three-dimensional coordinates of all atoms in the protein sequence; the real three-dimensional coordinates of all atoms in the protein sequence are three-dimensional coordinate sequence spliced by the real three-dimensional coordinate sequence of all atoms of each sub-chain;
the step of training the feature extraction network and the structure prediction network to obtain the protein structure prediction model specifically comprises the following steps:
inputting the first feature vector into a self-attention network, and processing for a plurality of times through a self-attention mechanism to obtain a processed first feature vector;
inputting the second feature vector into a self-attention network, and processing for a plurality of times through a self-attention mechanism to obtain a processed second feature vector;
fusing the processed first feature vector and the processed second feature vector through a cross attention network to obtain a fourth feature vector;
Inputting the third feature vector and the fourth feature vector into a fully-connected network to obtain predicted three-dimensional coordinates of all atoms in a protein sequence;
obtaining a loss value according to the predicted three-dimensional coordinates of all atoms in the protein sequence and the real three-dimensional coordinates of all atoms in the protein sequence;
and adjusting parameters in the self-attention network, the cross-attention network and the fully-connected network according to the loss value until the loss value converges to a preset value, so as to obtain the protein structure prediction model.
Optionally, the step of fusing the processed first feature vector and the processed second feature vector through a cross-attention network to obtain a fourth feature vector specifically includes:
inputting the processed first feature vector and the processed second feature vector into a cross attention network, performing matrix multiplication calculation first, and then performing softmax calculation to obtain a result vector;
performing matrix multiplication calculation on the processed first feature vector and a result vector obtained through softmax calculation to obtain a feature A;
performing matrix multiplication calculation on the processed second feature vector and a result vector obtained through softmax calculation to obtain a feature B;
And splicing the feature A and the feature B to obtain the fourth feature vector.
Optionally, the step of inputting the third feature vector and the fourth feature vector into a fully-connected network to obtain predicted three-dimensional coordinates of all atoms in the protein specifically includes:
inputting the third feature vector and the fourth feature vector into a first fully-connected network to obtain a predicted three-dimensional coordinate of a core heavy atom;
obtaining the predicted three-dimensional coordinates of the oxygen atoms through the predicted three-dimensional coordinates of the core heavy atoms;
and inputting the third characteristic vector and the fourth characteristic vector into a second full-connection network to obtain the predicted three-dimensional coordinates of other atoms except the core heavy atom and the oxygen atom.
In a second aspect of the present invention, there is provided a method for predicting a protein structure, comprising the steps of:
inputting the protein sequence of the structure to be detected and the physicochemical properties of the amino acid residues in the protein sequence into a protein structure prediction model trained by the training method disclosed by the invention to obtain the structure of the protein to be detected.
In a third aspect of the present invention, there is provided a training apparatus for a protein structure prediction model, wherein the protein structure prediction model includes a feature extraction network and a structure prediction network, the training apparatus for a protein structure prediction model including:
A data acquisition unit for acquiring a training data set comprising a known protein sequence and physicochemical properties of amino acid residues in the protein sequence;
a data processing unit, configured to generate a first feature vector containing protein sequence information according to the protein sequence; clustering the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and generating a second eigenvector containing the physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues;
and the model training unit is used for training the feature extraction network and the structure prediction network according to the first feature vector and the second feature vector to obtain the protein structure prediction model.
In a fourth aspect of the present invention, there is provided a protein structure prediction apparatus comprising:
the structure prediction unit is used for inputting the protein sequence of the structure to be detected and the physicochemical properties of the amino acid residues in the protein sequence into the protein structure prediction model obtained by training by adopting the training device disclosed by the invention, so as to obtain the structure of the protein to be detected.
In a fifth aspect of the present invention, a computer readable storage medium is provided, wherein the computer readable storage medium stores a computer program which, when executed by a processor, implements the training method of the present invention as described above or implements the prediction method of the present invention as described above.
In a sixth aspect of the present invention, an electronic device is provided, comprising a memory and a processor, said memory having stored thereon a computer program executable on said processor, said computer program, when executed by said processor, implementing the training method of the present invention as described above or implementing the prediction method of the present invention as described above.
The beneficial effects are that: the invention fully considers the physicochemical properties of amino acid residues, extracts the physicochemical properties of amino acid residues by clustering the physicochemical properties of a plurality of amino acid residues, and takes the physicochemical properties of the amino acid residues and the characteristics containing protein sequence information as input information of a protein prediction model to train the protein prediction model. Therefore, the training method of the protein structure prediction model provided by the invention has comprehensive input characteristics, does not need to extract the input characteristics of a complex model, has high calculation speed, and further effectively saves the training time of the protein prediction model.
Drawings
FIG. 1 is a flow chart of a training method of a protein structure prediction model according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a protein structure file according to an embodiment of the present invention.
FIG. 3 is a flow chart of a method for predicting protein structure according to an embodiment of the invention.
Fig. 4 is a training schematic diagram of a pre-training model according to an embodiment of the present invention.
Fig. 5 (a) is a schematic diagram of relative position coding in an embodiment of the present invention, and (b) is a schematic diagram of multi-chain fusion in an embodiment of the present invention.
FIG. 6 is a flow chart of a method for training a protein structure prediction model according to another embodiment of the present invention.
Fig. 7 (a) is a schematic structural diagram of a feature extraction network module according to an embodiment of the invention; (b) is a schematic diagram of a principle of a bidirectional cross-attention mechanism.
FIG. 8 is a flow chart of a method for predicting protein structure according to another embodiment of the invention.
FIG. 9 is a block diagram of a training device for protein structure prediction model in an embodiment of the present invention.
FIG. 10 is a block diagram showing the structure of a protein structure prediction apparatus according to an embodiment of the present invention.
FIG. 11 is a schematic diagram showing the three-dimensional structure of a protein predicted in an embodiment of the present invention.
FIG. 12 shows the protein structure prediction results reported in the paper by the current mainstream protein structure prediction algorithm omega Fold and alpha Fold 2.
Detailed Description
The invention provides a protein structure prediction method, a model training method, a device, equipment and a medium, and the invention is further described in detail below in order to make the purposes, the technical schemes and the effects of the invention clearer and more definite. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
As used herein, the term "comprising" and variants thereof are to be interpreted as meaning "including but not limited to" open-ended terms. The term "based on" is to be interpreted as "based at least in part on". The terms "first," "second," and the like, may refer to different or the same object. As used herein, a "network or neural network" is capable of processing inputs and providing corresponding outputs, each layer of the neural network including one or more nodes (also referred to as processing nodes or neurons), each node processing inputs from a previous layer. The terms "neural network", "network" and "neural network model" are used interchangeably herein, "physicochemical properties of amino acids" and "physicochemical properties of amino acid residues" are used interchangeably; the terms "vector", "eigenvector", "matrix" may also be referred to or understood as tensor.
The embodiment of the invention provides a training method of a protein structure prediction model, wherein as shown in fig. 1, the protein structure prediction model comprises a feature extraction network and a structure prediction network, and the training method of the protein structure prediction model comprises the following steps:
s11, acquiring a training data set, wherein the training data set comprises a known protein sequence and physicochemical properties of amino acid residues in the protein sequence;
s12, generating a first eigenvector containing protein sequence information according to the protein sequence;
s13, clustering the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and generating a second eigenvector containing the physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues;
s14, training the feature extraction network and the structure prediction network by using the first feature vector and the second feature vector to obtain the protein structure prediction model.
In the embodiment of the invention, the physicochemical properties of amino acid residues (amino groups and carboxyl groups among amino acids are dehydrated to form bonds, the amino acid takes part in peptide bond formation, the rest of the amino acid is called amino acid residues) are fully considered, and the physicochemical properties of the amino acid residues are numerous, and most physicochemical properties have certain redundancy, so that the physicochemical properties of the amino acid residues are only needed to be selected to describe the physicochemical properties of the amino acid, and the physicochemical properties of the representative amino acid residues are clustered firstly, extracted and then used as input information of a protein prediction model together with the characteristics containing protein sequence information to train the protein prediction model. Therefore, the training method of the protein structure prediction model provided by the invention has comprehensive input characteristics, does not need to extract the input characteristics by a complex model, has high calculation speed, further effectively saves the training time of the protein structure prediction model, and further saves the prediction time of the protein structure.
In step S11, the known protein sequence may be obtained from a public or self-constructed protein structure information library or antibody structure information library, for example, the complete protein structure file is downloaded from the us public website RSCB PDB or SABDAB website, and the physicochemical properties of the amino acid residues in the protein sequence may be obtained from the Peptides library in R language. A typical protein structure file is shown in fig. 2, in which the column with the number 1 is an atomic number, the column with the number 2 is an amino acid type, the column with the number 3 is an amino acid number, the column with the number 4 is a three-dimensional coordinate of an atom, and the column with the number 5 is an atomic type.
Wherein the protein sequence S consists of L amino acids S, S= { S 1 ,s 2 ,...,s L If the protein is multi-stranded, it is necessary to splice the sequences of each sub-strand.
The obtained training data set not only comprises the known protein sequence S and the physicochemical properties of the amino acid residues in the protein sequence; more protein information, such as true three-dimensional coordinate information C of all atoms, type information, namely amino acid type information A, can also be included according to actual needs.
In the present invention, all amino acids have 37 types of virtual atoms by default, and zero is set for the non-existing atom information, and whether the virtual atoms at all positions exist in each amino acid s is recorded by a single mask information M. The true atom for each amino acid is prior art and reference is made, for example, to reference 1. In addition, the training dataset may further comprise a position-coding array PE acquired based on the protein sequence. Thus, the training dataset may comprise physicochemical properties of amino acid residues in the protein sequence and a set of data consisting of S, PE, M, C and a, which may be denoted { S, PE, M, C, a }. And each data in the training data set is taken according to specific steps.
The more data that is used for deep learning, the better the model performance. Thus, when the protein is multi-chain, each data of { S, PE, M, C, A } in the training dataset is stitched to augment the training dataset, thereby enhancing the performance of the model. For example, the protein sequence is a protein sequence obtained by splicing the protein sequences of each sub-chain; the position coding sequence is a position coding sequence spliced by the position coding sequence of each sub-chain; the real three-dimensional coordinates of all atoms in the protein sequence are three-dimensional coordinate sequence spliced by the real three-dimensional coordinate sequence of all atoms of each sub-chain.
Illustratively, for double-stranded proteins, the training dataset may be augmented with the following strategy:
for a protein (e.g., diabody) that contains two sub-chains [ H, L ], the respective training data set for each sub-chain is { S_H, PE_H, M_H, C_H, A_H }, { S_L, PE_L, M_L, C_L, A_L }. Wherein S_H represents the protein sequence of the H chain, S_L represents the protein sequence of the L chain, PE_H represents the position-encoded sequence of the H chain, PE_L represents the position-encoded sequence of the L chain, and the meanings of the remaining symbols are the same.
When the training data sets of the two chains are spliced, two selection strategies exist, namely the front of the H chain or the front of the L chain, the front-back sequence change can substantially bring about data amplification, and the splicing mode of the front L chain of the H chain and the splicing mode of the rear H chain of the L chain are effective structural data. Therefore, two splicing modes are selected in the invention, so that double-data is constructed on double-chain data for enhancing the model performance.
The manner of splicing the data for each sub-chain of the double-stranded protein will be described in detail below using the protein sequence S in the training data set as an example.
For a strand comprising two sub-chains [ H, L]Provided that the H chain consists of M amino acids S, the protein sequence S_H= { S of the H chain 1 _H,s 2 _H,...,s M H, the L chain consists of N amino acids S, the protein sequence s_l= { S of the L chain 1 _L,s 2 _L,...,s N L, then from the two sub-chains [ H, L ]]The protein sequence obtained from the protein of (2) is S= { S 1 _H,s 2 _H,...,s M _H,s 1 _L,s 2 _L,...,s N L and S= { S 1 _L,s 2 _L,...,s N _L,s 1 _H,s 2 _H,...,s M _H}。
Most proteins in nature, particularly antibodies, often have multi-chain structures, whereas prior art methods of protein structure prediction are often only applicable to single-chain protein structures, and are thus disadvantageous for predicting many multi-chain proteins in nature. Thus, the training dataset further comprises a sequence of position-coding numbers acquired based on the protein sequence; in step S12, in some embodiments, the step of generating the first eigenvector containing protein sequence information according to the protein sequence specifically includes:
S121, converting the position coding sequence into a position coding feature vector;
s122, coding amino acids in the protein sequence to obtain a sequence feature vector, and splicing the sequence feature vector and the position coding feature vector to obtain a first feature vector.
In this embodiment, the sequence feature vector and the position-coding feature vector are spliced, so that the model can distinguish single-chain information and multi-chain information, and the prediction of the multi-chain protein structure is facilitated by the protein structure prediction model.
In order to enable the model to distinguish between single-strand information and multi-strand information, independent position coding information is introduced in step S121, i.e. the position codes of each sub-strand of the protein are calculated independently and spliced together. For example, for a double-stranded protein having a total length of L, the first strand has a length of L 1 The position code of the sub-chain of (1, 2, 3.) L 1 ]The second strip has a length L 2 The position code of the sub-chain of (1, 2, 3.) L 2 ]The final position-coded array (PE) is then 1,2,3, L 1 ,1,2,3...L 2 ]And [1,2, 3..L.) 2 ,1,2,3,...L 1 ]. Further, the position-coding sequence (PE) is converted into a position-coding feature vector with dimensions of lxn, where L is the total length of the protein, i.e. the total number of amino acids, n is not physically significant and is set according to the actual needs.
The position-coded sequence (PE) may be converted to a position-coded feature vector of dimension lxn via operations including, but not limited to, pseudo-code operations. Illustratively, the pseudocode employed is as follows:
pe=torch.zeros(length,d_model)
div_term=torch.exp((torch.arange(0,d_model,2,dtype=torch.float)*
-(math.log(10000.0)/d_model)))
pe[:,0::2]=torch.sin(position.float()*div_term)
pe[:,1::2]=torch.cos(position.float()*div_term)。
after the pseudo code operation, the position coding sequence (PE) is converted into a position coding feature vector with dimension L×128.
In step S122, in one embodiment, the step of encoding the amino acids in the protein sequence to obtain a sequence feature vector specifically includes:
s1220, providing a pre-training model;
s1221, inputting the protein sequence into the pre-training model, and training the pre-training model;
s1222, inputting the protein sequence into a trained pre-training model, and coding amino acids in the protein sequence to obtain a sequence feature vector with dimension L multiplied by m, wherein L is the length of the protein sequence, and m has no physical meaning and is different according to the selected pre-training model.
In this embodiment, the training time of the protein structure model can be saved by using a single-sequence pre-training model instead of the sequence search scheme. Specifically, the pretraining model is used for capturing information such as similarity information and evolution information on protein sequences, which is helpful for structure prediction, without depending on a multi-sequence comparison scheme which is too much time-consuming, so that the training time of the protein structure model is further reduced, and the training efficiency is improved.
In steps S1220-S1221, the pre-training model is a neural network model obtained by self-monitoring training of a large number of protein sequences or antibody sequences. The common training mode of the pre-training model predicts the current amino acid information for a given context information. Specifically, for the protein sequences S, randomly masking an amino acid in each sequence as an input, requiring the model to recover the masked amino acid, which involves adjusting the weights in the amino acid coding module by a gradient anti-pass algorithm to obtain an optimal result. As shown in fig. 4 (where the letters V, S, E, Q represent amino acids), the complete process of training the pre-training model is the prediction from the lower sequence to the upper sequence, i.e. the masking of amino acid E in the lower protein sequence is followed by the input of a pre-training model (amino acid coding module) that is able to recover amino acid E (see upper sequence).
The pre-training model can be used for encoding amino acids on a protein sequence, and in step S1222, the amino acids can be encoded only by performing feedforward calculation once by using the pre-training model when the method is specifically used.
The invention is not limited to a particular version of the pre-training model, and ESM-1b, ESM-3b, ESM-15b, antiberty, etc. may be used, as examples. The ESM-1b has good generalization capability, can be used for any sequence, has a basic framework of a transducer, and is a common model in the field of Natural Language Processing (NLP) at present. But according to different kinds of pre-training models, the m values are different, and when ESM-1b is adopted, a sequence feature vector with the dimension of L multiplied by 1280 is obtained; when ESM-3b or ESM-15b is adopted, a sequence feature vector with the dimension of L multiplied by 2560 is obtained; when anti is employed, a sequence feature vector of dimension l×512 is obtained. For convenience of description, a sequence feature vector having a dimension of lx1280 obtained by using ESM-1b processing will be described below as an example.
In step S122, the operation of splicing the sequence feature vector and the position-coding feature vector is performed by using the torch frame. In particular, such a stitching operation may be written as a torch.cat ([ S, PE ], dim= -1), which is independent of the pre-trained model.
In some specific embodiments, when the dimension of the sequence feature vector is lx1280 and the dimension of the position-coding feature vector is lx128, the two are spliced, and the dimension of the first feature vector is lx1408.
Most proteins in nature, particularly antibodies, often have a multi-chain structure so that the protein predictive model can predict a number of multi-chain proteins in nature. Therefore, in step S13, in some embodiments, as shown in fig. 3, the step of clustering the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and generating the second feature vector containing the physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues specifically includes:
s131, clustering the physicochemical properties of the amino acid residues to obtain a plurality of representative physicochemical properties of the amino acid residues capable of representing each amino acid;
s132, encoding amino acids in the protein sequence according to the physical and chemical properties of the plurality of representative amino acid residues capable of representing each amino acid to obtain a physical and chemical property feature vector;
and S133, splicing the physicochemical characteristic feature vector and the position coding feature vector to obtain a second feature vector.
Because of the numerous physicochemical properties of amino acid residues of proteins, and the redundancy of most physicochemical properties, it is only necessary to select the physicochemical properties of which the most representative are sufficient to describe the physicochemical properties of the amino acid. Therefore, in step S131, the physicochemical properties of the amino acid residues are clustered (may be implemented in the residue property clustering module) to obtain a plurality of representative physicochemical properties of the amino acid residues capable of characterizing each amino acid. In some embodiments, the physicochemical properties of the amino acid residues are clustered using any clustering algorithm, and the six most representative physicochemical properties are selected therefrom as characteristic representations for each amino acid. Specifically, the six physicochemical properties include: hydrophobicity (H1), side chain volume (V), polarity (P1), pH at isoelectric point (pl), negative log of the dissociation constant of the-COOH group (pKa), and net charge index side chain (NCI).
As an example, the clustering algorithm is a K-means clustering algorithm, a spectral clustering algorithm, a hierarchical clustering algorithm, etc., but is not limited thereto.
In step S132, each physicochemical property of each amino acid is a real number, and thus, a protein sequence of length L is encoded (i.e., amino acid characters are mapped to physicochemical properties), resulting in a physicochemical property eigenvector of dimension lx 6.
In step S133, when a position-coded feature vector with a dimension of lx 128 is used, the physicochemical feature vector is spliced with the position-coded feature vector to obtain a second feature vector with a dimension of lx 134. The stitching method may refer to stitching of the sequence feature vector and the position-coded feature vector above.
When the protein is multi-chain, in order for the model to more accurately obtain the relative positions of the amino acids on the entire chain (such relative position information often determines the three-dimensional structure of the amino acids on the protein), so that the model can more easily distinguish whether the amino acids are from the same chain and determine the positions of the amino acids on a specific chain, the processing ability of the model on the multi-chain protein is further enhanced, and in some embodiments, the training of the feature extraction network and the structure prediction network by using the first feature vector and the second feature vector further includes the steps of:
And S140, carrying out relative position coding on different sub-chains of the protein, and splicing to obtain a third feature vector.
In this embodiment, the different sub-chains of the protein are relatively position encoded to form a two-dimensional matrix, each chain having its own independent relative position information encoding, which may be accomplished using techniques including, but not limited to, pseudo-code operations. Illustratively, as shown in FIG. 5 (a), this is the result of the relative position encoding of a length-6 protein sequence, which is actually a 6X 6 matrix. When the input protein sequence has two sequences of length L 1 And L 2 When the sub-chain (total length is L), the corresponding relative position encoding results are spliced in the manner shown in FIG. 5 (b), wherein the dark color block is of length L 1 The result of the relative position coding of the sub-chains of (2) is that the light-colored blocks are of length L 2 The blank white blocks represent the relative position encoding results of the sub-chains of the two-dimensional matrix with elements of all 0, i.e. length L 1 The length of the strand is L 2 The position coding matrix of the sub-chain is positioned in the spliced shapeThe diagonal of the matrix (dimension L x L) is formed, so that the processing capacity of the model on the multi-chain protein is further enhanced. After the above multi-chain fusion operation, a third feature vector with dimension of lxlxlxlx1 is obtained.
By way of example, the pseudocode is:
embed=[]
for i in range(L):
a=list(range(-i,0))
b=list(range(L-i))
embed.append(a+b)
in step S14, the real three-dimensional coordinates C of all the atoms in the protein sequence in the training dataset are needed, and when the protein is a multi-chain protein, the real three-dimensional coordinates of all the atoms in the protein sequence are three-dimensional coordinate sequences obtained by splicing real three-dimensional coordinate sequences of all the atoms in each sub-chain, and the splicing manner is as described in the above example. In some embodiments, the feature extraction network includes a self-attention network and a cross-self-attention network, the structure prediction network is a fully-connected network, as shown in fig. 3 and 6, and the training the feature extraction network and the structure prediction network to obtain the protein structure prediction model specifically includes:
s141, inputting the first feature vector into a self-attention network, and processing for a plurality of times through a self-attention mechanism to obtain a processed first feature vector;
s142, inputting the second feature vector into a self-attention network, and processing for a plurality of times through a self-attention mechanism to obtain a processed second feature vector;
s143, fusing the processed first feature vector and the processed second feature vector through a cross attention network to obtain a fourth feature vector;
S144, inputting the third feature vector and the fourth feature vector into a fully-connected network to obtain predicted three-dimensional coordinates of all atoms in the protein sequence;
s145, obtaining a loss value according to the predicted three-dimensional coordinates of all atoms in the protein sequence and the real three-dimensional coordinates of all atoms in the protein sequence;
and S146, adjusting parameters in the self-attention network, the cross-attention network and the fully-connected network according to the loss value until the loss value converges to a preset value, and obtaining the protein structure prediction model.
In step S141, as shown in fig. 7 (a), the first feature vector is input into a self-attention network (self-attention basic module, which is a standard self-attention module), and processed several times (specifically, 3 or 5 times) through a self-attention mechanism to obtain a processed first feature vector (including protein sequence information), wherein the dimensions of the processed first feature vector may be l×1280, 1280 are freely set parameters, and the dimensions may be adjusted according to the performance of the final test in practical use, and generally 512, 1280, 2560 or the like may be selected.
In step S142, as shown in fig. 7 (a), the second feature vector is input into a self-attention network (self-attention basic module, which is a standard self-attention module), and processed several times (specifically, 3 or 5 times) through a self-attention mechanism to obtain a processed second feature vector (containing information of physicochemical properties of amino acid residues), wherein dimensions of the second feature vector may be l×1280, 1280 are freely set parameters, and in actual use, the second feature vector may be adjusted according to the performance of the final test, and generally 512, 1280, 2560 or the like may be selected.
In step S143, the processed first feature vector and the processed second feature vector are fused by the cross-attention mechanism, so that better performance can be obtained in a lighter model.
The processed first feature vector and the processed second feature vector are fused through a cross attention network (a cross attention basic module) to obtain a fourth feature vector, the dimension of the fourth feature vector can still be L multiplied by 1280, 1280 is a freely set parameter, and the fourth feature vector can be adjusted according to the final test performance in actual use, and generally 512, 1280 or 2560 and the like can be selected. The cross-attention basis module is bi-directional, and illustratively, the processed first feature vector and the processed second feature vector are calculated in the manner shown in fig. 7 (B), and are spliced (a|b operation in the figure), specifically, the processed first feature vector (corresponding to feature 1 in the figure) and the processed second feature vector (corresponding to feature 2 in the figure) are subjected to matrix multiplication, and then are calculated by softmax, to obtain a result vector. The method comprises the steps of alternately taking a processed first feature vector and a processed second feature vector (corresponding to features 1/2 in the figure) as inputs, respectively carrying out matrix multiplication calculation on the processed first feature vector and the processed first feature vector, namely carrying out matrix multiplication calculation on the result vector output through softmax calculation and the processed first feature vector to output a feature A, carrying out matrix multiplication calculation on the result vector output through softmax calculation and the processed second feature vector to output a feature B, and finally splicing the feature A and the feature B to obtain an A|B, namely a fourth feature vector. The processed first feature vector and the processed second feature vector are fused together through a cross-attention network (a cross-attention basis module), and the fusion can be performed through a direct summation or a front-back splicing mode.
In step S144, in some embodiments, the fully-connected network includes a first fully-connected network and a second fully-connected network, and the step of inputting the third feature vector and the fourth feature vector into the fully-connected network to obtain predicted three-dimensional coordinates of all atoms in the protein specifically includes:
s1441, inputting the third feature vector and the fourth feature vector into a first fully-connected network to obtain a predicted three-dimensional coordinate of a core heavy atom;
s1442, obtaining a predicted three-dimensional coordinate of the oxygen atom according to the predicted three-dimensional coordinate of the core heavy atom;
s1443, inputting the third feature vector and the fourth feature vector into a second full-connection network to obtain predicted three-dimensional coordinates of other atoms except the core heavy atom and the oxygen atom.
In this embodiment, because of the specificity of the protein structure, four core heavy atoms (central carbon, α carbon, nitrogen, β carbon) are in the same rotating group, and three-dimensional coordinates thereof need to be predicted separately. Therefore, in step S1441, the third feature vector and the fourth feature vector are input into the first fully-connected networkAnd (5) complexing to obtain the predicted three-dimensional coordinates of the four core heavy atoms. While the oxygen atom is generally directly determined by the fixed bond length, bond angle, i.e., when the coordinates of the four core heavy atoms are determined, the oxygen atom coordinates are determined accordingly. Therefore, in step S1442, the predicted three-dimensional coordinates of the oxygen atom are obtained from the predicted three-dimensional coordinates of the core heavy atom. The remaining 32 atoms are called side chain atoms (all amino acids have 37 types of virtual atoms by default in the present invention as described above), and their three-dimensional coordinates are determined by a special rotation angle (rotation angle) and need to be predicted separately, so in step S1443, the third feature vector and the fourth feature vector are input into a second full-connection network to obtain predicted three-dimensional coordinates of other atoms except the core heavy atom and the oxygen atom, so far, the predicted three-dimensional coordinates of all the atoms can be obtained
Figure BDA0003998891320000151
The format is L multiplied by 37 multiplied by 3, and the three-dimensional coordinates of all 37 virtual atoms of the L amino acids are respectively input. />
Step S145, predicting three-dimensional coordinates according to all atoms in the protein sequence
Figure BDA0003998891320000152
And the true three-dimensional coordinates C of all atoms in the protein sequence, to obtain a loss value. The loss value is used for judging the approaching degree of the predicted three-dimensional coordinate and the real three-dimensional coordinate, and the smaller the loss value is, the closer the predicted three-dimensional coordinate and the real three-dimensional coordinate are.
Specifically, the loss value may be calculated using a loss function, and the loss function L in the present invention may be obtained by combining Kabsch alignment loss (KLoss, see reference 1) and frame loss (FAPE) (see reference 2) and torsion angle loss (TorsionLoss), as follows:
L=KLoss+FAPE+TorsionLoss(1)
the inputs to the three loss functions in equation (1) are (C,
Figure BDA0003998891320000153
) I.e. trueReal three-dimensional coordinates and predicted three-dimensional coordinates. However, since during the calculation the structural model (i.e. the structure prediction network) outputs the coordinates of all 37 virtual atoms for the amino acids on the protein sequence +.>
Figure BDA0003998891320000154
However, each amino acid actually possesses a different number of atoms, and the absence of atoms is not significant and should not be used to calculate the final loss function. The calculation of the respective los constituting the above-mentioned Loss function should therefore be determined from the atomic mask information M of the amino acids described above, i.e. the predicted three-dimensional coordinates of the input Loss function should be +. >
Figure BDA0003998891320000155
I.e. the input of the above-mentioned loss function is (C, -)>
Figure BDA0003998891320000156
)。
In step S146, in some embodiments, as shown in fig. 6, the step of adjusting parameters in the self-attention network, the cross-attention network and the fully-connected network (including the first fully-connected network and the second fully-connected network) according to the loss value until the loss value converges to a preset value, and the step of obtaining the protein structure prediction model specifically includes:
s1461, judging whether the loss value converges to a preset value, if so, stopping training to obtain a protein structure prediction model;
and S1462, if not, adjusting parameters in the self-attention network, the cross-attention network and the fully-connected network according to the loss value.
And adjusting parameters (such as weights) in the self-attention network, the cross-attention network and the fully-connected network according to the loss value until the loss value converges to a preset value, so as to obtain the protein structure prediction model. Specifically, the weight values of the nodes in the networks are initialized randomly, in order to obtain the optimal weight, the predicted three-dimensional coordinates and the obtained real three-dimensional coordinates obtained in the training process are used for calculating a loss function, the weights of the nodes of each neural network are adjusted by using a gradient back-transmission algorithm, the final weight is generally obtained by a multi-round gradient back-transmission (300 rounds are generally carried out), and further a trained protein structure prediction model is obtained, and the gradient in the step is not back-transmitted to the pre-training model, namely the pre-training model is trained firstly in the method, then the training of the protein structure prediction model is carried out, and the training of the two models is carried out independently.
In the prior art, the protein structure reasoning time is too long to be applied to a large-scale structure prediction task, and the method is particularly characterized in two aspects, namely, the method relies on multi-sequence information comparison, and the basic principle of the multi-sequence information comparison is to compare a protein sequence needing to be predicted with an existing protein sequence library to obtain sequence similarity information. Although the speed of the multi-sequence comparison algorithm is greatly optimized, the construction of multi-sequence information comparison is still extremely time-consuming, and even consumes several hours or even tens of hours under the extreme condition of longer input sequence; secondly, in order to give a model a stronger prediction capability, the existing method often needs to learn information useful for structure prediction through internal repeated information interaction, and the complex model brings a great deal of calculation cost, so that the prediction time is prolonged. Based on the above, the embodiment of the invention also provides a prediction method of the protein structure, which is an artificial intelligence method for predicting the three-dimensional structure of the protein based on single sequence characteristics (a single sequence refers to a method that only sequence information is needed as input and no additional input is needed) and the physicochemical property information of amino acid residues on the protein. As shown in fig. 8, the method for predicting the protein structure includes the steps of:
S15, inputting the protein sequence of the structure to be detected and the physicochemical properties of the amino acid residues in the protein sequence into a protein structure prediction model trained by the training method according to the embodiment of the invention, so as to obtain the structure of the protein to be detected.
In this embodiment, the protein sequence of the structure to be tested and the physicochemical properties of the amino acid residues in the protein sequence need to be pretreated before being input into the protein structure prediction model, specifically as follows:
generating a characteristic vector containing protein sequence information by utilizing a trained pre-training model and splicing the characteristic vector with a position coding characteristic vector to obtain a first characteristic vector to be tested (the specific method is referred to as a method in the training method of the protein prediction model and is not repeated here);
clustering is carried out according to the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and according to the physicochemical properties of the clustered amino acid residues (the physicochemical properties of the 6 amino acid residues obtained after clustering can also be directly utilized), a feature vector containing physicochemical information of the amino acid residues is generated and spliced with the position coding feature vector to obtain a second feature vector to be detected (the specific method is referred to the method in the training method of the protein prediction model and is not repeated).
When the protein is multi-chain, the relative position codes are carried out on different sub-chains of the protein, and then the third feature vector to be detected is obtained after splicing (the specific method is referred to as the method in the training method of the protein prediction model, and the detailed description is omitted here).
And inputting the obtained first feature vector to be detected and the second feature vector to be detected into a trained protein structure prediction model.
In the embodiment, the physicochemical properties of the amino acid residues are fully considered, the physicochemical properties of the representative amino acid residues are extracted by clustering the physicochemical properties of a plurality of amino acid residues, and the physicochemical properties of the representative amino acid residues and the characteristics containing protein sequence information are used as input information of a protein prediction model, so that the information content of the model is improved. Therefore, the protein structure prediction method provided by the invention does not need to extract input features by a complex model, has high calculation speed, further effectively saves the prediction time of the protein prediction model, and improves the prediction efficiency. In addition, most of the current protein structure prediction methods infer time on the order of minutes or even hours, thereby being unfavorable for large-scale protein structure prediction tasks. In this embodiment, a single-sequence pre-training model is used to replace a multi-sequence comparison scheme with excessive time consumption, and the information such as similarity information and evolution information on the sequence, which is helpful for structure prediction, is captured through the pre-training model, so that the protein structure prediction time can be optimized to the second level. The protein structure prediction model provided by the embodiment can be used for predicting the multi-chain protein, so that the protein structure prediction method provided by the embodiment can not only realize the prediction of the single-chain protein, but also realize the prediction of the multi-chain protein, and has wider application range. In summary, the prediction method provided in this embodiment does not need a complex model to perform feature extraction, has short reasoning time and high prediction efficiency, and can implement prediction of a multi-chain protein structure.
Furthermore, not only the structure of the protein can be predicted in the present invention, but also in some embodiments the three-dimensional coordinates of the amino acid type information A and the output of the above protein sequences
Figure BDA0003998891320000171
A predicted protein structure file is generated as shown in fig. 2. Specifically, in the predicted protein structure file, each line represents information of a certain atom of a certain amino acid. Specifically, the column with the sequence number 1 shown in fig. 2 is written with the data line number; the column with the sequence number 2 is the amino acid type and corresponds to the information in A; the column with the sequence number 3 is an amino acid sequence number, and corresponds to the subscript of the information in A; the column with the sequence number 4 is the three-dimensional coordinates of the corresponding atom +.>
Figure BDA0003998891320000172
The column of number 5 is the atomic species, determined by the information in A, i.e., the true atomic information for each amino acid. Since each amino acid corresponds to multiple rows of information, the contents of the different rows in the separate columns for each of sequence number 2 and sequence number 3 are the same for atoms from the same amino acid. Other columns of protein structure information will be filled with 0 values as placeholders. />
In order to make the structure of the finally produced protein more realistic, in some embodiments, as shown in fig. 8, the protein sequence of the structure to be tested and the physicochemical properties of the amino acid residues in the protein sequence are input into the protein structure prediction model, and the structure of the protein to be tested is obtained further includes:
S16, optimizing the structure of the obtained protein to be tested. The specific optimization method is not limited in this embodiment, and may be Openmm, pddbfixer, or the like, as an example.
The method for predicting the protein structure provided by the invention is described below with reference to fig. 3, a protein sequence of the structure to be detected is input into a pre-training model, a characteristic representation of the protein sequence pre-training, namely a sequence characteristic vector, is obtained through the pre-training model, physical and chemical properties of amino acid residues in the protein sequence of the structure to be detected are clustered through residue properties to obtain the physical and chemical properties of the residues of the 6 protein sequences with representativeness, and a characteristic vector, namely a physical and chemical property characteristic vector, is formed. And respectively fusing the sequence feature vector and the residue physicochemical feature vector with the position coding feature vector to obtain a first feature vector and a second feature vector. Then, the first feature vector and the second feature vector are input into a trained structural prediction model, and the specific process is as follows:
the first feature vector and the second feature vector are respectively input into a trained self-attention basic module (self-attention network, the basic framework is a transducer), then the cross self-attention basic module (cross-attention network) is input for more complex feature extraction, namely, the first feature vector and the second feature vector are respectively processed by self-attention mechanism to respectively obtain a processed first feature vector and a processed second feature vector, then in the trained cross-attention basic module (not directly illustrated in fig. 3), a sequence feature representation with residue physicochemical property is obtained (the processed first feature vector and the processed second feature vector are subjected to matrix multiplication to form a pairing feature representation), and the sequence feature representation is fused with the features (alternatively used as the input processed first feature vector and the processed second feature vector) of the self-attention basic module to obtain a fourth feature vector (the process corresponds to (b) in fig. 7). And simultaneously, carrying out relative position coding on different sub chains of the protein, and splicing to obtain a third feature vector. And finally, inputting the third feature vector and the fourth feature vector into a structural module (a trained structure prediction network), integrating and outputting predicted protein skeleton information (the number of output proteins is determined by the number of input protein sequences) by the structural module, optimizing the predicted protein skeleton, and outputting a predicted protein structure. The invention fully considers the physicochemical properties of amino acid residues, extracts the physicochemical properties of amino acid residues by clustering the physicochemical properties of a plurality of amino acid residues, and takes the physicochemical properties of the amino acid residues and the characteristics containing protein sequence information as input information of a protein prediction model to train the protein prediction model. Therefore, the training method of the protein structure prediction model provided by the invention does not need to extract input features of a complex model, has high calculation speed, and further effectively saves the prediction time of the protein structure. The method can optimize the protein structure prediction time to the second level by combining schemes such as using a single-sequence pre-training model to replace sequence searching and the like. In addition, the invention can predict the structure of the multi-chain protein, and solves the problem that the structure of the single-chain protein can only be predicted in the prior art.
In order to verify the accuracy of the protein structure prediction method provided in the present invention, a sabdab antibody database (which is an antibody database established by oxford protein information center in uk according to open source innovation protocol, which gathers all antibody structure data in PDB protein three-dimensional structure database) was used as training data for model training, while a general index Root Mean Square Deviation (RMSD) was used as an evaluation index of the predicted structure. The predicted structure of the diabody with ID of 7phu obtained by the prediction method of the invention is shown in FIG. 11, and the results reported in the paper by the current mainstream prediction algorithm Omgefold and AlphaFold2 are shown in FIG. 12. The paper reports an RMSD of omega of 1.82 and an RMSD of alphafold2 of 5.83, whereas the RMSD of the prediction method of the present invention is 0.91, which fully demonstrates the accuracy of the prediction method provided by the present invention.
The embodiment of the invention also provides a training device of a protein structure prediction model, wherein as shown in fig. 9, the protein structure prediction model comprises a feature extraction network and a structure prediction network, and the training device of the protein structure prediction model comprises:
A data acquisition unit 1 for acquiring a training data set comprising a known protein sequence and physicochemical properties of amino acid residues in the protein sequence;
a data processing unit 2 for generating a first eigenvector containing protein sequence information from the protein sequence; clustering the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and generating a second eigenvector containing the physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues;
and the model training unit 3 is used for training the feature extraction network and the structure prediction network according to the first feature vector and the second feature vector to obtain the protein structure prediction model.
In some embodiments, the data acquisition unit 1 is specifically configured to:
protein sequences, physicochemical properties of amino acid residues in the protein sequences, position coding sequences obtained based on the protein sequences, real three-dimensional coordinate information of atoms, mask information (recording whether virtual atoms at all positions exist in each amino acid s) and the like are obtained from existing protein structure files and peptide libraries.
In some embodiments, the data processing unit 2 is specifically configured to:
converting the position coding sequence into a position coding feature vector;
coding amino acids in the protein sequence to obtain a sequence feature vector, and splicing the sequence feature vector and the position coding feature vector to obtain a first feature vector; specifically, inputting a protein sequence into a pre-training model, and training the pre-training model; inputting the protein sequence into a trained pre-training model, and coding amino acids in the protein sequence to obtain a feature vector with dimension of L multiplied by m.
In some embodiments, the data processing unit 2 is specifically configured to:
clustering the physicochemical properties of the amino acid residues to obtain a plurality of representative physicochemical properties of the amino acid residues capable of representing each amino acid; encoding amino acids in the protein sequence according to the plurality of representative amino acid residue physicochemical properties capable of representing each amino acid to obtain physicochemical property feature vectors; and splicing the physicochemical characteristic feature vector with the position coding characteristic vector to obtain a second characteristic vector.
In some embodiments, the data processing unit 2 is further configured to:
and (3) carrying out relative position coding on different sub-chains of the protein, and splicing to obtain a third feature vector.
In some embodiments, when the protein is multi-chain, the data processing unit 2 is further configured to:
and splicing each data in the training data set.
For example, the protein sequences of each sub-chain are spliced; splicing the position coding sequence of each sub-chain; stitching the true three-dimensional coordinate series of all atoms of each sub-chain, and so on. The splicing manner is referred to the above examples, and will not be described herein.
In some embodiments, the model training unit 3 is specifically configured to:
inputting the first feature vector into a self-attention network, and processing for a plurality of times through a self-attention mechanism to obtain a processed first feature vector;
inputting the second feature vector into a self-attention network, and processing for a plurality of times through a self-attention mechanism to obtain a processed second feature vector;
fusing the processed first feature vector and the processed second feature vector through a cross attention network to obtain a fourth feature vector;
Inputting the third feature vector and the fourth feature vector into a fully-connected network to obtain predicted three-dimensional coordinates of all atoms in a protein sequence;
obtaining a loss value according to the predicted three-dimensional coordinates of all atoms in the protein sequence and the real three-dimensional coordinates of all atoms in the protein sequence;
and adjusting parameters in the self-attention network, the cross-attention network and the fully-connected network according to the loss value until the loss value converges to a preset value, so as to obtain the protein structure prediction model. Specifically, inputting the third feature vector and the fourth feature vector into a first fully-connected network to obtain a predicted three-dimensional coordinate of a core heavy atom; obtaining predicted three-dimensional coordinates of oxygen atoms according to the predicted three-dimensional coordinates of the core heavy atoms; and inputting the third characteristic vector and the fourth characteristic vector into a second full-connection network to obtain the predicted three-dimensional coordinates of other atoms except the core heavy atom and the oxygen atom.
The embodiment of the invention also provides a protein structure prediction device, as shown in fig. 10, including:
and the structure prediction unit 4 is used for inputting the protein sequence of the structure to be detected and the physicochemical properties of the amino acid residues in the protein sequence into a protein structure prediction model trained by the training device to obtain the structure of the protein to be detected.
In this embodiment, the structure prediction unit further includes a data processing unit of the protein sequence to be detected, for processing the protein sequence to be detected, and the specific processing procedure can be referred to the data processing unit in the training device of the protein structure prediction model.
In some embodiments, as shown in fig. 10, the protein structure prediction apparatus further comprises:
and the structure optimization unit 5 is used for optimizing the structure of the protein to be tested.
In this embodiment, the structure optimization unit may be based on the optimization methods such as Openmm and pddbfixer, but is not limited thereto.
The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the training method of the protein structure model according to the embodiment of the invention is realized or the prediction method of the protein structure according to the invention is realized.
The computer readable medium described in this embodiment may be a computer readable storage medium or a computer readable signal medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments, a computer readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
The embodiment of the invention also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and when the computer program is executed by the processor, the training method of the protein structure model or the prediction method of the protein structure according to the embodiment of the invention are realized.
In this embodiment, the memory may be a volatile memory, such as a random access memory; the memory may also be a non-volatile memory such as read-only memory, flash memory, hard disk, etc. The processor may be a central processing unit, controller, microcontroller, microprocessor, or other data processing chip.
It is to be understood that the invention is not limited in its application to the examples described above, but is capable of modification and variation in light of the above teachings by those skilled in the art, and that all such modifications and variations are intended to be included within the scope of the appended claims.
Reference to the literature
1、https://github.com/aqlaboratory/openfold/blob/main/openfold/np/residue_constant
s.py#L356。
2、Subramanian R,Sarkar S,Labrador M,et al.Orientation invariant gait matching algorithm based on the Kabsch alignment[C]//IEEE International Conference on Identity,Security and Behavior Analysis(ISBA2015).IEEE,2015:1-8。
3、Jumper J,Evans R,Pritzel A,et al.Highly accurate protein structure prediction with AlphaFold[J].Nature,2021,596(7873):583-589。

Claims (14)

1. The training method of the protein structure prediction model is characterized in that the protein structure prediction model comprises a characteristic extraction network and a structure prediction network, and comprises the following steps:
Acquiring a training dataset comprising a known protein sequence and physicochemical properties of amino acid residues in the protein sequence;
generating a first eigenvector containing protein sequence information according to the protein sequence;
clustering the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and generating a second eigenvector containing the physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues;
and training the feature extraction network and the structure prediction network by using the first feature vector and the second feature vector to obtain the protein structure prediction model.
2. The training method of claim 1, wherein the training dataset further comprises a position-encoded array acquired based on the protein sequence;
generating a first eigenvector containing protein sequence information according to the protein sequence; clustering the physicochemical properties of the amino acid residues to obtain clustered physicochemical properties of the amino acid residues, and generating a second eigenvector containing physicochemical information of the amino acid residues according to the clustered physicochemical properties of the amino acid residues specifically comprises the steps of:
Converting the position coding sequence into a position coding feature vector;
coding amino acids in the protein sequence to obtain a sequence feature vector, and splicing the sequence feature vector and the position coding feature vector to obtain a first feature vector;
clustering the physicochemical properties of the amino acid residues to obtain a plurality of representative physicochemical properties of the amino acid residues capable of representing each amino acid;
encoding amino acids in the protein sequence according to the plurality of representative amino acid residue physicochemical properties capable of representing each amino acid to obtain physicochemical property feature vectors;
and splicing the physicochemical characteristic feature vector with the position coding characteristic vector to obtain a second characteristic vector.
3. The training method according to claim 2, wherein the step of encoding the amino acids in the protein sequence to obtain a sequence feature vector specifically comprises:
providing a pre-training model;
inputting the protein sequence into the pre-training model, and training the pre-training model;
inputting the protein sequence into a trained pre-training model, and encoding amino acids in the protein sequence to obtain a sequence feature vector.
4. A training method as claimed in claim 3, characterized in that the method of training the pre-training model comprises the steps of:
randomly masking one amino acid in the protein sequence as input, and adjusting the weight of the pre-training model through a gradient anti-transmission algorithm until the pre-training model can recover the masked amino acid in the protein sequence.
5. The training method of claim 2, wherein when the protein is multi-chain, the protein sequences in the training dataset are protein sequences after splicing the protein sequences of each sub-chain; the position coding sequence in the training data set is the position coding sequence after the position coding sequence of each sub-chain is spliced.
6. The training method of claim 5, wherein the training of the feature extraction network and the structure prediction network using the first feature vector and the second feature vector further comprises the steps of:
and (3) carrying out relative position coding on different sub-chains of the protein, and splicing to obtain a third feature vector.
7. The training method of claim 6, wherein the training dataset further comprises true three-dimensional coordinates of all atoms in the protein sequence; the real three-dimensional coordinates of all atoms in the protein sequence are three-dimensional coordinate sequence spliced by the real three-dimensional coordinate sequence of all atoms of each sub-chain;
The step of training the feature extraction network and the structure prediction network to obtain the protein structure prediction model specifically comprises the following steps:
inputting the first feature vector into a self-attention network, and processing for a plurality of times through a self-attention mechanism to obtain a processed first feature vector;
inputting the second feature vector into a self-attention network, and processing for a plurality of times through a self-attention mechanism to obtain a processed second feature vector;
fusing the processed first feature vector and the processed second feature vector through a cross attention network to obtain a fourth feature vector;
inputting the third feature vector and the fourth feature vector into a fully-connected network to obtain predicted three-dimensional coordinates of all atoms in a protein sequence;
obtaining a loss value according to the predicted three-dimensional coordinates of all atoms in the protein sequence and the real three-dimensional coordinates of all atoms in the protein sequence;
and adjusting parameters in the self-attention network, the cross-attention network and the fully-connected network according to the loss value until the loss value converges to a preset value, so as to obtain the protein structure prediction model.
8. The training method of claim 7, wherein the step of fusing the processed first feature vector and the processed second feature vector together via a cross-attention network to obtain a fourth feature vector comprises:
inputting the processed first feature vector and the processed second feature vector into a cross attention network, performing matrix multiplication calculation first, and then performing softmax calculation to obtain a result vector;
performing matrix multiplication calculation on the processed first feature vector and a result vector obtained through softmax calculation to obtain a feature A;
performing matrix multiplication calculation on the processed second feature vector and a result vector obtained through softmax calculation to obtain a feature B;
and splicing the feature A and the feature B to obtain the fourth feature vector.
9. The training method of claim 7, wherein the step of inputting the third feature vector and the fourth feature vector into a fully-connected network to obtain predicted three-dimensional coordinates of all atoms in the protein specifically comprises:
inputting the third feature vector and the fourth feature vector into a first fully-connected network to obtain a predicted three-dimensional coordinate of a core heavy atom;
Obtaining predicted three-dimensional coordinates of oxygen atoms according to the predicted three-dimensional coordinates of the core heavy atoms;
and inputting the third characteristic vector and the fourth characteristic vector into a second full-connection network to obtain the predicted three-dimensional coordinates of other atoms except the core heavy atom and the oxygen atom.
10. A method for predicting a protein structure, comprising the steps of:
inputting the protein sequence of the structure to be detected and the physicochemical properties of the amino acid residues in the protein sequence into a protein structure prediction model obtained by training by the training method according to any one of claims 1-9, so as to obtain the structure of the protein to be detected.
11. A training device for a protein structure prediction model, wherein the protein structure prediction model includes a feature extraction network and a structure prediction network, the training device for a protein structure prediction model comprising:
a data acquisition unit for acquiring a training data set comprising a known protein sequence and physicochemical properties of amino acid residues in the protein sequence;
a data processing unit, configured to generate a first feature vector containing protein sequence information according to the protein sequence; clustering the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and generating a second eigenvector containing the physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues;
And the model training unit is used for training the feature extraction network and the structure prediction network according to the first feature vector and the second feature vector to obtain the protein structure prediction model.
12. A protein structure prediction apparatus, comprising:
the structure prediction unit is used for inputting the physical and chemical properties of the protein sequence of the structure to be detected and the amino acid residues in the protein sequence into the protein structure prediction model trained by the training device of claim 11 to obtain the structure of the protein to be detected.
13. A computer readable storage medium, characterized in that it stores a computer program, which, when executed by a processor, implements the training method of any one of claims 1-9 or implements the prediction method of claim 10.
14. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program executable on the processor, which, when executed by the processor, implements the training method of any of claims 1-9 or implements the prediction method of claim 10.
CN202211606821.0A 2022-12-14 2022-12-14 Protein structure prediction method, model training method, device, equipment and medium Pending CN116230074A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211606821.0A CN116230074A (en) 2022-12-14 2022-12-14 Protein structure prediction method, model training method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211606821.0A CN116230074A (en) 2022-12-14 2022-12-14 Protein structure prediction method, model training method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN116230074A true CN116230074A (en) 2023-06-06

Family

ID=86588145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211606821.0A Pending CN116230074A (en) 2022-12-14 2022-12-14 Protein structure prediction method, model training method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN116230074A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116935952A (en) * 2023-09-18 2023-10-24 浙江大学杭州国际科创中心 Method and device for training protein prediction model based on graph neural network
CN117095743A (en) * 2023-10-17 2023-11-21 山东鲁润阿胶药业有限公司 Polypeptide spectrum matching data analysis method and system for small molecular peptide donkey-hide gelatin
CN117275582A (en) * 2023-07-07 2023-12-22 上海逐药科技有限公司 Construction of amino acid sequence generation model and method for obtaining protein variant

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147866A (en) * 2018-06-28 2019-01-04 南京理工大学 Residue prediction technique is bound based on sampling and the protein-DNA of integrated study
CN111063393A (en) * 2019-12-26 2020-04-24 青岛科技大学 Prokaryotic acetylation site prediction method based on information fusion and deep learning
CN112233723A (en) * 2020-10-26 2021-01-15 上海天壤智能科技有限公司 Protein structure prediction method and system based on deep learning
CN112289370A (en) * 2020-12-28 2021-01-29 武汉金开瑞生物工程有限公司 Protein structure prediction method and device based on multitask time domain convolutional neural network
US20210104294A1 (en) * 2019-10-02 2021-04-08 The General Hospital Corporation Method for predicting hla-binding peptides using protein structural features
CN114333980A (en) * 2021-08-27 2022-04-12 腾讯科技(深圳)有限公司 Method and device for model training, protein feature extraction and function prediction
CN114974397A (en) * 2021-02-23 2022-08-30 腾讯科技(深圳)有限公司 Training method of protein structure prediction model and protein structure prediction method
CN114974398A (en) * 2021-02-23 2022-08-30 腾讯科技(深圳)有限公司 Information processing method and device and computer readable storage medium
CN115116559A (en) * 2022-06-21 2022-09-27 北京百度网讯科技有限公司 Method, device, equipment and medium for determining and training atomic coordinates in amino acid
US20220375538A1 (en) * 2021-05-11 2022-11-24 International Business Machines Corporation Embedding-based generative model for protein design

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147866A (en) * 2018-06-28 2019-01-04 南京理工大学 Residue prediction technique is bound based on sampling and the protein-DNA of integrated study
US20210104294A1 (en) * 2019-10-02 2021-04-08 The General Hospital Corporation Method for predicting hla-binding peptides using protein structural features
CN111063393A (en) * 2019-12-26 2020-04-24 青岛科技大学 Prokaryotic acetylation site prediction method based on information fusion and deep learning
CN112233723A (en) * 2020-10-26 2021-01-15 上海天壤智能科技有限公司 Protein structure prediction method and system based on deep learning
CN112289370A (en) * 2020-12-28 2021-01-29 武汉金开瑞生物工程有限公司 Protein structure prediction method and device based on multitask time domain convolutional neural network
CN114974397A (en) * 2021-02-23 2022-08-30 腾讯科技(深圳)有限公司 Training method of protein structure prediction model and protein structure prediction method
CN114974398A (en) * 2021-02-23 2022-08-30 腾讯科技(深圳)有限公司 Information processing method and device and computer readable storage medium
US20220375538A1 (en) * 2021-05-11 2022-11-24 International Business Machines Corporation Embedding-based generative model for protein design
CN114333980A (en) * 2021-08-27 2022-04-12 腾讯科技(深圳)有限公司 Method and device for model training, protein feature extraction and function prediction
CN115116559A (en) * 2022-06-21 2022-09-27 北京百度网讯科技有限公司 Method, device, equipment and medium for determining and training atomic coordinates in amino acid

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
RATUL CHOWDHURY等: "《Single-sequence protein structure prediction using language models from deep learning》", 《BIORXIV》 *
刘子楠 等: "《蛋白质结构预测综述》", 《中国医学物理学杂质》, vol. 37, no. 9 *
张安胜;王爱平;: "基于深度学习的蛋白质二级结构预测", 计算机仿真, no. 01, 15 January 2015 (2015-01-15) *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117275582A (en) * 2023-07-07 2023-12-22 上海逐药科技有限公司 Construction of amino acid sequence generation model and method for obtaining protein variant
CN116935952A (en) * 2023-09-18 2023-10-24 浙江大学杭州国际科创中心 Method and device for training protein prediction model based on graph neural network
CN116935952B (en) * 2023-09-18 2023-12-01 浙江大学杭州国际科创中心 Method and device for training protein prediction model based on graph neural network
CN117095743A (en) * 2023-10-17 2023-11-21 山东鲁润阿胶药业有限公司 Polypeptide spectrum matching data analysis method and system for small molecular peptide donkey-hide gelatin
CN117095743B (en) * 2023-10-17 2024-01-05 山东鲁润阿胶药业有限公司 Polypeptide spectrum matching data analysis method and system for small molecular peptide donkey-hide gelatin

Similar Documents

Publication Publication Date Title
CN116230074A (en) Protein structure prediction method, model training method, device, equipment and medium
Giuffrida et al. Pheno‐deep counter: A unified and versatile deep learning architecture for leaf counting
Jia et al. Quantum neural network states: A brief review of methods and applications
CN114464247A (en) Method and device for predicting binding affinity based on antigen and antibody sequences
Wang et al. GanDTI: A multi-task neural network for drug-target interaction prediction
CN114581770A (en) TransUnnet-based automatic extraction processing method for remote sensing image building
WO2021106706A1 (en) Amino acid sequence searching device, vaccine, amino acid sequence searching method, and amino acid sequence searching program
CN114913917B (en) Drug target affinity prediction method based on digital twin and distillation BERT
Brigato et al. Image classification with small datasets: Overview and benchmark
CN112652358A (en) Drug recommendation system, computer equipment and storage medium for regulating and controlling disease target based on three-channel deep learning
Sekhar et al. Protein class prediction based on Count Vectorizer and long short term memory
CN115131700A (en) Training method of two-way hierarchical mixed model for weakly supervised audio and video content analysis
CN115101145A (en) Medicine virtual screening method based on adaptive meta-learning
Khishe et al. Variable-length CNNs evolved by digitized chimp optimization algorithm for deep learning applications
Quan et al. CrackViT: a unified CNN-transformer model for pixel-level crack extraction
Du et al. Improving protein domain classification for third-generation sequencing reads using deep learning
Ning et al. A symbolic characters aware model for solving geometry problems
CN113591892A (en) Training data processing method and device
CN116189776A (en) Antibody structure generation method based on deep learning
CN115497564A (en) Antigen identification model establishing method and antigen identification method
KR20220111215A (en) Apparatus and method for predicting drug-target interaction using deep neural network model based on self-attention
Ding et al. RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios
CN115083539A (en) Method, device and equipment for reconstructing molecular structure and readable storage medium
Kavitha et al. Explainable AI for Detecting Fissures on Concrete Surfaces Using Transfer Learning
Song et al. Bio-Inspired Computing Models and Algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination