CN116230074A

CN116230074A - Protein structure prediction method, model training method, device, equipment and medium

Info

Publication number: CN116230074A
Application number: CN202211606821.0A
Authority: CN
Inventors: 熊袁鹏; 刘子敬; 幺宝刚
Original assignee: International Digital Economy Academy IDEA
Current assignee: International Digital Economy Academy IDEA
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2023-06-06

Abstract

The invention discloses a protein structure prediction method, a model training method, a device, equipment and a medium, which relate to the technical fields of biological information, deep learning and computer application, wherein the training method of the protein structure prediction model comprises the following steps: acquiring a training dataset comprising physicochemical properties of known protein sequences and amino acid residues; generating a first eigenvector containing protein sequence information according to the protein sequence; clustering the physicochemical properties of the amino acid residues, and generating a second feature vector containing physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues; and training the feature extraction network and the structure prediction network by using the first feature vector and the second feature vector to obtain a protein structure prediction model. According to the training method of the protein structure prediction model, provided by the invention, the input characteristics are extracted without a complex model, the calculation speed is high, and the training time of the protein prediction model is effectively saved.

Description

Protein structure prediction method, model training method, device, equipment and medium

Technical Field

The invention relates to the technical fields of biological information, deep learning and computer application, in particular to a protein structure prediction method, a model training method, a device, equipment and a medium.

Background

Proteins are essentially one or more chains of amino acids, which are folded repeatedly to form a specific three-dimensional structure and which have a specific function. Although experimental means including single particle cryoelectron microscopy, x-ray, nuclear magnetic resonance and the like can accurately measure the three-dimensional structure of the protein to obtain spatial information in a natural state, the experimental techniques have the defects of high cost, overlong period and the like.

In recent years, the development of artificial intelligence technology and theory thereof has made great progress, and is widely applied to the field of biopharmaceuticals, and a batch of methods for predicting the three-dimensional structure of protein emerge. The trained deep neural network can predict the properties of proteins based on their amino acid sequences, based primarily on the distance between pairs of amino acids and the angle between chemical bonds connecting these amino acids. From the known information, angle and distance information after folding of the protein can be deduced, and thus the structure of the whole protein. The current prediction methods of protein structure can be mainly divided into: (1) Direct prediction of the atomic position reduction three-dimensional structure information of all amino acids in a protein, wherein the method for directly predicting the protein by multiple sequence alignment has the longest history and is a more effective scheme in the prior art. For example, the AlphaFold2 tool developed by the uk artificial intelligence company can predict the structure of general proteins and can reach the accuracy of comparable experiments. The technology mainly completes the prediction of the protein structure by introducing multi-sequence alignment information and a complex information interaction mechanism. (2) The three-dimensional structure of the protein is obtained by predicting the information of the included angles and the distance between atoms on amino acid and then by a complex 'energy minimum' optimization method. The most typical protocol for such methods includes Igfold, deepAb and RosettaFold. However, the existing protein structure prediction method is often used for processing sequence original information, ignoring physicochemical properties of amino acids, so that more complex models are required for feature extraction, and the complex models generate a large amount of calculation cost, so that training time of a protein structure prediction model and prediction time of a protein structure are prolonged.

Accordingly, the prior art is still in need of improvement and development.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a protein structure prediction method, a model training method, a device, equipment and a medium, which aims to solve the problem that the existing protein structure prediction method ignores the physicochemical properties of amino acids and requires a more complex model for feature extraction.

The technical scheme of the invention is as follows:

in a first aspect of the present invention, there is provided a training method of a protein structure prediction model, wherein the protein structure prediction model includes a feature extraction network and a structure prediction network, the training method of the protein structure prediction model including the steps of:

acquiring a training dataset comprising a known protein sequence and physicochemical properties of amino acid residues in the protein sequence;

generating a first eigenvector containing protein sequence information according to the protein sequence;

clustering the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and generating a second eigenvector containing the physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues;

And training the feature extraction network and the structure prediction network by using the first feature vector and the second feature vector to obtain the protein structure prediction model.

Optionally, the training dataset further comprises a sequence of position-coding numbers acquired based on the protein sequence; generating a first eigenvector containing protein sequence information according to the protein sequence; clustering the physicochemical properties of the amino acid residues to obtain clustered physicochemical properties of the amino acid residues, and generating a second eigenvector containing physicochemical information of the amino acid residues according to the clustered physicochemical properties of the amino acid residues specifically comprises the steps of:

converting the position coding sequence into a position coding feature vector;

coding amino acids in the protein sequence to obtain a sequence feature vector, and splicing the sequence feature vector and the position coding feature vector to obtain a first feature vector;

clustering the physicochemical properties of the amino acid residues to obtain a plurality of representative physicochemical properties of the amino acid residues capable of representing each amino acid;

encoding amino acids in the protein sequence according to the plurality of representative amino acid residue physicochemical properties capable of representing each amino acid to obtain physicochemical property feature vectors;

And splicing the physicochemical characteristic feature vector with the position coding characteristic vector to obtain a second characteristic vector.

Optionally, the step of encoding the amino acids in the protein sequence to obtain a sequence feature vector specifically includes:

providing a pre-training model;

inputting the protein sequence into the pre-training model, and training the pre-training model;

inputting the protein sequence into a trained pre-training model, and encoding amino acids in the protein sequence to obtain a sequence feature vector.

Optionally, the method for training the pre-training model includes the steps of:

randomly masking one amino acid in the protein sequence as input, and adjusting the weight of the pre-training model through a gradient anti-transmission algorithm until the pre-training model can recover the masked amino acid in the protein sequence.

Optionally, when the protein is multi-chain, the protein sequence in the training dataset is a protein sequence after splicing the protein sequences of each sub-chain; the position coding sequence in the training data set is the position coding sequence after the position coding sequence of each sub-chain is spliced.

Optionally, before training the feature extraction network and the structure prediction network by using the first feature vector and the second feature vector, the method further includes the steps of:

and (3) carrying out relative position coding on different sub-chains of the protein, and splicing to obtain a third feature vector.

Optionally, the training dataset further comprises true three-dimensional coordinates of all atoms in the protein sequence; the real three-dimensional coordinates of all atoms in the protein sequence are three-dimensional coordinate sequence spliced by the real three-dimensional coordinate sequence of all atoms of each sub-chain;

the step of training the feature extraction network and the structure prediction network to obtain the protein structure prediction model specifically comprises the following steps:

inputting the first feature vector into a self-attention network, and processing for a plurality of times through a self-attention mechanism to obtain a processed first feature vector;

inputting the second feature vector into a self-attention network, and processing for a plurality of times through a self-attention mechanism to obtain a processed second feature vector;

fusing the processed first feature vector and the processed second feature vector through a cross attention network to obtain a fourth feature vector;

Inputting the third feature vector and the fourth feature vector into a fully-connected network to obtain predicted three-dimensional coordinates of all atoms in a protein sequence;

obtaining a loss value according to the predicted three-dimensional coordinates of all atoms in the protein sequence and the real three-dimensional coordinates of all atoms in the protein sequence;

and adjusting parameters in the self-attention network, the cross-attention network and the fully-connected network according to the loss value until the loss value converges to a preset value, so as to obtain the protein structure prediction model.

Optionally, the step of fusing the processed first feature vector and the processed second feature vector through a cross-attention network to obtain a fourth feature vector specifically includes:

inputting the processed first feature vector and the processed second feature vector into a cross attention network, performing matrix multiplication calculation first, and then performing softmax calculation to obtain a result vector;

performing matrix multiplication calculation on the processed first feature vector and a result vector obtained through softmax calculation to obtain a feature A;

performing matrix multiplication calculation on the processed second feature vector and a result vector obtained through softmax calculation to obtain a feature B;

And splicing the feature A and the feature B to obtain the fourth feature vector.

Optionally, the step of inputting the third feature vector and the fourth feature vector into a fully-connected network to obtain predicted three-dimensional coordinates of all atoms in the protein specifically includes:

inputting the third feature vector and the fourth feature vector into a first fully-connected network to obtain a predicted three-dimensional coordinate of a core heavy atom;

obtaining the predicted three-dimensional coordinates of the oxygen atoms through the predicted three-dimensional coordinates of the core heavy atoms;

and inputting the third characteristic vector and the fourth characteristic vector into a second full-connection network to obtain the predicted three-dimensional coordinates of other atoms except the core heavy atom and the oxygen atom.

In a second aspect of the present invention, there is provided a method for predicting a protein structure, comprising the steps of:

inputting the protein sequence of the structure to be detected and the physicochemical properties of the amino acid residues in the protein sequence into a protein structure prediction model trained by the training method disclosed by the invention to obtain the structure of the protein to be detected.

In a third aspect of the present invention, there is provided a training apparatus for a protein structure prediction model, wherein the protein structure prediction model includes a feature extraction network and a structure prediction network, the training apparatus for a protein structure prediction model including:

A data acquisition unit for acquiring a training data set comprising a known protein sequence and physicochemical properties of amino acid residues in the protein sequence;

a data processing unit, configured to generate a first feature vector containing protein sequence information according to the protein sequence; clustering the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and generating a second eigenvector containing the physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues;

and the model training unit is used for training the feature extraction network and the structure prediction network according to the first feature vector and the second feature vector to obtain the protein structure prediction model.

In a fourth aspect of the present invention, there is provided a protein structure prediction apparatus comprising:

the structure prediction unit is used for inputting the protein sequence of the structure to be detected and the physicochemical properties of the amino acid residues in the protein sequence into the protein structure prediction model obtained by training by adopting the training device disclosed by the invention, so as to obtain the structure of the protein to be detected.

In a fifth aspect of the present invention, a computer readable storage medium is provided, wherein the computer readable storage medium stores a computer program which, when executed by a processor, implements the training method of the present invention as described above or implements the prediction method of the present invention as described above.

In a sixth aspect of the present invention, an electronic device is provided, comprising a memory and a processor, said memory having stored thereon a computer program executable on said processor, said computer program, when executed by said processor, implementing the training method of the present invention as described above or implementing the prediction method of the present invention as described above.

The beneficial effects are that: the invention fully considers the physicochemical properties of amino acid residues, extracts the physicochemical properties of amino acid residues by clustering the physicochemical properties of a plurality of amino acid residues, and takes the physicochemical properties of the amino acid residues and the characteristics containing protein sequence information as input information of a protein prediction model to train the protein prediction model. Therefore, the training method of the protein structure prediction model provided by the invention has comprehensive input characteristics, does not need to extract the input characteristics of a complex model, has high calculation speed, and further effectively saves the training time of the protein prediction model.

Drawings

FIG. 1 is a flow chart of a training method of a protein structure prediction model according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a protein structure file according to an embodiment of the present invention.

FIG. 3 is a flow chart of a method for predicting protein structure according to an embodiment of the invention.

Fig. 4 is a training schematic diagram of a pre-training model according to an embodiment of the present invention.

Fig. 5 (a) is a schematic diagram of relative position coding in an embodiment of the present invention, and (b) is a schematic diagram of multi-chain fusion in an embodiment of the present invention.

FIG. 6 is a flow chart of a method for training a protein structure prediction model according to another embodiment of the present invention.

Fig. 7 (a) is a schematic structural diagram of a feature extraction network module according to an embodiment of the invention; (b) is a schematic diagram of a principle of a bidirectional cross-attention mechanism.

FIG. 8 is a flow chart of a method for predicting protein structure according to another embodiment of the invention.

FIG. 9 is a block diagram of a training device for protein structure prediction model in an embodiment of the present invention.

FIG. 10 is a block diagram showing the structure of a protein structure prediction apparatus according to an embodiment of the present invention.

FIG. 11 is a schematic diagram showing the three-dimensional structure of a protein predicted in an embodiment of the present invention.

FIG. 12 shows the protein structure prediction results reported in the paper by the current mainstream protein structure prediction algorithm omega Fold and alpha Fold 2.

Detailed Description

The invention provides a protein structure prediction method, a model training method, a device, equipment and a medium, and the invention is further described in detail below in order to make the purposes, the technical schemes and the effects of the invention clearer and more definite. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

As used herein, the term "comprising" and variants thereof are to be interpreted as meaning "including but not limited to" open-ended terms. The term "based on" is to be interpreted as "based at least in part on". The terms "first," "second," and the like, may refer to different or the same object. As used herein, a "network or neural network" is capable of processing inputs and providing corresponding outputs, each layer of the neural network including one or more nodes (also referred to as processing nodes or neurons), each node processing inputs from a previous layer. The terms "neural network", "network" and "neural network model" are used interchangeably herein, "physicochemical properties of amino acids" and "physicochemical properties of amino acid residues" are used interchangeably; the terms "vector", "eigenvector", "matrix" may also be referred to or understood as tensor.

The embodiment of the invention provides a training method of a protein structure prediction model, wherein as shown in fig. 1, the protein structure prediction model comprises a feature extraction network and a structure prediction network, and the training method of the protein structure prediction model comprises the following steps:

s11, acquiring a training data set, wherein the training data set comprises a known protein sequence and physicochemical properties of amino acid residues in the protein sequence;

s12, generating a first eigenvector containing protein sequence information according to the protein sequence;

s13, clustering the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and generating a second eigenvector containing the physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues;

s14, training the feature extraction network and the structure prediction network by using the first feature vector and the second feature vector to obtain the protein structure prediction model.

In the embodiment of the invention, the physicochemical properties of amino acid residues (amino groups and carboxyl groups among amino acids are dehydrated to form bonds, the amino acid takes part in peptide bond formation, the rest of the amino acid is called amino acid residues) are fully considered, and the physicochemical properties of the amino acid residues are numerous, and most physicochemical properties have certain redundancy, so that the physicochemical properties of the amino acid residues are only needed to be selected to describe the physicochemical properties of the amino acid, and the physicochemical properties of the representative amino acid residues are clustered firstly, extracted and then used as input information of a protein prediction model together with the characteristics containing protein sequence information to train the protein prediction model. Therefore, the training method of the protein structure prediction model provided by the invention has comprehensive input characteristics, does not need to extract the input characteristics by a complex model, has high calculation speed, further effectively saves the training time of the protein structure prediction model, and further saves the prediction time of the protein structure.

In step S11, the known protein sequence may be obtained from a public or self-constructed protein structure information library or antibody structure information library, for example, the complete protein structure file is downloaded from the us public website RSCB PDB or SABDAB website, and the physicochemical properties of the amino acid residues in the protein sequence may be obtained from the Peptides library in R language. A typical protein structure file is shown in fig. 2, in which the column with the number 1 is an atomic number, the column with the number 2 is an amino acid type, the column with the number 3 is an amino acid number, the column with the number 4 is a three-dimensional coordinate of an atom, and the column with the number 5 is an atomic type.

Wherein the protein sequence S consists of L amino acids S, S= { S ₁ ，s ₂ ，...，s _L If the protein is multi-stranded, it is necessary to splice the sequences of each sub-strand.

The obtained training data set not only comprises the known protein sequence S and the physicochemical properties of the amino acid residues in the protein sequence; more protein information, such as true three-dimensional coordinate information C of all atoms, type information, namely amino acid type information A, can also be included according to actual needs.

In the present invention, all amino acids have 37 types of virtual atoms by default, and zero is set for the non-existing atom information, and whether the virtual atoms at all positions exist in each amino acid s is recorded by a single mask information M. The true atom for each amino acid is prior art and reference is made, for example, to reference 1. In addition, the training dataset may further comprise a position-coding array PE acquired based on the protein sequence. Thus, the training dataset may comprise physicochemical properties of amino acid residues in the protein sequence and a set of data consisting of S, PE, M, C and a, which may be denoted { S, PE, M, C, a }. And each data in the training data set is taken according to specific steps.

The more data that is used for deep learning, the better the model performance. Thus, when the protein is multi-chain, each data of { S, PE, M, C, A } in the training dataset is stitched to augment the training dataset, thereby enhancing the performance of the model. For example, the protein sequence is a protein sequence obtained by splicing the protein sequences of each sub-chain; the position coding sequence is a position coding sequence spliced by the position coding sequence of each sub-chain; the real three-dimensional coordinates of all atoms in the protein sequence are three-dimensional coordinate sequence spliced by the real three-dimensional coordinate sequence of all atoms of each sub-chain.

Illustratively, for double-stranded proteins, the training dataset may be augmented with the following strategy:

for a protein (e.g., diabody) that contains two sub-chains [ H, L ], the respective training data set for each sub-chain is { S_H, PE_H, M_H, C_H, A_H }, { S_L, PE_L, M_L, C_L, A_L }. Wherein S_H represents the protein sequence of the H chain, S_L represents the protein sequence of the L chain, PE_H represents the position-encoded sequence of the H chain, PE_L represents the position-encoded sequence of the L chain, and the meanings of the remaining symbols are the same.

When the training data sets of the two chains are spliced, two selection strategies exist, namely the front of the H chain or the front of the L chain, the front-back sequence change can substantially bring about data amplification, and the splicing mode of the front L chain of the H chain and the splicing mode of the rear H chain of the L chain are effective structural data. Therefore, two splicing modes are selected in the invention, so that double-data is constructed on double-chain data for enhancing the model performance.

The manner of splicing the data for each sub-chain of the double-stranded protein will be described in detail below using the protein sequence S in the training data set as an example.

For a strand comprising two sub-chains [ H, L]Provided that the H chain consists of M amino acids S, the protein sequence S_H= { S of the H chain ₁ _H，s ₂ _H，...，s _M H, the L chain consists of N amino acids S, the protein sequence s_l= { S of the L chain ₁ _L，s ₂ _L，...，s _N L, then from the two sub-chains [ H, L ]]The protein sequence obtained from the protein of (2) is S= { S ₁ _H，s ₂ _H，...，s _M _H，s ₁ _L，s ₂ _L，...，s _N L and S= { S ₁ _L，s ₂ _L，...，s _N _L，s ₁ _H，s ₂ _H，...，s _M _H}。

Most proteins in nature, particularly antibodies, often have multi-chain structures, whereas prior art methods of protein structure prediction are often only applicable to single-chain protein structures, and are thus disadvantageous for predicting many multi-chain proteins in nature. Thus, the training dataset further comprises a sequence of position-coding numbers acquired based on the protein sequence; in step S12, in some embodiments, the step of generating the first eigenvector containing protein sequence information according to the protein sequence specifically includes:

S121, converting the position coding sequence into a position coding feature vector;

s122, coding amino acids in the protein sequence to obtain a sequence feature vector, and splicing the sequence feature vector and the position coding feature vector to obtain a first feature vector.

In this embodiment, the sequence feature vector and the position-coding feature vector are spliced, so that the model can distinguish single-chain information and multi-chain information, and the prediction of the multi-chain protein structure is facilitated by the protein structure prediction model.

In order to enable the model to distinguish between single-strand information and multi-strand information, independent position coding information is introduced in step S121, i.e. the position codes of each sub-strand of the protein are calculated independently and spliced together. For example, for a double-stranded protein having a total length of L, the first strand has a length of L ₁ The position code of the sub-chain of (1, 2, 3.) L ₁ ]The second strip has a length L ₂ The position code of the sub-chain of (1, 2, 3.) L ₂ ]The final position-coded array (PE) is then 1,2,3, L ₁ ，1,2,3...L ₂ ]And [1,2, 3..L.) ₂ ,1,2,3,...L ₁ ]. Further, the position-coding sequence (PE) is converted into a position-coding feature vector with dimensions of lxn, where L is the total length of the protein, i.e. the total number of amino acids, n is not physically significant and is set according to the actual needs.

The position-coded sequence (PE) may be converted to a position-coded feature vector of dimension lxn via operations including, but not limited to, pseudo-code operations. Illustratively, the pseudocode employed is as follows:

pe＝torch.zeros(length,d_model)

div_term＝torch.exp((torch.arange(0,d_model,2,dtype＝torch.float)*

-(math.log(10000.0)/d_model)))

pe[:,0::2]＝torch.sin(position.float()*div_term)

pe[:,1::2]＝torch.cos(position.float()*div_term)。

after the pseudo code operation, the position coding sequence (PE) is converted into a position coding feature vector with dimension L×128.

In step S122, in one embodiment, the step of encoding the amino acids in the protein sequence to obtain a sequence feature vector specifically includes:

s1220, providing a pre-training model;

s1221, inputting the protein sequence into the pre-training model, and training the pre-training model;

s1222, inputting the protein sequence into a trained pre-training model, and coding amino acids in the protein sequence to obtain a sequence feature vector with dimension L multiplied by m, wherein L is the length of the protein sequence, and m has no physical meaning and is different according to the selected pre-training model.

In this embodiment, the training time of the protein structure model can be saved by using a single-sequence pre-training model instead of the sequence search scheme. Specifically, the pretraining model is used for capturing information such as similarity information and evolution information on protein sequences, which is helpful for structure prediction, without depending on a multi-sequence comparison scheme which is too much time-consuming, so that the training time of the protein structure model is further reduced, and the training efficiency is improved.

In steps S1220-S1221, the pre-training model is a neural network model obtained by self-monitoring training of a large number of protein sequences or antibody sequences. The common training mode of the pre-training model predicts the current amino acid information for a given context information. Specifically, for the protein sequences S, randomly masking an amino acid in each sequence as an input, requiring the model to recover the masked amino acid, which involves adjusting the weights in the amino acid coding module by a gradient anti-pass algorithm to obtain an optimal result. As shown in fig. 4 (where the letters V, S, E, Q represent amino acids), the complete process of training the pre-training model is the prediction from the lower sequence to the upper sequence, i.e. the masking of amino acid E in the lower protein sequence is followed by the input of a pre-training model (amino acid coding module) that is able to recover amino acid E (see upper sequence).

The pre-training model can be used for encoding amino acids on a protein sequence, and in step S1222, the amino acids can be encoded only by performing feedforward calculation once by using the pre-training model when the method is specifically used.

The invention is not limited to a particular version of the pre-training model, and ESM-1b, ESM-3b, ESM-15b, antiberty, etc. may be used, as examples. The ESM-1b has good generalization capability, can be used for any sequence, has a basic framework of a transducer, and is a common model in the field of Natural Language Processing (NLP) at present. But according to different kinds of pre-training models, the m values are different, and when ESM-1b is adopted, a sequence feature vector with the dimension of L multiplied by 1280 is obtained; when ESM-3b or ESM-15b is adopted, a sequence feature vector with the dimension of L multiplied by 2560 is obtained; when anti is employed, a sequence feature vector of dimension l×512 is obtained. For convenience of description, a sequence feature vector having a dimension of lx1280 obtained by using ESM-1b processing will be described below as an example.

In step S122, the operation of splicing the sequence feature vector and the position-coding feature vector is performed by using the torch frame. In particular, such a stitching operation may be written as a torch.cat ([ S, PE ], dim= -1), which is independent of the pre-trained model.

In some specific embodiments, when the dimension of the sequence feature vector is lx1280 and the dimension of the position-coding feature vector is lx128, the two are spliced, and the dimension of the first feature vector is lx1408.

Most proteins in nature, particularly antibodies, often have a multi-chain structure so that the protein predictive model can predict a number of multi-chain proteins in nature. Therefore, in step S13, in some embodiments, as shown in fig. 3, the step of clustering the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and generating the second feature vector containing the physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues specifically includes:

s131, clustering the physicochemical properties of the amino acid residues to obtain a plurality of representative physicochemical properties of the amino acid residues capable of representing each amino acid;

s132, encoding amino acids in the protein sequence according to the physical and chemical properties of the plurality of representative amino acid residues capable of representing each amino acid to obtain a physical and chemical property feature vector;

and S133, splicing the physicochemical characteristic feature vector and the position coding feature vector to obtain a second feature vector.

Because of the numerous physicochemical properties of amino acid residues of proteins, and the redundancy of most physicochemical properties, it is only necessary to select the physicochemical properties of which the most representative are sufficient to describe the physicochemical properties of the amino acid. Therefore, in step S131, the physicochemical properties of the amino acid residues are clustered (may be implemented in the residue property clustering module) to obtain a plurality of representative physicochemical properties of the amino acid residues capable of characterizing each amino acid. In some embodiments, the physicochemical properties of the amino acid residues are clustered using any clustering algorithm, and the six most representative physicochemical properties are selected therefrom as characteristic representations for each amino acid. Specifically, the six physicochemical properties include: hydrophobicity (H1), side chain volume (V), polarity (P1), pH at isoelectric point (pl), negative log of the dissociation constant of the-COOH group (pKa), and net charge index side chain (NCI).

As an example, the clustering algorithm is a K-means clustering algorithm, a spectral clustering algorithm, a hierarchical clustering algorithm, etc., but is not limited thereto.

In step S132, each physicochemical property of each amino acid is a real number, and thus, a protein sequence of length L is encoded (i.e., amino acid characters are mapped to physicochemical properties), resulting in a physicochemical property eigenvector of dimension lx 6.

In step S133, when a position-coded feature vector with a dimension of lx 128 is used, the physicochemical feature vector is spliced with the position-coded feature vector to obtain a second feature vector with a dimension of lx 134. The stitching method may refer to stitching of the sequence feature vector and the position-coded feature vector above.

When the protein is multi-chain, in order for the model to more accurately obtain the relative positions of the amino acids on the entire chain (such relative position information often determines the three-dimensional structure of the amino acids on the protein), so that the model can more easily distinguish whether the amino acids are from the same chain and determine the positions of the amino acids on a specific chain, the processing ability of the model on the multi-chain protein is further enhanced, and in some embodiments, the training of the feature extraction network and the structure prediction network by using the first feature vector and the second feature vector further includes the steps of:

And S140, carrying out relative position coding on different sub-chains of the protein, and splicing to obtain a third feature vector.

In this embodiment, the different sub-chains of the protein are relatively position encoded to form a two-dimensional matrix, each chain having its own independent relative position information encoding, which may be accomplished using techniques including, but not limited to, pseudo-code operations. Illustratively, as shown in FIG. 5 (a), this is the result of the relative position encoding of a length-6 protein sequence, which is actually a 6X 6 matrix. When the input protein sequence has two sequences of length L ₁ And L ₂ When the sub-chain (total length is L), the corresponding relative position encoding results are spliced in the manner shown in FIG. 5 (b), wherein the dark color block is of length L ₁ The result of the relative position coding of the sub-chains of (2) is that the light-colored blocks are of length L ₂ The blank white blocks represent the relative position encoding results of the sub-chains of the two-dimensional matrix with elements of all 0, i.e. length L ₁ The length of the strand is L ₂ The position coding matrix of the sub-chain is positioned in the spliced shapeThe diagonal of the matrix (dimension L x L) is formed, so that the processing capacity of the model on the multi-chain protein is further enhanced. After the above multi-chain fusion operation, a third feature vector with dimension of lxlxlxlx1 is obtained.

By way of example, the pseudocode is:

embed＝[]

for i in range(L):

a＝list(range(-i,0))

b＝list(range(L-i))

embed.append(a+b)

in step S14, the real three-dimensional coordinates C of all the atoms in the protein sequence in the training dataset are needed, and when the protein is a multi-chain protein, the real three-dimensional coordinates of all the atoms in the protein sequence are three-dimensional coordinate sequences obtained by splicing real three-dimensional coordinate sequences of all the atoms in each sub-chain, and the splicing manner is as described in the above example. In some embodiments, the feature extraction network includes a self-attention network and a cross-self-attention network, the structure prediction network is a fully-connected network, as shown in fig. 3 and 6, and the training the feature extraction network and the structure prediction network to obtain the protein structure prediction model specifically includes:

s141, inputting the first feature vector into a self-attention network, and processing for a plurality of times through a self-attention mechanism to obtain a processed first feature vector;

s142, inputting the second feature vector into a self-attention network, and processing for a plurality of times through a self-attention mechanism to obtain a processed second feature vector;

s143, fusing the processed first feature vector and the processed second feature vector through a cross attention network to obtain a fourth feature vector;

S144, inputting the third feature vector and the fourth feature vector into a fully-connected network to obtain predicted three-dimensional coordinates of all atoms in the protein sequence;

s145, obtaining a loss value according to the predicted three-dimensional coordinates of all atoms in the protein sequence and the real three-dimensional coordinates of all atoms in the protein sequence;

and S146, adjusting parameters in the self-attention network, the cross-attention network and the fully-connected network according to the loss value until the loss value converges to a preset value, and obtaining the protein structure prediction model.

In step S141, as shown in fig. 7 (a), the first feature vector is input into a self-attention network (self-attention basic module, which is a standard self-attention module), and processed several times (specifically, 3 or 5 times) through a self-attention mechanism to obtain a processed first feature vector (including protein sequence information), wherein the dimensions of the processed first feature vector may be l×1280, 1280 are freely set parameters, and the dimensions may be adjusted according to the performance of the final test in practical use, and generally 512, 1280, 2560 or the like may be selected.

In step S142, as shown in fig. 7 (a), the second feature vector is input into a self-attention network (self-attention basic module, which is a standard self-attention module), and processed several times (specifically, 3 or 5 times) through a self-attention mechanism to obtain a processed second feature vector (containing information of physicochemical properties of amino acid residues), wherein dimensions of the second feature vector may be l×1280, 1280 are freely set parameters, and in actual use, the second feature vector may be adjusted according to the performance of the final test, and generally 512, 1280, 2560 or the like may be selected.

In step S143, the processed first feature vector and the processed second feature vector are fused by the cross-attention mechanism, so that better performance can be obtained in a lighter model.

The processed first feature vector and the processed second feature vector are fused through a cross attention network (a cross attention basic module) to obtain a fourth feature vector, the dimension of the fourth feature vector can still be L multiplied by 1280, 1280 is a freely set parameter, and the fourth feature vector can be adjusted according to the final test performance in actual use, and generally 512, 1280 or 2560 and the like can be selected. The cross-attention basis module is bi-directional, and illustratively, the processed first feature vector and the processed second feature vector are calculated in the manner shown in fig. 7 (B), and are spliced (a|b operation in the figure), specifically, the processed first feature vector (corresponding to feature 1 in the figure) and the processed second feature vector (corresponding to feature 2 in the figure) are subjected to matrix multiplication, and then are calculated by softmax, to obtain a result vector. The method comprises the steps of alternately taking a processed first feature vector and a processed second feature vector (corresponding to features 1/2 in the figure) as inputs, respectively carrying out matrix multiplication calculation on the processed first feature vector and the processed first feature vector, namely carrying out matrix multiplication calculation on the result vector output through softmax calculation and the processed first feature vector to output a feature A, carrying out matrix multiplication calculation on the result vector output through softmax calculation and the processed second feature vector to output a feature B, and finally splicing the feature A and the feature B to obtain an A|B, namely a fourth feature vector. The processed first feature vector and the processed second feature vector are fused together through a cross-attention network (a cross-attention basis module), and the fusion can be performed through a direct summation or a front-back splicing mode.

In step S144, in some embodiments, the fully-connected network includes a first fully-connected network and a second fully-connected network, and the step of inputting the third feature vector and the fourth feature vector into the fully-connected network to obtain predicted three-dimensional coordinates of all atoms in the protein specifically includes:

s1441, inputting the third feature vector and the fourth feature vector into a first fully-connected network to obtain a predicted three-dimensional coordinate of a core heavy atom;

s1442, obtaining a predicted three-dimensional coordinate of the oxygen atom according to the predicted three-dimensional coordinate of the core heavy atom;

s1443, inputting the third feature vector and the fourth feature vector into a second full-connection network to obtain predicted three-dimensional coordinates of other atoms except the core heavy atom and the oxygen atom.

In this embodiment, because of the specificity of the protein structure, four core heavy atoms (central carbon, α carbon, nitrogen, β carbon) are in the same rotating group, and three-dimensional coordinates thereof need to be predicted separately. Therefore, in step S1441, the third feature vector and the fourth feature vector are input into the first fully-connected networkAnd (5) complexing to obtain the predicted three-dimensional coordinates of the four core heavy atoms. While the oxygen atom is generally directly determined by the fixed bond length, bond angle, i.e., when the coordinates of the four core heavy atoms are determined, the oxygen atom coordinates are determined accordingly. Therefore, in step S1442, the predicted three-dimensional coordinates of the oxygen atom are obtained from the predicted three-dimensional coordinates of the core heavy atom. The remaining 32 atoms are called side chain atoms (all amino acids have 37 types of virtual atoms by default in the present invention as described above), and their three-dimensional coordinates are determined by a special rotation angle (rotation angle) and need to be predicted separately, so in step S1443, the third feature vector and the fourth feature vector are input into a second full-connection network to obtain predicted three-dimensional coordinates of other atoms except the core heavy atom and the oxygen atom, so far, the predicted three-dimensional coordinates of all the atoms can be obtained

The format is L multiplied by 37 multiplied by 3, and the three-dimensional coordinates of all 37 virtual atoms of the L amino acids are respectively input. />

Step S145, predicting three-dimensional coordinates according to all atoms in the protein sequence

And the true three-dimensional coordinates C of all atoms in the protein sequence, to obtain a loss value. The loss value is used for judging the approaching degree of the predicted three-dimensional coordinate and the real three-dimensional coordinate, and the smaller the loss value is, the closer the predicted three-dimensional coordinate and the real three-dimensional coordinate are.

Specifically, the loss value may be calculated using a loss function, and the loss function L in the present invention may be obtained by combining Kabsch alignment loss (KLoss, see reference 1) and frame loss (FAPE) (see reference 2) and torsion angle loss (TorsionLoss), as follows:

L＝KLoss+FAPE+TorsionLoss(1)

the inputs to the three loss functions in equation (1) are (C,

) I.e. trueReal three-dimensional coordinates and predicted three-dimensional coordinates. However, since during the calculation the structural model (i.e. the structure prediction network) outputs the coordinates of all 37 virtual atoms for the amino acids on the protein sequence +.>

However, each amino acid actually possesses a different number of atoms, and the absence of atoms is not significant and should not be used to calculate the final loss function. The calculation of the respective los constituting the above-mentioned Loss function should therefore be determined from the atomic mask information M of the amino acids described above, i.e. the predicted three-dimensional coordinates of the input Loss function should be +. >

I.e. the input of the above-mentioned loss function is (C, -)>

)。

In step S146, in some embodiments, as shown in fig. 6, the step of adjusting parameters in the self-attention network, the cross-attention network and the fully-connected network (including the first fully-connected network and the second fully-connected network) according to the loss value until the loss value converges to a preset value, and the step of obtaining the protein structure prediction model specifically includes:

s1461, judging whether the loss value converges to a preset value, if so, stopping training to obtain a protein structure prediction model;

and S1462, if not, adjusting parameters in the self-attention network, the cross-attention network and the fully-connected network according to the loss value.

And adjusting parameters (such as weights) in the self-attention network, the cross-attention network and the fully-connected network according to the loss value until the loss value converges to a preset value, so as to obtain the protein structure prediction model. Specifically, the weight values of the nodes in the networks are initialized randomly, in order to obtain the optimal weight, the predicted three-dimensional coordinates and the obtained real three-dimensional coordinates obtained in the training process are used for calculating a loss function, the weights of the nodes of each neural network are adjusted by using a gradient back-transmission algorithm, the final weight is generally obtained by a multi-round gradient back-transmission (300 rounds are generally carried out), and further a trained protein structure prediction model is obtained, and the gradient in the step is not back-transmitted to the pre-training model, namely the pre-training model is trained firstly in the method, then the training of the protein structure prediction model is carried out, and the training of the two models is carried out independently.

In the prior art, the protein structure reasoning time is too long to be applied to a large-scale structure prediction task, and the method is particularly characterized in two aspects, namely, the method relies on multi-sequence information comparison, and the basic principle of the multi-sequence information comparison is to compare a protein sequence needing to be predicted with an existing protein sequence library to obtain sequence similarity information. Although the speed of the multi-sequence comparison algorithm is greatly optimized, the construction of multi-sequence information comparison is still extremely time-consuming, and even consumes several hours or even tens of hours under the extreme condition of longer input sequence; secondly, in order to give a model a stronger prediction capability, the existing method often needs to learn information useful for structure prediction through internal repeated information interaction, and the complex model brings a great deal of calculation cost, so that the prediction time is prolonged. Based on the above, the embodiment of the invention also provides a prediction method of the protein structure, which is an artificial intelligence method for predicting the three-dimensional structure of the protein based on single sequence characteristics (a single sequence refers to a method that only sequence information is needed as input and no additional input is needed) and the physicochemical property information of amino acid residues on the protein. As shown in fig. 8, the method for predicting the protein structure includes the steps of:

S15, inputting the protein sequence of the structure to be detected and the physicochemical properties of the amino acid residues in the protein sequence into a protein structure prediction model trained by the training method according to the embodiment of the invention, so as to obtain the structure of the protein to be detected.

In this embodiment, the protein sequence of the structure to be tested and the physicochemical properties of the amino acid residues in the protein sequence need to be pretreated before being input into the protein structure prediction model, specifically as follows:

generating a characteristic vector containing protein sequence information by utilizing a trained pre-training model and splicing the characteristic vector with a position coding characteristic vector to obtain a first characteristic vector to be tested (the specific method is referred to as a method in the training method of the protein prediction model and is not repeated here);

clustering is carried out according to the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and according to the physicochemical properties of the clustered amino acid residues (the physicochemical properties of the 6 amino acid residues obtained after clustering can also be directly utilized), a feature vector containing physicochemical information of the amino acid residues is generated and spliced with the position coding feature vector to obtain a second feature vector to be detected (the specific method is referred to the method in the training method of the protein prediction model and is not repeated).

When the protein is multi-chain, the relative position codes are carried out on different sub-chains of the protein, and then the third feature vector to be detected is obtained after splicing (the specific method is referred to as the method in the training method of the protein prediction model, and the detailed description is omitted here).

And inputting the obtained first feature vector to be detected and the second feature vector to be detected into a trained protein structure prediction model.

In the embodiment, the physicochemical properties of the amino acid residues are fully considered, the physicochemical properties of the representative amino acid residues are extracted by clustering the physicochemical properties of a plurality of amino acid residues, and the physicochemical properties of the representative amino acid residues and the characteristics containing protein sequence information are used as input information of a protein prediction model, so that the information content of the model is improved. Therefore, the protein structure prediction method provided by the invention does not need to extract input features by a complex model, has high calculation speed, further effectively saves the prediction time of the protein prediction model, and improves the prediction efficiency. In addition, most of the current protein structure prediction methods infer time on the order of minutes or even hours, thereby being unfavorable for large-scale protein structure prediction tasks. In this embodiment, a single-sequence pre-training model is used to replace a multi-sequence comparison scheme with excessive time consumption, and the information such as similarity information and evolution information on the sequence, which is helpful for structure prediction, is captured through the pre-training model, so that the protein structure prediction time can be optimized to the second level. The protein structure prediction model provided by the embodiment can be used for predicting the multi-chain protein, so that the protein structure prediction method provided by the embodiment can not only realize the prediction of the single-chain protein, but also realize the prediction of the multi-chain protein, and has wider application range. In summary, the prediction method provided in this embodiment does not need a complex model to perform feature extraction, has short reasoning time and high prediction efficiency, and can implement prediction of a multi-chain protein structure.

Furthermore, not only the structure of the protein can be predicted in the present invention, but also in some embodiments the three-dimensional coordinates of the amino acid type information A and the output of the above protein sequences

A predicted protein structure file is generated as shown in fig. 2. Specifically, in the predicted protein structure file, each line represents information of a certain atom of a certain amino acid. Specifically, the column with the sequence number 1 shown in fig. 2 is written with the data line number; the column with the sequence number 2 is the amino acid type and corresponds to the information in A; the column with the sequence number 3 is an amino acid sequence number, and corresponds to the subscript of the information in A; the column with the sequence number 4 is the three-dimensional coordinates of the corresponding atom +.>

The column of number 5 is the atomic species, determined by the information in A, i.e., the true atomic information for each amino acid. Since each amino acid corresponds to multiple rows of information, the contents of the different rows in the separate columns for each of sequence number 2 and sequence number 3 are the same for atoms from the same amino acid. Other columns of protein structure information will be filled with 0 values as placeholders. />

In order to make the structure of the finally produced protein more realistic, in some embodiments, as shown in fig. 8, the protein sequence of the structure to be tested and the physicochemical properties of the amino acid residues in the protein sequence are input into the protein structure prediction model, and the structure of the protein to be tested is obtained further includes:

S16, optimizing the structure of the obtained protein to be tested. The specific optimization method is not limited in this embodiment, and may be Openmm, pddbfixer, or the like, as an example.

The method for predicting the protein structure provided by the invention is described below with reference to fig. 3, a protein sequence of the structure to be detected is input into a pre-training model, a characteristic representation of the protein sequence pre-training, namely a sequence characteristic vector, is obtained through the pre-training model, physical and chemical properties of amino acid residues in the protein sequence of the structure to be detected are clustered through residue properties to obtain the physical and chemical properties of the residues of the 6 protein sequences with representativeness, and a characteristic vector, namely a physical and chemical property characteristic vector, is formed. And respectively fusing the sequence feature vector and the residue physicochemical feature vector with the position coding feature vector to obtain a first feature vector and a second feature vector. Then, the first feature vector and the second feature vector are input into a trained structural prediction model, and the specific process is as follows:

the first feature vector and the second feature vector are respectively input into a trained self-attention basic module (self-attention network, the basic framework is a transducer), then the cross self-attention basic module (cross-attention network) is input for more complex feature extraction, namely, the first feature vector and the second feature vector are respectively processed by self-attention mechanism to respectively obtain a processed first feature vector and a processed second feature vector, then in the trained cross-attention basic module (not directly illustrated in fig. 3), a sequence feature representation with residue physicochemical property is obtained (the processed first feature vector and the processed second feature vector are subjected to matrix multiplication to form a pairing feature representation), and the sequence feature representation is fused with the features (alternatively used as the input processed first feature vector and the processed second feature vector) of the self-attention basic module to obtain a fourth feature vector (the process corresponds to (b) in fig. 7). And simultaneously, carrying out relative position coding on different sub chains of the protein, and splicing to obtain a third feature vector. And finally, inputting the third feature vector and the fourth feature vector into a structural module (a trained structure prediction network), integrating and outputting predicted protein skeleton information (the number of output proteins is determined by the number of input protein sequences) by the structural module, optimizing the predicted protein skeleton, and outputting a predicted protein structure. The invention fully considers the physicochemical properties of amino acid residues, extracts the physicochemical properties of amino acid residues by clustering the physicochemical properties of a plurality of amino acid residues, and takes the physicochemical properties of the amino acid residues and the characteristics containing protein sequence information as input information of a protein prediction model to train the protein prediction model. Therefore, the training method of the protein structure prediction model provided by the invention does not need to extract input features of a complex model, has high calculation speed, and further effectively saves the prediction time of the protein structure. The method can optimize the protein structure prediction time to the second level by combining schemes such as using a single-sequence pre-training model to replace sequence searching and the like. In addition, the invention can predict the structure of the multi-chain protein, and solves the problem that the structure of the single-chain protein can only be predicted in the prior art.

In order to verify the accuracy of the protein structure prediction method provided in the present invention, a sabdab antibody database (which is an antibody database established by oxford protein information center in uk according to open source innovation protocol, which gathers all antibody structure data in PDB protein three-dimensional structure database) was used as training data for model training, while a general index Root Mean Square Deviation (RMSD) was used as an evaluation index of the predicted structure. The predicted structure of the diabody with ID of 7phu obtained by the prediction method of the invention is shown in FIG. 11, and the results reported in the paper by the current mainstream prediction algorithm Omgefold and AlphaFold2 are shown in FIG. 12. The paper reports an RMSD of omega of 1.82 and an RMSD of alphafold2 of 5.83, whereas the RMSD of the prediction method of the present invention is 0.91, which fully demonstrates the accuracy of the prediction method provided by the present invention.

The embodiment of the invention also provides a training device of a protein structure prediction model, wherein as shown in fig. 9, the protein structure prediction model comprises a feature extraction network and a structure prediction network, and the training device of the protein structure prediction model comprises:

A data acquisition unit 1 for acquiring a training data set comprising a known protein sequence and physicochemical properties of amino acid residues in the protein sequence;

a data processing unit 2 for generating a first eigenvector containing protein sequence information from the protein sequence; clustering the physicochemical properties of the amino acid residues to obtain the physicochemical properties of the clustered amino acid residues, and generating a second eigenvector containing the physicochemical information of the amino acid residues according to the physicochemical properties of the clustered amino acid residues;

and the model training unit 3 is used for training the feature extraction network and the structure prediction network according to the first feature vector and the second feature vector to obtain the protein structure prediction model.

In some embodiments, the data acquisition unit 1 is specifically configured to:

protein sequences, physicochemical properties of amino acid residues in the protein sequences, position coding sequences obtained based on the protein sequences, real three-dimensional coordinate information of atoms, mask information (recording whether virtual atoms at all positions exist in each amino acid s) and the like are obtained from existing protein structure files and peptide libraries.

In some embodiments, the data processing unit 2 is specifically configured to:

converting the position coding sequence into a position coding feature vector;

coding amino acids in the protein sequence to obtain a sequence feature vector, and splicing the sequence feature vector and the position coding feature vector to obtain a first feature vector; specifically, inputting a protein sequence into a pre-training model, and training the pre-training model; inputting the protein sequence into a trained pre-training model, and coding amino acids in the protein sequence to obtain a feature vector with dimension of L multiplied by m.

In some embodiments, the data processing unit 2 is specifically configured to:

clustering the physicochemical properties of the amino acid residues to obtain a plurality of representative physicochemical properties of the amino acid residues capable of representing each amino acid; encoding amino acids in the protein sequence according to the plurality of representative amino acid residue physicochemical properties capable of representing each amino acid to obtain physicochemical property feature vectors; and splicing the physicochemical characteristic feature vector with the position coding characteristic vector to obtain a second characteristic vector.

In some embodiments, the data processing unit 2 is further configured to:

In some embodiments, when the protein is multi-chain, the data processing unit 2 is further configured to:

and splicing each data in the training data set.

For example, the protein sequences of each sub-chain are spliced; splicing the position coding sequence of each sub-chain; stitching the true three-dimensional coordinate series of all atoms of each sub-chain, and so on. The splicing manner is referred to the above examples, and will not be described herein.

In some embodiments, the model training unit 3 is specifically configured to:

and adjusting parameters in the self-attention network, the cross-attention network and the fully-connected network according to the loss value until the loss value converges to a preset value, so as to obtain the protein structure prediction model. Specifically, inputting the third feature vector and the fourth feature vector into a first fully-connected network to obtain a predicted three-dimensional coordinate of a core heavy atom; obtaining predicted three-dimensional coordinates of oxygen atoms according to the predicted three-dimensional coordinates of the core heavy atoms; and inputting the third characteristic vector and the fourth characteristic vector into a second full-connection network to obtain the predicted three-dimensional coordinates of other atoms except the core heavy atom and the oxygen atom.

The embodiment of the invention also provides a protein structure prediction device, as shown in fig. 10, including:

and the structure prediction unit 4 is used for inputting the protein sequence of the structure to be detected and the physicochemical properties of the amino acid residues in the protein sequence into a protein structure prediction model trained by the training device to obtain the structure of the protein to be detected.

In this embodiment, the structure prediction unit further includes a data processing unit of the protein sequence to be detected, for processing the protein sequence to be detected, and the specific processing procedure can be referred to the data processing unit in the training device of the protein structure prediction model.

In some embodiments, as shown in fig. 10, the protein structure prediction apparatus further comprises:

and the structure optimization unit 5 is used for optimizing the structure of the protein to be tested.

In this embodiment, the structure optimization unit may be based on the optimization methods such as Openmm and pddbfixer, but is not limited thereto.

The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the training method of the protein structure model according to the embodiment of the invention is realized or the prediction method of the protein structure according to the invention is realized.

The computer readable medium described in this embodiment may be a computer readable storage medium or a computer readable signal medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments, a computer readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The embodiment of the invention also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and when the computer program is executed by the processor, the training method of the protein structure model or the prediction method of the protein structure according to the embodiment of the invention are realized.

In this embodiment, the memory may be a volatile memory, such as a random access memory; the memory may also be a non-volatile memory such as read-only memory, flash memory, hard disk, etc. The processor may be a central processing unit, controller, microcontroller, microprocessor, or other data processing chip.

It is to be understood that the invention is not limited in its application to the examples described above, but is capable of modification and variation in light of the above teachings by those skilled in the art, and that all such modifications and variations are intended to be included within the scope of the appended claims.

Reference to the literature

1、https://github.com/aqlaboratory/openfold/blob/main/openfold/np/residue_constant

s.py#L356。

2、Subramanian R,Sarkar S,Labrador M,et al.Orientation invariant gait matching algorithm based on the Kabsch alignment[C]//IEEE International Conference on Identity,Security and Behavior Analysis(ISBA2015).IEEE,2015:1-8。

3、Jumper J,Evans R,Pritzel A,et al.Highly accurate protein structure prediction with AlphaFold[J].Nature,2021,596(7873):583-589。

Claims

1. The training method of the protein structure prediction model is characterized in that the protein structure prediction model comprises a characteristic extraction network and a structure prediction network, and comprises the following steps:

2. The training method of claim 1, wherein the training dataset further comprises a position-encoded array acquired based on the protein sequence;

generating a first eigenvector containing protein sequence information according to the protein sequence; clustering the physicochemical properties of the amino acid residues to obtain clustered physicochemical properties of the amino acid residues, and generating a second eigenvector containing physicochemical information of the amino acid residues according to the clustered physicochemical properties of the amino acid residues specifically comprises the steps of:

Converting the position coding sequence into a position coding feature vector;

3. The training method according to claim 2, wherein the step of encoding the amino acids in the protein sequence to obtain a sequence feature vector specifically comprises:

providing a pre-training model;

4. A training method as claimed in claim 3, characterized in that the method of training the pre-training model comprises the steps of:

5. The training method of claim 2, wherein when the protein is multi-chain, the protein sequences in the training dataset are protein sequences after splicing the protein sequences of each sub-chain; the position coding sequence in the training data set is the position coding sequence after the position coding sequence of each sub-chain is spliced.

6. The training method of claim 5, wherein the training of the feature extraction network and the structure prediction network using the first feature vector and the second feature vector further comprises the steps of:

7. The training method of claim 6, wherein the training dataset further comprises true three-dimensional coordinates of all atoms in the protein sequence; the real three-dimensional coordinates of all atoms in the protein sequence are three-dimensional coordinate sequence spliced by the real three-dimensional coordinate sequence of all atoms of each sub-chain;

8. The training method of claim 7, wherein the step of fusing the processed first feature vector and the processed second feature vector together via a cross-attention network to obtain a fourth feature vector comprises:

9. The training method of claim 7, wherein the step of inputting the third feature vector and the fourth feature vector into a fully-connected network to obtain predicted three-dimensional coordinates of all atoms in the protein specifically comprises:

Obtaining predicted three-dimensional coordinates of oxygen atoms according to the predicted three-dimensional coordinates of the core heavy atoms;

10. A method for predicting a protein structure, comprising the steps of:

inputting the protein sequence of the structure to be detected and the physicochemical properties of the amino acid residues in the protein sequence into a protein structure prediction model obtained by training by the training method according to any one of claims 1-9, so as to obtain the structure of the protein to be detected.

11. A training device for a protein structure prediction model, wherein the protein structure prediction model includes a feature extraction network and a structure prediction network, the training device for a protein structure prediction model comprising:

12. A protein structure prediction apparatus, comprising:

the structure prediction unit is used for inputting the physical and chemical properties of the protein sequence of the structure to be detected and the amino acid residues in the protein sequence into the protein structure prediction model trained by the training device of claim 11 to obtain the structure of the protein to be detected.

13. A computer readable storage medium, characterized in that it stores a computer program, which, when executed by a processor, implements the training method of any one of claims 1-9 or implements the prediction method of claim 10.

14. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program executable on the processor, which, when executed by the processor, implements the training method of any of claims 1-9 or implements the prediction method of claim 10.