Background
In each life activity, the biological function of a protein plays an important role, and the biological function of a protein is mainly determined by its structure. Predicting the solvent accessibility of proteins is a key step in the prediction of protein structure. Therefore, the method for accurately predicting the solvent accessibility of the protein has important guiding significance for the aspects of understanding the protein function, analyzing the interrelation among biomolecules, designing new drugs and the like.
The research literature finds that many methods for predicting the Solvent accessibility of protein amino acids have been proposed, such as san (Joo, K.; Lee, S.J.; Lee, J.san: Solvent accessibility prediction of proteins by means of a protein structure, function, biological. 2012,80,1791, Joo, K, etc. san: a method for predicting the Solvent accessibility of proteins based on the K-neighbor algorithm, protein structure, function, biological. 2012,80,1791. and SPIDER3 (biological R et (2017) Capturing non-local interaction by local storage short term, biological. 2012, protein access prediction, and secondary nerve prediction, 2842. interaction, and secondary nerve access prediction, protein access prediction, secondary nerve access, protein access prediction, and secondary nerve access prediction, and secondary nerve access, protein access prediction, biological, and secondary nerve access, biological, and secondary nerve access, biological, 33(18) 2842 and 2849), etc. Although the existing method can be used for predicting the solvent accessibility of the protein, a large number of training data sets and machine learning algorithms are generally used, so the calculation cost is high, meanwhile, the problems of noise information and data imbalance in the training sets are not paid enough attention, the prediction accuracy cannot be guaranteed to be optimal, and the prediction efficiency needs to be further improved.
In view of the above, the existing prediction methods for protein solvent accessibility have a great gap from the practical application requirements in terms of calculation cost and prediction accuracy, and improvements are urgently needed.
Disclosure of Invention
In order to overcome the defects of the existing prediction method of the accessibility of the protein solvent in the aspects of calculation cost and prediction accuracy, the invention provides the prediction method of the accessibility of the protein solvent based on the iterative search strategy, which has low calculation cost and high prediction accuracy.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for predicting protein solvent accessibility based on an iterative search strategy, the method comprising the steps of:
1) inputting protein sequence information to be subjected to solvent accessibility prediction, wherein the number of protein residues is L, and recording the information as S;
2) for a given protein sequence S, the corresponding multiple-sequence binding information was generated using the HHBlits tool and recorded as
Wherein
The N-th sequence matching information in the MSA is represented, N is the total number of the sequence matching information in the MSA, each sequence matching information contains L elements, and each element belongs to an element set R ═ { R ═ R
1,…,R
r,…,R
21The set R is composed of twenty common amino acids and a complementary space element;
3) for given multi-sequence association information MSA, generating corresponding position specificity frequency matrix, recording as
Wherein
To represent
The first element of (1) when
And R
rIn the case of the same element type,
otherwise
4) For any two protein sequences SXAnd SYGiven their multiple sequence alignment information MSAXAnd MSAYThe similarity sim (S) between them is calculated using the following procedureX,SY) And obtain their sequence alignment information ali, as follows:
4.1) according to MSA
XAnd MSA
YObtaining S using step 3)
XAnd S
YCorresponding position-specific frequency matrix
And
4.2) constructing a similarity matrix
Wherein
4.3) obtaining S by using a Needleman-Wunsch dynamic programming algorithm according to the similarity matrix XY
XAnd S
YAli and calculating S
XAnd S
YIs/are as follows
Wherein, when ali (l)
X) Not equal to-1, ali (l)
X) Is S
YNeutralization of S
XL. 1
XA residue ofIndex the residues on the alignment and
otherwise, ali (l)
X) Is represented by the formula
XL. 1
XOn each residue alignment is a complementary space element and
5) for each protein in the PDB pool
Generating corresponding multi-sequence association information by using step 2)
Form a multi-sequence association information set and record it as
Wherein I represents the total number of protein sequences in the PDB pool;
6) multiple sequence association information MSA from input sequence S and generated in step 5)
Set, using step 4) to calculate MSA and
the similarity of each element in the set is obtained, and the protein sequence and the sequence comparison information in the PDB database corresponding to the M elements with the highest similarity are obtained to form a new multi-sequence association information MSA
newThe original MSA used for updating and replacing the input sequence S, and then step 6) is executed, the iteration process is terminated until the MSA information of the input sequence S is converged;
7) for each PDB database protein contained in the MSA obtained in step 6)
Calculating corresponding solutions using DSSP tools based on the corresponding three-dimensional structure informationAgent accessibility information, comprising a set of solvent accessibility information, is denoted as
Wherein
Is composed of
The corresponding solvent accessibility information is then communicated to the mobile station,
to represent
Solvent accessibility information for the first residue in (1);
8) obtained according to step 7)
The solvent accessibility information of the input protein sequence S is predicted to be
Wherein
Is solvent accessibility information for the first residue in S when ali
m(l) Not equal to-1, ali
m(l) Index the residue in the m-th sequence in MSA aligned with the l-th residue of S and
otherwise, ali
m(l) Indicates that the alignment with the first residue of S is a complementary space element and
the technical conception of the invention is as follows: firstly, according to input protein sequence information of solvent accessibility to be determined, generating corresponding multi-sequence association information by using a HHBlits tool, further generating a corresponding position specificity frequency matrix, and simultaneously carrying out the operation on each protein sequence in a PDB database; secondly, calculating the similarity between the position specificity frequency matrix of the input protein sequence and the position specificity frequency matrix of each protein in the PDB database; then, acquiring a plurality of protein sequences with the highest similarity to the input protein and structure information from a PDB database, and taking the protein sequences and the structure information as template proteins; thirdly, calculating solvent accessibility information of each template protein by using a DSSP tool; finally, the solvent accessibility of the input protein sequence is predicted from the solvent accessibility information of the template protein. The invention provides the protein solvent accessibility prediction method based on the iterative search strategy, which is low in calculation cost and high in prediction precision.
The beneficial effects of the invention are as follows: on one hand, multi-sequence matching information is obtained from the protein sequence, and more useful information is obtained by using an iterative search strategy, so that preparation is made for further improving the prediction precision of the accessibility of the protein solvent; on the other hand, similarity and sequence comparison information are calculated from the multi-sequence matching information of the protein, so that the prediction efficiency and accuracy of the accessibility of the protein solvent are improved.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 and 2, a protein solvent accessibility prediction method based on an iterative search strategy comprises the following steps:
1) inputting protein sequence information to be subjected to solvent accessibility prediction, wherein the number of protein residues is L, and recording the information as S;
2) to pairGiven a protein sequence S, the corresponding multiple-sequence binding information was generated using the HHBlits tool and recorded as
Wherein
The N-th sequence matching information in the MSA is represented, N is the total number of the sequence matching information in the MSA, each sequence matching information contains L elements, and each element belongs to an element set R ═ { R ═ R
1,…,R
r,…,R
21The set R is composed of twenty common amino acids and a complementary space element;
3) for given multi-sequence association information MSA, generating corresponding position specificity frequency matrix, recording as
Wherein
To represent
The first element of (1) when
And R
rIn the case of the same element type,
otherwise
4) For any two protein sequences SXAnd SYGiven their multiple sequence alignment information MSAXAnd MSAYThe similarity sim (S) between them is calculated using the following procedureX,SY) And get togetherTheir sequence alignment information ali was obtained as follows:
4.1) according to MSA
XAnd MSA
YObtaining S using step 3)
XAnd S
YCorresponding position-specific frequency matrix
And
4.2) constructing a similarity matrix
Wherein
4.3) obtaining S by using a Needleman-Wunsch dynamic programming algorithm according to the similarity matrix XY
XAnd S
YAli and calculating S
XAnd S
YIs/are as follows
Wherein, when ali (l)
X) Not equal to-1, ali (l)
X) Is S
YNeutralization of S
XL. 1
XResidue index on residue alignment and
otherwise, ali (l)
X) Is represented by the formula
XL. 1
XOn each residue alignment is a complementary space element and
5) for each protein in the PDB pool
Generating corresponding multi-sequence association information by using step 2)
Form a multi-sequence association information set and record it as
Wherein I represents the total number of protein sequences in the PDB pool;
6) multiple sequence association information MSA from input sequence S and generated in step 5)
Set, using step 4) to calculate MSA and
the similarity of each element in the set is obtained, and the protein sequence and the sequence comparison information in the PDB database corresponding to the M elements with the highest similarity are obtained to form a new multi-sequence association information MSA
newThe original MSA used for updating and replacing the input sequence S, and then step 6) is executed, the iteration process is terminated until the MSA information of the input sequence S is converged;
7) for each PDB database protein contained in the MSA obtained in step 6)
Calculating corresponding solvent accessibility information by using a DSSP tool according to the corresponding three-dimensional structure information to form a solvent accessibility information set which is recorded as
Wherein
Is composed of
The corresponding solvent accessibility information is then communicated to the mobile station,
to represent
Solvent accessibility information for the first residue in (1);
8) obtained according to step 7)
The solvent accessibility information of the input protein sequence S is predicted to be
Wherein
Is solvent accessibility information for the first residue in S when ali
m(l) Not equal to-1, ali
m(l) Index the residue in the m-th sequence in MSA aligned with the l-th residue of S and
otherwise, ali
m(l) Indicates that the alignment with the first residue of S is a complementary space element and
in this embodiment, the solvent accessibility prediction of protein 1ibaA is taken as an example, and a method for predicting the solvent accessibility of protein based on an iterative search strategy comprises the following steps:
1) inputting protein sequence information to be subjected to solvent accessibility prediction, wherein the number of protein residues is L, and the protein sequence information is marked as S, wherein L is 78;
2) for a given protein sequence S, the corresponding multiple-sequence binding information was generated using the HHBlits tool and recorded as
Wherein
The N-th sequence matching information in the MSA is represented, N is the total number of the sequence matching information in the MSA, each sequence matching information contains L elements, and each element belongs to an element set R ═ { R ═ R
1,…,R
r,…,R
21The set R is composed of twenty common amino acids and a complementary space element;
3) for given multi-sequence association information MSA, generating corresponding position specificity frequency matrix, recording as
Wherein
To represent
The first element of (1) when
And R
rIn the case of the same element type,
otherwise
4) For any two protein sequences SXAnd SYGiven their multiple sequence alignment information MSAXAnd MSAYThe similarity sim (S) between them is calculated using the following procedureX,SY) And obtain their sequence alignment information ali, as follows:
4.1) according to MSA
XAnd MSA
YObtaining S using step 3)
XAnd S
YCorresponding position-specific frequency matrix
And
4.2) constructing a similarity matrix
Wherein
4.3) obtaining S by using a Needleman-Wunsch dynamic programming algorithm according to the similarity matrix XY
XAnd S
YAli and calculating S
XAnd S
YIs/are as follows
Wherein, when ali (l)
X) Not equal to-1, ali (l)
X) Is S
YNeutralization of S
XL. 1
XResidue index on residue alignment and
otherwise, ali (l)
X) Is represented by the formula
XL. 1
XOn each residue alignment is a complementary space element and
5) for each protein in the PDB pool
Generating corresponding multi-sequence association information by using step 2)
Form a multi-sequence association information set and record it as
Wherein I represents the total number of protein sequences in the PDB pool;
6) multiple sequence association information MSA from input sequence S and generated in step 5)
Set, using step 4) to calculate MSA and
the similarity of each element in the set is obtained, and the protein sequence and the sequence comparison information in the PDB database corresponding to the M elements with the highest similarity are obtained to form a new multi-sequence association information MSA
newThe original MSA used for updating and replacing the input sequence S, and then step 6) is executed, the iteration process is terminated until the MSA information of the input sequence S is converged;
7) for each PDB database protein contained in the MSA obtained in step 6)
Calculating corresponding solvent accessibility information by using a DSSP tool according to the corresponding three-dimensional structure information to form a solvent accessibility information set which is recorded as
Wherein
Is composed of
The corresponding solvent accessibility information is then communicated to the mobile station,
to represent
Solvent accessibility information for the first residue in (1);
8) obtained according to step 7)
The solvent accessibility information of the input protein sequence S is predicted to be
Wherein
Is solvent accessibility information for the first residue in S when ali
m(l) Not equal to-1, ali
m(l) Index the residue in the m-th sequence in MSA aligned with the l-th residue of S and
otherwise, ali
m(l) Indicates that the alignment with the first residue of S is a complementary space element and
as an example of predicting the solvent accessibility of protein 1ibaA, the solvent accessibility result file of protein 1ibaA obtained by the above method is shown in FIG. 2.
The above description is the result of the prediction of the solvent accessibility of the protein 1ibaA according to the invention, and is not intended to limit the scope of the invention, and various modifications and improvements can be made without departing from the scope of the invention as defined in the basic content thereof, and are not intended to be excluded from the scope of the invention.