CN112216345A

CN112216345A - Protein solvent accessibility prediction method based on iterative search strategy

Info

Publication number: CN112216345A
Application number: CN202011030157.0A
Authority: CN
Inventors: 胡俊; 樊学强; 董世建; 白岩松; 张贵军
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Guangzhou Zhaoji Biotechnology Co ltd; Shenzhen Xinrui Gene Technology Co ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2021-01-12
Anticipated expiration: 2040-09-27
Also published as: CN112216345B

Abstract

Firstly, according to input protein sequence information of solvent accessibility to be determined, generating corresponding multi-sequence association information by using a HHBlits tool, further generating a corresponding position specificity frequency matrix, and simultaneously carrying out the operation on each protein sequence in a PDB database; secondly, calculating the similarity between the position specificity frequency matrix of the input protein sequence and the position specificity frequency matrix of each protein in the PDB database; then, acquiring a plurality of protein sequences with the highest similarity to the input protein and structure information from a PDB database, and taking the protein sequences and the structure information as template proteins; thirdly, calculating solvent accessibility information of each template protein by using a DSSP tool; finally, the solvent accessibility of the input protein sequence is predicted from the solvent accessibility information of the template protein. The method has low calculation cost and high prediction precision.

Description

Protein solvent accessibility prediction method based on iterative search strategy

Technical Field

The invention relates to the fields of bioinformatics, pattern recognition and computer application, in particular to a protein solvent accessibility prediction method based on an iterative search strategy.

Background

In each life activity, the biological function of a protein plays an important role, and the biological function of a protein is mainly determined by its structure. Predicting the solvent accessibility of proteins is a key step in the prediction of protein structure. Therefore, the method for accurately predicting the solvent accessibility of the protein has important guiding significance for the aspects of understanding the protein function, analyzing the interrelation among biomolecules, designing new drugs and the like.

The research literature finds that many methods for predicting the Solvent accessibility of protein amino acids have been proposed, such as san (Joo, K.; Lee, S.J.; Lee, J.san: Solvent accessibility prediction of proteins by means of a protein structure, function, biological. 2012,80,1791, Joo, K, etc. san: a method for predicting the Solvent accessibility of proteins based on the K-neighbor algorithm, protein structure, function, biological. 2012,80,1791. and SPIDER3 (biological R et (2017) Capturing non-local interaction by local storage short term, biological. 2012, protein access prediction, and secondary nerve prediction, 2842. interaction, and secondary nerve access prediction, protein access prediction, secondary nerve access, protein access prediction, and secondary nerve access prediction, and secondary nerve access, protein access prediction, biological, and secondary nerve access, biological, and secondary nerve access, biological, 33(18) 2842 and 2849), etc. Although the existing method can be used for predicting the solvent accessibility of the protein, a large number of training data sets and machine learning algorithms are generally used, so the calculation cost is high, meanwhile, the problems of noise information and data imbalance in the training sets are not paid enough attention, the prediction accuracy cannot be guaranteed to be optimal, and the prediction efficiency needs to be further improved.

In view of the above, the existing prediction methods for protein solvent accessibility have a great gap from the practical application requirements in terms of calculation cost and prediction accuracy, and improvements are urgently needed.

Disclosure of Invention

In order to overcome the defects of the existing prediction method of the accessibility of the protein solvent in the aspects of calculation cost and prediction accuracy, the invention provides the prediction method of the accessibility of the protein solvent based on the iterative search strategy, which has low calculation cost and high prediction accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for predicting protein solvent accessibility based on an iterative search strategy, the method comprising the steps of:

1) inputting protein sequence information to be subjected to solvent accessibility prediction, wherein the number of protein residues is L, and recording the information as S;

2) for a given protein sequence S, the corresponding multiple-sequence binding information was generated using the HHBlits tool and recorded as

Wherein

The N-th sequence matching information in the MSA is represented, N is the total number of the sequence matching information in the MSA, each sequence matching information contains L elements, and each element belongs to an element set R ═ { R ═ R₁,…,R_r,…,R₂₁The set R is composed of twenty common amino acids and a complementary space element;

3) for given multi-sequence association information MSA, generating corresponding position specificity frequency matrix, recording as

Wherein

To represent

The first element of (1) when

And R_rIn the case of the same element type,

otherwise

4) For any two protein sequences S^XAnd S^YGiven their multiple sequence alignment information MSA^XAnd MSA^YThe similarity sim (S) between them is calculated using the following procedure^X,S^Y) And obtain their sequence alignment information ali, as follows:

4.1) according to MSA^XAnd MSA^YObtaining S using step 3)^XAnd S^YCorresponding position-specific frequency matrix

And

4.2) constructing a similarity matrix

Wherein

4.3) obtaining S by using a Needleman-Wunsch dynamic programming algorithm according to the similarity matrix XY^XAnd S^YAli and calculating S^XAnd S^YIs/are as follows

Wherein, when ali (l)^X) Not equal to-1, ali (l)^X) Is S^YNeutralization of S^XL. 1^XA residue ofIndex the residues on the alignment and

otherwise, ali (l)^X) Is represented by the formula^XL. 1^XOn each residue alignment is a complementary space element and

5) for each protein in the PDB pool

Generating corresponding multi-sequence association information by using step 2)

Form a multi-sequence association information set and record it as

Wherein I represents the total number of protein sequences in the PDB pool;

6) multiple sequence association information MSA from input sequence S and generated in step 5)

Set, using step 4) to calculate MSA and

the similarity of each element in the set is obtained, and the protein sequence and the sequence comparison information in the PDB database corresponding to the M elements with the highest similarity are obtained to form a new multi-sequence association information MSA^newThe original MSA used for updating and replacing the input sequence S, and then step 6) is executed, the iteration process is terminated until the MSA information of the input sequence S is converged;

7) for each PDB database protein contained in the MSA obtained in step 6)

Calculating corresponding solutions using DSSP tools based on the corresponding three-dimensional structure informationAgent accessibility information, comprising a set of solvent accessibility information, is denoted as

Wherein

Is composed of

The corresponding solvent accessibility information is then communicated to the mobile station,

to represent

Solvent accessibility information for the first residue in (1);

8) obtained according to step 7)

The solvent accessibility information of the input protein sequence S is predicted to be

Wherein

Is solvent accessibility information for the first residue in S when ali^m(l) Not equal to-1, ali^m(l) Index the residue in the m-th sequence in MSA aligned with the l-th residue of S and

otherwise, ali^m(l) Indicates that the alignment with the first residue of S is a complementary space element and

the technical conception of the invention is as follows: firstly, according to input protein sequence information of solvent accessibility to be determined, generating corresponding multi-sequence association information by using a HHBlits tool, further generating a corresponding position specificity frequency matrix, and simultaneously carrying out the operation on each protein sequence in a PDB database; secondly, calculating the similarity between the position specificity frequency matrix of the input protein sequence and the position specificity frequency matrix of each protein in the PDB database; then, acquiring a plurality of protein sequences with the highest similarity to the input protein and structure information from a PDB database, and taking the protein sequences and the structure information as template proteins; thirdly, calculating solvent accessibility information of each template protein by using a DSSP tool; finally, the solvent accessibility of the input protein sequence is predicted from the solvent accessibility information of the template protein. The invention provides the protein solvent accessibility prediction method based on the iterative search strategy, which is low in calculation cost and high in prediction precision.

The beneficial effects of the invention are as follows: on one hand, multi-sequence matching information is obtained from the protein sequence, and more useful information is obtained by using an iterative search strategy, so that preparation is made for further improving the prediction precision of the accessibility of the protein solvent; on the other hand, similarity and sequence comparison information are calculated from the multi-sequence matching information of the protein, so that the prediction efficiency and accuracy of the accessibility of the protein solvent are improved.

Drawings

FIG. 1 is a schematic diagram of a protein solvent accessibility prediction method based on an iterative search strategy.

FIG. 2 is a result file of solvent accessibility predictions for protein 1ibaA using an iterative search strategy based protein solvent accessibility prediction method.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 and 2, a protein solvent accessibility prediction method based on an iterative search strategy comprises the following steps:

2) to pairGiven a protein sequence S, the corresponding multiple-sequence binding information was generated using the HHBlits tool and recorded as

Wherein

Wherein

To represent

The first element of (1) when

And R_rIn the case of the same element type,

otherwise

4) For any two protein sequences S^XAnd S^YGiven their multiple sequence alignment information MSA^XAnd MSA^YThe similarity sim (S) between them is calculated using the following procedure^X,S^Y) And get togetherTheir sequence alignment information ali was obtained as follows:

And

4.2) constructing a similarity matrix

Wherein

Wherein, when ali (l)^X) Not equal to-1, ali (l)^X) Is S^YNeutralization of S^XL. 1^XResidue index on residue alignment and

5) for each protein in the PDB pool

Form a multi-sequence association information set and record it as

Wherein I represents the total number of protein sequences in the PDB pool;

Set, using step 4) to calculate MSA and

7) for each PDB database protein contained in the MSA obtained in step 6)

Calculating corresponding solvent accessibility information by using a DSSP tool according to the corresponding three-dimensional structure information to form a solvent accessibility information set which is recorded as

Wherein

Is composed of

to represent

Solvent accessibility information for the first residue in (1);

8) obtained according to step 7)

Wherein

in this embodiment, the solvent accessibility prediction of protein 1ibaA is taken as an example, and a method for predicting the solvent accessibility of protein based on an iterative search strategy comprises the following steps:

1) inputting protein sequence information to be subjected to solvent accessibility prediction, wherein the number of protein residues is L, and the protein sequence information is marked as S, wherein L is 78;

Wherein

Wherein

To represent

The first element of (1) when

And R_rIn the case of the same element type,

otherwise

And

4.2) constructing a similarity matrix

Wherein

5) for each protein in the PDB pool

Form a multi-sequence association information set and record it as

Wherein I represents the total number of protein sequences in the PDB pool;

Set, using step 4) to calculate MSA and

7) for each PDB database protein contained in the MSA obtained in step 6)

Wherein

Is composed of

to represent

Solvent accessibility information for the first residue in (1);

8) obtained according to step 7)

Wherein

as an example of predicting the solvent accessibility of protein 1ibaA, the solvent accessibility result file of protein 1ibaA obtained by the above method is shown in FIG. 2.

The above description is the result of the prediction of the solvent accessibility of the protein 1ibaA according to the invention, and is not intended to limit the scope of the invention, and various modifications and improvements can be made without departing from the scope of the invention as defined in the basic content thereof, and are not intended to be excluded from the scope of the invention.

Claims

1. A method for predicting the solvent accessibility of a protein based on an iterative search strategy, the method comprising the steps of:

Wherein

The N-th sequence matching information in the MSA is represented, N is the total number of the sequence matching information in the MSA, each sequence matching information contains L elements, and each element belongs to an element set R ═ { R ═ R₁,…,R_r,…,R₂₁The set R is made of twentyThe amino acid is composed of common amino acid and a complementary space element;