Background
In each life activity, the biological function of a protein plays an important role, and the biological function of a protein is mainly determined by its structure. Predicting the solvent accessibility of proteins is a key step in the prediction of protein structure. Therefore, the method for accurately predicting the solvent accessibility of the protein has important guiding significance for the aspects of understanding the protein function, analyzing the interrelation among biomolecules, designing new drugs and the like.
The research literature finds that many methods for predicting the Solvent accessibility of protein amino acids have been proposed, such as san (Joo, K.; Lee, S.J.; Lee, J.san: Solvent accessibility prediction of proteins by means of a protein structure, function, biological. 2012,80,1791, Joo, K, etc. san: a method for predicting the Solvent accessibility of proteins based on the K-neighbor algorithm, protein structure, function, biological. 2012,80,1791. and SPIDER3 (biological R et (2017) Capturing non-local interaction by local storage short term, biological. 2012, protein access prediction, and secondary nerve prediction, 2842. interaction, and secondary nerve access prediction, protein access prediction, secondary nerve access, protein access prediction, and secondary nerve access prediction, and secondary nerve access, protein access prediction, biological, and secondary nerve access, biological, and secondary nerve access, biological, 33(18) 2842 and 2849), etc. Although the existing method can be used for predicting the solvent accessibility of the protein, a large number of training data sets and machine learning algorithms are generally used, so the calculation cost is high, meanwhile, the problems of noise information and data imbalance in the training sets are not paid enough attention, the prediction accuracy cannot be guaranteed to be optimal, and the prediction efficiency needs to be further improved.
In view of the above, the existing prediction methods for protein solvent accessibility have a great gap from the practical application requirements in terms of calculation cost and prediction accuracy, and improvements are urgently needed.
Disclosure of Invention
In order to overcome the defects of high calculation cost and low prediction accuracy of the conventional prediction method of the accessibility of the protein solvent, the invention provides the protein solvent accessibility prediction method based on the iterative search strategy, which is low in calculation cost and high in prediction accuracy.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a multi-view based prediction method of protein solvent accessibility, the method comprising the steps of:
1) inputting a piece of protein sequence information with the number of protein residues L and to be subjected to protein solvent accessibility prediction, and recording the information as S;
2) for any given number of protein residues LXProtein sequence information of (2), denoted SX;
3) For protein sequence S
XGenerating corresponding multi-sequence association information by using HHBlits tool, and recording the information as
Wherein the content of the first and second substances,
representing the nth sequence matching information in the MSA, N being the total number of the sequence matching information in the MSA, each sequence matching information containing L
XAn element, each element belonging to the set of elements R ═ { R ═ R
1,…,R
r,…,R
21The set R is composed of twenty common amino acids and a complementary space element;
4) for given multi-sequence association information MSA, generating corresponding position specificity frequency matrix, recording as
Wherein
To represent
The first element of (1) when
And R
rIn the case of the same element type,
otherwise
5) For a given protein sequence SXGenerating a corresponding position specificity scoring matrix by using a PSI-BLAST tool, and recording the position specificity scoring matrix as PSSM;
6) for a given protein sequence SXGenerating corresponding secondary structure information by using a PSIPRED tool, and recording the secondary structure information as PSS;
7) all proteins with annotated tertiary structure information were collected from the PDB library, and then based on the tertiary structure information of all proteins, a corresponding protein solvent accessibility tag was generated using DSSP tools, denoted as Dataset ═ Si,YiIn which S isiDenotes the i-th protein, Y, in DatasetiDenotes S in DatasetiCorresponding tag information, i ═ 1,2, …, N is the total number of protein sequences in Dataset;
8) building a depth multi-view characteristic learning neural network framework, wherein the neural network framework consists of 4 pipelines which are respectively marked as I, II, III and IV;
9) the pipeline I and the pipeline II are composed of two-layer bidirectional long-short time memory recurrent neural network BilSTM, three linear layers FC and a two-layer attention mechanism module SEnet and are respectively used for extracting evolution information in a position specificity frequency matrix and a position specificity score matrix, and corresponding outputs of the pipeline are respectively recorded as a first output and a second output;
10) the pipeline III consists of two layers of bidirectional long-time memory cyclic neural networks (BilSTM), three linear layers (FC) and a two-layer attention mechanism module SEnet and is used for extracting secondary structure information, and the corresponding output of the pipeline is written as (iii);
11) the pipeline IV consists of three linear layers FC and two layers of attention mechanism modules SENET, and the corresponding output of the pipeline is recorded as the fourth;
12) according to steps 3) to 6), all S in the Dataset are generated
iRespectively, are recorded as
Wherein i is 1,2, … …, N is the total number of protein sequences and the corresponding tag Y
iComposing a sample set
13) Using the steps 8) to 11), the built depth multi-view feature learning neural network framework learns a prediction model on S, and recording the model as DMVFL;
14) in the process of training the DMVFL, the outputs of the steps 9) to 11) are calculated by using a mean square error function (I), the outputs of the steps are respectively compared with the loss of the label (I), the loss of the label (II), the loss of the label (III), and the loss of the label (IV) are recorded as
Where T is 4, y is a label, y is
tIs a predictor of solvent accessibility;
15) and 3) generating corresponding characteristic information of the protein S to be detected through steps 3) -6), and inputting the characteristic information into the trained model DMVFL to obtain the solvent accessibility information of the protein S.
The technical conception of the invention is as follows: a protein solvent accessibility prediction method based on multi-view learning comprises the steps of firstly, generating corresponding multi-sequence association information by using an HHblits tool according to input protein sequence information of protein solvent accessibility to be determined, and generating a corresponding position specificity frequency matrix based on the multi-sequence association information; secondly, generating a corresponding position specificity scoring matrix by using a PSI-BLAST tool according to input protein sequence information of the solvent accessibility of the protein to be determined, and generating corresponding secondary structure information by using a PSIPRED tool; thirdly, building a multi-view learning neural network framework, collecting all proteins with annotated three-level structure information from a PDB library, calculating labels of protein sequences by using a DSSP tool according to the three-level structure information of the proteins, generating characteristic information of the proteins, forming a data set with the corresponding labels, and learning a prediction model on the data set by using the multi-view learning neural network framework; and finally, inputting the characteristic information to be subjected to the protein solvent accessibility prediction into the model to obtain a prediction result of the protein solvent accessibility. The invention provides a protein solvent accessibility prediction method based on multi-view learning, which is low in calculation cost and high in prediction accuracy.
The beneficial effects of the invention are as follows: on one hand, multi-sequence matching information is obtained from the protein sequence, and more useful information is obtained by using a multi-view learning strategy, so that preparation is made for further improving the prediction precision of the accessibility of the protein solvent; on the other hand, more effective information is mined from a plurality of derived information of the protein sequence, and the prediction efficiency and precision of the accessibility of the protein solvent are improved.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1, a protein solvent accessibility prediction method based on multi-view learning includes the following steps:
1) inputting a piece of protein sequence information with the number of protein residues L and to be subjected to protein solvent accessibility prediction, and recording the information as S;
2) for any given number of protein residues LXProtein sequence information of (2), denoted SX;
3) For protein sequenceColumn S
XGenerating corresponding multi-sequence association information by using HHBlits tool, and recording the information as
Wherein the content of the first and second substances,
representing the nth sequence matching information in the MSA, N being the total number of the sequence matching information in the MSA, each sequence matching information containing L
XAn element, each element belonging to the set of elements R ═ { R ═ R
1,…,R
r,…,R
21The set R is composed of twenty common amino acids and a complementary space element;
4) for given multi-sequence association information MSA, generating corresponding position specificity frequency matrix, recording as
Wherein
To represent
The first element of (1) when
And R
rIn the case of the same element type,
otherwise
5) For a given protein sequence SXGenerating a corresponding position specificity scoring matrix by using a PSI-BLAST tool, and recording the position specificity scoring matrix as PSSM;
6) for a given protein sequence SXGenerating corresponding secondary structure information by using a PSIPRED tool, and recording the secondary structure information as PSS;
7) collection of annotated triple junctions from PDB librariesAll proteins of the structural information are then generated from their tertiary structural information using DSSP tools to generate corresponding protein solvent accessibility tags, denoted as Dataset ═ Si,YiIn which S isiDenotes the i-th protein, Y, in DatasetiDenotes S in DatasetiCorresponding tag information, i ═ 1,2, …, N is the total number of protein sequences in Dataset;
8) building a depth multi-view characteristic learning neural network framework, wherein the neural network framework consists of 4 pipelines which are respectively marked as I, II, III and IV;
9) the pipeline I and the pipeline II are composed of two-layer bidirectional long-short time memory recurrent neural network BilSTM, three linear layers FC and a two-layer attention mechanism module SEnet and are respectively used for extracting evolution information in a position specificity frequency matrix and a position specificity score matrix, and corresponding outputs of the pipeline are respectively recorded as a first output and a second output;
10) the pipeline III consists of two layers of bidirectional long-time memory cyclic neural networks (BilSTM), three linear layers (FC) and a two-layer attention mechanism module SEnet and is used for extracting secondary structure information, and the corresponding output of the pipeline is written as (iii);
11) the pipeline IV consists of three linear layers FC and two layers of attention mechanism modules SENET, and the corresponding output of the pipeline is recorded as the fourth;
12) according to steps 3) to 6), all S in the Dataset are generated
iRespectively, are recorded as
Wherein i is 1,2, … …, N is the total number of protein sequences and the corresponding tag Y
iComposing a sample set
13) Using the steps 8) to 11), the built depth multi-view feature learning neural network framework learns a prediction model on S, and recording the model as DMVFL;
14) in the process of training the DMVFL, the outputs of the steps 9) to 11) are calculated by using a mean square error function (I), the outputs of the steps are respectively compared with the loss of the label (I), the loss of the label (II), the loss of the label (III), and the loss of the label (IV) are recorded as
Where T is 4, y is a label, y is
tIs a predictor of solvent accessibility;
15) and 3) generating corresponding characteristic information of the protein S to be detected through steps 3) -6), and inputting the characteristic information into the trained model DMVFL to obtain the solvent accessibility information of the protein S.
In this embodiment, the solvent accessibility prediction of protein 1ibaA is taken as an example, and a method for predicting the solvent accessibility of protein based on an iterative search strategy comprises the following steps:
1) inputting a piece of protein sequence information with 76 protein residues and to be subjected to protein solvent accessibility prediction, and recording the information as S;
2) for any given number of protein residues LXProtein sequence information of (2), denoted SX;
3) For protein sequence S
XGenerating corresponding multi-sequence association information by using HHBlits tool, and recording the information as
Wherein
Representing the nth sequence matching information in the MSA, N being the total number of the sequence matching information in the MSA, each sequence matching information containing L
XAn element, each element belonging to the set of elements R ═ { R ═ R
1,…,R
r,…,R
21The set R is composed of twenty common amino acids and a complementary space element;
4) for given multi-sequence association information MSA, generating corresponding position specificity frequency matrix, recording as
Wherein
To represent
The first element of (1) when
And R
rIn the case of the same element type,
otherwise
5) For a given protein sequence SXGenerating a corresponding position specificity scoring matrix by using a PSI-BLAST tool, and recording the position specificity scoring matrix as PSSM;
6) for a given protein sequence SXGenerating corresponding secondary structure information by using a PSIPRED tool, and recording the secondary structure information as PSS;
7) all proteins with annotated tertiary structure information were collected from the PDB library, and then based on the tertiary structure information of all proteins, a corresponding protein solvent accessibility tag was generated using DSSP tools, denoted as Dataset ═ Si,YiIn which S isiDenotes the i-th protein, Y, in DatasetiDenotes S in DatasetiCorresponding tag information, i ═ 1,2, …, N is the total number of protein sequences in Dataset;
8) building a depth multi-view characteristic learning neural network framework, wherein the neural network framework consists of 4 pipelines which are respectively marked as I, II, III and IV;
9) the pipeline I and the pipeline II are composed of two-layer bidirectional long-short time memory recurrent neural network BilSTM, three linear layers FC and a two-layer attention mechanism module SEnet and are respectively used for extracting evolution information in a position specificity frequency matrix and a position specificity score matrix, and corresponding outputs of the pipeline are respectively recorded as a first output and a second output;
10) the pipeline III consists of two layers of bidirectional long-time memory cyclic neural networks (BilSTM), three linear layers (FC) and a two-layer attention mechanism module SEnet and is used for extracting secondary structure information, and the corresponding output of the pipeline is written as (iii);
11) the pipeline IV consists of three linear layers FC and two layers of attention mechanism modules SENET, and the corresponding output of the pipeline is recorded as the fourth;
12) according to steps 3) to 6), all S in the Dataset are generated
iRespectively, are recorded as
Wherein i is 1,2, … …, N is the total number of protein sequences and the corresponding tag Y
iComposing a sample set
13) Using the steps 8) to 11), the built depth multi-view feature learning neural network framework learns a prediction model on S, and recording the model as DMVFL;
14) in the process of training the DMVFL, the outputs of the steps 9) to 11) are calculated by using a mean square error function (I), the outputs of the steps are respectively compared with the loss of the label (I), the loss of the label (II), the loss of the label (III), and the loss of the label (IV) are recorded as
Where T is 4, y is a label, y is
tIs a predictor of solvent accessibility;
15) and 3) generating corresponding characteristic information of the protein S to be detected through steps 3) -6), and inputting the characteristic information into the trained model DMVFL to obtain the solvent accessibility information of the protein S.
Using the solvent accessibility prediction for protein 1ibaA as an example, the solvent accessibility for protein 1ibaA was obtained using the above method.
The above description is the result of the prediction of the solvent accessibility of the protein 1ibaA according to the invention, and is not intended to limit the scope of the invention, and various modifications and improvements can be made without departing from the scope of the invention as defined in the basic content thereof, and are not intended to be excluded from the scope of the invention.