CN116230075A

CN116230075A - Protein domain boundary prediction method based on hybrid network model

Info

Publication number: CN116230075A
Application number: CN202310215993.3A
Authority: CN
Inventors: 张贵军; 汪乾梁; 彭春翔; 张金龙; 朱海涛; 周晓根
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2023-03-08
Filing date: 2023-03-08
Publication date: 2023-06-06

Abstract

A protein domain boundary prediction method based on a hybrid network model designs a hybrid deep learning network model combining a convolution network and a long-short-term memory network to extract information of input features, wherein the input features comprise: protein amino acid sequence, position specific matrix (PSSM), secondary structure and solvent accessibility; and classifying the proposed depth features by using a random forest, obtaining a probability score between [0,1] for each residue, and judging whether each protein residue is in a domain boundary region or not by combining boundary thresholds obtained by a large amount of experimental data. The domain boundary segmentation rules are learned by using huge data, and the method has great progress significance for protein domain segmentation. The invention improves the reliability of boundary prediction.

Description

Protein domain boundary prediction method based on hybrid network model

Technical Field

The invention relates to the fields of bioinformatics and computer application, in particular to a protein domain boundary prediction method based on a hybrid network model.

Background

Protein domains are fundamental units of protein structure, folding, function, evolution and design. They are partially compact structural units.

Protein domain detection methods can be divided into two categories: a protein structure-based domain segmentation method and a sequence-based domain segmentation method. Structure-based methods require experimental or predicted protein structures for domain identification. However, biological wet experiments are time consuming, laborious and costly to determine the three-dimensional structure of proteins. With the development of high-throughput sequencing technology, protein sequences are in extremely unbalanced relation to the structure, so that a structure-based domain segmentation method has been difficult to meet the actual demands.

Sequence-based methods include both homology-based and ab-based methods, and homology-based methods have a limitation in that prediction accuracy is drastically reduced if a good homology template cannot be found. Whereas ab-plot methods can overcome this limitation, as ab-plot methods mostly use statistical methods to predict domain boundaries. This approach can be seen as a classification problem for each residue of the protein. However, the early ab-decision method was only 25% -40% accurate, because their input features only selected short range information, and ignored long range information.

Disclosure of Invention

In order to overcome the problems of limitation, low precision and the like of the existing protein domain boundary prediction method, the invention provides a hybrid network model for extracting short-range and long-range characteristics, and overcomes the defect that the traditional network model only pays attention to the short-range characteristics; finally, the random forest classification method is introduced to classify the characteristics, so that the probability of whether each residue is a boundary point is predicted, and the reliability of boundary prediction is further improved.

The technical scheme adopted for solving the technical problems is as follows:

a method of protein domain boundary prediction based on a hybrid network model, the method comprising the steps of:

1) Sequence dataset construction: extracting the full-length sequence and boundary of the protein according to the domain information of the protein domain classification database CATH; first, removing a protein sequence having a chain length of less than 80 or a domain length of less than 40 from the extracted protein sequence, and then removing a protein sequence having a chain length of more than 1500; obtaining 78653 sequences; second, redundancy was removed with a CD-HIT with 30% sequence similarity; finally 20336 sequences are obtained as data sets; extracting input features;

3) Constructing a network frame;

4) Training model parameters: the method comprises the steps of (1) fusing sequence codes (21 x L), PSSM (42 x L), secondary structure (3*L) and solvent accessibility (2*L) into 68 x L feature data, inputting the 68 x L feature data into a feature extraction network, performing data dimension reduction through a set training round, feature extraction and back propagation, and finally obtaining trained model parameters;

5) Determining a boundary threshold: according to the trained model, 200 examples are predicted, the correlation coefficient MCC is calculated Ma Xiusi by using different domain boundary thresholds, and the domain boundary threshold corresponding to the MCC with the highest score is taken as the final result

The domain boundary threshold used, the calculation formula is:

6) Domain boundary point determination: for the input protein sequence, predicting the probability that each residue obtained by using the trained network is a domain boundary region, comparing the probability with a boundary threshold value, and setting the probability as a possible domain boundary region if the probability is larger than the threshold value determined in the step 5); then, the highest probability residue M is found out from the possible domain boundary area to be used as a candidate cutting point; and (3) according to the secondary structure predicted in the step (2), if the M residue is not in the loop region, moving the candidate cut point to a residue F in the loop region nearest to the M residue, and if the M residue is in the loop region, determining that the M is the final cut point.

Further, in the step 2), the process of extracting the input features is as follows: each amino acid is encoded according to the amino acid sequence in the following manner: 20 amino acids plus one gap region are expressed as: alanine a:1, cysteine C:2, aspartic acid D: glutamic acid E:4, phenylalanine F:5, glycine G:6, histidine H:7, isoleucine I:8, lysine K:

9, leucine L: methionine M:11, asparagine N: proline P: glutamine Q:14, arginine R:15, serine S:16, threonine T:17, valine V:18, tryptophan W:19, tyrosine Y:20, gap:21. then converting the encoded protein sequence into a matrix of L x 21 using single heat encoding; according to the amino acid sequence, PSIBLAST is used for generating multi-sequence association from an NR database, so that a position-specific score matrix PSSM is constructed as one of input features of the method, and the method comprises the following steps of: acquiring a multi-sequence alignment file: setting maximum sequence similarity S by using PSIBLAST tool _max =90%, coverage rate cov =75%, searching an NR sequence database to obtain a multi-sequence alignment file composed of homologous sequences of a target sequence; filtering the multi-sequence comparison file by using the sequence similarity SS to obtain an effective multi-sequence comparison file, and calculating the number S of the effective sequences _val The formula is as follows:

wherein S is the number of sequences in the multi-sequence alignment file, m and n are two mutually different sequences in the multi-sequence alignment file, if the sequences m and n are identical at the ith position residue of the input sequence

1, otherwise 0, L represents the length of the input sequence;

2.2.2 Calculating sequence PSSM according to multiple sequence comparison file, firstly calculating frequency spectrum f of amino acid appearing at a certain position of sequence _i (A) The formula is as follows:

wherein N is _A The number of times of the amino acid A in a certain list in the effective multi-sequence alignment file is counted; to prevent the situation of sparse frequency spectrum data, the frequency spectrum f _i (A) The following conversion is carried out:

obtaining a PSSM of 21 x L, performing horizontal traversal and vertical traversal on the sequence amino acid frequency spectrum characteristics, and changing the frequency spectrum characteristic dimension into 42 x L, wherein L represents the length of an input sequence;

2.3 According to the amino acid sequence, the secondary structure and solvent accessibility of the protein are predicted by using SCRATCH, wherein the secondary structure respectively uses (1, 0) to represent alpha helix, (0, 1, 0) to represent beta sheet, and (0, 1) to represent loop region, and the obtained secondary structure is 3*L dimensional characteristic; solvent accessibility was represented as 2*L-dimensional characteristics using (1, 0) and (0, 1) for the exposed and buried states of each residue, respectively.

Still further, in the step 3), the process of constructing the network frame is as follows:

3.1 The first layer of the network is a one-dimensional multi-scale convolution layer which is formed by combining three one-dimensional convolution layers with convolution kernels of 11, 15 and 21 respectively, and the multi-scale convolution layer can extract the short-range characteristics of an input sequence;

3.2 The second layer of the network is a BLSTMs which is made up of three BLSTMs stacked, each consisting of two bi-directional LSTMs which are scanned from the N-and C-ends of the protein sequence, respectively. BLSTMs are capable of extracting long-range features of an input sequence;

3.3 Multi-scale convolution layer and BLSTMs extract depth features from the input features and then use random forests to predict the probability of being a domain boundary region for each residue based on the extracted depth features.

Further, in the step 6), according to the secondary structure predicted in the step 2.3), if the M residue is not in the loop region, the candidate cut point is moved to the residue F in the loop region nearest to the M residue, and if the M residue is in the loop region, the M is the final cut point.

The beneficial effects of the invention are mainly shown in the following steps: under the framework of a deep learning algorithm, a protein domain boundary prediction method based on a mixed network model is provided from the sequence, so that the limitation of the traditional method based on the structure, the template and the homology is abandoned. And short-range and long-range characteristics are extracted by using the hybrid network model, and the structure domain boundary points are predicted by combining the secondary structure information and priori knowledge of the structure domain boundary, so that the reliability of the prediction is improved.

Description of the drawings

FIG. 1 is an overall flow chart of a method for deep learning based protein domain boundary prediction from protein sequences.

FIG. 2 is a probability that each residue resulting from a protein sequence based deep learning protein domain boundary prediction method is a domain boundary region.

FIG. 3 shows the prediction result of protein T0789 by a protein domain boundary prediction method based on deep learning from protein sequences.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 3, a protein domain boundary prediction method based on deep learning from a protein sequence includes the steps of:

1) Sequence dataset construction: and extracting the full-length sequence and the boundary of the protein according to the domain information of the protein domain classification database CATH. First, removing a protein sequence having a chain length of less than 80 or a domain length of less than 40 from the extracted protein sequence, and then removing a protein sequence having a chain length of more than 1500; obtaining 78653 sequences; second, redundancy was removed with a CD-HIT with 30% sequence similarity; finally 20336 sequences are obtained as data sets;

2) The input feature extraction process is as follows:

2.1 According to the amino acid sequence, each amino acid is encoded in the following manner: 20 amino acids plus one gap region are expressed as: alanine a:1, cysteine C:2, aspartic acid D: glutamic acid E:4, phenylalanine F:5, glycine G:6, histidine H:7, isoleucine I:8, lysine K:9, leucine L: methionine M:11, asparagine N: proline P: glutamine Q:14, arginine R:15, serine S:16, threonine T:17, valine V:18, tryptophan W:19, tyrosine Y:20, gap:21. then converting the encoded protein sequence into a matrix of L x 21 using single heat encoding;

2.2 Based on amino acid sequence, using PSIBLAST to generate a multisequence association from NR database (https:// www.ncbi.nlm.nih.gov /), thereby constructing a Position Specific Score Matrix (PSSM) as one of the input features of the method, the steps are as follows:

2.2.3 Acquiring a multi-sequence alignment file: setting maximum sequence similarity S by using PSIBLAST tool _max =90%, coverage rate cov =75%, searching an NR sequence database to obtain a multi-sequence alignment file composed of homologous sequences of a target sequence; filtering the multi-sequence comparison file by using the sequence similarity SS to obtain an effective multi-sequence comparison file, and calculating the number S of the effective sequences _val The formula is as follows:

1, otherwise 0, L represents the length of the input sequence;

2.2.4 Calculating sequence PSSM according to multiple sequence comparison file, firstly calculating frequency spectrum f of amino acid appearing at a certain position of sequence _i (A) The formula is as follows:

2.3 Predicting the secondary structure and solvent accessibility of the protein using SCRATCH (Scratch Protein Predictor (uci.edu)) based on the amino acid sequence, wherein the secondary structure represents an alpha helix using (1, 0) respectively, (0, 1, 0) represents a beta sheet and (0, 1) represents a loop region, resulting in a secondary structure characterized by 3*L dimensions; solvent accessibility, represented as 2*L-dimensional features, using (1, 0) and (0, 1) to represent the exposed and buried states of each residue, respectively;

3) The network frame is built, and the process is as follows:

3.2 The second layer of the network is a BLSTMs which are formed by stacking three BLSTMs, each BLSTM is composed of two bidirectional LSTM, the two LSTM respectively start scanning from the N-end and the C-end of the protein sequence, and the BLSTMs can extract the long-range characteristics of the input sequence;

3.3 Multi-scale convolution layer and BLSTMs extract depth features from the input features, and then predict each residue using a random forest based on the extracted depth features, the probability of being a domain boundary region; 4) Training model parameters: the method comprises the steps of (1) fusing sequence codes (21 x L), PSSM (42 x L), secondary structure (3*L) and solvent accessibility (2*L) into 68 x L feature data, inputting the 68 x L feature data into a feature extraction network, performing data dimension reduction through a set training round, feature extraction and back propagation, and finally obtaining trained model parameters;

5) Determining a boundary threshold: based on the trained model, 200 instances are predicted, using different domain boundaries

Threshold calculation Ma Xiusi correlation coefficient Matthews Correlation Coefficient (MCC), with highest score

The domain boundary threshold corresponding to MCC is used as the domain boundary threshold for final use, and the calculation formula is as follows:

6) Domain boundary point determination: for the input protein sequence, the probability that each residue is a domain boundary region is predicted by using the trained network, and then compared with a boundary threshold value, and if the threshold value determined in the step 5) is greater than the threshold value, the possible domain boundary region is set. The highest probability residue M is then found from the possible domain boundary regions as a candidate cut point. And 2.3) according to the predicted secondary structure in the step 2.3), if the M residue is not in the loop area, the candidate cut point is moved to the residue F in the loop area nearest to the M residue, and if the M residue is in the loop area, the M is the final cut point.

Taking CASP11 protein T0789 with sequence length 295 as an example, a protein domain boundary prediction method based on deep learning comprises the following steps:

1) Sequence dataset construction: extracting the full-length sequence and boundary of the protein according to the domain information of the protein domain classification database CATH; first, removing a protein sequence having a chain length of less than 80 or a domain length of less than 40 from the extracted protein sequence, and then removing a protein sequence having a chain length of more than 1500; obtaining 78653 sequences; second, redundancy was removed with a CD-HIT with 30% sequence similarity; finally 20336 sequences are obtained as data sets;

2) The input feature extraction process is as follows:

2.1 According to the amino acid sequence, each amino acid is encoded in the following manner: 20 amino acids plus one gap region are expressed as: alanine a:1, cysteine C:2, aspartic acid D:

glutamic acid E:4, phenylalanine F:5, glycine G:6, histidine H:7, isoleucine I:8, lysine K:9, leucine L: methionine M:11, asparagine N:

proline P: glutamine Q:14, arginine R:15, serine S:16, threonine T:17, valine V:18, tryptophan W:19, tyrosine Y:20, gap:21.

then converting the encoded protein sequence into a matrix of L x 21 using single heat encoding;

2.2.1 Acquiring a multi-sequence alignment file: setting maximum sequence similarity S by using PSIBLAST tool _max =90%, coverage rate cov =75%, searching an NR sequence database to obtain a multi-sequence alignment file composed of homologous sequences of a target sequence; and filtering the multi-sequence comparison file by using the sequence similarity SS to obtain an effective multi-sequence comparison file. Calculating the number of significant sequences S _val The formula is as follows:

1, otherwise 0, L represents the length of the input sequence;

3) The network frame is built, and the process is as follows:

3.3 Multi-scale convolution layer and BLSTMs extract depth features from the input features, and then predict each residue using a random forest based on the extracted depth features, the probability of being a domain boundary region; 4) Training model parameters: coding sequence (21 x L), PSSM (42 x L), secondary structure (3*L) and solvent

And (2*L) fusing the characteristics into 68L characteristic data, inputting the 68L characteristic data into a characteristic extraction network, performing data dimension reduction, characteristic extraction and back propagation on the set training rounds, and finally obtaining trained model parameters;

6) Domain boundary point determination: for the input protein sequence T0789, the probability that each residue predicted by the trained network is a domain boundary region is then compared with the boundary threshold value 0.61, and if it is greater than the threshold value 0.61 determined in step 5), the possible domain boundary region (residues 126 to 184) is set. The highest probability residue M (glycine 159) is then found from the possible domain boundary regions as a candidate cut point. According to the predicted secondary structure of step 2.3), if M residues are in the loop region, M is the final cut point. The predicted was correct since the cut point T (histidine 151) was on the same loop as that of the experimental measurement.

While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic spirit thereof, and the scope thereof is determined by the claims that follow.

Claims

1. A method for predicting protein domain boundaries based on a hybrid network model, the method comprising the steps of:

3) Constructing a network frame;

5) Determining a boundary threshold: according to the trained model, 200 examples are predicted, the correlation coefficient MCC is calculated Ma Xiusi by using different domain boundary thresholds, the domain boundary threshold corresponding to the MCC with the highest score is used as the domain boundary threshold for final use, and the calculation formula is as follows:

2. The method for predicting protein domain boundaries based on hybrid network model as claimed in claim 1, wherein in said step 2), the process of input feature extraction is as follows: each amino acid is encoded according to the amino acid sequence in the following manner: 20 amino acids plus one gap region are expressed as: alanine a:1, cysteine C:2, aspartic acid D: glutamic acid E:4,

phenylalanine F:5, glycine G:6, histidine H:7, isoleucine I:8, lysine K:

9, leucine L: methionine M:11, asparagine N: proline P:13,

glutamine Q:14, arginine R:15, serine S:16, threonine T:17, valine V:18, tryptophan W:19, tyrosine Y:20, gap:21. then converting the encoded protein sequence into a matrix of L x 21 using single heat encoding; according to the amino acid sequence, PSIBLAST is used for generating multi-sequence association from an NR database, so that a position-specific score matrix PSSM is constructed as one of input features of the method, and the method comprises the following steps of: acquiring a multi-sequence alignment file: setting maximum sequence similarity S by using PSIBLAST tool _max =90%, coverage cov =75%, searching NR sequence database to obtain multiple homologous sequence components of target sequenceSequence alignment files; filtering the multi-sequence comparison file by using the sequence similarity SS to obtain an effective multi-sequence comparison file, and calculating the number S of the effective sequences _val The formula is as follows:

1, otherwise 0, L represents the length of the input sequence;

3. The method for predicting protein domain boundaries based on hybrid network model as claimed in claim 2, wherein in the step 3), the process of constructing the network frame is as follows:

4. The method according to claim 2, wherein in the step 6), if the M residue is not in the loop region, the candidate cut point is moved to the residue F in the loop region nearest to the M residue, and if the M residue is in the loop region, the M is the final cut point.