CN116230075A - Protein domain boundary prediction method based on hybrid network model - Google Patents

Protein domain boundary prediction method based on hybrid network model Download PDF

Info

Publication number
CN116230075A
CN116230075A CN202310215993.3A CN202310215993A CN116230075A CN 116230075 A CN116230075 A CN 116230075A CN 202310215993 A CN202310215993 A CN 202310215993A CN 116230075 A CN116230075 A CN 116230075A
Authority
CN
China
Prior art keywords
sequence
residue
protein
domain
domain boundary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202310215993.3A
Other languages
Chinese (zh)
Inventor
张贵军
汪乾梁
彭春翔
张金龙
朱海涛
周晓根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202310215993.3A priority Critical patent/CN116230075A/en
Publication of CN116230075A publication Critical patent/CN116230075A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A protein domain boundary prediction method based on a hybrid network model designs a hybrid deep learning network model combining a convolution network and a long-short-term memory network to extract information of input features, wherein the input features comprise: protein amino acid sequence, position specific matrix (PSSM), secondary structure and solvent accessibility; and classifying the proposed depth features by using a random forest, obtaining a probability score between [0,1] for each residue, and judging whether each protein residue is in a domain boundary region or not by combining boundary thresholds obtained by a large amount of experimental data. The domain boundary segmentation rules are learned by using huge data, and the method has great progress significance for protein domain segmentation. The invention improves the reliability of boundary prediction.

Description

Protein domain boundary prediction method based on hybrid network model
Technical Field
The invention relates to the fields of bioinformatics and computer application, in particular to a protein domain boundary prediction method based on a hybrid network model.
Background
Protein domains are fundamental units of protein structure, folding, function, evolution and design. They are partially compact structural units.
Protein domain detection methods can be divided into two categories: a protein structure-based domain segmentation method and a sequence-based domain segmentation method. Structure-based methods require experimental or predicted protein structures for domain identification. However, biological wet experiments are time consuming, laborious and costly to determine the three-dimensional structure of proteins. With the development of high-throughput sequencing technology, protein sequences are in extremely unbalanced relation to the structure, so that a structure-based domain segmentation method has been difficult to meet the actual demands.
Sequence-based methods include both homology-based and ab-based methods, and homology-based methods have a limitation in that prediction accuracy is drastically reduced if a good homology template cannot be found. Whereas ab-plot methods can overcome this limitation, as ab-plot methods mostly use statistical methods to predict domain boundaries. This approach can be seen as a classification problem for each residue of the protein. However, the early ab-decision method was only 25% -40% accurate, because their input features only selected short range information, and ignored long range information.
Disclosure of Invention
In order to overcome the problems of limitation, low precision and the like of the existing protein domain boundary prediction method, the invention provides a hybrid network model for extracting short-range and long-range characteristics, and overcomes the defect that the traditional network model only pays attention to the short-range characteristics; finally, the random forest classification method is introduced to classify the characteristics, so that the probability of whether each residue is a boundary point is predicted, and the reliability of boundary prediction is further improved.
The technical scheme adopted for solving the technical problems is as follows:
a method of protein domain boundary prediction based on a hybrid network model, the method comprising the steps of:
1) Sequence dataset construction: extracting the full-length sequence and boundary of the protein according to the domain information of the protein domain classification database CATH; first, removing a protein sequence having a chain length of less than 80 or a domain length of less than 40 from the extracted protein sequence, and then removing a protein sequence having a chain length of more than 1500; obtaining 78653 sequences; second, redundancy was removed with a CD-HIT with 30% sequence similarity; finally 20336 sequences are obtained as data sets; extracting input features;
3) Constructing a network frame;
4) Training model parameters: the method comprises the steps of (1) fusing sequence codes (21 x L), PSSM (42 x L), secondary structure (3*L) and solvent accessibility (2*L) into 68 x L feature data, inputting the 68 x L feature data into a feature extraction network, performing data dimension reduction through a set training round, feature extraction and back propagation, and finally obtaining trained model parameters;
5) Determining a boundary threshold: according to the trained model, 200 examples are predicted, the correlation coefficient MCC is calculated Ma Xiusi by using different domain boundary thresholds, and the domain boundary threshold corresponding to the MCC with the highest score is taken as the final result
The domain boundary threshold used, the calculation formula is:
Figure BDA0004114903530000021
6) Domain boundary point determination: for the input protein sequence, predicting the probability that each residue obtained by using the trained network is a domain boundary region, comparing the probability with a boundary threshold value, and setting the probability as a possible domain boundary region if the probability is larger than the threshold value determined in the step 5); then, the highest probability residue M is found out from the possible domain boundary area to be used as a candidate cutting point; and (3) according to the secondary structure predicted in the step (2), if the M residue is not in the loop region, moving the candidate cut point to a residue F in the loop region nearest to the M residue, and if the M residue is in the loop region, determining that the M is the final cut point.
Further, in the step 2), the process of extracting the input features is as follows: each amino acid is encoded according to the amino acid sequence in the following manner: 20 amino acids plus one gap region are expressed as: alanine a:1, cysteine C:2, aspartic acid D: glutamic acid E:4, phenylalanine F:5, glycine G:6, histidine H:7, isoleucine I:8, lysine K:
9, leucine L: methionine M:11, asparagine N: proline P: glutamine Q:14, arginine R:15, serine S:16, threonine T:17, valine V:18, tryptophan W:19, tyrosine Y:20, gap:21. then converting the encoded protein sequence into a matrix of L x 21 using single heat encoding; according to the amino acid sequence, PSIBLAST is used for generating multi-sequence association from an NR database, so that a position-specific score matrix PSSM is constructed as one of input features of the method, and the method comprises the following steps of: acquiring a multi-sequence alignment file: setting maximum sequence similarity S by using PSIBLAST tool max =90%, coverage rate cov =75%, searching an NR sequence database to obtain a multi-sequence alignment file composed of homologous sequences of a target sequence; filtering the multi-sequence comparison file by using the sequence similarity SS to obtain an effective multi-sequence comparison file, and calculating the number S of the effective sequences val The formula is as follows:
Figure BDA0004114903530000031
Figure BDA0004114903530000032
wherein S is the number of sequences in the multi-sequence alignment file, m and n are two mutually different sequences in the multi-sequence alignment file, if the sequences m and n are identical at the ith position residue of the input sequence
Figure BDA0004114903530000033
1, otherwise 0, L represents the length of the input sequence;
2.2.2 Calculating sequence PSSM according to multiple sequence comparison file, firstly calculating frequency spectrum f of amino acid appearing at a certain position of sequence i (A) The formula is as follows:
Figure BDA0004114903530000034
wherein N is A The number of times of the amino acid A in a certain list in the effective multi-sequence alignment file is counted; to prevent the situation of sparse frequency spectrum data, the frequency spectrum f i (A) The following conversion is carried out:
Figure BDA0004114903530000035
obtaining a PSSM of 21 x L, performing horizontal traversal and vertical traversal on the sequence amino acid frequency spectrum characteristics, and changing the frequency spectrum characteristic dimension into 42 x L, wherein L represents the length of an input sequence;
2.3 According to the amino acid sequence, the secondary structure and solvent accessibility of the protein are predicted by using SCRATCH, wherein the secondary structure respectively uses (1, 0) to represent alpha helix, (0, 1, 0) to represent beta sheet, and (0, 1) to represent loop region, and the obtained secondary structure is 3*L dimensional characteristic; solvent accessibility was represented as 2*L-dimensional characteristics using (1, 0) and (0, 1) for the exposed and buried states of each residue, respectively.
Still further, in the step 3), the process of constructing the network frame is as follows:
3.1 The first layer of the network is a one-dimensional multi-scale convolution layer which is formed by combining three one-dimensional convolution layers with convolution kernels of 11, 15 and 21 respectively, and the multi-scale convolution layer can extract the short-range characteristics of an input sequence;
3.2 The second layer of the network is a BLSTMs which is made up of three BLSTMs stacked, each consisting of two bi-directional LSTMs which are scanned from the N-and C-ends of the protein sequence, respectively. BLSTMs are capable of extracting long-range features of an input sequence;
3.3 Multi-scale convolution layer and BLSTMs extract depth features from the input features and then use random forests to predict the probability of being a domain boundary region for each residue based on the extracted depth features.
Further, in the step 6), according to the secondary structure predicted in the step 2.3), if the M residue is not in the loop region, the candidate cut point is moved to the residue F in the loop region nearest to the M residue, and if the M residue is in the loop region, the M is the final cut point.
The beneficial effects of the invention are mainly shown in the following steps: under the framework of a deep learning algorithm, a protein domain boundary prediction method based on a mixed network model is provided from the sequence, so that the limitation of the traditional method based on the structure, the template and the homology is abandoned. And short-range and long-range characteristics are extracted by using the hybrid network model, and the structure domain boundary points are predicted by combining the secondary structure information and priori knowledge of the structure domain boundary, so that the reliability of the prediction is improved.
Description of the drawings
FIG. 1 is an overall flow chart of a method for deep learning based protein domain boundary prediction from protein sequences.
FIG. 2 is a probability that each residue resulting from a protein sequence based deep learning protein domain boundary prediction method is a domain boundary region.
FIG. 3 shows the prediction result of protein T0789 by a protein domain boundary prediction method based on deep learning from protein sequences.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 to 3, a protein domain boundary prediction method based on deep learning from a protein sequence includes the steps of:
1) Sequence dataset construction: and extracting the full-length sequence and the boundary of the protein according to the domain information of the protein domain classification database CATH. First, removing a protein sequence having a chain length of less than 80 or a domain length of less than 40 from the extracted protein sequence, and then removing a protein sequence having a chain length of more than 1500; obtaining 78653 sequences; second, redundancy was removed with a CD-HIT with 30% sequence similarity; finally 20336 sequences are obtained as data sets;
2) The input feature extraction process is as follows:
2.1 According to the amino acid sequence, each amino acid is encoded in the following manner: 20 amino acids plus one gap region are expressed as: alanine a:1, cysteine C:2, aspartic acid D: glutamic acid E:4, phenylalanine F:5, glycine G:6, histidine H:7, isoleucine I:8, lysine K:9, leucine L: methionine M:11, asparagine N: proline P: glutamine Q:14, arginine R:15, serine S:16, threonine T:17, valine V:18, tryptophan W:19, tyrosine Y:20, gap:21. then converting the encoded protein sequence into a matrix of L x 21 using single heat encoding;
2.2 Based on amino acid sequence, using PSIBLAST to generate a multisequence association from NR database (https:// www.ncbi.nlm.nih.gov /), thereby constructing a Position Specific Score Matrix (PSSM) as one of the input features of the method, the steps are as follows:
2.2.3 Acquiring a multi-sequence alignment file: setting maximum sequence similarity S by using PSIBLAST tool max =90%, coverage rate cov =75%, searching an NR sequence database to obtain a multi-sequence alignment file composed of homologous sequences of a target sequence; filtering the multi-sequence comparison file by using the sequence similarity SS to obtain an effective multi-sequence comparison file, and calculating the number S of the effective sequences val The formula is as follows:
Figure BDA0004114903530000051
Figure BDA0004114903530000052
wherein S is the number of sequences in the multi-sequence alignment file, m and n are two mutually different sequences in the multi-sequence alignment file, if the sequences m and n are identical at the ith position residue of the input sequence
Figure BDA0004114903530000053
1, otherwise 0, L represents the length of the input sequence;
2.2.4 Calculating sequence PSSM according to multiple sequence comparison file, firstly calculating frequency spectrum f of amino acid appearing at a certain position of sequence i (A) The formula is as follows:
Figure BDA0004114903530000061
wherein N is A The number of times of the amino acid A in a certain list in the effective multi-sequence alignment file is counted; to prevent the situation of sparse frequency spectrum data, the frequency spectrum f i (A) The following conversion is carried out:
Figure BDA0004114903530000062
obtaining a PSSM of 21 x L, performing horizontal traversal and vertical traversal on the sequence amino acid frequency spectrum characteristics, and changing the frequency spectrum characteristic dimension into 42 x L, wherein L represents the length of an input sequence;
2.3 Predicting the secondary structure and solvent accessibility of the protein using SCRATCH (Scratch Protein Predictor (uci.edu)) based on the amino acid sequence, wherein the secondary structure represents an alpha helix using (1, 0) respectively, (0, 1, 0) represents a beta sheet and (0, 1) represents a loop region, resulting in a secondary structure characterized by 3*L dimensions; solvent accessibility, represented as 2*L-dimensional features, using (1, 0) and (0, 1) to represent the exposed and buried states of each residue, respectively;
3) The network frame is built, and the process is as follows:
3.1 The first layer of the network is a one-dimensional multi-scale convolution layer which is formed by combining three one-dimensional convolution layers with convolution kernels of 11, 15 and 21 respectively, and the multi-scale convolution layer can extract the short-range characteristics of an input sequence;
3.2 The second layer of the network is a BLSTMs which are formed by stacking three BLSTMs, each BLSTM is composed of two bidirectional LSTM, the two LSTM respectively start scanning from the N-end and the C-end of the protein sequence, and the BLSTMs can extract the long-range characteristics of the input sequence;
3.3 Multi-scale convolution layer and BLSTMs extract depth features from the input features, and then predict each residue using a random forest based on the extracted depth features, the probability of being a domain boundary region; 4) Training model parameters: the method comprises the steps of (1) fusing sequence codes (21 x L), PSSM (42 x L), secondary structure (3*L) and solvent accessibility (2*L) into 68 x L feature data, inputting the 68 x L feature data into a feature extraction network, performing data dimension reduction through a set training round, feature extraction and back propagation, and finally obtaining trained model parameters;
5) Determining a boundary threshold: based on the trained model, 200 instances are predicted, using different domain boundaries
Threshold calculation Ma Xiusi correlation coefficient Matthews Correlation Coefficient (MCC), with highest score
The domain boundary threshold corresponding to MCC is used as the domain boundary threshold for final use, and the calculation formula is as follows:
Figure BDA0004114903530000071
6) Domain boundary point determination: for the input protein sequence, the probability that each residue is a domain boundary region is predicted by using the trained network, and then compared with a boundary threshold value, and if the threshold value determined in the step 5) is greater than the threshold value, the possible domain boundary region is set. The highest probability residue M is then found from the possible domain boundary regions as a candidate cut point. And 2.3) according to the predicted secondary structure in the step 2.3), if the M residue is not in the loop area, the candidate cut point is moved to the residue F in the loop area nearest to the M residue, and if the M residue is in the loop area, the M is the final cut point.
Taking CASP11 protein T0789 with sequence length 295 as an example, a protein domain boundary prediction method based on deep learning comprises the following steps:
1) Sequence dataset construction: extracting the full-length sequence and boundary of the protein according to the domain information of the protein domain classification database CATH; first, removing a protein sequence having a chain length of less than 80 or a domain length of less than 40 from the extracted protein sequence, and then removing a protein sequence having a chain length of more than 1500; obtaining 78653 sequences; second, redundancy was removed with a CD-HIT with 30% sequence similarity; finally 20336 sequences are obtained as data sets;
2) The input feature extraction process is as follows:
2.1 According to the amino acid sequence, each amino acid is encoded in the following manner: 20 amino acids plus one gap region are expressed as: alanine a:1, cysteine C:2, aspartic acid D:
glutamic acid E:4, phenylalanine F:5, glycine G:6, histidine H:7, isoleucine I:8, lysine K:9, leucine L: methionine M:11, asparagine N:
proline P: glutamine Q:14, arginine R:15, serine S:16, threonine T:17, valine V:18, tryptophan W:19, tyrosine Y:20, gap:21.
then converting the encoded protein sequence into a matrix of L x 21 using single heat encoding;
2.2 Based on amino acid sequence, using PSIBLAST to generate a multisequence association from NR database (https:// www.ncbi.nlm.nih.gov /), thereby constructing a Position Specific Score Matrix (PSSM) as one of the input features of the method, the steps are as follows:
2.2.1 Acquiring a multi-sequence alignment file: setting maximum sequence similarity S by using PSIBLAST tool max =90%, coverage rate cov =75%, searching an NR sequence database to obtain a multi-sequence alignment file composed of homologous sequences of a target sequence; and filtering the multi-sequence comparison file by using the sequence similarity SS to obtain an effective multi-sequence comparison file. Calculating the number of significant sequences S val The formula is as follows:
Figure BDA0004114903530000081
Figure BDA0004114903530000082
wherein S is the number of sequences in the multi-sequence alignment file, m and n are two mutually different sequences in the multi-sequence alignment file, if the sequences m and n are identical at the ith position residue of the input sequence
Figure BDA0004114903530000083
1, otherwise 0, L represents the length of the input sequence;
2.2.2 Calculating sequence PSSM according to multiple sequence comparison file, firstly calculating frequency spectrum f of amino acid appearing at a certain position of sequence i (A) The formula is as follows:
Figure BDA0004114903530000084
wherein N is A The number of times of the amino acid A in a certain list in the effective multi-sequence alignment file is counted; to prevent the situation of sparse frequency spectrum data, the frequency spectrum f i (A) The following conversion is carried out:
Figure BDA0004114903530000085
obtaining a PSSM of 21 x L, performing horizontal traversal and vertical traversal on the sequence amino acid frequency spectrum characteristics, and changing the frequency spectrum characteristic dimension into 42 x L, wherein L represents the length of an input sequence;
2.3 Predicting the secondary structure and solvent accessibility of the protein using SCRATCH (Scratch Protein Predictor (uci.edu)) based on the amino acid sequence, wherein the secondary structure represents an alpha helix using (1, 0) respectively, (0, 1, 0) represents a beta sheet and (0, 1) represents a loop region, resulting in a secondary structure characterized by 3*L dimensions; solvent accessibility, represented as 2*L-dimensional features, using (1, 0) and (0, 1) to represent the exposed and buried states of each residue, respectively;
3) The network frame is built, and the process is as follows:
3.1 The first layer of the network is a one-dimensional multi-scale convolution layer which is formed by combining three one-dimensional convolution layers with convolution kernels of 11, 15 and 21 respectively, and the multi-scale convolution layer can extract the short-range characteristics of an input sequence;
3.2 The second layer of the network is a BLSTMs which is made up of three BLSTMs stacked, each consisting of two bi-directional LSTMs which are scanned from the N-and C-ends of the protein sequence, respectively. BLSTMs are capable of extracting long-range features of an input sequence;
3.3 Multi-scale convolution layer and BLSTMs extract depth features from the input features, and then predict each residue using a random forest based on the extracted depth features, the probability of being a domain boundary region; 4) Training model parameters: coding sequence (21 x L), PSSM (42 x L), secondary structure (3*L) and solvent
And (2*L) fusing the characteristics into 68L characteristic data, inputting the 68L characteristic data into a characteristic extraction network, performing data dimension reduction, characteristic extraction and back propagation on the set training rounds, and finally obtaining trained model parameters;
5) Determining a boundary threshold: based on the trained model, 200 instances are predicted, using different domain boundaries
Threshold calculation Ma Xiusi correlation coefficient Matthews Correlation Coefficient (MCC), with highest score
The domain boundary threshold corresponding to MCC is used as the domain boundary threshold for final use, and the calculation formula is as follows:
Figure BDA0004114903530000091
6) Domain boundary point determination: for the input protein sequence T0789, the probability that each residue predicted by the trained network is a domain boundary region is then compared with the boundary threshold value 0.61, and if it is greater than the threshold value 0.61 determined in step 5), the possible domain boundary region (residues 126 to 184) is set. The highest probability residue M (glycine 159) is then found from the possible domain boundary regions as a candidate cut point. According to the predicted secondary structure of step 2.3), if M residues are in the loop region, M is the final cut point. The predicted was correct since the cut point T (histidine 151) was on the same loop as that of the experimental measurement.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic spirit thereof, and the scope thereof is determined by the claims that follow.

Claims (4)

1. A method for predicting protein domain boundaries based on a hybrid network model, the method comprising the steps of:
1) Sequence dataset construction: extracting the full-length sequence and boundary of the protein according to the domain information of the protein domain classification database CATH; first, removing a protein sequence having a chain length of less than 80 or a domain length of less than 40 from the extracted protein sequence, and then removing a protein sequence having a chain length of more than 1500; obtaining 78653 sequences; second, redundancy was removed with a CD-HIT with 30% sequence similarity; finally 20336 sequences are obtained as data sets; extracting input features;
3) Constructing a network frame;
4) Training model parameters: the method comprises the steps of (1) fusing sequence codes (21 x L), PSSM (42 x L), secondary structure (3*L) and solvent accessibility (2*L) into 68 x L feature data, inputting the 68 x L feature data into a feature extraction network, performing data dimension reduction through a set training round, feature extraction and back propagation, and finally obtaining trained model parameters;
5) Determining a boundary threshold: according to the trained model, 200 examples are predicted, the correlation coefficient MCC is calculated Ma Xiusi by using different domain boundary thresholds, the domain boundary threshold corresponding to the MCC with the highest score is used as the domain boundary threshold for final use, and the calculation formula is as follows:
Figure FDA0004114903520000011
6) Domain boundary point determination: for the input protein sequence, predicting the probability that each residue obtained by using the trained network is a domain boundary region, comparing the probability with a boundary threshold value, and setting the probability as a possible domain boundary region if the probability is larger than the threshold value determined in the step 5); then, the highest probability residue M is found out from the possible domain boundary area to be used as a candidate cutting point; and (3) according to the secondary structure predicted in the step (2), if the M residue is not in the loop region, moving the candidate cut point to a residue F in the loop region nearest to the M residue, and if the M residue is in the loop region, determining that the M is the final cut point.
2. The method for predicting protein domain boundaries based on hybrid network model as claimed in claim 1, wherein in said step 2), the process of input feature extraction is as follows: each amino acid is encoded according to the amino acid sequence in the following manner: 20 amino acids plus one gap region are expressed as: alanine a:1, cysteine C:2, aspartic acid D: glutamic acid E:4,
phenylalanine F:5, glycine G:6, histidine H:7, isoleucine I:8, lysine K:
9, leucine L: methionine M:11, asparagine N: proline P:13,
glutamine Q:14, arginine R:15, serine S:16, threonine T:17, valine V:18, tryptophan W:19, tyrosine Y:20, gap:21. then converting the encoded protein sequence into a matrix of L x 21 using single heat encoding; according to the amino acid sequence, PSIBLAST is used for generating multi-sequence association from an NR database, so that a position-specific score matrix PSSM is constructed as one of input features of the method, and the method comprises the following steps of: acquiring a multi-sequence alignment file: setting maximum sequence similarity S by using PSIBLAST tool max =90%, coverage cov =75%, searching NR sequence database to obtain multiple homologous sequence components of target sequenceSequence alignment files; filtering the multi-sequence comparison file by using the sequence similarity SS to obtain an effective multi-sequence comparison file, and calculating the number S of the effective sequences val The formula is as follows:
Figure FDA0004114903520000021
Figure FDA0004114903520000022
wherein S is the number of sequences in the multi-sequence alignment file, m and n are two mutually different sequences in the multi-sequence alignment file, if the sequences m and n are identical at the ith position residue of the input sequence
Figure FDA0004114903520000023
1, otherwise 0, L represents the length of the input sequence;
2.2.2 Calculating sequence PSSM according to multiple sequence comparison file, firstly calculating frequency spectrum f of amino acid appearing at a certain position of sequence i (A) The formula is as follows:
Figure FDA0004114903520000024
wherein N is A The number of times of the amino acid A in a certain list in the effective multi-sequence alignment file is counted; to prevent the situation of sparse frequency spectrum data, the frequency spectrum f i (A) The following conversion is carried out:
Figure FDA0004114903520000025
obtaining a PSSM of 21 x L, performing horizontal traversal and vertical traversal on the sequence amino acid frequency spectrum characteristics, and changing the frequency spectrum characteristic dimension into 42 x L, wherein L represents the length of an input sequence;
2.3 According to the amino acid sequence, the secondary structure and solvent accessibility of the protein are predicted by using SCRATCH, wherein the secondary structure respectively uses (1, 0) to represent alpha helix, (0, 1, 0) to represent beta sheet, and (0, 1) to represent loop region, and the obtained secondary structure is 3*L dimensional characteristic; solvent accessibility was represented as 2*L-dimensional characteristics using (1, 0) and (0, 1) for the exposed and buried states of each residue, respectively.
3. The method for predicting protein domain boundaries based on hybrid network model as claimed in claim 2, wherein in the step 3), the process of constructing the network frame is as follows:
3.1 The first layer of the network is a one-dimensional multi-scale convolution layer which is formed by combining three one-dimensional convolution layers with convolution kernels of 11, 15 and 21 respectively, and the multi-scale convolution layer can extract the short-range characteristics of an input sequence;
3.2 The second layer of the network is a BLSTMs which is made up of three BLSTMs stacked, each consisting of two bi-directional LSTMs which are scanned from the N-and C-ends of the protein sequence, respectively. BLSTMs are capable of extracting long-range features of an input sequence;
3.3 Multi-scale convolution layer and BLSTMs extract depth features from the input features and then use random forests to predict the probability of being a domain boundary region for each residue based on the extracted depth features.
4. The method according to claim 2, wherein in the step 6), if the M residue is not in the loop region, the candidate cut point is moved to the residue F in the loop region nearest to the M residue, and if the M residue is in the loop region, the M is the final cut point.
CN202310215993.3A 2023-03-08 2023-03-08 Protein domain boundary prediction method based on hybrid network model Withdrawn CN116230075A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310215993.3A CN116230075A (en) 2023-03-08 2023-03-08 Protein domain boundary prediction method based on hybrid network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310215993.3A CN116230075A (en) 2023-03-08 2023-03-08 Protein domain boundary prediction method based on hybrid network model

Publications (1)

Publication Number Publication Date
CN116230075A true CN116230075A (en) 2023-06-06

Family

ID=86578445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310215993.3A Withdrawn CN116230075A (en) 2023-03-08 2023-03-08 Protein domain boundary prediction method based on hybrid network model

Country Status (1)

Country Link
CN (1) CN116230075A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118038990A (en) * 2024-04-11 2024-05-14 山东大学 Multi-level chromatin topological structure domain identification method and system based on community discovery

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118038990A (en) * 2024-04-11 2024-05-14 山东大学 Multi-level chromatin topological structure domain identification method and system based on community discovery
CN118038990B (en) * 2024-04-11 2024-07-16 山东大学 Multi-level chromatin topological structure domain identification method and system based on community discovery

Similar Documents

Publication Publication Date Title
CN110610166B (en) Text region detection model training method and device, electronic equipment and storage medium
CN106886795B (en) Object identification method based on salient object in image
Qu et al. The algorithm of concrete surface crack detection based on the genetic programming and percolation model
CN103714148B (en) SAR image search method based on sparse coding classification
CN104615911B (en) Method based on sparse coding and chain study prediction memebrane protein beta barrel trans-membrane regions
CN111507260B (en) Video similarity rapid detection method and detection device
Yang et al. Image-based classification of protein subcellular location patterns in human reproductive tissue by ensemble learning global and local features
CN116230075A (en) Protein domain boundary prediction method based on hybrid network model
CN109599149A (en) A kind of prediction technique of RNA coding potential
CN110119693B (en) English handwriting identification method based on improved VGG-16 model
CN108694411B (en) Method for identifying similar images
CN108845999A (en) A kind of trademark image retrieval method compared based on multiple dimensioned provincial characteristics
CN111079527B (en) Shot boundary detection method based on 3D residual error network
CN108763265B (en) Image identification method based on block retrieval
CN108763261B (en) Graph retrieval method
CN104239551B (en) Multi-feature VP-tree index-based remote sensing image retrieval method and multi-feature VP-tree index-based remote sensing image retrieval device
CN115240775A (en) Cas protein prediction method based on stacking ensemble learning strategy
CN113257341A (en) Method for predicting distribution of distance between protein residues based on depth residual error network
Becker et al. On the encoding of proteins for disordered regions prediction
Shang et al. An improved OTSU method based on Genetic Algorithm
CN108897746B (en) Image retrieval method
CN116630790B (en) Classification result optimization method based on edge precision evaluation
CN117592424B (en) Layout design method, device and equipment of memory chip and storage medium
Khan et al. Content based image retrieval using uniform local binary patterns
JP2008071214A (en) Character recognition dictionary creation method and its device, character recognition method and its device, and storage medium in which program is stored

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20230606