CN116230075A - Protein domain boundary prediction method based on hybrid network model - Google Patents
Protein domain boundary prediction method based on hybrid network model Download PDFInfo
- Publication number
- CN116230075A CN116230075A CN202310215993.3A CN202310215993A CN116230075A CN 116230075 A CN116230075 A CN 116230075A CN 202310215993 A CN202310215993 A CN 202310215993A CN 116230075 A CN116230075 A CN 116230075A
- Authority
- CN
- China
- Prior art keywords
- sequence
- residue
- protein
- domain
- domain boundary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 108020001580 protein domains Proteins 0.000 title claims abstract description 22
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 47
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 47
- 239000002904 solvent Substances 0.000 claims abstract description 13
- 239000011159 matrix material Substances 0.000 claims abstract description 9
- 238000007637 random forest analysis Methods 0.000 claims abstract description 6
- 235000018102 proteins Nutrition 0.000 claims description 42
- 150000001413 amino acids Chemical class 0.000 claims description 32
- 235000001014 amino acid Nutrition 0.000 claims description 20
- 229940024606 amino acid Drugs 0.000 claims description 20
- 238000001228 spectrum Methods 0.000 claims description 20
- 238000002864 sequence alignment Methods 0.000 claims description 19
- 238000000605 extraction Methods 0.000 claims description 11
- DHMQDGOQFOQNFH-UHFFFAOYSA-N Glycine Chemical compound NCC(O)=O DHMQDGOQFOQNFH-UHFFFAOYSA-N 0.000 claims description 10
- KZSNJWFQEVHDMF-UHFFFAOYSA-N Valine Natural products CC(C)C(N)C(O)=O KZSNJWFQEVHDMF-UHFFFAOYSA-N 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 6
- 239000004471 Glycine Substances 0.000 claims description 5
- HNDVDQJCIGZPNO-UHFFFAOYSA-N histidine Natural products OC(=O)C(N)CC1=CN=CN1 HNDVDQJCIGZPNO-UHFFFAOYSA-N 0.000 claims description 5
- 239000004475 Arginine Substances 0.000 claims description 4
- DCXYFEDJOCDNAF-UHFFFAOYSA-N Asparagine Natural products OC(=O)C(N)CC(N)=O DCXYFEDJOCDNAF-UHFFFAOYSA-N 0.000 claims description 4
- WHUUTDBJXJRKMK-UHFFFAOYSA-N Glutamic acid Natural products OC(=O)C(N)CCC(O)=O WHUUTDBJXJRKMK-UHFFFAOYSA-N 0.000 claims description 4
- ONIBWKKTOPOVIA-BYPYZUCNSA-N L-Proline Chemical compound OC(=O)[C@@H]1CCCN1 ONIBWKKTOPOVIA-BYPYZUCNSA-N 0.000 claims description 4
- QNAYBMKLOCPYGJ-REOHCLBHSA-N L-alanine Chemical compound C[C@H](N)C(O)=O QNAYBMKLOCPYGJ-REOHCLBHSA-N 0.000 claims description 4
- DCXYFEDJOCDNAF-REOHCLBHSA-N L-asparagine Chemical compound OC(=O)[C@@H](N)CC(N)=O DCXYFEDJOCDNAF-REOHCLBHSA-N 0.000 claims description 4
- CKLJMWTZIZZHCS-REOHCLBHSA-N L-aspartic acid Chemical compound OC(=O)[C@@H](N)CC(O)=O CKLJMWTZIZZHCS-REOHCLBHSA-N 0.000 claims description 4
- AGPKZVBTJJNPAG-WHFBIAKZSA-N L-isoleucine Chemical compound CC[C@H](C)[C@H](N)C(O)=O AGPKZVBTJJNPAG-WHFBIAKZSA-N 0.000 claims description 4
- ROHFNLRQFUQHCH-YFKPBYRVSA-N L-leucine Chemical compound CC(C)C[C@H](N)C(O)=O ROHFNLRQFUQHCH-YFKPBYRVSA-N 0.000 claims description 4
- FFEARJCKVFRZRR-BYPYZUCNSA-N L-methionine Chemical compound CSCC[C@H](N)C(O)=O FFEARJCKVFRZRR-BYPYZUCNSA-N 0.000 claims description 4
- COLNVLDHVKWLRT-QMMMGPOBSA-N L-phenylalanine Chemical compound OC(=O)[C@@H](N)CC1=CC=CC=C1 COLNVLDHVKWLRT-QMMMGPOBSA-N 0.000 claims description 4
- QIVBCDIJIAJPQS-VIFPVBQESA-N L-tryptophane Chemical compound C1=CC=C2C(C[C@H](N)C(O)=O)=CNC2=C1 QIVBCDIJIAJPQS-VIFPVBQESA-N 0.000 claims description 4
- OUYCCCASQSFEME-QMMMGPOBSA-N L-tyrosine Chemical compound OC(=O)[C@@H](N)CC1=CC=C(O)C=C1 OUYCCCASQSFEME-QMMMGPOBSA-N 0.000 claims description 4
- KZSNJWFQEVHDMF-BYPYZUCNSA-N L-valine Chemical compound CC(C)[C@H](N)C(O)=O KZSNJWFQEVHDMF-BYPYZUCNSA-N 0.000 claims description 4
- ROHFNLRQFUQHCH-UHFFFAOYSA-N Leucine Natural products CC(C)CC(N)C(O)=O ROHFNLRQFUQHCH-UHFFFAOYSA-N 0.000 claims description 4
- KDXKERNSBIXSRK-UHFFFAOYSA-N Lysine Natural products NCCCCC(N)C(O)=O KDXKERNSBIXSRK-UHFFFAOYSA-N 0.000 claims description 4
- 239000004472 Lysine Substances 0.000 claims description 4
- ONIBWKKTOPOVIA-UHFFFAOYSA-N Proline Natural products OC(=O)C1CCCN1 ONIBWKKTOPOVIA-UHFFFAOYSA-N 0.000 claims description 4
- MTCFGRXMJLQNBG-UHFFFAOYSA-N Serine Natural products OCC(N)C(O)=O MTCFGRXMJLQNBG-UHFFFAOYSA-N 0.000 claims description 4
- AYFVYJQAPQTCCC-UHFFFAOYSA-N Threonine Natural products CC(O)C(N)C(O)=O AYFVYJQAPQTCCC-UHFFFAOYSA-N 0.000 claims description 4
- 239000004473 Threonine Substances 0.000 claims description 4
- QIVBCDIJIAJPQS-UHFFFAOYSA-N Tryptophan Natural products C1=CC=C2C(CC(N)C(O)=O)=CNC2=C1 QIVBCDIJIAJPQS-UHFFFAOYSA-N 0.000 claims description 4
- 235000004279 alanine Nutrition 0.000 claims description 4
- ODKSFYDXXFIFQN-UHFFFAOYSA-N arginine Natural products OC(=O)C(N)CCCNC(N)=N ODKSFYDXXFIFQN-UHFFFAOYSA-N 0.000 claims description 4
- 235000009582 asparagine Nutrition 0.000 claims description 4
- 229960001230 asparagine Drugs 0.000 claims description 4
- 235000003704 aspartic acid Nutrition 0.000 claims description 4
- OQFSQFPPLPISGP-UHFFFAOYSA-N beta-carboxyaspartic acid Natural products OC(=O)C(N)C(C(O)=O)C(O)=O OQFSQFPPLPISGP-UHFFFAOYSA-N 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 4
- 235000018417 cysteine Nutrition 0.000 claims description 4
- XUJNEKJLAYXESH-UHFFFAOYSA-N cysteine Natural products SCC(N)C(O)=O XUJNEKJLAYXESH-UHFFFAOYSA-N 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 4
- 235000013922 glutamic acid Nutrition 0.000 claims description 4
- 239000004220 glutamic acid Substances 0.000 claims description 4
- ZDXPYRJPNDTMRX-UHFFFAOYSA-N glutamine Natural products OC(=O)C(N)CCC(N)=O ZDXPYRJPNDTMRX-UHFFFAOYSA-N 0.000 claims description 4
- AGPKZVBTJJNPAG-UHFFFAOYSA-N isoleucine Natural products CCC(C)C(N)C(O)=O AGPKZVBTJJNPAG-UHFFFAOYSA-N 0.000 claims description 4
- 229960000310 isoleucine Drugs 0.000 claims description 4
- 229930182817 methionine Natural products 0.000 claims description 4
- COLNVLDHVKWLRT-UHFFFAOYSA-N phenylalanine Natural products OC(=O)C(N)CC1=CC=CC=C1 COLNVLDHVKWLRT-UHFFFAOYSA-N 0.000 claims description 4
- OUYCCCASQSFEME-UHFFFAOYSA-N tyrosine Natural products OC(=O)C(N)CC1=CC=C(O)C=C1 OUYCCCASQSFEME-UHFFFAOYSA-N 0.000 claims description 4
- 239000004474 valine Substances 0.000 claims description 4
- 238000013135 deep learning Methods 0.000 abstract description 7
- 230000011218 segmentation Effects 0.000 abstract description 5
- 238000013461 design Methods 0.000 abstract description 2
- 125000003275 alpha amino acid group Chemical group 0.000 abstract 1
- 108091026890 Coding region Proteins 0.000 description 1
- 101000716750 Homo sapiens Protein SCAF11 Proteins 0.000 description 1
- 102100020876 Protein SCAF11 Human genes 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/20—Protein or domain folding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Crystallography & Structural Chemistry (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A protein domain boundary prediction method based on a hybrid network model designs a hybrid deep learning network model combining a convolution network and a long-short-term memory network to extract information of input features, wherein the input features comprise: protein amino acid sequence, position specific matrix (PSSM), secondary structure and solvent accessibility; and classifying the proposed depth features by using a random forest, obtaining a probability score between [0,1] for each residue, and judging whether each protein residue is in a domain boundary region or not by combining boundary thresholds obtained by a large amount of experimental data. The domain boundary segmentation rules are learned by using huge data, and the method has great progress significance for protein domain segmentation. The invention improves the reliability of boundary prediction.
Description
Technical Field
The invention relates to the fields of bioinformatics and computer application, in particular to a protein domain boundary prediction method based on a hybrid network model.
Background
Protein domains are fundamental units of protein structure, folding, function, evolution and design. They are partially compact structural units.
Protein domain detection methods can be divided into two categories: a protein structure-based domain segmentation method and a sequence-based domain segmentation method. Structure-based methods require experimental or predicted protein structures for domain identification. However, biological wet experiments are time consuming, laborious and costly to determine the three-dimensional structure of proteins. With the development of high-throughput sequencing technology, protein sequences are in extremely unbalanced relation to the structure, so that a structure-based domain segmentation method has been difficult to meet the actual demands.
Sequence-based methods include both homology-based and ab-based methods, and homology-based methods have a limitation in that prediction accuracy is drastically reduced if a good homology template cannot be found. Whereas ab-plot methods can overcome this limitation, as ab-plot methods mostly use statistical methods to predict domain boundaries. This approach can be seen as a classification problem for each residue of the protein. However, the early ab-decision method was only 25% -40% accurate, because their input features only selected short range information, and ignored long range information.
Disclosure of Invention
In order to overcome the problems of limitation, low precision and the like of the existing protein domain boundary prediction method, the invention provides a hybrid network model for extracting short-range and long-range characteristics, and overcomes the defect that the traditional network model only pays attention to the short-range characteristics; finally, the random forest classification method is introduced to classify the characteristics, so that the probability of whether each residue is a boundary point is predicted, and the reliability of boundary prediction is further improved.
The technical scheme adopted for solving the technical problems is as follows:
a method of protein domain boundary prediction based on a hybrid network model, the method comprising the steps of:
1) Sequence dataset construction: extracting the full-length sequence and boundary of the protein according to the domain information of the protein domain classification database CATH; first, removing a protein sequence having a chain length of less than 80 or a domain length of less than 40 from the extracted protein sequence, and then removing a protein sequence having a chain length of more than 1500; obtaining 78653 sequences; second, redundancy was removed with a CD-HIT with 30% sequence similarity; finally 20336 sequences are obtained as data sets; extracting input features;
3) Constructing a network frame;
4) Training model parameters: the method comprises the steps of (1) fusing sequence codes (21 x L), PSSM (42 x L), secondary structure (3*L) and solvent accessibility (2*L) into 68 x L feature data, inputting the 68 x L feature data into a feature extraction network, performing data dimension reduction through a set training round, feature extraction and back propagation, and finally obtaining trained model parameters;
5) Determining a boundary threshold: according to the trained model, 200 examples are predicted, the correlation coefficient MCC is calculated Ma Xiusi by using different domain boundary thresholds, and the domain boundary threshold corresponding to the MCC with the highest score is taken as the final result
The domain boundary threshold used, the calculation formula is:
6) Domain boundary point determination: for the input protein sequence, predicting the probability that each residue obtained by using the trained network is a domain boundary region, comparing the probability with a boundary threshold value, and setting the probability as a possible domain boundary region if the probability is larger than the threshold value determined in the step 5); then, the highest probability residue M is found out from the possible domain boundary area to be used as a candidate cutting point; and (3) according to the secondary structure predicted in the step (2), if the M residue is not in the loop region, moving the candidate cut point to a residue F in the loop region nearest to the M residue, and if the M residue is in the loop region, determining that the M is the final cut point.
Further, in the step 2), the process of extracting the input features is as follows: each amino acid is encoded according to the amino acid sequence in the following manner: 20 amino acids plus one gap region are expressed as: alanine a:1, cysteine C:2, aspartic acid D: glutamic acid E:4, phenylalanine F:5, glycine G:6, histidine H:7, isoleucine I:8, lysine K:
9, leucine L: methionine M:11, asparagine N: proline P: glutamine Q:14, arginine R:15, serine S:16, threonine T:17, valine V:18, tryptophan W:19, tyrosine Y:20, gap:21. then converting the encoded protein sequence into a matrix of L x 21 using single heat encoding; according to the amino acid sequence, PSIBLAST is used for generating multi-sequence association from an NR database, so that a position-specific score matrix PSSM is constructed as one of input features of the method, and the method comprises the following steps of: acquiring a multi-sequence alignment file: setting maximum sequence similarity S by using PSIBLAST tool max =90%, coverage rate cov =75%, searching an NR sequence database to obtain a multi-sequence alignment file composed of homologous sequences of a target sequence; filtering the multi-sequence comparison file by using the sequence similarity SS to obtain an effective multi-sequence comparison file, and calculating the number S of the effective sequences val The formula is as follows:
wherein S is the number of sequences in the multi-sequence alignment file, m and n are two mutually different sequences in the multi-sequence alignment file, if the sequences m and n are identical at the ith position residue of the input sequence1, otherwise 0, L represents the length of the input sequence;
2.2.2 Calculating sequence PSSM according to multiple sequence comparison file, firstly calculating frequency spectrum f of amino acid appearing at a certain position of sequence i (A) The formula is as follows:
wherein N is A The number of times of the amino acid A in a certain list in the effective multi-sequence alignment file is counted; to prevent the situation of sparse frequency spectrum data, the frequency spectrum f i (A) The following conversion is carried out:
obtaining a PSSM of 21 x L, performing horizontal traversal and vertical traversal on the sequence amino acid frequency spectrum characteristics, and changing the frequency spectrum characteristic dimension into 42 x L, wherein L represents the length of an input sequence;
2.3 According to the amino acid sequence, the secondary structure and solvent accessibility of the protein are predicted by using SCRATCH, wherein the secondary structure respectively uses (1, 0) to represent alpha helix, (0, 1, 0) to represent beta sheet, and (0, 1) to represent loop region, and the obtained secondary structure is 3*L dimensional characteristic; solvent accessibility was represented as 2*L-dimensional characteristics using (1, 0) and (0, 1) for the exposed and buried states of each residue, respectively.
Still further, in the step 3), the process of constructing the network frame is as follows:
3.1 The first layer of the network is a one-dimensional multi-scale convolution layer which is formed by combining three one-dimensional convolution layers with convolution kernels of 11, 15 and 21 respectively, and the multi-scale convolution layer can extract the short-range characteristics of an input sequence;
3.2 The second layer of the network is a BLSTMs which is made up of three BLSTMs stacked, each consisting of two bi-directional LSTMs which are scanned from the N-and C-ends of the protein sequence, respectively. BLSTMs are capable of extracting long-range features of an input sequence;
3.3 Multi-scale convolution layer and BLSTMs extract depth features from the input features and then use random forests to predict the probability of being a domain boundary region for each residue based on the extracted depth features.
Further, in the step 6), according to the secondary structure predicted in the step 2.3), if the M residue is not in the loop region, the candidate cut point is moved to the residue F in the loop region nearest to the M residue, and if the M residue is in the loop region, the M is the final cut point.
The beneficial effects of the invention are mainly shown in the following steps: under the framework of a deep learning algorithm, a protein domain boundary prediction method based on a mixed network model is provided from the sequence, so that the limitation of the traditional method based on the structure, the template and the homology is abandoned. And short-range and long-range characteristics are extracted by using the hybrid network model, and the structure domain boundary points are predicted by combining the secondary structure information and priori knowledge of the structure domain boundary, so that the reliability of the prediction is improved.
Description of the drawings
FIG. 1 is an overall flow chart of a method for deep learning based protein domain boundary prediction from protein sequences.
FIG. 2 is a probability that each residue resulting from a protein sequence based deep learning protein domain boundary prediction method is a domain boundary region.
FIG. 3 shows the prediction result of protein T0789 by a protein domain boundary prediction method based on deep learning from protein sequences.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 to 3, a protein domain boundary prediction method based on deep learning from a protein sequence includes the steps of:
1) Sequence dataset construction: and extracting the full-length sequence and the boundary of the protein according to the domain information of the protein domain classification database CATH. First, removing a protein sequence having a chain length of less than 80 or a domain length of less than 40 from the extracted protein sequence, and then removing a protein sequence having a chain length of more than 1500; obtaining 78653 sequences; second, redundancy was removed with a CD-HIT with 30% sequence similarity; finally 20336 sequences are obtained as data sets;
2) The input feature extraction process is as follows:
2.1 According to the amino acid sequence, each amino acid is encoded in the following manner: 20 amino acids plus one gap region are expressed as: alanine a:1, cysteine C:2, aspartic acid D: glutamic acid E:4, phenylalanine F:5, glycine G:6, histidine H:7, isoleucine I:8, lysine K:9, leucine L: methionine M:11, asparagine N: proline P: glutamine Q:14, arginine R:15, serine S:16, threonine T:17, valine V:18, tryptophan W:19, tyrosine Y:20, gap:21. then converting the encoded protein sequence into a matrix of L x 21 using single heat encoding;
2.2 Based on amino acid sequence, using PSIBLAST to generate a multisequence association from NR database (https:// www.ncbi.nlm.nih.gov /), thereby constructing a Position Specific Score Matrix (PSSM) as one of the input features of the method, the steps are as follows:
2.2.3 Acquiring a multi-sequence alignment file: setting maximum sequence similarity S by using PSIBLAST tool max =90%, coverage rate cov =75%, searching an NR sequence database to obtain a multi-sequence alignment file composed of homologous sequences of a target sequence; filtering the multi-sequence comparison file by using the sequence similarity SS to obtain an effective multi-sequence comparison file, and calculating the number S of the effective sequences val The formula is as follows:
wherein S is the number of sequences in the multi-sequence alignment file, m and n are two mutually different sequences in the multi-sequence alignment file, if the sequences m and n are identical at the ith position residue of the input sequence1, otherwise 0, L represents the length of the input sequence;
2.2.4 Calculating sequence PSSM according to multiple sequence comparison file, firstly calculating frequency spectrum f of amino acid appearing at a certain position of sequence i (A) The formula is as follows:
wherein N is A The number of times of the amino acid A in a certain list in the effective multi-sequence alignment file is counted; to prevent the situation of sparse frequency spectrum data, the frequency spectrum f i (A) The following conversion is carried out:
obtaining a PSSM of 21 x L, performing horizontal traversal and vertical traversal on the sequence amino acid frequency spectrum characteristics, and changing the frequency spectrum characteristic dimension into 42 x L, wherein L represents the length of an input sequence;
2.3 Predicting the secondary structure and solvent accessibility of the protein using SCRATCH (Scratch Protein Predictor (uci.edu)) based on the amino acid sequence, wherein the secondary structure represents an alpha helix using (1, 0) respectively, (0, 1, 0) represents a beta sheet and (0, 1) represents a loop region, resulting in a secondary structure characterized by 3*L dimensions; solvent accessibility, represented as 2*L-dimensional features, using (1, 0) and (0, 1) to represent the exposed and buried states of each residue, respectively;
3) The network frame is built, and the process is as follows:
3.1 The first layer of the network is a one-dimensional multi-scale convolution layer which is formed by combining three one-dimensional convolution layers with convolution kernels of 11, 15 and 21 respectively, and the multi-scale convolution layer can extract the short-range characteristics of an input sequence;
3.2 The second layer of the network is a BLSTMs which are formed by stacking three BLSTMs, each BLSTM is composed of two bidirectional LSTM, the two LSTM respectively start scanning from the N-end and the C-end of the protein sequence, and the BLSTMs can extract the long-range characteristics of the input sequence;
3.3 Multi-scale convolution layer and BLSTMs extract depth features from the input features, and then predict each residue using a random forest based on the extracted depth features, the probability of being a domain boundary region; 4) Training model parameters: the method comprises the steps of (1) fusing sequence codes (21 x L), PSSM (42 x L), secondary structure (3*L) and solvent accessibility (2*L) into 68 x L feature data, inputting the 68 x L feature data into a feature extraction network, performing data dimension reduction through a set training round, feature extraction and back propagation, and finally obtaining trained model parameters;
5) Determining a boundary threshold: based on the trained model, 200 instances are predicted, using different domain boundaries
Threshold calculation Ma Xiusi correlation coefficient Matthews Correlation Coefficient (MCC), with highest score
The domain boundary threshold corresponding to MCC is used as the domain boundary threshold for final use, and the calculation formula is as follows:
6) Domain boundary point determination: for the input protein sequence, the probability that each residue is a domain boundary region is predicted by using the trained network, and then compared with a boundary threshold value, and if the threshold value determined in the step 5) is greater than the threshold value, the possible domain boundary region is set. The highest probability residue M is then found from the possible domain boundary regions as a candidate cut point. And 2.3) according to the predicted secondary structure in the step 2.3), if the M residue is not in the loop area, the candidate cut point is moved to the residue F in the loop area nearest to the M residue, and if the M residue is in the loop area, the M is the final cut point.
Taking CASP11 protein T0789 with sequence length 295 as an example, a protein domain boundary prediction method based on deep learning comprises the following steps:
1) Sequence dataset construction: extracting the full-length sequence and boundary of the protein according to the domain information of the protein domain classification database CATH; first, removing a protein sequence having a chain length of less than 80 or a domain length of less than 40 from the extracted protein sequence, and then removing a protein sequence having a chain length of more than 1500; obtaining 78653 sequences; second, redundancy was removed with a CD-HIT with 30% sequence similarity; finally 20336 sequences are obtained as data sets;
2) The input feature extraction process is as follows:
2.1 According to the amino acid sequence, each amino acid is encoded in the following manner: 20 amino acids plus one gap region are expressed as: alanine a:1, cysteine C:2, aspartic acid D:
glutamic acid E:4, phenylalanine F:5, glycine G:6, histidine H:7, isoleucine I:8, lysine K:9, leucine L: methionine M:11, asparagine N:
proline P: glutamine Q:14, arginine R:15, serine S:16, threonine T:17, valine V:18, tryptophan W:19, tyrosine Y:20, gap:21.
then converting the encoded protein sequence into a matrix of L x 21 using single heat encoding;
2.2 Based on amino acid sequence, using PSIBLAST to generate a multisequence association from NR database (https:// www.ncbi.nlm.nih.gov /), thereby constructing a Position Specific Score Matrix (PSSM) as one of the input features of the method, the steps are as follows:
2.2.1 Acquiring a multi-sequence alignment file: setting maximum sequence similarity S by using PSIBLAST tool max =90%, coverage rate cov =75%, searching an NR sequence database to obtain a multi-sequence alignment file composed of homologous sequences of a target sequence; and filtering the multi-sequence comparison file by using the sequence similarity SS to obtain an effective multi-sequence comparison file. Calculating the number of significant sequences S val The formula is as follows:
wherein S is the number of sequences in the multi-sequence alignment file, m and n are two mutually different sequences in the multi-sequence alignment file, if the sequences m and n are identical at the ith position residue of the input sequence1, otherwise 0, L represents the length of the input sequence;
2.2.2 Calculating sequence PSSM according to multiple sequence comparison file, firstly calculating frequency spectrum f of amino acid appearing at a certain position of sequence i (A) The formula is as follows:
wherein N is A The number of times of the amino acid A in a certain list in the effective multi-sequence alignment file is counted; to prevent the situation of sparse frequency spectrum data, the frequency spectrum f i (A) The following conversion is carried out:
obtaining a PSSM of 21 x L, performing horizontal traversal and vertical traversal on the sequence amino acid frequency spectrum characteristics, and changing the frequency spectrum characteristic dimension into 42 x L, wherein L represents the length of an input sequence;
2.3 Predicting the secondary structure and solvent accessibility of the protein using SCRATCH (Scratch Protein Predictor (uci.edu)) based on the amino acid sequence, wherein the secondary structure represents an alpha helix using (1, 0) respectively, (0, 1, 0) represents a beta sheet and (0, 1) represents a loop region, resulting in a secondary structure characterized by 3*L dimensions; solvent accessibility, represented as 2*L-dimensional features, using (1, 0) and (0, 1) to represent the exposed and buried states of each residue, respectively;
3) The network frame is built, and the process is as follows:
3.1 The first layer of the network is a one-dimensional multi-scale convolution layer which is formed by combining three one-dimensional convolution layers with convolution kernels of 11, 15 and 21 respectively, and the multi-scale convolution layer can extract the short-range characteristics of an input sequence;
3.2 The second layer of the network is a BLSTMs which is made up of three BLSTMs stacked, each consisting of two bi-directional LSTMs which are scanned from the N-and C-ends of the protein sequence, respectively. BLSTMs are capable of extracting long-range features of an input sequence;
3.3 Multi-scale convolution layer and BLSTMs extract depth features from the input features, and then predict each residue using a random forest based on the extracted depth features, the probability of being a domain boundary region; 4) Training model parameters: coding sequence (21 x L), PSSM (42 x L), secondary structure (3*L) and solvent
And (2*L) fusing the characteristics into 68L characteristic data, inputting the 68L characteristic data into a characteristic extraction network, performing data dimension reduction, characteristic extraction and back propagation on the set training rounds, and finally obtaining trained model parameters;
5) Determining a boundary threshold: based on the trained model, 200 instances are predicted, using different domain boundaries
Threshold calculation Ma Xiusi correlation coefficient Matthews Correlation Coefficient (MCC), with highest score
The domain boundary threshold corresponding to MCC is used as the domain boundary threshold for final use, and the calculation formula is as follows:
6) Domain boundary point determination: for the input protein sequence T0789, the probability that each residue predicted by the trained network is a domain boundary region is then compared with the boundary threshold value 0.61, and if it is greater than the threshold value 0.61 determined in step 5), the possible domain boundary region (residues 126 to 184) is set. The highest probability residue M (glycine 159) is then found from the possible domain boundary regions as a candidate cut point. According to the predicted secondary structure of step 2.3), if M residues are in the loop region, M is the final cut point. The predicted was correct since the cut point T (histidine 151) was on the same loop as that of the experimental measurement.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic spirit thereof, and the scope thereof is determined by the claims that follow.
Claims (4)
1. A method for predicting protein domain boundaries based on a hybrid network model, the method comprising the steps of:
1) Sequence dataset construction: extracting the full-length sequence and boundary of the protein according to the domain information of the protein domain classification database CATH; first, removing a protein sequence having a chain length of less than 80 or a domain length of less than 40 from the extracted protein sequence, and then removing a protein sequence having a chain length of more than 1500; obtaining 78653 sequences; second, redundancy was removed with a CD-HIT with 30% sequence similarity; finally 20336 sequences are obtained as data sets; extracting input features;
3) Constructing a network frame;
4) Training model parameters: the method comprises the steps of (1) fusing sequence codes (21 x L), PSSM (42 x L), secondary structure (3*L) and solvent accessibility (2*L) into 68 x L feature data, inputting the 68 x L feature data into a feature extraction network, performing data dimension reduction through a set training round, feature extraction and back propagation, and finally obtaining trained model parameters;
5) Determining a boundary threshold: according to the trained model, 200 examples are predicted, the correlation coefficient MCC is calculated Ma Xiusi by using different domain boundary thresholds, the domain boundary threshold corresponding to the MCC with the highest score is used as the domain boundary threshold for final use, and the calculation formula is as follows:
6) Domain boundary point determination: for the input protein sequence, predicting the probability that each residue obtained by using the trained network is a domain boundary region, comparing the probability with a boundary threshold value, and setting the probability as a possible domain boundary region if the probability is larger than the threshold value determined in the step 5); then, the highest probability residue M is found out from the possible domain boundary area to be used as a candidate cutting point; and (3) according to the secondary structure predicted in the step (2), if the M residue is not in the loop region, moving the candidate cut point to a residue F in the loop region nearest to the M residue, and if the M residue is in the loop region, determining that the M is the final cut point.
2. The method for predicting protein domain boundaries based on hybrid network model as claimed in claim 1, wherein in said step 2), the process of input feature extraction is as follows: each amino acid is encoded according to the amino acid sequence in the following manner: 20 amino acids plus one gap region are expressed as: alanine a:1, cysteine C:2, aspartic acid D: glutamic acid E:4,
phenylalanine F:5, glycine G:6, histidine H:7, isoleucine I:8, lysine K:
9, leucine L: methionine M:11, asparagine N: proline P:13,
glutamine Q:14, arginine R:15, serine S:16, threonine T:17, valine V:18, tryptophan W:19, tyrosine Y:20, gap:21. then converting the encoded protein sequence into a matrix of L x 21 using single heat encoding; according to the amino acid sequence, PSIBLAST is used for generating multi-sequence association from an NR database, so that a position-specific score matrix PSSM is constructed as one of input features of the method, and the method comprises the following steps of: acquiring a multi-sequence alignment file: setting maximum sequence similarity S by using PSIBLAST tool max =90%, coverage cov =75%, searching NR sequence database to obtain multiple homologous sequence components of target sequenceSequence alignment files; filtering the multi-sequence comparison file by using the sequence similarity SS to obtain an effective multi-sequence comparison file, and calculating the number S of the effective sequences val The formula is as follows:
wherein S is the number of sequences in the multi-sequence alignment file, m and n are two mutually different sequences in the multi-sequence alignment file, if the sequences m and n are identical at the ith position residue of the input sequence1, otherwise 0, L represents the length of the input sequence;
2.2.2 Calculating sequence PSSM according to multiple sequence comparison file, firstly calculating frequency spectrum f of amino acid appearing at a certain position of sequence i (A) The formula is as follows:
wherein N is A The number of times of the amino acid A in a certain list in the effective multi-sequence alignment file is counted; to prevent the situation of sparse frequency spectrum data, the frequency spectrum f i (A) The following conversion is carried out:
obtaining a PSSM of 21 x L, performing horizontal traversal and vertical traversal on the sequence amino acid frequency spectrum characteristics, and changing the frequency spectrum characteristic dimension into 42 x L, wherein L represents the length of an input sequence;
2.3 According to the amino acid sequence, the secondary structure and solvent accessibility of the protein are predicted by using SCRATCH, wherein the secondary structure respectively uses (1, 0) to represent alpha helix, (0, 1, 0) to represent beta sheet, and (0, 1) to represent loop region, and the obtained secondary structure is 3*L dimensional characteristic; solvent accessibility was represented as 2*L-dimensional characteristics using (1, 0) and (0, 1) for the exposed and buried states of each residue, respectively.
3. The method for predicting protein domain boundaries based on hybrid network model as claimed in claim 2, wherein in the step 3), the process of constructing the network frame is as follows:
3.1 The first layer of the network is a one-dimensional multi-scale convolution layer which is formed by combining three one-dimensional convolution layers with convolution kernels of 11, 15 and 21 respectively, and the multi-scale convolution layer can extract the short-range characteristics of an input sequence;
3.2 The second layer of the network is a BLSTMs which is made up of three BLSTMs stacked, each consisting of two bi-directional LSTMs which are scanned from the N-and C-ends of the protein sequence, respectively. BLSTMs are capable of extracting long-range features of an input sequence;
3.3 Multi-scale convolution layer and BLSTMs extract depth features from the input features and then use random forests to predict the probability of being a domain boundary region for each residue based on the extracted depth features.
4. The method according to claim 2, wherein in the step 6), if the M residue is not in the loop region, the candidate cut point is moved to the residue F in the loop region nearest to the M residue, and if the M residue is in the loop region, the M is the final cut point.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310215993.3A CN116230075A (en) | 2023-03-08 | 2023-03-08 | Protein domain boundary prediction method based on hybrid network model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310215993.3A CN116230075A (en) | 2023-03-08 | 2023-03-08 | Protein domain boundary prediction method based on hybrid network model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116230075A true CN116230075A (en) | 2023-06-06 |
Family
ID=86578445
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310215993.3A Withdrawn CN116230075A (en) | 2023-03-08 | 2023-03-08 | Protein domain boundary prediction method based on hybrid network model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116230075A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118038990A (en) * | 2024-04-11 | 2024-05-14 | 山东大学 | Multi-level chromatin topological structure domain identification method and system based on community discovery |
-
2023
- 2023-03-08 CN CN202310215993.3A patent/CN116230075A/en not_active Withdrawn
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118038990A (en) * | 2024-04-11 | 2024-05-14 | 山东大学 | Multi-level chromatin topological structure domain identification method and system based on community discovery |
CN118038990B (en) * | 2024-04-11 | 2024-07-16 | 山东大学 | Multi-level chromatin topological structure domain identification method and system based on community discovery |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110610166B (en) | Text region detection model training method and device, electronic equipment and storage medium | |
CN106886795B (en) | Object identification method based on salient object in image | |
Qu et al. | The algorithm of concrete surface crack detection based on the genetic programming and percolation model | |
CN103714148B (en) | SAR image search method based on sparse coding classification | |
CN104615911B (en) | Method based on sparse coding and chain study prediction memebrane protein beta barrel trans-membrane regions | |
CN111507260B (en) | Video similarity rapid detection method and detection device | |
Yang et al. | Image-based classification of protein subcellular location patterns in human reproductive tissue by ensemble learning global and local features | |
CN116230075A (en) | Protein domain boundary prediction method based on hybrid network model | |
CN109599149A (en) | A kind of prediction technique of RNA coding potential | |
CN110119693B (en) | English handwriting identification method based on improved VGG-16 model | |
CN108694411B (en) | Method for identifying similar images | |
CN108845999A (en) | A kind of trademark image retrieval method compared based on multiple dimensioned provincial characteristics | |
CN111079527B (en) | Shot boundary detection method based on 3D residual error network | |
CN108763265B (en) | Image identification method based on block retrieval | |
CN108763261B (en) | Graph retrieval method | |
CN104239551B (en) | Multi-feature VP-tree index-based remote sensing image retrieval method and multi-feature VP-tree index-based remote sensing image retrieval device | |
CN115240775A (en) | Cas protein prediction method based on stacking ensemble learning strategy | |
CN113257341A (en) | Method for predicting distribution of distance between protein residues based on depth residual error network | |
Becker et al. | On the encoding of proteins for disordered regions prediction | |
Shang et al. | An improved OTSU method based on Genetic Algorithm | |
CN108897746B (en) | Image retrieval method | |
CN116630790B (en) | Classification result optimization method based on edge precision evaluation | |
CN117592424B (en) | Layout design method, device and equipment of memory chip and storage medium | |
Khan et al. | Content based image retrieval using uniform local binary patterns | |
JP2008071214A (en) | Character recognition dictionary creation method and its device, character recognition method and its device, and storage medium in which program is stored |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20230606 |