CN113643756A - Protein interaction site prediction method based on deep learning - Google Patents

Protein interaction site prediction method based on deep learning Download PDF

Info

Publication number
CN113643756A
CN113643756A CN202110909991.5A CN202110909991A CN113643756A CN 113643756 A CN113643756 A CN 113643756A CN 202110909991 A CN202110909991 A CN 202110909991A CN 113643756 A CN113643756 A CN 113643756A
Authority
CN
China
Prior art keywords
residues
model
data
dimensional
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110909991.5A
Other languages
Chinese (zh)
Inventor
王兵
李敏杰
米春风
杨海娟
王子
周阳
汪文艳
卢琨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University of Technology AHUT
Original Assignee
Anhui University of Technology AHUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University of Technology AHUT filed Critical Anhui University of Technology AHUT
Priority to CN202110909991.5A priority Critical patent/CN113643756A/en
Publication of CN113643756A publication Critical patent/CN113643756A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a protein interaction site prediction method based on deep learning, which belongs to the technical field of bioinformatics analysis. Since there are far more non-interacting residues than interacting residues in the protein sequence, a down-sampling strategy is employed to eliminate class imbalances to obtain a high quality and low bias data set. And dividing the balanced data set into a training set and a testing set, further extracting high-level abstract features of the protein sequence by using a variational self-encoder for the training set, and classifying the amino acid residues by using a multilayer perceptron. And testing the trained model on the test set to obtain a prediction result. The method has low calculation cost and higher prediction precision.

Description

Protein interaction site prediction method based on deep learning
Technical Field
The invention relates to the technical field of bioinformatics analysis, in particular to a deep learning-based protein interaction site prediction method.
Background
As one of the most common molecules in cells, proteins are of great importance for regulating various metabolic pathways and numerous biological processes in cells. Generally, proteins do not act individually, but rather they perform their respective tasks by interacting with each other, i.e., Protein-Protein interactions (abbreviated as PPIs hereinafter). In addition, studies on protein interactions can provide a new perspective for medical diagnosis and treatment, promoting the design of new drugs and the development of biomedicine. Therefore, predicting PPIs has become a fundamental topic of system biology and has attracted increasing attention.
At present, methods for predicting protein interaction mainly comprise two biological methods and computational methods, in the traditional biological field, the collection of interaction data can be completed by yeast double hybridization, protein chips, synthetic lethal analysis and other methods, however, the methods are time-consuming and labor-consuming, the prediction efficiency is insufficient, and false negative and false positive phenomena of the ratio can be frequently observed in the prediction result. Therefore, with the rapid development of computer technology, a computational method, which is originally an auxiliary means, has been a mainstream method for predicting protein interactions.
Through research, a plurality of methods for predicting the interaction between protein and protein interfaces are proposed, Chinese patent application with the publication number of CN111210871A and the publication date of 2020, 05 and 29 discloses a method for predicting the protein-protein interaction based on deep forests, sequence information, physicochemical property information and evolution information of a fusion protein pair are used as initial characteristics of a sample, an elastic net is used for characteristic selection, redundant and irrelevant characteristics are eliminated, and the fused optimal characteristic vector is input into the constructed multi-granularity cascade deep forest to predict the protein-protein interaction. The Chinese patent application with publication number CN112259157A and publication date 2021, 01 month and 22 days discloses a protein interaction prediction method based on a sampling strategy of non-interacting protein pairs fused with biological semantics, wherein protein pairs in different molecular functions, biological processes and cell components are sampled and combined to obtain an NIPs subset based on GO term semantic similarity.
In the above methods, there are still problems to be solved, which restrict the development of protein interaction prediction: (1) extracting the characteristics of the protein and expressing the sequence information; (2) the effect of the imbalance of protein interaction sample data; (3) how to efficiently select and design PPIs classifiers; (4) the existing prediction model cannot well meet the requirements of protein interaction mass data. Therefore, a protein interaction site prediction method based on deep learning is provided.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: how to solve the problems proposed in the background art, the method for predicting the protein interaction sites based on deep learning is provided, the characteristics are extracted according to the primary sequence information of the protein, the influence of unbalanced data set categories is eliminated, and the variation self-encoder and the multi-layer perceptron algorithm are utilized to predict protein residues.
The invention solves the technical problems through the following technical scheme, and the invention comprises the following steps:
s1: collecting data
Obtaining sequence information of the protein based on the disclosed protein interaction dataset as a reference dataset;
s2: feature extraction
Generating a position specificity scoring matrix according to the protein sequence information, and extracting the physicochemical characteristics of the protein sequence;
s3: feature fusion
Extracting the characteristics of the residues and the adjacent residues by using a sliding window technology, combining all the extracted characteristics into a characteristic space data set, and then combining the label of each residue to form the input of a model;
s4: equalization process
Sampling representative features by using a down-sampling method to obtain a class-balanced data set for training a model;
s5: establishing a classifier
Dividing a class-balanced data set into a training set and a test set according to a proportion, further extracting high-level abstract characteristics of a protein sequence from the training set by using a variational self-encoder, and classifying amino acid residues by using a multilayer perceptron;
s6: evaluating and validating models
Testing the trained model on a test set to obtain a prediction result; and a public independent data set is used as a verification set to verify the robustness of the model.
Further, in step S1, the reference data sets are Dset186 and Dtestset72, which contain 186 and 72 protein sequences, respectively, wherein Dset186 is used for training the model, Dtestset72 is used as an independent verification set, and the amino acid residues marked as interaction sites in the data set are used as positive samples, and the amino acid residues marked as non-interaction sites in the data set are used as negative samples.
Still further, in step S2, the protein sequence features in the data set consist of three of: location specific score matrix (PSSM) features, Hydrophilicity Index (HI), and Relative Solvent Accessibility (RSA); PSSM is a 20-dimensional matrix generated by PSI-BLAST and used to describe evolutionary protection information for 20 amino acids, and HI and RSA are both 1-dimensional values characteristic of protein sequences.
Further, the specific processing procedure of step S3 is:
s31: placing a sliding window with the window size of 9 on PSSM features, extracting the characteristic value of the residue of the row and 8 adjacent rows of residues for each row of residues, then calculating the average value of the characteristics to be used as the updated value of the residue of the row, and sequentially acting the sliding window on each row of residues, thereby obtaining another 20-dimensional characteristic vector, and adding the original 20-dimensional PSSM features without the action of the sliding window, so that the PSSM features have 40 dimensions;
s32: respectively using sliding windows with the window sizes of 1, 3, 5, 7 and 9 to obtain a 5-dimensional hydrophilic index feature vector and a 5-dimensional relative solvent accessibility feature vector by a method of averaging the feature values;
s33: the augmented feature data are combined together, each row of features representing an amino acid residue, each residue having 50-dimensional features, tagged with the extracted amino acid residues, to form a 51-dimensional dataset as input to the model.
Further, in said step S33, the tag value of the amino acid residue is-1 or 1, -1 represents a negative sample, i.e. a non-interacting residue; 1 represents a positive sample, i.e. an interacting residue.
Further, the processing procedure of step S4 is: and measuring the distance between the positive sample and the negative sample by using a NearMiss downsampling algorithm through a K nearest neighbor rule, and selecting and reserving the negative sample with the minimum average distance to the farthest sample in the positive sample until the ratio of the positive sample to the negative sample, namely the interaction residues to the non-interaction residues, is 1: 1.
Further, in step S5, the data set after class balancing is divided into a training set and a test set according to a ratio of 8:2, and a model is trained by using the training set; the model comprises a variational self-encoder and a multilayer perceptron classifier which are sequentially connected, a training set is sent into the variational self-encoder (VAE) during training, a neural network is utilized to automatically learn data characteristics, redundant characteristics in the data are eliminated, and finally 30-dimensional abstract characteristics of a hidden layer in the middle of the network are extracted for a downstream classification task; the 30-dimensional abstract features were re-entered into a multi-level perceptron (MLP) classifier to identify whether residues belong to interacting residues.
Furthermore, the variational self-encoder comprises an encoder and a decoder, wherein input data are sequentially connected with a full connection layer FC1 with the number of neurons being 512 and a Dropout layer in the encoder, the input data are respectively subjected to two full connection layers to obtain a mean value (z _ mean) and a logarithm (z _ log _ var) of variance with 30 Gaussian distributions, random noise (epsilon) subject to the Gaussian distributions is introduced, epsilon, z _ mean and z _ log _ var are subjected to linear fusion by utilizing a Lambda layer to obtain an intermediate hidden variable z, and the process is called as a sampling process; in the decoder, the intermediate hidden variable z passes through the full link layer FC1 and a Dropout layer, and then connects to a full link layer to output 50-dimensional data similar to the input data.
Further, the multi-layered perceptron classifier comprises a Lambda layer, three fully connected layers and a Dropout layer, wherein the Lambda layer mainly combines the mean (z mean) and the logarithm of variance (z log var) for the transfer of neural network data, and finally performs binary classification on residues by using a softmax function.
Further, in step S6, the trained model is tested on the test set to obtain the accuracy, recall rate, accuracy, F1-value, and MCC value of the model prediction; and the Dtestset72 is used as an independent verification set, and the Dtestset72 data set is processed through the same steps of S1-S6 without changing any parameter of the model to obtain an evaluation index of the model so as to verify the generalization capability of the model.
Compared with the prior art, the invention has the following advantages: the protein interaction site prediction method based on deep learning is characterized in that the problem of sample data distribution imbalance is processed based on a NearMiss down-sampling algorithm, so that the phenomenon that a model is biased to a plurality of classes in order to improve the prediction accuracy to the maximum extent can be avoided, the prediction accuracy of the interaction residues which are a few classes is ignored, the NearMiss down-sampling algorithm is a sampling method for deleting a part of the plurality of classes according to the distance between measurement samples, the global information is considered in the sampling process, and therefore the left data of the plurality of classes are more representative, and on the other hand, the down-sampling method can also improve the operation speed of the model; the method comprises the steps of further extracting and compressing a feature space data set by using a variational self-encoder, wherein the variational self-encoder is an unsupervised learning algorithm, compresses input information and extracts the most representative information in the data, and aims to reduce the dimensionality of the input information and reduce the processing burden of a neural network under the condition of ensuring that important features are not lost, simply extracts the higher-level and more abstract features of the input information, is convenient for subsequent classification work, and is worthy of popularization and application.
Drawings
FIG. 1 is a schematic flow chart of a deep learning-based protein interaction site prediction method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the combination of a variational auto-encoder and a multi-layer perceptron classifier according to an embodiment of the present invention.
Detailed Description
The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.
The embodiment provides a technical scheme: a method for predicting protein interaction sites based on deep learning is characterized in that features are extracted according to primary sequence information of protein, the influence of data set class imbalance is eliminated, and a variational self-encoder and a multi-layer perceptron algorithm are utilized to predict protein residues.
Referring to fig. 1, the method for predicting protein interaction sites based on deep learning of the present invention specifically comprises the following steps:
1) the published protein interaction dataset is downloaded as a reference dataset to obtain sequence information for the protein.
The data sets selected for use in this embodiment are derived from the public reference data sets Dset186 and Dtestset72These two data sets were developed in 2010 by the japanese researchers Murakami and mizugchi, which contained 186 and 72 protein sequences, respectively, with Dset186 used as the training model and Dtestset72 as the independent validation set validation model. For a protein amino acid, the absolute solvent accessibility area that is lost if it binds to another amino acid is less than
Figure BDA0003203356820000042
This indicates that the amino acid is an interacting residue, otherwise, a non-interacting residue. Taking the amino acid residue marked as an interaction site in the data set as a positive sample, and marking as 1; the amino acid residue labeled as the non-interacting site was designated as minus 1 as a negative sample. The ratio of the number of interacting residues to the number of non-interacting residues and the number of interacting residues to the total number of residues in Dset186 and Dtestset72 is shown in Table 1.
Number of residues in the data set of Table 1
Figure BDA0003203356820000041
2) And generating a position specificity scoring matrix according to the protein sequence information, and extracting the physicochemical characteristics of the protein sequence.
The protein sequence features in the dataset consisted of 3 of: location specific score matrix (PSSM) features, Hydrophilicity Index (HI) and Relative Solvent Accessibility (RSA). PSSM was generated by running PSI-BLAST algorithm to search the NCBI's non-redundant (NR) sequence database for three iterations with an e-value threshold of 0.001. Each amino acid is encoded as a vector containing 20 elements. The PSSM or position specific scoring matrix is one used in a protein BLAST search in which the amino acid substitution score is given separately for each position in the protein multiple sequence alignment. Thus, a substitution of Tyr-Trp at position a in the alignment with the same substitution in position B may result in a very different score, with PSSM scores typically shown as positive or negative integers. A positive value indicates that a given amino acid substitution alignment occurs more frequently than would be expected by chance, while a negative value indicates that substitutions occur less frequently than would be expected. PSSM can be viewed through a FASTA file that enters the protein sequence. The hydropathic index is another important feature identified by PPIs, and the hydropathic index of an amino acid is a value that describes the degree of hydrophilicity or hydrophobicity of its branches. "hydropathic index" was introduced in 1982 by Jack Kyte and Russell Doolittle, and the larger the hydropathic index, the more hydrophobic this amino acid is. Since the physical properties of a protein cannot be accurately measured simply by the size of its value due to the larger or smaller accessible surface area of the protein solvent, the structure and properties of the protein can be further analyzed by converting the accessible surface area of the solvent into the accessibility of the solvent. Relative solvent accessibility was determined by using an online server SANN. Both HI and RSA are 1-dimensional numerical protein sequence features.
3) Considering whether the amino acid residues in the protein chain are closely related to the properties of the interaction residues and the adjacent residues, the characteristics of the residues and the adjacent residues are extracted by using a sliding window technology, all the extracted characteristics are combined into a characteristic space data set, and then the label of each residue is combined to form the input of a model.
A sliding window size of (2n +1) means that we consider the centrally located target amino acid and 2n adjacent amino acids as input features for the target amino acid. Firstly, a sliding window with the window size of 9 is placed on the PSSM features, for each row of residues, the feature values of the row of residues and 8 adjacent rows of residues are extracted, the average value of the features is used as the updated value of the row of residues, the sliding window is sequentially acted on each row of residues, so that another 20-dimensional feature vector is obtained, and the 20-dimensional PSSM features without the action of the sliding window are added, so that the total number of the PSSM features is 40. Then, 5-dimensional hydrophilic index feature vectors and 5-dimensional relative solvent accessibility feature vectors are obtained by averaging the feature values using sliding windows with window sizes of 1, 3, 5, 7 and 9, respectively. For a given protein sequence, the first and last four residues of the protein chain will be discarded after the sliding window, since the 4 residues at the beginning and end cannot use a sliding window of 9. The augmented feature data are combined together, each row of features representing an amino acid residue, each residue having 50-dimensional features, tagged with the extracted amino acid residues, to form a 51-dimensional dataset as input to the model. The tag value for an amino acid residue is-1 or 1, where-1 represents a negative sample, i.e., a non-interacting residue; 1 represents a positive sample, i.e. an interacting residue.
4) In order to solve the problem of unbalanced class of the data set, a representative feature is sampled by using a down-sampling technology to obtain a class-balanced data set so as to be used for training a model.
Using a data set with an unbalanced distribution of samples will result in a poor accuracy and robustness of the model, which is a common problem in prediction of PPIs. The handling of class imbalance for a data set is: the method comprises the steps of measuring the distance between positive and negative samples by using a NearMiss down-sampling algorithm through a K neighbor rule, wherein the NearMiss algorithm can be divided into three versions according to different rules, experimental results show that the prediction effect is the best when the version is set to be 2, the rule of the version 2 is to select and reserve negative samples with the minimum average distance to the farthest sample in positive samples until the ratio of the positive and negative samples, namely interaction residues to non-interaction residues is 1: 1.
5) And (3) dividing the class-balanced data set into a training set and a testing set according to a proportion, further extracting high-level abstract features of the protein sequence from the training set by using a variational self-encoder, and classifying the amino acid residues by using a multilayer perceptron.
Firstly, dividing a data set with balanced categories into a training set and a testing set according to the proportion of 8:2, and training a model by using the training set. And (3) sending the training set into a constructed variational self-encoder (VAE for short), automatically learning data features by using a neural network, eliminating redundant features in the data, and finally extracting 30-dimensional abstract features of a hidden layer in the middle of the network for a downstream classification task. Since the features extracted by the variational self-encoder are very representative, a simple multi-layer perceptron (MLP) classifier is followed after the variational self-encoder to identify whether a residue belongs to an interacting residue.
With reference to fig. 2, the construction method of VAE is: and designing an unsupervised learning network, wherein the input and the output of the unsupervised learning network are training sets obtained after the protein data set is segmented, and the similarity between the input and the output is high as much as possible through a deep neural network. The invention adds a softmax multilayer perceptron on a variational self-encoder network to simultaneously obtain dimension reduction and classification output. Firstly, 50-dimensional feature data of a training set are encoded by an encoder to obtain an intermediate hidden layer, and then an input 50-dimensional feature vector is reconstructed by a decoder. The variational self-encoder comprises an encoder and a decoder, wherein input data are sequentially connected with a full connection layer FC1 with the number of neurons being 512 and a Dropout layer in the encoder, the input data are respectively subjected to two full connection layers to obtain a mean value (z _ mean) and a logarithm (z _ log _ var) of variance with 30 Gaussian distributions, random noise (epsilon) subject to the Gaussian distributions is introduced, epsilon, z _ mean and z _ log _ var are subjected to linear fusion by utilizing a Lambda layer to obtain a middle hidden variable z, and the process is called as a sampling process; in the decoder, the intermediate hidden variable z passes through the full link layer FC1 and a Dropout layer, and then connects to a full link layer to output 50-dimensional data similar to the input data. The dimension of the middle hidden layer of the neural network is designed to be 30, which is 50 dimensions lower than the dimension of the data set constructed originally, so that useful features in the data are extracted, the dimension of the features is reduced, the model speed is higher, and the prediction accuracy is higher. The VAE calculates the reconstruction error using the KL divergence as a loss function.
The multi-layer perceptron classifier comprises a Lambda layer, three fully connected layers and a Dropout layer, wherein the Lambda layer mainly combines a mean value (z _ mean) and a logarithm of variance (z _ log _ var) for transferring neural network data, and finally performs binary classification on residues by using a softmax function.
6) And testing the trained model on a test set to obtain evaluation indexes such as model prediction accuracy, recall rate, F1-value and the like. In order to verify the effectiveness of the model, a public independent data set is used as a verification set to verify the robustness of the model.
And obtaining the protein interaction sites through the classification model, and further, testing and evaluating the obtained protein interaction sites by using a test set. Wherein the test indexes are as follows:
the accuracy is as follows:
Figure BDA0003203356820000071
the recall ratio is as follows:
Figure BDA0003203356820000072
the precision ratio is as follows:
Figure BDA0003203356820000073
f1-value:
Figure BDA0003203356820000074
MCC value:
Figure BDA0003203356820000075
wherein TP is the number of true positives and the number of correctly predicted positive samples; TN is the number of true negatives, which represents the number of negative samples predicted correctly; FP is the number of false positives, namely the number of positive samples which are originally predicted as negative samples in the prediction result; FN is the number of false negatives, i.e., the number of samples that are inherently positive but are mispredicted as negative. The F1-value is the weighted harmonic mean of Precision and Recall, namely the results of Precision and Recall are combined, and the F1 value is higher, which indicates that the test method is more effective; MCC is a good measure for measuring the unbalance problem, and is essentially a correlation coefficient between a true value and a predicted value, wherein, the correlation coefficient is between-1 and 1, the-1 represents the worst predicted result, and the 1 represents the best predicted result.
The results of this example for the recognition of protein interaction sites are detailed in Table 2.
TABLE 2 Classification Performance evaluation of the models
Data set Rate of accuracy Recall rate Rate of accuracy F1-value MCC
Dset186 0.855 0.758 0.938 0.838 0.722
Dtestset72 0.763 0.680 0.823 0.744 0.535
As can be seen from the data in Table 2, the classification accuracy of Dset186 of this example reached 85.5%, and the recall rate reached 75.8%, indicating that the effect of correctly predicting the interacting residues reached a better level. F1-value and MCC reach 83.8% and 72.2% respectively, which shows that the overall classification performance of the model is better, and whether the residue is an interaction residue or a non-interaction residue can be accurately predicted. Since there is less data for Dtestset72, the evaluation indices are not as good as Dset186, but at a better level, so the model is more robust.
To further evaluate the performance of this model, it was compared with 4 existing models, PSIVER, lorris, CRF, SSWRF. This model is denoted as VAEMLP. Also on the Dset186 and Dtestset72 data sets, VAEMLP compared with the evaluation indices of other models as shown in table 3.
TABLE 3 comparison with existing protein interaction prediction methods
Figure BDA0003203356820000081
The results in table 3 fully show that various evaluation indexes of the model constructed by the invention are remarkably improved, and the protein interaction sites can be effectively identified.
In summary, the protein interaction site prediction method based on deep learning of the above embodiment processes the unbalanced sample data distribution problem based on the nearmoss downsampling algorithm, so that the bias of the model to the majority classes for improving the prediction accuracy to the maximum extent can be avoided, the prediction accuracy of the minority classes, i.e., interaction residues, is ignored, the nearmoss downsampling algorithm is a sampling method for deleting a part of the majority classes according to the distance between measurement samples, the global information is considered in the sampling process, and thus the left majority classes of data are more representative, and on the other hand, the downsampling method can also improve the operation speed of the model; the method comprises the following steps of further extracting and compressing a feature space data set by using a variational self-encoder, wherein the variational self-encoder is an unsupervised learning algorithm, compresses input information and extracts the most representative information in the data, and aims to reduce the dimensionality of the input information and reduce the processing burden of a neural network under the condition of ensuring that important features are not lost; simply speaking, the method is used for extracting higher-level and more abstract features of the input information, and is convenient for subsequent classification work.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A protein interaction site prediction method based on deep learning is characterized by comprising the following steps:
s1: collecting data
Obtaining sequence information of the protein based on the disclosed protein interaction dataset as a reference dataset;
s2: feature extraction
Generating a position specificity scoring matrix according to the protein sequence information, and extracting the physicochemical characteristics of the protein sequence;
s3: feature fusion
Extracting the characteristics of the residues and the adjacent residues by using a sliding window technology, combining all the extracted characteristics into a characteristic space data set, and then combining the label of each residue to form the input of a model;
s4: equalization process
Sampling representative features by using a down-sampling method to obtain a class-balanced data set for training a model;
s5: establishing a classifier
Dividing a class-balanced data set into a training set and a test set according to a proportion, further extracting high-level abstract characteristics of a protein sequence from the training set by using a variational self-encoder, and classifying amino acid residues by using a multilayer perceptron;
s6: evaluating and validating models
Testing the trained model on a test set to obtain a prediction result; and a public independent data set is used as a verification set to verify the robustness of the model.
2. The method for predicting protein interaction sites based on deep learning of claim 1, wherein: in step S1, the reference data sets are Dset186 and Dtestset72, which contain 186 and 72 protein sequences, respectively, wherein Dset186 is used for training the model, Dtestset72 is used as an independent verification set, and the amino acid residues marked as interaction sites in the data set are used as positive samples, and the amino acid residues marked as non-interaction sites in the data set are used as negative samples.
3. The method for predicting protein interaction sites based on deep learning of claim 1, wherein: in step S2, the protein sequence features in the data set include the following three types: PSSM features, HI and RSA; PSSM is a 20-dimensional matrix generated by PSI-BLAST and used to describe evolutionary protection information for 20 amino acids, and HI and RSA are both 1-dimensional values characteristic of protein sequences.
4. The method for predicting protein interaction sites based on deep learning of claim 3, wherein: the specific processing procedure of step S3:
s31: placing a sliding window with the window size of 9 on PSSM characteristics, extracting characteristic values of the residues of the row and 8 adjacent rows of residues for each row of residues, then calculating the average value of the characteristics to be used as an updated value of the residues of the row, sequentially acting the sliding window on the residues of each row to obtain another 20-dimensional characteristic vector, and adding the original 20-dimensional PSSM characteristics without the action of the sliding window, wherein the PSSM characteristics have 40 dimensions;
s32: respectively using sliding windows with the window sizes of 1, 3, 5, 7 and 9 to obtain a 5-dimensional hydrophilic index feature vector and a 5-dimensional relative solvent accessibility feature vector by a method of averaging the feature values;
s33: the augmented feature data are combined together, each row of features representing an amino acid residue, each residue having 50-dimensional features, tagged with the extracted amino acid residues, to form a 51-dimensional dataset as input to the model.
5. The method for predicting protein interaction sites based on deep learning of claim 4, wherein: in said step S33, the tag value of the amino acid residue is-1 or 1, -1 represents a negative sample, i.e. a non-interacting residue; 1 represents a positive sample, i.e. an interacting residue.
6. The method for predicting protein interaction sites based on deep learning of claim 1, wherein: the processing procedure of step S4 is: and measuring the distance between the positive sample and the negative sample by using a NearMiss downsampling algorithm through a K nearest neighbor rule, and selecting and reserving the negative sample with the minimum average distance to the farthest sample in the positive sample until the ratio of the positive sample to the negative sample, namely the interaction residues to the non-interaction residues, is 1: 1.
7. The method for predicting protein interaction sites based on deep learning of claim 1, wherein: in step S5, the data set after class balancing is divided into a training set and a test set according to a ratio of 8:2, and a model is trained by using the training set; the model comprises a variational self-encoder and a multilayer perceptron classifier which are sequentially connected, a training set is sent into the variational self-encoder during training, data features are automatically learned by utilizing a neural network, redundant features in the data are eliminated, and finally 30-dimensional abstract features of a hidden layer in the middle of the network are extracted for a downstream classification task; the 30-dimensional abstract features are re-entered into a multi-layered perceptron classifier to identify whether a residue belongs to an interacting residue.
8. The method for predicting protein interaction sites based on deep learning of claim 7, wherein: the variational self-encoder comprises an encoder and a decoder, wherein input data are sequentially connected with a full connection layer FC1 with the number of neurons being 512 and a Dropout layer in the encoder, a mean value z _ mean and a logarithm z _ log _ var of variance with 30 Gaussian distributions are respectively obtained through the two full connection layers, random noise epsilon complying with the Gaussian distributions is introduced, the random noise epsilon, the mean value z _ mean and the logarithm z _ log _ var of the variance are subjected to linear fusion by utilizing a Lambda layer to obtain a middle hidden variable z, and the process is called as a sampling process; in the decoder, the intermediate hidden variable z passes through the full link layer FC1 and a Dropout layer, and then connects to a full link layer to output 50-dimensional data similar to the input data.
9. The method for predicting protein interaction sites based on deep learning of claim 7, wherein: the multi-layer perceptron classifier comprises a Lambda layer, three fully-connected layers and a Dropout layer, wherein one fully-connected layer is arranged between the Lambda layer and the Dropout layer, the other two fully-connected layers are sequentially connected with the Dropout layer, the Lambda layer combines a mean value z _ mean and a logarithm of variance z _ log _ var for transmission of neural network data, and finally, the residue is subjected to binary classification by utilizing a softmax function.
10. The method for predicting protein interaction sites based on deep learning of claim 1, wherein: in the step S6, the trained model is tested on a test set to obtain the accuracy, recall rate, accuracy, F1-value, and MCC value of the model prediction; and the Dtestset72 is used as an independent verification set, and the Dtestset72 data set is processed through the same steps of S1-S6 without changing any parameter of the model to obtain an evaluation index of the model so as to verify the generalization capability of the model.
CN202110909991.5A 2021-08-09 2021-08-09 Protein interaction site prediction method based on deep learning Pending CN113643756A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110909991.5A CN113643756A (en) 2021-08-09 2021-08-09 Protein interaction site prediction method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110909991.5A CN113643756A (en) 2021-08-09 2021-08-09 Protein interaction site prediction method based on deep learning

Publications (1)

Publication Number Publication Date
CN113643756A true CN113643756A (en) 2021-11-12

Family

ID=78420324

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110909991.5A Pending CN113643756A (en) 2021-08-09 2021-08-09 Protein interaction site prediction method based on deep learning

Country Status (1)

Country Link
CN (1) CN113643756A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113936748A (en) * 2021-11-17 2022-01-14 西安电子科技大学 Molecular recognition characteristic function prediction method based on ensemble learning
CN114550824A (en) * 2022-01-29 2022-05-27 河南大学 Protein folding identification method and system based on embedding characteristics and unbalanced classification loss
CN115512763A (en) * 2022-09-06 2022-12-23 北京百度网讯科技有限公司 Method for generating polypeptide sequence, method and device for training polypeptide generation model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113936748A (en) * 2021-11-17 2022-01-14 西安电子科技大学 Molecular recognition characteristic function prediction method based on ensemble learning
CN114550824A (en) * 2022-01-29 2022-05-27 河南大学 Protein folding identification method and system based on embedding characteristics and unbalanced classification loss
CN115512763A (en) * 2022-09-06 2022-12-23 北京百度网讯科技有限公司 Method for generating polypeptide sequence, method and device for training polypeptide generation model
CN115512763B (en) * 2022-09-06 2023-10-24 北京百度网讯科技有限公司 Polypeptide sequence generation method, and training method and device of polypeptide generation model

Similar Documents

Publication Publication Date Title
CN113643756A (en) Protein interaction site prediction method based on deep learning
CN110109835B (en) Software defect positioning method based on deep neural network
CN108766559B (en) Clinical decision support method and system for intelligent disease screening
CN112711953A (en) Text multi-label classification method and system based on attention mechanism and GCN
Liu et al. A novel method based on deep learning for aligned fingerprints matching
Jagan Mohan et al. A novel four-step feature selection technique for diabetic retinopathy grading
CN113392894A (en) Cluster analysis method and system for multi-group mathematical data
Bennet et al. A Hybrid Approach for Gene Selection and Classification Using Support Vector Machine.
Uddin et al. Machine learning based diabetes detection model for false negative reduction
CN114743600A (en) Gate-controlled attention mechanism-based deep learning prediction method for target-ligand binding affinity
CN113674862A (en) Acute renal function injury onset prediction method based on machine learning
Ahmad et al. Diagnosis of cardiovascular disease using deep learning technique
CN113838018A (en) Cnn-former-based hepatic fibrosis lesion detection model training method and system
Ateş et al. The investigation of the success of different machine learning methods in breast cancer diagnosis
CN116245139B (en) Training method and device for graph neural network model, event detection method and device
CN117195027A (en) Cluster weighted clustering integration method based on member selection
Andi et al. Analysis of the random forest and grid search algorithms in early detection of diabetes mellitus disease
Çelik Determination and Classification of Importance of Attributes Used in Diagnosing Pregnant Women's Birth Method
CN114999628A (en) Method for searching significant characteristics of degenerative knee osteoarthritis by machine learning
CN113971984A (en) Classification model construction method and device, electronic equipment and storage medium
Siahmarzkooh ACO-based Type 2 Diabetes Detection using Artificial Neural Networks.
Sinha et al. A study of feature selection and extraction algorithms for cancer subtype prediction
CN117912591B (en) Kinase-drug interaction prediction method based on deep contrast learning
Walsh et al. Evolution of convolutional neural networks for lymphoma classification
CN112885409B (en) Colorectal cancer protein marker selection system based on feature selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination