CN113223620B - Protein solubility prediction method based on multi-dimensional sequence embedding - Google Patents
Protein solubility prediction method based on multi-dimensional sequence embedding Download PDFInfo
- Publication number
- CN113223620B CN113223620B CN202110521651.5A CN202110521651A CN113223620B CN 113223620 B CN113223620 B CN 113223620B CN 202110521651 A CN202110521651 A CN 202110521651A CN 113223620 B CN113223620 B CN 113223620B
- Authority
- CN
- China
- Prior art keywords
- sequence
- protein
- layer
- amino acid
- embedding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Computational Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a protein solubility prediction method based on multi-dimensional sequence embedding, which comprises the following steps: (1) acquiring an amino acid sequence set of the protein; (2) Enhanced representation of the amino acid sequence of each protein; (3) Calculating structural information of the amino acid sequence of each protein; (4) acquiring a training sample set and a testing sample set; (5) Constructing a protein solubility prediction model H based on multi-dimensional sequence embedding; (6) Performing iterative training on the protein solubility prediction model H; (7) obtaining the result of predicting the solubility of the protein. In the process of training a model and acquiring a protein solubility prediction result, each amino acid sequence is subjected to enhanced representation and structural information supplementation, and multi-dimensional sequence embedding is performed, so that the information amount is increased, the accuracy of protein solubility prediction is improved, and the method can be used for screening the amino acid sequences for protein synthesis.
Description
Technical Field
The invention belongs to the technical field of bioinformatics, relates to a protein solubility prediction method, and particularly relates to a protein solubility prediction method based on multi-dimensional sequence embedding in the field of deep learning based on a neural network, which can be used for screening a protein amino acid sequence soluble in a protein expression system and providing reference for protein synthesis.
Background
The development of genetic engineering and cloning techniques has enabled research and industrial fields to synthesize and isolate proteins on a large scale in protein expression systems. Commonly used protein expression systems include E.coli expression systems, yeast expression systems, insect cell expression systems, mammalian cell expression systems. However, heterologous expression of many proteins in expression systems is not soluble, resulting in the synthesis of proteins with no biological activity, and thus efficient production of active soluble proteins remains a major challenge.
The solubility of a protein under given experimental conditions is a trait whose sequence is ultimately determined. Through researching the amino acid sequence mode of the insoluble/soluble protein and developing a protein solubility calculation method, the experimental work can be concentrated on the soluble protein, and the efficiency of large-scale screening is improved.
Protein solubility prediction refers to prediction of whether a protein amino acid sequence to be studied is soluble after synthesis by mining patterns in existing protein solubility-related data. Existing protein solubility predictions are mainly divided into two categories: the method comprises a protein solubility prediction method based on traditional machine learning of feature engineering and a protein solubility prediction method based on deep learning of a neural network. A protein solubility prediction method based on traditional machine learning of feature engineering mainly extracts a series of statistical features through an amino acid sequence of a protein, and obtains a final protein solubility prediction model through training a machine learning classifier. The method depends on a large amount of characteristic engineering and experience of characteristic selection, the amount of finally extracted information is limited, the protein cannot be comprehensively depicted, and the upper limit of the prediction accuracy of the method is reduced. The deep learning protein solubility prediction method based on the neural network mainly automatically learns feature representation from an amino acid sequence of a protein through the neural network, and performs protein solubility prediction end to end. Such methods typically use convolutional neural networks to extract the features of the amino acid sequence of proteins, for example, khurana et al, published in 2018 on Bioinformatics by "deep Sol: a deep learning frame for sequence-based protein solubility prediction", disclosing a protein solubility prediction method deep Sol. Deepsol learns feature representation from the amino acid sequence of a single protein only by using a convolutional neural network, and although the method can automatically extract features from the amino acid sequence of the protein, the amount of information provided by the amino acid sequence of the protein is limited, and information is lost in the process of convolution and pooling operation, so that the improvement of prediction accuracy is limited.
Disclosure of Invention
The present invention is directed to overcome the above-mentioned deficiencies of the prior art, and an object of the present invention is to provide a method for predicting protein solubility by embedding a multidimensional sequence, which increases the amount of information by enhancing expression of protein amino acid sequences and supplementing structural information, and which acquires vector representations of proteins from a plurality of dimensions and integrates them to improve the prediction accuracy.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) Obtaining the amino acid sequence set of the protein:
downloading the amino acid sequences of M proteins from the protein solubility dataset X = { X = { X = } (1) ,X (2) ,...,X (m) ,...,X (M) And its corresponding solubility label y = { y = } (1) ,y (2) ,...,y (m) ,...,y (M) Wherein M is more than or equal to 10000,X (m) indicates that the m-th amino acid is composed of 20 amino acids and has a length of L m The amino acid sequence of the protein of (1),a vector space is represented in the form of a vector,represents the amino acid at the l-position in the amino acid sequence of the mth protein, y (m) Represents X (m) Solubility tag of (a), y (m) =0 means X (m) Insoluble, y (m) =1 denotes X (m) Dissolving;
(2) Amino acid sequence X for each protein (m) Performing enhancement expression:
for each protein amino acid sequence X in the order from front to back (m) Combining the amino acids at every two positions to obtain a binary combination sequence set B = { B = { (B) } (1) ,B (2) ,...,B (m) ,...,B (M) },And to X (m) Combining amino acids at every three positions to obtain a ternary combination sequence set T = { T = { (T) } (1) ,T (2) ,...,T (m) ,...,T (M) },WhereinB (m) Represents X (m) A corresponding binary combination sequence with the length of L-1 and formed by binary combination of 400 amino acids,T (m) represents X (m) A corresponding ternary combination sequence which is composed of 8000 kinds of amino acid ternary combinations and has the length of L-2;
(3) Calculating the amino acid sequence X of each protein (m) The structural information of (2):
(3a) The amino acid sequence X of each protein was calculated separately using the ACCPro5 software package at 25% threshold and at 0-95% threshold interval (m) The solvent relative reachability of (2) is obtained, and the corresponding solvent relative reachability category number under the 25% threshold value is 2, and the set RSA2= { RSA2= is represented (1) ,...,RSA2 (m) ,...,RSA2 (M) },And a corresponding solvent relative reachability sequence representation set RSA20= { RSA20 } corresponding to a solvent relative reachability category number of 20 in a threshold interval of 0-95% (1) ,...,RSA20 (m) ,...,RSA20 (M) },Wherein, the first and the second end of the pipe are connected with each other,e representsCan be contacted with a solvent, meansThe contact with the solvent is not allowed to occur,
(3b) The amino acid sequence X of each protein was calculated using the SSpro5 software package (m) Tri-state secondary structure sequence of (1)And an octamer secondary structure sequenceObtaining a tri-state secondary structure sequence set SS3= { SS3 } of which the secondary structure class number corresponding to X is 3 (1) ,...,SS3 (m) ,...,SS3 (M) H, and an eight-state secondary structure sequence set SS8 with a secondary structure class number of 8= { SS8= } (1) ,...,SS8 (m) ,...,SS8 (M) -means for, among other things,
(4) Acquiring a training sample set and a testing sample set:
(4a) Amino acid sequence X of each protein (m) And its corresponding binary combination sequence B (m) Ternary combination sequence T (m) Relative solvent reachability sequences RSA2 with relative solvent reachability categories of 2 and 20 (m) And RSA20 (m) Ternary and eight-state secondary structure sequences SS3 (m) And SS8 (m) All the lengths of the sequences are initialized to be L, L =1200, and when the length of the sequences is less than L during initialization, the sequences are filled with 0, and when the length of the sequences exceeds L, the excess parts are deleted;
(4b) Combining all the sequences with initialized length into a multidimensional sequence of proteins represents a sample set D = { D = { D } (1) ,D (2) ,...,D (m) ,...,D (M) And using the solubility labels of the N multidimensional sequences representing the samples and the amino acid sequences contained in the samples as a training sample setThe rest S multi-dimensional sequence tablesSolubility tags representing samples and amino acid sequences contained therein as test sample setsWherein D is (m) Represents an amino acid sequence X 'comprising a protein of length L' (m) And the corresponding binary combination sequence B' (m) And a ternary combination sequence T' (m) Relative solvent reachability sequences RSA2 'with relative solvent reachability category numbers of 2 and 20' (m) And RSA20' (m) Tristate and octate Secondary Structure sequence SS3' (m) And SS8' (m) Multidimensional sequences of 7 dimensions in total represent samples, D (m) =(X' (m) ,B' (m) ,T' (m) ,RSA2' (m) ,RSA20' (m) ,SS3' (m) ,SS8' (m) ),S=M-N;
(5) Constructing a protein solubility prediction model H based on multi-dimensional sequence embedding:
constructing a protein solubility prediction model comprising 7 parallel-arranged embedding layers for realizing multi-dimensional sequence embedding and a prediction layer, wherein a convolution pooling module is loaded between each embedding layer and the prediction layer, and the convolution pooling module comprises a one-dimensional convolution layer, a global maximum pooling layer and a concat layer which are sequentially stacked; the prediction layer comprises a plurality of full connection layers and a sigmoid layer which are sequentially stacked;
(6) And (3) performing iterative training on a protein solubility prediction model H:
(6a) Initializing all parameters in an embedding layer, a convolution pooling module and a prediction layer randomly, wherein the initialization iteration number is C, the maximum iteration number is C, and C is more than or equal to 1, and C =0;
(6b) The training sample set Train is used as the input of the protein solubility prediction model H, and 7 embedding layers are used for each training sample7 sequence X 'of' (n) 、B' (n) 、T' (n) 、RSA2' (n) 、RSA20' (n) 、SS3' (n) 、SS8' (n) Respectively embedding, extracting features of the embedding results of the 7 convolution pooling modules, and extracting features of the prediction layer by 7 convolution pooling modulesAmino acid sequence X 'of middle protein' (n) Probability of being solublePredicting to obtain soluble probability set corresponding to Train
(6c) Calculating p by using cross entropy loss function train With solubility labels y train And updating all parameters in the embedded layer, the convolution pooling module and the prediction layer through the Loss value Loss by adopting a gradient descent method, wherein:
(6d) Judging whether C is more than or equal to C, if so, obtaining a trained multidimensional sequence embedded convolutional neural network model H', otherwise, making C = C +1, and executing the step (6 b);
(7) Obtaining a protein solubility prediction:
embedding the Test sample set Test as the trained multidimensional sequence into the input of the convolutional neural network model H' for forward propagation to obtain a probability set that S samples are predicted to be soluble Time means the s-th sample in the test sample setIt is predicted to be soluble in the water,time indicates that the s-th sample is predicted to be insoluble.
Compared with the prior art, the invention has the following advantages:
1. the protein solubility prediction model constructed by the invention comprises 7 embedded layers and convolution pooling modules connected with each embedded layer, wherein in the process of training the model and obtaining a protein solubility prediction result, the 7 embedded layers and the corresponding convolution pooling modules respectively embed 7 sequences, different feature expressions of proteins are learned from sequences of multiple dimensions, the different feature expressions can be mutually supplemented, and the proteins are jointly depicted.
2. In the process of acquiring the training sample set and the test sample set, the invention compensates for information loss in the process of convolution operation and pooling operation only from the amino acid sequence learning characteristic representation of the protein by performing enhanced representation on the amino acid sequence of each protein, and the four kinds of structural information of the amino acid sequence of each protein contained in the training sample set and the test sample set can increase the information amount in the process of training the model and acquiring the protein solubility prediction result.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
Detailed Description
The invention is described in further detail below with reference to fig. 1 and the specific examples.
Referring to fig. 1, the present invention includes the steps of:
step 1) obtaining an amino acid sequence set of protein:
this example downloads the amino acid sequence X = { X) of M proteins from the protein solubility dataset of the DeepSol e (1) ,X (2) ,...,X (m) ,...,X (M) And its corresponding solubility label y = { y = } (1) ,y (2) ,...,y (m) ,...,y (M) Where, M =71421,X (m) indicates that the m-th amino acid is composed of 20 amino acids and has a length of L m The amino acid sequence of the protein of (1),a vector space is represented in the form of a vector,denotes the amino acid at position I in the amino acid sequence of the mth protein, y (m) Represents X (m) Solubility tag of (a), y (m) =0 denotes X (m) Insoluble, y (m) =1 denotes X (m) Dissolving;
step 2) for each protein amino acid sequence X (m) Performing enhancement representation:
for each protein amino acid sequence X in the order from front to back (m) The amino acids at every two positions are combined to obtain a binary combined sequence set B = { B = { (B) (1) ,B (2) ,...,B (m) ,...,B (M) },And to X (m) Combining amino acids at every three positions to obtain a ternary combination sequence set T = { T = { (T) } (1) ,T (2) ,...,T (m) ,...,T (M) },WhereinB (m) Represents X (m) A corresponding binary combination sequence with the length of L-1 and formed by binary combination of 400 amino acids,T (m) represents X (m) Corresponding three-element combination sequence with length of L-2 and formed by 8000 kinds of amino acid three-element combination.
The multivariate enhancement expression sequence of the amino acid sequence of the protein can enable the model to learn a mode in a more complex amino acid sequence, simultaneously make up for information loss in subsequent convolution pooling operation, and improve the accuracy of protein solubility prediction.
Step 3) calculation of the amino acid sequence X of each protein (m) The structural information of (2):
step 3 a) separately calculate the amino acid sequence X of each protein using the ACCPro5 software package at 25% threshold and at 0-95% threshold interval (m) The solvent relative reachability of (2) is obtained, and the corresponding solvent relative reachability category number under the 25% threshold value is 2, and the set RSA2= { RSA2= is represented (1) ,...,RSA2 (m) ,...,RSA2 (M) },And the corresponding solvent relative reachability sequence representation set RSA20 with the relative solvent reachability category number of 20 under the threshold interval of 0-95 = { RSA20= (1) ,...,RSA20 (m) ,...,RSA20 (M) },Wherein the content of the first and second substances,e representsCan be contacted with a solvent, meansThe contact with the solvent is not allowed to occur,a larger value indicates that the amino acid at that position is accessibleThe possibility of reaching a solvent is higher, and the relative solubility accessibility sequence can reflect the structure of the amino acid sequence of the protein, so that the information content of a data set is enlarged;
step 3 b) calculation of the amino acid sequence X of each protein using the SSpro5 software package (m) Tri-state secondary structure sequence of (1)And an octamer secondary structure sequenceObtaining a ternary secondary structure sequence set SS3= { SS3 } of which the secondary structure type number corresponding to X is 3 (1) ,...,SS3 (m) ,...,SS3 (M) And eight-state secondary structure sequence set SS8 with secondary structure class number of 8= { SS8= } (1) ,...,SS8 (m) ,...,SS8 (M) And (c) the step of (c) in which,the tristate secondary structure and the octate secondary structure represent the secondary structure information of the protein from different granularities, and the tristate secondary structure comprises an alpha helix, a beta chain and a coil; the eight-state secondary structure further subdivides alpha helices, beta chains and coils into eight categories;
the tri-state and eight-state secondary structure sequences provide structural information in the process of training the model and obtaining the protein solubility prediction result, so that the information amount can be increased, and the protein solubility prediction precision can be improved.
Step 4), acquiring a training sample set and a testing sample set:
step 4 a) amino acid sequence X of each protein (m) And its corresponding binary combination sequence B (m) Ternary combination sequence T (m) Relative solvent reachability sequences RSA2 with relative solvent reachability categories of 2 and 20 (m) And RSA20 (m) Ternary and eight-state secondary structure sequences SS3 (m) And SS8 (m) Is initialized to L, L =1200, and is filled with 0 if the length of the sequence is less than L during initialization, and is exceeded if the length of the sequence exceeds LPartial deletion of (2);
the reason for initializing the length to L is that the input of the deep learning model based on the neural network requires the same shape, and the amino acid sequences of a plurality of proteins are generally not of equal length and cannot meet the requirement; the reason for setting L to 1200 is that the majority of protein amino acid sequences in the dataset are within 1200 in length, since the model uses global max pooling, filling with 0 does not affect the training and prediction of the model, and a length of 1200 ensures the relative integrity of the protein amino acid sequences.
Step 4 b) combining all sequences with initialized length into a multidimensional sequence representation of the protein sample set D = { D = { D } (1) ,D (2) ,...,D (m) ,...,D (M) And using the solubility labels of the N multidimensional sequences representing the samples and the amino acid sequences contained in the samples as a training sample setUsing the solubility labels of the rest S multi-dimensional sequence representation samples and the amino acid sequences contained in the samples as a test sample setWherein D is (m) Denotes an amino acid sequence X 'comprising a protein of length L' (m) And a binary combination sequence B 'corresponding to the sequence B' (m) And a ternary combination sequence T' (m) Relative solvent reachability sequences RSA2 'with relative solvent reachability category numbers of 2 and 20' (m) And RSA20' (m) Tristate and octate Secondary Structure sequence SS3' (m) And SS8' (m) Multidimensional sequence of 7 dimensions in total representing samples, D (m) =(X' (m) ,B' (m) ,T' (m) ,RSA2' (m) ,RSA20' (m) ,SS3' (m) ,SS8' (m) ),N=69420,S=2001;
Step 5) constructing a protein solubility prediction model H based on multi-dimensional sequence embedding:
constructing a protein solubility prediction model comprising 7 parallel-arranged embedding layers for realizing multi-dimensional sequence embedding and a prediction layer, wherein one embedding layer is used for embedding one dimension of a multi-dimensional sequence representation sample set, and a convolution pooling module is loaded between each embedding layer and the prediction layer and comprises a one-dimensional convolution layer, a global maximum pooling layer and a concat layer which are sequentially stacked; the prediction layer comprises a plurality of full connection layers and a sigmoid layer which are sequentially stacked;
the 7 embedding layers and the convolution pooling modules corresponding to the embedding layers extract the features of the protein from the sequences with 7 dimensions, the protein can be extracted and depicted from different angles, the characteristics extracted by fusing the sequences with 7 dimensions with the prediction layer comprehensively predict the solubility of the protein, and the accuracy of protein solubility prediction is improved;
the embedding dimensions of the 7 embedding layers are set to 64,5,32,5 and 10, respectively;
the structure of the convolution pooling module is as follows: one-dimensional convolution layer → pooling layer → concat layer, wherein the one-dimensional convolution layer is composed of K one-dimensional convolution units expressed as a set of two-tuplek j Representing the size of the convolution kernel, q j Representing the number of convolution kernels;
parameter setting of the convolution pooling module:
k one-dimensional convolution units of the one-dimensional convolution layer are set to be { (3, 32), (5, 32), (7, 32), (9, 32), (11, 32), (13, 32), (15, 32) };
the pooling mode of the pooling layer is set to global maximum pooling;
the structure of the prediction layer is: first fully-connected layer → second fully-connected layer → sigmoid layer;
parameter setting of prediction layer:
the number of the neurons of the first full-connection layer is set to be 128, and the activation function is set to be a ReLU function;
the number of the neurons of the second full connection layer is set to be 64, and the activation function is set to be a ReLU function;
the number of neurons in a sigmoid layer is set to be 1, and an activation function is set to be a sigmoid function;
step 6) iterative training is carried out on the protein solubility prediction model H:
step 6 a), initializing the number of iterations to be C, the maximum number of iterations to be C, wherein C is more than or equal to 1, randomly initializing all parameters in an embedded layer, a convolution pooling module and a prediction layer, and making C =0 and C =3;
step 6 b) using the training sample set Train as the input of the protein solubility prediction model H, and 7 embedding layers for each training sample7 sequence X 'of' (n) 、B' (n) 、T' (n) 、RSA2' (n) 、RSA20' (n) 、SS3' (n) 、SS8' (n) Respectively embedding, extracting features of the embedding results of the 7 convolution pooling modules, and extracting features of the prediction layer by 7 convolution pooling modulesAmino acid sequence X 'of middle protein' (n) Probability of being solublePredicting to obtain soluble probability set corresponding to Train
Step 6 c) calculating p by adopting a cross entropy loss function train With solubility labels y train And updating all parameters in the embedded layer, the convolution pooling module and the prediction layer through the Loss value Loss by adopting a gradient descent method, wherein:
step 6 d) judging whether C is larger than or equal to C, if so, obtaining a trained multidimensional sequence embedded into a convolutional neural network model H', otherwise, making C = C +1, and executing step (6 b);
step 7) obtaining a protein solubility prediction result:
embedding the Test sample set Test as the trained multidimensional sequence into the input of the convolutional neural network model H' for forward propagation to obtain a probability set that S samples are predicted to be soluble Time indicates that the s-th sample in the test sample set is predicted to be soluble,time indicates that the s-th sample is predicted to be insoluble.
The technical effects of the invention are further explained by combining simulation experiments as follows:
1. simulation conditions and contents:
the simulation experiment was performed on Intel (R) Xeon (R) Gold 5115CPU (20 core), dominant frequency 2.40GHz, memory 48G, tesla P40 video card, python 3.6.2 on Red Hat 4.8.5-11 platform in combination with Tensorflow-gpu-1.12 and Keras-2.2.4 using a data set of protein solubility data set for Deepsol E.coli expression system.
The prediction accuracy of the protein solubility prediction method Deepsol is compared and simulated with that of the conventional protein solubility prediction method, and the result is shown in the table I.
2. And (3) simulation result analysis:
the evaluation indexes adopted for the prediction Accuracy of the protein solubility comprise Accuracy and AUC.
(1) Accuracy = (TP + FN)/(TP + FN + FP + TN), where FP represents the number of samples for which the model is actually negative but the model is incorrectly predicted as positive, TN represents the number of samples for which the model is actually negative and the model is correctly predicted as negative, TP represents the number of samples for which the model is actually positive and the model is correctly predicted as positive, FN represents the number of samples for which the model is actually positive but the model is incorrectly predicted as negative, positive represents soluble, and negative represents insoluble.
(2) The AUC (Area under curve) is the Area under the ROC curve (receiver operating characteristic curve), the abscissa of the ROC curve is the False Positive Rate FPR (False Positive Rate), the ordinate is the True Positive Rate TPR (True Positive Rate), FPR = FP/(TN + FP), TPR = TP/(TP + FN).
Table 1 shows the results of comparing Accuracy and AUC values on the pasrenip dataset for the present invention and the prior art.
TABLE 1
Method | Accuracy | AUC |
Prior Art | 0.77 | 0.86 |
The invention | 0.79 | 0.87 |
As can be seen from Table 1, the Accuracy and AUC of the protein solubility prediction is obviously higher than those of the prior art, and the protein solubility prediction Accuracy is effectively improved.
The foregoing description is only an example of the present invention and should not be construed as limiting the invention in any way, and it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the principles and arrangements of the invention, but such changes and modifications are within the scope of the invention as defined by the appended claims.
Claims (2)
1. A protein solubility prediction method based on multidimensional sequence embedding is characterized by comprising the following steps:
(1) Obtaining the amino acid sequence set of the protein:
downloading the amino acid sequences of M proteins from a protein solubility dataset X = { X (1) ,X (2) ,...,X (m) ,...,X (M) And its corresponding solubility label y = { y = } (1) ,y (2) ,...,y (m) ,...,y (M) In which, M is more than or equal to 10000,X (m) denotes the m-th 20 amino acids with a length L m The amino acid sequence of the protein of (1),a vector space is represented in the form of a vector, denotes the amino acid at position I in the amino acid sequence of the mth protein, y (m) Represents X (m) Solubility tag of (a), y (m) =0 denotes X (m) Insoluble, y (m) =1 denotes X (m) Dissolving;
(2) Amino acid sequence X for each protein (m) Performing enhancement expression:
the amino acids of each protein are aligned in the order from front to backSequence X (m) The amino acids at every two positions are combined to obtain a binary combined sequence set B = { B = { (B) (1) ,B (2) ,...,B (m) ,...,B (M) },And to X (m) Combining the amino acids at every three positions to obtain a ternary combination sequence set T = { T = { (T) (1) ,T (2) ,...,T (m) ,...,T (M) },WhereinB (m) Represents X (m) A corresponding binary combination sequence with the length of L-1 and formed by binary combination of 400 amino acids,T (m) represents X (m) A corresponding ternary combination sequence which is composed of 8000 kinds of amino acid ternary combinations and has the length of L-2;
(3) Calculation of the amino acid sequence X of each protein (m) The structural information of (2):
(3a) The amino acid sequence X of each protein was calculated separately using the ACCPro5 software package at 25% threshold and at 0-95% threshold interval (m) The solvent relative reachability of (2) is obtained, and the corresponding solvent relative reachability category number under the 25% threshold value is 2, and the set RSA2= { RSA2= is represented (1) ,...,RSA2 (m) ,...,RSA2 (M) },And the corresponding solvent relative reachability sequence representation set RSA20 with the relative solvent reachability category number of 20 under the threshold interval of 0-95 = { RSA20= (1) ,...,RSA20 (m) ,...,RSA20 (M) },Wherein the content of the first and second substances,r l (m) e { e, - }, e representsCan be contacted with a solvent, meansThe contact with the solvent is not allowed to occur,
(3b) The amino acid sequence X of each protein was calculated using the SSpro5 software package (m) Tri-state secondary structure sequence ofAnd an octamer secondary structure sequenceObtaining a tri-state secondary structure sequence set SS3= { SS3 } of which the secondary structure class number corresponding to X is 3 (1) ,...,SS3 (m) ,...,SS3 (M) And eight-state secondary structure sequence set SS8 with secondary structure class number of 8= { SS8= } (1) ,...,SS8 (m) ,...,SS8 (M) -means for, among other things,
(4) Acquiring a training sample set and a testing sample set:
(4a) Amino acid sequence X of each protein (m) And its corresponding binary combination sequence B (m) Ternary combination sequence T (m) Relative solvent reachability sequences RSA2 with relative solvent reachability categories of 2 and 20 (m) And RSA20 (m) Tristate and octate Secondary Structure sequences SS3 (m) And SS8 (m) All the lengths of the sequences are initialized to be L, L =1200, and when the length of the sequences is less than L during initialization, the sequences are filled with 0, and when the length of the sequences exceeds L, the excess parts are deleted;
(4b) The multidimensional sequence whose length was initialized and all sequences were combined into a protein represents a sample set D = { D = { D } (1) ,D (2) ,...,D (m) ,...,D (M) And using the solubility labels of the N multidimensional sequences representing the samples and the amino acid sequences contained in the samples as a training sample setUsing the solubility labels of the rest S multi-dimensional sequence representation samples and the amino acid sequences contained in the samples as a test sample setWherein D is (m) Represents an amino acid sequence X 'comprising a protein of length L' (m) And the corresponding binary combination sequence B' (m) And a ternary combination sequence T' (m) Relative sequence of solvent reachability RSA2 'with relative solvent reachability category numbers of 2 and 20' (m) And RSA20' (m) Tristate and octate Secondary Structure sequence SS3' (m) And SS8' (m) Multidimensional sequence of 7 dimensions in total representing samples, D (m) =(X' (m) ,B' (m) ,T' (m) ,RSA2' (m) ,RSA20' (m) ,SS3' (m) ,SS8' (m) ),S=M-N;
(5) Constructing a protein solubility prediction model H based on multi-dimensional sequence embedding:
constructing a protein solubility prediction model comprising 7 parallel-arranged embedding layers for realizing multi-dimensional sequence embedding and a prediction layer, wherein a convolution pooling module is loaded between each embedding layer and the prediction layer, and the convolution pooling module comprises a one-dimensional convolution layer, a global maximum pooling layer and a concat layer which are sequentially stacked; the prediction layer comprises a plurality of fully-connected layers and a sigmoid layer which are sequentially stacked;
(6) And (3) performing iterative training on a protein solubility prediction model H:
(6a) Initializing all parameters in an embedding layer, a convolution pooling module and a prediction layer randomly, wherein the initialization iteration number is C, the maximum iteration number is C, and C is more than or equal to 1, and C =0;
(6b) The training sample set Train is used as the input of the protein solubility prediction model H, and 7 embedding layers are used for each training sample7 sequence X 'of' (n) 、B' (n) 、T' (n) 、RSA2' (n) 、RSA20' (n) 、SS3' (n) 、SS8' (n) Respectively embedding, extracting features of the embedding results of the 7 convolution pooling modules, and extracting features of the prediction layer by 7 convolution pooling modulesAmino acid sequence X 'of middle protein' (n) Probability of being solublePredicting to obtain a soluble probability set corresponding to Train
(6c) Calculating p by using cross entropy loss function train With solubility labels y train Cross entropy Loss value Loss between them, and adopting gradient descent methodThe Loss value Loss updates all parameters in the embedding layer, the convolution pooling module and the prediction layer, wherein:
(6d) Judging whether C is more than or equal to C, if so, obtaining a trained multidimensional sequence embedded convolutional neural network model H', otherwise, making C = C +1, and executing the step (6 b);
(7) Obtaining a protein solubility prediction:
embedding the Test sample set Test as the trained multidimensional sequence into the input of the convolutional neural network model H' for forward propagation to obtain a probability set that S samples are predicted to be soluble Time indicates that the s-th sample in the test sample set is predicted to be soluble,time indicates that the s-th sample is predicted to be insoluble.
2. The method for predicting protein solubility based on multi-dimensional sequence embedding of claim 1, wherein the protein solubility in step (5) is predicted by model H, wherein:
the embedding dimensions of the 7 embedding layers were set to 64,5,32,5 and 10, respectively;
the structure of the convolution pooling module is as follows: one-dimensional convolution layer → pooling layer → concat layer, wherein the one-dimensional convolution layer is composed of K one-dimensional convolution units expressed as a set of two-tuplek j Representing the size of the convolution kernel, q j Representing the number of convolution kernels;
parameter setting of the convolution pooling module:
k one-dimensional convolution units of the one-dimensional convolution layer are set to be { (3, 32), (5, 32), (7, 32), (9, 32), (11, 32), (13, 32), (15, 32) };
the pooling mode of the pooling layer is set to global maximum pooling;
the structure of the prediction layer is: a first fully connected layer → a second fully connected layer → a sigmoid layer;
parameter setting of prediction layer:
the number of the neurons of the first full connection layer is set to be 128, and the activation function is set to be a ReLU function;
the number of the neurons of the second full connection layer is set to be 64, and the activation function is set to be a ReLU function;
the number of neurons in a sigmoid layer is set to be 1, and an activation function is set to be a sigmoid function;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110521651.5A CN113223620B (en) | 2021-05-13 | 2021-05-13 | Protein solubility prediction method based on multi-dimensional sequence embedding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110521651.5A CN113223620B (en) | 2021-05-13 | 2021-05-13 | Protein solubility prediction method based on multi-dimensional sequence embedding |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113223620A CN113223620A (en) | 2021-08-06 |
CN113223620B true CN113223620B (en) | 2023-02-07 |
Family
ID=77095548
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110521651.5A Active CN113223620B (en) | 2021-05-13 | 2021-05-13 | Protein solubility prediction method based on multi-dimensional sequence embedding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113223620B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113851192B (en) * | 2021-09-15 | 2023-06-30 | 安庆师范大学 | Training method and device for amino acid one-dimensional attribute prediction model and attribute prediction method |
CN114582423A (en) * | 2022-02-26 | 2022-06-03 | 河南省健康元生物医药研究院有限公司 | Protein solubility prediction method based on combined machine learning model |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109817276B (en) * | 2019-01-29 | 2023-05-23 | 鲁东大学 | Protein secondary structure prediction method based on deep neural network |
US20210134389A1 (en) * | 2019-10-31 | 2021-05-06 | Pharmcadd Co., Ltd. | Method for training protein structure prediction apparatus, protein structure prediction apparatus and method for predicting protein structure based on molecular dynamics |
CN112767997B (en) * | 2021-02-04 | 2023-04-25 | 齐鲁工业大学 | Protein secondary structure prediction method based on multi-scale convolution attention neural network |
-
2021
- 2021-05-13 CN CN202110521651.5A patent/CN113223620B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN113223620A (en) | 2021-08-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mascarenhas et al. | A comparison between VGG16, VGG19 and ResNet50 architecture frameworks for Image Classification | |
US20190228268A1 (en) | Method and system for cell image segmentation using multi-stage convolutional neural networks | |
CN113223620B (en) | Protein solubility prediction method based on multi-dimensional sequence embedding | |
CN110210625B (en) | Modeling method and device based on transfer learning, computer equipment and storage medium | |
CN115331732B (en) | Gene phenotype training and predicting method and device based on graph neural network | |
CN110188827B (en) | Scene recognition method based on convolutional neural network and recursive automatic encoder model | |
CN112199536A (en) | Cross-modality-based rapid multi-label image classification method and system | |
CN111400494B (en) | Emotion analysis method based on GCN-Attention | |
CN112364974B (en) | YOLOv3 algorithm based on activation function improvement | |
CN113343974A (en) | Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement | |
CN112784921A (en) | Task attention guided small sample image complementary learning classification algorithm | |
CN116362325A (en) | Electric power image recognition model lightweight application method based on model compression | |
KR102149355B1 (en) | Learning system to reduce computation volume | |
CN113764034B (en) | Method, device, equipment and medium for predicting potential BGC in genome sequence | |
CN113257357B (en) | Protein residue contact map prediction method | |
CN113066528B (en) | Protein classification method based on active semi-supervised graph neural network | |
CN112785479B (en) | Image invisible watermark universal detection method based on few sample learning | |
CN111461229B (en) | Deep neural network optimization and image classification method based on target transfer and line search | |
CN116386733A (en) | Protein function prediction method based on multi-view multi-scale multi-attention mechanism | |
CN115497564A (en) | Antigen identification model establishing method and antigen identification method | |
Charisma et al. | Transfer Learning With Densenet201 Architecture Model For Potato Leaf Disease Classification | |
CN115423076A (en) | Directed hypergraph chain prediction method based on two-step framework | |
CN115691658A (en) | Processing method and device for optimizing molecular structure based on three-dimensional atomic density map | |
CN115063374A (en) | Model training method, face image quality scoring method, electronic device and storage medium | |
CN114496068A (en) | Protein secondary structure prediction method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |