CN113223620B - Protein solubility prediction method based on multi-dimensional sequence embedding - Google Patents

Protein solubility prediction method based on multi-dimensional sequence embedding Download PDF

Info

Publication number
CN113223620B
CN113223620B CN202110521651.5A CN202110521651A CN113223620B CN 113223620 B CN113223620 B CN 113223620B CN 202110521651 A CN202110521651 A CN 202110521651A CN 113223620 B CN113223620 B CN 113223620B
Authority
CN
China
Prior art keywords
sequence
protein
layer
amino acid
embedding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110521651.5A
Other languages
Chinese (zh)
Other versions
CN113223620A (en
Inventor
鱼亮
武相
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202110521651.5A priority Critical patent/CN113223620B/en
Publication of CN113223620A publication Critical patent/CN113223620A/en
Application granted granted Critical
Publication of CN113223620B publication Critical patent/CN113223620B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a protein solubility prediction method based on multi-dimensional sequence embedding, which comprises the following steps: (1) acquiring an amino acid sequence set of the protein; (2) Enhanced representation of the amino acid sequence of each protein; (3) Calculating structural information of the amino acid sequence of each protein; (4) acquiring a training sample set and a testing sample set; (5) Constructing a protein solubility prediction model H based on multi-dimensional sequence embedding; (6) Performing iterative training on the protein solubility prediction model H; (7) obtaining the result of predicting the solubility of the protein. In the process of training a model and acquiring a protein solubility prediction result, each amino acid sequence is subjected to enhanced representation and structural information supplementation, and multi-dimensional sequence embedding is performed, so that the information amount is increased, the accuracy of protein solubility prediction is improved, and the method can be used for screening the amino acid sequences for protein synthesis.

Description

Protein solubility prediction method based on multi-dimensional sequence embedding
Technical Field
The invention belongs to the technical field of bioinformatics, relates to a protein solubility prediction method, and particularly relates to a protein solubility prediction method based on multi-dimensional sequence embedding in the field of deep learning based on a neural network, which can be used for screening a protein amino acid sequence soluble in a protein expression system and providing reference for protein synthesis.
Background
The development of genetic engineering and cloning techniques has enabled research and industrial fields to synthesize and isolate proteins on a large scale in protein expression systems. Commonly used protein expression systems include E.coli expression systems, yeast expression systems, insect cell expression systems, mammalian cell expression systems. However, heterologous expression of many proteins in expression systems is not soluble, resulting in the synthesis of proteins with no biological activity, and thus efficient production of active soluble proteins remains a major challenge.
The solubility of a protein under given experimental conditions is a trait whose sequence is ultimately determined. Through researching the amino acid sequence mode of the insoluble/soluble protein and developing a protein solubility calculation method, the experimental work can be concentrated on the soluble protein, and the efficiency of large-scale screening is improved.
Protein solubility prediction refers to prediction of whether a protein amino acid sequence to be studied is soluble after synthesis by mining patterns in existing protein solubility-related data. Existing protein solubility predictions are mainly divided into two categories: the method comprises a protein solubility prediction method based on traditional machine learning of feature engineering and a protein solubility prediction method based on deep learning of a neural network. A protein solubility prediction method based on traditional machine learning of feature engineering mainly extracts a series of statistical features through an amino acid sequence of a protein, and obtains a final protein solubility prediction model through training a machine learning classifier. The method depends on a large amount of characteristic engineering and experience of characteristic selection, the amount of finally extracted information is limited, the protein cannot be comprehensively depicted, and the upper limit of the prediction accuracy of the method is reduced. The deep learning protein solubility prediction method based on the neural network mainly automatically learns feature representation from an amino acid sequence of a protein through the neural network, and performs protein solubility prediction end to end. Such methods typically use convolutional neural networks to extract the features of the amino acid sequence of proteins, for example, khurana et al, published in 2018 on Bioinformatics by "deep Sol: a deep learning frame for sequence-based protein solubility prediction", disclosing a protein solubility prediction method deep Sol. Deepsol learns feature representation from the amino acid sequence of a single protein only by using a convolutional neural network, and although the method can automatically extract features from the amino acid sequence of the protein, the amount of information provided by the amino acid sequence of the protein is limited, and information is lost in the process of convolution and pooling operation, so that the improvement of prediction accuracy is limited.
Disclosure of Invention
The present invention is directed to overcome the above-mentioned deficiencies of the prior art, and an object of the present invention is to provide a method for predicting protein solubility by embedding a multidimensional sequence, which increases the amount of information by enhancing expression of protein amino acid sequences and supplementing structural information, and which acquires vector representations of proteins from a plurality of dimensions and integrates them to improve the prediction accuracy.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) Obtaining the amino acid sequence set of the protein:
downloading the amino acid sequences of M proteins from the protein solubility dataset X = { X = { X = } (1) ,X (2) ,...,X (m) ,...,X (M) And its corresponding solubility label y = { y = } (1) ,y (2) ,...,y (m) ,...,y (M) Wherein M is more than or equal to 10000,
Figure BDA0003064247220000021
X (m) indicates that the m-th amino acid is composed of 20 amino acids and has a length of L m The amino acid sequence of the protein of (1),
Figure BDA0003064247220000022
a vector space is represented in the form of a vector,
Figure BDA0003064247220000023
represents the amino acid at the l-position in the amino acid sequence of the mth protein, y (m) Represents X (m) Solubility tag of (a), y (m) =0 means X (m) Insoluble, y (m) =1 denotes X (m) Dissolving;
(2) Amino acid sequence X for each protein (m) Performing enhancement expression:
for each protein amino acid sequence X in the order from front to back (m) Combining the amino acids at every two positions to obtain a binary combination sequence set B = { B = { (B) } (1) ,B (2) ,...,B (m) ,...,B (M) },
Figure BDA0003064247220000024
And to X (m) Combining amino acids at every three positions to obtain a ternary combination sequence set T = { T = { (T) } (1) ,T (2) ,...,T (m) ,...,T (M) },
Figure BDA0003064247220000025
Wherein
Figure BDA0003064247220000026
B (m) Represents X (m) A corresponding binary combination sequence with the length of L-1 and formed by binary combination of 400 amino acids,
Figure BDA0003064247220000027
T (m) represents X (m) A corresponding ternary combination sequence which is composed of 8000 kinds of amino acid ternary combinations and has the length of L-2;
(3) Calculating the amino acid sequence X of each protein (m) The structural information of (2):
(3a) The amino acid sequence X of each protein was calculated separately using the ACCPro5 software package at 25% threshold and at 0-95% threshold interval (m) The solvent relative reachability of (2) is obtained, and the corresponding solvent relative reachability category number under the 25% threshold value is 2, and the set RSA2= { RSA2= is represented (1) ,...,RSA2 (m) ,...,RSA2 (M) },
Figure BDA0003064247220000028
And a corresponding solvent relative reachability sequence representation set RSA20= { RSA20 } corresponding to a solvent relative reachability category number of 20 in a threshold interval of 0-95% (1) ,...,RSA20 (m) ,...,RSA20 (M) },
Figure BDA0003064247220000031
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003064247220000032
e represents
Figure BDA0003064247220000033
Can be contacted with a solvent, means
Figure BDA0003064247220000034
The contact with the solvent is not allowed to occur,
Figure BDA0003064247220000035
(3b) The amino acid sequence X of each protein was calculated using the SSpro5 software package (m) Tri-state secondary structure sequence of (1)
Figure BDA0003064247220000036
And an octamer secondary structure sequence
Figure BDA0003064247220000037
Obtaining a tri-state secondary structure sequence set SS3= { SS3 } of which the secondary structure class number corresponding to X is 3 (1) ,...,SS3 (m) ,...,SS3 (M) H, and an eight-state secondary structure sequence set SS8 with a secondary structure class number of 8= { SS8= } (1) ,...,SS8 (m) ,...,SS8 (M) -means for, among other things,
Figure BDA0003064247220000038
(4) Acquiring a training sample set and a testing sample set:
(4a) Amino acid sequence X of each protein (m) And its corresponding binary combination sequence B (m) Ternary combination sequence T (m) Relative solvent reachability sequences RSA2 with relative solvent reachability categories of 2 and 20 (m) And RSA20 (m) Ternary and eight-state secondary structure sequences SS3 (m) And SS8 (m) All the lengths of the sequences are initialized to be L, L =1200, and when the length of the sequences is less than L during initialization, the sequences are filled with 0, and when the length of the sequences exceeds L, the excess parts are deleted;
(4b) Combining all the sequences with initialized length into a multidimensional sequence of proteins represents a sample set D = { D = { D } (1) ,D (2) ,...,D (m) ,...,D (M) And using the solubility labels of the N multidimensional sequences representing the samples and the amino acid sequences contained in the samples as a training sample set
Figure BDA0003064247220000039
The rest S multi-dimensional sequence tablesSolubility tags representing samples and amino acid sequences contained therein as test sample sets
Figure BDA00030642472200000310
Wherein D is (m) Represents an amino acid sequence X 'comprising a protein of length L' (m) And the corresponding binary combination sequence B' (m) And a ternary combination sequence T' (m) Relative solvent reachability sequences RSA2 'with relative solvent reachability category numbers of 2 and 20' (m) And RSA20' (m) Tristate and octate Secondary Structure sequence SS3' (m) And SS8' (m) Multidimensional sequences of 7 dimensions in total represent samples, D (m) =(X' (m) ,B' (m) ,T' (m) ,RSA2' (m) ,RSA20' (m) ,SS3' (m) ,SS8' (m) ),
Figure BDA00030642472200000311
S=M-N;
(5) Constructing a protein solubility prediction model H based on multi-dimensional sequence embedding:
constructing a protein solubility prediction model comprising 7 parallel-arranged embedding layers for realizing multi-dimensional sequence embedding and a prediction layer, wherein a convolution pooling module is loaded between each embedding layer and the prediction layer, and the convolution pooling module comprises a one-dimensional convolution layer, a global maximum pooling layer and a concat layer which are sequentially stacked; the prediction layer comprises a plurality of full connection layers and a sigmoid layer which are sequentially stacked;
(6) And (3) performing iterative training on a protein solubility prediction model H:
(6a) Initializing all parameters in an embedding layer, a convolution pooling module and a prediction layer randomly, wherein the initialization iteration number is C, the maximum iteration number is C, and C is more than or equal to 1, and C =0;
(6b) The training sample set Train is used as the input of the protein solubility prediction model H, and 7 embedding layers are used for each training sample
Figure BDA0003064247220000041
7 sequence X 'of' (n) 、B' (n) 、T' (n) 、RSA2' (n) 、RSA20' (n) 、SS3' (n) 、SS8' (n) Respectively embedding, extracting features of the embedding results of the 7 convolution pooling modules, and extracting features of the prediction layer by 7 convolution pooling modules
Figure BDA0003064247220000042
Amino acid sequence X 'of middle protein' (n) Probability of being soluble
Figure BDA0003064247220000043
Predicting to obtain soluble probability set corresponding to Train
Figure BDA0003064247220000044
(6c) Calculating p by using cross entropy loss function train With solubility labels y train And updating all parameters in the embedded layer, the convolution pooling module and the prediction layer through the Loss value Loss by adopting a gradient descent method, wherein:
Figure BDA0003064247220000045
(6d) Judging whether C is more than or equal to C, if so, obtaining a trained multidimensional sequence embedded convolutional neural network model H', otherwise, making C = C +1, and executing the step (6 b);
(7) Obtaining a protein solubility prediction:
embedding the Test sample set Test as the trained multidimensional sequence into the input of the convolutional neural network model H' for forward propagation to obtain a probability set that S samples are predicted to be soluble
Figure BDA0003064247220000046
Figure BDA0003064247220000047
Time means the s-th sample in the test sample setIt is predicted to be soluble in the water,
Figure BDA0003064247220000048
time indicates that the s-th sample is predicted to be insoluble.
Compared with the prior art, the invention has the following advantages:
1. the protein solubility prediction model constructed by the invention comprises 7 embedded layers and convolution pooling modules connected with each embedded layer, wherein in the process of training the model and obtaining a protein solubility prediction result, the 7 embedded layers and the corresponding convolution pooling modules respectively embed 7 sequences, different feature expressions of proteins are learned from sequences of multiple dimensions, the different feature expressions can be mutually supplemented, and the proteins are jointly depicted.
2. In the process of acquiring the training sample set and the test sample set, the invention compensates for information loss in the process of convolution operation and pooling operation only from the amino acid sequence learning characteristic representation of the protein by performing enhanced representation on the amino acid sequence of each protein, and the four kinds of structural information of the amino acid sequence of each protein contained in the training sample set and the test sample set can increase the information amount in the process of training the model and acquiring the protein solubility prediction result.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
Detailed Description
The invention is described in further detail below with reference to fig. 1 and the specific examples.
Referring to fig. 1, the present invention includes the steps of:
step 1) obtaining an amino acid sequence set of protein:
this example downloads the amino acid sequence X = { X) of M proteins from the protein solubility dataset of the DeepSol e (1) ,X (2) ,...,X (m) ,...,X (M) And its corresponding solubility label y = { y = } (1) ,y (2) ,...,y (m) ,...,y (M) Where, M =71421,
Figure BDA0003064247220000051
X (m) indicates that the m-th amino acid is composed of 20 amino acids and has a length of L m The amino acid sequence of the protein of (1),
Figure BDA0003064247220000052
a vector space is represented in the form of a vector,
Figure BDA0003064247220000053
denotes the amino acid at position I in the amino acid sequence of the mth protein, y (m) Represents X (m) Solubility tag of (a), y (m) =0 denotes X (m) Insoluble, y (m) =1 denotes X (m) Dissolving;
step 2) for each protein amino acid sequence X (m) Performing enhancement representation:
for each protein amino acid sequence X in the order from front to back (m) The amino acids at every two positions are combined to obtain a binary combined sequence set B = { B = { (B) (1) ,B (2) ,...,B (m) ,...,B (M) },
Figure BDA0003064247220000054
And to X (m) Combining amino acids at every three positions to obtain a ternary combination sequence set T = { T = { (T) } (1) ,T (2) ,...,T (m) ,...,T (M) },
Figure BDA0003064247220000055
Wherein
Figure BDA0003064247220000056
B (m) Represents X (m) A corresponding binary combination sequence with the length of L-1 and formed by binary combination of 400 amino acids,
Figure BDA0003064247220000057
T (m) represents X (m) Corresponding three-element combination sequence with length of L-2 and formed by 8000 kinds of amino acid three-element combination.
The multivariate enhancement expression sequence of the amino acid sequence of the protein can enable the model to learn a mode in a more complex amino acid sequence, simultaneously make up for information loss in subsequent convolution pooling operation, and improve the accuracy of protein solubility prediction.
Step 3) calculation of the amino acid sequence X of each protein (m) The structural information of (2):
step 3 a) separately calculate the amino acid sequence X of each protein using the ACCPro5 software package at 25% threshold and at 0-95% threshold interval (m) The solvent relative reachability of (2) is obtained, and the corresponding solvent relative reachability category number under the 25% threshold value is 2, and the set RSA2= { RSA2= is represented (1) ,...,RSA2 (m) ,...,RSA2 (M) },
Figure BDA0003064247220000061
And the corresponding solvent relative reachability sequence representation set RSA20 with the relative solvent reachability category number of 20 under the threshold interval of 0-95 = { RSA20= (1) ,...,RSA20 (m) ,...,RSA20 (M) },
Figure BDA0003064247220000062
Wherein the content of the first and second substances,
Figure BDA0003064247220000063
e represents
Figure BDA0003064247220000064
Can be contacted with a solvent, means
Figure BDA0003064247220000065
The contact with the solvent is not allowed to occur,
Figure BDA0003064247220000066
a larger value indicates that the amino acid at that position is accessibleThe possibility of reaching a solvent is higher, and the relative solubility accessibility sequence can reflect the structure of the amino acid sequence of the protein, so that the information content of a data set is enlarged;
step 3 b) calculation of the amino acid sequence X of each protein using the SSpro5 software package (m) Tri-state secondary structure sequence of (1)
Figure BDA0003064247220000067
And an octamer secondary structure sequence
Figure BDA0003064247220000068
Obtaining a ternary secondary structure sequence set SS3= { SS3 } of which the secondary structure type number corresponding to X is 3 (1) ,...,SS3 (m) ,...,SS3 (M) And eight-state secondary structure sequence set SS8 with secondary structure class number of 8= { SS8= } (1) ,...,SS8 (m) ,...,SS8 (M) And (c) the step of (c) in which,
Figure BDA0003064247220000069
the tristate secondary structure and the octate secondary structure represent the secondary structure information of the protein from different granularities, and the tristate secondary structure comprises an alpha helix, a beta chain and a coil; the eight-state secondary structure further subdivides alpha helices, beta chains and coils into eight categories;
the tri-state and eight-state secondary structure sequences provide structural information in the process of training the model and obtaining the protein solubility prediction result, so that the information amount can be increased, and the protein solubility prediction precision can be improved.
Step 4), acquiring a training sample set and a testing sample set:
step 4 a) amino acid sequence X of each protein (m) And its corresponding binary combination sequence B (m) Ternary combination sequence T (m) Relative solvent reachability sequences RSA2 with relative solvent reachability categories of 2 and 20 (m) And RSA20 (m) Ternary and eight-state secondary structure sequences SS3 (m) And SS8 (m) Is initialized to L, L =1200, and is filled with 0 if the length of the sequence is less than L during initialization, and is exceeded if the length of the sequence exceeds LPartial deletion of (2);
the reason for initializing the length to L is that the input of the deep learning model based on the neural network requires the same shape, and the amino acid sequences of a plurality of proteins are generally not of equal length and cannot meet the requirement; the reason for setting L to 1200 is that the majority of protein amino acid sequences in the dataset are within 1200 in length, since the model uses global max pooling, filling with 0 does not affect the training and prediction of the model, and a length of 1200 ensures the relative integrity of the protein amino acid sequences.
Step 4 b) combining all sequences with initialized length into a multidimensional sequence representation of the protein sample set D = { D = { D } (1) ,D (2) ,...,D (m) ,...,D (M) And using the solubility labels of the N multidimensional sequences representing the samples and the amino acid sequences contained in the samples as a training sample set
Figure BDA0003064247220000071
Using the solubility labels of the rest S multi-dimensional sequence representation samples and the amino acid sequences contained in the samples as a test sample set
Figure BDA0003064247220000072
Wherein D is (m) Denotes an amino acid sequence X 'comprising a protein of length L' (m) And a binary combination sequence B 'corresponding to the sequence B' (m) And a ternary combination sequence T' (m) Relative solvent reachability sequences RSA2 'with relative solvent reachability category numbers of 2 and 20' (m) And RSA20' (m) Tristate and octate Secondary Structure sequence SS3' (m) And SS8' (m) Multidimensional sequence of 7 dimensions in total representing samples, D (m) =(X' (m) ,B' (m) ,T' (m) ,RSA2' (m) ,RSA20' (m) ,SS3' (m) ,SS8' (m) ),N=69420,S=2001;
Step 5) constructing a protein solubility prediction model H based on multi-dimensional sequence embedding:
constructing a protein solubility prediction model comprising 7 parallel-arranged embedding layers for realizing multi-dimensional sequence embedding and a prediction layer, wherein one embedding layer is used for embedding one dimension of a multi-dimensional sequence representation sample set, and a convolution pooling module is loaded between each embedding layer and the prediction layer and comprises a one-dimensional convolution layer, a global maximum pooling layer and a concat layer which are sequentially stacked; the prediction layer comprises a plurality of full connection layers and a sigmoid layer which are sequentially stacked;
the 7 embedding layers and the convolution pooling modules corresponding to the embedding layers extract the features of the protein from the sequences with 7 dimensions, the protein can be extracted and depicted from different angles, the characteristics extracted by fusing the sequences with 7 dimensions with the prediction layer comprehensively predict the solubility of the protein, and the accuracy of protein solubility prediction is improved;
the embedding dimensions of the 7 embedding layers are set to 64,5,32,5 and 10, respectively;
the structure of the convolution pooling module is as follows: one-dimensional convolution layer → pooling layer → concat layer, wherein the one-dimensional convolution layer is composed of K one-dimensional convolution units expressed as a set of two-tuple
Figure BDA0003064247220000073
k j Representing the size of the convolution kernel, q j Representing the number of convolution kernels;
parameter setting of the convolution pooling module:
k one-dimensional convolution units of the one-dimensional convolution layer are set to be { (3, 32), (5, 32), (7, 32), (9, 32), (11, 32), (13, 32), (15, 32) };
the pooling mode of the pooling layer is set to global maximum pooling;
the structure of the prediction layer is: first fully-connected layer → second fully-connected layer → sigmoid layer;
parameter setting of prediction layer:
the number of the neurons of the first full-connection layer is set to be 128, and the activation function is set to be a ReLU function;
the number of the neurons of the second full connection layer is set to be 64, and the activation function is set to be a ReLU function;
the number of neurons in a sigmoid layer is set to be 1, and an activation function is set to be a sigmoid function;
wherein ReLU (x) = max (x, 0),
Figure BDA0003064247220000081
step 6) iterative training is carried out on the protein solubility prediction model H:
step 6 a), initializing the number of iterations to be C, the maximum number of iterations to be C, wherein C is more than or equal to 1, randomly initializing all parameters in an embedded layer, a convolution pooling module and a prediction layer, and making C =0 and C =3;
step 6 b) using the training sample set Train as the input of the protein solubility prediction model H, and 7 embedding layers for each training sample
Figure BDA0003064247220000082
7 sequence X 'of' (n) 、B' (n) 、T' (n) 、RSA2' (n) 、RSA20' (n) 、SS3' (n) 、SS8' (n) Respectively embedding, extracting features of the embedding results of the 7 convolution pooling modules, and extracting features of the prediction layer by 7 convolution pooling modules
Figure BDA0003064247220000083
Amino acid sequence X 'of middle protein' (n) Probability of being soluble
Figure BDA0003064247220000084
Predicting to obtain soluble probability set corresponding to Train
Figure BDA0003064247220000085
Step 6 c) calculating p by adopting a cross entropy loss function train With solubility labels y train And updating all parameters in the embedded layer, the convolution pooling module and the prediction layer through the Loss value Loss by adopting a gradient descent method, wherein:
Figure BDA0003064247220000086
step 6 d) judging whether C is larger than or equal to C, if so, obtaining a trained multidimensional sequence embedded into a convolutional neural network model H', otherwise, making C = C +1, and executing step (6 b);
step 7) obtaining a protein solubility prediction result:
embedding the Test sample set Test as the trained multidimensional sequence into the input of the convolutional neural network model H' for forward propagation to obtain a probability set that S samples are predicted to be soluble
Figure BDA0003064247220000087
Figure BDA0003064247220000088
Time indicates that the s-th sample in the test sample set is predicted to be soluble,
Figure BDA0003064247220000089
time indicates that the s-th sample is predicted to be insoluble.
The technical effects of the invention are further explained by combining simulation experiments as follows:
1. simulation conditions and contents:
the simulation experiment was performed on Intel (R) Xeon (R) Gold 5115CPU (20 core), dominant frequency 2.40GHz, memory 48G, tesla P40 video card, python 3.6.2 on Red Hat 4.8.5-11 platform in combination with Tensorflow-gpu-1.12 and Keras-2.2.4 using a data set of protein solubility data set for Deepsol E.coli expression system.
The prediction accuracy of the protein solubility prediction method Deepsol is compared and simulated with that of the conventional protein solubility prediction method, and the result is shown in the table I.
2. And (3) simulation result analysis:
the evaluation indexes adopted for the prediction Accuracy of the protein solubility comprise Accuracy and AUC.
(1) Accuracy = (TP + FN)/(TP + FN + FP + TN), where FP represents the number of samples for which the model is actually negative but the model is incorrectly predicted as positive, TN represents the number of samples for which the model is actually negative and the model is correctly predicted as negative, TP represents the number of samples for which the model is actually positive and the model is correctly predicted as positive, FN represents the number of samples for which the model is actually positive but the model is incorrectly predicted as negative, positive represents soluble, and negative represents insoluble.
(2) The AUC (Area under curve) is the Area under the ROC curve (receiver operating characteristic curve), the abscissa of the ROC curve is the False Positive Rate FPR (False Positive Rate), the ordinate is the True Positive Rate TPR (True Positive Rate), FPR = FP/(TN + FP), TPR = TP/(TP + FN).
Table 1 shows the results of comparing Accuracy and AUC values on the pasrenip dataset for the present invention and the prior art.
TABLE 1
Method Accuracy AUC
Prior Art 0.77 0.86
The invention 0.79 0.87
As can be seen from Table 1, the Accuracy and AUC of the protein solubility prediction is obviously higher than those of the prior art, and the protein solubility prediction Accuracy is effectively improved.
The foregoing description is only an example of the present invention and should not be construed as limiting the invention in any way, and it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the principles and arrangements of the invention, but such changes and modifications are within the scope of the invention as defined by the appended claims.

Claims (2)

1. A protein solubility prediction method based on multidimensional sequence embedding is characterized by comprising the following steps:
(1) Obtaining the amino acid sequence set of the protein:
downloading the amino acid sequences of M proteins from a protein solubility dataset X = { X (1) ,X (2) ,...,X (m) ,...,X (M) And its corresponding solubility label y = { y = } (1) ,y (2) ,...,y (m) ,...,y (M) In which, M is more than or equal to 10000,
Figure FDA0003064247210000011
X (m) denotes the m-th 20 amino acids with a length L m The amino acid sequence of the protein of (1),
Figure FDA0003064247210000012
a vector space is represented in the form of a vector,
Figure FDA0003064247210000013
Figure FDA0003064247210000014
denotes the amino acid at position I in the amino acid sequence of the mth protein, y (m) Represents X (m) Solubility tag of (a), y (m) =0 denotes X (m) Insoluble, y (m) =1 denotes X (m) Dissolving;
(2) Amino acid sequence X for each protein (m) Performing enhancement expression:
the amino acids of each protein are aligned in the order from front to backSequence X (m) The amino acids at every two positions are combined to obtain a binary combined sequence set B = { B = { (B) (1) ,B (2) ,...,B (m) ,...,B (M) },
Figure FDA0003064247210000015
And to X (m) Combining the amino acids at every three positions to obtain a ternary combination sequence set T = { T = { (T) (1) ,T (2) ,...,T (m) ,...,T (M) },
Figure FDA0003064247210000016
Wherein
Figure FDA0003064247210000017
B (m) Represents X (m) A corresponding binary combination sequence with the length of L-1 and formed by binary combination of 400 amino acids,
Figure FDA0003064247210000018
T (m) represents X (m) A corresponding ternary combination sequence which is composed of 8000 kinds of amino acid ternary combinations and has the length of L-2;
(3) Calculation of the amino acid sequence X of each protein (m) The structural information of (2):
(3a) The amino acid sequence X of each protein was calculated separately using the ACCPro5 software package at 25% threshold and at 0-95% threshold interval (m) The solvent relative reachability of (2) is obtained, and the corresponding solvent relative reachability category number under the 25% threshold value is 2, and the set RSA2= { RSA2= is represented (1) ,...,RSA2 (m) ,...,RSA2 (M) },
Figure FDA0003064247210000021
And the corresponding solvent relative reachability sequence representation set RSA20 with the relative solvent reachability category number of 20 under the threshold interval of 0-95 = { RSA20= (1) ,...,RSA20 (m) ,...,RSA20 (M) },
Figure FDA0003064247210000022
Wherein the content of the first and second substances,
Figure FDA0003064247210000023
r l (m) e { e, - }, e represents
Figure FDA0003064247210000024
Can be contacted with a solvent, means
Figure FDA0003064247210000025
The contact with the solvent is not allowed to occur,
Figure FDA0003064247210000026
Figure FDA0003064247210000027
(3b) The amino acid sequence X of each protein was calculated using the SSpro5 software package (m) Tri-state secondary structure sequence of
Figure FDA0003064247210000028
And an octamer secondary structure sequence
Figure FDA0003064247210000029
Obtaining a tri-state secondary structure sequence set SS3= { SS3 } of which the secondary structure class number corresponding to X is 3 (1) ,...,SS3 (m) ,...,SS3 (M) And eight-state secondary structure sequence set SS8 with secondary structure class number of 8= { SS8= } (1) ,...,SS8 (m) ,...,SS8 (M) -means for, among other things,
Figure FDA00030642472100000210
Figure FDA00030642472100000211
(4) Acquiring a training sample set and a testing sample set:
(4a) Amino acid sequence X of each protein (m) And its corresponding binary combination sequence B (m) Ternary combination sequence T (m) Relative solvent reachability sequences RSA2 with relative solvent reachability categories of 2 and 20 (m) And RSA20 (m) Tristate and octate Secondary Structure sequences SS3 (m) And SS8 (m) All the lengths of the sequences are initialized to be L, L =1200, and when the length of the sequences is less than L during initialization, the sequences are filled with 0, and when the length of the sequences exceeds L, the excess parts are deleted;
(4b) The multidimensional sequence whose length was initialized and all sequences were combined into a protein represents a sample set D = { D = { D } (1) ,D (2) ,...,D (m) ,...,D (M) And using the solubility labels of the N multidimensional sequences representing the samples and the amino acid sequences contained in the samples as a training sample set
Figure FDA00030642472100000212
Using the solubility labels of the rest S multi-dimensional sequence representation samples and the amino acid sequences contained in the samples as a test sample set
Figure FDA00030642472100000213
Wherein D is (m) Represents an amino acid sequence X 'comprising a protein of length L' (m) And the corresponding binary combination sequence B' (m) And a ternary combination sequence T' (m) Relative sequence of solvent reachability RSA2 'with relative solvent reachability category numbers of 2 and 20' (m) And RSA20' (m) Tristate and octate Secondary Structure sequence SS3' (m) And SS8' (m) Multidimensional sequence of 7 dimensions in total representing samples, D (m) =(X' (m) ,B' (m) ,T' (m) ,RSA2' (m) ,RSA20' (m) ,SS3' (m) ,SS8' (m) ),
Figure FDA0003064247210000031
S=M-N;
(5) Constructing a protein solubility prediction model H based on multi-dimensional sequence embedding:
constructing a protein solubility prediction model comprising 7 parallel-arranged embedding layers for realizing multi-dimensional sequence embedding and a prediction layer, wherein a convolution pooling module is loaded between each embedding layer and the prediction layer, and the convolution pooling module comprises a one-dimensional convolution layer, a global maximum pooling layer and a concat layer which are sequentially stacked; the prediction layer comprises a plurality of fully-connected layers and a sigmoid layer which are sequentially stacked;
(6) And (3) performing iterative training on a protein solubility prediction model H:
(6a) Initializing all parameters in an embedding layer, a convolution pooling module and a prediction layer randomly, wherein the initialization iteration number is C, the maximum iteration number is C, and C is more than or equal to 1, and C =0;
(6b) The training sample set Train is used as the input of the protein solubility prediction model H, and 7 embedding layers are used for each training sample
Figure FDA0003064247210000032
7 sequence X 'of' (n) 、B' (n) 、T' (n) 、RSA2' (n) 、RSA20' (n) 、SS3' (n) 、SS8' (n) Respectively embedding, extracting features of the embedding results of the 7 convolution pooling modules, and extracting features of the prediction layer by 7 convolution pooling modules
Figure FDA0003064247210000033
Amino acid sequence X 'of middle protein' (n) Probability of being soluble
Figure FDA0003064247210000034
Predicting to obtain a soluble probability set corresponding to Train
Figure FDA0003064247210000035
(6c) Calculating p by using cross entropy loss function train With solubility labels y train Cross entropy Loss value Loss between them, and adopting gradient descent methodThe Loss value Loss updates all parameters in the embedding layer, the convolution pooling module and the prediction layer, wherein:
Figure FDA0003064247210000036
(6d) Judging whether C is more than or equal to C, if so, obtaining a trained multidimensional sequence embedded convolutional neural network model H', otherwise, making C = C +1, and executing the step (6 b);
(7) Obtaining a protein solubility prediction:
embedding the Test sample set Test as the trained multidimensional sequence into the input of the convolutional neural network model H' for forward propagation to obtain a probability set that S samples are predicted to be soluble
Figure FDA0003064247210000041
Figure FDA0003064247210000042
Time indicates that the s-th sample in the test sample set is predicted to be soluble,
Figure FDA0003064247210000043
time indicates that the s-th sample is predicted to be insoluble.
2. The method for predicting protein solubility based on multi-dimensional sequence embedding of claim 1, wherein the protein solubility in step (5) is predicted by model H, wherein:
the embedding dimensions of the 7 embedding layers were set to 64,5,32,5 and 10, respectively;
the structure of the convolution pooling module is as follows: one-dimensional convolution layer → pooling layer → concat layer, wherein the one-dimensional convolution layer is composed of K one-dimensional convolution units expressed as a set of two-tuple
Figure FDA0003064247210000044
k j Representing the size of the convolution kernel, q j Representing the number of convolution kernels;
parameter setting of the convolution pooling module:
k one-dimensional convolution units of the one-dimensional convolution layer are set to be { (3, 32), (5, 32), (7, 32), (9, 32), (11, 32), (13, 32), (15, 32) };
the pooling mode of the pooling layer is set to global maximum pooling;
the structure of the prediction layer is: a first fully connected layer → a second fully connected layer → a sigmoid layer;
parameter setting of prediction layer:
the number of the neurons of the first full connection layer is set to be 128, and the activation function is set to be a ReLU function;
the number of the neurons of the second full connection layer is set to be 64, and the activation function is set to be a ReLU function;
the number of neurons in a sigmoid layer is set to be 1, and an activation function is set to be a sigmoid function;
wherein ReLU (x) = max (x, 0),
Figure FDA0003064247210000045
CN202110521651.5A 2021-05-13 2021-05-13 Protein solubility prediction method based on multi-dimensional sequence embedding Active CN113223620B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110521651.5A CN113223620B (en) 2021-05-13 2021-05-13 Protein solubility prediction method based on multi-dimensional sequence embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110521651.5A CN113223620B (en) 2021-05-13 2021-05-13 Protein solubility prediction method based on multi-dimensional sequence embedding

Publications (2)

Publication Number Publication Date
CN113223620A CN113223620A (en) 2021-08-06
CN113223620B true CN113223620B (en) 2023-02-07

Family

ID=77095548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110521651.5A Active CN113223620B (en) 2021-05-13 2021-05-13 Protein solubility prediction method based on multi-dimensional sequence embedding

Country Status (1)

Country Link
CN (1) CN113223620B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113851192B (en) * 2021-09-15 2023-06-30 安庆师范大学 Training method and device for amino acid one-dimensional attribute prediction model and attribute prediction method
CN114582423A (en) * 2022-02-26 2022-06-03 河南省健康元生物医药研究院有限公司 Protein solubility prediction method based on combined machine learning model

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109817276B (en) * 2019-01-29 2023-05-23 鲁东大学 Protein secondary structure prediction method based on deep neural network
US20210134389A1 (en) * 2019-10-31 2021-05-06 Pharmcadd Co., Ltd. Method for training protein structure prediction apparatus, protein structure prediction apparatus and method for predicting protein structure based on molecular dynamics
CN112767997B (en) * 2021-02-04 2023-04-25 齐鲁工业大学 Protein secondary structure prediction method based on multi-scale convolution attention neural network

Also Published As

Publication number Publication date
CN113223620A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
Mascarenhas et al. A comparison between VGG16, VGG19 and ResNet50 architecture frameworks for Image Classification
US20190228268A1 (en) Method and system for cell image segmentation using multi-stage convolutional neural networks
CN113223620B (en) Protein solubility prediction method based on multi-dimensional sequence embedding
CN110210625B (en) Modeling method and device based on transfer learning, computer equipment and storage medium
CN115331732B (en) Gene phenotype training and predicting method and device based on graph neural network
CN110188827B (en) Scene recognition method based on convolutional neural network and recursive automatic encoder model
CN112199536A (en) Cross-modality-based rapid multi-label image classification method and system
CN111400494B (en) Emotion analysis method based on GCN-Attention
CN112364974B (en) YOLOv3 algorithm based on activation function improvement
CN113343974A (en) Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement
CN112784921A (en) Task attention guided small sample image complementary learning classification algorithm
CN116362325A (en) Electric power image recognition model lightweight application method based on model compression
KR102149355B1 (en) Learning system to reduce computation volume
CN113764034B (en) Method, device, equipment and medium for predicting potential BGC in genome sequence
CN113257357B (en) Protein residue contact map prediction method
CN113066528B (en) Protein classification method based on active semi-supervised graph neural network
CN112785479B (en) Image invisible watermark universal detection method based on few sample learning
CN111461229B (en) Deep neural network optimization and image classification method based on target transfer and line search
CN116386733A (en) Protein function prediction method based on multi-view multi-scale multi-attention mechanism
CN115497564A (en) Antigen identification model establishing method and antigen identification method
Charisma et al. Transfer Learning With Densenet201 Architecture Model For Potato Leaf Disease Classification
CN115423076A (en) Directed hypergraph chain prediction method based on two-step framework
CN115691658A (en) Processing method and device for optimizing molecular structure based on three-dimensional atomic density map
CN115063374A (en) Model training method, face image quality scoring method, electronic device and storage medium
CN114496068A (en) Protein secondary structure prediction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant