CN113223620B

CN113223620B - Protein solubility prediction method based on multi-dimensional sequence embedding

Info

Publication number: CN113223620B
Application number: CN202110521651.5A
Authority: CN
Inventors: 鱼亮; 武相
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2023-02-07
Anticipated expiration: 2041-05-13
Also published as: CN113223620A

Abstract

The invention provides a protein solubility prediction method based on multi-dimensional sequence embedding, which comprises the following steps: (1) acquiring an amino acid sequence set of the protein; (2) Enhanced representation of the amino acid sequence of each protein; (3) Calculating structural information of the amino acid sequence of each protein; (4) acquiring a training sample set and a testing sample set; (5) Constructing a protein solubility prediction model H based on multi-dimensional sequence embedding; (6) Performing iterative training on the protein solubility prediction model H; (7) obtaining the result of predicting the solubility of the protein. In the process of training a model and acquiring a protein solubility prediction result, each amino acid sequence is subjected to enhanced representation and structural information supplementation, and multi-dimensional sequence embedding is performed, so that the information amount is increased, the accuracy of protein solubility prediction is improved, and the method can be used for screening the amino acid sequences for protein synthesis.

Description

Protein solubility prediction method based on multi-dimensional sequence embedding

Technical Field

The invention belongs to the technical field of bioinformatics, relates to a protein solubility prediction method, and particularly relates to a protein solubility prediction method based on multi-dimensional sequence embedding in the field of deep learning based on a neural network, which can be used for screening a protein amino acid sequence soluble in a protein expression system and providing reference for protein synthesis.

Background

The development of genetic engineering and cloning techniques has enabled research and industrial fields to synthesize and isolate proteins on a large scale in protein expression systems. Commonly used protein expression systems include E.coli expression systems, yeast expression systems, insect cell expression systems, mammalian cell expression systems. However, heterologous expression of many proteins in expression systems is not soluble, resulting in the synthesis of proteins with no biological activity, and thus efficient production of active soluble proteins remains a major challenge.

The solubility of a protein under given experimental conditions is a trait whose sequence is ultimately determined. Through researching the amino acid sequence mode of the insoluble/soluble protein and developing a protein solubility calculation method, the experimental work can be concentrated on the soluble protein, and the efficiency of large-scale screening is improved.

Protein solubility prediction refers to prediction of whether a protein amino acid sequence to be studied is soluble after synthesis by mining patterns in existing protein solubility-related data. Existing protein solubility predictions are mainly divided into two categories: the method comprises a protein solubility prediction method based on traditional machine learning of feature engineering and a protein solubility prediction method based on deep learning of a neural network. A protein solubility prediction method based on traditional machine learning of feature engineering mainly extracts a series of statistical features through an amino acid sequence of a protein, and obtains a final protein solubility prediction model through training a machine learning classifier. The method depends on a large amount of characteristic engineering and experience of characteristic selection, the amount of finally extracted information is limited, the protein cannot be comprehensively depicted, and the upper limit of the prediction accuracy of the method is reduced. The deep learning protein solubility prediction method based on the neural network mainly automatically learns feature representation from an amino acid sequence of a protein through the neural network, and performs protein solubility prediction end to end. Such methods typically use convolutional neural networks to extract the features of the amino acid sequence of proteins, for example, khurana et al, published in 2018 on Bioinformatics by "deep Sol: a deep learning frame for sequence-based protein solubility prediction", disclosing a protein solubility prediction method deep Sol. Deepsol learns feature representation from the amino acid sequence of a single protein only by using a convolutional neural network, and although the method can automatically extract features from the amino acid sequence of the protein, the amount of information provided by the amino acid sequence of the protein is limited, and information is lost in the process of convolution and pooling operation, so that the improvement of prediction accuracy is limited.

Disclosure of Invention

The present invention is directed to overcome the above-mentioned deficiencies of the prior art, and an object of the present invention is to provide a method for predicting protein solubility by embedding a multidimensional sequence, which increases the amount of information by enhancing expression of protein amino acid sequences and supplementing structural information, and which acquires vector representations of proteins from a plurality of dimensions and integrates them to improve the prediction accuracy.

In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:

(1) Obtaining the amino acid sequence set of the protein:

downloading the amino acid sequences of M proteins from the protein solubility dataset X = { X = { X = } ⁽¹⁾ ,X ⁽²⁾ ,...,X ^(m) ,...,X ^(M) And its corresponding solubility label y = { y = } ⁽¹⁾ ,y ⁽²⁾ ,...,y ^(m) ,...,y ^(M) Wherein M is more than or equal to 10000,

X ^(m) indicates that the m-th amino acid is composed of 20 amino acids and has a length of L _m The amino acid sequence of the protein of (1),

a vector space is represented in the form of a vector,

represents the amino acid at the l-position in the amino acid sequence of the mth protein, y ^(m) Represents X ^(m) Solubility tag of (a), y ^(m) =0 means X ^(m) Insoluble, y ^(m) =1 denotes X ^(m) Dissolving;

(2) Amino acid sequence X for each protein ^(m) Performing enhancement expression:

for each protein amino acid sequence X in the order from front to back ^(m) Combining the amino acids at every two positions to obtain a binary combination sequence set B = { B = { (B) } ⁽¹⁾ ,B ⁽²⁾ ,...,B ^(m) ,...,B ^(M) }，

And to X ^(m) Combining amino acids at every three positions to obtain a ternary combination sequence set T = { T = { (T) } ⁽¹⁾ ,T ⁽²⁾ ,...,T ^(m) ,...,T ^(M) }，

Wherein

B ^(m) Represents X ^(m) A corresponding binary combination sequence with the length of L-1 and formed by binary combination of 400 amino acids,

T ^(m) represents X ^(m) A corresponding ternary combination sequence which is composed of 8000 kinds of amino acid ternary combinations and has the length of L-2;

(3) Calculating the amino acid sequence X of each protein ^(m) The structural information of (2):

(3a) The amino acid sequence X of each protein was calculated separately using the ACCPro5 software package at 25% threshold and at 0-95% threshold interval ^(m) The solvent relative reachability of (2) is obtained, and the corresponding solvent relative reachability category number under the 25% threshold value is 2, and the set RSA2= { RSA2= is represented ⁽¹⁾ ,...,RSA2 ^(m) ,...,RSA2 ^(M) }，

And a corresponding solvent relative reachability sequence representation set RSA20= { RSA20 } corresponding to a solvent relative reachability category number of 20 in a threshold interval of 0-95% ⁽¹⁾ ,...,RSA20 ^(m) ,...,RSA20 ^(M) }，

Wherein, the first and the second end of the pipe are connected with each other,

e represents

Can be contacted with a solvent, means

The contact with the solvent is not allowed to occur,

(3b) The amino acid sequence X of each protein was calculated using the SSpro5 software package ^(m) Tri-state secondary structure sequence of (1)

And an octamer secondary structure sequence

Obtaining a tri-state secondary structure sequence set SS3= { SS3 } of which the secondary structure class number corresponding to X is 3 ⁽¹⁾ ,...,SS3 ^(m) ,...,SS3 ^(M) H, and an eight-state secondary structure sequence set SS8 with a secondary structure class number of 8= { SS8= } ⁽¹⁾ ,...,SS8 ^(m) ,...,SS8 ^(M) -means for, among other things,

(4) Acquiring a training sample set and a testing sample set:

(4a) Amino acid sequence X of each protein ^(m) And its corresponding binary combination sequence B ^(m) Ternary combination sequence T ^(m) Relative solvent reachability sequences RSA2 with relative solvent reachability categories of 2 and 20 ^(m) And RSA20 ^(m) Ternary and eight-state secondary structure sequences SS3 ^(m) And SS8 ^(m) All the lengths of the sequences are initialized to be L, L =1200, and when the length of the sequences is less than L during initialization, the sequences are filled with 0, and when the length of the sequences exceeds L, the excess parts are deleted;

(4b) Combining all the sequences with initialized length into a multidimensional sequence of proteins represents a sample set D = { D = { D } ⁽¹⁾ ,D ⁽²⁾ ,...,D ^(m) ,...,D ^(M) And using the solubility labels of the N multidimensional sequences representing the samples and the amino acid sequences contained in the samples as a training sample set

The rest S multi-dimensional sequence tablesSolubility tags representing samples and amino acid sequences contained therein as test sample sets

Wherein D is ^(m) Represents an amino acid sequence X 'comprising a protein of length L' ^(m) And the corresponding binary combination sequence B' ^(m) And a ternary combination sequence T' ^(m) Relative solvent reachability sequences RSA2 'with relative solvent reachability category numbers of 2 and 20' ^(m) And RSA20' ^(m) Tristate and octate Secondary Structure sequence SS3' ^(m) And SS8' ^(m) Multidimensional sequences of 7 dimensions in total represent samples, D ^(m) ＝(X' ^(m) ,B' ^(m) ,T' ^(m) ,RSA2' ^(m) ,RSA20' ^(m) ,SS3' ^(m) ,SS8' ^(m) )，

S＝M-N；

(5) Constructing a protein solubility prediction model H based on multi-dimensional sequence embedding:

constructing a protein solubility prediction model comprising 7 parallel-arranged embedding layers for realizing multi-dimensional sequence embedding and a prediction layer, wherein a convolution pooling module is loaded between each embedding layer and the prediction layer, and the convolution pooling module comprises a one-dimensional convolution layer, a global maximum pooling layer and a concat layer which are sequentially stacked; the prediction layer comprises a plurality of full connection layers and a sigmoid layer which are sequentially stacked;

(6) And (3) performing iterative training on a protein solubility prediction model H:

(6a) Initializing all parameters in an embedding layer, a convolution pooling module and a prediction layer randomly, wherein the initialization iteration number is C, the maximum iteration number is C, and C is more than or equal to 1, and C =0;

(6b) The training sample set Train is used as the input of the protein solubility prediction model H, and 7 embedding layers are used for each training sample

7 sequence X 'of' ⁽ⁿ⁾ 、B' ⁽ⁿ⁾ 、T' ⁽ⁿ⁾ 、RSA2' ⁽ⁿ⁾ 、RSA20' ⁽ⁿ⁾ 、SS3' ⁽ⁿ⁾ 、SS8' ⁽ⁿ⁾ Respectively embedding, extracting features of the embedding results of the 7 convolution pooling modules, and extracting features of the prediction layer by 7 convolution pooling modules

Amino acid sequence X 'of middle protein' ⁽ⁿ⁾ Probability of being soluble

Predicting to obtain soluble probability set corresponding to Train

(6c) Calculating p by using cross entropy loss function _train With solubility labels y _train And updating all parameters in the embedded layer, the convolution pooling module and the prediction layer through the Loss value Loss by adopting a gradient descent method, wherein:

(6d) Judging whether C is more than or equal to C, if so, obtaining a trained multidimensional sequence embedded convolutional neural network model H', otherwise, making C = C +1, and executing the step (6 b);

(7) Obtaining a protein solubility prediction:

embedding the Test sample set Test as the trained multidimensional sequence into the input of the convolutional neural network model H' for forward propagation to obtain a probability set that S samples are predicted to be soluble

Time means the s-th sample in the test sample setIt is predicted to be soluble in the water,

time indicates that the s-th sample is predicted to be insoluble.

Compared with the prior art, the invention has the following advantages:

1. the protein solubility prediction model constructed by the invention comprises 7 embedded layers and convolution pooling modules connected with each embedded layer, wherein in the process of training the model and obtaining a protein solubility prediction result, the 7 embedded layers and the corresponding convolution pooling modules respectively embed 7 sequences, different feature expressions of proteins are learned from sequences of multiple dimensions, the different feature expressions can be mutually supplemented, and the proteins are jointly depicted.

2. In the process of acquiring the training sample set and the test sample set, the invention compensates for information loss in the process of convolution operation and pooling operation only from the amino acid sequence learning characteristic representation of the protein by performing enhanced representation on the amino acid sequence of each protein, and the four kinds of structural information of the amino acid sequence of each protein contained in the training sample set and the test sample set can increase the information amount in the process of training the model and acquiring the protein solubility prediction result.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Detailed Description

The invention is described in further detail below with reference to fig. 1 and the specific examples.

Referring to fig. 1, the present invention includes the steps of:

step 1) obtaining an amino acid sequence set of protein:

this example downloads the amino acid sequence X = { X) of M proteins from the protein solubility dataset of the DeepSol e ⁽¹⁾ ,X ⁽²⁾ ,...,X ^(m) ,...,X ^(M) And its corresponding solubility label y = { y = } ⁽¹⁾ ,y ⁽²⁾ ,...,y ^(m) ,...,y ^(M) Where, M =71421,

a vector space is represented in the form of a vector,

denotes the amino acid at position I in the amino acid sequence of the mth protein, y ^(m) Represents X ^(m) Solubility tag of (a), y ^(m) =0 denotes X ^(m) Insoluble, y ^(m) =1 denotes X ^(m) Dissolving;

step 2) for each protein amino acid sequence X ^(m) Performing enhancement representation:

for each protein amino acid sequence X in the order from front to back ^(m) The amino acids at every two positions are combined to obtain a binary combined sequence set B = { B = { (B) ⁽¹⁾ ,B ⁽²⁾ ,...,B ^(m) ,...,B ^(M) }，

Wherein

T ^(m) represents X ^(m) Corresponding three-element combination sequence with length of L-2 and formed by 8000 kinds of amino acid three-element combination.

The multivariate enhancement expression sequence of the amino acid sequence of the protein can enable the model to learn a mode in a more complex amino acid sequence, simultaneously make up for information loss in subsequent convolution pooling operation, and improve the accuracy of protein solubility prediction.

Step 3) calculation of the amino acid sequence X of each protein ^(m) The structural information of (2):

step 3 a) separately calculate the amino acid sequence X of each protein using the ACCPro5 software package at 25% threshold and at 0-95% threshold interval ^(m) The solvent relative reachability of (2) is obtained, and the corresponding solvent relative reachability category number under the 25% threshold value is 2, and the set RSA2= { RSA2= is represented ⁽¹⁾ ,...,RSA2 ^(m) ,...,RSA2 ^(M) }，

And the corresponding solvent relative reachability sequence representation set RSA20 with the relative solvent reachability category number of 20 under the threshold interval of 0-95 = { RSA20= ⁽¹⁾ ,...,RSA20 ^(m) ,...,RSA20 ^(M) }，

Wherein the content of the first and second substances,

e represents

Can be contacted with a solvent, means

The contact with the solvent is not allowed to occur,

a larger value indicates that the amino acid at that position is accessibleThe possibility of reaching a solvent is higher, and the relative solubility accessibility sequence can reflect the structure of the amino acid sequence of the protein, so that the information content of a data set is enlarged;

step 3 b) calculation of the amino acid sequence X of each protein using the SSpro5 software package ^(m) Tri-state secondary structure sequence of (1)

And an octamer secondary structure sequence

Obtaining a ternary secondary structure sequence set SS3= { SS3 } of which the secondary structure type number corresponding to X is 3 ⁽¹⁾ ,...,SS3 ^(m) ,...,SS3 ^(M) And eight-state secondary structure sequence set SS8 with secondary structure class number of 8= { SS8= } ⁽¹⁾ ,...,SS8 ^(m) ,...,SS8 ^(M) And (c) the step of (c) in which,

the tristate secondary structure and the octate secondary structure represent the secondary structure information of the protein from different granularities, and the tristate secondary structure comprises an alpha helix, a beta chain and a coil; the eight-state secondary structure further subdivides alpha helices, beta chains and coils into eight categories;

the tri-state and eight-state secondary structure sequences provide structural information in the process of training the model and obtaining the protein solubility prediction result, so that the information amount can be increased, and the protein solubility prediction precision can be improved.

Step 4), acquiring a training sample set and a testing sample set:

step 4 a) amino acid sequence X of each protein ^(m) And its corresponding binary combination sequence B ^(m) Ternary combination sequence T ^(m) Relative solvent reachability sequences RSA2 with relative solvent reachability categories of 2 and 20 ^(m) And RSA20 ^(m) Ternary and eight-state secondary structure sequences SS3 ^(m) And SS8 ^(m) Is initialized to L, L =1200, and is filled with 0 if the length of the sequence is less than L during initialization, and is exceeded if the length of the sequence exceeds LPartial deletion of (2);

the reason for initializing the length to L is that the input of the deep learning model based on the neural network requires the same shape, and the amino acid sequences of a plurality of proteins are generally not of equal length and cannot meet the requirement; the reason for setting L to 1200 is that the majority of protein amino acid sequences in the dataset are within 1200 in length, since the model uses global max pooling, filling with 0 does not affect the training and prediction of the model, and a length of 1200 ensures the relative integrity of the protein amino acid sequences.

Step 4 b) combining all sequences with initialized length into a multidimensional sequence representation of the protein sample set D = { D = { D } ⁽¹⁾ ,D ⁽²⁾ ,...,D ^(m) ,...,D ^(M) And using the solubility labels of the N multidimensional sequences representing the samples and the amino acid sequences contained in the samples as a training sample set

Using the solubility labels of the rest S multi-dimensional sequence representation samples and the amino acid sequences contained in the samples as a test sample set

Wherein D is ^(m) Denotes an amino acid sequence X 'comprising a protein of length L' ^(m) And a binary combination sequence B 'corresponding to the sequence B' ^(m) And a ternary combination sequence T' ^(m) Relative solvent reachability sequences RSA2 'with relative solvent reachability category numbers of 2 and 20' ^(m) And RSA20' ^(m) Tristate and octate Secondary Structure sequence SS3' ^(m) And SS8' ^(m) Multidimensional sequence of 7 dimensions in total representing samples, D ^(m) ＝(X' ^(m) ,B' ^(m) ,T' ^(m) ,RSA2' ^(m) ,RSA20' ^(m) ,SS3' ^(m) ,SS8' ^(m) )，N＝69420，S＝2001；

Step 5) constructing a protein solubility prediction model H based on multi-dimensional sequence embedding:

constructing a protein solubility prediction model comprising 7 parallel-arranged embedding layers for realizing multi-dimensional sequence embedding and a prediction layer, wherein one embedding layer is used for embedding one dimension of a multi-dimensional sequence representation sample set, and a convolution pooling module is loaded between each embedding layer and the prediction layer and comprises a one-dimensional convolution layer, a global maximum pooling layer and a concat layer which are sequentially stacked; the prediction layer comprises a plurality of full connection layers and a sigmoid layer which are sequentially stacked;

the 7 embedding layers and the convolution pooling modules corresponding to the embedding layers extract the features of the protein from the sequences with 7 dimensions, the protein can be extracted and depicted from different angles, the characteristics extracted by fusing the sequences with 7 dimensions with the prediction layer comprehensively predict the solubility of the protein, and the accuracy of protein solubility prediction is improved;

the embedding dimensions of the 7 embedding layers are set to 64,5,32,5 and 10, respectively;

the structure of the convolution pooling module is as follows: one-dimensional convolution layer → pooling layer → concat layer, wherein the one-dimensional convolution layer is composed of K one-dimensional convolution units expressed as a set of two-tuple

k _j Representing the size of the convolution kernel, q _j Representing the number of convolution kernels;

parameter setting of the convolution pooling module:

k one-dimensional convolution units of the one-dimensional convolution layer are set to be { (3, 32), (5, 32), (7, 32), (9, 32), (11, 32), (13, 32), (15, 32) };

the pooling mode of the pooling layer is set to global maximum pooling;

the structure of the prediction layer is: first fully-connected layer → second fully-connected layer → sigmoid layer;

parameter setting of prediction layer:

the number of the neurons of the first full-connection layer is set to be 128, and the activation function is set to be a ReLU function;

the number of the neurons of the second full connection layer is set to be 64, and the activation function is set to be a ReLU function;

the number of neurons in a sigmoid layer is set to be 1, and an activation function is set to be a sigmoid function;

wherein ReLU (x) = max (x, 0),

step 6) iterative training is carried out on the protein solubility prediction model H:

step 6 a), initializing the number of iterations to be C, the maximum number of iterations to be C, wherein C is more than or equal to 1, randomly initializing all parameters in an embedded layer, a convolution pooling module and a prediction layer, and making C =0 and C =3;

step 6 b) using the training sample set Train as the input of the protein solubility prediction model H, and 7 embedding layers for each training sample

Predicting to obtain soluble probability set corresponding to Train

Step 6 c) calculating p by adopting a cross entropy loss function _train With solubility labels y _train And updating all parameters in the embedded layer, the convolution pooling module and the prediction layer through the Loss value Loss by adopting a gradient descent method, wherein:

step 6 d) judging whether C is larger than or equal to C, if so, obtaining a trained multidimensional sequence embedded into a convolutional neural network model H', otherwise, making C = C +1, and executing step (6 b);

step 7) obtaining a protein solubility prediction result:

Time indicates that the s-th sample in the test sample set is predicted to be soluble,

time indicates that the s-th sample is predicted to be insoluble.

The technical effects of the invention are further explained by combining simulation experiments as follows:

1. simulation conditions and contents:

the simulation experiment was performed on Intel (R) Xeon (R) Gold 5115CPU (20 core), dominant frequency 2.40GHz, memory 48G, tesla P40 video card, python 3.6.2 on Red Hat 4.8.5-11 platform in combination with Tensorflow-gpu-1.12 and Keras-2.2.4 using a data set of protein solubility data set for Deepsol E.coli expression system.

The prediction accuracy of the protein solubility prediction method Deepsol is compared and simulated with that of the conventional protein solubility prediction method, and the result is shown in the table I.

2. And (3) simulation result analysis:

the evaluation indexes adopted for the prediction Accuracy of the protein solubility comprise Accuracy and AUC.

(1) Accuracy = (TP + FN)/(TP + FN + FP + TN), where FP represents the number of samples for which the model is actually negative but the model is incorrectly predicted as positive, TN represents the number of samples for which the model is actually negative and the model is correctly predicted as negative, TP represents the number of samples for which the model is actually positive and the model is correctly predicted as positive, FN represents the number of samples for which the model is actually positive but the model is incorrectly predicted as negative, positive represents soluble, and negative represents insoluble.

(2) The AUC (Area under curve) is the Area under the ROC curve (receiver operating characteristic curve), the abscissa of the ROC curve is the False Positive Rate FPR (False Positive Rate), the ordinate is the True Positive Rate TPR (True Positive Rate), FPR = FP/(TN + FP), TPR = TP/(TP + FN).

Table 1 shows the results of comparing Accuracy and AUC values on the pasrenip dataset for the present invention and the prior art.

TABLE 1

Method	Accuracy	AUC
			Prior Art	0.77	0.86
The invention	0.79	0.87

As can be seen from Table 1, the Accuracy and AUC of the protein solubility prediction is obviously higher than those of the prior art, and the protein solubility prediction Accuracy is effectively improved.

The foregoing description is only an example of the present invention and should not be construed as limiting the invention in any way, and it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the principles and arrangements of the invention, but such changes and modifications are within the scope of the invention as defined by the appended claims.

Claims

1. A protein solubility prediction method based on multidimensional sequence embedding is characterized by comprising the following steps:

(1) Obtaining the amino acid sequence set of the protein:

downloading the amino acid sequences of M proteins from a protein solubility dataset X = { X ⁽¹⁾ ,X ⁽²⁾ ,...,X ^(m) ,...,X ^(M) And its corresponding solubility label y = { y = } ⁽¹⁾ ,y ⁽²⁾ ,...,y ^(m) ,...,y ^(M) In which, M is more than or equal to 10000,

X ^(m) denotes the m-th 20 amino acids with a length L _m The amino acid sequence of the protein of (1),

a vector space is represented in the form of a vector,

the amino acids of each protein are aligned in the order from front to backSequence X ^(m) The amino acids at every two positions are combined to obtain a binary combined sequence set B = { B = { (B) ⁽¹⁾ ,B ⁽²⁾ ,...,B ^(m) ,...,B ^(M) }，

And to X ^(m) Combining the amino acids at every three positions to obtain a ternary combination sequence set T = { T = { (T) ⁽¹⁾ ,T ⁽²⁾ ,...,T ^(m) ,...,T ^(M) }，

Wherein

(3) Calculation of the amino acid sequence X of each protein ^(m) The structural information of (2):

Wherein the content of the first and second substances,

r _l ^(m) e { e, - }, e represents

Can be contacted with a solvent, means

The contact with the solvent is not allowed to occur,

(3b) The amino acid sequence X of each protein was calculated using the SSpro5 software package ^(m) Tri-state secondary structure sequence of

And an octamer secondary structure sequence

Obtaining a tri-state secondary structure sequence set SS3= { SS3 } of which the secondary structure class number corresponding to X is 3 ⁽¹⁾ ,...,SS3 ^(m) ,...,SS3 ^(M) And eight-state secondary structure sequence set SS8 with secondary structure class number of 8= { SS8= } ⁽¹⁾ ,...,SS8 ^(m) ,...,SS8 ^(M) -means for, among other things,

(4) Acquiring a training sample set and a testing sample set:

(4a) Amino acid sequence X of each protein ^(m) And its corresponding binary combination sequence B ^(m) Ternary combination sequence T ^(m) Relative solvent reachability sequences RSA2 with relative solvent reachability categories of 2 and 20 ^(m) And RSA20 ^(m) Tristate and octate Secondary Structure sequences SS3 ^(m) And SS8 ^(m) All the lengths of the sequences are initialized to be L, L =1200, and when the length of the sequences is less than L during initialization, the sequences are filled with 0, and when the length of the sequences exceeds L, the excess parts are deleted;

(4b) The multidimensional sequence whose length was initialized and all sequences were combined into a protein represents a sample set D = { D = { D } ⁽¹⁾ ,D ⁽²⁾ ,...,D ^(m) ,...,D ^(M) And using the solubility labels of the N multidimensional sequences representing the samples and the amino acid sequences contained in the samples as a training sample set

Wherein D is ^(m) Represents an amino acid sequence X 'comprising a protein of length L' ^(m) And the corresponding binary combination sequence B' ^(m) And a ternary combination sequence T' ^(m) Relative sequence of solvent reachability RSA2 'with relative solvent reachability category numbers of 2 and 20' ^(m) And RSA20' ^(m) Tristate and octate Secondary Structure sequence SS3' ^(m) And SS8' ^(m) Multidimensional sequence of 7 dimensions in total representing samples, D ^(m) ＝(X' ^(m) ,B' ^(m) ,T' ^(m) ,RSA2' ^(m) ,RSA20' ^(m) ,SS3' ^(m) ,SS8' ^(m) )，

S＝M-N；

constructing a protein solubility prediction model comprising 7 parallel-arranged embedding layers for realizing multi-dimensional sequence embedding and a prediction layer, wherein a convolution pooling module is loaded between each embedding layer and the prediction layer, and the convolution pooling module comprises a one-dimensional convolution layer, a global maximum pooling layer and a concat layer which are sequentially stacked; the prediction layer comprises a plurality of fully-connected layers and a sigmoid layer which are sequentially stacked;

Predicting to obtain a soluble probability set corresponding to Train

(6c) Calculating p by using cross entropy loss function _train With solubility labels y _train Cross entropy Loss value Loss between them, and adopting gradient descent methodThe Loss value Loss updates all parameters in the embedding layer, the convolution pooling module and the prediction layer, wherein:

(7) Obtaining a protein solubility prediction:

time indicates that the s-th sample is predicted to be insoluble.

2. The method for predicting protein solubility based on multi-dimensional sequence embedding of claim 1, wherein the protein solubility in step (5) is predicted by model H, wherein:

the embedding dimensions of the 7 embedding layers were set to 64,5,32,5 and 10, respectively;

parameter setting of the convolution pooling module:

the pooling mode of the pooling layer is set to global maximum pooling;

the structure of the prediction layer is: a first fully connected layer → a second fully connected layer → a sigmoid layer;

parameter setting of prediction layer:

the number of the neurons of the first full connection layer is set to be 128, and the activation function is set to be a ReLU function;

wherein ReLU (x) = max (x, 0),