Disclosure of Invention
The invention aims to avoid the defects of the prior art and provides a lysine acetylation site prediction method based on a modular dense convolutional network.
The purpose of the invention can be realized by adopting the following technical measures, and the method for predicting the lysine acetylation site based on the modularized dense convolutional network is designed, and comprises the following steps:
describing lysine acetylation sites from three aspects of protein structural characteristics, protein original sequences and amino acid physicochemical attribute information, and constructing site initial characteristic space;
adopting a modularized dense convolution network, respectively extracting high-level characteristics of protein structural characteristics, protein original sequences and amino acid physicochemical properties from an initial characteristic space of a locus, and simultaneously paying attention to low-level characteristics and high-level characteristics through dense jump connection;
the importance of evaluation features of a compression-excitation (SE) layer is introduced, each feature map is weighted, and the self-adaptive dynamic fusion of the three types of information is realized;
constructing a lysine acetylation site classifier based on the fusion characteristics and the softmax layer, and predicting potential lysine acetylation sites;
training a lysine acetylation site prediction model based on a modular dense convolutional network;
the proposed model was evaluated by four types of experiments of ten-fold cross validation, independent test, model generalization ability test, and recognition ability to unknown lysine acetylation sites.
Wherein, describing lysine acetylation sites from three aspects of protein structural characteristics, protein original sequences and amino acid physicochemical attribute information, and the step of constructing site initial feature space comprises the following steps:
(1) collecting and pretreating experimental data of lysine acetylation sites;
(2) and converting the collected protein data into a numerical vector by an encoding mode, constructing a site initial feature space, and taking the site initial feature space as the input of a prediction model.
Wherein, the experimental data collection and pretreatment of lysine acetylation sites comprise the following steps:
6078, 3645 and 1860 experimentally validated human, mouse and E.coli lysine acetylated protein data were collected and downloaded from the Protein Lysine Modification Database (PLMD).
Considering that the SPIDER3 server cannot handle protein sequences containing non-standard amino acids, the present invention manually deletes these protein sequences. Taking human as an example, sequence redundancy elimination using CD-HIT tool avoids the model bias caused by large sequence homology, the threshold is set to 0.4, and 4977 acetylated protein sequences are retained. According to the invention, 10% (498) of 4977 acetylated protein sequences after filtration are randomly selected to construct an independent test data set, and the rest acetylated protein sequences are used as a training data set to facilitate comparison with other lysine acetylation site predictors.
The method comprises the following steps of converting collected protein data into numerical vectors in an encoding mode, constructing a site initial feature space, and using the site initial feature space as input of a prediction model, wherein the method comprises the following steps:
(1) using the original sequence information of the protein at the one-of-21 coding site, and representing the vector of the original sequence information of the protein with dimension L multiplied by 21 for the motif with length L;
(2) adopting the amino acid physicochemical attribute information of the Atchley factor coding sites, wherein each amino acid residue is represented by 5 Atchley factors, and for the motif with the length of L, obtaining the vector representation of the L multiplied by 5 dimensional amino acid physicochemical attribute information;
(3) protein structural property information was obtained by SPIDER3, including 8 indices of 3 attributes, secondary structure: α helix p (h), β chain p (c), γ loop p (e), local diaphyseal torsion angle:
ψ, θ, τ, accessible surface area: ASA. For motifs of length L, a vector representation of L × 8 dimensional protein structural property information will be obtained.
The method comprises the following steps of adopting a modularized dense convolution network, respectively extracting high-level features of protein structural characteristics, protein original sequences and amino acid physicochemical properties from an initial feature space of a locus, simultaneously paying attention to low-level features and high-level features through dense jump connection, and comprising the following steps of:
(1) introducing a design idea of a modular network structure, and constructing a structure, a sequence and a physical and chemical information module;
(2) and (3) extracting high-level features of each module by adopting a stacking dense rolling block, and realizing information complementation between different-level features by considering low-level features and high-level features at the same time through dense jump connection.
The design idea of a modular network structure is introduced, and three information modules of a structure, a sequence and physics and chemistry are constructed, and the method comprises the following steps:
a structure module, a sequence module, a physicochemical module and three feature extraction submodules are respectively constructed on the basis of the structural characteristics of the protein, the original sequence of the protein and the physicochemical properties of amino acid, and the parameter spaces among the submodules are mutually independent, so that the crosstalk among the three types of information is effectively avoided, and the quality of the features is improved.
The method comprises the following steps of adopting a stacking compact volume block to extract high-level features of each module, simultaneously considering low-level features and high-level features through dense jump connection, realizing information complementation among different-level features, and comprising the following steps of:
since the network structures of the structural module, the sequence module and the physicochemical module are the same, only the sequence module is explained here:
(1) first, the sequence module receives as input the one-of-21 code of the site motif of length L, and then generates a low-level profile of the original sequence information of the protein by means of the one-dimensional convolution layer, as shown in formula (1).
X0=σ(I*W+b) (1)
Wherein I is a one-of-21 code vector.
In the weight matrix, S is the size of the filter (S ═ 3), and D is the number of filters (D ═ 96). b is the bias term and σ is the activation function. X
0Is the output of the one-dimensional convolution layer and has a size of L × D.
(2) And (3) extracting high-level characteristic representation of the protein original sequence information by adopting a dense convolution block, wherein the dense convolution process is shown as a formula (2).
Xl=σ([X0;X1;...;Xl-1]*W′+b′) (2)
Wherein, X
l-1A feature map generated for the first-1 convolutional layer in the dense convolutional block [ · C]Representing concatenation along a feature dimension.
For the weight matrix, D' is the total number of filters for 1 to l-1 layers of convolution in the dense volume block, and D "is the number of the l-th convolution layer filters in the dense volume block (D ═ 32). b' is a bias term, σ is an activation function, X
lA feature map generated for the first convolutional layer in the dense convolutional block is shown. The output of the dense volume block is a low-level feature map X
0Feature map X generated with each convolution layer in a dense convolution block
1,X
2,...,X
lAre connected in series by characteristic dimensions, i.e. [ X ]
0;X
1;...;X
l]。
(3) And (3) carrying out convolution operation and activation operation on the characteristic diagram of the protein original sequence information obtained in the step (2) by adopting a transition layer, wherein the process of the transition layer is shown as a formula (3).
X=σ([X0;X1;...;Xl]*W″+b″) (3)
Wherein,
for the weight matrix, S 'is the size of the filter (S' ═ 1). b' is the bias term, σ is the activation function, and X is the output of the transition layer. And then, carrying out average pooling operation on the transition layer result so as to reduce the dimension of the characteristic diagram and reduce the risk of model overfitting.
(4) And (4) repeating the steps (2) and (3) to form a stacked dense volume block. The fourth step (2) is not followed by step (3), but instead by a global average pooling replacement.
Through the process, the high-level characteristic X of the original protein sequence of the site is extracted by the sequence module(seq)。
Similarly, physicochemical and structural modules were also extracted through the above process to extract high-level features of the amino acid physicochemical properties and protein structural characteristics of the sites.
The method comprises the following steps of introducing the importance of evaluation features of a compression-excitation (SE) layer, weighting each feature map, and realizing the self-adaptive dynamic fusion of three types of information:
(1) introducing a compression-excitation (SE) layer to evaluate the importance of the features, and weighting each feature map;
(2) self-adaptive dynamic fusion of three kinds of information including protein structure characteristic, protein original sequence and amino acid physical and chemical properties.
Wherein, the importance of the compression-excitation (SE) layer evaluation features is introduced, and each feature map is weighted, comprising the following steps:
the sequence module is taken as an example for explanation:
(1) compression (squeeze): high-level feature X extracted from sequence module by global average pooling(seq)The global spatial information of (2) is compressed into the channel descriptor, and the compression process is as shown in formula (4).
Wherein z is
cRepresents X
(seq)C characteristic diagram of
Channel statistics of F
sq(. cndot.) denotes a compression operation, W and H denote feature diagrams, respectively
Width and height. Mixing X
(seq)After calculating the statistical information of each feature map, X is obtained
(seq)Channel descriptor of
(2) Excitation (excitation): trapping X with two fully-connected layers (FC)(seq)Channel dependence of, learning X(seq)The excitation process of the specificity weight of each feature map is shown in formula (5).
s=Fex(z,W)=σ(W2*δ(W1*z)) (5)
Wherein,
indicating learned X after the instigation operation
(seq)The specific weight of each feature map in (1), F
ex(. cndot.) denotes the firing operation. Delta and sigma respectively represent the activation functions of two fully-connected layers, the former being a ReLU function and the layer being a function with parameters
A reduction ratio of r (r-16); the latter is Sigmoid function, and the layer is with parameters
The dimension reduction layer ensures the dimension of s and the characteristic X
(seq)The number of channels is the same.
(3) Feature scaling (scale): scaling X by activating
(seq)Get the output of SE layer
Wherein
Is composed of
Any one of the elements of (a), (b), (c), (d,
is calculated as shown in equation (6).
Wherein,
representation of feature map
Each value of (a) is multiplied by a weight sc.
Similarly, the physicochemical and structural modules also get weighted high-level features through the SE layer.
The self-adaptive dynamic fusion of three types of information of protein structural characteristics, protein original sequences and amino acid physical and chemical properties comprises the following steps:
the SE layer is realized based on global average pooling and two full connection layers (FC), and the network structures of the structure module, the sequence module and the physicochemical module are the same, and the SE layer obtains weighted high-level features. Then, the output of each submodule is connected in series to obtain a fusion characteristic for classification
As the SE layer weights different feature graphs, the feature fusion process has self-adaptive dynamic characteristics.
Wherein, constructing a lysine acetylation site classifier based on the fusion characteristics and the softmax layer, predicting potential lysine acetylation sites, comprises the following steps:
softmax layer receives advanced features
And as an input, obtaining the prediction class of the sample after weighted summation and activation operation, wherein the forward propagation process of the softmax layer is shown as formula (7).
Wherein,
in order to be a weight matrix, the weight matrix,
is the bias term. P (y ═ i | x) represents the probability that the sample x is predicted as i ∈ {0, 1} class, and the class corresponding to the highest probability is the prediction class of the softmax classifier.
The method comprises the following steps of training a lysine acetylation site prediction model based on a modular dense convolutional network, wherein the training comprises the following steps:
(1) cross entropy is used as a cost function to minimize the training error:
where N is the total number of training samples, yjIs the true tag of the jth input motif, xjIs the jth input motif.
(2) L2 regularization is used in the training to mitigate the effects of overfitting, the final objective function of the model is:
minW(LC+λ∑(||W||2)2) (9)
wherein λ is a regularization coefficient, | W | | ceiling2Is the L2 norm of the weight matrix.
(3) The objective function was optimized using an Adam optimizer with learning rates and batch processing set to 0.0001 and 1000, respectively. The early stopping strategy and dropout technique are used to further prevent the model from being over-fitted.
(4) And a class re-weighting method is adopted, so that the influence of the positive samples is increased, and the model is forced to learn an abstract mechanism of the positive samples which occupies a small number.
(5) In the invention, a deep learning model is realized based on Keras 2.1.6 and TensorFlow 1.13.1, and model training and testing are carried out on a workstation which is provided with a Ubuntu 18.04.1LTS system and is provided with a GPU Nvidia Tesla V100-PCIE-32 GB.
Wherein, the proposed model is evaluated through four types of experiments including ten-fold cross validation, independent test, model generalization ability test and recognition ability of unknown lysine acetylation sites, and comprises the following steps:
(1) comparing the performance of a lysine acetylation site prediction model based on a modularized dense convolutional network and the performance of other prediction methods under the same reference training data set by adopting cross-folding verification;
(2) the prediction capability of a lysine acetylation site prediction model based on a modular dense convolutional network is further compared with that of other models in an independent test mode;
(3) further verifying that the lysine acetylation site prediction model based on the modularized dense convolution network has better generalization capability by adopting a generalization experiment mode;
(4) on an independent test set, top 20 candidate sites were validated and the ability to identify unknown lysine acetylation sites based on a lysine acetylation site prediction model of a modular dense convolutional network was evaluated.
Different from the prior art, the lysine acetylation site prediction method based on the modularized dense convolutional network introduces the structural characteristics of protein, and combines the structural characteristics with the original sequence of the protein and the physical and chemical properties of amino acid to construct a site feature space; the modular dense convolutional network is adopted to capture feature information of different levels, so that information loss and information crosstalk are reduced in the feature learning process; and a compression-excitation layer is introduced to evaluate the importance of different characteristics, and the abstract capability of the network is improved so as to identify potential lysine acetylation sites. The method can effectively solve the problem that the existing method only considers protein sequence level information and characteristic learning efficiency is low, more accurately predicts the potential lysine acetylation site, reduces the verification cost of the lysine acetylation site, and improves the research efficiency of lysine acetylation modification.
Detailed Description
The technical solution of the present invention will be further described in more detail with reference to the following embodiments. It is to be understood that the described embodiments are merely a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic flow chart of a lysine acetylation site prediction method based on a modular dense convolutional network according to the present invention.
The method comprises the following steps:
s110: describing lysine acetylation sites from three aspects of protein structural characteristics, protein original sequences and amino acid physicochemical attribute information, and constructing site initial feature space.
The step S110 includes:
1. collecting and pretreating experimental data of lysine acetylation sites;
2. and converting the collected protein data into a numerical vector by an encoding mode, constructing a site initial feature space, and taking the site initial feature space as the input of a prediction model.
Experimental data collection and pretreatment of lysine acetylation sites, comprising the steps of:
6078, 3645 and 1860 experimentally validated human, mouse and E.coli lysine acetylated protein data were collected and downloaded from the Protein Lysine Modification Database (PLMD).
Considering that the SPIDER3 server cannot handle protein sequences containing non-standard amino acids, the present invention manually deletes these protein sequences. Taking human as an example, sequence redundancy elimination using CD-HIT tool avoids the model bias caused by large sequence homology, the threshold is set to 0.4, and 4977 acetylated protein sequences are retained. According to the invention, 10% (498) of 4977 acetylated protein sequences after filtration are randomly selected to construct an independent test data set, and the rest acetylated protein sequences are used as a training data set to facilitate comparison with other lysine acetylation site predictors.
Converting collected protein data into a numerical vector by an encoding mode, constructing a site initial feature space, and using the site initial feature space as an input of a prediction model, wherein the method comprises the following steps of:
(1) using the original sequence information of the protein at the one-of-21 coding site, and representing the vector of the original sequence information of the protein with dimension L multiplied by 21 for the motif with length L;
(2) adopting the amino acid physicochemical attribute information of the Atchley factor coding sites, wherein each amino acid residue is represented by 5 Atchley factors, and for the motif with the length of L, obtaining the vector representation of the L multiplied by 5 dimensional amino acid physicochemical attribute information;
(3) protein structural property information was obtained by SPIDER3, including 8 indices of 3 attributes, secondary structure: α helix p (h), β chain p (c), γ loop p (e), local diaphyseal torsion angle:
ψ, θ, τ, accessible surface area: ASA. For motifs of length L, a vector representation of L × 8 dimensional protein structural property information will be obtained.
S120: and (3) adopting a modularized dense convolution network, respectively extracting high-level characteristics of protein structural characteristics, protein original sequences and amino acid physicochemical properties from the initial characteristic space of the sites, and simultaneously paying attention to low-level characteristics and high-level characteristics through dense jump connection.
The step S120 includes:
1. introducing a design idea of a modular network structure, and constructing a structure, a sequence and a physical and chemical information module;
2. and (3) extracting high-level features of each module by adopting a stacking dense rolling block, and realizing information complementation between different-level features by considering low-level features and high-level features at the same time through dense jump connection.
Introducing the design idea of a modular network structure, and constructing a structure, a sequence and a physical and chemical information module, wherein the method comprises the following steps:
a structure module, a sequence module, a physicochemical module and three feature extraction submodules are respectively constructed on the basis of the structural characteristics of the protein, the original sequence of the protein and the physicochemical properties of amino acid, and the parameter spaces among the submodules are mutually independent, so that the crosstalk among the three types of information is effectively avoided, and the quality of the features is improved.
The method adopts the stacking compact volume blocks to extract the high-level features of each module, simultaneously considers the low-level features and the high-level features through the dense jump connection, realizes the information complementation between the different-level features, and comprises the following steps:
since the network structures of the structural module, the sequence module and the physicochemical module are the same, only the sequence module is explained here:
(1) first, the sequence module receives as input the one-of-21 code of the site motif of length L, and then generates a low-level profile of the original sequence information of the protein by means of the one-dimensional convolution layer, as shown in formula (1).
X0=σ(I*W+b) (1)
Wherein I is a one-of-21 code vector.
In the weight matrix, S is the size of the filter (S ═ 3), and D is the number of filters (D ═ 96). b is the bias term and σ is the activation function. X
0Is the output of the one-dimensional convolution layer and has a size of L × D.
(2) And (3) extracting high-level characteristic representation of the protein original sequence information by adopting a dense convolution block, wherein the dense convolution process is shown as a formula (2).
Xl=σ([X0;X1;...;Xl-1]*W′+b′) (2)
Wherein, X
l-1A feature map generated for the first-1 convolutional layer in the dense convolutional block [ · C]Representing concatenation along a feature dimension.
For the weight matrix, D' is the total number of filters for 1 to l-1 layers of convolution in the dense volume block, and D "is the number of the l-th convolution layer filters in the dense volume block (D ═ 32). b' is a bias term, σ is an activation function, X
lA feature map generated for the first convolutional layer in the dense convolutional block is shown. The output of the dense volume block is a low-level feature map X
0Feature map X generated with each convolution layer in a dense convolution block
1,X
2,...,X
lAre connected in series by characteristic dimensions, i.e. [ X ]
0;X
1;...;X
l]。
(3) And (3) carrying out convolution operation and activation operation on the characteristic diagram of the protein original sequence information obtained in the step (2) by adopting a transition layer, wherein the process of the transition layer is shown as a formula (3).
X=σ([X0;X1;...;Xl]*W″+b″) (3)
Wherein,
for the weight matrix, S 'is the size of the filter (S' ═ 1). b' is a bias term and σ isAnd activating a function, wherein X is the output of the transition layer. And then, carrying out average pooling operation on the transition layer result so as to reduce the dimension of the characteristic diagram and reduce the risk of model overfitting.
(4) And (4) repeating the steps (2) and (3) to form a stacked dense volume block. The fourth step (2) is not followed by step (3), but instead by a global average pooling replacement.
Through the process, the high-level characteristic X of the original protein sequence of the site is extracted by the sequence module(seq)。
Similarly, physicochemical and structural modules were also extracted through the above process to extract high-level features of the amino acid physicochemical properties and protein structural characteristics of the sites.
S130: and (3) introducing a compression-excitation (SE) layer to evaluate the importance of the features, weighting each feature map, and realizing the self-adaptive dynamic fusion of the three types of information.
The step S130 includes:
1. introducing a compression-excitation (SE) layer to evaluate the importance of the features, and weighting each feature map;
2. self-adaptive dynamic fusion of three kinds of information including protein structure characteristic, protein original sequence and amino acid physical and chemical properties.
The method introduces a compression-excitation (SE) layer to evaluate the importance of the features and weights each feature map, and comprises the following steps:
the sequence module is taken as an example for explanation:
(1) compression (squeeze): high-level feature X extracted from sequence module by global average pooling(seq)The global spatial information of (2) is compressed into the channel descriptor, and the compression process is as shown in formula (4).
Wherein z is
cRepresents X
(seq)C characteristic diagram of
Channel statistics of F
sq(. represents a compression operation, W and H are respectivelyRepresentation characteristic diagram
Width and height. Mixing X
(seq)After calculating the statistical information of each feature map, X is obtained
(seq)Channel descriptor of
(2) Excitation (excitation): trapping X with two fully-connected layers (FC)(seq)Channel dependence of, learning X(seq)The excitation process of the specificity weight of each feature map is shown in formula (5).
s=Fex(z,W)=σ(W2*δ(W1*z)) (5)
Wherein,
indicating learned X after the instigation operation
(seq)The specific weight of each feature map in (1), F
ex(. cndot.) denotes the firing operation. Delta and sigma respectively represent the activation functions of two fully-connected layers, the former being a ReLU function and the layer being a function with parameters
A reduction ratio of r (r-16); the latter is Sigmoid function, and the layer is with parameters
The dimension reduction layer ensures the dimension of s and the characteristic X
(seq)The number of channels is the same.
(3) Feature scaling (scale): scaling X by activating
(seq)Get the output of SE layer
Wherein
Is composed of
Any one of the elements of (a), (b), (c), (d,
is calculated as shown in equation (6).
Wherein,
representation of feature map
Each value and weight s of
cMultiplication.
Similarly, the physicochemical and structural modules also get weighted high-level features through the SE layer.
The self-adaptive dynamic fusion of three kinds of information of protein structure characteristic, protein original sequence and amino acid physical and chemical properties includes the following steps:
the SE layer is realized based on global average pooling and two full connection layers (FC), and the network structures of the structure module, the sequence module and the physicochemical module are the same, and the SE layer obtains weighted high-level features. Then, the output of each submodule is connected in series to obtain a fusion characteristic for classification
As the SE layer weights different feature graphs, the feature fusion process has self-adaptive dynamic characteristics.
S140: and constructing a lysine acetylation site classifier based on the fusion characteristics and the softmax layer, and predicting potential lysine acetylation sites.
The step S140 includes:
softmax layer receives advanced features
As input, by weighted evaluationAnd obtaining the prediction category of the sample after the activation operation, wherein the forward propagation process of the softmax layer is shown as an equation (7).
Wherein,
in order to be a weight matrix, the weight matrix,
is the bias term. P (y ═ i | x) represents the probability that the sample x is predicted as i ∈ {0, 1} class, and the class corresponding to the highest probability is the prediction class of the softmax classifier.
S150: training is based on a modular dense convolutional network lysine acetylation site prediction model.
The step S150 includes:
1. cross entropy is used as a cost function to minimize the training error:
where N is the total number of training samples, yjIs the true tag of the jth input motif, xjIs the jth input motif.
2. L2 regularization is used in the training to mitigate the effects of overfitting, the final objective function of the model is:
minW(LC+λ∑(||W||2)2) (9)
wherein λ is a regularization coefficient, | W | | ceiling2Is the L2 norm of the weight matrix.
3. The objective function was optimized using an Adam optimizer with learning rates and batch processing set to 0.0001 and 1000, respectively. The early stopping strategy and dropout technique are used to further prevent the model from being over-fitted.
4. And a class re-weighting method is adopted, so that the influence of the positive samples is increased, and the model is forced to learn an abstract mechanism of the positive samples which occupies a small number.
5. In the invention, a deep learning model is realized based on Keras 2.1.6 and TensorFlow 1.13.1, and model training and testing are carried out on a workstation which is provided with a Ubuntu 18.04.1LTS system and is provided with a GPU Nvidia Tesla V100-PCIE-32 GB.
S160: the proposed model was evaluated by four types of experiments of ten-fold cross validation, independent test, model generalization ability test, and recognition ability to unknown lysine acetylation sites.
The step S160 includes:
1. comparing the performance of a lysine acetylation site prediction model based on a modularized dense convolutional network and the performance of other prediction methods under the same reference training data set by adopting cross-folding verification;
2. the prediction capability of a lysine acetylation site prediction model based on a modular dense convolutional network is further compared with that of other models in an independent test mode;
3. further verifying that the lysine acetylation site prediction model based on the modularized dense convolution network has better generalization capability by adopting a generalization experiment mode;
4. on an independent test set, top 20 candidate sites were validated and the ability to identify unknown lysine acetylation sites based on a lysine acetylation site prediction model of a modular dense convolutional network was evaluated.
The performance of a lysine acetylation site prediction model based on a modularized dense convolutional network and the performance of other prediction methods are compared under the same reference training data set by adopting ten-fold cross validation, and the method comprises the following steps of:
(1) the model of the invention and other existing lysine acetylation site prediction models are combined in a ten-fold cross validation mode: MusiteDeep, CapsNet, DeepAcet, PSKACEPred, EnsemblePail, GPS-PAIL2.0, and ProAcePred.
(2) The performance of the model was evaluated using six statistical measures, including sensitivity (Sn), specificity (Sp), accuracy (Acc), precision (Pre), Mahalanobis Correlation Coefficient (MCC) and geometric mean (G-mean), which are defined as follows:
wherein TP, TN, FP and FN are true positive, true negative, false positive and false negative respectively. The MCC and G-mean indices may well reflect model quality when positive and negative samples are not balanced. In addition, the area under the Receiver Operating Characteristic (ROC) curve (AUC) and the area under the precision recall rate (PR) curve (AUPR) are also adopted to measure the overall performance of the model, and the higher the AUC and AUPR values are, the better the overall performance of the model is. The comparison result of the model is shown in the attached drawing of the specification.
The method for predicting the lysine acetylation site by using the modular dense convolutional network comprises the following steps of:
for models with independent tools, training the models with training data and then performing potential lysine acetylation site prediction on independent test data sets, for models providing Web services, testing the prediction performance based on the independent test data sets only. The result shows that the lysine acetylation site prediction model based on the modularized dense convolutional network has the highest MCC, G-mean, AUC and AUPR, is optimal in independent test data set, and has better lysine acetylation site prediction capability compared with other prediction methods. The results of the independent tests are shown in the attached drawings of the specification.
The method further verifies that the lysine acetylation site prediction model based on the modularized dense convolutional network has better generalization capability by adopting a generalization experiment mode, and comprises the following steps of:
the generalized experiment mode is adopted to predict lysine acetylation sites under the 0.3 redundancy removing threshold of a human data set, under the 0.4 and 0.3 redundancy removing thresholds of a mice data set and under the 0.4 and 0.3 redundancy removing thresholds of an escherichia coli data set. The lysine acetylation site prediction model based on the modularized dense convolutional network has better generalization capability and can be suitable for different species data sets, and available reference is provided for the prediction of lysine acetylation modification sites of more other species. The results of the generalization ability test are shown in the attached drawings of the specification.
Wherein, on an independent test set, the candidate sites with the top 20 rank are verified, and the ability of identifying unknown lysine acetylation sites based on a lysine acetylation site prediction model of a modularized dense convolutional network is evaluated, comprising the following steps:
the top 20 candidate sites predicted to be lysine acetylated by the model of the invention are listed according to the results of the independent test set and these 20 candidate sites were examined manually in the lysine modification database PLMD and the protein database Uniprot (https:// www.uniprot.org). Through statistical validation, found that 20 candidate sites in 13 are truly acetylated, 65%. The results of the first 20 candidate sites for independent testing of acetylated proteins by human are shown in the attached figure of the specification.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.