CN112447265A

CN112447265A - Lysine acetylation site prediction method based on modular dense convolutional network

Info

Publication number: CN112447265A
Application number: CN202011344614.3A
Authority: CN
Inventors: 王会青; 颜志良; 刘丹; 赵虹; 赵健; 赵静; 赵森
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-03-05
Anticipated expiration: 2040-11-25
Also published as: CN112447265B

Abstract

The invention discloses a lysine acetylation site prediction method based on a modular dense convolutional network, which introduces protein structural characteristics and combines the protein structural characteristics with an original protein sequence and amino acid physical and chemical properties to construct a site characteristic space; the modular dense convolutional network is adopted to capture feature information of different levels, so that information loss and information crosstalk are reduced in the feature learning process; and a compression-excitation layer is introduced to evaluate the importance of different characteristics, and the abstract capability of the network is improved so as to identify potential lysine acetylation sites. The method can effectively solve the problem that the existing method only considers protein sequence level information and characteristic learning efficiency is low, more accurately predicts the potential lysine acetylation site, reduces the verification cost of the lysine acetylation site, and improves the research efficiency of lysine acetylation modification.

Description

Lysine acetylation site prediction method based on modular dense convolutional network

Technical Field

The invention relates to the field of lysine acetylation site prediction research and analysis, in particular to a lysine acetylation site prediction method based on a modular dense convolutional network.

Background

Lysine acetylation is a conservative posttranslational modification of proteins and is closely related to various metabolic diseases, and therefore, the recognition of lysine acetylation sites is of great significance for the research of the treatment of metabolic diseases. The structural characteristics of the protein comprise highly useful structural information, and a powerful basis is provided for the identification of protein posttranslational modification; in the characteristic learning process, information among different levels of characteristics is complemented, and the characteristic quality can be effectively improved by paying attention to the information of the different levels of characteristics. The existing deep learning method adopts the information of protein sequence level as input, and does not consider the structural characteristics of protein; only high-level features are considered during feature extraction, so that information is seriously lost, and a prediction result is reduced.

Disclosure of Invention

The invention aims to avoid the defects of the prior art and provides a lysine acetylation site prediction method based on a modular dense convolutional network.

The purpose of the invention can be realized by adopting the following technical measures, and the method for predicting the lysine acetylation site based on the modularized dense convolutional network is designed, and comprises the following steps:

describing lysine acetylation sites from three aspects of protein structural characteristics, protein original sequences and amino acid physicochemical attribute information, and constructing site initial characteristic space;

adopting a modularized dense convolution network, respectively extracting high-level characteristics of protein structural characteristics, protein original sequences and amino acid physicochemical properties from an initial characteristic space of a locus, and simultaneously paying attention to low-level characteristics and high-level characteristics through dense jump connection;

the importance of evaluation features of a compression-excitation (SE) layer is introduced, each feature map is weighted, and the self-adaptive dynamic fusion of the three types of information is realized;

constructing a lysine acetylation site classifier based on the fusion characteristics and the softmax layer, and predicting potential lysine acetylation sites;

training a lysine acetylation site prediction model based on a modular dense convolutional network;

the proposed model was evaluated by four types of experiments of ten-fold cross validation, independent test, model generalization ability test, and recognition ability to unknown lysine acetylation sites.

Wherein, describing lysine acetylation sites from three aspects of protein structural characteristics, protein original sequences and amino acid physicochemical attribute information, and the step of constructing site initial feature space comprises the following steps:

(1) collecting and pretreating experimental data of lysine acetylation sites;

(2) and converting the collected protein data into a numerical vector by an encoding mode, constructing a site initial feature space, and taking the site initial feature space as the input of a prediction model.

Wherein, the experimental data collection and pretreatment of lysine acetylation sites comprise the following steps:

6078, 3645 and 1860 experimentally validated human, mouse and E.coli lysine acetylated protein data were collected and downloaded from the Protein Lysine Modification Database (PLMD).

Considering that the SPIDER3 server cannot handle protein sequences containing non-standard amino acids, the present invention manually deletes these protein sequences. Taking human as an example, sequence redundancy elimination using CD-HIT tool avoids the model bias caused by large sequence homology, the threshold is set to 0.4, and 4977 acetylated protein sequences are retained. According to the invention, 10% (498) of 4977 acetylated protein sequences after filtration are randomly selected to construct an independent test data set, and the rest acetylated protein sequences are used as a training data set to facilitate comparison with other lysine acetylation site predictors.

The method comprises the following steps of converting collected protein data into numerical vectors in an encoding mode, constructing a site initial feature space, and using the site initial feature space as input of a prediction model, wherein the method comprises the following steps:

(1) using the original sequence information of the protein at the one-of-21 coding site, and representing the vector of the original sequence information of the protein with dimension L multiplied by 21 for the motif with length L;

(2) adopting the amino acid physicochemical attribute information of the Atchley factor coding sites, wherein each amino acid residue is represented by 5 Atchley factors, and for the motif with the length of L, obtaining the vector representation of the L multiplied by 5 dimensional amino acid physicochemical attribute information;

(3) protein structural property information was obtained by SPIDER3, including 8 indices of 3 attributes, secondary structure: α helix p (h), β chain p (c), γ loop p (e), local diaphyseal torsion angle:

ψ, θ, τ, accessible surface area: ASA. For motifs of length L, a vector representation of L × 8 dimensional protein structural property information will be obtained.

The method comprises the following steps of adopting a modularized dense convolution network, respectively extracting high-level features of protein structural characteristics, protein original sequences and amino acid physicochemical properties from an initial feature space of a locus, simultaneously paying attention to low-level features and high-level features through dense jump connection, and comprising the following steps of:

(1) introducing a design idea of a modular network structure, and constructing a structure, a sequence and a physical and chemical information module;

(2) and (3) extracting high-level features of each module by adopting a stacking dense rolling block, and realizing information complementation between different-level features by considering low-level features and high-level features at the same time through dense jump connection.

The design idea of a modular network structure is introduced, and three information modules of a structure, a sequence and physics and chemistry are constructed, and the method comprises the following steps:

a structure module, a sequence module, a physicochemical module and three feature extraction submodules are respectively constructed on the basis of the structural characteristics of the protein, the original sequence of the protein and the physicochemical properties of amino acid, and the parameter spaces among the submodules are mutually independent, so that the crosstalk among the three types of information is effectively avoided, and the quality of the features is improved.

The method comprises the following steps of adopting a stacking compact volume block to extract high-level features of each module, simultaneously considering low-level features and high-level features through dense jump connection, realizing information complementation among different-level features, and comprising the following steps of:

since the network structures of the structural module, the sequence module and the physicochemical module are the same, only the sequence module is explained here:

(1) first, the sequence module receives as input the one-of-21 code of the site motif of length L, and then generates a low-level profile of the original sequence information of the protein by means of the one-dimensional convolution layer, as shown in formula (1).

X⁰＝σ(I*W+b) (1)

Wherein I is a one-of-21 code vector.

In the weight matrix, S is the size of the filter (S ═ 3), and D is the number of filters (D ═ 96). b is the bias term and σ is the activation function. X⁰Is the output of the one-dimensional convolution layer and has a size of L × D.

(2) And (3) extracting high-level characteristic representation of the protein original sequence information by adopting a dense convolution block, wherein the dense convolution process is shown as a formula (2).

X^l＝σ([X⁰；X¹；...；X^l-1]*W′+b′) (2)

Wherein, X^l-1A feature map generated for the first-1 convolutional layer in the dense convolutional block [ · C]Representing concatenation along a feature dimension.

For the weight matrix, D' is the total number of filters for 1 to l-1 layers of convolution in the dense volume block, and D "is the number of the l-th convolution layer filters in the dense volume block (D ═ 32). b' is a bias term, σ is an activation function, X^lA feature map generated for the first convolutional layer in the dense convolutional block is shown. The output of the dense volume block is a low-level feature map X⁰Feature map X generated with each convolution layer in a dense convolution block¹，X²，...，X^lAre connected in series by characteristic dimensions, i.e. [ X ]⁰；X¹；...；X^l]。

(3) And (3) carrying out convolution operation and activation operation on the characteristic diagram of the protein original sequence information obtained in the step (2) by adopting a transition layer, wherein the process of the transition layer is shown as a formula (3).

X＝σ([X⁰；X¹；...；X^l]*W″+b″) (3)

Wherein,

for the weight matrix, S 'is the size of the filter (S' ═ 1). b' is the bias term, σ is the activation function, and X is the output of the transition layer. And then, carrying out average pooling operation on the transition layer result so as to reduce the dimension of the characteristic diagram and reduce the risk of model overfitting.

(4) And (4) repeating the steps (2) and (3) to form a stacked dense volume block. The fourth step (2) is not followed by step (3), but instead by a global average pooling replacement.

Through the process, the high-level characteristic X of the original protein sequence of the site is extracted by the sequence module^(seq)。

Similarly, physicochemical and structural modules were also extracted through the above process to extract high-level features of the amino acid physicochemical properties and protein structural characteristics of the sites.

The method comprises the following steps of introducing the importance of evaluation features of a compression-excitation (SE) layer, weighting each feature map, and realizing the self-adaptive dynamic fusion of three types of information:

(1) introducing a compression-excitation (SE) layer to evaluate the importance of the features, and weighting each feature map;

(2) self-adaptive dynamic fusion of three kinds of information including protein structure characteristic, protein original sequence and amino acid physical and chemical properties.

Wherein, the importance of the compression-excitation (SE) layer evaluation features is introduced, and each feature map is weighted, comprising the following steps:

the sequence module is taken as an example for explanation:

(1) compression (squeeze): high-level feature X extracted from sequence module by global average pooling^(seq)The global spatial information of (2) is compressed into the channel descriptor, and the compression process is as shown in formula (4).

Wherein z is_cRepresents X^(seq)C characteristic diagram of

Channel statistics of F_sq(. cndot.) denotes a compression operation, W and H denote feature diagrams, respectively

Width and height. Mixing X^(seq)After calculating the statistical information of each feature map, X is obtained^(seq)Channel descriptor of

(2) Excitation (excitation): trapping X with two fully-connected layers (FC)^(seq)Channel dependence of, learning X^(seq)The excitation process of the specificity weight of each feature map is shown in formula (5).

s＝F_ex(z，W)＝σ(W₂*δ(W₁*z)) (5)

Wherein,

indicating learned X after the instigation operation^(seq)The specific weight of each feature map in (1), F_ex(. cndot.) denotes the firing operation. Delta and sigma respectively represent the activation functions of two fully-connected layers, the former being a ReLU function and the layer being a function with parameters

A reduction ratio of r (r-16); the latter is Sigmoid function, and the layer is with parameters

The dimension reduction layer ensures the dimension of s and the characteristic X^(seq)The number of channels is the same.

(3) Feature scaling (scale): scaling X by activating^(seq)Get the output of SE layer

Wherein

Is composed of

Any one of the elements of (a), (b), (c), (d,

is calculated as shown in equation (6).

Wherein,

representation of feature map

Each value of (a) is multiplied by a weight sc.

Similarly, the physicochemical and structural modules also get weighted high-level features through the SE layer.

The self-adaptive dynamic fusion of three types of information of protein structural characteristics, protein original sequences and amino acid physical and chemical properties comprises the following steps:

the SE layer is realized based on global average pooling and two full connection layers (FC), and the network structures of the structure module, the sequence module and the physicochemical module are the same, and the SE layer obtains weighted high-level features. Then, the output of each submodule is connected in series to obtain a fusion characteristic for classification

As the SE layer weights different feature graphs, the feature fusion process has self-adaptive dynamic characteristics.

Wherein, constructing a lysine acetylation site classifier based on the fusion characteristics and the softmax layer, predicting potential lysine acetylation sites, comprises the following steps:

softmax layer receives advanced features

And as an input, obtaining the prediction class of the sample after weighted summation and activation operation, wherein the forward propagation process of the softmax layer is shown as formula (7).

Wherein,

in order to be a weight matrix, the weight matrix,

is the bias term. P (y ═ i | x) represents the probability that the sample x is predicted as i ∈ {0, 1} class, and the class corresponding to the highest probability is the prediction class of the softmax classifier.

The method comprises the following steps of training a lysine acetylation site prediction model based on a modular dense convolutional network, wherein the training comprises the following steps:

(1) cross entropy is used as a cost function to minimize the training error:

where N is the total number of training samples, y^jIs the true tag of the jth input motif, x^jIs the jth input motif.

(2) L2 regularization is used in the training to mitigate the effects of overfitting, the final objective function of the model is:

min_W(L_C+λ∑(||W||₂)²) (9)

wherein λ is a regularization coefficient, | W | | ceiling₂Is the L2 norm of the weight matrix.

(3) The objective function was optimized using an Adam optimizer with learning rates and batch processing set to 0.0001 and 1000, respectively. The early stopping strategy and dropout technique are used to further prevent the model from being over-fitted.

(4) And a class re-weighting method is adopted, so that the influence of the positive samples is increased, and the model is forced to learn an abstract mechanism of the positive samples which occupies a small number.

(5) In the invention, a deep learning model is realized based on Keras 2.1.6 and TensorFlow 1.13.1, and model training and testing are carried out on a workstation which is provided with a Ubuntu 18.04.1LTS system and is provided with a GPU Nvidia Tesla V100-PCIE-32 GB.

Wherein, the proposed model is evaluated through four types of experiments including ten-fold cross validation, independent test, model generalization ability test and recognition ability of unknown lysine acetylation sites, and comprises the following steps:

(1) comparing the performance of a lysine acetylation site prediction model based on a modularized dense convolutional network and the performance of other prediction methods under the same reference training data set by adopting cross-folding verification;

(2) the prediction capability of a lysine acetylation site prediction model based on a modular dense convolutional network is further compared with that of other models in an independent test mode;

(3) further verifying that the lysine acetylation site prediction model based on the modularized dense convolution network has better generalization capability by adopting a generalization experiment mode;

(4) on an independent test set, top 20 candidate sites were validated and the ability to identify unknown lysine acetylation sites based on a lysine acetylation site prediction model of a modular dense convolutional network was evaluated.

Different from the prior art, the lysine acetylation site prediction method based on the modularized dense convolutional network introduces the structural characteristics of protein, and combines the structural characteristics with the original sequence of the protein and the physical and chemical properties of amino acid to construct a site feature space; the modular dense convolutional network is adopted to capture feature information of different levels, so that information loss and information crosstalk are reduced in the feature learning process; and a compression-excitation layer is introduced to evaluate the importance of different characteristics, and the abstract capability of the network is improved so as to identify potential lysine acetylation sites. The method can effectively solve the problem that the existing method only considers protein sequence level information and characteristic learning efficiency is low, more accurately predicts the potential lysine acetylation site, reduces the verification cost of the lysine acetylation site, and improves the research efficiency of lysine acetylation modification.

Drawings

FIG. 1 is a schematic flow chart of a lysine acetylation site prediction method based on a modular dense convolutional network according to the present invention;

FIG. 2 is the collected human data set information in the lysine acetylation site prediction method based on the modularized dense convolutional network, wherein the threshold value of CD-HIT is 0.4;

FIG. 3 is a schematic diagram of a dense convolutional network in a lysine acetylation site prediction method based on a modular dense convolutional network according to the present invention;

FIG. 4 is a schematic diagram of a compression-excitation module in a lysine acetylation site prediction method based on a modular dense convolutional network according to the present invention;

FIG. 5 shows ten times of cross validation performance of different methods on a human training data set under a redundancy removal threshold of 0.4 in a lysine acetylation site prediction method based on a modular dense convolutional network, bold as a highest value under a corresponding index;

FIG. 6 shows the prediction performance of the modular dense convolutional network-based lysine acetylation site prediction method on a human independent test data set under a 0.4 redundancy elimination threshold for different prediction methods;

FIG. 7 shows that in the lysine acetylation site prediction method based on the modularized dense convolutional network, each model predicts the performance on an independent test data set of Escherichia coli under a redundancy-removing threshold of 0.4, and the bold is the highest value under the corresponding index;

FIG. 8 is the prediction results of the first 20 candidate sites of acetylated proteins independently tested by human under 0.4 redundancy removing threshold in the lysine acetylation site prediction method based on the modularized dense convolutional network provided by the present invention, and bold is the site where acetylation modification actually occurs.

Detailed Description

The technical solution of the present invention will be further described in more detail with reference to the following embodiments. It is to be understood that the described embodiments are merely a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic flow chart of a lysine acetylation site prediction method based on a modular dense convolutional network according to the present invention.

The method comprises the following steps:

s110: describing lysine acetylation sites from three aspects of protein structural characteristics, protein original sequences and amino acid physicochemical attribute information, and constructing site initial feature space.

The step S110 includes:

1. collecting and pretreating experimental data of lysine acetylation sites;

2. and converting the collected protein data into a numerical vector by an encoding mode, constructing a site initial feature space, and taking the site initial feature space as the input of a prediction model.

Experimental data collection and pretreatment of lysine acetylation sites, comprising the steps of:

Converting collected protein data into a numerical vector by an encoding mode, constructing a site initial feature space, and using the site initial feature space as an input of a prediction model, wherein the method comprises the following steps of:

S120: and (3) adopting a modularized dense convolution network, respectively extracting high-level characteristics of protein structural characteristics, protein original sequences and amino acid physicochemical properties from the initial characteristic space of the sites, and simultaneously paying attention to low-level characteristics and high-level characteristics through dense jump connection.

The step S120 includes:

1. introducing a design idea of a modular network structure, and constructing a structure, a sequence and a physical and chemical information module;

2. and (3) extracting high-level features of each module by adopting a stacking dense rolling block, and realizing information complementation between different-level features by considering low-level features and high-level features at the same time through dense jump connection.

Introducing the design idea of a modular network structure, and constructing a structure, a sequence and a physical and chemical information module, wherein the method comprises the following steps:

The method adopts the stacking compact volume blocks to extract the high-level features of each module, simultaneously considers the low-level features and the high-level features through the dense jump connection, realizes the information complementation between the different-level features, and comprises the following steps:

X⁰＝σ(I*W+b) (1)

Wherein I is a one-of-21 code vector.

X^l＝σ([X⁰；X¹；...；X^l-1]*W′+b′) (2)

X＝σ([X⁰；X¹；...；X^l]*W″+b″) (3)

Wherein,

for the weight matrix, S 'is the size of the filter (S' ═ 1). b' is a bias term and σ isAnd activating a function, wherein X is the output of the transition layer. And then, carrying out average pooling operation on the transition layer result so as to reduce the dimension of the characteristic diagram and reduce the risk of model overfitting.

S130: and (3) introducing a compression-excitation (SE) layer to evaluate the importance of the features, weighting each feature map, and realizing the self-adaptive dynamic fusion of the three types of information.

The step S130 includes:

1. introducing a compression-excitation (SE) layer to evaluate the importance of the features, and weighting each feature map;

2. self-adaptive dynamic fusion of three kinds of information including protein structure characteristic, protein original sequence and amino acid physical and chemical properties.

The method introduces a compression-excitation (SE) layer to evaluate the importance of the features and weights each feature map, and comprises the following steps:

the sequence module is taken as an example for explanation:

Wherein z is_cRepresents X^(seq)C characteristic diagram of

Channel statistics of F_sq(. represents a compression operation, W and H are respectivelyRepresentation characteristic diagram

s＝F_ex(z，W)＝σ(W₂*δ(W₁*z)) (5)

Wherein,

Wherein

Is composed of

Any one of the elements of (a), (b), (c), (d,

is calculated as shown in equation (6).

Wherein,

representation of feature map

Each value and weight s of_cMultiplication.

The self-adaptive dynamic fusion of three kinds of information of protein structure characteristic, protein original sequence and amino acid physical and chemical properties includes the following steps:

S140: and constructing a lysine acetylation site classifier based on the fusion characteristics and the softmax layer, and predicting potential lysine acetylation sites.

The step S140 includes:

softmax layer receives advanced features

As input, by weighted evaluationAnd obtaining the prediction category of the sample after the activation operation, wherein the forward propagation process of the softmax layer is shown as an equation (7).

Wherein,

in order to be a weight matrix, the weight matrix,

S150: training is based on a modular dense convolutional network lysine acetylation site prediction model.

The step S150 includes:

1. cross entropy is used as a cost function to minimize the training error:

2. L2 regularization is used in the training to mitigate the effects of overfitting, the final objective function of the model is:

min_W(L_C+λ∑(||W||₂)²) (9)

3. The objective function was optimized using an Adam optimizer with learning rates and batch processing set to 0.0001 and 1000, respectively. The early stopping strategy and dropout technique are used to further prevent the model from being over-fitted.

4. And a class re-weighting method is adopted, so that the influence of the positive samples is increased, and the model is forced to learn an abstract mechanism of the positive samples which occupies a small number.

5. In the invention, a deep learning model is realized based on Keras 2.1.6 and TensorFlow 1.13.1, and model training and testing are carried out on a workstation which is provided with a Ubuntu 18.04.1LTS system and is provided with a GPU Nvidia Tesla V100-PCIE-32 GB.

S160: the proposed model was evaluated by four types of experiments of ten-fold cross validation, independent test, model generalization ability test, and recognition ability to unknown lysine acetylation sites.

The step S160 includes:

1. comparing the performance of a lysine acetylation site prediction model based on a modularized dense convolutional network and the performance of other prediction methods under the same reference training data set by adopting cross-folding verification;

2. the prediction capability of a lysine acetylation site prediction model based on a modular dense convolutional network is further compared with that of other models in an independent test mode;

3. further verifying that the lysine acetylation site prediction model based on the modularized dense convolution network has better generalization capability by adopting a generalization experiment mode;

4. on an independent test set, top 20 candidate sites were validated and the ability to identify unknown lysine acetylation sites based on a lysine acetylation site prediction model of a modular dense convolutional network was evaluated.

The performance of a lysine acetylation site prediction model based on a modularized dense convolutional network and the performance of other prediction methods are compared under the same reference training data set by adopting ten-fold cross validation, and the method comprises the following steps of:

(1) the model of the invention and other existing lysine acetylation site prediction models are combined in a ten-fold cross validation mode: MusiteDeep, CapsNet, DeepAcet, PSKACEPred, EnsemblePail, GPS-PAIL2.0, and ProAcePred.

(2) The performance of the model was evaluated using six statistical measures, including sensitivity (Sn), specificity (Sp), accuracy (Acc), precision (Pre), Mahalanobis Correlation Coefficient (MCC) and geometric mean (G-mean), which are defined as follows:

wherein TP, TN, FP and FN are true positive, true negative, false positive and false negative respectively. The MCC and G-mean indices may well reflect model quality when positive and negative samples are not balanced. In addition, the area under the Receiver Operating Characteristic (ROC) curve (AUC) and the area under the precision recall rate (PR) curve (AUPR) are also adopted to measure the overall performance of the model, and the higher the AUC and AUPR values are, the better the overall performance of the model is. The comparison result of the model is shown in the attached drawing of the specification.

The method for predicting the lysine acetylation site by using the modular dense convolutional network comprises the following steps of:

for models with independent tools, training the models with training data and then performing potential lysine acetylation site prediction on independent test data sets, for models providing Web services, testing the prediction performance based on the independent test data sets only. The result shows that the lysine acetylation site prediction model based on the modularized dense convolutional network has the highest MCC, G-mean, AUC and AUPR, is optimal in independent test data set, and has better lysine acetylation site prediction capability compared with other prediction methods. The results of the independent tests are shown in the attached drawings of the specification.

The method further verifies that the lysine acetylation site prediction model based on the modularized dense convolutional network has better generalization capability by adopting a generalization experiment mode, and comprises the following steps of:

the generalized experiment mode is adopted to predict lysine acetylation sites under the 0.3 redundancy removing threshold of a human data set, under the 0.4 and 0.3 redundancy removing thresholds of a mice data set and under the 0.4 and 0.3 redundancy removing thresholds of an escherichia coli data set. The lysine acetylation site prediction model based on the modularized dense convolutional network has better generalization capability and can be suitable for different species data sets, and available reference is provided for the prediction of lysine acetylation modification sites of more other species. The results of the generalization ability test are shown in the attached drawings of the specification.

Wherein, on an independent test set, the candidate sites with the top 20 rank are verified, and the ability of identifying unknown lysine acetylation sites based on a lysine acetylation site prediction model of a modularized dense convolutional network is evaluated, comprising the following steps:

the top 20 candidate sites predicted to be lysine acetylated by the model of the invention are listed according to the results of the independent test set and these 20 candidate sites were examined manually in the lysine modification database PLMD and the protein database Uniprot (https:// www.uniprot.org). Through statistical validation, found that 20 candidate sites in 13 are truly acetylated, 65%. The results of the first 20 candidate sites for independent testing of acetylated proteins by human are shown in the attached figure of the specification.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A lysine acetylation site prediction method based on a modular dense convolutional network is characterized by comprising the following steps:

s1, acquiring and preprocessing lysine acetylation site experimental data, converting the preprocessed data into numerical vectors in an encoding mode, and constructing site initial feature space;

s2, adopting a modularized dense convolution network, respectively extracting high-level characteristics of protein structural characteristics, protein original sequences and amino acid physicochemical properties from the site initial characteristic space, and obtaining low-level characteristics and high-level characteristics through dense jump connection;

s3, introducing the importance of compression-excitation SE layer evaluation features, weighting each feature map, and realizing self-adaptive dynamic fusion of three types of information, namely protein structural characteristics, protein original sequences and amino acid physicochemical properties to obtain fusion features;

s4, constructing a lysine acetylation site prediction model based on the fusion characteristics and the softmax layer, and predicting potential lysine acetylation sites;

s5, training a lysine acetylation site prediction model based on the modularized dense convolutional network;

s6 through ten-fold cross validation, independent test, model generalization ability test and unknown lysine acetylation site recognition ability four types of experiments, evaluation lysine acetylation site prediction model.

2. The model of claim 1, wherein the step S1 of obtaining and preprocessing the experimental data of lysine acetylation sites comprises:

s11 obtaining an experimentally verified lysine acetylated protein sequence from a protein lysine modification database;

s12, using CD-HIT tool to carry out sequence redundancy elimination, and setting the threshold value to be 0.4;

s13 randomly selecting 10% of the filtered lysine acetylated protein sequences, constructing an independent test data set, and using the residual lysine acetylated protein sequences as a training data set.

3. The model of claim 2, wherein in step S1, the transforming the preprocessed data into numerical vectors by coding comprises:

1) obtaining the vector representation of the original sequence information of the protein by using the original sequence information of the protein of the one-of-21 coding site;

2) adopting the amino acid physicochemical attribute information of the Atchley factor coding sites, wherein each amino acid residue is represented by 5 Atchley factors, and obtaining the vector representation of the amino acid physicochemical attribute information;

3) protein structural property information was obtained by SPIDER3, including 8 indices of 3 attributes, secondary structure: alpha helix P (H), beta strand P (C), gamma loop P (E); local diaphyseal torsion angle:

ψ, θ, τ; accessible surface area: ASA; a vector representation of the protein structural property information is obtained.

4. The model for predicting lysine acetylation sites based on modular dense convolutional network of claim 1, wherein step S2 specifically comprises:

1) introducing a design idea of a modular network structure, and constructing a structure, a sequence and a physical and chemical information module;

2) and (3) extracting high-level features of each module by adopting a stacking dense rolling block, and simultaneously acquiring low-level features and high-level features through dense jump connection.

5. The model for predicting lysine acetylation sites based on the modular dense convolutional network as claimed in claim 4, wherein the design idea of the modular network structure is introduced to construct three information modules of structure, sequence and physicochemical, specifically comprising:

a structure module, a sequence module, a physicochemical module and three feature extraction submodules are respectively constructed on the basis of the structural characteristics of the protein, the original sequence of the protein and the physicochemical properties of amino acid, and parameter spaces among the submodules are mutually independent.

6. The model of claim 5, wherein the sequence module is extracted from the high-level features by using stacked dense convolutional blocks, and the model specifically comprises:

1) the sequence module receives as input the one-of-21 code for the L-length site motif and then generates a low-level profile of the original sequence information of the protein by means of a one-dimensional convolution layer, as shown in equation (1):

X⁰＝σ(I*W+b) (1)

wherein I is a one-of-21 code vector,

is a weight matrix, S is the size of the filter (S-3), D is the number of filters (D-96), b is a bias term, σ is an activation function, X⁰Is the output of the one-dimensional convolution layer, and the size is L multiplied by D;

2) extracting high-level characteristic representation of protein original sequence information by adopting a dense convolution block, wherein the dense convolution process is shown as a formula (2):

X^l＝σ([X⁰；X¹；...；X^l-1]*W′+b′) (2)

wherein, X^l-1A feature map generated for the first-1 convolutional layer in the dense convolutional block [ · C]The representations are concatenated along the feature dimension,

for the weight matrix, D 'is the total number of filters for 1 to l-1 layers of convolution in the dense volume block, and D' is the filter for the l-th convolution layer in the dense volume blockThe number of devices (D ″ ═ 32), b' is the bias term, σ is the activation function, X^lRepresenting a feature map generated by the first convolutional layer in a dense convolutional block, the output of which is a low-level feature map X⁰Feature map X generated with each convolution layer in a dense convolution block¹，X²，...，X^lAre connected in series by characteristic dimensions, i.e. [ X ]⁰；X¹；...；X^l]；

3) Performing convolution operation and activation operation on the characteristic diagram of the protein original sequence information obtained in the step 2) by adopting a transition layer, wherein the process of the transition layer is shown as a formula (3):

X＝σ([X⁰；X¹；...；X^l]*W″+b″) (3)

wherein,

for the weight matrix, S 'is the size of the filter (S' ═ 1), b "is the bias term, σ is the activation function, and X is the output of the transition layer;

then, performing average pooling operation on the transition layer result to reduce the dimension of the characteristic diagram and reduce the risk of model overfitting;

4) repeating the steps 2) and 3) to form a stacked dense volume block, wherein the step 3) is not performed after the step 2) for the fourth time, and global average pooling replacement is used;

7. The model of claim 6, wherein in step S3, the compression-excitation layer is introduced to evaluate the importance of the high-level features extracted by the sequence module, and each feature map is weighted, comprising the steps of:

1) compression: high-level feature X extracted from sequence module by global average pooling^(seq)The global space information is compressed into the channel descriptor, and the compression process is shown as formula (4):

wherein z is_cRepresents X^(seq)C characteristic diagram of

Width and height of X^(seq)After calculating the statistical information of each feature map, X is obtained^(seq)Channel descriptor of

2) Excitation: trapping X with two fully-connected layers (FC)^(seq)Channel dependence of, learning X^(seq)The excitation process of the specific weight of each feature map is shown as formula (5):

s＝F_ex(z，W)＝σ(W₂*δ(W₁*z)) (5)

wherein,

indicating learned X after the instigation operation^(seq)The specific weight of each feature map in (1), F_ex(-) represents the firing operation; delta and sigma respectively represent the activation functions of two fully-connected layers, the former being a ReLU function and the layer being a function with parameters

The dimension reduction layer ensures the dimension of s and the characteristic X^(seq)The number of channels is the same;

3) feature scaling (scale): scaling X by activating^(seq)Get the output of SE layer

Wherein

Is composed of

Any one of the elements of (a), (b), (c), (d,

is calculated as shown in equation (6):

wherein,

representation of feature map

Each value of (a) is multiplied by a weight sc.

8. The modular dense convolutional network-based lysine acetylation site prediction model as claimed in claim 1, wherein the step S4 of predicting potential lysine acetylation sites comprises the steps of:

softmax layer receives advanced features

As an input, after weighted summation and activation operation, the prediction class of the sample is obtained, and the forward propagation process of the softmax layer is shown as formula (7):

wherein,

in order to be a weight matrix, the weight matrix,

for the bias term, P (y ═ i | x) represents the probability that the sample x is predicted to be i ∈ {0, 1} class, and the class corresponding to the maximum probability is the prediction class of the softmax classifier.

9. The modular dense convolutional network-based lysine acetylation site prediction model as claimed in claim 1, wherein the step S5 of training the modular dense convolutional network-based lysine acetylation site prediction model comprises the steps of:

1) cross entropy is used as a cost function:

where N is the total number of training samples, y^jIs the true tag of the jth input motif, x^jIs the jth input motif;

2) with L2 regularization in the training, the final objective function of the model is:

min_W(L_C+λ∑(||W||₂)²) (9)

wherein λ is a regularization coefficient, | W | | ceiling₂Is the L2 norm of the weight matrix;

(3) optimizing the objective function by adopting an Adam optimizer, and setting the learning rate and the batch processing to be 0.0001 and 1000 respectively; adopting early stopping strategy and dropout technology to further prevent the model from being over-fitted;

10. The model for predicting lysine acetylation sites based on modular dense convolutional network of claim 1, wherein step S6 specifically comprises:

(2) comparing the prediction capability of a lysine acetylation site prediction model based on a modular dense convolutional network with that of other models in an independent test mode;

(3) the method adopts a generalization experiment mode to verify that the lysine acetylation site prediction model based on the modularized dense convolution network has better generalization capability;