CN112820350B

CN112820350B - Lysine propionylation prediction method and system based on transfer learning

Info

Publication number: CN112820350B
Application number: CN202110289477.6A
Authority: CN
Inventors: 黎昂; 陈敏; 谭艳; 邓英伟; 孙旭东
Original assignee: Hunan Institute of Technology
Current assignee: Hunan Institute of Technology
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2022-08-09
Anticipated expiration: 2041-03-18
Also published as: CN112820350A

Abstract

The invention relates to a lysine propionylation prediction method and a lysine propionylation prediction system based on transfer learning, which relate to the technical field of biological information. The invention solves the problem that the existing propionylation sample data is too small to train a deep learning model well by using a transfer learning means, and can quickly and effectively predict the propionylation modification of lysine.

Description

Lysine propionylation prediction method and system based on transfer learning

Technical Field

The invention relates to the technical field of biological information, in particular to a lysine propionylation prediction method and system based on transfer learning.

Background

Protein malonation is a novel lysine acylation modification first found on histones in 2007, propionyl can be complexed to a larger acyl-coenzyme molecule via acetyltransferase. Recent studies have found that some acetyltransferases such as PCF, P300, CBP can catalyze propionylation, while SIRT1 and SIRT2 can remove propionylation modifications. Current studies indicate that lysine propionylation plays a regulatory role in metabolic processes and is a marker of active chromatin.

The identification and systematic analysis of modification sites are important for the study of protein PTM (post-translational modification), and the identification of propionylated sites is a key basis for further exploring the function and role of propionylated proteins in pathophysiology. The traditional methods for identifying the propionylated substrate protein include high-throughput Mass Spectrometry (MS), PTMap binding protein sequence ratio, mixed MS nano liquid chromatography binding method and the like. In recent years, the work of identifying PTMs by calculation has made remarkable progress, and various types of prediction algorithms and systems have appeared, such as a method for predicting crotonylation of lysine by combining relative characteristics of various positions and compositions and a statistical matrix, and a method for predicting the lysine acetylation site of prokaryotes by extracting sequence-based physicochemical characteristics and evolutionary information characteristics, and the like, but methods and systems for predicting alanine acylation of lysine are rarely seen. In addition, most of prediction algorithms combining sequence feature information and a feature screening optimization method in the prior art are only trained on 'small' samples, and the generalization is poor, which means that even though the prediction algorithms achieve higher prediction accuracy in an experimental data set, the actual accuracy is probably much worse than the experimental accuracy.

Disclosure of Invention

One of the purposes of the invention is to provide a lysine propionylation prediction method based on transfer learning, which is used for quickly and effectively predicting lysine propionylation modification.

In order to achieve the purpose, the lysine propionylation prediction method based on the transfer learning adopts the following means:

1) training by known lysine malonyl modification data and then finely adjusting by the known lysine malonyl modification data to obtain a deep recurrent neural network model as a feature extractor;

2) taking a support vector machine model subjected to parameter optimization and training by known lysine propionylated protein sequence characteristics as a final classifier;

3) and extracting the target sequence characteristics of the protein to be analyzed by using a characteristic extractor, inputting the extracted target sequence characteristics into a final classifier, predicting the propionylated modified sites and outputting a prediction result.

Wherein, before step 1), further comprising: constructing a deep recurrent neural network model, and inputting known lysine malonyl modification data into the deep recurrent neural network model to train the deep recurrent neural network model; and inputting the known lysine alanylation modification data into the trained deep recurrent neural network model to fine-tune the known lysine alanylation modification data.

Before the step 2), the method further comprises the steps of constructing a support vector machine, segmenting a known lysine propionylated protein sequence into peptide fragment sequences, forming positive and negative sample sets, extracting sequence features from the positive and negative sample sets through a feature extractor, optimizing the window size and the hyperparameter of the support vector machine by using the extracted sequence features, and training the support vector machine model.

Further, when the deep recurrent neural network model is trained, known lysine malonyl proteins are firstly segmented into peptide fragment sequences to form lysine malonyl modification data containing corresponding positive and negative sample sets, and then the lysine malonyl modification data are input into the deep recurrent neural network model to train the deep recurrent neural network model.

Furthermore, when the trained deep recurrent neural network model is fine-tuned, known lysine propionylated proteins are firstly segmented into peptide segment sequences to form lysine propionylated modification data containing corresponding positive and negative sample sets, and then the lysine propionylated modification data are input into the trained deep recurrent neural network model to perform fine tuning.

In addition, in the step 3), the step of extracting the target sequence feature of the protein to be analyzed by using the feature extractor is to firstly segment the protein sequence to be analyzed into the peptide fragment sequence, and then extract the target sequence feature from the peptide fragment sequence by using the feature extractor.

Wherein, when each protein sequence is divided into peptide segment sequences, the corresponding protein sequence is divided into peptide segments which take lysine as the center and respectively contain n amino acid residues at the upstream and downstream; for the segmented peptide fragments with the upstream and/or downstream less than n amino acid residues, the front end and/or the tail end of the corresponding peptide fragment are filled with characters 'X'; wherein n is a natural number of 1 or more.

When the deep recurrent neural network model is constructed, setting a frame of the deep recurrent neural network model to be composed of an embedding layer, a first bidirectional long-and-short term memory network layer, a bidirectional gating circulation unit layer, a second bidirectional long-and-short term memory network layer, an exit layer, a flattening layer, a complete connection layer and an output layer in sequence; converting the amino acid character integer index of the input peptide segment sequence into an embedded vector by the embedding layer, and taking the output of the complete connection layer as the sequence feature to be extracted.

In addition, the invention also relates to a lysine propionylation prediction system based on transfer learning, which comprises the following components:

a feature extractor comprising a deep recurrent neural network model trained on known lysine malonyl modification data and then fine-tuned by known lysine malonyl modification data;

a final classifier comprising a support vector machine model that is parameter optimized and trained with known lysine propionylated sequence features;

the lysine propionylation prediction system predicts the propionylation modification sites of the protein to be analyzed according to the lysine propionylation prediction method and outputs a prediction result.

Further, the migration learning-based lysine propionylation prediction system further comprises a sequence divider, wherein the sequence divider is used for dividing each protein sequence into peptide segment sequences which take lysine as a center and respectively comprise n amino acid residues at the upstream and downstream, and the front ends and/or the tail ends of the peptide segment sequences which are obtained by dividing and are less than n amino acid residues at the upstream and/or downstream are filled with characters 'X'; wherein n is a natural number of 1 or more.

The invention directly inputs the known lysine malonyl modification data into a deep recurrent neural network model, firstly, training a deep recurrent neural network model by utilizing malonyl modification data, then finely adjusting the trained model by utilizing known lysine alanyl modification data, regarding the output of the final second layer of the deep recurrent neural network model after the training and the fine adjustment as the characteristic of a propionylation sequence, wherein the trained model can be used as a characteristic extractor, meanwhile, the invention takes the deep recurrent neural network model after training and fine tuning as a feature extractor to extract the sequence characteristics of the known lysine propionylated protein to optimize the parameters (window size and hyper-parameters) of the support vector machine, and training a support vector machine, wherein the trained support vector machine can be used as a final classifier to predict lysine alanylation of an unknown protein sequence. By means of the migration learning means, the method solves the problem that the existing propionylated data sample is too small to train a deep learning model better, and can quickly and effectively predict the lysine propionylated modification site.

Description of the drawings:

fig. 1 is an exemplary flow diagram of a migration learning based lysylation prediction method.

FIG. 2 is a diagram showing an example of a protein sequence divided into peptide fragments.

Fig. 3 is a framework diagram of a deep Recurrent Neural Network (RNN) model constructed in the embodiment.

Fig. 4 (a) is a graph showing performance comparison of 10-fold cross-validation of the prediction method and the PropPred according to the example, and (b) is a graph showing performance comparison of independent tests of the prediction method and the PropPred according to the example.

Detailed Description

To facilitate understanding of those skilled in the art, the present invention will be further described with reference to specific embodiments and drawings, which are not intended to limit the present invention.

Fig. 1 shows a specific implementation flow of the migration learning-based lysylation prediction method in the following examples.

Firstly, forming a data sample.

1. 192 proteins containing 413 propionyl lysine sites were downloaded from the PLMD database, 18 propionylated proteins were retrieved from the Uniprot database, and after combining the two protein datasets and deleting duplicate proteins, a total of 207 unique proteins were obtained.

2. Sequence similarity clustering was performed with sequence clustering software CD-HIT, with sequence identity set to 0.7, 189 proteins were obtained as experimental data, with sequence similarity between any two less than 0.7.

3. 4/5 (151) of 189 proteins were randomly selected as positive training samples, which contained 304 loci; the remaining (38) were used as negative test samples, which contained 104 sites. Because the lysine sites greatly exceed the lysine propionylated sites, the positive and negative samples are unbalanced, so the lysine sites which do not undergo PTM are randomly selected as the negative samples, and the proportion of the positive and negative samples is 1: 1. the training set consisted of 304 positive and 304 negative alanine lysine sites, while the test set consisted of 104 positive and 104 negative lysine sites.

4. 3429 malonylated proteins containing 9584 malonylation sites were downloaded. The same number of lysine sites were randomly selected as non-malonylation sites, and these lysine sites did not undergo malonylation events as positive samples. Thus, the malonylation group contained 9584 malonylation sites and 9584 non-malonyl lysine sites.

5. And (5) dividing the sequence. As shown in fig. 2, each protein sequence was divided into peptide fragments of n amino acid residues upstream and downstream of lysine as the center; for peptide fragments less than n amino acid residues upstream and/or downstream, they are filled in at the front and/or at the end with the character "X". Peptide fragments are windows of residues of fixed size (2 x n + 1). The propionylation data set and the malonylation data set yielded 816 peptide fragments and 19168 peptide fragments, respectively.

And secondly, constructing and training the model.

6. And constructing a deep Recurrent Neural Network (RNN) model, and training and fine-tuning. The model constructed is mainly composed of an embedding layer, two bidirectional long-and-short memory network Layers (LSTM), a bidirectional gated cyclic unit layer (GRU), a discarding layer, a flattening layer, a full link and an output layer, as shown in fig. 3. For the sake of convenience, the bidirectional long-short term memory network layer located above the drawing is referred to as a "first bidirectional long-short term memory network layer", and the bidirectional long-short term memory network layer located relatively below the drawing is referred to as a "first bidirectional long-short term memory network layer".

In the deep Recurrent Neural Network (RNN) model shown in fig. 3:

(1) the embedding layer is a bridge from text to numeric vectors for converting integer indices of amino acid characters into embedding vectors.

(2) Long-and-short memory networks (LSTM) are a variant of Recurrent Neural Networks (RNN). The recurrent neural networks share network weights, the output of which at the current step depends not only on the input of the current step but also on the output of the previous step. Since the recurrent neural network cannot remember information about previous inputs, a long-term memory network is designed. The long and short term memory network includes three gates: the system comprises a forgetting gate, an input gate and an output gate, wherein the forgetting gate is used for forgetting some information selected in the past, the input gate is used for remembering some current information, the forgetting gate and the input gate and the output gate both adopt S-shaped activation functions, the output range is 0-1, the output is 0, no information passes, and the output is 1, all information passes. In addition, the long-time memory network also comprises candidate memory cells fusing the current memory and the past memory.

(3) Gated round-robin cells are variants of long and short term memory networks. Compared with the long and short time memory network, the GRU only includes two gates: the reset gate and the update gate do not contain candidate memory cells. The reset gate is used to determine past information to forget, and the update gate is used to delete some past information and add some new information. The operation number in the gating cycle unit is less than that in the long-time and short-time memory network, so that the calculation speed of the gating cycle unit is higher than that of the long-time and short-time memory network.

(4) The discarding layer is to prevent the neural network from being over-fitted, and the discarding operation is to discard some neurons with un-updated weights with a certain probability in the training process, and use all neurons in the testing process.

(5) A flat layer is a bridge between a long and short memory network layer and a fully connected layer, the purpose of which is simply to change the shape of the input so that it can be connected to a subsequent fully connected layer. In multi-layer perception, a fully connected layer corresponds to a hidden layer. The number of neurons in the output layer determines the number of class labels.

After the deep Recurrent Neural Network (RNN) model is constructed, a malonyl data set containing 19168 peptide fragments is input into the deep Recurrent Neural Network (RNN) model and trained. And introducing sample data of a training set (comprising a positive sample and a negative sample set, and containing 304 positive alanyl lysine sites and 304 negative lysine sites) into a trained deep Recurrent Neural Network (RNN) model for fine adjustment. It should be noted that the fine-tuning here can be understood as a second training, which is called fine-tuning since the sample data of the training set introduced into the deep Recurrent Neural Network (RNN) model at the latter time is much smaller than the malonylated data set inputted at the first time. The deep Recurrent Neural Network (RNN) model after completion of the training and fine tuning described above is subsequently used to act as a feature extractor.

7. And constructing a support vector machine model, and performing parameter optimization and training. The support vector machine is actually a statistical learning algorithm. For example, with n training samples { (x) _i ,y _i ) Binary classification of i =1,2, …, n } is an example, where y _i E {1, -1 }. The purpose of the support vector machine is to find a hyperplane f (x) = wx + b to separate samples labeled +1 from samples labeled-1. That is, the hyperplane makes positive samples satisfy f (x) = wx + b>0, and negative samples are f (x) = wx + b<0, in practice, however, there will be many hyperplanes that meet the above requirements. The support vector machine is to find a hyperplane that maximizes the separation margin. This problem is modeled as minimal by the following equation:

the constraint conditions are as follows:

i =1,2,3, …, n. In the real world, training samples are not always completely separated by any hyperplane, i.e., some samples are separated into another class. To solve this problem, supportThe vector machine introduces a relaxation variable xi _i The objective function is rewritten as:

where C is called a penalty factor, which is a user-specified hyper-parameter, and the constraint formula can be rewritten as:

, i=1,2,3,…,n,

. The objective function is composed of structural risk and empirical risk, the punishment factor controls the balance between the two risks, and in addition, the support vector machine has the advantages that the kernel function is absorbed, when the samples cannot be distinguished in the low-dimensional space but can be distinguished in the high-dimensional space, the support vector machine firstly converts the samples which cannot be distinguished from the low-dimensional shape into the high-dimensional shape by using the kernel function, then finds a high-dimensional hyperplane to separate the samples, and F (F) (an

Where Φ (x) is a kernel function. The corresponding constraint is updated as:

, i=1,2,3,…,n,

. The support vector machine can be solved through a dual theory and a Lagrange optimization algorithm.

And (3) after the construction of the support vector machine model is completed, inputting the propionyl training set data obtained in the step (3) into a deep Recurrent Neural Network (RNN) model after training and fine tuning, taking the deep Recurrent Neural Network (RNN) model after training and fine tuning as a feature extractor to extract sequence features, and inputting the extracted sequence features into the support vector machine to perform parameter optimization and training on the support vector machine. In this embodiment, the training set is cross-validated 10 times to find a better window size. The performance at various window sizes is listed in table 1 below, and from the statistics in table 1 it can be seen that cross-validation with a window size of 29 results in better performance. Therefore, in the subsequent experiment, the window size was set to 29.

。

In addition, the hyper-parameters in the support vector machine classifier are optimized according to the statistical results in the following table 2, namely C is 1, kernel is rbf, and gamma is scale or auto.

。

And thirdly, verifying and testing.

8. Inputting the test set obtained in the step 3 into a trained and fine-tuned deep Recurrent Neural Network (RNN) model, and taking the trained and fine-tuned deep Recurrent Neural Network (RNN) model as a feature extractor to extract propionylated sequence features.

In the prior art, there are two main calculation methods for propionylation prediction: one is PropPred and the other is PropSeek, which is inferior to the method in this embodiment in terms of SN and MCC because the training set and test set used are different. The PropPred is tested in comparison with the method involved in this example (hereinafter referred to simply as "the method"), in which the PropPred is achieved using 250 optimal functions and a window size of 25 residues. Table 3 below lists the performance of the PropPred method for 10-fold cross validation on the training set and for independent testing on the test set.

。

The 10 fold cross-validation results for the present method (corresponding to curve 1) and the PropPred (corresponding to curve 2) are shown in fig. 4-a, and it can be seen from fig. 4-a that although the AUC curve of the present method is slightly lower in the posterior segment than the PropPred, the AUC at the upper left-most position of the anterior segment is significantly better than the PropPred. Fig. 4-b shows the results of independent tests of the present method (corresponding to curve 1), the PropPred (corresponding to curve 2) and the deep Recurrent Neural Network (RNN) method (corresponding to curve 3), and it can be seen from fig. 4-b that the present method is significantly superior to the PropPred and deep Recurrent Neural Network (RNN) methods in independent tests.

In the above embodiment, after statistical comparison of a protein data set, it is found that 600 (40.8%) of 1471 known propionylated sites overlap malonylation, and the amount of malonylation far exceeds the characteristics of the propionylated sites, so that a deep recurrent neural network model is trained by relying on malonylation data samples, and then the trained model is fine-tuned (or can be understood as being trained for the second time) by the propionylated data samples, and the problem that the deep learning model cannot be better trained due to the small propionylated data samples in the prior art is solved by the migration learning means. And the verification test results are combined to determine that the prediction method provided by the embodiment can meet the requirement of quickly and effectively predicting the propionylated lysine modification site.

Based on the lyse-propionylation prediction method in the above embodiment, there is also provided a lyse-propionylation prediction system based on transfer learning, which includes a feature extractor and a final classifier, wherein the feature extractor includes the deep recurrent neural network model which is trained by known lyse-malonyl modification data and then fine-tuned by known lyse-malonyl modification data, and the final classifier includes the support vector machine model which is optimized and trained by known lyse-propionylation sequence features. The lysine propionylation prediction system predicts the propionylation modification site of a protein to be analyzed (unknown protein) according to the prediction method in the above example and outputs the prediction result. In the morning, a sequence divider can be further added in the lysine propionylation prediction system, each protein sequence is automatically divided into peptide segment sequences which take lysine as the center and respectively contain n amino acid residues at the upstream and downstream by the sequence divider, and the front end and/or the tail end of the peptide segment sequence which is obtained by dividing and is less than n amino acid residues at the upstream and/or downstream is filled with a character X; wherein n is a natural number of 1 or more. It should be understood by those skilled in the art that the above-mentioned lysine propionylation prediction system can be packaged in a portable storage medium to operate, and can also be stored in the cloud to operate on line; the process of implementing the lysyl acylation prediction may be executed by a computer capable of running the prediction system, or may be executed by a server located in the cloud.

The above embodiments are preferred implementations of the present invention, and the present invention can be implemented in other ways without departing from the spirit of the present invention.

Finally, it should be emphasized that some of the descriptions of the present invention have been simplified to facilitate the understanding of the improvements of the present invention over the prior art by those of ordinary skill in the art, and that other elements have been omitted from this document for the sake of clarity, and those of ordinary skill in the art will recognize that such omitted elements may also constitute the subject matter of the present invention.

Claims

1. The lysine propionylation prediction method based on the transfer learning is characterized by comprising the following steps:

1) constructing a deep recurrent neural network model, wherein a frame of the deep recurrent neural network model is set to be composed of an embedding layer, a first bidirectional long-and-short term memory network layer, a bidirectional gated circulation unit layer, a second bidirectional long-and-short term memory network layer, an exiting layer, a flattening layer, a complete connection layer and an output layer in sequence;

2) training a deep recurrent neural network model, namely segmenting known lysine malonyl protein into peptide fragment sequences to form lysine malonyl modification data containing corresponding positive and negative sample sets, and inputting the lysine malonyl modification data into the deep recurrent neural network model to train the deep recurrent neural network model;

3) finely adjusting the trained deep recurrent neural network model, namely segmenting known lysine propionylated protein into peptide fragment sequences to form lysine propionylated modification data containing corresponding positive and negative sample sets, and inputting the lysine propionylated modification data into the trained deep recurrent neural network model to finely adjust the deep recurrent neural network model;

4) training by known lysine malonyl modification data and then finely adjusting by the known lysine malonyl modification data to obtain a deep recurrent neural network model as a feature extractor;

5) taking a support vector machine model subjected to parameter optimization and training by known lysine propionylated protein sequence characteristics as a final classifier;

6) and extracting the target sequence characteristics of the protein to be analyzed by using a characteristic extractor, inputting the extracted target sequence characteristics into a final classifier, predicting the propionylated modified sites and outputting a prediction result.

2. The migratory learning-based lysine propionylation prediction method according to claim 1, further comprising, before the step 5):

constructing a support vector machine, segmenting a known lysine propionylated protein sequence into peptide fragment sequences to form a positive sample set and a negative sample set, extracting sequence characteristics from the positive sample set and the negative sample set through a characteristic extractor, optimizing the window size and the hyperparameter of the support vector machine by utilizing the extracted sequence characteristics, and training a support vector machine model.

3. The migratory learning-based lysyl acylation prediction method according to claim 1, wherein: in the step 6), the characteristic extractor is used for extracting the target sequence characteristics of the protein to be analyzed, namely, the protein sequence to be analyzed is firstly divided into peptide fragment sequences, and then the characteristic extractor is used for extracting the target sequence characteristics from the peptide fragment sequences.

4. The migratory learning-based lysyl prediction method according to claim 2 or 3, wherein: when each protein sequence is divided into peptide segment sequences, the corresponding protein sequence is divided into peptide segments which take lysine as the center and respectively contain n amino acid residues at the upstream and downstream; for the segmented peptide fragments with the upstream and/or downstream less than n amino acid residues, the front end and/or the tail end of the corresponding peptide fragment are filled with characters 'X'; wherein n is a natural number of 1 or more.

5. The migratory learning-based lysine propionylation prediction method according to claim 3, wherein: converting the amino acid character integer index of the input peptide segment sequence into an embedded vector by the embedding layer, and taking the output of the complete connection layer as the sequence feature to be extracted.

6. A lysine propionylation prediction system based on transfer learning is characterized by comprising:

a final classifier comprising a support vector machine model that is parametrically optimized and trained over known lysine propionylated sequence features;

the lysine propionylation prediction system predicts a propionylated modification site of a protein to be analyzed according to the lysine propionylation prediction method of any one of claims 1 to 5 and outputs the prediction result.

7. The migratory learning-based lysyl acylation prediction system of claim 6, further comprising:

a sequence divider for dividing each protein sequence into peptide segment sequences which take lysine as a center and respectively contain n amino acid residues at the upstream and downstream, and filling the front end and/or the tail end of the peptide segment sequence which is obtained by dividing and is less than n amino acid residues at the upstream and/or downstream with a character X; wherein n is a natural number of 1 or more.