CN112820350B - Lysine propionylation prediction method and system based on transfer learning - Google Patents

Lysine propionylation prediction method and system based on transfer learning Download PDF

Info

Publication number
CN112820350B
CN112820350B CN202110289477.6A CN202110289477A CN112820350B CN 112820350 B CN112820350 B CN 112820350B CN 202110289477 A CN202110289477 A CN 202110289477A CN 112820350 B CN112820350 B CN 112820350B
Authority
CN
China
Prior art keywords
lysine
neural network
recurrent neural
sequence
propionylation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110289477.6A
Other languages
Chinese (zh)
Other versions
CN112820350A (en
Inventor
黎昂
陈敏
谭艳
邓英伟
孙旭东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Institute of Technology
Original Assignee
Hunan Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Institute of Technology filed Critical Hunan Institute of Technology
Priority to CN202110289477.6A priority Critical patent/CN112820350B/en
Publication of CN112820350A publication Critical patent/CN112820350A/en
Application granted granted Critical
Publication of CN112820350B publication Critical patent/CN112820350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a lysine propionylation prediction method and a lysine propionylation prediction system based on transfer learning, which relate to the technical field of biological information. The invention solves the problem that the existing propionylation sample data is too small to train a deep learning model well by using a transfer learning means, and can quickly and effectively predict the propionylation modification of lysine.

Description

Lysine propionylation prediction method and system based on transfer learning
Technical Field
The invention relates to the technical field of biological information, in particular to a lysine propionylation prediction method and system based on transfer learning.
Background
Protein malonation is a novel lysine acylation modification first found on histones in 2007, propionyl can be complexed to a larger acyl-coenzyme molecule via acetyltransferase. Recent studies have found that some acetyltransferases such as PCF, P300, CBP can catalyze propionylation, while SIRT1 and SIRT2 can remove propionylation modifications. Current studies indicate that lysine propionylation plays a regulatory role in metabolic processes and is a marker of active chromatin.
The identification and systematic analysis of modification sites are important for the study of protein PTM (post-translational modification), and the identification of propionylated sites is a key basis for further exploring the function and role of propionylated proteins in pathophysiology. The traditional methods for identifying the propionylated substrate protein include high-throughput Mass Spectrometry (MS), PTMap binding protein sequence ratio, mixed MS nano liquid chromatography binding method and the like. In recent years, the work of identifying PTMs by calculation has made remarkable progress, and various types of prediction algorithms and systems have appeared, such as a method for predicting crotonylation of lysine by combining relative characteristics of various positions and compositions and a statistical matrix, and a method for predicting the lysine acetylation site of prokaryotes by extracting sequence-based physicochemical characteristics and evolutionary information characteristics, and the like, but methods and systems for predicting alanine acylation of lysine are rarely seen. In addition, most of prediction algorithms combining sequence feature information and a feature screening optimization method in the prior art are only trained on 'small' samples, and the generalization is poor, which means that even though the prediction algorithms achieve higher prediction accuracy in an experimental data set, the actual accuracy is probably much worse than the experimental accuracy.
Disclosure of Invention
One of the purposes of the invention is to provide a lysine propionylation prediction method based on transfer learning, which is used for quickly and effectively predicting lysine propionylation modification.
In order to achieve the purpose, the lysine propionylation prediction method based on the transfer learning adopts the following means:
1) training by known lysine malonyl modification data and then finely adjusting by the known lysine malonyl modification data to obtain a deep recurrent neural network model as a feature extractor;
2) taking a support vector machine model subjected to parameter optimization and training by known lysine propionylated protein sequence characteristics as a final classifier;
3) and extracting the target sequence characteristics of the protein to be analyzed by using a characteristic extractor, inputting the extracted target sequence characteristics into a final classifier, predicting the propionylated modified sites and outputting a prediction result.
Wherein, before step 1), further comprising: constructing a deep recurrent neural network model, and inputting known lysine malonyl modification data into the deep recurrent neural network model to train the deep recurrent neural network model; and inputting the known lysine alanylation modification data into the trained deep recurrent neural network model to fine-tune the known lysine alanylation modification data.
Before the step 2), the method further comprises the steps of constructing a support vector machine, segmenting a known lysine propionylated protein sequence into peptide fragment sequences, forming positive and negative sample sets, extracting sequence features from the positive and negative sample sets through a feature extractor, optimizing the window size and the hyperparameter of the support vector machine by using the extracted sequence features, and training the support vector machine model.
Further, when the deep recurrent neural network model is trained, known lysine malonyl proteins are firstly segmented into peptide fragment sequences to form lysine malonyl modification data containing corresponding positive and negative sample sets, and then the lysine malonyl modification data are input into the deep recurrent neural network model to train the deep recurrent neural network model.
Furthermore, when the trained deep recurrent neural network model is fine-tuned, known lysine propionylated proteins are firstly segmented into peptide segment sequences to form lysine propionylated modification data containing corresponding positive and negative sample sets, and then the lysine propionylated modification data are input into the trained deep recurrent neural network model to perform fine tuning.
In addition, in the step 3), the step of extracting the target sequence feature of the protein to be analyzed by using the feature extractor is to firstly segment the protein sequence to be analyzed into the peptide fragment sequence, and then extract the target sequence feature from the peptide fragment sequence by using the feature extractor.
Wherein, when each protein sequence is divided into peptide segment sequences, the corresponding protein sequence is divided into peptide segments which take lysine as the center and respectively contain n amino acid residues at the upstream and downstream; for the segmented peptide fragments with the upstream and/or downstream less than n amino acid residues, the front end and/or the tail end of the corresponding peptide fragment are filled with characters 'X'; wherein n is a natural number of 1 or more.
When the deep recurrent neural network model is constructed, setting a frame of the deep recurrent neural network model to be composed of an embedding layer, a first bidirectional long-and-short term memory network layer, a bidirectional gating circulation unit layer, a second bidirectional long-and-short term memory network layer, an exit layer, a flattening layer, a complete connection layer and an output layer in sequence; converting the amino acid character integer index of the input peptide segment sequence into an embedded vector by the embedding layer, and taking the output of the complete connection layer as the sequence feature to be extracted.
In addition, the invention also relates to a lysine propionylation prediction system based on transfer learning, which comprises the following components:
a feature extractor comprising a deep recurrent neural network model trained on known lysine malonyl modification data and then fine-tuned by known lysine malonyl modification data;
a final classifier comprising a support vector machine model that is parameter optimized and trained with known lysine propionylated sequence features;
the lysine propionylation prediction system predicts the propionylation modification sites of the protein to be analyzed according to the lysine propionylation prediction method and outputs a prediction result.
Further, the migration learning-based lysine propionylation prediction system further comprises a sequence divider, wherein the sequence divider is used for dividing each protein sequence into peptide segment sequences which take lysine as a center and respectively comprise n amino acid residues at the upstream and downstream, and the front ends and/or the tail ends of the peptide segment sequences which are obtained by dividing and are less than n amino acid residues at the upstream and/or downstream are filled with characters 'X'; wherein n is a natural number of 1 or more.
The invention directly inputs the known lysine malonyl modification data into a deep recurrent neural network model, firstly, training a deep recurrent neural network model by utilizing malonyl modification data, then finely adjusting the trained model by utilizing known lysine alanyl modification data, regarding the output of the final second layer of the deep recurrent neural network model after the training and the fine adjustment as the characteristic of a propionylation sequence, wherein the trained model can be used as a characteristic extractor, meanwhile, the invention takes the deep recurrent neural network model after training and fine tuning as a feature extractor to extract the sequence characteristics of the known lysine propionylated protein to optimize the parameters (window size and hyper-parameters) of the support vector machine, and training a support vector machine, wherein the trained support vector machine can be used as a final classifier to predict lysine alanylation of an unknown protein sequence. By means of the migration learning means, the method solves the problem that the existing propionylated data sample is too small to train a deep learning model better, and can quickly and effectively predict the lysine propionylated modification site.
Description of the drawings:
fig. 1 is an exemplary flow diagram of a migration learning based lysylation prediction method.
FIG. 2 is a diagram showing an example of a protein sequence divided into peptide fragments.
Fig. 3 is a framework diagram of a deep Recurrent Neural Network (RNN) model constructed in the embodiment.
Fig. 4 (a) is a graph showing performance comparison of 10-fold cross-validation of the prediction method and the PropPred according to the example, and (b) is a graph showing performance comparison of independent tests of the prediction method and the PropPred according to the example.
Detailed Description
To facilitate understanding of those skilled in the art, the present invention will be further described with reference to specific embodiments and drawings, which are not intended to limit the present invention.
Fig. 1 shows a specific implementation flow of the migration learning-based lysylation prediction method in the following examples.
Firstly, forming a data sample.
1. 192 proteins containing 413 propionyl lysine sites were downloaded from the PLMD database, 18 propionylated proteins were retrieved from the Uniprot database, and after combining the two protein datasets and deleting duplicate proteins, a total of 207 unique proteins were obtained.
2. Sequence similarity clustering was performed with sequence clustering software CD-HIT, with sequence identity set to 0.7, 189 proteins were obtained as experimental data, with sequence similarity between any two less than 0.7.
3. 4/5 (151) of 189 proteins were randomly selected as positive training samples, which contained 304 loci; the remaining (38) were used as negative test samples, which contained 104 sites. Because the lysine sites greatly exceed the lysine propionylated sites, the positive and negative samples are unbalanced, so the lysine sites which do not undergo PTM are randomly selected as the negative samples, and the proportion of the positive and negative samples is 1: 1. the training set consisted of 304 positive and 304 negative alanine lysine sites, while the test set consisted of 104 positive and 104 negative lysine sites.
4. 3429 malonylated proteins containing 9584 malonylation sites were downloaded. The same number of lysine sites were randomly selected as non-malonylation sites, and these lysine sites did not undergo malonylation events as positive samples. Thus, the malonylation group contained 9584 malonylation sites and 9584 non-malonyl lysine sites.
5. And (5) dividing the sequence. As shown in fig. 2, each protein sequence was divided into peptide fragments of n amino acid residues upstream and downstream of lysine as the center; for peptide fragments less than n amino acid residues upstream and/or downstream, they are filled in at the front and/or at the end with the character "X". Peptide fragments are windows of residues of fixed size (2 x n + 1). The propionylation data set and the malonylation data set yielded 816 peptide fragments and 19168 peptide fragments, respectively.
And secondly, constructing and training the model.
6. And constructing a deep Recurrent Neural Network (RNN) model, and training and fine-tuning. The model constructed is mainly composed of an embedding layer, two bidirectional long-and-short memory network Layers (LSTM), a bidirectional gated cyclic unit layer (GRU), a discarding layer, a flattening layer, a full link and an output layer, as shown in fig. 3. For the sake of convenience, the bidirectional long-short term memory network layer located above the drawing is referred to as a "first bidirectional long-short term memory network layer", and the bidirectional long-short term memory network layer located relatively below the drawing is referred to as a "first bidirectional long-short term memory network layer".
In the deep Recurrent Neural Network (RNN) model shown in fig. 3:
(1) the embedding layer is a bridge from text to numeric vectors for converting integer indices of amino acid characters into embedding vectors.
(2) Long-and-short memory networks (LSTM) are a variant of Recurrent Neural Networks (RNN). The recurrent neural networks share network weights, the output of which at the current step depends not only on the input of the current step but also on the output of the previous step. Since the recurrent neural network cannot remember information about previous inputs, a long-term memory network is designed. The long and short term memory network includes three gates: the system comprises a forgetting gate, an input gate and an output gate, wherein the forgetting gate is used for forgetting some information selected in the past, the input gate is used for remembering some current information, the forgetting gate and the input gate and the output gate both adopt S-shaped activation functions, the output range is 0-1, the output is 0, no information passes, and the output is 1, all information passes. In addition, the long-time memory network also comprises candidate memory cells fusing the current memory and the past memory.
(3) Gated round-robin cells are variants of long and short term memory networks. Compared with the long and short time memory network, the GRU only includes two gates: the reset gate and the update gate do not contain candidate memory cells. The reset gate is used to determine past information to forget, and the update gate is used to delete some past information and add some new information. The operation number in the gating cycle unit is less than that in the long-time and short-time memory network, so that the calculation speed of the gating cycle unit is higher than that of the long-time and short-time memory network.
(4) The discarding layer is to prevent the neural network from being over-fitted, and the discarding operation is to discard some neurons with un-updated weights with a certain probability in the training process, and use all neurons in the testing process.
(5) A flat layer is a bridge between a long and short memory network layer and a fully connected layer, the purpose of which is simply to change the shape of the input so that it can be connected to a subsequent fully connected layer. In multi-layer perception, a fully connected layer corresponds to a hidden layer. The number of neurons in the output layer determines the number of class labels.
After the deep Recurrent Neural Network (RNN) model is constructed, a malonyl data set containing 19168 peptide fragments is input into the deep Recurrent Neural Network (RNN) model and trained. And introducing sample data of a training set (comprising a positive sample and a negative sample set, and containing 304 positive alanyl lysine sites and 304 negative lysine sites) into a trained deep Recurrent Neural Network (RNN) model for fine adjustment. It should be noted that the fine-tuning here can be understood as a second training, which is called fine-tuning since the sample data of the training set introduced into the deep Recurrent Neural Network (RNN) model at the latter time is much smaller than the malonylated data set inputted at the first time. The deep Recurrent Neural Network (RNN) model after completion of the training and fine tuning described above is subsequently used to act as a feature extractor.
7. And constructing a support vector machine model, and performing parameter optimization and training. The support vector machine is actually a statistical learning algorithm. For example, with n training samples { (x) i ,y i ) Binary classification of i =1,2, …, n } is an example, where y i E {1, -1 }. The purpose of the support vector machine is to find a hyperplane f (x) = wx + b to separate samples labeled +1 from samples labeled-1. That is, the hyperplane makes positive samples satisfy f (x) = wx + b>0, and negative samples are f (x) = wx + b<0, in practice, however, there will be many hyperplanes that meet the above requirements. The support vector machine is to find a hyperplane that maximizes the separation margin. This problem is modeled as minimal by the following equation:
Figure DEST_PATH_IMAGE002
the constraint conditions are as follows:
Figure DEST_PATH_IMAGE004
i =1,2,3, …, n. In the real world, training samples are not always completely separated by any hyperplane, i.e., some samples are separated into another class. To solve this problem, supportThe vector machine introduces a relaxation variable xi i The objective function is rewritten as:
Figure DEST_PATH_IMAGE006
where C is called a penalty factor, which is a user-specified hyper-parameter, and the constraint formula can be rewritten as:
Figure DEST_PATH_IMAGE008
, i=1,2,3,…,n,
Figure DEST_PATH_IMAGE010
. The objective function is composed of structural risk and empirical risk, the punishment factor controls the balance between the two risks, and in addition, the support vector machine has the advantages that the kernel function is absorbed, when the samples cannot be distinguished in the low-dimensional space but can be distinguished in the high-dimensional space, the support vector machine firstly converts the samples which cannot be distinguished from the low-dimensional shape into the high-dimensional shape by using the kernel function, then finds a high-dimensional hyperplane to separate the samples, and F (F) (an
Figure DEST_PATH_IMAGE012
Where Φ (x) is a kernel function. The corresponding constraint is updated as:
Figure DEST_PATH_IMAGE014
, i=1,2,3,…,n,
Figure 468579DEST_PATH_IMAGE010
. The support vector machine can be solved through a dual theory and a Lagrange optimization algorithm.
And (3) after the construction of the support vector machine model is completed, inputting the propionyl training set data obtained in the step (3) into a deep Recurrent Neural Network (RNN) model after training and fine tuning, taking the deep Recurrent Neural Network (RNN) model after training and fine tuning as a feature extractor to extract sequence features, and inputting the extracted sequence features into the support vector machine to perform parameter optimization and training on the support vector machine. In this embodiment, the training set is cross-validated 10 times to find a better window size. The performance at various window sizes is listed in table 1 below, and from the statistics in table 1 it can be seen that cross-validation with a window size of 29 results in better performance. Therefore, in the subsequent experiment, the window size was set to 29.
Figure DEST_PATH_IMAGE016
In addition, the hyper-parameters in the support vector machine classifier are optimized according to the statistical results in the following table 2, namely C is 1, kernel is rbf, and gamma is scale or auto.
Figure DEST_PATH_IMAGE018
And thirdly, verifying and testing.
8. Inputting the test set obtained in the step 3 into a trained and fine-tuned deep Recurrent Neural Network (RNN) model, and taking the trained and fine-tuned deep Recurrent Neural Network (RNN) model as a feature extractor to extract propionylated sequence features.
In the prior art, there are two main calculation methods for propionylation prediction: one is PropPred and the other is PropSeek, which is inferior to the method in this embodiment in terms of SN and MCC because the training set and test set used are different. The PropPred is tested in comparison with the method involved in this example (hereinafter referred to simply as "the method"), in which the PropPred is achieved using 250 optimal functions and a window size of 25 residues. Table 3 below lists the performance of the PropPred method for 10-fold cross validation on the training set and for independent testing on the test set.
Figure DEST_PATH_IMAGE020
The 10 fold cross-validation results for the present method (corresponding to curve 1) and the PropPred (corresponding to curve 2) are shown in fig. 4-a, and it can be seen from fig. 4-a that although the AUC curve of the present method is slightly lower in the posterior segment than the PropPred, the AUC at the upper left-most position of the anterior segment is significantly better than the PropPred. Fig. 4-b shows the results of independent tests of the present method (corresponding to curve 1), the PropPred (corresponding to curve 2) and the deep Recurrent Neural Network (RNN) method (corresponding to curve 3), and it can be seen from fig. 4-b that the present method is significantly superior to the PropPred and deep Recurrent Neural Network (RNN) methods in independent tests.
In the above embodiment, after statistical comparison of a protein data set, it is found that 600 (40.8%) of 1471 known propionylated sites overlap malonylation, and the amount of malonylation far exceeds the characteristics of the propionylated sites, so that a deep recurrent neural network model is trained by relying on malonylation data samples, and then the trained model is fine-tuned (or can be understood as being trained for the second time) by the propionylated data samples, and the problem that the deep learning model cannot be better trained due to the small propionylated data samples in the prior art is solved by the migration learning means. And the verification test results are combined to determine that the prediction method provided by the embodiment can meet the requirement of quickly and effectively predicting the propionylated lysine modification site.
Based on the lyse-propionylation prediction method in the above embodiment, there is also provided a lyse-propionylation prediction system based on transfer learning, which includes a feature extractor and a final classifier, wherein the feature extractor includes the deep recurrent neural network model which is trained by known lyse-malonyl modification data and then fine-tuned by known lyse-malonyl modification data, and the final classifier includes the support vector machine model which is optimized and trained by known lyse-propionylation sequence features. The lysine propionylation prediction system predicts the propionylation modification site of a protein to be analyzed (unknown protein) according to the prediction method in the above example and outputs the prediction result. In the morning, a sequence divider can be further added in the lysine propionylation prediction system, each protein sequence is automatically divided into peptide segment sequences which take lysine as the center and respectively contain n amino acid residues at the upstream and downstream by the sequence divider, and the front end and/or the tail end of the peptide segment sequence which is obtained by dividing and is less than n amino acid residues at the upstream and/or downstream is filled with a character X; wherein n is a natural number of 1 or more. It should be understood by those skilled in the art that the above-mentioned lysine propionylation prediction system can be packaged in a portable storage medium to operate, and can also be stored in the cloud to operate on line; the process of implementing the lysyl acylation prediction may be executed by a computer capable of running the prediction system, or may be executed by a server located in the cloud.
The above embodiments are preferred implementations of the present invention, and the present invention can be implemented in other ways without departing from the spirit of the present invention.
Finally, it should be emphasized that some of the descriptions of the present invention have been simplified to facilitate the understanding of the improvements of the present invention over the prior art by those of ordinary skill in the art, and that other elements have been omitted from this document for the sake of clarity, and those of ordinary skill in the art will recognize that such omitted elements may also constitute the subject matter of the present invention.

Claims (7)

1. The lysine propionylation prediction method based on the transfer learning is characterized by comprising the following steps:
1) constructing a deep recurrent neural network model, wherein a frame of the deep recurrent neural network model is set to be composed of an embedding layer, a first bidirectional long-and-short term memory network layer, a bidirectional gated circulation unit layer, a second bidirectional long-and-short term memory network layer, an exiting layer, a flattening layer, a complete connection layer and an output layer in sequence;
2) training a deep recurrent neural network model, namely segmenting known lysine malonyl protein into peptide fragment sequences to form lysine malonyl modification data containing corresponding positive and negative sample sets, and inputting the lysine malonyl modification data into the deep recurrent neural network model to train the deep recurrent neural network model;
3) finely adjusting the trained deep recurrent neural network model, namely segmenting known lysine propionylated protein into peptide fragment sequences to form lysine propionylated modification data containing corresponding positive and negative sample sets, and inputting the lysine propionylated modification data into the trained deep recurrent neural network model to finely adjust the deep recurrent neural network model;
4) training by known lysine malonyl modification data and then finely adjusting by the known lysine malonyl modification data to obtain a deep recurrent neural network model as a feature extractor;
5) taking a support vector machine model subjected to parameter optimization and training by known lysine propionylated protein sequence characteristics as a final classifier;
6) and extracting the target sequence characteristics of the protein to be analyzed by using a characteristic extractor, inputting the extracted target sequence characteristics into a final classifier, predicting the propionylated modified sites and outputting a prediction result.
2. The migratory learning-based lysine propionylation prediction method according to claim 1, further comprising, before the step 5):
constructing a support vector machine, segmenting a known lysine propionylated protein sequence into peptide fragment sequences to form a positive sample set and a negative sample set, extracting sequence characteristics from the positive sample set and the negative sample set through a characteristic extractor, optimizing the window size and the hyperparameter of the support vector machine by utilizing the extracted sequence characteristics, and training a support vector machine model.
3. The migratory learning-based lysyl acylation prediction method according to claim 1, wherein: in the step 6), the characteristic extractor is used for extracting the target sequence characteristics of the protein to be analyzed, namely, the protein sequence to be analyzed is firstly divided into peptide fragment sequences, and then the characteristic extractor is used for extracting the target sequence characteristics from the peptide fragment sequences.
4. The migratory learning-based lysyl prediction method according to claim 2 or 3, wherein: when each protein sequence is divided into peptide segment sequences, the corresponding protein sequence is divided into peptide segments which take lysine as the center and respectively contain n amino acid residues at the upstream and downstream; for the segmented peptide fragments with the upstream and/or downstream less than n amino acid residues, the front end and/or the tail end of the corresponding peptide fragment are filled with characters 'X'; wherein n is a natural number of 1 or more.
5. The migratory learning-based lysine propionylation prediction method according to claim 3, wherein: converting the amino acid character integer index of the input peptide segment sequence into an embedded vector by the embedding layer, and taking the output of the complete connection layer as the sequence feature to be extracted.
6. A lysine propionylation prediction system based on transfer learning is characterized by comprising:
a feature extractor comprising a deep recurrent neural network model trained on known lysine malonyl modification data and then fine-tuned by known lysine malonyl modification data;
a final classifier comprising a support vector machine model that is parametrically optimized and trained over known lysine propionylated sequence features;
the lysine propionylation prediction system predicts a propionylated modification site of a protein to be analyzed according to the lysine propionylation prediction method of any one of claims 1 to 5 and outputs the prediction result.
7. The migratory learning-based lysyl acylation prediction system of claim 6, further comprising:
a sequence divider for dividing each protein sequence into peptide segment sequences which take lysine as a center and respectively contain n amino acid residues at the upstream and downstream, and filling the front end and/or the tail end of the peptide segment sequence which is obtained by dividing and is less than n amino acid residues at the upstream and/or downstream with a character X; wherein n is a natural number of 1 or more.
CN202110289477.6A 2021-03-18 2021-03-18 Lysine propionylation prediction method and system based on transfer learning Active CN112820350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110289477.6A CN112820350B (en) 2021-03-18 2021-03-18 Lysine propionylation prediction method and system based on transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110289477.6A CN112820350B (en) 2021-03-18 2021-03-18 Lysine propionylation prediction method and system based on transfer learning

Publications (2)

Publication Number Publication Date
CN112820350A CN112820350A (en) 2021-05-18
CN112820350B true CN112820350B (en) 2022-08-09

Family

ID=75863406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110289477.6A Active CN112820350B (en) 2021-03-18 2021-03-18 Lysine propionylation prediction method and system based on transfer learning

Country Status (1)

Country Link
CN (1) CN112820350B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113936742A (en) * 2021-09-14 2022-01-14 上海中科新生命生物科技有限公司 Peptide spectrum retention time prediction method and system based on mass spectrometry
CN114093427B (en) * 2021-11-12 2023-06-09 杭州电子科技大学 Antiviral peptide prediction method based on deep learning and machine learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3022907A1 (en) * 2016-05-04 2017-11-09 Deep Genomics Incorporated Methods and systems for producing an expanded training set for machine learning using biological sequences
JP2019152535A (en) * 2018-03-02 2019-09-12 学校法人 名城大学 MEASUREMENT METHOD FOR SPECIFICALLY DETECTING PROPANOYL-MODIFIED SITES IN AMYLOID-β PROTEIN
CN111081311A (en) * 2019-12-26 2020-04-28 青岛科技大学 Protein lysine malonylation site prediction method based on deep learning
WO2021026037A1 (en) * 2019-08-02 2021-02-11 Flagship Pioneering Innovations Vi, Llc Machine learning guided polypeptide design

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060728A (en) * 2019-04-10 2019-07-26 浙江科技学院 RNA secondary structure prediction method based on recurrent neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3022907A1 (en) * 2016-05-04 2017-11-09 Deep Genomics Incorporated Methods and systems for producing an expanded training set for machine learning using biological sequences
JP2019152535A (en) * 2018-03-02 2019-09-12 学校法人 名城大学 MEASUREMENT METHOD FOR SPECIFICALLY DETECTING PROPANOYL-MODIFIED SITES IN AMYLOID-β PROTEIN
WO2021026037A1 (en) * 2019-08-02 2021-02-11 Flagship Pioneering Innovations Vi, Llc Machine learning guided polypeptide design
CN111081311A (en) * 2019-12-26 2020-04-28 青岛科技大学 Protein lysine malonylation site prediction method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于深度卷积神经网络的物体识别算法;黄斌等;《计算机应用》;20161210(第12期);全文 *
深度学习在药物设计与发现中的应用;李伟等;《药学学报》;20190409(第05期);全文 *

Also Published As

Publication number Publication date
CN112820350A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN112820350B (en) Lysine propionylation prediction method and system based on transfer learning
CN108897989B (en) Biological event extraction method based on candidate event element attention mechanism
Wei et al. An improved protein structural classes prediction method by incorporating both sequence and structure information
US9053391B2 (en) Supervised and semi-supervised online boosting algorithm in machine learning framework
Busia et al. Next-step conditioned deep convolutional neural networks improve protein secondary structure prediction
CN108108762B (en) Nuclear extreme learning machine for coronary heart disease data and random forest classification method
Kuang et al. Protein backbone angle prediction with machine learning approaches
Kuncheva et al. On the window size for classification in changing environments
Menegaux et al. Continuous embeddings of DNA sequencing reads and application to metagenomics
CN107463802A (en) A kind of Forecasting Methodology of protokaryon protein acetylation sites
Zhang et al. A local boosting algorithm for solving classification problems
CN113420163B (en) Heterogeneous information network knowledge graph completion method and device based on matrix fusion
CN104966105A (en) Robust machine error retrieving method and system
CN110289050A (en) A kind of drug based on figure convolution sum term vector-target interaction prediction method
Lee et al. Protein family classification with neural networks
Juraszek et al. Transition path sampling of protein conformational changes
CN113268612A (en) Heterogeneous information network knowledge graph completion method and device based on mean value fusion
CN113762417B (en) Method for enhancing HLA antigen presentation prediction system based on deep migration
Sha et al. DeepSADPr: A hybrid-learning architecture for serine ADP-ribosylation site prediction
Naik et al. A global-best harmony search based gradient descent learning FLANN (GbHS-GDL-FLANN) for data classification
Ganapathiraju et al. Transmembrane helix prediction using amino acid property features and latent semantic analysis
Zou et al. SVM learning from imbalanced data by GA sampling for protein domain prediction
Hawkins et al. Identifying novel peroxisomal proteins
Wang et al. Protein subcellular localization prediction by combining ProtBert and BiGRU
CN109947945A (en) Word-based vector sum integrates the textstream classification method of SVM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Li Ang

Inventor after: Chen Min

Inventor after: Tan Yan

Inventor after: Deng Yingwei

Inventor after: Sun Xudong

Inventor before: Li Ang

Inventor before: Chen Min

Inventor before: Tan Yan

Inventor before: Deng Yingwei

GR01 Patent grant
GR01 Patent grant