CN110488020A - A kind of protein glycation site identification method - Google Patents

A kind of protein glycation site identification method Download PDF

Info

Publication number
CN110488020A
CN110488020A CN201910734943.XA CN201910734943A CN110488020A CN 110488020 A CN110488020 A CN 110488020A CN 201910734943 A CN201910734943 A CN 201910734943A CN 110488020 A CN110488020 A CN 110488020A
Authority
CN
China
Prior art keywords
peptide chain
sample set
artificial
sample
protein glycation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910734943.XA
Other languages
Chinese (zh)
Other versions
CN110488020B (en
Inventor
杨润涛
陈金桂
张承进
张丽娜
宋勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN201910734943.XA priority Critical patent/CN110488020B/en
Publication of CN110488020A publication Critical patent/CN110488020A/en
Application granted granted Critical
Publication of CN110488020B publication Critical patent/CN110488020B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Urology & Nephrology (AREA)
  • Immunology (AREA)
  • Biomedical Technology (AREA)
  • Hematology (AREA)
  • Cell Biology (AREA)
  • Physiology (AREA)
  • Microbiology (AREA)
  • Genetics & Genomics (AREA)
  • Food Science & Technology (AREA)
  • Medicinal Chemistry (AREA)
  • Biochemistry (AREA)
  • General Physics & Mathematics (AREA)
  • Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

This application provides a kind of protein glycation site identification methods, comprising: collects protein glycation site data, peptide chain is extracted from the data of the protein glycation site and obtains peptide chain sample set, the peptide chain is centered on lysine;Each amino acid of peptide chain described in single hot vector coding is respectively adopted, obtains the peptide chain training set indicated using single hot vector;Using LSTM RNNs training productive manpower peptide chain sample, artificial peptide chain sample set is constructed;Each peptide chain in the peptide chain sample set and artificial peptide chain sample set is divided into a series of biology word, the feature that ProtVec constructs each peptide chain in the peptide chain sample set and artificial peptide chain sample set is passed through based on the biology word respectively;Fallout predictor, identification protein glycation site are obtained based on CNN training.Provided by the present application kind of protein glycosylation site identification method reduces the complicated degree of feature extraction, improves the accuracy of protein glycation site identification for identifying protein glycation site.

Description

A kind of protein glycation site identification method
Technical field
This application involves protein function electric powder prediction more particularly to a kind of protein glycation site identification methods.
Background technique
1912, LC Maillard had found for the first time and describes saccharification reaction.As most important posttranslational modification One of (Post-Translational Modifications, PTMs) process, the redox or mistake that saccharification reaction passes through sugar Redox generates the formoxyl or ketone group with carbonyl group, and the oxygen atom of carbonyl is negatively charged, can under the conditions of high glucose Non- enzymatic saccharification occurs with the nucleophilic group in the biomolecule such as protein, DNA, lipid react, formation Advanced Glycation final product (Advanced Glycation End Products, AGEs).AGEs the intracorporal long-term accumulation of people will cause two kinds it is main Cytological effect:: (1) cause the extracellular intermolecular bonding between intracellular protein or crosslinking, changes extracellular matrix (ECM) physiological characteristic of protein;(2) start when it and cell surface AGE receptor (Receptor for AGE, RAGE) are combined Complicated signal transduction path, eventually leads to the generation of pro-inflammatory mediator and active oxygen.Change on these molecular levels of studies have shown that Change closely related with the pathogenesis of numerous diseases such as diabetes, ephritis, atherosclerosis, cataract, alzheimer's disease.
The formation of studies have shown that AGEs compound is mostly related to lysine, i.e. the saccharification reaction of protein mostly occurs in Lysine residue, such as table 1.Therefore, the identification of lysine glycosylation site can help researcher to understand pathogenesis, to cure Disease provides highly useful theoretical foundation.
The structure of table 1:AGEs
However, when purely identifying huge each site function cost of determining protein and consuming using traditional means of experiment Between, in recent years, protein glycation site is identified using various machine learning methods.2006, MB Johansen etc. was logical It crosses and collects correlative theses and obtain first data set about lysine glycosylation site, establish saccharification neural network based Site estimation device;Later, Liu Yan etc. on the basis of this data set using algorithm of support vector machine develop one it is improved pre- Survey device;2016, Xu Yan etc. discussed the effect of sequence information and location specific amino acid tendency in glycosylation site prediction, It is then based on database CPLM (Compendium of protein lysine modifications) and has trained fallout predictor "Gly-PseAAC";2017, Zhao Xiaowei etc. utilized secondary structure information, and the encoded peptides such as k- spacer amino acids and AAindex are answered With new two step feature selecting algorithms filtering redundancy feature and construct the prediction model based on algorithm of support vector machine;2018, MM Islam etc. has trained the fallout predictor of one entitled " iProtGly-SS ", it extracts spy from sequence and secondary structure information Sign, then finds best features collection using feature selecting algorithm.Although having developed these methods to identify glycosylation site, Still there is the deficiency that accuracy is low and extraction feature is excessively complicated.
Summary of the invention
This application provides a kind of protein glycation site identification methods to reduce special for identifying protein glycation site The complicated degree extracted is levied, the accuracy of protein glycation site identification is improved.
This application provides a kind of protein glycation site identification methods, which comprises
Protein glycation site data are collected, peptide chain is extracted from the data of the protein glycation site and obtains peptide chain sample Collection, for the peptide chain centered on lysine, the form of the peptide chain is P=AA-(η-1)...A-2A-1KA1A2...Aη-1Aη, K is Lysine, η are the amino acid quantity in lysine upstream or downstream, and A is one of 20 kinds of natural amino acids;
Each amino acid of peptide chain described in single hot vector coding is respectively adopted, obtains and is instructed using the peptide chain that single hot vector indicates Practice collection, wherein the lysine is 000000000001000000000;
According to the peptide chain training set, artificial peptide chain sample is obtained using LSTM RNNs training, constructs artificial peptide chain sample Collection;
Each peptide chain in the peptide chain sample set and artificial peptide chain sample set is divided into a series of biology word, is based on institute It states biology word and passes through the feature that ProtVec constructs each peptide chain in the peptide chain sample set and artificial peptide chain sample set respectively;
The peptide chain sample set and artificial peptide chain sample set are constructed according to ProtVec is passed through respectively based on the biology word In each peptide chain feature, based on CNN (convolutional neural networks) training obtain fallout predictor;
Protein glycation site is identified based on the fallout predictor.
Optionally, in the identification method of above-mentioned protein glycation site, the method also includes:
When the amino acid quantity in lysine upstream in the peptide chain or downstream is less than η, using symbol X to the extension peptide Chain, wherein the hot vector coding of the list of X is 000000000000000000001.
Optionally, in the identification method of above-mentioned protein glycation site, which is characterized in that η=24.
Optionally, described to be concentrated from protein glycation site data in the identification method of above-mentioned protein glycation site It extracts peptide chain and obtains peptide chain sample set, comprising:
It is concentrated from protein glycation site data and extracts peptide chain, filtered out from the peptide chain of extraction using CD-HIT similar Peptide chain of the degree lower than 50% generates peptide chain sample set.
Optionally, in the identification method of above-mentioned protein glycation site, artificial peptide chain sample is constructed using LSTM RNNs, is obtained Take artificial peptide chain sample set, comprising:
Artificial peptide chain sample is constructed using LSTM RNNs, it is low to filter out similarity from artificial peptide chain sample using CD-HIT In 50% artificial peptide chain sample, several artificial peptide chain samples are obtained at random and form artificial peptide chain sample set.
Optionally, in the identification method of above-mentioned protein glycation site, the method also includes:
Using the artificial peptide chain sample of the artificial peptide chain sample set as positive sample, sample in the artificial peptide chain sample set This quantity is added the number for being equal to negative sample in the peptide chain sample set with the quantity of positive sample in the peptide chain sample set Amount.
Optionally, in the identification method of above-mentioned protein glycation site, the method also includes:
Judge to filter out similarity from artificial peptide chain sample using CD-HIT lower than 50% using GlyNN and Gly-PseAAC Artificial peptide chain sample whether be saccharification peptide chain;
When use GlyNN and Gly-PseAAC judges the artificial peptide chain sample for the peptide chain that is saccharified, then by the artificial peptide Sample of the chain sample as artificial peptide chain sample set.
Protein glycation site provided by the present application identification method collects protein glycation site data, from the albumen Peptide chain is extracted in matter glycosylation site data obtains peptide chain sample set;Using each amino acid of peptide chain described in single hot vector coding, Obtain the peptide chain training set indicated using single hot vector;According to the peptide chain training set, artificial peptide chain is constructed using LSTM RNNs Sample obtains artificial peptide chain sample set;Each peptide chain in the peptide chain sample set and artificial peptide chain sample set is divided into a series of Biology word, ProtVec is passed through based on the biology word respectively and constructs the peptide chain sample set and artificial peptide chain sample set In each peptide chain feature;Fallout predictor is obtained based on CNN training, protein glycation site is identified by the fallout predictor.The application The protein glycation site identification method of offer is solved by constructing artificial peptide chain sample set in existing all glycosylation sites The quantity of negative sample (lysine in peptide chain is not saccharified) is far longer than corresponding positive sample (peptide when in forecasting problem Lysine in chain is saccharified) the problem of;Pass through ProtVec respectively based on the biology word and constructs the peptide chain sample set With the feature of each peptide chain in artificial peptide chain sample set, the biology word in peptide chain, which is converted to vector, to be indicated, is realized and is used NLP skill Art describes glycosylation site characteristic, so simplifies characteristic extraction procedure, improves the accuracy of protein glycation site identification.
Detailed description of the invention
In order to illustrate more clearly of the technical solution of the application, letter will be made to attached drawing needed in the embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without any creative labor, It is also possible to obtain other drawings based on these drawings.
Fig. 1 is the structure flow chart of protein glycation site provided by the embodiments of the present application identification method;
Fig. 2 is a kind of internal structure of LSTM unit provided by the embodiments of the present application;
Fig. 3 is a kind of RNN architecture diagram provided by the embodiments of the present application;;
RNN loses decline curve when Fig. 4 a is iteration 2000 times in the embodiment of the present application;
RNN loses decline curve when Fig. 4 b is iteration 1000 times in the embodiment of the present application;
Fig. 5 is the ratio of the total lysine of amino acid Zhan of glycosylation site in the embodiment of the present application;
It is followed in terms of global charge and hydrophobicity of the Fig. 6 for the artificial positive sample generated of RNN in the embodiment of the present application identical Distribution map;
Fig. 7 is the structure chart of Skip-gram model in the embodiment of the present application;
Fig. 8 is D, D in the embodiment of the present application*、p、q、q*Between relational graph;
Fig. 9 is the architecture diagram of convolutional neural networks in the embodiment of the present application;
Figure 10 is ROC curve and AUC value figure in the embodiment of the present application.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
Attached drawing 1 is the structure flow chart of protein glycation site provided by the embodiments of the present application identification method.Such as 1 institute of attached drawing Show, protein glycation site provided by the embodiments of the present application identification method includes:
S101: collecting protein glycation site data, and peptide chain is extracted from the data of the protein glycation site and obtains peptide Chain sample set.
In the embodiment of the present application, peptide chain is extracted from the protein glycation site data being collected into obtain peptide chain sample Collection.For peptide chain centered on lysine, the form of the peptide chain is P=AA-(η-1)...A-2A-1KA1A2...Aη-1Aη, K is to rely ammonia Acid, η are the amino acid quantity in lysine upstream or downstream, are positive integer, and A is one of 20 kinds of natural amino acids.In the application reality It applies in example, η=24.When η value is 24, guarantee to carry enough information.
Glycosylation site training dataset can obtain in existing paper or database.It is specific:
The data set A used comes from CPLM, this is that a protein lysine by experimental verification modifies database, receives Collect and identifies 12 kinds of sites lysine PTM.Through inquiring, there are 72 protein to contain glycosylation site in the database, amounts to 323 A positive sample, 2046 negative samples.
Data set B is collected manually by Johansen, including 89 positive samples and 126 negative samples.
Merging data collection A and data set B generates the protein glycation comprising 412 positive samples and 2172 negative samples Site data set.
Further, in the embodiment of the present application, in order to avoid redundancy and homology bias, phase is removed by CD-HIT tool Data set of the peptide chain as the glycosylation site predictive factor for constructing us like property greater than 50%.In this way, from the protein sugar Changing extraction peptide chain acquisition peptide chain sample set in the data of site includes comprising 155 positive samples and 674 negative samples.
S102: being respectively adopted each amino acid of peptide chain described in single hot vector coding, obtains using single hot vector expression Peptide chain training set.
In the embodiment of the present application, the amino acid in peptide chain, including 20 natural amino acids are represented using 21 letters With a special acid, respectively " A ", " R ", " N ", " D ", " C ", " Q ", " E ", " G ", " H ", " M ", " L ", " K ", " M ", " F ", " P ", " S ", " T ", " W ", " Y ", " V ", " X ", wherein the coded sequence of 20 natural amino acids is as follows, " A ", " R ", " N ", " D ", " C ", " Q ", " E ", " G ", " H ", " M ", " L ", " K ", " M ", " F ", " P ", " S ", " T ", " W ", " Y ", " V ", such as " A " are third Propylhomoserin, " R " are arginine etc.;" X " is special acid, when the amino acid quantity of lysine upstream in the peptide chain or downstream is small When η, using symbol X to the extension peptide chain.Each amino acid in peptide chain is indicated using single hot vector, as: alanine (A) Single hot vector coding is 100000000000000000000, and the hot vector coding of list of arginine (R) is 010000000000000000000, the hot vector coding of list of lysine (K) is 000000000001000000000, special amino The hot vector coding of list of sour (X) is 000000000000000000001.
S103: according to the peptide chain training set, artificial peptide chain sample is generated using LSTM RNNs training, constructs artificial peptide Chain sample set.
Peptide chain training set is used as the training data of LSTM RNNs, and LSTM RNNs structure is as shown in Figure 2.Building LSTM RNNs is carried out before model training and artificial positive sample generate, and in the beginning addition B (' start ') of training data, single hot vector is compiled Code is 000000000000000000000.The internal structure of LSTM RNNs is from left to right to forget door respectively, input gate and defeated It gos out.
In the embodiment of the present application, during model training, the signal from previous hidden stateWith current input Signal x(t)It is passed to sigmoid function simultaneously.
The effect for forgeing door is which information should be retained according to function result determination, as shown in formula (1),
Output valve f(t)Between 0 and 1.Closer 0, the status information of the partDiscarding is more, closer to 1, protects It stays more.
Input gate is used for updating unit state, i(t)Determine i '(t)In which information needs retain, such as formula (2) and (3) shown in.
The input o (t) of out gate is for updating long-term memory h1 (t), as shown in formula (4) and (5).
As shown in figure 3, RNN is used to generate the sample with glycosylation site, information from top to bottom, from left to right flows.It is right In each amino acid residue x(t-1), the next x ' of LSTM RNNs prediction(t-1), the sequence of generation and practical peptide chain are compared Compared with calculating intersects entropy loss to obtain optimum network weight, shown in loss function such as formula (6).
Wherein,It is i-th of amino acid residue of LSTM RNNs prediction, yiRepresent actual i-th of ammonia in training data Base acid residue, is both indicated by One-Hot vector, and 2 η+1 are the length of peptide chain.
The simple LSTM RNNs model less for layer and neuron, validation error will not Complete Convergence;But number of plies mistake It will lead to model overfitting more and training speed be by significant decline.Considered based on these, in the embodiment of the present application, network rack Structure parameter such as table 2:
Table 2:RNN hyper parameter
Fig. 4 a and Fig. 4 b are in the embodiment of the present application.The corresponding RNN of different the number of iterations loses decline curve, in which: figure RNN loses decline curve when 4a is iteration 2000 times, and RNN loses decline curve when Fig. 4 b is iteration 1000 times.In iteration After 2000 times, network complete stability, average training loss is 0.78 ± 0.06, and model over-fitting, selects net in order to prevent Gradually stable period samples it to network, and such as iteration 1000 times.
It is lysine until sequence reaches maximum length and peptide chain center each sample cyclic is by bebinning character " B ". In order to control the otherness of sampling period sequence.In the embodiment of the present application, temperature factor T is introduced in softmax function.
Temperature controls probability P such as formula (7).
For the diversity of artificial saccharification sequence directly by the value effect of temperature factor T, the bigger sequence of T value is distributed more steady, T It is worth smaller sequence distribution more to concentrate.
Compared with SMOTE algorithm, LSTM RNNs is suitable for the problem of handling high dimensional data, and what is generated is visualization Peptide chain rather than a pile abstract characteristics number, this is to filter to generate sample and lay the foundation using existing glycosylation site fallout predictor.Make The peptide chain for resulting from different T with 50% threshold value screening with CD-HIT tool, determines generation using Gly-PseAAC and GlyNN Whether artificial peptide chain is saccharification peptide chain, i.e., concentrates from protein glycation site data and extract peptide chain, using CD-HIT from mentioning Peptide chain of the similarity lower than 50% is filtered out in the peptide chain taken generates peptide chain sample set.Specifically, if artificial peptide chain is classified at two It is all judged as saccharification peptide chain on website, then this peptide chain is classified as artificial positive sample.Artificial peptide chain in artificial peptide chain sample set The quantity of sample can determine according to the quantity of negative sample in peptide chain sample set.Preferably, sample in the artificial peptide chain sample set This quantity is added the number for being equal to negative sample in the peptide chain sample set with the quantity of positive sample in the peptide chain sample set Amount.
Lines shown in Fig. 5 are followed successively by ordered sequence Zhan that CD-HIT tool filters out always manually generated sequence from top to bottom Percentage and be shown as in classification tool GlyNN, Gly-PseAAC glycosylation site the total lysine of amino acid Zhan ratio, Respectively illustrate the peptide chain of similitude that CD-HIT tool filters out lower than 50% account for RNN generate total sample number percentage, The detection of GlyNN tool show that the sequence quantity of not glycosylation site accounts for the percentage of total sample of RNN generation, Gly-PseAAC work Tool detection show that the sequence quantity of not glycosylation site accounts for the percentage of total sample of RNN generation.
As can be seen from Figure 5, the number difference at any temperature, in the artificial positive sample that RNN is generated containing glycosylation site Less.In the embodiment of the present application, because sequence validity starts to restrain, so working as temperature factor when temperature factor T is 1.25 When T is 1.25, sampled.
By can be seen that in Fig. 6, original series and artificial sequence are made of amino acid respectively, in terms of global charge and hydrophobicity Follow identical distribution.Wherein dark is original series, and blue represents the artificial positive sample that RNN is generated when T is 1.25.Rely ammonia Amino acid nature near acid plays an important role in saccharification, close to the positively charged amino of lysine in primary or three-dimensional structure The initiation of acid catalysis lysine;MM Islam et al. takes the hydrophobicity of peptide chain as special in newest saccharification forecasting research Sign, achieves good classifying quality.
Compared with SMOTE algorithm, LSTM RNNs is suitable for the problem of handling high dimensional data, and the artificial sample generated Visual peptide chain rather than the abstract feature number of a pile, this just facilitate we using existing glycosylation site fallout predictor more Accurately filter the sample of generation.
S104: being divided into a series of biology word for each peptide chain in the peptide chain sample set and artificial peptide chain sample set, Pass through the spy that ProtVec constructs each peptide chain in the peptide chain sample set and artificial peptide chain sample set respectively based on the biology word Sign.
Continuously distributed expression is derived from NLP, has apparent similitude between protein sequence and natural language, by peptide chain sequence Column are cut into a series of independent biology words, and each biology word is made of three adjacent amino acid.Data set is selected from Swiss-Prot amounts to 1,640,370 biology words, and Skip-gram model is a multilayer neural network, as shown in fig. 7, For model by input layer, (hiding) layer and prediction interval composition are searched in insertion, insertion lookup layer the result is that the distributed table of target word Show.The final goal of the model is that effectively study is embedded in available word, characterizes the co-occurrence between target word and context Relationship.In order to achieve this goal, Asgari carrys out training pattern using biology word and its context, is trained with Asgari The continuously distributed presentation code peptide chain of biology word.In the embodiment of the present application, peptide chain is regarded as a sentence, three neighboring amine groups Acid regards a biology word, therefore a total of 9261 biology words as.Skip-gram network training is based on using by Asgari ProtVec model out, the biology word in peptide chain, which is converted to vector, to be indicated, uses NLP technical description glycosylation site characteristic.
For given biology word ωa, Asgari sets the vector size of its continuously distributed expression as 100, under maximization The average log probability function in face finds its corresponding continuously distributed expression.
Wherein: L is the number of biology word, and m, n are the line number for hiding layer parameter U, ωaIt is the center of window, k is up and down The window size of text,Indicate the continuously distributed expression of any word in vocabulary.Due to being used when calculating probability All words need to be done exponent arithmetic with target word and summed by softmax function, denominator, inefficiency.
In order to reduce computing cost, optimized using negative sampling technique, core concept be calculate target word and The training of context words composition is to (ωaa+b) while, add a little noise to (ωa*) do exponent arithmetic and sum again, The purpose of negative sampling is to keep following functional expression maximum:
D,D*It is all corpus, θ is the biological term vector of the training in optimization network.Assuming that k, which is equal to 2, Fig. 8, shows D, D*, ωa, ωa+b, ω*Between relationship.Wherein: urporotein sequence being decomposed into biology word in a;D is context in b The set of word, centre word are LME, and k=2, D* are the set for randomly choosing word;(p, q) is centre word and upper and lower word in c The set (positive sample) formed a team, the collection intersection (negative sample) that (p, q*) is randomly generated.
The continuously distributed new approaches for being expressed as solving the problems, such as that protein post-translational modification function classification provides, application The trained ProtVec construction feature of Asgari, the peptide formed with 49 amino acid of vector coding of 47*100=4700 length Chain.
S105: the peptide chain sample set and artificial peptide chain are constructed according to ProtVec is passed through respectively based on the biology word The feature of each peptide chain in sample set obtains fallout predictor based on CNN training, identifies protein glycation site based on the fallout predictor.
CNN is non-linear by implementing the basis that the measures such as multiple convolution, part connection, shared weight are extracted in initial data Structure and advanced features, learn a large amount of mapping relations between outputting and inputting, these relationships can definitely be counted no It is expressed in the case where formula, CNN has been widely used in solving various classification tasks at present.In the embodiment of the present application, it applies Convolutional neural networks predict that the glycosylation site of peptide chain, convolutional neural networks include input layer, convolutional layer, pond layer, are fully connected Layer and output layer, as shown in Fig. 9.
Firstly, the peptide chain that length is L is made of L-2 biology word, input layer input is encoded by ProtVec Vector afterwards.
For saccharification classify, the characteristic pattern of biology word be one cannot divided entirety, therefore select with word feature Figure has the convolution kernel of same widths.For example, the filter for being m*n if there is size, the continuously distributed expression of n and biology word It is isometric, in convolutional layer with xI:i+m-1It carries out convolution and obtains new feature convj,i
WhereinIt is convolution operator, j is convolution kernel number, i to the i+m-1 row of covering input data, W in corejIt is volume Product kernel parameter, b are bias terms, and f is rectification linear unit (ReLU) activation primitive.
Characteristic pattern Conv, the further filtering characteristic figure of pond layer, selection are generated on the peptide chain of convolution kernel effect in encoded Maximum value pool thereinj,*=max { convJ, 1:i-m+1It is used as convolution kernel kjCharacter pair.The advantages of this way is that it is protected Most important part in each Feature Mapping has been stayed, and has greatly reduced the computational complexity of model.
Conv=(conv1,1conv1,2..., convJ, i-m+1
Pool=(pool1, *,pool2, *,...,poolJ, *)
Finally, merging all ponds as a result, generating model output by softmax function, there is larger value of position Corresponding to the classification of forecast sample, i.e. acquisition fallout predictor, protein glycation site is identified based on the fallout predictor.
It in the embodiment of the present application, is obtained fallout predictor (abbreviation DeepGly) using expansible deep learning frame, network is super Parameter is as shown in table 3.
3 network hyper parameter of table
Protein glycation site provided by the embodiments of the present application identification method is carried out specifically below with reference to specific example It is bright.
In the embodiment of the present application, protein glycation site data are collected, are mentioned from the data of the protein glycation site It takes peptide chain to obtain peptide chain sample set, includes 412 positive samples and 2172 negative samples in peptide chain sample set, using CD- HIT deletes the data that similarity is more than 50%, obtains 155 positive samples and 674 negative samples.Utilize LSTM RNNs structure Artificial peptide chain sample is built, artificial peptide chain sample set is obtained, randomly chooses 519 samples as final positive sample.In this way, obtaining Maximum protein lysine saccharification data set at present is obtained, there is 674 positive samples and 674 negative samples respectively, it can be effective Solve the problems, such as that data set is unbalanced.
Using ACC (accuracy), SEN (susceptibility), SPC (specificity) and MCC (Matthews related coefficient) this four Index measures the classification performance of DeepGly.
It is defined as follows:
Wherein, for glycosylation site, TP (True Positive) is the correct quantity of prediction, TN (True Negative) It is the quantity of prediction error;For both the non-glycated site, FP (False Positive) is the quantity of prediction error, FN (False It Negative) is the correct quantity of prediction.
Carry out assessment prediction performance usually using three kinds of methods, is k folding cross validation test, leaving-one method (LOO) test respectively It is tested with independent data sets.The embodiment of the present application rolls over cross validation test by k, and k rolls over cross validation and passes through to k different points The result of group training averagely reduces variance, thus the performance of model to the division of data with regard to less sensitive.It is specific: Get off to assess the classification performance of DeepGly in the case where 4 foldings (Fold), 6 foldings, 8 foldings and 10 folding cross validation respectively.In order to avoid Sampling deviation carries out 10 training using different random seeds.The results are shown in Table 4.
Table 4: the result under different cross validations
It can be seen that all accuracy from the data in table 4 and be both greater than 80%, be up to 91.84%.ROC curve ((receiver operating characteristic curve, Receiver operating curve) and AUC (Area Under Roc Curve) value is as shown in Figure 10, and ROC curve is intensive, and the maximum value of AUC is 94.4%, shows the robustness of predictive variable.
By protein glycation site provided by the embodiments of the present application identification method obtain DeepGly with it is more existing Fallout predictor is compared, and the results are shown in Table 5.
Table 5: it is compared with other fallout predictors
In data set A, the accuracy of classifier Gly-PseAAC and iProtGly-SS are respectively 68.91% He 81.64%.And DeepGly is in accuracy, specificity, AUC etc. has the performance of very bright eye.In data set B, The prediction accuracy of DeepGly is 92.05%.Therefore, protein glycation site provided by the embodiments of the present application identification method, mentions The accuracy of high protein glycosylation site identification.
Protein glycation site provided by the embodiments of the present application identification method is based on recurrent neural network (RNN) and convolution Neural network (CNN) builds the glycosylation site in deep learning frame identification peptide chain.In protein provided by the embodiments of the present application In glycosylation site identification method, a possibility that artificial positive sample solves data set imbalance problem is generated using LSTM RNNs, is adopted Use continuously distributed expression as feature feasibility with build the validity that CNN classifies to peptide chain.In protein sequence and Under the promotion of obvious analogy between natural language, the shot and long term information of sequence is remembered using LSTM RNNs, is connected to current In task, the peptide chain data set of form needed for network model generates within the specified scope.Later, using ProtVec by Biological Order Column are converted to carrier expression.Finally, a glycosylation site prediction model DeepGly is developed using simple CNN framework, this Structure is successfully used for the sentence classification task in NLP.By testing our new method in several benchmark datasets, DeepGly has shown competitive result in glycosylation site classification task.Furthermore protein provided by the embodiments of the present application The fallout predictor of glycosylation site identification method, acquisition is especially effective for the learning tasks with large data sets;Even if in small data Collection, performance are also very competitive;Model realization is simple, and training speed is fast, in 1060 3GB of NVIDIA GeForce GTX Only it need to spend a few minutes that can train model in graphics processing unit, it can be in related drugs exploitation and toxicologic study Glycosylation site identification provides strong help.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments, and related place is referring to side The part of method embodiment illustrates.Those skilled in the art will be easy to think after considering the invention of specification and practice here To other embodiments of the present invention.This application is intended to cover any variations, uses, or adaptations of the invention, these Variations, uses, or adaptations follow general principle of the invention and do not invent in the art including the present invention Common knowledge or conventional techniques.The description and examples are only to be considered as illustrative, true scope of the invention and essence Mind is indicated by the following claims.
It should be understood that the application is not limited to the precise structure that has been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.Scope of the present application is only limited by the accompanying claims.

Claims (7)

1. a kind of protein glycation site identification method, which is characterized in that the described method includes:
Protein glycation site data are collected, peptide chain is extracted from the data of the protein glycation site and obtains peptide chain sample set, For the peptide chain centered on lysine, the form of the peptide chain is P=AA-(η-1)...A-2A-1KA1A2...Aη-1Aη, K is to rely ammonia Acid, η are the amino acid quantity in lysine upstream or downstream, and A is one of 20 kinds of natural amino acids;
Each amino acid of peptide chain described in single hot vector coding is respectively adopted, obtains the peptide chain training indicated using single hot vector Collection, wherein the lysine is 000000000001000000000;
According to the peptide chain training set, artificial peptide chain sample is obtained using LSTM RNNs training, constructs artificial peptide chain sample set;
Each peptide chain in the peptide chain sample set and artificial peptide chain sample set is divided into a series of biology word, is based on the life Object word passes through the feature that ProtVec constructs each peptide chain in the peptide chain sample set and artificial peptide chain sample set respectively;
It is constructed in the peptide chain sample set and artificial peptide chain sample set respectively according to ProtVec is passed through respectively based on the biology word The feature of peptide chain obtains fallout predictor based on CNN training, identifies protein glycation site based on the fallout predictor.
2. protein glycation site according to claim 1 identification method, which is characterized in that the method also includes:
When the amino acid quantity in lysine upstream in the peptide chain or downstream is less than η, using symbol X to extending the peptide chain, Wherein the hot vector coding of the list of X is 000000000000000000001.
3. protein glycation site according to claim 1 identification method, which is characterized in that η=24.
4. protein glycation site according to claim 1 identification method, which is characterized in that described from the protein sugar Changing site data concentrates extraction peptide chain to obtain peptide chain sample set, comprising:
It is concentrated from protein glycation site data and extracts peptide chain, it is low to filter out similarity from the peptide chain of extraction using CD-HIT Peptide chain in 50% generates peptide chain sample set.
5. protein glycation site according to claim 1 identification method, which is characterized in that constructed using LSTM RNNs Artificial peptide chain sample, obtains artificial peptide chain sample set, comprising:
Artificial peptide chain sample is constructed using LSTM RNNs, similarity is filtered out from artificial peptide chain sample using CD-HIT and is lower than 50% artificial peptide chain sample obtains several artificial peptide chain samples at random and forms artificial peptide chain sample set.
6. protein glycation site according to claim 5 identification method, which is characterized in that the method also includes:
Using the artificial peptide chain sample of the artificial peptide chain sample set as positive sample, sample in the artificial peptide chain sample set Quantity is added the quantity for being equal to negative sample in the peptide chain sample set with the quantity of positive sample in the peptide chain sample set.
7. protein glycation site according to claim 6 identification method, which is characterized in that the method also includes:
Judge to filter out people of the similarity lower than 50% from artificial peptide chain sample using CD-HIT using GlyNN and Gly-PseAAC Whether work peptide chain sample is saccharification peptide chain;
When use GlyNN and Gly-PseAAC judges the artificial peptide chain sample for the peptide chain that is saccharified, then by the artificial peptide chain sample This sample as artificial peptide chain sample set.
CN201910734943.XA 2019-08-09 2019-08-09 Protein saccharification site identification method Active CN110488020B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910734943.XA CN110488020B (en) 2019-08-09 2019-08-09 Protein saccharification site identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910734943.XA CN110488020B (en) 2019-08-09 2019-08-09 Protein saccharification site identification method

Publications (2)

Publication Number Publication Date
CN110488020A true CN110488020A (en) 2019-11-22
CN110488020B CN110488020B (en) 2022-12-13

Family

ID=68549653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910734943.XA Active CN110488020B (en) 2019-08-09 2019-08-09 Protein saccharification site identification method

Country Status (1)

Country Link
CN (1) CN110488020B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081311A (en) * 2019-12-26 2020-04-28 青岛科技大学 Protein lysine malonylation site prediction method based on deep learning
CN112067577A (en) * 2020-08-18 2020-12-11 武汉工程大学 Method, device and equipment for identifying overproof cream pigment based on support vector machine
CN116705141A (en) * 2022-12-15 2023-09-05 西北大学 Method for screening Alzheimer disease prevention peptide from walnut enzymolysis product based on CNN-LSTM neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100070438A1 (en) * 2006-10-31 2010-03-18 Keio University Method for predicting interaction between protein and chemical
US20190018019A1 (en) * 2017-07-17 2019-01-17 Bioinformatics Solutions Inc. Methods and systems for de novo peptide sequencing using deep learning
CN109726510A (en) * 2019-01-23 2019-05-07 山东大学 A kind of protein glycation site identification method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100070438A1 (en) * 2006-10-31 2010-03-18 Keio University Method for predicting interaction between protein and chemical
US20190018019A1 (en) * 2017-07-17 2019-01-17 Bioinformatics Solutions Inc. Methods and systems for de novo peptide sequencing using deep learning
CN109726510A (en) * 2019-01-23 2019-05-07 山东大学 A kind of protein glycation site identification method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
EHSANEDDIN ASGARI等: "Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics", 《PLOS ONE》 *
EHSANEDDIN ASGARI等: "DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences", 《BIOINFORMATICS》 *
宋江华: "基于深度学习的蛋白质糖基化的应用研究", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081311A (en) * 2019-12-26 2020-04-28 青岛科技大学 Protein lysine malonylation site prediction method based on deep learning
CN112067577A (en) * 2020-08-18 2020-12-11 武汉工程大学 Method, device and equipment for identifying overproof cream pigment based on support vector machine
CN116705141A (en) * 2022-12-15 2023-09-05 西北大学 Method for screening Alzheimer disease prevention peptide from walnut enzymolysis product based on CNN-LSTM neural network
CN116705141B (en) * 2022-12-15 2024-01-09 西北大学 Method for screening Alzheimer disease prevention peptide from walnut enzymolysis product based on CNN-LSTM neural network

Also Published As

Publication number Publication date
CN110488020B (en) 2022-12-13

Similar Documents

Publication Publication Date Title
CN103544392B (en) Medical science Gas Distinguishing Method based on degree of depth study
CN103714261B (en) Intelligent auxiliary medical treatment decision supporting method of two-stage mixed model
CN110488020A (en) A kind of protein glycation site identification method
Wang et al. Predicting the impacts of mutations on protein-ligand binding affinity based on molecular dynamics simulations and machine learning methods
CN108921604B (en) Advertisement click rate prediction method based on cost-sensitive classifier integration
CN108763865A (en) A kind of integrated learning approach of prediction DNA protein binding sites
CN109325517A (en) A kind of figure classification method of the Recognition with Recurrent Neural Network model based on Attention
CN113096814A (en) Alzheimer disease classification prediction method based on multi-classifier fusion
KR102213670B1 (en) Method for prediction of drug-target interactions
CN110689523A (en) Personalized image information evaluation method based on meta-learning and information data processing terminal
CN114220540A (en) Construction method and application of diabetic nephropathy risk prediction model
CN115050477B (en) Bethes-optimized RF and LightGBM disease prediction method
CN113470816A (en) Machine learning-based diabetic nephropathy prediction method, system and prediction device
CN109411016A (en) Genetic mutation site detection method, device, equipment and storage medium
Sinha et al. Analyzing chronic disease biomarkers using electrochemical sensors and artificial neural networks
CN111649779A (en) Oil well oil content and total flow rate measuring method based on dense neural network and application
CN115185937A (en) SA-GAN architecture-based time sequence anomaly detection method
CN103164631A (en) Intelligent coordinate expression gene analyzer
Khorashadizade et al. An intelligent feature selection method using binary teaching-learning based optimization algorithm and ANN
CN109920478A (en) A kind of microorganism-disease relationship prediction technique filled based on similitude and low-rank matrix
Reddy et al. AdaBoost for Parkinson's disease detection using robust scaler and SFS from acoustic features
Firnando et al. Analyzing InceptionV3 and InceptionResNetV2 with Data Augmentation for Rice Leaf Disease Classification
CN113780378A (en) Disease high risk group prediction device
Faraji-Biregani et al. Diabetes prediction recommender system based on artificial neural networks and sine-cosine optimization algorithm
Hakim Performance Evaluation of Machine Learning Techniques for Early Prediction of Brain Strokes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant