CN109243527A - A kind of peptide fragment detectability prediction technique of digestion probability auxiliary - Google Patents
A kind of peptide fragment detectability prediction technique of digestion probability auxiliary Download PDFInfo
- Publication number
- CN109243527A CN109243527A CN201810901107.1A CN201810901107A CN109243527A CN 109243527 A CN109243527 A CN 109243527A CN 201810901107 A CN201810901107 A CN 201810901107A CN 109243527 A CN109243527 A CN 109243527A
- Authority
- CN
- China
- Prior art keywords
- peptide fragment
- digestion
- albumen
- enzyme site
- restriction enzyme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Abstract
The invention discloses a kind of peptide fragment detectability prediction technique of digestion probability auxiliary, step includes: that high credible albumen 1) is screened in all identification albumen;2) restriction enzyme site digestion Probabilistic Prediction Model training set is constructed;3) training restriction enzyme site digestion Probabilistic Prediction Model;4) theoretical digestion is carried out to all high credible albumen, obtains theoretical digestion peptide fragment;5) the digestion probability of all theoretical digestion peptide fragments is predicted;6) peptide fragment detectability training set is constructed;7) training peptide fragment detectability model;8) the peptide fragment detectability of all theoretical digestion peptide fragments of other albumen is predicted.The characteristics of present invention is according to shotgun proteomics process, the enzymolysis process of the albumen considered during the prediction of peptide fragment detectability, significantly improves the accuracy rate of peptide fragment detectability prediction.
Description
Technical field
The present invention relates to the peptide fragment detectability prediction techniques in proteomics, in particular to shotgun proteomics
In peptide fragment detectability prediction technique.
Background technique
The experiment of targeting protein group selectively can detect and quantify interested peptide fragment and albumen, such as MRM
Experimental strategy.This method can be used for quickly verifying candidate biological markers.Developing the first step that MRM is tested to be exactly is time
Sortilin selection represents peptide fragment.The method for representing peptide fragment is selected to can be mainly divided into two classes: method and base based on experimental data
In the method for calculating.But the method based on experimental data has some limitations.Firstly, not all albumen have it is existing
Experimental data.It is influenced by many factors secondly, can peptide fragment be detected, can be accredited in experiment before,
It is different in experiment next time to be surely detected.Therefore, scientific research personnel increasingly pays close attention to the method based on calculating.But peptide fragment is examined
The mechanism of survey is still unclear, and which prevent the exploitations that Accurate Prediction represents the algorithm of peptide fragment.
Up to the present, scientist has been that the mechanism for exploring peptide fragment detection has done a large amount of effort.Before the several years, Le
Et al. (bibliography: Le Bihan, T., Robinson, M.D., Stewart, I.I.&Figeys, D.Definition and
Characterization of a“Trypsinosome”from Specific Peptide Characteristics by
Nano-HPLC-MS/MS and in Silico Analysis of Complex Protein Mixtures.J.Proteome
Res.3,1138-1148 (2004)) and Eithier et al. (bibliography: Ethier, M.&Figeys, D.Strategy to
Design Improved Proteomic Experiments Based on Statistical Analyses of the
Chemical Properties of Identified Peptides.J.Proteome Res.4,2201–2206(2005).)
Propose the empirical equation based on hydrophobicity, peptide segment length and isoelectric point.In recent years, scientist has been developed for much being based on
The method that the prediction of machine learning algorithm represents peptide fragment.In these methods, design description peptide fragment is characterized in that a key is asked
Topic.It is well known that can the cause that influenced peptide fragment and be detected in proteomic experiments be known as very much, such as peptide fragment
Physicochemical property, the abundance of albumen belonging to peptide fragment and identity process etc..Tang et al. (bibliography: Tang, H.et al.A
computational approach toward label-free protein quantification using
Predicted peptide detectability.Bioinformatics 22, e481-8 (2006)) propose peptide fragment can
The concept of detection property, and use the machine learning of the prediction peptide fragment detectability of 175 kinds of feature constructions from peptide section sequence
Algorithm.Later, Sander et al. (bibliography: Sanders, W.S., Bridges, S.M., McCarthy, F.M.,
Nanduri,B.&Burgess,S.C.Prediction of peptides observable by mass spectrometry
applied at the experimental set level.BMC Bioinformatics 8Suppl 7,S23
(2007)), Mallick et al. (bibliography: Mallick, P.et al.Computational prediction of
proteotypic peptides for quantitative proteomics.Nat.Biotechnol.25,125–31
(2007)) and Eyers et al. (bibliography: Eyers, C.E.et al.CONSeQuence:Prediction of
Reference Peptides for Absolute Quantitative Proteomics Using Consensus
Machine Learning Approaches.Mol.Cell.Proteomics 10,M110.003384-M110.003384
(2011)) 596 features are based respectively on, 1010 features and 1186 features develop peptide fragment detectability algorithm.These are calculated
Method mainly consider from AAindex (bibliography: Kawashima, S., Ogata, H.&Kanehisa, M.AAindex:
Amino acid index database.Nucleic Acids Res.27,368-369 (1999)) and peptide section sequence spy
Sign.Recently, Muntel et al. (bibliography: Muntel, J.et al.Abundance-based classifier for the
prediction of mass spectrometric peptide detectability upon enrichment(PPA)
.Mol 14,430-440 (2015) of Cell Proteomics) protein abundance has been added to peptide fragment as additional feature can
In detection property prediction model, and achieve better effect.But in the case where not carrying out Mass spectrometry experiments in advance, albumen is rich
Degree is usually unknown.
Although the method in terms of many peptide fragment detectabilities has been proposed, current peptide fragment detectability method
Accuracy rate is still unsatisfactory.Therefore how accurately to predict that peptide fragment detectability is technical problem urgently to be solved.
Summary of the invention
For technical problem of the existing technology, the purpose of the present invention is the enzymatic hydrolysis letters by abundant consideration albumen
Breath, to provide a kind of more accurate peptide fragment detectability prediction technique.
To achieve the goals above, the present invention provides a kind of digestion probability auxiliary peptide fragment detectability prediction technique,
Include:
Step 1) screens high credible albumen in all identification albumen;
Step 2) constructs restriction enzyme site digestion Probabilistic Prediction Model training set;
Step 3) trains restriction enzyme site digestion Probabilistic Prediction Model;
Step 4) carries out theoretical digestion to all high credible albumen, obtains theoretical digestion peptide fragment;
The digestion probability of all theoretical digestion peptide fragments of step 5) prediction;
Step 6) constructs peptide fragment detectability training set;
Step 7) trains peptide fragment detectability model;
Step 8) predicts the peptide fragment detectability of all theoretical digestion peptide fragments of other albumen.
In the above-mentioned technical solutions, in the step 1), first respectively by all identification albumen according to the spectrum of albumen
The sequence coverage of map number and albumen carries out descending sort, and 50% albumen is as height before then accounting in two minor sorts
Credible albumen.The spectrogram number of the albumen refer to the albumen the sum of the spectrogram number of relevant peptide fragment.The albumen
Sequence coverage refer to the albumen relevant peptide fragment sequence trace back to protein sequence after, account for protein sequence total length
Ratio.Or high credible albumen is to have identified that the spectrogram number of albumen in albumen is greater than given threshold h1 and the sequence of albumen is covered
Cover degree is greater than the albumen of given threshold h2.
In the above-mentioned technical solutions, in the step 2), the building restriction enzyme site digestion Probabilistic Prediction Model instruction
Practicing collection includes:
Step 2-1) it will identify in peptide fragment set and trace back to the step 1) with the high credible associated identification peptide fragment of albumen
In the credible protein sequence of height in.In the case where very coincidence, perhaps some peptide fragment can correspond to the plurality of positions of an albumen.It is right
Such case, the present invention only consider corresponding for the first time.The credible protein sequence of height is collected according to the backtracking position of identification peptide fragment
In restriction enzyme site information.The restriction enzyme site information includes: the sum of the spectrogram number of 1) the restriction enzyme site left side peptide fragment, note
For parameter L;2) the sum of the spectrogram number of restriction enzyme site the right peptide fragment, is denoted as parameter R;3) position is cut using the restriction enzyme site as leakage
The sum of the spectrogram number of peptide fragment of point, is denoted as parameter O.The restriction enzyme site in the credible protein sequence of height in the step 1)
In, the restriction enzyme site for meeting the following conditions is classified as positive site: 1) L is more than or equal to 1;2) R is more than or equal to 1;3) O is equal to 0.It will
The restriction enzyme site for meeting the following conditions is classified as negative positions: 1) L is equal to 0;2) R is equal to 0;3) O is more than or equal to 2.
Step 2-2) to the step 2-1) in positive site and negative positions, respectively fetch bit point or so T it is adjacent
Amino acid forms the 2T+1 that length is 2T+1, and even (T is natural number to son, 4) general T value is.If some restriction enzyme site by chance goes out
Perhaps C-terminal causes the left end of restriction enzyme site or right end not to have enough amino acid composition 2T+1 to connect to the N-terminal of present protein sequence
Son is then filled with placeholder " Z ".It, will be except the amino acid in middle position (lysine or smart ammonia to each 2T+1 even son
Acid) except each amino acid be converted into the 0-1 vectors of 21 dimensions.In this way, even son has been converted into a 42T dimension to each 2T+1
0-1 vector.1 is set by the label in the positive site in the step 2-1), by the negative position in the step 2-1)
The label of point is set as 0.
In the above-mentioned technical solutions, in the step 3), why needing to train digestion Probabilistic Prediction Model is base
In the shotgun proteomics the characteristics of and design.One typical shotgun proteomic experiments can be divided into two mistakes
Journey: the enzymolysis process of albumen and the detection process of peptide fragment.One is easy the fact that ignore to be which peptide fragment protease has solved is
Unknown.That is, can accurately not know which peptide fragment has actually entered the detection process of peptide fragment.Therefore accurately pre-
Survey the enzymolysis process that peptide fragment detectability needs to consider albumen.The training restriction enzyme site digestion Probabilistic Prediction Model refers to
Training Random Forest model on constructed restriction enzyme site digestion probability training set in the step 2).With random forest
Number increases, and the error of random forest can reduce, but runing time also will increase.Through preliminary test of the present invention, random forest
The number of middle tree is set as 200 can obtain relatively good result in accuracy rate and efficiency.It is selected at random at each node
The number of features selected uses as default, i.e. the square root of total characteristic numberUnder ten folding cross validation strategies
Index of the AUC as assessment models effect.
In the above-mentioned technical solutions, in the step 4), to all high credible albumen carry out theoretical digestions refer to by
The series model digestion of the credible albumen of height in the step 1) is theoretical digestion peptide fragment.1) design parameter, which is provided that, will
The peptide fragment longest or shortest length of permission are respectively set to the longest or shortest length of all identification peptide fragments.It 2) will be in peptide fragment
The most leakage enzyme site numbers allowed are set as 2.
In the technical solution, in the step 5), predict that the digestion probability of all theoretical digestion peptide fragments includes:
Step 5-1) obtain the left end restriction enzyme site of theoretical digestion peptide fragment in all steps 4), right end digestion position
Point and the corresponding 2T+1 of leakage enzyme cutting enzyme site (if any) are even sub.By these 2T+1 even son according to the step 2-2) in
Method be expressed as 42T dimension vector.
Step 5-2) the step 5- is predicted using the trained digestion Probabilistic Prediction Model in the step 3)
1) the digestion probability of the restriction enzyme site in.
Step 5-3) according to the step 5-2) in predict restriction enzyme site digestion probability, calculated according to following formula
The digestion probability of all theory digestion peptide fragments:
Wherein, epepIndicate the digestion probability of peptide fragment;
elIndicate the digestion probability of peptide fragment left end restriction enzyme site;
erIndicate the digestion probability of peptide fragment right end restriction enzyme site;
eiIndicate the digestion probability that enzyme cutting enzyme site is leaked in peptide fragment;
N indicates the leakage enzyme cutting enzyme site number in peptide fragment.
In the technical solution, in the step 6), building peptide fragment detectability training set includes:
Step 6-1) for all theoretical digestion peptide fragments of the credible albumen of height in the step 4), it will wherein be accredited
To and spectrogram number greater than 1 peptide fragment be classified as positive peptide fragment, the peptide fragment not being accredited wherein is classified as negative peptide fragment.It will be positive
The label of peptide fragment is set as 1, and the label of negative peptide fragment is set as 0.
Step 6-2) calculate the step 6-1) in positive peptide fragment and negative peptide fragment 588 attribute.It is each in this way
Peptide fragment can use vector x=(x1,x2,x3,…,x588) indicate.In this 588 attribute, first 23 kinds are peptide sequence datas
Relevant feature, for example, leaking the appearance frequency of each amino acid in the number, peptide fragment quality, peptide fragment of enzyme site in peptide segment length, peptide fragment
Rate etc..Intermediate 544 kinds be from AAindex (bibliography: Kawashima, S., Pokarowski, P., Pokarowska,
M.,Kolinski,A.,Katayama,T.,and Kanehisa,M.;AAindex:amino acid index database,
Progress report2008.Nucleic Acids Res.36, D202-D205 (2008)) the physicochemical property of amino acid exist
Result after averaging in peptide fragment dimension.Then the reference of 20 kinds of physicochemical properties from the result of study of forefathers (bibliography:
Braisted, J.C.et al.BMC Bioinformatics 9,529 (2008), Webb-Robertson, B.J.et
Al.Bioinformatics 26,1677-1683 (2010), Eyers, C.E.et al.Mol Cell Proteomics 10,
M110 003384 (2011), Tang, H.et al.Bioinformatics 22, e481-488 (2006)).A kind of last feature
The step 5-3) in calculate peptide fragment digestion probability.This feature by it is proposed that be simultaneously used for peptide fragment detectability for the first time
In model.
In the technical solution, in the step 7), training peptide fragment detectability model includes:
Step 7-1) on the training set that the step 6) constructs carry out feature selecting.Herein, the spy that the present invention uses
Sign selection method is mRMR (bibliography: Ding, C.&Peng, H.Minimum Redundancy Feature Selection
from Microarray Gene Expression Data.in Proceedings of the IEEE Computer
Society Conference on Bioinformatics 523--(IEEE Computer Society,2003).).According to
The ranking results of feature selecting take preceding 50 attributes successively to train Random Forest model on the training set of the step 6).
Test result is shown, when taking 31 features, Random Forest model effect is optimal.
Step 7-2) the step 7-1 is used on the peptide fragment detectability training set that the step 6) constructs) in
31 features training Random Forest model of selection.The number set in random forest is set as 200.It is selected at random at each node
The number of features selected uses as default, i.e. the square root of total characteristic numberEqually with ten folding cross validation plans
Index of the AUC as assessment models effect under slightly.Since peptide fragment detectability model is bias collection, so training random forest
When use downward sampling technique (bibliography: Chen, C.&Liaw, A.Using random forest to learn
imbalanced data.Discovery(2004).)。
In the technical solution, in the step 8), the peptide of all theoretical digestion peptide fragments of other albumen is predicted
Section detectability include:
Step 8-1) the protein sequence theory digestion that will predict is theoretical digestion peptide fragment, obtain several theoretical digestion peptide fragments
And the digestion probability of each theoretical digestion peptide fragment is predicted using the restriction enzyme site digestion Probabilistic Prediction Model.Parameter during digestion
It is provided that the peptide fragment maximum length of permission is set as 38;The peptide fragment shortest length of permission is set as 6;In the peptide fragment of permission most
More leakage enzyme site numbers are set as 2;
Step 8-2) by the theoretical digestion peptide fragment of each of described step 8-2) according to the step 6-2) method
Be converted to the numerical value vector of 588 dimensions.
Step 8-3) numerical value vector obtained in the step 8-2) is input to the step 7-2) in train
Peptide fragment detectability model in, predict the step 8-1) all theoretical digestion peptide fragments peptide fragment detectability.
The present invention also provides a kind of devices that peptide fragment prediction is represented for targeting protein group.The device includes: peptide
Section and protein identification module, peptide fragment detectability prediction module represent peptide fragment prediction module.
The peptide fragment and protein identification module completes the basis parsing work of spectrogram using protein identification software.
The peptide fragment detectability prediction module includes following part:
1) building peptide fragment detectability predicts training set;
2) training peptide fragment detectability model;
3) the peptide fragment detectability of all theoretical digestion peptide fragments of candidate albumen in targeting protein group is predicted.
The representative peptide fragment prediction module is to each candidate albumen, according to the above-mentioned peptide fragment detectability pair being calculated
The associated peptide fragment of the albumen is ranked up, and is taken the peptide fragment of sequence M first (the general value of M is 5) to be used as and is represented peptide fragment.
The invention has the following advantages that
1, it is put forward for the first time and peptide fragment digestion probability is added in peptide fragment detectability model.
2, the characteristics of according to shotgun proteomics process, the albumen that considers during the prediction of peptide fragment detectability
Enzymolysis process significantly improves the accuracy rate of peptide fragment detectability prediction.
Detailed description of the invention
Fig. 1 is the peptide fragment detectability prediction technique flow chart of digestion probability of the present invention auxiliary;
Fig. 2 is that ten folding cross validation ROC of restriction enzyme site digestion Probabilistic Prediction Model scheme;
Fig. 3 is the effect diagram of peptide fragment detectability model after the feature for sequentially adding feature selecting;
Fig. 4 is the Performance Evaluation ROC figure of the peptide fragment detectability model on test set.
Specific embodiment
The present invention is described further with reference to the accompanying drawings and detailed description.
Assuming that there is a protein example.The protein mixing sample is digested by existing Measurement for Biochemistry first
Peptide fragment mixture solution is formed, then generates experiment tandem mass spectrum data through liquid chromatography-mass spectrometry.The tandem mass spectrum
Data include chromatographic retention, mass particle charge ratio, mass spectrum response signal intensity three-dimensional information.Then, it needs to utilize mirror
Determine software and determines the relationship for having which peptide fragment and albumen and peptide fragment and albumen in spectrogram.For example, MaxQuant (bibliography:
Cox,J.and Mann,M.MaxQuant enables high peptide identification rates,
individualized p.p.b.-range mass accuracies and proteome-wide protein
Quantification.Nat Biotechnol, 2008,26, pp 1367-72), pFind (bibliography: Wang L.H.et
al..pFind 2.0:a software package for peptide and protein identification via
Tandem mass spectrometry.Rapid Commun Mass Spectrom, 2007,21,2985-2991) softwares such as
All there is this function.It is used for the result that the sample obtains to construct training set.
Assuming that separately there is a protein example to obtain identification peptide fragment and protein according to aforesaid operations.The sample is obtained
As a result for constructing test set.
Below based on above-mentioned background data, and Fig. 1 is referred to, the specific implementation process of the method for the present invention is illustrated.
43088 identification peptide fragments and 3959 eggs are extracted in the MaxQuant qualification result for constructing training set
It is white.According to the sequence coverage of albumen and the corresponding spectrogram number of albumen, screening obtains 1556 credible albumen of height, corresponds to
29172 identification peptide fragments.These identification peptide fragments are traced back in its corresponding protein sequence, the digestion that each site occurs is counted
Situation.Then according to the building rule of digestion Probabilistic Prediction Model training set specified before the present invention, 7778 digestions are obtained
Site and 4854 leakage enzyme sites.On the training set constructed after these site vectorizations, training Random Forest model.Such as figure
Shown in 2, the AUC of 10 folding cross validations of the Random Forest model is up to 0.9756.This illustrates the digestion probability of present invention training
Model can the extraordinary digestion situation for predicting a restriction enzyme site.Then, the present invention will identify that spectrogram number is more than 1 in peptide fragment
The peptide fragment opened is as positive collection peptide fragment (25363), and the theoretical digestion peptide fragment not being accredited is as feminine gender collection peptide fragment (180284
It is a).Then according to trained digestion Probabilistic Prediction Model, predict that the digestion of each peptide fragment in peptide fragment detectability training set is general
Rate, and calculate other attributes of training set peptide fragment.
In order to portray the detectability of peptide fragment, each peptide fragment has been changed into the numerical value vector of 1x588 dimension by the present invention.Peptide fragment sheet
One section of ordered sequence being made of in matter amino acid.A kind of representation of amino acid is: a capitalization indicates one
Amino acid, such as alanine can be indicated that cysteine can be indicated by letter C by alphabetical A.Peptide fragment can be expressed as in this way
A string of alphabetical sequences.Illustrate the character representation of peptide fragment by taking peptide fragment ARNDCEQK as an example below.In a mass spectrometer, too short or mistake
Long peptide fragment cannot be all detected, therefore peptide segment length is to influence an important factor for can it be detected.It is with the peptide fragment
Example, the length of the peptide fragment are 8.Trypsase would generally from lysine or arginic N-terminal by protein sequence digestion at peptide
Section, therefore generally believe that the lysine (K) or arginine (R) of appearance in peptide intersegmental part (non-C-terminal) they are caused by leakage is cut.Peptide fragment
Digestion situation the mass signal of peptide fragment can be had a huge impact, therefore, the number of the leakage enzyme site in peptide fragment is also one
A important feature.For example, just there is a leakage enzyme site R in peptide fragment ARNDCEQK.The quality of each amino acid in peptide fragment
It is added, obtaining peptide fragment quality is 963.43Da.In biology, common amino acid has 20 kinds, and the present invention is with the amino of 20 dimensions
Sour frequency vector indicates the composed structure of amino acid in peptide fragment.For example, a kind of fixed Amino acid sequence mode, counts peptide fragment
The number that each amino acid occurs in ARNDCEQK is by chance all 1, then divided by the length of the peptide fragment 8, then each amino acid
The characteristic value of corresponding position is all 1/8, and the characteristic value at remaining amino acid position is 0.According to knowing in AAindex database
Know, each amino acid has the physics physicochemical property of 544 kinds of quantizations, and the quantization characteristic of the amino acid in peptide fragment is averaged as peptide
The feature of section.Such as: assuming that in peptide fragment ARNDCEQK each amino acid 544 kinds of physicochemical properties are as follows:
Then the feature of the peptide fragment is
WhereinIndicate the vector of 1x544.
Then, referring to bibliography (Braisted, J.C.et al.BMC Bioinformatics 9,529 (2008),
Webb-Robertson, B.J.et al.Bioinformatics 26,1677-1683 (2010), Eyers, C.E.et
Al.Mol Cell Proteomics 10, M110 003384 (2011), Tang, H.et al.Bioinformatics 22,
E481-488 (2006)), calculate the physicochemical properties of last 20 kinds of peptide fragments.It is worth noting that, calculate these features when
It waits, not only used the amino acid sequence information of peptide fragment itself, also use the information of amino acid sequence adjacent near peptide fragment.
Finally, the peptide fragment digestion probability that the present invention is calculated in previous step is as the 588th dimensional feature of last peptide fragment.
These features that the present invention uses cover peptide fragment detectability correlative study so far use it is nearly all
Feature.But so many feature also brings a problem, there are redundancies between feature.In order to solve this problem, of the invention
Using mRMR method carry out feature selecting, selection obtain 50 with peptide fragment whether be detected it is highly relevant and each other before redundancy
The smallest feature.In order to further determine required number of features, this 50 features are sequentially added random forest mould by the present invention
In type, the variation of 10 folding cross validation AUC of statistical model.As shown in figure 3, when using 31 features, the performance of model
Most preferably, other feature is added, will affect modelling effect instead.
To the MaxQuant qualification result for constructing test set, according to above-mentioned steps by the theoretical digestion peptide of each albumen
Segment table is shown as the vector of 1x31 dimension.The peptide of this theoretical digestion peptide fragment using above-mentioned trained peptide fragment detectability model prediction
Section detectability.As shown in figure 4, the peptide fragment detectability model of above-mentioned training still has good prediction on the test set
Performance, to illustrate that model of the invention has good extensive property.
So far, aforesaid operations of the invention have been completed the work for predicting the peptide fragment detectability of all peptide fragments.
The present invention also provides a kind of devices that peptide fragment prediction is represented for targeting protein group.The device includes: peptide
Section and protein identification module, peptide fragment detectability prediction module represent peptide fragment prediction module.
The peptide fragment and protein identification module completes the basis parsing work of spectrogram using protein identification software.
The peptide fragment detectability prediction module includes following part:
1) building peptide fragment detectability predicts training set;
2) training peptide fragment detectability model;
3) the peptide fragment detectability of all theoretical digestion peptide fragments of candidate albumen in targeting protein group is predicted.
The described peptide fragment prediction module that represents is to each candidate albumen, according to the above-mentioned peptide fragment detectability being calculated
The associated peptide fragment of the albumen is ranked up, takes the peptide fragment of sequence top N (the general value of N is 5) to be used as and represents peptide fragment.
It should be noted last that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting.Although ginseng
It is described the invention in detail according to embodiment, those skilled in the art should understand that, to technical side of the invention
Case is modified or replaced equivalently, and without departure from the spirit and scope of technical solution of the present invention, should all be covered in the present invention
Scope of the claims in.
Claims (10)
1. a kind of peptide fragment detectability prediction technique of digestion probability auxiliary, step include:
1) high credible albumen is filtered out in albumen from having identified;
2) restriction enzyme site digestion Probabilistic Prediction Model training set, training restriction enzyme site digestion Probabilistic Prediction Model are constructed;
3) theoretical digestion is carried out to all high credible albumen, obtains theoretical digestion peptide fragment;
4) the digestion probability of all theoretical digestion peptide fragments is predicted using the restriction enzyme site digestion Probabilistic Prediction Model;
5) peptide fragment detectability training set, training peptide fragment detectability model are constructed;
6) the peptide fragment detectability of all theoretical digestion peptide fragments of peptide fragment detectability model prediction setting albumen is utilized.
2. the method as described in claim 1, which is characterized in that the credible albumen of height is the spectrogram for having identified albumen in albumen
Number is greater than albumen of the sequence coverage greater than given threshold h2 of given threshold h1 and albumen;Or the credible albumen of height is
The forward albumen of the sequence coverage descending sort of the spectrogram number and albumen of having identified albumen in albumen.
3. the method as described in claim 1, which is characterized in that construct the restriction enzyme site digestion Probabilistic Prediction Model training set
The step of include:
3-1) the credible albumen of height will be traced back to the associated identification peptide fragment of the credible albumen of height in identification peptide fragment set
In protein sequence, the restriction enzyme site information in the protein sequence is then collected according to the backtracking position of identification peptide fragment;And according to
Restriction enzyme site is divided into positive site and negative positions by the restriction enzyme site information;
3-2) take the 2T+1 that T adjacent amino acid composition length of restriction enzyme site or so are 2T+1 even sub;If restriction enzyme site occurs
In the N-terminal of protein sequence, perhaps C-terminal causes the left end of restriction enzyme site or right end not to have enough amino acid composition 2T+1 to connect
Son is then filled with placeholder " Z ";
3-3) to each even son, by each amino acid in addition to the amino acid in middle position be converted into the 0-1 of 21 dimensions to
Amount, so that each even son to be converted into the 0-1 vector of 42T dimension;1 is set by the label in the positive site, it will
The label of the negative positions is set as 0.
4. method as claimed in claim 3, which is characterized in that the restriction enzyme site information includes: restriction enzyme site left side peptide fragment
The sum of spectrogram number, be denoted as parameter L;The sum of the spectrogram number of peptide fragment, is denoted as parameter R on the right of restriction enzyme site;With restriction enzyme site
As the sum of the spectrogram number of peptide fragment of leakage enzyme site, it is denoted as parameter O;By it is eligible 1)~3) restriction enzyme site be classified as the positive
Site: 1) L be more than or equal to 1,2) R be more than or equal to 1,3) O be equal to 0;By it is eligible a)~c) restriction enzyme site be classified as negative position
Point: a) L be equal to 0, b) R be equal to 0, c) O be more than or equal to 2.
5. method as claimed in claim 3, which is characterized in that the step of predicting the digestion probability of all theoretical digestion peptide fragments is wrapped
It includes:
5-1) obtain left end restriction enzyme site, right end the restriction enzyme site institute corresponding with leakage enzyme cutting enzyme site of all theoretical digestion peptide fragments
2T+1 even son is stated, and converts thereof into the 0-1 vector of 42T dimension;
5-2) using the digestion probability of each restriction enzyme site in the digestion Probabilistic Prediction Model prediction steps 5-1);
5-3) according to the digestion probability for the restriction enzyme site predicted in step 5-2), according to formula
Calculate the digestion probability of all theoretical digestion peptide fragments;Wherein, epepIndicate the digestion probability of peptide fragment, elIndicate the digestion of peptide fragment left end
The digestion probability in site, erIndicate the digestion probability of peptide fragment right end restriction enzyme site, eiIndicate the digestion that enzyme cutting enzyme site is leaked in peptide fragment
Probability, n indicate the leakage enzyme cutting enzyme site number in peptide fragment.
6. the method as described in claim 1, which is characterized in that referring to all high credible theoretical digestions of albumen progress will be described
The series model digestion of high credible albumen is theoretical digestion peptide fragment;Wherein, the peptide fragment longest or shortest length of permission are distinguished
It is set as the longest or shortest length of all identification peptide fragments, sets 2 for the most leakage enzyme site numbers allowed in peptide fragment.
7. the method as described in claim 1, which is characterized in that building peptide fragment detectability training set the step of include: for
All theoretical digestion peptide fragments that step 3) obtains, will wherein be accredited and peptide fragment of the spectrogram number greater than 1 is classified as positive peptide fragment,
The peptide fragment not being accredited wherein is classified as negative peptide fragment;1 is set by the label of positive peptide fragment, the label setting of negative peptide fragment
It is 0;Then it calculates separately the attributive character of positive peptide fragment and negative peptide fragment and is converted into numerical value vector, in the attributive character
Including the digestion probability.
8. the method as described in claim 1, which is characterized in that the step of training peptide fragment detectability model includes: using special
Sign selection method mRMR carries out feature selecting on the peptide fragment detectability training set;Then special according to multiple attributes of selection
Sign successively trains Random Forest model, determines corresponding m feature when Random Forest model effect is optimal;Then in the peptide fragment
The m feature training Random Forest model is used on detectability training set.
9. the method as described in claim 1, which is characterized in that the peptide fragment of all theoretical digestion peptide fragments of prediction setting albumen can
It is several theoretical digestion peptide fragments that the step of detection property, which includes: by the protein sequence theory digestion for setting albumen, and the utilization enzyme
Enzyme site digestion Probabilistic Prediction Model predicts the digestion probability of each theoretical digestion peptide fragment;Then according to each theory of setting albumen
The attributive character of digestion peptide fragment generates corresponding numerical value vector, includes that corresponding digestion is general in the attributive character of theoretical digestion peptide fragment
Rate;Then the numerical value vector is input in trained peptide fragment detectability model, the theoretical digestion peptide of prediction setting albumen
The peptide fragment detectability of section.
10. the method as described in claim 1, which is characterized in that the restriction enzyme site digestion Probabilistic Prediction Model is random gloomy
Woods model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810901107.1A CN109243527B (en) | 2018-08-09 | 2018-08-09 | Enzyme digestion probability-assisted peptide fragment detectability prediction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810901107.1A CN109243527B (en) | 2018-08-09 | 2018-08-09 | Enzyme digestion probability-assisted peptide fragment detectability prediction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109243527A true CN109243527A (en) | 2019-01-18 |
CN109243527B CN109243527B (en) | 2020-04-17 |
Family
ID=65071437
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810901107.1A Active CN109243527B (en) | 2018-08-09 | 2018-08-09 | Enzyme digestion probability-assisted peptide fragment detectability prediction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109243527B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110349621A (en) * | 2019-06-04 | 2019-10-18 | 中国科学院计算技术研究所 | Peptide fragment-spectrogram matching confidence the method for inspection, system, storage medium and device |
CN114093415A (en) * | 2021-11-19 | 2022-02-25 | 中国科学院数学与系统科学研究院 | Peptide fragment detectability prediction method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103116713A (en) * | 2013-02-25 | 2013-05-22 | 浙江大学 | Method of predicting interaction between chemical compounds and proteins based on random forest |
CN107609352A (en) * | 2017-11-02 | 2018-01-19 | 中国科学院新疆理化技术研究所 | A kind of Forecasting Methodology of protein self-interaction |
-
2018
- 2018-08-09 CN CN201810901107.1A patent/CN109243527B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103116713A (en) * | 2013-02-25 | 2013-05-22 | 浙江大学 | Method of predicting interaction between chemical compounds and proteins based on random forest |
CN107609352A (en) * | 2017-11-02 | 2018-01-19 | 中国科学院新疆理化技术研究所 | A kind of Forecasting Methodology of protein self-interaction |
Non-Patent Citations (1)
Title |
---|
常乘: "定量蛋白质组算法研究与应用", 《中国博士学位论文全文数据库 基础科学辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110349621A (en) * | 2019-06-04 | 2019-10-18 | 中国科学院计算技术研究所 | Peptide fragment-spectrogram matching confidence the method for inspection, system, storage medium and device |
CN114093415A (en) * | 2021-11-19 | 2022-02-25 | 中国科学院数学与系统科学研究院 | Peptide fragment detectability prediction method |
CN114093415B (en) * | 2021-11-19 | 2022-06-03 | 中国科学院数学与系统科学研究院 | Peptide fragment detectability prediction method and system |
Also Published As
Publication number | Publication date |
---|---|
CN109243527B (en) | 2020-04-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Pedrioli | Trans-proteomic pipeline: a pipeline for proteomic analysis | |
Deutsch et al. | A guided tour of the Trans‐Proteomic Pipeline | |
Wang et al. | pFind 2.0: a software package for peptide and protein identification via tandem mass spectrometry | |
WO2019050966A2 (en) | Automated sample workflow gating and data analysis | |
Alves et al. | Advancement in protein inference from shotgun proteomics using peptide detectability | |
CN105956416B (en) | A kind of method of fast automatic analyzing prokaryote protein gene group data | |
Lu et al. | A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications | |
Curran et al. | Computer aided manual validation of mass spectrometry-based proteomic data | |
Luo et al. | Protein quantitation using iTRAQ: Review on the sources of variations and analysis of nonrandom missingness | |
CN109243527A (en) | A kind of peptide fragment detectability prediction technique of digestion probability auxiliary | |
WO2023207453A1 (en) | Traditional chinese medicine ingredient analysis method and system based on spectral clustering | |
CN111582315A (en) | Sample data processing method and device and electronic equipment | |
CN109346125B (en) | Rapid and accurate protein binding pocket structure alignment method | |
CN108491690A (en) | The peptide fragment quantitative efficacy prediction technique of peptide fragment in a kind of proteomics | |
CN110310706B (en) | Label-free absolute quantitative method for protein | |
CN112328951B (en) | Processing method of experimental data of analysis sample | |
CN108388774A (en) | A kind of on-line analysis of polypeptide spectrum matched data | |
Iravani et al. | An Interpretable Deep Learning Approach for Biomarker Detection in LC-MS Proteomics Data | |
CN103488913A (en) | A computational method for mapping peptides to proteins using sequencing data | |
CN111898807A (en) | Tobacco yield prediction method based on whole genome selection and application | |
CN113782094A (en) | Modification site prediction method, modification site prediction device, computer device, and storage medium | |
Xu et al. | Prediction of acetylation and succinylation in proteins based on multilabel learning RankSVM | |
Mir et al. | In vivo ChIP-Seq of nuclear receptors: a rough guide to transform frozen tissues into high-confidence genome-wide binding profiles | |
Bessant | Proteome informatics | |
Källberg et al. | An improved machine learning protocol for the identification of correct Sequest search results |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |