CN109243527A - A kind of peptide fragment detectability prediction technique of digestion probability auxiliary - Google Patents

A kind of peptide fragment detectability prediction technique of digestion probability auxiliary Download PDF

Info

Publication number
CN109243527A
CN109243527A CN201810901107.1A CN201810901107A CN109243527A CN 109243527 A CN109243527 A CN 109243527A CN 201810901107 A CN201810901107 A CN 201810901107A CN 109243527 A CN109243527 A CN 109243527A
Authority
CN
China
Prior art keywords
peptide fragment
digestion
albumen
enzyme site
restriction enzyme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810901107.1A
Other languages
Chinese (zh)
Other versions
CN109243527B (en
Inventor
常乘
付岩
高志强
朱云平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING PROTEOME RESEARCH CENTER
Institute of Pharmacology and Toxicology of AMMS
Academy of Military Medical Sciences AMMS of PLA
Academy of Mathematics and Systems Science of CAS
Original Assignee
BEIJING PROTEOME RESEARCH CENTER
Institute of Pharmacology and Toxicology of AMMS
Academy of Mathematics and Systems Science of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING PROTEOME RESEARCH CENTER, Institute of Pharmacology and Toxicology of AMMS, Academy of Mathematics and Systems Science of CAS filed Critical BEIJING PROTEOME RESEARCH CENTER
Priority to CN201810901107.1A priority Critical patent/CN109243527B/en
Publication of CN109243527A publication Critical patent/CN109243527A/en
Application granted granted Critical
Publication of CN109243527B publication Critical patent/CN109243527B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention discloses a kind of peptide fragment detectability prediction technique of digestion probability auxiliary, step includes: that high credible albumen 1) is screened in all identification albumen;2) restriction enzyme site digestion Probabilistic Prediction Model training set is constructed;3) training restriction enzyme site digestion Probabilistic Prediction Model;4) theoretical digestion is carried out to all high credible albumen, obtains theoretical digestion peptide fragment;5) the digestion probability of all theoretical digestion peptide fragments is predicted;6) peptide fragment detectability training set is constructed;7) training peptide fragment detectability model;8) the peptide fragment detectability of all theoretical digestion peptide fragments of other albumen is predicted.The characteristics of present invention is according to shotgun proteomics process, the enzymolysis process of the albumen considered during the prediction of peptide fragment detectability, significantly improves the accuracy rate of peptide fragment detectability prediction.

Description

A kind of peptide fragment detectability prediction technique of digestion probability auxiliary
Technical field
The present invention relates to the peptide fragment detectability prediction techniques in proteomics, in particular to shotgun proteomics In peptide fragment detectability prediction technique.
Background technique
The experiment of targeting protein group selectively can detect and quantify interested peptide fragment and albumen, such as MRM Experimental strategy.This method can be used for quickly verifying candidate biological markers.Developing the first step that MRM is tested to be exactly is time Sortilin selection represents peptide fragment.The method for representing peptide fragment is selected to can be mainly divided into two classes: method and base based on experimental data In the method for calculating.But the method based on experimental data has some limitations.Firstly, not all albumen have it is existing Experimental data.It is influenced by many factors secondly, can peptide fragment be detected, can be accredited in experiment before, It is different in experiment next time to be surely detected.Therefore, scientific research personnel increasingly pays close attention to the method based on calculating.But peptide fragment is examined The mechanism of survey is still unclear, and which prevent the exploitations that Accurate Prediction represents the algorithm of peptide fragment.
Up to the present, scientist has been that the mechanism for exploring peptide fragment detection has done a large amount of effort.Before the several years, Le Et al. (bibliography: Le Bihan, T., Robinson, M.D., Stewart, I.I.&Figeys, D.Definition and Characterization of a“Trypsinosome”from Specific Peptide Characteristics by Nano-HPLC-MS/MS and in Silico Analysis of Complex Protein Mixtures.J.Proteome Res.3,1138-1148 (2004)) and Eithier et al. (bibliography: Ethier, M.&Figeys, D.Strategy to Design Improved Proteomic Experiments Based on Statistical Analyses of the Chemical Properties of Identified Peptides.J.Proteome Res.4,2201–2206(2005).) Propose the empirical equation based on hydrophobicity, peptide segment length and isoelectric point.In recent years, scientist has been developed for much being based on The method that the prediction of machine learning algorithm represents peptide fragment.In these methods, design description peptide fragment is characterized in that a key is asked Topic.It is well known that can the cause that influenced peptide fragment and be detected in proteomic experiments be known as very much, such as peptide fragment Physicochemical property, the abundance of albumen belonging to peptide fragment and identity process etc..Tang et al. (bibliography: Tang, H.et al.A computational approach toward label-free protein quantification using Predicted peptide detectability.Bioinformatics 22, e481-8 (2006)) propose peptide fragment can The concept of detection property, and use the machine learning of the prediction peptide fragment detectability of 175 kinds of feature constructions from peptide section sequence Algorithm.Later, Sander et al. (bibliography: Sanders, W.S., Bridges, S.M., McCarthy, F.M., Nanduri,B.&Burgess,S.C.Prediction of peptides observable by mass spectrometry applied at the experimental set level.BMC Bioinformatics 8Suppl 7,S23 (2007)), Mallick et al. (bibliography: Mallick, P.et al.Computational prediction of proteotypic peptides for quantitative proteomics.Nat.Biotechnol.25,125–31 (2007)) and Eyers et al. (bibliography: Eyers, C.E.et al.CONSeQuence:Prediction of Reference Peptides for Absolute Quantitative Proteomics Using Consensus Machine Learning Approaches.Mol.Cell.Proteomics 10,M110.003384-M110.003384 (2011)) 596 features are based respectively on, 1010 features and 1186 features develop peptide fragment detectability algorithm.These are calculated Method mainly consider from AAindex (bibliography: Kawashima, S., Ogata, H.&Kanehisa, M.AAindex: Amino acid index database.Nucleic Acids Res.27,368-369 (1999)) and peptide section sequence spy Sign.Recently, Muntel et al. (bibliography: Muntel, J.et al.Abundance-based classifier for the prediction of mass spectrometric peptide detectability upon enrichment(PPA) .Mol 14,430-440 (2015) of Cell Proteomics) protein abundance has been added to peptide fragment as additional feature can In detection property prediction model, and achieve better effect.But in the case where not carrying out Mass spectrometry experiments in advance, albumen is rich Degree is usually unknown.
Although the method in terms of many peptide fragment detectabilities has been proposed, current peptide fragment detectability method Accuracy rate is still unsatisfactory.Therefore how accurately to predict that peptide fragment detectability is technical problem urgently to be solved.
Summary of the invention
For technical problem of the existing technology, the purpose of the present invention is the enzymatic hydrolysis letters by abundant consideration albumen Breath, to provide a kind of more accurate peptide fragment detectability prediction technique.
To achieve the goals above, the present invention provides a kind of digestion probability auxiliary peptide fragment detectability prediction technique, Include:
Step 1) screens high credible albumen in all identification albumen;
Step 2) constructs restriction enzyme site digestion Probabilistic Prediction Model training set;
Step 3) trains restriction enzyme site digestion Probabilistic Prediction Model;
Step 4) carries out theoretical digestion to all high credible albumen, obtains theoretical digestion peptide fragment;
The digestion probability of all theoretical digestion peptide fragments of step 5) prediction;
Step 6) constructs peptide fragment detectability training set;
Step 7) trains peptide fragment detectability model;
Step 8) predicts the peptide fragment detectability of all theoretical digestion peptide fragments of other albumen.
In the above-mentioned technical solutions, in the step 1), first respectively by all identification albumen according to the spectrum of albumen The sequence coverage of map number and albumen carries out descending sort, and 50% albumen is as height before then accounting in two minor sorts Credible albumen.The spectrogram number of the albumen refer to the albumen the sum of the spectrogram number of relevant peptide fragment.The albumen Sequence coverage refer to the albumen relevant peptide fragment sequence trace back to protein sequence after, account for protein sequence total length Ratio.Or high credible albumen is to have identified that the spectrogram number of albumen in albumen is greater than given threshold h1 and the sequence of albumen is covered Cover degree is greater than the albumen of given threshold h2.
In the above-mentioned technical solutions, in the step 2), the building restriction enzyme site digestion Probabilistic Prediction Model instruction Practicing collection includes:
Step 2-1) it will identify in peptide fragment set and trace back to the step 1) with the high credible associated identification peptide fragment of albumen In the credible protein sequence of height in.In the case where very coincidence, perhaps some peptide fragment can correspond to the plurality of positions of an albumen.It is right Such case, the present invention only consider corresponding for the first time.The credible protein sequence of height is collected according to the backtracking position of identification peptide fragment In restriction enzyme site information.The restriction enzyme site information includes: the sum of the spectrogram number of 1) the restriction enzyme site left side peptide fragment, note For parameter L;2) the sum of the spectrogram number of restriction enzyme site the right peptide fragment, is denoted as parameter R;3) position is cut using the restriction enzyme site as leakage The sum of the spectrogram number of peptide fragment of point, is denoted as parameter O.The restriction enzyme site in the credible protein sequence of height in the step 1) In, the restriction enzyme site for meeting the following conditions is classified as positive site: 1) L is more than or equal to 1;2) R is more than or equal to 1;3) O is equal to 0.It will The restriction enzyme site for meeting the following conditions is classified as negative positions: 1) L is equal to 0;2) R is equal to 0;3) O is more than or equal to 2.
Step 2-2) to the step 2-1) in positive site and negative positions, respectively fetch bit point or so T it is adjacent Amino acid forms the 2T+1 that length is 2T+1, and even (T is natural number to son, 4) general T value is.If some restriction enzyme site by chance goes out Perhaps C-terminal causes the left end of restriction enzyme site or right end not to have enough amino acid composition 2T+1 to connect to the N-terminal of present protein sequence Son is then filled with placeholder " Z ".It, will be except the amino acid in middle position (lysine or smart ammonia to each 2T+1 even son Acid) except each amino acid be converted into the 0-1 vectors of 21 dimensions.In this way, even son has been converted into a 42T dimension to each 2T+1 0-1 vector.1 is set by the label in the positive site in the step 2-1), by the negative position in the step 2-1) The label of point is set as 0.
In the above-mentioned technical solutions, in the step 3), why needing to train digestion Probabilistic Prediction Model is base In the shotgun proteomics the characteristics of and design.One typical shotgun proteomic experiments can be divided into two mistakes Journey: the enzymolysis process of albumen and the detection process of peptide fragment.One is easy the fact that ignore to be which peptide fragment protease has solved is Unknown.That is, can accurately not know which peptide fragment has actually entered the detection process of peptide fragment.Therefore accurately pre- Survey the enzymolysis process that peptide fragment detectability needs to consider albumen.The training restriction enzyme site digestion Probabilistic Prediction Model refers to Training Random Forest model on constructed restriction enzyme site digestion probability training set in the step 2).With random forest Number increases, and the error of random forest can reduce, but runing time also will increase.Through preliminary test of the present invention, random forest The number of middle tree is set as 200 can obtain relatively good result in accuracy rate and efficiency.It is selected at random at each node The number of features selected uses as default, i.e. the square root of total characteristic numberUnder ten folding cross validation strategies Index of the AUC as assessment models effect.
In the above-mentioned technical solutions, in the step 4), to all high credible albumen carry out theoretical digestions refer to by The series model digestion of the credible albumen of height in the step 1) is theoretical digestion peptide fragment.1) design parameter, which is provided that, will The peptide fragment longest or shortest length of permission are respectively set to the longest or shortest length of all identification peptide fragments.It 2) will be in peptide fragment The most leakage enzyme site numbers allowed are set as 2.
In the technical solution, in the step 5), predict that the digestion probability of all theoretical digestion peptide fragments includes:
Step 5-1) obtain the left end restriction enzyme site of theoretical digestion peptide fragment in all steps 4), right end digestion position Point and the corresponding 2T+1 of leakage enzyme cutting enzyme site (if any) are even sub.By these 2T+1 even son according to the step 2-2) in Method be expressed as 42T dimension vector.
Step 5-2) the step 5- is predicted using the trained digestion Probabilistic Prediction Model in the step 3) 1) the digestion probability of the restriction enzyme site in.
Step 5-3) according to the step 5-2) in predict restriction enzyme site digestion probability, calculated according to following formula The digestion probability of all theory digestion peptide fragments:
Wherein, epepIndicate the digestion probability of peptide fragment;
elIndicate the digestion probability of peptide fragment left end restriction enzyme site;
erIndicate the digestion probability of peptide fragment right end restriction enzyme site;
eiIndicate the digestion probability that enzyme cutting enzyme site is leaked in peptide fragment;
N indicates the leakage enzyme cutting enzyme site number in peptide fragment.
In the technical solution, in the step 6), building peptide fragment detectability training set includes:
Step 6-1) for all theoretical digestion peptide fragments of the credible albumen of height in the step 4), it will wherein be accredited To and spectrogram number greater than 1 peptide fragment be classified as positive peptide fragment, the peptide fragment not being accredited wherein is classified as negative peptide fragment.It will be positive The label of peptide fragment is set as 1, and the label of negative peptide fragment is set as 0.
Step 6-2) calculate the step 6-1) in positive peptide fragment and negative peptide fragment 588 attribute.It is each in this way Peptide fragment can use vector x=(x1,x2,x3,…,x588) indicate.In this 588 attribute, first 23 kinds are peptide sequence datas Relevant feature, for example, leaking the appearance frequency of each amino acid in the number, peptide fragment quality, peptide fragment of enzyme site in peptide segment length, peptide fragment Rate etc..Intermediate 544 kinds be from AAindex (bibliography: Kawashima, S., Pokarowski, P., Pokarowska, M.,Kolinski,A.,Katayama,T.,and Kanehisa,M.;AAindex:amino acid index database, Progress report2008.Nucleic Acids Res.36, D202-D205 (2008)) the physicochemical property of amino acid exist Result after averaging in peptide fragment dimension.Then the reference of 20 kinds of physicochemical properties from the result of study of forefathers (bibliography: Braisted, J.C.et al.BMC Bioinformatics 9,529 (2008), Webb-Robertson, B.J.et Al.Bioinformatics 26,1677-1683 (2010), Eyers, C.E.et al.Mol Cell Proteomics 10, M110 003384 (2011), Tang, H.et al.Bioinformatics 22, e481-488 (2006)).A kind of last feature The step 5-3) in calculate peptide fragment digestion probability.This feature by it is proposed that be simultaneously used for peptide fragment detectability for the first time In model.
In the technical solution, in the step 7), training peptide fragment detectability model includes:
Step 7-1) on the training set that the step 6) constructs carry out feature selecting.Herein, the spy that the present invention uses Sign selection method is mRMR (bibliography: Ding, C.&Peng, H.Minimum Redundancy Feature Selection from Microarray Gene Expression Data.in Proceedings of the IEEE Computer Society Conference on Bioinformatics 523--(IEEE Computer Society,2003).).According to The ranking results of feature selecting take preceding 50 attributes successively to train Random Forest model on the training set of the step 6). Test result is shown, when taking 31 features, Random Forest model effect is optimal.
Step 7-2) the step 7-1 is used on the peptide fragment detectability training set that the step 6) constructs) in 31 features training Random Forest model of selection.The number set in random forest is set as 200.It is selected at random at each node The number of features selected uses as default, i.e. the square root of total characteristic numberEqually with ten folding cross validation plans Index of the AUC as assessment models effect under slightly.Since peptide fragment detectability model is bias collection, so training random forest When use downward sampling technique (bibliography: Chen, C.&Liaw, A.Using random forest to learn imbalanced data.Discovery(2004).)。
In the technical solution, in the step 8), the peptide of all theoretical digestion peptide fragments of other albumen is predicted Section detectability include:
Step 8-1) the protein sequence theory digestion that will predict is theoretical digestion peptide fragment, obtain several theoretical digestion peptide fragments And the digestion probability of each theoretical digestion peptide fragment is predicted using the restriction enzyme site digestion Probabilistic Prediction Model.Parameter during digestion It is provided that the peptide fragment maximum length of permission is set as 38;The peptide fragment shortest length of permission is set as 6;In the peptide fragment of permission most More leakage enzyme site numbers are set as 2;
Step 8-2) by the theoretical digestion peptide fragment of each of described step 8-2) according to the step 6-2) method Be converted to the numerical value vector of 588 dimensions.
Step 8-3) numerical value vector obtained in the step 8-2) is input to the step 7-2) in train Peptide fragment detectability model in, predict the step 8-1) all theoretical digestion peptide fragments peptide fragment detectability.
The present invention also provides a kind of devices that peptide fragment prediction is represented for targeting protein group.The device includes: peptide Section and protein identification module, peptide fragment detectability prediction module represent peptide fragment prediction module.
The peptide fragment and protein identification module completes the basis parsing work of spectrogram using protein identification software.
The peptide fragment detectability prediction module includes following part:
1) building peptide fragment detectability predicts training set;
2) training peptide fragment detectability model;
3) the peptide fragment detectability of all theoretical digestion peptide fragments of candidate albumen in targeting protein group is predicted.
The representative peptide fragment prediction module is to each candidate albumen, according to the above-mentioned peptide fragment detectability pair being calculated The associated peptide fragment of the albumen is ranked up, and is taken the peptide fragment of sequence M first (the general value of M is 5) to be used as and is represented peptide fragment.
The invention has the following advantages that
1, it is put forward for the first time and peptide fragment digestion probability is added in peptide fragment detectability model.
2, the characteristics of according to shotgun proteomics process, the albumen that considers during the prediction of peptide fragment detectability Enzymolysis process significantly improves the accuracy rate of peptide fragment detectability prediction.
Detailed description of the invention
Fig. 1 is the peptide fragment detectability prediction technique flow chart of digestion probability of the present invention auxiliary;
Fig. 2 is that ten folding cross validation ROC of restriction enzyme site digestion Probabilistic Prediction Model scheme;
Fig. 3 is the effect diagram of peptide fragment detectability model after the feature for sequentially adding feature selecting;
Fig. 4 is the Performance Evaluation ROC figure of the peptide fragment detectability model on test set.
Specific embodiment
The present invention is described further with reference to the accompanying drawings and detailed description.
Assuming that there is a protein example.The protein mixing sample is digested by existing Measurement for Biochemistry first Peptide fragment mixture solution is formed, then generates experiment tandem mass spectrum data through liquid chromatography-mass spectrometry.The tandem mass spectrum Data include chromatographic retention, mass particle charge ratio, mass spectrum response signal intensity three-dimensional information.Then, it needs to utilize mirror Determine software and determines the relationship for having which peptide fragment and albumen and peptide fragment and albumen in spectrogram.For example, MaxQuant (bibliography: Cox,J.and Mann,M.MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein Quantification.Nat Biotechnol, 2008,26, pp 1367-72), pFind (bibliography: Wang L.H.et al..pFind 2.0:a software package for peptide and protein identification via Tandem mass spectrometry.Rapid Commun Mass Spectrom, 2007,21,2985-2991) softwares such as All there is this function.It is used for the result that the sample obtains to construct training set.
Assuming that separately there is a protein example to obtain identification peptide fragment and protein according to aforesaid operations.The sample is obtained As a result for constructing test set.
Below based on above-mentioned background data, and Fig. 1 is referred to, the specific implementation process of the method for the present invention is illustrated.
43088 identification peptide fragments and 3959 eggs are extracted in the MaxQuant qualification result for constructing training set It is white.According to the sequence coverage of albumen and the corresponding spectrogram number of albumen, screening obtains 1556 credible albumen of height, corresponds to 29172 identification peptide fragments.These identification peptide fragments are traced back in its corresponding protein sequence, the digestion that each site occurs is counted Situation.Then according to the building rule of digestion Probabilistic Prediction Model training set specified before the present invention, 7778 digestions are obtained Site and 4854 leakage enzyme sites.On the training set constructed after these site vectorizations, training Random Forest model.Such as figure Shown in 2, the AUC of 10 folding cross validations of the Random Forest model is up to 0.9756.This illustrates the digestion probability of present invention training Model can the extraordinary digestion situation for predicting a restriction enzyme site.Then, the present invention will identify that spectrogram number is more than 1 in peptide fragment The peptide fragment opened is as positive collection peptide fragment (25363), and the theoretical digestion peptide fragment not being accredited is as feminine gender collection peptide fragment (180284 It is a).Then according to trained digestion Probabilistic Prediction Model, predict that the digestion of each peptide fragment in peptide fragment detectability training set is general Rate, and calculate other attributes of training set peptide fragment.
In order to portray the detectability of peptide fragment, each peptide fragment has been changed into the numerical value vector of 1x588 dimension by the present invention.Peptide fragment sheet One section of ordered sequence being made of in matter amino acid.A kind of representation of amino acid is: a capitalization indicates one Amino acid, such as alanine can be indicated that cysteine can be indicated by letter C by alphabetical A.Peptide fragment can be expressed as in this way A string of alphabetical sequences.Illustrate the character representation of peptide fragment by taking peptide fragment ARNDCEQK as an example below.In a mass spectrometer, too short or mistake Long peptide fragment cannot be all detected, therefore peptide segment length is to influence an important factor for can it be detected.It is with the peptide fragment Example, the length of the peptide fragment are 8.Trypsase would generally from lysine or arginic N-terminal by protein sequence digestion at peptide Section, therefore generally believe that the lysine (K) or arginine (R) of appearance in peptide intersegmental part (non-C-terminal) they are caused by leakage is cut.Peptide fragment Digestion situation the mass signal of peptide fragment can be had a huge impact, therefore, the number of the leakage enzyme site in peptide fragment is also one A important feature.For example, just there is a leakage enzyme site R in peptide fragment ARNDCEQK.The quality of each amino acid in peptide fragment It is added, obtaining peptide fragment quality is 963.43Da.In biology, common amino acid has 20 kinds, and the present invention is with the amino of 20 dimensions Sour frequency vector indicates the composed structure of amino acid in peptide fragment.For example, a kind of fixed Amino acid sequence mode, counts peptide fragment The number that each amino acid occurs in ARNDCEQK is by chance all 1, then divided by the length of the peptide fragment 8, then each amino acid The characteristic value of corresponding position is all 1/8, and the characteristic value at remaining amino acid position is 0.According to knowing in AAindex database Know, each amino acid has the physics physicochemical property of 544 kinds of quantizations, and the quantization characteristic of the amino acid in peptide fragment is averaged as peptide The feature of section.Such as: assuming that in peptide fragment ARNDCEQK each amino acid 544 kinds of physicochemical properties are as follows:
Then the feature of the peptide fragment is
WhereinIndicate the vector of 1x544.
Then, referring to bibliography (Braisted, J.C.et al.BMC Bioinformatics 9,529 (2008), Webb-Robertson, B.J.et al.Bioinformatics 26,1677-1683 (2010), Eyers, C.E.et Al.Mol Cell Proteomics 10, M110 003384 (2011), Tang, H.et al.Bioinformatics 22, E481-488 (2006)), calculate the physicochemical properties of last 20 kinds of peptide fragments.It is worth noting that, calculate these features when It waits, not only used the amino acid sequence information of peptide fragment itself, also use the information of amino acid sequence adjacent near peptide fragment.
Finally, the peptide fragment digestion probability that the present invention is calculated in previous step is as the 588th dimensional feature of last peptide fragment.
These features that the present invention uses cover peptide fragment detectability correlative study so far use it is nearly all Feature.But so many feature also brings a problem, there are redundancies between feature.In order to solve this problem, of the invention Using mRMR method carry out feature selecting, selection obtain 50 with peptide fragment whether be detected it is highly relevant and each other before redundancy The smallest feature.In order to further determine required number of features, this 50 features are sequentially added random forest mould by the present invention In type, the variation of 10 folding cross validation AUC of statistical model.As shown in figure 3, when using 31 features, the performance of model Most preferably, other feature is added, will affect modelling effect instead.
To the MaxQuant qualification result for constructing test set, according to above-mentioned steps by the theoretical digestion peptide of each albumen Segment table is shown as the vector of 1x31 dimension.The peptide of this theoretical digestion peptide fragment using above-mentioned trained peptide fragment detectability model prediction Section detectability.As shown in figure 4, the peptide fragment detectability model of above-mentioned training still has good prediction on the test set Performance, to illustrate that model of the invention has good extensive property.
So far, aforesaid operations of the invention have been completed the work for predicting the peptide fragment detectability of all peptide fragments.
The present invention also provides a kind of devices that peptide fragment prediction is represented for targeting protein group.The device includes: peptide Section and protein identification module, peptide fragment detectability prediction module represent peptide fragment prediction module.
The peptide fragment and protein identification module completes the basis parsing work of spectrogram using protein identification software.
The peptide fragment detectability prediction module includes following part:
1) building peptide fragment detectability predicts training set;
2) training peptide fragment detectability model;
3) the peptide fragment detectability of all theoretical digestion peptide fragments of candidate albumen in targeting protein group is predicted.
The described peptide fragment prediction module that represents is to each candidate albumen, according to the above-mentioned peptide fragment detectability being calculated The associated peptide fragment of the albumen is ranked up, takes the peptide fragment of sequence top N (the general value of N is 5) to be used as and represents peptide fragment.
It should be noted last that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting.Although ginseng It is described the invention in detail according to embodiment, those skilled in the art should understand that, to technical side of the invention Case is modified or replaced equivalently, and without departure from the spirit and scope of technical solution of the present invention, should all be covered in the present invention Scope of the claims in.

Claims (10)

1. a kind of peptide fragment detectability prediction technique of digestion probability auxiliary, step include:
1) high credible albumen is filtered out in albumen from having identified;
2) restriction enzyme site digestion Probabilistic Prediction Model training set, training restriction enzyme site digestion Probabilistic Prediction Model are constructed;
3) theoretical digestion is carried out to all high credible albumen, obtains theoretical digestion peptide fragment;
4) the digestion probability of all theoretical digestion peptide fragments is predicted using the restriction enzyme site digestion Probabilistic Prediction Model;
5) peptide fragment detectability training set, training peptide fragment detectability model are constructed;
6) the peptide fragment detectability of all theoretical digestion peptide fragments of peptide fragment detectability model prediction setting albumen is utilized.
2. the method as described in claim 1, which is characterized in that the credible albumen of height is the spectrogram for having identified albumen in albumen Number is greater than albumen of the sequence coverage greater than given threshold h2 of given threshold h1 and albumen;Or the credible albumen of height is The forward albumen of the sequence coverage descending sort of the spectrogram number and albumen of having identified albumen in albumen.
3. the method as described in claim 1, which is characterized in that construct the restriction enzyme site digestion Probabilistic Prediction Model training set The step of include:
3-1) the credible albumen of height will be traced back to the associated identification peptide fragment of the credible albumen of height in identification peptide fragment set In protein sequence, the restriction enzyme site information in the protein sequence is then collected according to the backtracking position of identification peptide fragment;And according to Restriction enzyme site is divided into positive site and negative positions by the restriction enzyme site information;
3-2) take the 2T+1 that T adjacent amino acid composition length of restriction enzyme site or so are 2T+1 even sub;If restriction enzyme site occurs In the N-terminal of protein sequence, perhaps C-terminal causes the left end of restriction enzyme site or right end not to have enough amino acid composition 2T+1 to connect Son is then filled with placeholder " Z ";
3-3) to each even son, by each amino acid in addition to the amino acid in middle position be converted into the 0-1 of 21 dimensions to Amount, so that each even son to be converted into the 0-1 vector of 42T dimension;1 is set by the label in the positive site, it will The label of the negative positions is set as 0.
4. method as claimed in claim 3, which is characterized in that the restriction enzyme site information includes: restriction enzyme site left side peptide fragment The sum of spectrogram number, be denoted as parameter L;The sum of the spectrogram number of peptide fragment, is denoted as parameter R on the right of restriction enzyme site;With restriction enzyme site As the sum of the spectrogram number of peptide fragment of leakage enzyme site, it is denoted as parameter O;By it is eligible 1)~3) restriction enzyme site be classified as the positive Site: 1) L be more than or equal to 1,2) R be more than or equal to 1,3) O be equal to 0;By it is eligible a)~c) restriction enzyme site be classified as negative position Point: a) L be equal to 0, b) R be equal to 0, c) O be more than or equal to 2.
5. method as claimed in claim 3, which is characterized in that the step of predicting the digestion probability of all theoretical digestion peptide fragments is wrapped It includes:
5-1) obtain left end restriction enzyme site, right end the restriction enzyme site institute corresponding with leakage enzyme cutting enzyme site of all theoretical digestion peptide fragments 2T+1 even son is stated, and converts thereof into the 0-1 vector of 42T dimension;
5-2) using the digestion probability of each restriction enzyme site in the digestion Probabilistic Prediction Model prediction steps 5-1);
5-3) according to the digestion probability for the restriction enzyme site predicted in step 5-2), according to formula Calculate the digestion probability of all theoretical digestion peptide fragments;Wherein, epepIndicate the digestion probability of peptide fragment, elIndicate the digestion of peptide fragment left end The digestion probability in site, erIndicate the digestion probability of peptide fragment right end restriction enzyme site, eiIndicate the digestion that enzyme cutting enzyme site is leaked in peptide fragment Probability, n indicate the leakage enzyme cutting enzyme site number in peptide fragment.
6. the method as described in claim 1, which is characterized in that referring to all high credible theoretical digestions of albumen progress will be described The series model digestion of high credible albumen is theoretical digestion peptide fragment;Wherein, the peptide fragment longest or shortest length of permission are distinguished It is set as the longest or shortest length of all identification peptide fragments, sets 2 for the most leakage enzyme site numbers allowed in peptide fragment.
7. the method as described in claim 1, which is characterized in that building peptide fragment detectability training set the step of include: for All theoretical digestion peptide fragments that step 3) obtains, will wherein be accredited and peptide fragment of the spectrogram number greater than 1 is classified as positive peptide fragment, The peptide fragment not being accredited wherein is classified as negative peptide fragment;1 is set by the label of positive peptide fragment, the label setting of negative peptide fragment It is 0;Then it calculates separately the attributive character of positive peptide fragment and negative peptide fragment and is converted into numerical value vector, in the attributive character Including the digestion probability.
8. the method as described in claim 1, which is characterized in that the step of training peptide fragment detectability model includes: using special Sign selection method mRMR carries out feature selecting on the peptide fragment detectability training set;Then special according to multiple attributes of selection Sign successively trains Random Forest model, determines corresponding m feature when Random Forest model effect is optimal;Then in the peptide fragment The m feature training Random Forest model is used on detectability training set.
9. the method as described in claim 1, which is characterized in that the peptide fragment of all theoretical digestion peptide fragments of prediction setting albumen can It is several theoretical digestion peptide fragments that the step of detection property, which includes: by the protein sequence theory digestion for setting albumen, and the utilization enzyme Enzyme site digestion Probabilistic Prediction Model predicts the digestion probability of each theoretical digestion peptide fragment;Then according to each theory of setting albumen The attributive character of digestion peptide fragment generates corresponding numerical value vector, includes that corresponding digestion is general in the attributive character of theoretical digestion peptide fragment Rate;Then the numerical value vector is input in trained peptide fragment detectability model, the theoretical digestion peptide of prediction setting albumen The peptide fragment detectability of section.
10. the method as described in claim 1, which is characterized in that the restriction enzyme site digestion Probabilistic Prediction Model is random gloomy Woods model.
CN201810901107.1A 2018-08-09 2018-08-09 Enzyme digestion probability-assisted peptide fragment detectability prediction method Active CN109243527B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810901107.1A CN109243527B (en) 2018-08-09 2018-08-09 Enzyme digestion probability-assisted peptide fragment detectability prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810901107.1A CN109243527B (en) 2018-08-09 2018-08-09 Enzyme digestion probability-assisted peptide fragment detectability prediction method

Publications (2)

Publication Number Publication Date
CN109243527A true CN109243527A (en) 2019-01-18
CN109243527B CN109243527B (en) 2020-04-17

Family

ID=65071437

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810901107.1A Active CN109243527B (en) 2018-08-09 2018-08-09 Enzyme digestion probability-assisted peptide fragment detectability prediction method

Country Status (1)

Country Link
CN (1) CN109243527B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110349621A (en) * 2019-06-04 2019-10-18 中国科学院计算技术研究所 Peptide fragment-spectrogram matching confidence the method for inspection, system, storage medium and device
CN114093415A (en) * 2021-11-19 2022-02-25 中国科学院数学与系统科学研究院 Peptide fragment detectability prediction method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116713A (en) * 2013-02-25 2013-05-22 浙江大学 Method of predicting interaction between chemical compounds and proteins based on random forest
CN107609352A (en) * 2017-11-02 2018-01-19 中国科学院新疆理化技术研究所 A kind of Forecasting Methodology of protein self-interaction

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116713A (en) * 2013-02-25 2013-05-22 浙江大学 Method of predicting interaction between chemical compounds and proteins based on random forest
CN107609352A (en) * 2017-11-02 2018-01-19 中国科学院新疆理化技术研究所 A kind of Forecasting Methodology of protein self-interaction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
常乘: "定量蛋白质组算法研究与应用", 《中国博士学位论文全文数据库 基础科学辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110349621A (en) * 2019-06-04 2019-10-18 中国科学院计算技术研究所 Peptide fragment-spectrogram matching confidence the method for inspection, system, storage medium and device
CN114093415A (en) * 2021-11-19 2022-02-25 中国科学院数学与系统科学研究院 Peptide fragment detectability prediction method
CN114093415B (en) * 2021-11-19 2022-06-03 中国科学院数学与系统科学研究院 Peptide fragment detectability prediction method and system

Also Published As

Publication number Publication date
CN109243527B (en) 2020-04-17

Similar Documents

Publication Publication Date Title
Pedrioli Trans-proteomic pipeline: a pipeline for proteomic analysis
Deutsch et al. A guided tour of the Trans‐Proteomic Pipeline
Wang et al. pFind 2.0: a software package for peptide and protein identification via tandem mass spectrometry
WO2019050966A2 (en) Automated sample workflow gating and data analysis
Alves et al. Advancement in protein inference from shotgun proteomics using peptide detectability
CN105956416B (en) A kind of method of fast automatic analyzing prokaryote protein gene group data
Lu et al. A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications
Curran et al. Computer aided manual validation of mass spectrometry-based proteomic data
Luo et al. Protein quantitation using iTRAQ: Review on the sources of variations and analysis of nonrandom missingness
CN109243527A (en) A kind of peptide fragment detectability prediction technique of digestion probability auxiliary
WO2023207453A1 (en) Traditional chinese medicine ingredient analysis method and system based on spectral clustering
CN111582315A (en) Sample data processing method and device and electronic equipment
CN109346125B (en) Rapid and accurate protein binding pocket structure alignment method
CN108491690A (en) The peptide fragment quantitative efficacy prediction technique of peptide fragment in a kind of proteomics
CN110310706B (en) Label-free absolute quantitative method for protein
CN112328951B (en) Processing method of experimental data of analysis sample
CN108388774A (en) A kind of on-line analysis of polypeptide spectrum matched data
Iravani et al. An Interpretable Deep Learning Approach for Biomarker Detection in LC-MS Proteomics Data
CN103488913A (en) A computational method for mapping peptides to proteins using sequencing data
CN111898807A (en) Tobacco yield prediction method based on whole genome selection and application
CN113782094A (en) Modification site prediction method, modification site prediction device, computer device, and storage medium
Xu et al. Prediction of acetylation and succinylation in proteins based on multilabel learning RankSVM
Mir et al. In vivo ChIP-Seq of nuclear receptors: a rough guide to transform frozen tissues into high-confidence genome-wide binding profiles
Bessant Proteome informatics
Källberg et al. An improved machine learning protocol for the identification of correct Sequest search results

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant