CN109243527A

CN109243527A - A kind of peptide fragment detectability prediction technique of digestion probability auxiliary

Info

Publication number: CN109243527A
Application number: CN201810901107.1A
Authority: CN
Inventors: 常乘; 付岩; 高志强; 朱云平
Original assignee: BEIJING PROTEOME RESEARCH CENTER; Institute of Pharmacology and Toxicology of AMMS; Academy of Mathematics and Systems Science of CAS
Current assignee: BEIJING PROTEOME RESEARCH CENTER; Institute of Pharmacology and Toxicology of AMMS; Academy of Military Medical Sciences AMMS of PLA; Academy of Mathematics and Systems Science of CAS
Priority date: 2018-08-09
Filing date: 2018-08-09
Publication date: 2019-01-18
Anticipated expiration: 2038-08-09
Also published as: CN109243527B

Abstract

The invention discloses a kind of peptide fragment detectability prediction technique of digestion probability auxiliary, step includes: that high credible albumen 1) is screened in all identification albumen；2) restriction enzyme site digestion Probabilistic Prediction Model training set is constructed；3) training restriction enzyme site digestion Probabilistic Prediction Model；4) theoretical digestion is carried out to all high credible albumen, obtains theoretical digestion peptide fragment；5) the digestion probability of all theoretical digestion peptide fragments is predicted；6) peptide fragment detectability training set is constructed；7) training peptide fragment detectability model；8) the peptide fragment detectability of all theoretical digestion peptide fragments of other albumen is predicted.The characteristics of present invention is according to shotgun proteomics process, the enzymolysis process of the albumen considered during the prediction of peptide fragment detectability, significantly improves the accuracy rate of peptide fragment detectability prediction.

Description

A kind of peptide fragment detectability prediction technique of digestion probability auxiliary

Technical field

The present invention relates to the peptide fragment detectability prediction techniques in proteomics, in particular to shotgun proteomics In peptide fragment detectability prediction technique.

Background technique

The experiment of targeting protein group selectively can detect and quantify interested peptide fragment and albumen, such as MRM Experimental strategy.This method can be used for quickly verifying candidate biological markers.Developing the first step that MRM is tested to be exactly is time Sortilin selection represents peptide fragment.The method for representing peptide fragment is selected to can be mainly divided into two classes: method and base based on experimental data In the method for calculating.But the method based on experimental data has some limitations.Firstly, not all albumen have it is existing Experimental data.It is influenced by many factors secondly, can peptide fragment be detected, can be accredited in experiment before, It is different in experiment next time to be surely detected.Therefore, scientific research personnel increasingly pays close attention to the method based on calculating.But peptide fragment is examined The mechanism of survey is still unclear, and which prevent the exploitations that Accurate Prediction represents the algorithm of peptide fragment.

Up to the present, scientist has been that the mechanism for exploring peptide fragment detection has done a large amount of effort.Before the several years, Le Et al. (bibliography: Le Bihan, T., Robinson, M.D., Stewart, I.I.&Figeys, D.Definition and Characterization of a“Trypsinosome”from Specific Peptide Characteristics by Nano-HPLC-MS/MS and in Silico Analysis of Complex Protein Mixtures.J.Proteome Res.3,1138-1148 (2004)) and Eithier et al. (bibliography: Ethier, M.&Figeys, D.Strategy to Design Improved Proteomic Experiments Based on Statistical Analyses of the Chemical Properties of Identified Peptides.J.Proteome Res.4,2201–2206(2005).) Propose the empirical equation based on hydrophobicity, peptide segment length and isoelectric point.In recent years, scientist has been developed for much being based on The method that the prediction of machine learning algorithm represents peptide fragment.In these methods, design description peptide fragment is characterized in that a key is asked Topic.It is well known that can the cause that influenced peptide fragment and be detected in proteomic experiments be known as very much, such as peptide fragment Physicochemical property, the abundance of albumen belonging to peptide fragment and identity process etc..Tang et al. (bibliography: Tang, H.et al.A computational approach toward label-free protein quantification using Predicted peptide detectability.Bioinformatics 22, e481-8 (2006)) propose peptide fragment can The concept of detection property, and use the machine learning of the prediction peptide fragment detectability of 175 kinds of feature constructions from peptide section sequence Algorithm.Later, Sander et al. (bibliography: Sanders, W.S., Bridges, S.M., McCarthy, F.M., Nanduri,B.&Burgess,S.C.Prediction of peptides observable by mass spectrometry applied at the experimental set level.BMC Bioinformatics 8Suppl 7,S23 (2007)), Mallick et al. (bibliography: Mallick, P.et al.Computational prediction of proteotypic peptides for quantitative proteomics.Nat.Biotechnol.25,125–31 (2007)) and Eyers et al. (bibliography: Eyers, C.E.et al.CONSeQuence:Prediction of Reference Peptides for Absolute Quantitative Proteomics Using Consensus Machine Learning Approaches.Mol.Cell.Proteomics 10,M110.003384-M110.003384 (2011)) 596 features are based respectively on, 1010 features and 1186 features develop peptide fragment detectability algorithm.These are calculated Method mainly consider from AAindex (bibliography: Kawashima, S., Ogata, H.&Kanehisa, M.AAindex: Amino acid index database.Nucleic Acids Res.27,368-369 (1999)) and peptide section sequence spy Sign.Recently, Muntel et al. (bibliography: Muntel, J.et al.Abundance-based classifier for the prediction of mass spectrometric peptide detectability upon enrichment(PPA) .Mol 14,430-440 (2015) of Cell Proteomics) protein abundance has been added to peptide fragment as additional feature can In detection property prediction model, and achieve better effect.But in the case where not carrying out Mass spectrometry experiments in advance, albumen is rich Degree is usually unknown.

Although the method in terms of many peptide fragment detectabilities has been proposed, current peptide fragment detectability method Accuracy rate is still unsatisfactory.Therefore how accurately to predict that peptide fragment detectability is technical problem urgently to be solved.

Summary of the invention

For technical problem of the existing technology, the purpose of the present invention is the enzymatic hydrolysis letters by abundant consideration albumen Breath, to provide a kind of more accurate peptide fragment detectability prediction technique.

To achieve the goals above, the present invention provides a kind of digestion probability auxiliary peptide fragment detectability prediction technique, Include:

Step 1) screens high credible albumen in all identification albumen；

Step 2) constructs restriction enzyme site digestion Probabilistic Prediction Model training set；

Step 3) trains restriction enzyme site digestion Probabilistic Prediction Model；

Step 4) carries out theoretical digestion to all high credible albumen, obtains theoretical digestion peptide fragment；

The digestion probability of all theoretical digestion peptide fragments of step 5) prediction；

Step 6) constructs peptide fragment detectability training set；

Step 7) trains peptide fragment detectability model；

Step 8) predicts the peptide fragment detectability of all theoretical digestion peptide fragments of other albumen.

In the above-mentioned technical solutions, in the step 1), first respectively by all identification albumen according to the spectrum of albumen The sequence coverage of map number and albumen carries out descending sort, and 50% albumen is as height before then accounting in two minor sorts Credible albumen.The spectrogram number of the albumen refer to the albumen the sum of the spectrogram number of relevant peptide fragment.The albumen Sequence coverage refer to the albumen relevant peptide fragment sequence trace back to protein sequence after, account for protein sequence total length Ratio.Or high credible albumen is to have identified that the spectrogram number of albumen in albumen is greater than given threshold h1 and the sequence of albumen is covered Cover degree is greater than the albumen of given threshold h2.

In the above-mentioned technical solutions, in the step 2), the building restriction enzyme site digestion Probabilistic Prediction Model instruction Practicing collection includes:

Step 2-1) it will identify in peptide fragment set and trace back to the step 1) with the high credible associated identification peptide fragment of albumen In the credible protein sequence of height in.In the case where very coincidence, perhaps some peptide fragment can correspond to the plurality of positions of an albumen.It is right Such case, the present invention only consider corresponding for the first time.The credible protein sequence of height is collected according to the backtracking position of identification peptide fragment In restriction enzyme site information.The restriction enzyme site information includes: the sum of the spectrogram number of 1) the restriction enzyme site left side peptide fragment, note For parameter L；2) the sum of the spectrogram number of restriction enzyme site the right peptide fragment, is denoted as parameter R；3) position is cut using the restriction enzyme site as leakage The sum of the spectrogram number of peptide fragment of point, is denoted as parameter O.The restriction enzyme site in the credible protein sequence of height in the step 1) In, the restriction enzyme site for meeting the following conditions is classified as positive site: 1) L is more than or equal to 1；2) R is more than or equal to 1；3) O is equal to 0.It will The restriction enzyme site for meeting the following conditions is classified as negative positions: 1) L is equal to 0；2) R is equal to 0；3) O is more than or equal to 2.

Step 2-2) to the step 2-1) in positive site and negative positions, respectively fetch bit point or so T it is adjacent Amino acid forms the 2T+1 that length is 2T+1, and even (T is natural number to son, 4) general T value is.If some restriction enzyme site by chance goes out Perhaps C-terminal causes the left end of restriction enzyme site or right end not to have enough amino acid composition 2T+1 to connect to the N-terminal of present protein sequence Son is then filled with placeholder " Z ".It, will be except the amino acid in middle position (lysine or smart ammonia to each 2T+1 even son Acid) except each amino acid be converted into the 0-1 vectors of 21 dimensions.In this way, even son has been converted into a 42T dimension to each 2T+1 0-1 vector.1 is set by the label in the positive site in the step 2-1), by the negative position in the step 2-1) The label of point is set as 0.

In the above-mentioned technical solutions, in the step 3), why needing to train digestion Probabilistic Prediction Model is base In the shotgun proteomics the characteristics of and design.One typical shotgun proteomic experiments can be divided into two mistakes Journey: the enzymolysis process of albumen and the detection process of peptide fragment.One is easy the fact that ignore to be which peptide fragment protease has solved is Unknown.That is, can accurately not know which peptide fragment has actually entered the detection process of peptide fragment.Therefore accurately pre- Survey the enzymolysis process that peptide fragment detectability needs to consider albumen.The training restriction enzyme site digestion Probabilistic Prediction Model refers to Training Random Forest model on constructed restriction enzyme site digestion probability training set in the step 2).With random forest Number increases, and the error of random forest can reduce, but runing time also will increase.Through preliminary test of the present invention, random forest The number of middle tree is set as 200 can obtain relatively good result in accuracy rate and efficiency.It is selected at random at each node The number of features selected uses as default, i.e. the square root of total characteristic numberUnder ten folding cross validation strategies Index of the AUC as assessment models effect.

In the above-mentioned technical solutions, in the step 4), to all high credible albumen carry out theoretical digestions refer to by The series model digestion of the credible albumen of height in the step 1) is theoretical digestion peptide fragment.1) design parameter, which is provided that, will The peptide fragment longest or shortest length of permission are respectively set to the longest or shortest length of all identification peptide fragments.It 2) will be in peptide fragment The most leakage enzyme site numbers allowed are set as 2.

In the technical solution, in the step 5), predict that the digestion probability of all theoretical digestion peptide fragments includes:

Step 5-1) obtain the left end restriction enzyme site of theoretical digestion peptide fragment in all steps 4), right end digestion position Point and the corresponding 2T+1 of leakage enzyme cutting enzyme site (if any) are even sub.By these 2T+1 even son according to the step 2-2) in Method be expressed as 42T dimension vector.

Step 5-2) the step 5- is predicted using the trained digestion Probabilistic Prediction Model in the step 3) 1) the digestion probability of the restriction enzyme site in.

Step 5-3) according to the step 5-2) in predict restriction enzyme site digestion probability, calculated according to following formula The digestion probability of all theory digestion peptide fragments:

Wherein, e_pepIndicate the digestion probability of peptide fragment；

e_lIndicate the digestion probability of peptide fragment left end restriction enzyme site；

e_rIndicate the digestion probability of peptide fragment right end restriction enzyme site；

e_iIndicate the digestion probability that enzyme cutting enzyme site is leaked in peptide fragment；

N indicates the leakage enzyme cutting enzyme site number in peptide fragment.

In the technical solution, in the step 6), building peptide fragment detectability training set includes:

Step 6-1) for all theoretical digestion peptide fragments of the credible albumen of height in the step 4), it will wherein be accredited To and spectrogram number greater than 1 peptide fragment be classified as positive peptide fragment, the peptide fragment not being accredited wherein is classified as negative peptide fragment.It will be positive The label of peptide fragment is set as 1, and the label of negative peptide fragment is set as 0.

Step 6-2) calculate the step 6-1) in positive peptide fragment and negative peptide fragment 588 attribute.It is each in this way Peptide fragment can use vector x=(x₁,x₂,x₃,…,x₅₈₈) indicate.In this 588 attribute, first 23 kinds are peptide sequence datas Relevant feature, for example, leaking the appearance frequency of each amino acid in the number, peptide fragment quality, peptide fragment of enzyme site in peptide segment length, peptide fragment Rate etc..Intermediate 544 kinds be from AAindex (bibliography: Kawashima, S., Pokarowski, P., Pokarowska, M.,Kolinski,A.,Katayama,T.,and Kanehisa,M.；AAindex:amino acid index database, Progress report2008.Nucleic Acids Res.36, D202-D205 (2008)) the physicochemical property of amino acid exist Result after averaging in peptide fragment dimension.Then the reference of 20 kinds of physicochemical properties from the result of study of forefathers (bibliography: Braisted, J.C.et al.BMC Bioinformatics 9,529 (2008), Webb-Robertson, B.J.et Al.Bioinformatics 26,1677-1683 (2010), Eyers, C.E.et al.Mol Cell Proteomics 10, M110 003384 (2011), Tang, H.et al.Bioinformatics 22, e481-488 (2006)).A kind of last feature The step 5-3) in calculate peptide fragment digestion probability.This feature by it is proposed that be simultaneously used for peptide fragment detectability for the first time In model.

In the technical solution, in the step 7), training peptide fragment detectability model includes:

Step 7-1) on the training set that the step 6) constructs carry out feature selecting.Herein, the spy that the present invention uses Sign selection method is mRMR (bibliography: Ding, C.&Peng, H.Minimum Redundancy Feature Selection from Microarray Gene Expression Data.in Proceedings of the IEEE Computer Society Conference on Bioinformatics 523--(IEEE Computer Society,2003).).According to The ranking results of feature selecting take preceding 50 attributes successively to train Random Forest model on the training set of the step 6). Test result is shown, when taking 31 features, Random Forest model effect is optimal.

Step 7-2) the step 7-1 is used on the peptide fragment detectability training set that the step 6) constructs) in 31 features training Random Forest model of selection.The number set in random forest is set as 200.It is selected at random at each node The number of features selected uses as default, i.e. the square root of total characteristic numberEqually with ten folding cross validation plans Index of the AUC as assessment models effect under slightly.Since peptide fragment detectability model is bias collection, so training random forest When use downward sampling technique (bibliography: Chen, C.&Liaw, A.Using random forest to learn imbalanced data.Discovery(2004).)。

In the technical solution, in the step 8), the peptide of all theoretical digestion peptide fragments of other albumen is predicted Section detectability include:

Step 8-1) the protein sequence theory digestion that will predict is theoretical digestion peptide fragment, obtain several theoretical digestion peptide fragments And the digestion probability of each theoretical digestion peptide fragment is predicted using the restriction enzyme site digestion Probabilistic Prediction Model.Parameter during digestion It is provided that the peptide fragment maximum length of permission is set as 38；The peptide fragment shortest length of permission is set as 6；In the peptide fragment of permission most More leakage enzyme site numbers are set as 2；

Step 8-2) by the theoretical digestion peptide fragment of each of described step 8-2) according to the step 6-2) method Be converted to the numerical value vector of 588 dimensions.

Step 8-3) numerical value vector obtained in the step 8-2) is input to the step 7-2) in train Peptide fragment detectability model in, predict the step 8-1) all theoretical digestion peptide fragments peptide fragment detectability.

The present invention also provides a kind of devices that peptide fragment prediction is represented for targeting protein group.The device includes: peptide Section and protein identification module, peptide fragment detectability prediction module represent peptide fragment prediction module.

The peptide fragment and protein identification module completes the basis parsing work of spectrogram using protein identification software.

The peptide fragment detectability prediction module includes following part:

1) building peptide fragment detectability predicts training set；

2) training peptide fragment detectability model；

3) the peptide fragment detectability of all theoretical digestion peptide fragments of candidate albumen in targeting protein group is predicted.

The representative peptide fragment prediction module is to each candidate albumen, according to the above-mentioned peptide fragment detectability pair being calculated The associated peptide fragment of the albumen is ranked up, and is taken the peptide fragment of sequence M first (the general value of M is 5) to be used as and is represented peptide fragment.

The invention has the following advantages that

1, it is put forward for the first time and peptide fragment digestion probability is added in peptide fragment detectability model.

2, the characteristics of according to shotgun proteomics process, the albumen that considers during the prediction of peptide fragment detectability Enzymolysis process significantly improves the accuracy rate of peptide fragment detectability prediction.

Detailed description of the invention

Fig. 1 is the peptide fragment detectability prediction technique flow chart of digestion probability of the present invention auxiliary；

Fig. 2 is that ten folding cross validation ROC of restriction enzyme site digestion Probabilistic Prediction Model scheme；

Fig. 3 is the effect diagram of peptide fragment detectability model after the feature for sequentially adding feature selecting；

Fig. 4 is the Performance Evaluation ROC figure of the peptide fragment detectability model on test set.

Specific embodiment

The present invention is described further with reference to the accompanying drawings and detailed description.

Assuming that there is a protein example.The protein mixing sample is digested by existing Measurement for Biochemistry first Peptide fragment mixture solution is formed, then generates experiment tandem mass spectrum data through liquid chromatography-mass spectrometry.The tandem mass spectrum Data include chromatographic retention, mass particle charge ratio, mass spectrum response signal intensity three-dimensional information.Then, it needs to utilize mirror Determine software and determines the relationship for having which peptide fragment and albumen and peptide fragment and albumen in spectrogram.For example, MaxQuant (bibliography: Cox,J.and Mann,M.MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein Quantification.Nat Biotechnol, 2008,26, pp 1367-72), pFind (bibliography: Wang L.H.et al..pFind 2.0:a software package for peptide and protein identification via Tandem mass spectrometry.Rapid Commun Mass Spectrom, 2007,21,2985-2991) softwares such as All there is this function.It is used for the result that the sample obtains to construct training set.

Assuming that separately there is a protein example to obtain identification peptide fragment and protein according to aforesaid operations.The sample is obtained As a result for constructing test set.

Below based on above-mentioned background data, and Fig. 1 is referred to, the specific implementation process of the method for the present invention is illustrated.

43088 identification peptide fragments and 3959 eggs are extracted in the MaxQuant qualification result for constructing training set It is white.According to the sequence coverage of albumen and the corresponding spectrogram number of albumen, screening obtains 1556 credible albumen of height, corresponds to 29172 identification peptide fragments.These identification peptide fragments are traced back in its corresponding protein sequence, the digestion that each site occurs is counted Situation.Then according to the building rule of digestion Probabilistic Prediction Model training set specified before the present invention, 7778 digestions are obtained Site and 4854 leakage enzyme sites.On the training set constructed after these site vectorizations, training Random Forest model.Such as figure Shown in 2, the AUC of 10 folding cross validations of the Random Forest model is up to 0.9756.This illustrates the digestion probability of present invention training Model can the extraordinary digestion situation for predicting a restriction enzyme site.Then, the present invention will identify that spectrogram number is more than 1 in peptide fragment The peptide fragment opened is as positive collection peptide fragment (25363), and the theoretical digestion peptide fragment not being accredited is as feminine gender collection peptide fragment (180284 It is a).Then according to trained digestion Probabilistic Prediction Model, predict that the digestion of each peptide fragment in peptide fragment detectability training set is general Rate, and calculate other attributes of training set peptide fragment.

In order to portray the detectability of peptide fragment, each peptide fragment has been changed into the numerical value vector of 1x588 dimension by the present invention.Peptide fragment sheet One section of ordered sequence being made of in matter amino acid.A kind of representation of amino acid is: a capitalization indicates one Amino acid, such as alanine can be indicated that cysteine can be indicated by letter C by alphabetical A.Peptide fragment can be expressed as in this way A string of alphabetical sequences.Illustrate the character representation of peptide fragment by taking peptide fragment ARNDCEQK as an example below.In a mass spectrometer, too short or mistake Long peptide fragment cannot be all detected, therefore peptide segment length is to influence an important factor for can it be detected.It is with the peptide fragment Example, the length of the peptide fragment are 8.Trypsase would generally from lysine or arginic N-terminal by protein sequence digestion at peptide Section, therefore generally believe that the lysine (K) or arginine (R) of appearance in peptide intersegmental part (non-C-terminal) they are caused by leakage is cut.Peptide fragment Digestion situation the mass signal of peptide fragment can be had a huge impact, therefore, the number of the leakage enzyme site in peptide fragment is also one A important feature.For example, just there is a leakage enzyme site R in peptide fragment ARNDCEQK.The quality of each amino acid in peptide fragment It is added, obtaining peptide fragment quality is 963.43Da.In biology, common amino acid has 20 kinds, and the present invention is with the amino of 20 dimensions Sour frequency vector indicates the composed structure of amino acid in peptide fragment.For example, a kind of fixed Amino acid sequence mode, counts peptide fragment The number that each amino acid occurs in ARNDCEQK is by chance all 1, then divided by the length of the peptide fragment 8, then each amino acid The characteristic value of corresponding position is all 1/8, and the characteristic value at remaining amino acid position is 0.According to knowing in AAindex database Know, each amino acid has the physics physicochemical property of 544 kinds of quantizations, and the quantization characteristic of the amino acid in peptide fragment is averaged as peptide The feature of section.Such as: assuming that in peptide fragment ARNDCEQK each amino acid 544 kinds of physicochemical properties are as follows:

Then the feature of the peptide fragment is

WhereinIndicate the vector of 1x544.

Then, referring to bibliography (Braisted, J.C.et al.BMC Bioinformatics 9,529 (2008), Webb-Robertson, B.J.et al.Bioinformatics 26,1677-1683 (2010), Eyers, C.E.et Al.Mol Cell Proteomics 10, M110 003384 (2011), Tang, H.et al.Bioinformatics 22, E481-488 (2006)), calculate the physicochemical properties of last 20 kinds of peptide fragments.It is worth noting that, calculate these features when It waits, not only used the amino acid sequence information of peptide fragment itself, also use the information of amino acid sequence adjacent near peptide fragment.

Finally, the peptide fragment digestion probability that the present invention is calculated in previous step is as the 588th dimensional feature of last peptide fragment.

These features that the present invention uses cover peptide fragment detectability correlative study so far use it is nearly all Feature.But so many feature also brings a problem, there are redundancies between feature.In order to solve this problem, of the invention Using mRMR method carry out feature selecting, selection obtain 50 with peptide fragment whether be detected it is highly relevant and each other before redundancy The smallest feature.In order to further determine required number of features, this 50 features are sequentially added random forest mould by the present invention In type, the variation of 10 folding cross validation AUC of statistical model.As shown in figure 3, when using 31 features, the performance of model Most preferably, other feature is added, will affect modelling effect instead.

To the MaxQuant qualification result for constructing test set, according to above-mentioned steps by the theoretical digestion peptide of each albumen Segment table is shown as the vector of 1x31 dimension.The peptide of this theoretical digestion peptide fragment using above-mentioned trained peptide fragment detectability model prediction Section detectability.As shown in figure 4, the peptide fragment detectability model of above-mentioned training still has good prediction on the test set Performance, to illustrate that model of the invention has good extensive property.

So far, aforesaid operations of the invention have been completed the work for predicting the peptide fragment detectability of all peptide fragments.

The peptide fragment detectability prediction module includes following part:

1) building peptide fragment detectability predicts training set；

2) training peptide fragment detectability model；

The described peptide fragment prediction module that represents is to each candidate albumen, according to the above-mentioned peptide fragment detectability being calculated The associated peptide fragment of the albumen is ranked up, takes the peptide fragment of sequence top N (the general value of N is 5) to be used as and represents peptide fragment.

It should be noted last that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting.Although ginseng It is described the invention in detail according to embodiment, those skilled in the art should understand that, to technical side of the invention Case is modified or replaced equivalently, and without departure from the spirit and scope of technical solution of the present invention, should all be covered in the present invention Scope of the claims in.

Claims

1. a kind of peptide fragment detectability prediction technique of digestion probability auxiliary, step include:

1) high credible albumen is filtered out in albumen from having identified；

2) restriction enzyme site digestion Probabilistic Prediction Model training set, training restriction enzyme site digestion Probabilistic Prediction Model are constructed；

3) theoretical digestion is carried out to all high credible albumen, obtains theoretical digestion peptide fragment；

4) the digestion probability of all theoretical digestion peptide fragments is predicted using the restriction enzyme site digestion Probabilistic Prediction Model；

5) peptide fragment detectability training set, training peptide fragment detectability model are constructed；

6) the peptide fragment detectability of all theoretical digestion peptide fragments of peptide fragment detectability model prediction setting albumen is utilized.

2. the method as described in claim 1, which is characterized in that the credible albumen of height is the spectrogram for having identified albumen in albumen Number is greater than albumen of the sequence coverage greater than given threshold h2 of given threshold h1 and albumen；Or the credible albumen of height is The forward albumen of the sequence coverage descending sort of the spectrogram number and albumen of having identified albumen in albumen.

3. the method as described in claim 1, which is characterized in that construct the restriction enzyme site digestion Probabilistic Prediction Model training set The step of include:

3-1) the credible albumen of height will be traced back to the associated identification peptide fragment of the credible albumen of height in identification peptide fragment set In protein sequence, the restriction enzyme site information in the protein sequence is then collected according to the backtracking position of identification peptide fragment；And according to Restriction enzyme site is divided into positive site and negative positions by the restriction enzyme site information；

3-2) take the 2T+1 that T adjacent amino acid composition length of restriction enzyme site or so are 2T+1 even sub；If restriction enzyme site occurs In the N-terminal of protein sequence, perhaps C-terminal causes the left end of restriction enzyme site or right end not to have enough amino acid composition 2T+1 to connect Son is then filled with placeholder " Z "；

3-3) to each even son, by each amino acid in addition to the amino acid in middle position be converted into the 0-1 of 21 dimensions to Amount, so that each even son to be converted into the 0-1 vector of 42T dimension；1 is set by the label in the positive site, it will The label of the negative positions is set as 0.

4. method as claimed in claim 3, which is characterized in that the restriction enzyme site information includes: restriction enzyme site left side peptide fragment The sum of spectrogram number, be denoted as parameter L；The sum of the spectrogram number of peptide fragment, is denoted as parameter R on the right of restriction enzyme site；With restriction enzyme site As the sum of the spectrogram number of peptide fragment of leakage enzyme site, it is denoted as parameter O；By it is eligible 1)~3) restriction enzyme site be classified as the positive Site: 1) L be more than or equal to 1,2) R be more than or equal to 1,3) O be equal to 0；By it is eligible a)~c) restriction enzyme site be classified as negative position Point: a) L be equal to 0, b) R be equal to 0, c) O be more than or equal to 2.

5. method as claimed in claim 3, which is characterized in that the step of predicting the digestion probability of all theoretical digestion peptide fragments is wrapped It includes:

5-1) obtain left end restriction enzyme site, right end the restriction enzyme site institute corresponding with leakage enzyme cutting enzyme site of all theoretical digestion peptide fragments 2T+1 even son is stated, and converts thereof into the 0-1 vector of 42T dimension；

5-2) using the digestion probability of each restriction enzyme site in the digestion Probabilistic Prediction Model prediction steps 5-1)；

5-3) according to the digestion probability for the restriction enzyme site predicted in step 5-2), according to formula Calculate the digestion probability of all theoretical digestion peptide fragments；Wherein, e_pepIndicate the digestion probability of peptide fragment, e_lIndicate the digestion of peptide fragment left end The digestion probability in site, e_rIndicate the digestion probability of peptide fragment right end restriction enzyme site, e_iIndicate the digestion that enzyme cutting enzyme site is leaked in peptide fragment Probability, n indicate the leakage enzyme cutting enzyme site number in peptide fragment.

6. the method as described in claim 1, which is characterized in that referring to all high credible theoretical digestions of albumen progress will be described The series model digestion of high credible albumen is theoretical digestion peptide fragment；Wherein, the peptide fragment longest or shortest length of permission are distinguished It is set as the longest or shortest length of all identification peptide fragments, sets 2 for the most leakage enzyme site numbers allowed in peptide fragment.

7. the method as described in claim 1, which is characterized in that building peptide fragment detectability training set the step of include: for All theoretical digestion peptide fragments that step 3) obtains, will wherein be accredited and peptide fragment of the spectrogram number greater than 1 is classified as positive peptide fragment, The peptide fragment not being accredited wherein is classified as negative peptide fragment；1 is set by the label of positive peptide fragment, the label setting of negative peptide fragment It is 0；Then it calculates separately the attributive character of positive peptide fragment and negative peptide fragment and is converted into numerical value vector, in the attributive character Including the digestion probability.

8. the method as described in claim 1, which is characterized in that the step of training peptide fragment detectability model includes: using special Sign selection method mRMR carries out feature selecting on the peptide fragment detectability training set；Then special according to multiple attributes of selection Sign successively trains Random Forest model, determines corresponding m feature when Random Forest model effect is optimal；Then in the peptide fragment The m feature training Random Forest model is used on detectability training set.

9. the method as described in claim 1, which is characterized in that the peptide fragment of all theoretical digestion peptide fragments of prediction setting albumen can It is several theoretical digestion peptide fragments that the step of detection property, which includes: by the protein sequence theory digestion for setting albumen, and the utilization enzyme Enzyme site digestion Probabilistic Prediction Model predicts the digestion probability of each theoretical digestion peptide fragment；Then according to each theory of setting albumen The attributive character of digestion peptide fragment generates corresponding numerical value vector, includes that corresponding digestion is general in the attributive character of theoretical digestion peptide fragment Rate；Then the numerical value vector is input in trained peptide fragment detectability model, the theoretical digestion peptide of prediction setting albumen The peptide fragment detectability of section.

10. the method as described in claim 1, which is characterized in that the restriction enzyme site digestion Probabilistic Prediction Model is random gloomy Woods model.