CN107622184B

CN107622184B - Evaluation method for amino acid reliability and modification site positioning

Info

Publication number: CN107622184B
Application number: CN201710904787.8A
Authority: CN
Inventors: 杨皓; 迟浩; 曾文锋; 周文婧; 王钊伟; 王瑞敏; 牛秀南; 陈振霖; 刘超; 贺思敏
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2020-01-21
Anticipated expiration: 2037-09-29
Also published as: CN107622184A

Abstract

The invention provides a training method of an amino acid reliability evaluation model. The method comprises the following steps: generating a background peptide fragment set of the amino acid to be trained according to a training peptide fragment containing the amino acid to be trained; extracting a plurality of features from the training peptide and the amino acid to be trained; and training a classification model by taking the extracted multiple features as input vectors and taking whether the amino acid to be trained is correct as output, so as to obtain an amino acid reliability evaluation model. The amino acid credibility evaluation model obtained by the invention can be used for evaluating the amino acid credibility and the positioning of the modification sites, so that the accuracy of evaluating the amino acid credibility is improved, and the evaluation performance of positioning the modification sites is improved.

Description

Evaluation method for amino acid reliability and modification site positioning

Technical Field

The invention relates to the technical field of biology, in particular to an evaluation method for positioning amino acids and modification sites.

Background

Mass spectrometry has become a routine means for biologists to analyze biological samples, where peptide fragment and protein identification methodologies have become a key cycle. At present, peptide fragment identification methods based on tandem mass spectrometry data mainly fall into two categories: database search methods and de novo sequencing methods. Database search methods are heavily dependent on the quality of the database, and if the correct peptide fragment is not in the database, the identification results will be in error. The de novo sequencing method does not depend on database information, and obtains peptide fragment sequences directly from the spectrogram, thereby finding some new peptide fragments which are not in the database, such as mutation, accidental modification and the like. Currently, there are more and more de novo sequencing algorithms, including: SHEERENGA, PEAKS, PepNovo, pNovo, pNovo +, UniNovo, Novor and Open-pNovo supporting the identification of accidental modifications.

However, since de novo sequencing does not use database information as a priori, it is inevitable that very similar peptide stretch sequences are reported, resulting in very high error rates. According to literature reports, there is an error Rate of nearly 40% in the high-scoring results obtained from de novo sequencing, and therefore, how to control False Discovery Rate (FDR) in the field of de novo sequencing remains an urgent problem to be solved.

According to experience, the peptide segment sequence in the de novo sequencing result has the phenomenon that partial continuous sequences are correct and residual sequences are wrong, based on the characteristic, the credibility of each amino acid in the peptide segment sequence can be evaluated, a subsequence consisting of highly credible amino acids is extracted to be used as a sequence label, and then a database is searched by using a mode based on the sequence label, so that a report of the peptide segment sequence can be obtained. However, so far, no relevant literature reports how to specifically evaluate the reliability of amino acids, and the accuracy of the reliability evaluation of amino acids has not been deeply evaluated.

Therefore, there is a need for improvements in the prior art to accurately assess the reliability of amino acids and thereby reduce the error rate of detecting peptide fragment sequences in de novo sequencing.

Disclosure of Invention

Accordingly, the present invention has been made to overcome the above-mentioned drawbacks of the prior art and to provide a method for evaluating the reliability of amino acids and a method for evaluating the localization of modified sites.

According to a first aspect of the invention, an amino acid reliability assessment model training method is provided. The method comprises the following steps:

step 1: generating a background peptide fragment set of the amino acid to be trained according to a training peptide fragment containing the amino acid to be trained;

step 2: extracting a plurality of features from the training peptide and the amino acid to be trained;

and step 3: and training a classification model by taking the extracted multiple features as input vectors and taking whether the amino acid to be trained is correct as output, so as to obtain an amino acid reliability evaluation model.

In the amino acid reliability evaluation model training method of the invention, the step 1 comprises the following steps:

enumerating a subsequence of predetermined length for the amino acids to be trained, wherein the subsequence comprises the amino acids to be trained and other amino acids in the training peptide segment;

enumerating from said training peptide stretch a full array of amino acids having a mass equal to the mass of said subsequence;

and splicing the amino acid full-range sequence with the rest sequences in the training peptide segment to obtain a background peptide segment set of the amino acid to be trained.

In the amino acid reliability evaluation model training method of the invention, the step 2 comprises the following steps:

calculating the peptide profile matching score psm of the training peptide fragment₁Spectral peak intensity matching proportion psm₂Ratio of number of matched spectral peaks psm₃As the first feature, the second feature and the third feature, respectively;

calculating the peptide spectrum matching of the best background peptide segment in the background peptide segment set of the amino acid to be trained to score psm'₁And spectrum peak intensity matching proportion psm'₂And the matching number proportion of spectral peaks psm'₃And calculating the difference between the score of the training peptide and the score of the best background peptide, expressed as psm₁-psm′₁、psm₂-psm′₂And psm₃-psm′₃Respectively as a fourth feature, a fifth feature and a sixth feature, wherein the best background peptide segment is the background peptide segment with the highest score of peptide spectrum matching in the background peptide segment set of the amino acid to be trained;

and calculating the position information, the class information and the length information of the training peptide segment of the amino acid to be trained as a seventh feature, an eighth feature and a ninth feature respectively.

In the amino acid credibility assessment model training method, in step 3, the classification model comprises any one of a support vector machine, a decision tree, a random forest and a Bayesian network.

According to a second aspect of the present invention, there is provided a method for assessing the reliability of an amino acid. The evaluation method comprises the following steps:

step 51: generating a background peptide fragment set of the amino acid to be evaluated according to an original peptide fragment containing the amino acid to be evaluated;

step 52: extracting a plurality of features from the original peptide fragment and the amino acid to be evaluated;

step 53: and inputting the extracted features into an amino acid reliability evaluation model obtained by the amino acid reliability evaluation model training method to obtain reliability scoring distribution of the amino acid to be evaluated.

The method for assessing the reliability of an amino acid of the present invention further comprises:

fitting the credibility scoring distribution of the amino acid to be evaluated into Gamma distribution;

calculating the false occurrence rate of the amino acid to be evaluated based on the Gamma distribution:

wherein FAR represents the false discovery rate of amino acid to be evaluated, p_wAnd p_rRespectively, the prior probabilities of incorrect and correct amino acids, gamma (X | alpha)_w,β_w) Denotes the distribution area of the erroneous amino acids above the scoring threshold X, Γ (X | α |)_r,β_r) Represents the area of distribution in the correct amino acid above the scoring threshold X, X represents the scoring threshold for the amino acid to be evaluated, alpha_w,β_wGamma parameter, alpha, representing the scored distribution of wrong amino acids_r,β_rGamma parameter indicating the correct amino acid score distribution.

According to a third aspect of the present invention, there is provided a method for assessing the location of a modification site. The evaluation method comprises the following steps:

enumerating candidate modification sites for a given peptide stretch sequence where phosphorylation modifications can occur;

according to the amino acid reliability evaluation method, the reliability score of phosphorylation modification of each candidate site is obtained.

In the method for evaluating the location of a modification site of the present invention, the method further comprises calculating the probability of phosphorylation modification at each candidate modification site using the following formula:

wherein p is_iDenotes the prior probability, s, of the i-th modification site_iScore for confidence of candidate phosphorylation site i, t_iIndicating whether the candidate phosphorylation site i is phosphorylated, if t_i1, indicates phosphorylation; t is t_iEqual to 0, indicates no phosphorylation.

Compared with the prior art, the invention has the advantages that: the reliability of the amino acid is evaluated by using a machine learning method, and the accuracy is high; the concept of False Amino-acid Rate (FAR) at the Amino acid level was first proposed and used for quality control in the field of de novo sequencing; the evaluation of the amino acid reliability and the evaluation of the modification site location are unified, and the evaluation performance of the modification site location is improved.

Drawings

The invention is illustrated and described only by way of example and not by way of limitation in the scope of the invention as set forth in the following drawings, in which:

FIG. 1 shows a flow diagram of an amino acid confidence model training method according to one embodiment of the invention;

FIG. 2 shows a schematic diagram of generating a background peptide stretch according to one embodiment of the present invention;

FIG. 3 shows a flow diagram of a method for assessing amino acid reliability according to one embodiment of the invention;

FIG. 4 shows a schematic of amino acid confidence score distribution versus Gamma distribution;

FIG. 5 shows a flow chart of a method of assessing modification site localization according to one embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions, design methods, and advantages of the present invention more apparent, the present invention will be further described in detail by specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In brief, the method for assessing the amino acid reliability of the present invention comprises two processes, wherein the first process is to obtain a model for assessing the amino acid reliability by training using a machine learning method, and the second process is to obtain the reliability of the amino acid to be assessed using the trained model. FIG. 1 shows a flow diagram of a method for training an amino acid reliability assessment model according to one embodiment of the invention.

Step S110, selecting training sample

In this step, training samples, including positive and negative samples, will be selected for machine learning.

In one example, the process of selecting training samples includes:

step 111: and searching a data set corresponding to the biological sample used for obtaining the training sample by using the database, and taking the result of which the wig occurrence rate FDR is less than or equal to 1% as an annotation set.

The data set is a data set obtained by putting a biological sample (containing a lot of peptide fragments, and the information of the peptide fragments needs to be analyzed) into a mass spectrometer, wherein the data set usually contains tens of thousands of spectrograms, each spectrogram corresponds to a peptide fragment sequence, and a computer is needed to analyze the peptide fragment sequence of each spectrogram.

Searching the data set generated by the mass spectrometer using a database (i.e., a known gene library) refers to matching and scoring the data set in the database to find the peptide fragment sequence corresponding to each spectrogram in the data set and having the best score in the database.

In order to ensure the accuracy of the obtained training samples, the search results with the false discovery rate FDR less than or equal to 1% are selected as the labeling set.

Sequencing is performed on the same data set using de novo sequencing software to find the peptide fragment sequence directly from the spectrogram information in the absence of database information, step 112.

Step 113, regarding each amino acid obtained by de novo sequencing, if the amino acid is consistent with the type of the amino acid on the labeling set, the amino acid is regarded as a positive sample; otherwise, it is considered as a negative sample.

Through this step S110, a positive sample peptide fragment sequence and a negative sample peptide fragment sequence for training can be obtained, hereinafter, the peptide fragment sequence for training is referred to as a training peptide fragment or a training peptide fragment sequence, and the amino acids included in the training peptide fragment sequence are referred to as amino acids to be trained.

And step S120, generating a background peptide segment of the amino acid to be trained.

For the training peptide fragment a₁a₂…a_lAssuming amino acid a involved in training_iAnd i is 1 to l, wherein l represents the length of the peptide segment and generally ranges from 6 to 30. According to one embodiment of the present invention, the step of generating the background peptide fragment comprises:

step 121: for amino acid a to be trained_iEnumerating all k-long subsequences, wherein k represents the length of the enumerated subsequences and generally takes a value between 2 and 5, and if the value of k is too long, the algorithm is slow;

step 122: assuming that k is chosen to be 3, there are three sub-sequence forms, namely a_i-2a_i-1a_i，a_i-1a_ia_i+1And a_ia_i+1a_i+2；

Step 123: enumerate all amino acid permutations with masses equal to the masses of these three subsequences, corresponding to the three pools: s₁，S₂And S₃；

Step 124: splicing the three sets with the rest sequences in the training peptide sequence to obtain a background peptide sequence: a is₁…a_i-3S₁a_i+1…a_l，a₁…a_i-2S₂a_i+2…a_lAnd a₁…a_i-1S₃a_i+3…a_lThus, for S₁，S₂And S₃Each set of (a) is spliced to obtain a background peptide fragment, each background peptide fragment is a set containing a very large number of background peptide fragments, e.g., background peptide fragment a₁…a_i-3S₁a_i+1…a_lI.e. a set.

In this step, the purpose of generating the background peptide fragment is to determine whether the amino acid is correct or not by comparing the spectral characteristics of the training peptide fragment and the background peptide fragment. See fig. 2 for a schematic process of generating background peptide fragments, wherein in the spectrum of fig. 2 the abscissa m/z represents the mass to charge ratio, i.e. mass divided by charge, and the ordinate indicates the intensity of the spectral peak (intensity). As can be seen from fig. 2, assuming that the correct peptide fragment sequence (i.e. the training peptide fragment sequence) is AQPSK, the correctness of the first amino acid a needs to be determined, and all amino acid permutations whose mass is equal to the mass of AQPS are enumerated, for example: QAPPS, APSQ, APQ S, …, TQPG. Splicing all background amino acid permutations to the remaining sequence of the training peptide stretch to generate a background peptide stretch, comprising: QAPSK, APSQK, APQSK, …, TQPGK. As can be seen from the corresponding spectrum of fig. 2, the peak matching number of the training peptide sequence AQPSK is 7, i.e. y4, y3, y2, y1, b2, b3 and b4, denoted by score of 7, while the peak matching number of the background peptide sequence QAPSK is 6 (i.e. y3, y2, y1, b2, b3 and b4) and the peak matching number of the background peptide sequence tqpgkk is 2 (i.e. y1 and b4), and both the peak matching numbers of the background peptide sequences are lower than the peak matching number of the training peptide sequence AQPSK by 7, so the amino acid a confidence in the training peptide sequence is higher.

It should be understood that fig. 2 only schematically illustrates the process and meaning of generating the background peptide fragment, and the present invention is to evaluate the correctness of the amino acid to be trained by a machine learning method using a plurality of features of the extracted background peptide fragment according to the following detailed description.

Step S130, extracting and selecting the characteristics of the training peptide segment and the amino acid to be trained

The aim of the step is to select the characteristics which can effectively evaluate the credibility of the amino acid to be trained from the training peptide segment and the background peptide segment of the amino acid to be trained.

In one example, the extracted features include at least two of: 1) matching and scoring the peptide spectrum of the training peptide segment; 2) training the peak intensity matching proportion of the peptide section; 3) training the matching proportion of the number of peaks of the peptide fragment; 4) matching and differentiating the peptide spectrum of the training peptide segment and the background peptide segment with the best score; 5) the spectrum peak intensity matching proportion of the training peptide section and the best-scoring background peptide section is poor; 6) the matching proportion of the number of the spectrum peaks of the training peptide and the best-scoring background peptide is poor; 7) amino acid position information (e.g., from 1 to the length of the peptide stretch, l); 8) amino acid class information; 9) peptide fragment length information.

Specifically, the process of extracting the nine-dimensional features includes:

calculation of training peptide fragment a₁a₂…a_lPeptide profile matching of (2) to score psm₁The spectral peak intensity matching ratio psm of the training peptide fragment₂The ratio of the number of peaks of the training peptide fragment to the number of peaks of the training peptide fragment (psm)₃As feature 1, feature 2, and feature 3;

and calculating the peptide spectrum matching scores, the spectrum peak intensity matching proportion and the spectrum peak matching number proportion of all background peptide segments. Finding the background peptide segment with the highest score of peptide spectrum matching, wherein the corresponding three scores are respectively expressed as psm'₁、psm′₂And psm'₃The difference between the training peptide score and the best background peptide score, expressed as psm, was calculated₁-psm′₁、psm₂-psm′₂And psm₃-psm′₃As feature 4, feature 5, and feature 6;

calculating amino acid a to be trained_iPosition information of (a), amino acid class information (the class information is used to indicate the type of amino acid, there are 20 amino acids in total, and 26 capital-letter symbols are used for representation, wherein the letter B, J, O, U, X, Z is removed), and length information of the training peptide (i.e., for the training peptide a)₁a₂…a_lThe length of the peptide fragment is l), as the characteristics 7, 8 and 9.

Step S140, training the classification model to obtain an amino acid reliability evaluation model

And (3) training a classifier by using the obtained positive and negative samples through a machine learning method to obtain a trained classification model, namely an amino acid reliability evaluation model, wherein the positive sample is used for identifying correct amino acid, the negative sample is used for identifying wrong amino acid, the input in the training is nine-dimensional feature vectors of the amino acid in the positive sample and the negative sample extracted by using the process of the step S130, and the output in the training is the scoring of the amino acid as the correct amino acid or the scoring of the wrong amino acid.

In this step, the classifier may employ a Support Vector Machine (SVM) or other types such as decision trees, random forest RF, bayesian networks, and the like. In one embodiment, when SVM training classification is used, the radial basis kernel function of the SVM is used, or other kernel functions, even linear classification of non-kernel functions, can be used.

The confidence level of any amino acid to be tested can be evaluated by using the trained classification model, and the flow chart of the method for evaluating the amino acid confidence level shown in figure 3 is referred. The embodiment is introduced by taking an SVM classification model as an example, and specifically includes:

and step S310, preprocessing a spectrogram of an original peptide fragment to be evaluated.

The purpose of this step is to remove a large number of isotopic and noise peaks from the corresponding spectra of the original peptide stretch containing the amino acid to be evaluated before de novo sequencing, in order to avoid interference with de novo sequencing algorithms, e.g. to remove peaks near parent ions, and to remove neutral water loss, such as loss of peaks of water and ammonia molecules.

In one example, the process of preprocessing the spectrogram comprises:

enumerating charges in a spectrogram corresponding to the original peptide fragment, and searching all isotope peak clusters according to the mass difference of every two spectral peaks; judging the charge according to the mass difference of two peaks in the isotope peak cluster; if the quality difference is aboutThen is + n charge; according to the charge quantity, converting the single isotope peak into single charge mass, and removing other isotope peaks; and removing the parent ion peak and the parent ion water loss and ammonia loss peak in the spectrogram.

And step S320, generating a background peptide segment of the amino acid to be evaluated.

The original peptide fragment sequence a to be tested is subjected to a similar procedure as step S120₁a₂…a_lEvaluation amino acid a in (1)_iGenerating background peptide fragments.

Step S330, extracting and selecting the characteristics of the original peptide segment and the amino acid to be evaluated

Features of the amino acid to be evaluated are extracted and selected using a process similar to step S130, and similarly, the extracted features include at least two of the following: 1) matching and scoring the peptide spectrum of the original peptide segment; 2) matching proportion of spectrum peak intensity of the original peptide segment; 3) matching proportion of the number of peaks of the original peptide fragment; 4) matching and differentiating the peptide spectrum of the original peptide segment and the background peptide segment with the best score; 5) the spectrum peak intensity matching proportion of the original peptide segment and the best-scored background peptide segment is poor; 6) the matching proportion of the number of the spectrum peaks of the original peptide segment and the best-scored background peptide segment is poor; 7) amino acid position information (e.g., from 1 to the length of the peptide stretch, l); 8) amino acid class information; 9) peptide fragment length information.

And step S340, obtaining credibility scoring distribution of the amino acid to be evaluated by using the trained amino acid credibility evaluation model.

In the step, the extracted original peptide segment and the characteristics of the amino acid to be evaluated are input into the obtained amino acid credibility evaluation model, all the amino acid to be evaluated are scored by using a trained SVM model and are drawn into scoring distribution, and for convenience of subsequent description, the scoring of the amino acid credibility is named as SVM-Score.

By scoring the distribution of the confidence level of the amino acids at this step, it is possible to identify whether the amino acid to be evaluated is the correct amino acid, for example, if the score is higher than a predetermined threshold, the confidence level of the amino acid to be evaluated is considered to be high.

Step S350, using Gamma to fit the amino acid confidence score distribution.

The purpose of this step is that the scored distribution obtained can be further processed for more accurate confidence in the amino acid.

Since the SVM-Score distribution is similar to the Gamma distribution, in this example, the Gamma distribution was used to fit the amino acid confidence Score distribution. For example, the fitting may be done using an EM (expectation maximization) method in combination with a Gamma distribution. Since the scored distribution of the amino acids to be evaluated is necessarily two distributions (since there are two classes: correct amino acids and incorrect amino acids), two Gamma distributions Γ (X | α |)_w,β_w) And Γ (X | α)_r,β_r) Separately fitting the correctness of SVM-Score, where X represents SVM-Score, α_w,β_wGamma parameter, alpha, representing the distribution of erroneous results_r,β_rGamma parameter indicating the distribution of correct results.

FIG. 4 shows a schematic of SVM-Score distribution versus Gamma distribution, with the abscissa representing the Score value and the ordinate representing the proportion of amino acids corresponding to the Score (percent), where "real data" represents the actual scored distribution including all correct and incorrect amino acids, "real incorrect" and "real correct" represent the actual scored distribution of incorrect and correct amino acids, respectively, and "affected incorrect" and "affected correct" represent the estimated Gamma distribution of incorrect and correct amino acids, respectively, using the EM algorithm.

Step S360, calculating the false occurrence rate FAR of the amino acid to be evaluated:

wherein p is_wRepresenting the prior probability of the wrong amino acid, p_w×Γ(X|α_w,β_w) Indicating the number of erroneous amino acids, p, exceeding a threshold value_rIndicates the prior probability of the correct amino acid, p_r×Γ(X|α_r,β_r) Indicating the number of correct Amino acids that exceed a threshold, the False Amino-acid Rate, FAR, of such Amino acids is the first time the present invention suggests, indicating that at a given score threshold, the number of False Amino acids that exceed the threshold is divided by the total number of Amino acids that exceed the threshold. The quality control in the field of de novo sequencing can be controlled by utilizing the false discovery rate FAR of amino acids.

According to another aspect of the present invention, the trained classification model can be used to evaluate the location of modification sites, as shown in FIG. 5, which includes the following steps:

candidate modification sites at which phosphorylation can occur are enumerated for a given peptide stretch sequence, step 510.

For example, for a given peptide stretch sequence WQSHTPPYAEK, a phosphorylation modification has occurred on that sequence, assuming that the phosphorylation modification can occur at S, T, Y three amino acids. The specific process for locating these three candidate modification sites is:

enumerating WQSHTPPYAEK all modification sites at which phosphorylation modifications can occur: WQpSHTPPYAEK, WQSHpTPPYAEK and WQSHTPPpYAEK, wherein "pS" represents that phosphorylation modification is carried out on S amino acid, pT "represents that phosphorylation modification is carried out on amino acid T, pY" represents that phosphorylation modification is carried out on amino acid Y, and the site where phosphorylation modification is carried out can be uniformly represented by "pX", namely representing that phosphorylation modification is carried out on X amino acid, in the embodiment, X can be any one of S, T, Y amino acids, and pX can be regarded as a new amino acid.

And step S520, obtaining the credibility score of each candidate site subjected to phosphorylation modification by using the trained amino acid credibility assessment model.

The evaluation method of amino acid reliability provided by the invention calculates the reliability score of the new amino acid pX, and in the embodiment, there are three candidatesWhen the modification site S, T, Y is selected, the score is expressed as s₁，s₂And s₃。

Step S530, calculating the probability of phosphorylation modification at each candidate modification site.

The probability of phosphorylation at each candidate modification site is calculated. In this example, bayesian formulation was used to calculate the probability that phosphorylation occurred at the candidate site, i.e.:

wherein p is_iDenotes the prior probability, s, of the i-th modification site_iRepresents the confidence score, t, of the candidate phosphorylation sites i obtained by the method of the present invention_iIndicates whether phosphorylation of the ith site has occurred, if t_i1, indicates phosphorylation; t is t_iEqual to 0, indicates no phosphorylation.

In conclusion, the invention unifies two problems of amino acid reliability evaluation and modification site location, and considers the modified amino acid as a new amino acid, so that the method for evaluating the amino acid reliability can also be applied to the evaluation of the modification site location.

The methods of the present invention may be implemented in software, hardware, or a combination of software and hardware. To further validate the effect of the invention, the inventors implemented the method of the invention as software and compared it with the current only two software, PEAKS and Novor, that support the evaluation of amino acid confidence. The results show that the method of the invention is much better than the two existing software on three real data sets, for example, under the condition that the FAR is controlled to be 5%, the invention can identify 124.8% more amino acids than the PEAKS software with better performance; the method of the invention also outperformed the very current software Ascore and phosphorrs in terms of modification site localization on the three phosphorylation-enriched datasets, e.g., with FAR control of 1%, the method of the invention was able to identify 67.5% more phosphorylation sites than Ascore and 65.6% more phosphorylation sites than phosphoRS, while covering 98% of the results of Ascore and phosphorrs and 21% of the results of itself alone.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An amino acid credibility assessment model training method comprises the following steps:

step 2: extracting a plurality of features from the training peptide fragment, the amino acid to be trained and the background peptide fragment set of the amino acid to be trained;

2. The method for training an amino acid reliability assessment model according to claim 1, wherein the step 1 comprises:

enumerating a set of amino acids from the training peptide stretch having a mass equal to the mass of the subsequence;

and splicing the amino acid set with the rest sequences in the training peptide segment to obtain the background peptide segment set of the amino acid to be trained.

3. The method for training an amino acid reliability assessment model according to claim 1, wherein the step 2 comprises:

calculating the peptide spectrum matching of the best background peptide segment in the background peptide segment set of the amino acid to be trained to score psm'₁And spectrum peak intensity matching proportion psm'₂And the matching number proportion of spectral peaks psm'₃And calculating the difference between the score of the training peptide segment and the score of the best background peptide segment to representIs psm₁-psm′₁、psm₂-psm′₂And psm₃-psm′₃Respectively as a fourth feature, a fifth feature and a sixth feature, wherein the best background peptide segment is the background peptide segment with the highest score of peptide spectrum matching in the background peptide segment set of the amino acid to be trained;

4. The method for training an amino acid reliability assessment model according to any one of claims 1 to 3, wherein in step 3, the classification model comprises any one of a support vector machine, a decision tree, a random forest and a Bayesian network.

5. A method for assessing the reliability of amino acids, comprising:

step 53: inputting the extracted features into an amino acid credibility assessment model obtained by the amino acid credibility assessment model training method of any one of claims 1 to 4 to obtain credibility scoring distribution of the amino acid to be assessed.

6. The method for assessing amino acid reliability according to claim 5, further comprising:

wherein FAR representsFalse discovery rate of amino acids to be evaluated, p_wAnd p_rRespectively, the prior probabilities of incorrect and correct amino acids, gamma (X | alpha)_w，β_w) Denotes the distribution area of the wrong amino acids above the scoring threshold X, Γ (X | α |)_r，β_r) Represents the area of distribution of the correct amino acid above the scoring threshold X, X representing the score of the amino acid to be evaluated, alpha_w，β_wGamma parameter, alpha, representing the scored distribution of wrong amino acids_r，β_rGamma parameter indicating the correct amino acid score distribution.

7. A method of assessing modification site localization, comprising:

the method for evaluating the amino acid reliability according to claim 5, wherein a score is obtained for the reliability of phosphorylation modification of each candidate site.

8. The method of assessing modification site localization of claim 7, further comprising calculating the probability of each candidate modification site for phosphorylation modification using the following formula:

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.

10. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the steps of the method of any of claims 1 to 8 are implemented when the processor executes the program.