Disclosure of Invention
Accordingly, the present invention has been made to overcome the above-mentioned drawbacks of the prior art and to provide a method for evaluating the reliability of amino acids and a method for evaluating the localization of modified sites.
According to a first aspect of the invention, an amino acid reliability assessment model training method is provided. The method comprises the following steps:
step 1: generating a background peptide fragment set of the amino acid to be trained according to a training peptide fragment containing the amino acid to be trained;
step 2: extracting a plurality of features from the training peptide and the amino acid to be trained;
and step 3: and training a classification model by taking the extracted multiple features as input vectors and taking whether the amino acid to be trained is correct as output, so as to obtain an amino acid reliability evaluation model.
In the amino acid reliability evaluation model training method of the invention, the step 1 comprises the following steps:
enumerating a subsequence of predetermined length for the amino acids to be trained, wherein the subsequence comprises the amino acids to be trained and other amino acids in the training peptide segment;
enumerating from said training peptide stretch a full array of amino acids having a mass equal to the mass of said subsequence;
and splicing the amino acid full-range sequence with the rest sequences in the training peptide segment to obtain a background peptide segment set of the amino acid to be trained.
In the amino acid reliability evaluation model training method of the invention, the step 2 comprises the following steps:
calculating the peptide profile matching score psm of the training peptide fragment1Spectral peak intensity matching proportion psm2Ratio of number of matched spectral peaks psm3As the first feature, the second feature and the third feature, respectively;
calculating the peptide spectrum matching of the best background peptide segment in the background peptide segment set of the amino acid to be trained to score psm'1And spectrum peak intensity matching proportion psm'2And the matching number proportion of spectral peaks psm'3And calculating the difference between the score of the training peptide and the score of the best background peptide, expressed as psm1-psm′1、psm2-psm′2And psm3-psm′3Respectively as a fourth feature, a fifth feature and a sixth feature, wherein the best background peptide segment is the background peptide segment with the highest score of peptide spectrum matching in the background peptide segment set of the amino acid to be trained;
and calculating the position information, the class information and the length information of the training peptide segment of the amino acid to be trained as a seventh feature, an eighth feature and a ninth feature respectively.
In the amino acid credibility assessment model training method, in step 3, the classification model comprises any one of a support vector machine, a decision tree, a random forest and a Bayesian network.
According to a second aspect of the present invention, there is provided a method for assessing the reliability of an amino acid. The evaluation method comprises the following steps:
step 51: generating a background peptide fragment set of the amino acid to be evaluated according to an original peptide fragment containing the amino acid to be evaluated;
step 52: extracting a plurality of features from the original peptide fragment and the amino acid to be evaluated;
step 53: and inputting the extracted features into an amino acid reliability evaluation model obtained by the amino acid reliability evaluation model training method to obtain reliability scoring distribution of the amino acid to be evaluated.
The method for assessing the reliability of an amino acid of the present invention further comprises:
fitting the credibility scoring distribution of the amino acid to be evaluated into Gamma distribution;
calculating the false occurrence rate of the amino acid to be evaluated based on the Gamma distribution:
wherein FAR represents the false discovery rate of amino acid to be evaluated, pwAnd prRespectively, the prior probabilities of incorrect and correct amino acids, gamma (X | alpha)w,βw) Denotes the distribution area of the erroneous amino acids above the scoring threshold X, Γ (X | α |)r,βr) Represents the area of distribution in the correct amino acid above the scoring threshold X, X represents the scoring threshold for the amino acid to be evaluated, alphaw,βwGamma parameter, alpha, representing the scored distribution of wrong amino acidsr,βrGamma parameter indicating the correct amino acid score distribution.
According to a third aspect of the present invention, there is provided a method for assessing the location of a modification site. The evaluation method comprises the following steps:
enumerating candidate modification sites for a given peptide stretch sequence where phosphorylation modifications can occur;
according to the amino acid reliability evaluation method, the reliability score of phosphorylation modification of each candidate site is obtained.
In the method for evaluating the location of a modification site of the present invention, the method further comprises calculating the probability of phosphorylation modification at each candidate modification site using the following formula:
wherein p isiDenotes the prior probability, s, of the i-th modification siteiScore for confidence of candidate phosphorylation site i, tiIndicating whether the candidate phosphorylation site i is phosphorylated, if ti1, indicates phosphorylation; t is tiEqual to 0, indicates no phosphorylation.
Compared with the prior art, the invention has the advantages that: the reliability of the amino acid is evaluated by using a machine learning method, and the accuracy is high; the concept of False Amino-acid Rate (FAR) at the Amino acid level was first proposed and used for quality control in the field of de novo sequencing; the evaluation of the amino acid reliability and the evaluation of the modification site location are unified, and the evaluation performance of the modification site location is improved.
Detailed Description
In order to make the objects, technical solutions, design methods, and advantages of the present invention more apparent, the present invention will be further described in detail by specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In brief, the method for assessing the amino acid reliability of the present invention comprises two processes, wherein the first process is to obtain a model for assessing the amino acid reliability by training using a machine learning method, and the second process is to obtain the reliability of the amino acid to be assessed using the trained model. FIG. 1 shows a flow diagram of a method for training an amino acid reliability assessment model according to one embodiment of the invention.
Step S110, selecting training sample
In this step, training samples, including positive and negative samples, will be selected for machine learning.
In one example, the process of selecting training samples includes:
step 111: and searching a data set corresponding to the biological sample used for obtaining the training sample by using the database, and taking the result of which the wig occurrence rate FDR is less than or equal to 1% as an annotation set.
The data set is a data set obtained by putting a biological sample (containing a lot of peptide fragments, and the information of the peptide fragments needs to be analyzed) into a mass spectrometer, wherein the data set usually contains tens of thousands of spectrograms, each spectrogram corresponds to a peptide fragment sequence, and a computer is needed to analyze the peptide fragment sequence of each spectrogram.
Searching the data set generated by the mass spectrometer using a database (i.e., a known gene library) refers to matching and scoring the data set in the database to find the peptide fragment sequence corresponding to each spectrogram in the data set and having the best score in the database.
In order to ensure the accuracy of the obtained training samples, the search results with the false discovery rate FDR less than or equal to 1% are selected as the labeling set.
Sequencing is performed on the same data set using de novo sequencing software to find the peptide fragment sequence directly from the spectrogram information in the absence of database information, step 112.
Step 113, regarding each amino acid obtained by de novo sequencing, if the amino acid is consistent with the type of the amino acid on the labeling set, the amino acid is regarded as a positive sample; otherwise, it is considered as a negative sample.
Through this step S110, a positive sample peptide fragment sequence and a negative sample peptide fragment sequence for training can be obtained, hereinafter, the peptide fragment sequence for training is referred to as a training peptide fragment or a training peptide fragment sequence, and the amino acids included in the training peptide fragment sequence are referred to as amino acids to be trained.
And step S120, generating a background peptide segment of the amino acid to be trained.
For the training peptide fragment a1a2…alAssuming amino acid a involved in trainingiAnd i is 1 to l, wherein l represents the length of the peptide segment and generally ranges from 6 to 30. According to one embodiment of the present invention, the step of generating the background peptide fragment comprises:
step 121: for amino acid a to be trainediEnumerating all k-long subsequences, wherein k represents the length of the enumerated subsequences and generally takes a value between 2 and 5, and if the value of k is too long, the algorithm is slow;
step 122: assuming that k is chosen to be 3, there are three sub-sequence forms, namely ai-2ai-1ai,ai-1aiai+1And aiai+1ai+2;
Step 123: enumerate all amino acid permutations with masses equal to the masses of these three subsequences, corresponding to the three pools: s1,S2And S3;
Step 124: splicing the three sets with the rest sequences in the training peptide sequence to obtain a background peptide sequence: a is1…ai-3S1ai+1…al,a1…ai-2S2ai+2…alAnd a1…ai-1S3ai+3…alThus, for S1,S2And S3Each set of (a) is spliced to obtain a background peptide fragment, each background peptide fragment is a set containing a very large number of background peptide fragments, e.g., background peptide fragment a1…ai-3S1ai+1…alI.e. a set.
In this step, the purpose of generating the background peptide fragment is to determine whether the amino acid is correct or not by comparing the spectral characteristics of the training peptide fragment and the background peptide fragment. See fig. 2 for a schematic process of generating background peptide fragments, wherein in the spectrum of fig. 2 the abscissa m/z represents the mass to charge ratio, i.e. mass divided by charge, and the ordinate indicates the intensity of the spectral peak (intensity). As can be seen from fig. 2, assuming that the correct peptide fragment sequence (i.e. the training peptide fragment sequence) is AQPSK, the correctness of the first amino acid a needs to be determined, and all amino acid permutations whose mass is equal to the mass of AQPS are enumerated, for example: QAPPS, APSQ, APQ S, …, TQPG. Splicing all background amino acid permutations to the remaining sequence of the training peptide stretch to generate a background peptide stretch, comprising: QAPSK, APSQK, APQSK, …, TQPGK. As can be seen from the corresponding spectrum of fig. 2, the peak matching number of the training peptide sequence AQPSK is 7, i.e. y4, y3, y2, y1, b2, b3 and b4, denoted by score of 7, while the peak matching number of the background peptide sequence QAPSK is 6 (i.e. y3, y2, y1, b2, b3 and b4) and the peak matching number of the background peptide sequence tqpgkk is 2 (i.e. y1 and b4), and both the peak matching numbers of the background peptide sequences are lower than the peak matching number of the training peptide sequence AQPSK by 7, so the amino acid a confidence in the training peptide sequence is higher.
It should be understood that fig. 2 only schematically illustrates the process and meaning of generating the background peptide fragment, and the present invention is to evaluate the correctness of the amino acid to be trained by a machine learning method using a plurality of features of the extracted background peptide fragment according to the following detailed description.
Step S130, extracting and selecting the characteristics of the training peptide segment and the amino acid to be trained
The aim of the step is to select the characteristics which can effectively evaluate the credibility of the amino acid to be trained from the training peptide segment and the background peptide segment of the amino acid to be trained.
In one example, the extracted features include at least two of: 1) matching and scoring the peptide spectrum of the training peptide segment; 2) training the peak intensity matching proportion of the peptide section; 3) training the matching proportion of the number of peaks of the peptide fragment; 4) matching and differentiating the peptide spectrum of the training peptide segment and the background peptide segment with the best score; 5) the spectrum peak intensity matching proportion of the training peptide section and the best-scoring background peptide section is poor; 6) the matching proportion of the number of the spectrum peaks of the training peptide and the best-scoring background peptide is poor; 7) amino acid position information (e.g., from 1 to the length of the peptide stretch, l); 8) amino acid class information; 9) peptide fragment length information.
Specifically, the process of extracting the nine-dimensional features includes:
calculation of training peptide fragment a1a2…alPeptide profile matching of (2) to score psm1The spectral peak intensity matching ratio psm of the training peptide fragment2The ratio of the number of peaks of the training peptide fragment to the number of peaks of the training peptide fragment (psm)3As feature 1, feature 2, and feature 3;
and calculating the peptide spectrum matching scores, the spectrum peak intensity matching proportion and the spectrum peak matching number proportion of all background peptide segments. Finding the background peptide segment with the highest score of peptide spectrum matching, wherein the corresponding three scores are respectively expressed as psm'1、psm′2And psm'3The difference between the training peptide score and the best background peptide score, expressed as psm, was calculated1-psm′1、psm2-psm′2And psm3-psm′3As feature 4, feature 5, and feature 6;
calculating amino acid a to be trainediPosition information of (a), amino acid class information (the class information is used to indicate the type of amino acid, there are 20 amino acids in total, and 26 capital-letter symbols are used for representation, wherein the letter B, J, O, U, X, Z is removed), and length information of the training peptide (i.e., for the training peptide a)1a2…alThe length of the peptide fragment is l), as the characteristics 7, 8 and 9.
Step S140, training the classification model to obtain an amino acid reliability evaluation model
And (3) training a classifier by using the obtained positive and negative samples through a machine learning method to obtain a trained classification model, namely an amino acid reliability evaluation model, wherein the positive sample is used for identifying correct amino acid, the negative sample is used for identifying wrong amino acid, the input in the training is nine-dimensional feature vectors of the amino acid in the positive sample and the negative sample extracted by using the process of the step S130, and the output in the training is the scoring of the amino acid as the correct amino acid or the scoring of the wrong amino acid.
In this step, the classifier may employ a Support Vector Machine (SVM) or other types such as decision trees, random forest RF, bayesian networks, and the like. In one embodiment, when SVM training classification is used, the radial basis kernel function of the SVM is used, or other kernel functions, even linear classification of non-kernel functions, can be used.
The confidence level of any amino acid to be tested can be evaluated by using the trained classification model, and the flow chart of the method for evaluating the amino acid confidence level shown in figure 3 is referred. The embodiment is introduced by taking an SVM classification model as an example, and specifically includes:
and step S310, preprocessing a spectrogram of an original peptide fragment to be evaluated.
The purpose of this step is to remove a large number of isotopic and noise peaks from the corresponding spectra of the original peptide stretch containing the amino acid to be evaluated before de novo sequencing, in order to avoid interference with de novo sequencing algorithms, e.g. to remove peaks near parent ions, and to remove neutral water loss, such as loss of peaks of water and ammonia molecules.
In one example, the process of preprocessing the spectrogram comprises:
enumerating charges in a spectrogram corresponding to the original peptide fragment, and searching all isotope peak clusters according to the mass difference of every two spectral peaks; judging the charge according to the mass difference of two peaks in the isotope peak cluster; if the quality difference is aboutThen is + n charge; according to the charge quantity, converting the single isotope peak into single charge mass, and removing other isotope peaks; and removing the parent ion peak and the parent ion water loss and ammonia loss peak in the spectrogram.
And step S320, generating a background peptide segment of the amino acid to be evaluated.
The original peptide fragment sequence a to be tested is subjected to a similar procedure as step S1201a2…alEvaluation amino acid a in (1)iGenerating background peptide fragments.
Step S330, extracting and selecting the characteristics of the original peptide segment and the amino acid to be evaluated
Features of the amino acid to be evaluated are extracted and selected using a process similar to step S130, and similarly, the extracted features include at least two of the following: 1) matching and scoring the peptide spectrum of the original peptide segment; 2) matching proportion of spectrum peak intensity of the original peptide segment; 3) matching proportion of the number of peaks of the original peptide fragment; 4) matching and differentiating the peptide spectrum of the original peptide segment and the background peptide segment with the best score; 5) the spectrum peak intensity matching proportion of the original peptide segment and the best-scored background peptide segment is poor; 6) the matching proportion of the number of the spectrum peaks of the original peptide segment and the best-scored background peptide segment is poor; 7) amino acid position information (e.g., from 1 to the length of the peptide stretch, l); 8) amino acid class information; 9) peptide fragment length information.
And step S340, obtaining credibility scoring distribution of the amino acid to be evaluated by using the trained amino acid credibility evaluation model.
In the step, the extracted original peptide segment and the characteristics of the amino acid to be evaluated are input into the obtained amino acid credibility evaluation model, all the amino acid to be evaluated are scored by using a trained SVM model and are drawn into scoring distribution, and for convenience of subsequent description, the scoring of the amino acid credibility is named as SVM-Score.
By scoring the distribution of the confidence level of the amino acids at this step, it is possible to identify whether the amino acid to be evaluated is the correct amino acid, for example, if the score is higher than a predetermined threshold, the confidence level of the amino acid to be evaluated is considered to be high.
Step S350, using Gamma to fit the amino acid confidence score distribution.
The purpose of this step is that the scored distribution obtained can be further processed for more accurate confidence in the amino acid.
Since the SVM-Score distribution is similar to the Gamma distribution, in this example, the Gamma distribution was used to fit the amino acid confidence Score distribution. For example, the fitting may be done using an EM (expectation maximization) method in combination with a Gamma distribution. Since the scored distribution of the amino acids to be evaluated is necessarily two distributions (since there are two classes: correct amino acids and incorrect amino acids), two Gamma distributions Γ (X | α |)w,βw) And Γ (X | α)r,βr) Separately fitting the correctness of SVM-Score, where X represents SVM-Score, αw,βwGamma parameter, alpha, representing the distribution of erroneous resultsr,βrGamma parameter indicating the distribution of correct results.
FIG. 4 shows a schematic of SVM-Score distribution versus Gamma distribution, with the abscissa representing the Score value and the ordinate representing the proportion of amino acids corresponding to the Score (percent), where "real data" represents the actual scored distribution including all correct and incorrect amino acids, "real incorrect" and "real correct" represent the actual scored distribution of incorrect and correct amino acids, respectively, and "affected incorrect" and "affected correct" represent the estimated Gamma distribution of incorrect and correct amino acids, respectively, using the EM algorithm.
Step S360, calculating the false occurrence rate FAR of the amino acid to be evaluated:
wherein p iswRepresenting the prior probability of the wrong amino acid, pw×Γ(X|αw,βw) Indicating the number of erroneous amino acids, p, exceeding a threshold valuerIndicates the prior probability of the correct amino acid, pr×Γ(X|αr,βr) Indicating the number of correct Amino acids that exceed a threshold, the False Amino-acid Rate, FAR, of such Amino acids is the first time the present invention suggests, indicating that at a given score threshold, the number of False Amino acids that exceed the threshold is divided by the total number of Amino acids that exceed the threshold. The quality control in the field of de novo sequencing can be controlled by utilizing the false discovery rate FAR of amino acids.
According to another aspect of the present invention, the trained classification model can be used to evaluate the location of modification sites, as shown in FIG. 5, which includes the following steps:
candidate modification sites at which phosphorylation can occur are enumerated for a given peptide stretch sequence, step 510.
For example, for a given peptide stretch sequence WQSHTPPYAEK, a phosphorylation modification has occurred on that sequence, assuming that the phosphorylation modification can occur at S, T, Y three amino acids. The specific process for locating these three candidate modification sites is:
enumerating WQSHTPPYAEK all modification sites at which phosphorylation modifications can occur: WQpSHTPPYAEK, WQSHpTPPYAEK and WQSHTPPpYAEK, wherein "pS" represents that phosphorylation modification is carried out on S amino acid, pT "represents that phosphorylation modification is carried out on amino acid T, pY" represents that phosphorylation modification is carried out on amino acid Y, and the site where phosphorylation modification is carried out can be uniformly represented by "pX", namely representing that phosphorylation modification is carried out on X amino acid, in the embodiment, X can be any one of S, T, Y amino acids, and pX can be regarded as a new amino acid.
And step S520, obtaining the credibility score of each candidate site subjected to phosphorylation modification by using the trained amino acid credibility assessment model.
The evaluation method of amino acid reliability provided by the invention calculates the reliability score of the new amino acid pX, and in the embodiment, there are three candidatesWhen the modification site S, T, Y is selected, the score is expressed as s1,s2And s3。
Step S530, calculating the probability of phosphorylation modification at each candidate modification site.
The probability of phosphorylation at each candidate modification site is calculated. In this example, bayesian formulation was used to calculate the probability that phosphorylation occurred at the candidate site, i.e.:
wherein p isiDenotes the prior probability, s, of the i-th modification siteiRepresents the confidence score, t, of the candidate phosphorylation sites i obtained by the method of the present inventioniIndicates whether phosphorylation of the ith site has occurred, if ti1, indicates phosphorylation; t is tiEqual to 0, indicates no phosphorylation.
In conclusion, the invention unifies two problems of amino acid reliability evaluation and modification site location, and considers the modified amino acid as a new amino acid, so that the method for evaluating the amino acid reliability can also be applied to the evaluation of the modification site location.
The methods of the present invention may be implemented in software, hardware, or a combination of software and hardware. To further validate the effect of the invention, the inventors implemented the method of the invention as software and compared it with the current only two software, PEAKS and Novor, that support the evaluation of amino acid confidence. The results show that the method of the invention is much better than the two existing software on three real data sets, for example, under the condition that the FAR is controlled to be 5%, the invention can identify 124.8% more amino acids than the PEAKS software with better performance; the method of the invention also outperformed the very current software Ascore and phosphorrs in terms of modification site localization on the three phosphorylation-enriched datasets, e.g., with FAR control of 1%, the method of the invention was able to identify 67.5% more phosphorylation sites than Ascore and 65.6% more phosphorylation sites than phosphoRS, while covering 98% of the results of Ascore and phosphorrs and 21% of the results of itself alone.
It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.