CN107622184B - Evaluation method for amino acid reliability and modification site positioning - Google Patents

Evaluation method for amino acid reliability and modification site positioning Download PDF

Info

Publication number
CN107622184B
CN107622184B CN201710904787.8A CN201710904787A CN107622184B CN 107622184 B CN107622184 B CN 107622184B CN 201710904787 A CN201710904787 A CN 201710904787A CN 107622184 B CN107622184 B CN 107622184B
Authority
CN
China
Prior art keywords
amino acid
training
peptide
trained
psm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710904787.8A
Other languages
Chinese (zh)
Other versions
CN107622184A (en
Inventor
杨皓
迟浩
曾文锋
周文婧
王钊伟
王瑞敏
牛秀南
陈振霖
刘超
贺思敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201710904787.8A priority Critical patent/CN107622184B/en
Publication of CN107622184A publication Critical patent/CN107622184A/en
Application granted granted Critical
Publication of CN107622184B publication Critical patent/CN107622184B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a training method of an amino acid reliability evaluation model. The method comprises the following steps: generating a background peptide fragment set of the amino acid to be trained according to a training peptide fragment containing the amino acid to be trained; extracting a plurality of features from the training peptide and the amino acid to be trained; and training a classification model by taking the extracted multiple features as input vectors and taking whether the amino acid to be trained is correct as output, so as to obtain an amino acid reliability evaluation model. The amino acid credibility evaluation model obtained by the invention can be used for evaluating the amino acid credibility and the positioning of the modification sites, so that the accuracy of evaluating the amino acid credibility is improved, and the evaluation performance of positioning the modification sites is improved.

Description

Evaluation method for amino acid reliability and modification site positioning
Technical Field
The invention relates to the technical field of biology, in particular to an evaluation method for positioning amino acids and modification sites.
Background
Mass spectrometry has become a routine means for biologists to analyze biological samples, where peptide fragment and protein identification methodologies have become a key cycle. At present, peptide fragment identification methods based on tandem mass spectrometry data mainly fall into two categories: database search methods and de novo sequencing methods. Database search methods are heavily dependent on the quality of the database, and if the correct peptide fragment is not in the database, the identification results will be in error. The de novo sequencing method does not depend on database information, and obtains peptide fragment sequences directly from the spectrogram, thereby finding some new peptide fragments which are not in the database, such as mutation, accidental modification and the like. Currently, there are more and more de novo sequencing algorithms, including: SHEERENGA, PEAKS, PepNovo, pNovo, pNovo +, UniNovo, Novor and Open-pNovo supporting the identification of accidental modifications.
However, since de novo sequencing does not use database information as a priori, it is inevitable that very similar peptide stretch sequences are reported, resulting in very high error rates. According to literature reports, there is an error Rate of nearly 40% in the high-scoring results obtained from de novo sequencing, and therefore, how to control False Discovery Rate (FDR) in the field of de novo sequencing remains an urgent problem to be solved.
According to experience, the peptide segment sequence in the de novo sequencing result has the phenomenon that partial continuous sequences are correct and residual sequences are wrong, based on the characteristic, the credibility of each amino acid in the peptide segment sequence can be evaluated, a subsequence consisting of highly credible amino acids is extracted to be used as a sequence label, and then a database is searched by using a mode based on the sequence label, so that a report of the peptide segment sequence can be obtained. However, so far, no relevant literature reports how to specifically evaluate the reliability of amino acids, and the accuracy of the reliability evaluation of amino acids has not been deeply evaluated.
Therefore, there is a need for improvements in the prior art to accurately assess the reliability of amino acids and thereby reduce the error rate of detecting peptide fragment sequences in de novo sequencing.
Disclosure of Invention
Accordingly, the present invention has been made to overcome the above-mentioned drawbacks of the prior art and to provide a method for evaluating the reliability of amino acids and a method for evaluating the localization of modified sites.
According to a first aspect of the invention, an amino acid reliability assessment model training method is provided. The method comprises the following steps:
step 1: generating a background peptide fragment set of the amino acid to be trained according to a training peptide fragment containing the amino acid to be trained;
step 2: extracting a plurality of features from the training peptide and the amino acid to be trained;
and step 3: and training a classification model by taking the extracted multiple features as input vectors and taking whether the amino acid to be trained is correct as output, so as to obtain an amino acid reliability evaluation model.
In the amino acid reliability evaluation model training method of the invention, the step 1 comprises the following steps:
enumerating a subsequence of predetermined length for the amino acids to be trained, wherein the subsequence comprises the amino acids to be trained and other amino acids in the training peptide segment;
enumerating from said training peptide stretch a full array of amino acids having a mass equal to the mass of said subsequence;
and splicing the amino acid full-range sequence with the rest sequences in the training peptide segment to obtain a background peptide segment set of the amino acid to be trained.
In the amino acid reliability evaluation model training method of the invention, the step 2 comprises the following steps:
calculating the peptide profile matching score psm of the training peptide fragment1Spectral peak intensity matching proportion psm2Ratio of number of matched spectral peaks psm3As the first feature, the second feature and the third feature, respectively;
calculating the peptide spectrum matching of the best background peptide segment in the background peptide segment set of the amino acid to be trained to score psm'1And spectrum peak intensity matching proportion psm'2And the matching number proportion of spectral peaks psm'3And calculating the difference between the score of the training peptide and the score of the best background peptide, expressed as psm1-psm′1、psm2-psm′2And psm3-psm′3Respectively as a fourth feature, a fifth feature and a sixth feature, wherein the best background peptide segment is the background peptide segment with the highest score of peptide spectrum matching in the background peptide segment set of the amino acid to be trained;
and calculating the position information, the class information and the length information of the training peptide segment of the amino acid to be trained as a seventh feature, an eighth feature and a ninth feature respectively.
In the amino acid credibility assessment model training method, in step 3, the classification model comprises any one of a support vector machine, a decision tree, a random forest and a Bayesian network.
According to a second aspect of the present invention, there is provided a method for assessing the reliability of an amino acid. The evaluation method comprises the following steps:
step 51: generating a background peptide fragment set of the amino acid to be evaluated according to an original peptide fragment containing the amino acid to be evaluated;
step 52: extracting a plurality of features from the original peptide fragment and the amino acid to be evaluated;
step 53: and inputting the extracted features into an amino acid reliability evaluation model obtained by the amino acid reliability evaluation model training method to obtain reliability scoring distribution of the amino acid to be evaluated.
The method for assessing the reliability of an amino acid of the present invention further comprises:
fitting the credibility scoring distribution of the amino acid to be evaluated into Gamma distribution;
calculating the false occurrence rate of the amino acid to be evaluated based on the Gamma distribution:
Figure BDA0001423772310000031
wherein FAR represents the false discovery rate of amino acid to be evaluated, pwAnd prRespectively, the prior probabilities of incorrect and correct amino acids, gamma (X | alpha)ww) Denotes the distribution area of the erroneous amino acids above the scoring threshold X, Γ (X | α |)rr) Represents the area of distribution in the correct amino acid above the scoring threshold X, X represents the scoring threshold for the amino acid to be evaluated, alphawwGamma parameter, alpha, representing the scored distribution of wrong amino acidsrrGamma parameter indicating the correct amino acid score distribution.
According to a third aspect of the present invention, there is provided a method for assessing the location of a modification site. The evaluation method comprises the following steps:
enumerating candidate modification sites for a given peptide stretch sequence where phosphorylation modifications can occur;
according to the amino acid reliability evaluation method, the reliability score of phosphorylation modification of each candidate site is obtained.
In the method for evaluating the location of a modification site of the present invention, the method further comprises calculating the probability of phosphorylation modification at each candidate modification site using the following formula:
Figure BDA0001423772310000032
wherein p isiDenotes the prior probability, s, of the i-th modification siteiScore for confidence of candidate phosphorylation site i, tiIndicating whether the candidate phosphorylation site i is phosphorylated, if ti1, indicates phosphorylation; t is tiEqual to 0, indicates no phosphorylation.
Compared with the prior art, the invention has the advantages that: the reliability of the amino acid is evaluated by using a machine learning method, and the accuracy is high; the concept of False Amino-acid Rate (FAR) at the Amino acid level was first proposed and used for quality control in the field of de novo sequencing; the evaluation of the amino acid reliability and the evaluation of the modification site location are unified, and the evaluation performance of the modification site location is improved.
Drawings
The invention is illustrated and described only by way of example and not by way of limitation in the scope of the invention as set forth in the following drawings, in which:
FIG. 1 shows a flow diagram of an amino acid confidence model training method according to one embodiment of the invention;
FIG. 2 shows a schematic diagram of generating a background peptide stretch according to one embodiment of the present invention;
FIG. 3 shows a flow diagram of a method for assessing amino acid reliability according to one embodiment of the invention;
FIG. 4 shows a schematic of amino acid confidence score distribution versus Gamma distribution;
FIG. 5 shows a flow chart of a method of assessing modification site localization according to one embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions, design methods, and advantages of the present invention more apparent, the present invention will be further described in detail by specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In brief, the method for assessing the amino acid reliability of the present invention comprises two processes, wherein the first process is to obtain a model for assessing the amino acid reliability by training using a machine learning method, and the second process is to obtain the reliability of the amino acid to be assessed using the trained model. FIG. 1 shows a flow diagram of a method for training an amino acid reliability assessment model according to one embodiment of the invention.
Step S110, selecting training sample
In this step, training samples, including positive and negative samples, will be selected for machine learning.
In one example, the process of selecting training samples includes:
step 111: and searching a data set corresponding to the biological sample used for obtaining the training sample by using the database, and taking the result of which the wig occurrence rate FDR is less than or equal to 1% as an annotation set.
The data set is a data set obtained by putting a biological sample (containing a lot of peptide fragments, and the information of the peptide fragments needs to be analyzed) into a mass spectrometer, wherein the data set usually contains tens of thousands of spectrograms, each spectrogram corresponds to a peptide fragment sequence, and a computer is needed to analyze the peptide fragment sequence of each spectrogram.
Searching the data set generated by the mass spectrometer using a database (i.e., a known gene library) refers to matching and scoring the data set in the database to find the peptide fragment sequence corresponding to each spectrogram in the data set and having the best score in the database.
In order to ensure the accuracy of the obtained training samples, the search results with the false discovery rate FDR less than or equal to 1% are selected as the labeling set.
Sequencing is performed on the same data set using de novo sequencing software to find the peptide fragment sequence directly from the spectrogram information in the absence of database information, step 112.
Step 113, regarding each amino acid obtained by de novo sequencing, if the amino acid is consistent with the type of the amino acid on the labeling set, the amino acid is regarded as a positive sample; otherwise, it is considered as a negative sample.
Through this step S110, a positive sample peptide fragment sequence and a negative sample peptide fragment sequence for training can be obtained, hereinafter, the peptide fragment sequence for training is referred to as a training peptide fragment or a training peptide fragment sequence, and the amino acids included in the training peptide fragment sequence are referred to as amino acids to be trained.
And step S120, generating a background peptide segment of the amino acid to be trained.
For the training peptide fragment a1a2…alAssuming amino acid a involved in trainingiAnd i is 1 to l, wherein l represents the length of the peptide segment and generally ranges from 6 to 30. According to one embodiment of the present invention, the step of generating the background peptide fragment comprises:
step 121: for amino acid a to be trainediEnumerating all k-long subsequences, wherein k represents the length of the enumerated subsequences and generally takes a value between 2 and 5, and if the value of k is too long, the algorithm is slow;
step 122: assuming that k is chosen to be 3, there are three sub-sequence forms, namely ai-2ai-1ai,ai-1aiai+1And aiai+1ai+2
Step 123: enumerate all amino acid permutations with masses equal to the masses of these three subsequences, corresponding to the three pools: s1,S2And S3
Step 124: splicing the three sets with the rest sequences in the training peptide sequence to obtain a background peptide sequence: a is1…ai-3S1ai+1…al,a1…ai-2S2ai+2…alAnd a1…ai-1S3ai+3…alThus, for S1,S2And S3Each set of (a) is spliced to obtain a background peptide fragment, each background peptide fragment is a set containing a very large number of background peptide fragments, e.g., background peptide fragment a1…ai-3S1ai+1…alI.e. a set.
In this step, the purpose of generating the background peptide fragment is to determine whether the amino acid is correct or not by comparing the spectral characteristics of the training peptide fragment and the background peptide fragment. See fig. 2 for a schematic process of generating background peptide fragments, wherein in the spectrum of fig. 2 the abscissa m/z represents the mass to charge ratio, i.e. mass divided by charge, and the ordinate indicates the intensity of the spectral peak (intensity). As can be seen from fig. 2, assuming that the correct peptide fragment sequence (i.e. the training peptide fragment sequence) is AQPSK, the correctness of the first amino acid a needs to be determined, and all amino acid permutations whose mass is equal to the mass of AQPS are enumerated, for example: QAPPS, APSQ, APQ S, …, TQPG. Splicing all background amino acid permutations to the remaining sequence of the training peptide stretch to generate a background peptide stretch, comprising: QAPSK, APSQK, APQSK, …, TQPGK. As can be seen from the corresponding spectrum of fig. 2, the peak matching number of the training peptide sequence AQPSK is 7, i.e. y4, y3, y2, y1, b2, b3 and b4, denoted by score of 7, while the peak matching number of the background peptide sequence QAPSK is 6 (i.e. y3, y2, y1, b2, b3 and b4) and the peak matching number of the background peptide sequence tqpgkk is 2 (i.e. y1 and b4), and both the peak matching numbers of the background peptide sequences are lower than the peak matching number of the training peptide sequence AQPSK by 7, so the amino acid a confidence in the training peptide sequence is higher.
It should be understood that fig. 2 only schematically illustrates the process and meaning of generating the background peptide fragment, and the present invention is to evaluate the correctness of the amino acid to be trained by a machine learning method using a plurality of features of the extracted background peptide fragment according to the following detailed description.
Step S130, extracting and selecting the characteristics of the training peptide segment and the amino acid to be trained
The aim of the step is to select the characteristics which can effectively evaluate the credibility of the amino acid to be trained from the training peptide segment and the background peptide segment of the amino acid to be trained.
In one example, the extracted features include at least two of: 1) matching and scoring the peptide spectrum of the training peptide segment; 2) training the peak intensity matching proportion of the peptide section; 3) training the matching proportion of the number of peaks of the peptide fragment; 4) matching and differentiating the peptide spectrum of the training peptide segment and the background peptide segment with the best score; 5) the spectrum peak intensity matching proportion of the training peptide section and the best-scoring background peptide section is poor; 6) the matching proportion of the number of the spectrum peaks of the training peptide and the best-scoring background peptide is poor; 7) amino acid position information (e.g., from 1 to the length of the peptide stretch, l); 8) amino acid class information; 9) peptide fragment length information.
Specifically, the process of extracting the nine-dimensional features includes:
calculation of training peptide fragment a1a2…alPeptide profile matching of (2) to score psm1The spectral peak intensity matching ratio psm of the training peptide fragment2The ratio of the number of peaks of the training peptide fragment to the number of peaks of the training peptide fragment (psm)3As feature 1, feature 2, and feature 3;
and calculating the peptide spectrum matching scores, the spectrum peak intensity matching proportion and the spectrum peak matching number proportion of all background peptide segments. Finding the background peptide segment with the highest score of peptide spectrum matching, wherein the corresponding three scores are respectively expressed as psm'1、psm′2And psm'3The difference between the training peptide score and the best background peptide score, expressed as psm, was calculated1-psm′1、psm2-psm′2And psm3-psm′3As feature 4, feature 5, and feature 6;
calculating amino acid a to be trainediPosition information of (a), amino acid class information (the class information is used to indicate the type of amino acid, there are 20 amino acids in total, and 26 capital-letter symbols are used for representation, wherein the letter B, J, O, U, X, Z is removed), and length information of the training peptide (i.e., for the training peptide a)1a2…alThe length of the peptide fragment is l), as the characteristics 7, 8 and 9.
Step S140, training the classification model to obtain an amino acid reliability evaluation model
And (3) training a classifier by using the obtained positive and negative samples through a machine learning method to obtain a trained classification model, namely an amino acid reliability evaluation model, wherein the positive sample is used for identifying correct amino acid, the negative sample is used for identifying wrong amino acid, the input in the training is nine-dimensional feature vectors of the amino acid in the positive sample and the negative sample extracted by using the process of the step S130, and the output in the training is the scoring of the amino acid as the correct amino acid or the scoring of the wrong amino acid.
In this step, the classifier may employ a Support Vector Machine (SVM) or other types such as decision trees, random forest RF, bayesian networks, and the like. In one embodiment, when SVM training classification is used, the radial basis kernel function of the SVM is used, or other kernel functions, even linear classification of non-kernel functions, can be used.
The confidence level of any amino acid to be tested can be evaluated by using the trained classification model, and the flow chart of the method for evaluating the amino acid confidence level shown in figure 3 is referred. The embodiment is introduced by taking an SVM classification model as an example, and specifically includes:
and step S310, preprocessing a spectrogram of an original peptide fragment to be evaluated.
The purpose of this step is to remove a large number of isotopic and noise peaks from the corresponding spectra of the original peptide stretch containing the amino acid to be evaluated before de novo sequencing, in order to avoid interference with de novo sequencing algorithms, e.g. to remove peaks near parent ions, and to remove neutral water loss, such as loss of peaks of water and ammonia molecules.
In one example, the process of preprocessing the spectrogram comprises:
enumerating charges in a spectrogram corresponding to the original peptide fragment, and searching all isotope peak clusters according to the mass difference of every two spectral peaks; judging the charge according to the mass difference of two peaks in the isotope peak cluster; if the quality difference is aboutThen is + n charge; according to the charge quantity, converting the single isotope peak into single charge mass, and removing other isotope peaks; and removing the parent ion peak and the parent ion water loss and ammonia loss peak in the spectrogram.
And step S320, generating a background peptide segment of the amino acid to be evaluated.
The original peptide fragment sequence a to be tested is subjected to a similar procedure as step S1201a2…alEvaluation amino acid a in (1)iGenerating background peptide fragments.
Step S330, extracting and selecting the characteristics of the original peptide segment and the amino acid to be evaluated
Features of the amino acid to be evaluated are extracted and selected using a process similar to step S130, and similarly, the extracted features include at least two of the following: 1) matching and scoring the peptide spectrum of the original peptide segment; 2) matching proportion of spectrum peak intensity of the original peptide segment; 3) matching proportion of the number of peaks of the original peptide fragment; 4) matching and differentiating the peptide spectrum of the original peptide segment and the background peptide segment with the best score; 5) the spectrum peak intensity matching proportion of the original peptide segment and the best-scored background peptide segment is poor; 6) the matching proportion of the number of the spectrum peaks of the original peptide segment and the best-scored background peptide segment is poor; 7) amino acid position information (e.g., from 1 to the length of the peptide stretch, l); 8) amino acid class information; 9) peptide fragment length information.
And step S340, obtaining credibility scoring distribution of the amino acid to be evaluated by using the trained amino acid credibility evaluation model.
In the step, the extracted original peptide segment and the characteristics of the amino acid to be evaluated are input into the obtained amino acid credibility evaluation model, all the amino acid to be evaluated are scored by using a trained SVM model and are drawn into scoring distribution, and for convenience of subsequent description, the scoring of the amino acid credibility is named as SVM-Score.
By scoring the distribution of the confidence level of the amino acids at this step, it is possible to identify whether the amino acid to be evaluated is the correct amino acid, for example, if the score is higher than a predetermined threshold, the confidence level of the amino acid to be evaluated is considered to be high.
Step S350, using Gamma to fit the amino acid confidence score distribution.
The purpose of this step is that the scored distribution obtained can be further processed for more accurate confidence in the amino acid.
Since the SVM-Score distribution is similar to the Gamma distribution, in this example, the Gamma distribution was used to fit the amino acid confidence Score distribution. For example, the fitting may be done using an EM (expectation maximization) method in combination with a Gamma distribution. Since the scored distribution of the amino acids to be evaluated is necessarily two distributions (since there are two classes: correct amino acids and incorrect amino acids), two Gamma distributions Γ (X | α |)ww) And Γ (X | α)rr) Separately fitting the correctness of SVM-Score, where X represents SVM-Score, αwwGamma parameter, alpha, representing the distribution of erroneous resultsrrGamma parameter indicating the distribution of correct results.
FIG. 4 shows a schematic of SVM-Score distribution versus Gamma distribution, with the abscissa representing the Score value and the ordinate representing the proportion of amino acids corresponding to the Score (percent), where "real data" represents the actual scored distribution including all correct and incorrect amino acids, "real incorrect" and "real correct" represent the actual scored distribution of incorrect and correct amino acids, respectively, and "affected incorrect" and "affected correct" represent the estimated Gamma distribution of incorrect and correct amino acids, respectively, using the EM algorithm.
Step S360, calculating the false occurrence rate FAR of the amino acid to be evaluated:
Figure BDA0001423772310000081
wherein p iswRepresenting the prior probability of the wrong amino acid, pw×Γ(X|αww) Indicating the number of erroneous amino acids, p, exceeding a threshold valuerIndicates the prior probability of the correct amino acid, pr×Γ(X|αrr) Indicating the number of correct Amino acids that exceed a threshold, the False Amino-acid Rate, FAR, of such Amino acids is the first time the present invention suggests, indicating that at a given score threshold, the number of False Amino acids that exceed the threshold is divided by the total number of Amino acids that exceed the threshold. The quality control in the field of de novo sequencing can be controlled by utilizing the false discovery rate FAR of amino acids.
According to another aspect of the present invention, the trained classification model can be used to evaluate the location of modification sites, as shown in FIG. 5, which includes the following steps:
candidate modification sites at which phosphorylation can occur are enumerated for a given peptide stretch sequence, step 510.
For example, for a given peptide stretch sequence WQSHTPPYAEK, a phosphorylation modification has occurred on that sequence, assuming that the phosphorylation modification can occur at S, T, Y three amino acids. The specific process for locating these three candidate modification sites is:
enumerating WQSHTPPYAEK all modification sites at which phosphorylation modifications can occur: WQpSHTPPYAEK, WQSHpTPPYAEK and WQSHTPPpYAEK, wherein "pS" represents that phosphorylation modification is carried out on S amino acid, pT "represents that phosphorylation modification is carried out on amino acid T, pY" represents that phosphorylation modification is carried out on amino acid Y, and the site where phosphorylation modification is carried out can be uniformly represented by "pX", namely representing that phosphorylation modification is carried out on X amino acid, in the embodiment, X can be any one of S, T, Y amino acids, and pX can be regarded as a new amino acid.
And step S520, obtaining the credibility score of each candidate site subjected to phosphorylation modification by using the trained amino acid credibility assessment model.
The evaluation method of amino acid reliability provided by the invention calculates the reliability score of the new amino acid pX, and in the embodiment, there are three candidatesWhen the modification site S, T, Y is selected, the score is expressed as s1,s2And s3
Step S530, calculating the probability of phosphorylation modification at each candidate modification site.
The probability of phosphorylation at each candidate modification site is calculated. In this example, bayesian formulation was used to calculate the probability that phosphorylation occurred at the candidate site, i.e.:
Figure BDA0001423772310000091
wherein p isiDenotes the prior probability, s, of the i-th modification siteiRepresents the confidence score, t, of the candidate phosphorylation sites i obtained by the method of the present inventioniIndicates whether phosphorylation of the ith site has occurred, if ti1, indicates phosphorylation; t is tiEqual to 0, indicates no phosphorylation.
In conclusion, the invention unifies two problems of amino acid reliability evaluation and modification site location, and considers the modified amino acid as a new amino acid, so that the method for evaluating the amino acid reliability can also be applied to the evaluation of the modification site location.
The methods of the present invention may be implemented in software, hardware, or a combination of software and hardware. To further validate the effect of the invention, the inventors implemented the method of the invention as software and compared it with the current only two software, PEAKS and Novor, that support the evaluation of amino acid confidence. The results show that the method of the invention is much better than the two existing software on three real data sets, for example, under the condition that the FAR is controlled to be 5%, the invention can identify 124.8% more amino acids than the PEAKS software with better performance; the method of the invention also outperformed the very current software Ascore and phosphorrs in terms of modification site localization on the three phosphorylation-enriched datasets, e.g., with FAR control of 1%, the method of the invention was able to identify 67.5% more phosphorylation sites than Ascore and 65.6% more phosphorylation sites than phosphoRS, while covering 98% of the results of Ascore and phosphorrs and 21% of the results of itself alone.
It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. An amino acid credibility assessment model training method comprises the following steps:
step 1: generating a background peptide fragment set of the amino acid to be trained according to a training peptide fragment containing the amino acid to be trained;
step 2: extracting a plurality of features from the training peptide fragment, the amino acid to be trained and the background peptide fragment set of the amino acid to be trained;
and step 3: and training a classification model by taking the extracted multiple features as input vectors and taking whether the amino acid to be trained is correct as output, so as to obtain an amino acid reliability evaluation model.
2. The method for training an amino acid reliability assessment model according to claim 1, wherein the step 1 comprises:
enumerating a subsequence of predetermined length for the amino acids to be trained, wherein the subsequence comprises the amino acids to be trained and other amino acids in the training peptide segment;
enumerating a set of amino acids from the training peptide stretch having a mass equal to the mass of the subsequence;
and splicing the amino acid set with the rest sequences in the training peptide segment to obtain the background peptide segment set of the amino acid to be trained.
3. The method for training an amino acid reliability assessment model according to claim 1, wherein the step 2 comprises:
calculating the peptide profile matching score psm of the training peptide fragment1Spectral peak intensity matching proportion psm2Ratio of number of matched spectral peaks psm3As the first feature, the second feature and the third feature, respectively;
calculating the peptide spectrum matching of the best background peptide segment in the background peptide segment set of the amino acid to be trained to score psm'1And spectrum peak intensity matching proportion psm'2And the matching number proportion of spectral peaks psm'3And calculating the difference between the score of the training peptide segment and the score of the best background peptide segment to representIs psm1-psm′1、psm2-psm′2And psm3-psm′3Respectively as a fourth feature, a fifth feature and a sixth feature, wherein the best background peptide segment is the background peptide segment with the highest score of peptide spectrum matching in the background peptide segment set of the amino acid to be trained;
and calculating the position information, the class information and the length information of the training peptide segment of the amino acid to be trained as a seventh feature, an eighth feature and a ninth feature respectively.
4. The method for training an amino acid reliability assessment model according to any one of claims 1 to 3, wherein in step 3, the classification model comprises any one of a support vector machine, a decision tree, a random forest and a Bayesian network.
5. A method for assessing the reliability of amino acids, comprising:
step 51: generating a background peptide fragment set of the amino acid to be evaluated according to an original peptide fragment containing the amino acid to be evaluated;
step 52: extracting a plurality of features from the original peptide fragment and the amino acid to be evaluated;
step 53: inputting the extracted features into an amino acid credibility assessment model obtained by the amino acid credibility assessment model training method of any one of claims 1 to 4 to obtain credibility scoring distribution of the amino acid to be assessed.
6. The method for assessing amino acid reliability according to claim 5, further comprising:
fitting the credibility scoring distribution of the amino acid to be evaluated into Gamma distribution;
calculating the false occurrence rate of the amino acid to be evaluated based on the Gamma distribution:
Figure FDA0002270643520000021
wherein FAR representsFalse discovery rate of amino acids to be evaluated, pwAnd prRespectively, the prior probabilities of incorrect and correct amino acids, gamma (X | alpha)w,βw) Denotes the distribution area of the wrong amino acids above the scoring threshold X, Γ (X | α |)r,βr) Represents the area of distribution of the correct amino acid above the scoring threshold X, X representing the score of the amino acid to be evaluated, alphaw,βwGamma parameter, alpha, representing the scored distribution of wrong amino acidsr,βrGamma parameter indicating the correct amino acid score distribution.
7. A method of assessing modification site localization, comprising:
enumerating candidate modification sites for a given peptide stretch sequence where phosphorylation modifications can occur;
the method for evaluating the amino acid reliability according to claim 5, wherein a score is obtained for the reliability of phosphorylation modification of each candidate site.
8. The method of assessing modification site localization of claim 7, further comprising calculating the probability of each candidate modification site for phosphorylation modification using the following formula:
Figure FDA0002270643520000022
wherein p isiDenotes the prior probability, s, of the i-th modification siteiScore for confidence of candidate phosphorylation site i, tiIndicating whether the candidate phosphorylation site i is phosphorylated, if ti1, indicates phosphorylation; t is tiEqual to 0, indicates no phosphorylation.
9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
10. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the steps of the method of any of claims 1 to 8 are implemented when the processor executes the program.
CN201710904787.8A 2017-09-29 2017-09-29 Evaluation method for amino acid reliability and modification site positioning Active CN107622184B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710904787.8A CN107622184B (en) 2017-09-29 2017-09-29 Evaluation method for amino acid reliability and modification site positioning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710904787.8A CN107622184B (en) 2017-09-29 2017-09-29 Evaluation method for amino acid reliability and modification site positioning

Publications (2)

Publication Number Publication Date
CN107622184A CN107622184A (en) 2018-01-23
CN107622184B true CN107622184B (en) 2020-01-21

Family

ID=61091431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710904787.8A Active CN107622184B (en) 2017-09-29 2017-09-29 Evaluation method for amino acid reliability and modification site positioning

Country Status (1)

Country Link
CN (1) CN107622184B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112464804B (en) * 2020-11-26 2022-05-24 北京航空航天大学 Peptide fragment signal matching method based on neural network framework

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1586641A3 (en) * 1991-04-05 2008-04-09 The General Hospital Corporation Compounds for the treatment of hypercalcemia and hypocalcemia
CN104134015A (en) * 2014-07-25 2014-11-05 中国科学院计算技术研究所 Protein post-translational modification positioning method and protein post-translational modification positioning system
CN104182658A (en) * 2014-08-06 2014-12-03 中国科学院计算技术研究所 Tandem mass spectrogram identification method
CN104215729A (en) * 2014-08-18 2014-12-17 中国科学院计算技术研究所 Tandem-mass-spectrometry data parent-ion detection model training method and parent-ion detection method
CN106770605A (en) * 2016-11-14 2017-05-31 中国科学院计算技术研究所 De novo sequencing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1586641A3 (en) * 1991-04-05 2008-04-09 The General Hospital Corporation Compounds for the treatment of hypercalcemia and hypocalcemia
CN104134015A (en) * 2014-07-25 2014-11-05 中国科学院计算技术研究所 Protein post-translational modification positioning method and protein post-translational modification positioning system
CN104182658A (en) * 2014-08-06 2014-12-03 中国科学院计算技术研究所 Tandem mass spectrogram identification method
CN104215729A (en) * 2014-08-18 2014-12-17 中国科学院计算技术研究所 Tandem-mass-spectrometry data parent-ion detection model training method and parent-ion detection method
CN106770605A (en) * 2016-11-14 2017-05-31 中国科学院计算技术研究所 De novo sequencing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于正反库特征信息匹配的蛋白质二级质谱鉴定算法;李华梅;《中国优秀硕士学位论文全文数据库 信息科技辑》;20151215(第12期);摘要及正文第3.7-3.8节 *

Also Published As

Publication number Publication date
CN107622184A (en) 2018-01-23

Similar Documents

Publication Publication Date Title
CN108491302B (en) Method for detecting spark cluster node state
Liesecke et al. Improved gene co-expression network quality through expression dataset down-sampling and network aggregation
JP2016200435A (en) Mass spectrum analysis system, method, and program
Gleason et al. Machine learning predicts translation initiation sites in neurologic diseases with nucleotide repeat expansions
CN107622184B (en) Evaluation method for amino acid reliability and modification site positioning
CN113517021B (en) Cancer driver gene prediction method
CN111723182B (en) Key information extraction method and device for vulnerability text
CN109886151B (en) False identity attribute detection method
CN110717037B (en) Method and device for classifying users
JP6356015B2 (en) Gene expression information analyzing apparatus, gene expression information analyzing method, and program
Souza et al. Detecting clustered independent rare variant associations using genetic algorithms
CN111048145A (en) Method, device, equipment and storage medium for generating protein prediction model
US20090175520A1 (en) Method and apparatus for matching of bracketed patterns in test strings
CN113127342B (en) Defect prediction method and device based on power grid information system feature selection
Csilléry et al. Approximate Bayesian computation (ABC) in R: a Vignette
CN113918471A (en) Test case processing method and device and computer readable storage medium
CN109284354B (en) Script searching method and device, computer equipment and storage medium
CN111108516B (en) Evaluating input data using a deep learning algorithm
US11177018B2 (en) Stable genes in comparative transcriptomics
US11210605B1 (en) Dataset suitability check for machine learning
Ripon et al. Machine-learning approach for ribonucleic acid primary and secondary structure prediction from images
CN111009287B (en) SLiMs prediction model generation method, device, equipment and storage medium
CN112614542B (en) Microorganism identification method, device, equipment and storage medium
Patil et al. CoalQC-Quality control while inferring demographic histories from genomic data: Application to forest tree genomes
Clendinen et al. Enter Gaussian Mixture Modeling Extensions for Improved False Discovery Rate Estimation in GC-MS Metabolomics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Yang Hao

Inventor after: He Simin

Inventor after: Chi Hao

Inventor after: Zeng Wenfeng

Inventor after: Zhou Wenjing

Inventor after: Wang Zhaowei

Inventor after: Wang Ruimin

Inventor after: Niu Xiunan

Inventor after: Chen Zhenlin

Inventor after: Liu Chao

Inventor before: Yang Hao

Inventor before: He Simin

Inventor before: Chi Hao

Inventor before: Zeng Wenfeng

Inventor before: Zhou Wenjing

Inventor before: Wang Zhaowei

Inventor before: Wang Ruimin

Inventor before: Niu Xiunan

Inventor before: Chen Zhenlin

Inventor before: Liu Chao

CB03 Change of inventor or designer information