WO2020260419A1 - Modèle de probabilité de protéine - Google Patents

Modèle de probabilité de protéine Download PDF

Info

Publication number
WO2020260419A1
WO2020260419A1 PCT/EP2020/067751 EP2020067751W WO2020260419A1 WO 2020260419 A1 WO2020260419 A1 WO 2020260419A1 EP 2020067751 W EP2020067751 W EP 2020067751W WO 2020260419 A1 WO2020260419 A1 WO 2020260419A1
Authority
WO
WIPO (PCT)
Prior art keywords
protein
peptides
decoy
peptide
target
Prior art date
Application number
PCT/EP2020/067751
Other languages
English (en)
Inventor
Gorka PRIETO AGUJETA
Jesús VÁZQUEZ COBOS
Original Assignee
Universidad Del Pais Vasco-Euskal Herriko Unibersitatea
Centro Nacional De Investigaciones Cardiovasculares Carlos Iii (F.S.P.)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Universidad Del Pais Vasco-Euskal Herriko Unibersitatea, Centro Nacional De Investigaciones Cardiovasculares Carlos Iii (F.S.P.) filed Critical Universidad Del Pais Vasco-Euskal Herriko Unibersitatea
Publication of WO2020260419A1 publication Critical patent/WO2020260419A1/fr

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6803General methods of protein analysis not limited to specific proteins or families of proteins
    • G01N33/6848Methods of protein analysis involving mass spectrometry
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR

Definitions

  • MS/MS spectra are matched against the theoretical spectra of peptides resulting from an in-silico digestion of a protein database. Each of these matches is called a peptide-to-spectrum match (PSM), and a score is assigned to each PSM depending on the similarity between the measured and theoretical spectra.
  • PSM peptide-to-spectrum match
  • the validity of identifications is checked by calculating the false discovery rate (FDR), which is usually set to a maximum of 1%. The same concept can be applied at the peptide level to control for the peptide FDR.
  • FDR false discovery rate
  • the most popular and widely-accepted strategy for computing FDRs in shotgun proteomics is the target-decoy approach.
  • two databases are used: a target database with all the protein sequences that could be present in the sample, and a decoy database with fictitious protein sequences that should not be detected. If the decoy database is correctly constructed (with the same size and statistical distributions as the target database), the probability of obtaining a false PSM in the target and decoy databases is identical.
  • Several strategies have been proposed for the calculation of FDR by the target-decoy approach. The number of above-threshold matches in the decoy database can be directly used to estimate the number of false matches in the target database.
  • spectra yielding high scores in the target database also tend to produce high scores in the decoy database
  • the search is usually performed against a concatenated target + decoy database, so that target and decoy sequences compete for the spectra.
  • different formulae can be used to calculate the FDR at PSM or peptide levels using the competition strategy, depending on the population of matches used to estimate FDR.
  • spectra are searched separately against the two databases, and target and decoy matches compete a posteriori, so that the FDR can be calculated in the original target sequence population.
  • the differences between these methods are small, and the competition target-decoy strategy for computing FDR is currently widely accepted in the proteomics community.
  • the main goal of high-throughput proteomics is to identify not peptides, but the proteins present in a biological sample.
  • the identity of these proteins can be inferred from peptides identified but controlling the FDR at the protein level is not straightforward. Since each protein can be identified by several peptides, true peptide identifications tend to concentrate on the proteins present in the sample, whereas false peptide identifications are randomly distributed among the proteins in the database. Therefore, the ratio between the number of peptides and the number of proteins matched by these peptides is higher for target peptides than for decoy peptides, and consequently the protein-level FDR can be much larger than the peptide-level FDR.
  • Another way to control amplification of the error rate is to compute a protein-level score from the scores of their peptides and estimate the FDR by directly applying the target-decoy approach at the protein level.
  • Several protein-scoring methods have been proposed, and the widespread implementation of FDR quality control has allowed comparison of the different approaches.
  • the approach that currently seems the most efficient, especially for large datasets, is to assign each protein with the score of its best peptide.
  • This is somewhat paradoxical, since from a purely statistical standpoint one would expect to obtain more information and evidence for correct protein identification by considering all peptides mapping to a protein.
  • This peptide-to-protein paradox illustrates the urgent need for the development of a comprehensive protein scoring model that effectively integrates all the existing peptide information.
  • Such a model would facilitate the development of automated workflows with increased quality of protein information and improve annotation in protein databases.
  • Figure 1 Distribution of different decoy protein scores using three tissues from the Human Protein Map as separated test datasets.
  • the red line represents the uniform distribution. Decoys above this line are over-evaluated and decoys below the line are under-valuated.
  • Figure 3 Competition between target and decoy proteins for FDR estimation using a simulated dataset and LPG scores.
  • the dashed blue lines in (B) indicate the score threshold.
  • the LPG scores of the decoy proteins follow an exponential distribution since they correspond to the logarithm of the probability, which follows a uniform distribution, as previously presented.
  • TP true-positive
  • FP false-positive
  • A When no true-positive (TP) target proteins are present, decoy proteins and false-positive (FP) target proteins are distributed symmetrically across the diagonal. The FDRr definition is based on this symmetry.
  • B When there are true-positive target proteins, four regions can be defined considering the diagonal and the score threshold.
  • the decoy-only (do) region contains pairs in which the decoy protein has an above-threshold score and the target protein has a below-threshold score.
  • the decoy-best (db) region contains pairs for which both scores are above-threshold but the decoy-protein score is better than the target-protein score.
  • the target-only (to) and target-best (tb) regions are defined in a similar fashion.
  • Figure 4 Number of identifications using the different protein-level FDR algorithms for three sample tissues in the Human Proteome Map. For this comparison, the protein score has been calculated simply as the score of its best peptide. Note that the axes have different oset values to better highlight differences.
  • Figure 5 Number of identified genes as a function of the FDR threshold for different protein identification workflows, using three tissues from the Human Protein Map as separate tests.
  • Figure 6 Venn diagrams showing the number of genes identified by different protein identification workflows in three tissues of the Human Protein Map.
  • Figure 7 Comparison of target versus decoy peptides for each gene identified exclusively by any of the three different protein identification workflows discussed. Total number of peptides is considered, without filtering by FDR. Each point corresponds to a target-decoy pair. The comparison has been carried out in three tissues of the Human Protein Map.
  • An aspect of the invention refers to a method carried out by a computer for selecting, identifying or classifying proteins present in a sample, preferably a biological sample, which comprises the following steps: i. matching MS/MS spectra obtained from the set of peptides present in the sample against a target and a decoy protein database, separately or in the form of a concatenated target and decoy protein database, wherein the target database is characterized by having all the protein sequences that could be present in the sample, and the decoy database contains sequences that should not be detected, and
  • LPGF value (coLogarithm of Probability using the Gamma distribution for Filtered peptides), which is the probability of getting a decoy protein with an equal or lower p-value product from m identified peptides that are selected among n matched peptides, according to the following formulae:
  • LPF coLogarithm of Probability product using Filtered peptides
  • n are the total number of peptides matching the protein (considered as identified or not)
  • G(x; k; Q) is the cumulative density function (CDF) of the gamma distribution with shape k and scale Q
  • LPGM coLogarithm of Probability using the Gamma distribution for the Maximum peptide score
  • LPGM -log 10 (l - (l - P r) where p is preferably the lowest p-value among the set of p-values of all the peptides that match the protein, and wherein the peptide p-value is calculated from a score computed by matching MS/MS spectra obtained from the set of peptides present in the sample against a target and a decoy protein database, separately or in the form of a concatenated target and decoy protein database, and wherein the target database is characterized by having all the protein sequences that could be present in the sample, and the decoy database contains sequences that should not be detected, and wherein each of these matches is called a peptide- to-spectrum match (PSM); or wherein the LPGF value is calculated by using the following chi-squared distribution:
  • a further aspect of the invention refers to a method carried out by a computer for selecting, identifying or classifying proteins present in a sample, which comprises the following steps: a. calculating the peptide p-values from the set of peptides identified in the sample, characterized in that:
  • MS/MS spectra obtained from the set of peptides from the sample are matched against a target and a decoy protein database, separately or in the form of a concatenated target and decoy protein database, and wherein the target database is characterized by having all the protein sequences that could be present in the sample, and the decoy database contains sequences that should not be detected, and wherein each of these matches is called a peptide-to-spectrum match (PSM), and a score is assigned to each PSM depending on the similarity between the MS/MS spectrum and the peptide;
  • PSM peptide-to-spectrum match
  • step a) iv. calculating the LPGF value from the peptide p-values obtained in step a) (coLogarithm of Probability using the Gamma distribution for Filtered peptides), which is the probability of getting a decoy protein with an equal or lower p-value product, using the p-values as obtained in step a), from m identified peptides that are selected among n matched peptides, according to
  • LPF is the cologarithm of the product of p-values of the m identified peptides matching, as identified in step a(i), the protein, n are the total number of peptides matching the protein considered as identified or not)
  • G(x; k; Q) is the CDF of the gamma distribution with shape k and scale Q and LPGM is given by
  • LPGM —log 10 (l— (1— p) n ) where p is the p-value obtained according to step a) of the peptide that matches the protein with the best score; or wherein the LPGF value is calculated by using the following chi-squared distribution:
  • Chi2(x; k) is the CDF of the chi-squared distribution with k degrees of freedom; wherein the protein probability provided by the LPGF value is used to classify or rank the proteins according to their probability and to optionally select or identify the proteins that are considered truly identified according to a statistical significance threshold.
  • a False Discovery Rate (FDR)- value preferably FDR ⁇ 0.01 is used as the threshold for statistical significance of protein identification, so that the LPGF value of the target and decoy proteins are used to calculate the FDR of each protein, so that all the proteins whose FDR is lower than the FDR threshold are considered as identified according to said significance threshold.
  • FDR False Discovery Rate
  • a further aspect of the invention refers to a data processing apparatus/device/system comprising means for carrying out the steps of any of the methods previously described.
  • a further aspect of the invention refers to a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of any of the methods previously described.
  • a further aspect of the invention refers to a computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the steps of any of the methods previously described.
  • protein confidence values are calculated from peptide confidence values and peptide confidence values do not change as peptides are assigned to proteins, the protein confidence values are dependent on the initial confidence values of the peptides. These initial peptide confidence values are based on a model for the relationship between the amount of evidence in data for a peptide and the probability of correctness of a peptide.
  • the present invention first provides an accurate method for calculating peptide confidence values from a set of peptide-spectrum matches (PSM) obtained after searching against a target and a decoy protein database (concatenated or separated), wherein said PSMs are generated in a proteomics experiment in which one or more mass spectrometers that perform a plurality of scans of a sample produce a plurality of spectra and wherein a processor in communication with a target and a decoy protein database identify said set of PSM from the plurality of spectra.
  • PSM peptide-spectrum matches
  • the method comprises calculating the peptide p-values from the set of peptides identified in a sample, by the following steps: i. MS/MS spectra obtained from the set of peptides from the sample are matched against a target and a decoy protein database, separately or in the form of a concatenated target and decoy protein database, and wherein the target database is characterized by having all the protein sequences that could be present in the sample, and the decoy database contains sequences that should not be detected, and wherein each of these matches is called a peptide-to-spectrum match (PSM), and a score is assigned to each PSM depending on the similarity between the MS/MS spectrum and the peptide; ii.
  • PSM peptide-to-spectrum match
  • a second aspect of the invention refers to a method for determining a protein probability in a sample, preferably from the peptide p-values obtained according to the method of the first aspect of the invention although other peptide p-values obtained according to any further methods known to the skilled person (for a review see (Nesvizhskii, 2010)) are also useful in the context of the second aspect of the invention; and calculating said probability in terms of protein confidence values for each protein that evaluates the likelihood that a candidate protein is correctly identified, wherein the method integrates the p-values of the identified peptides according to the following formula:
  • LPF is the cologarithm of the product of p-values of the m identified peptides matching the protein
  • n are the total number of peptides matching the protein (considered as identified or not)
  • G(x; k; Q) is the CDF of the gamma distribution with shape k and scale Q and LPGM is given by
  • the previous method for determining a protein probability comprises the following steps:
  • b. carrying out the method for determining a protein probability in a sample by: i. calculating the LPGF (coLogarithm of Probability using the Gamma distribution for Filtered peptides) value from the peptide p-values obtained in step a) or obtained by any other methods, which is the probability of getting a decoy protein with an equal or lower p-value product, using the p-values as obtained in step a), from m identified peptides that are selected among n matched peptides, according to the following formulae:
  • LPF is the cologarithm of the product of p-values of the m identified peptides matching, as identified in step a(i), the protein, n are the total number of peptides matching the protein considered as identified or not)
  • G(x; k; Q) is the CDF of the gamma distribution with shape k and scale Q and LPGM is given by
  • LPGM -log 10 (l - (l - P r) where p is the p-value obtained according to step a) of the peptide that matches the protein with the best score; or wherein the LPGF value is calculated by using the following chi-squared distribution:
  • Chi2(x; k) is the CDF of the chi-squared distribution with k degrees of freedom.
  • a third aspect of the invention refers to a method of selecting, identifying or classifying a protein from a set of proteins using as an indicator the protein probability obtained according to any of the methods of the second aspect of the invention.
  • a p-value preferably p ⁇ 0.05, is used as the threshold for statistical significance of protein identification, so that all the proteins with LPGF ⁇ p are considered as identified according to said significance threshold in the protein probability model.
  • a False Discovery Rate (FDR)-value preferably FDR ⁇ 0.01
  • FDR False Discovery Rate
  • the processor indicated in the first aspect may determine the peptide score values by using a heuristic.
  • a fourth aspect of the invention refers to a computer program or a computer program product, which may comprise or not a tangible computer-readable storage medium, wherein the computer program contents include a program with instructions being executed on a processor so as to perform a method for calculating peptide p- values in proteomic analysis, the method comprising: providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise at least an analysis module which determines peptide p-values for the set of peptides using the analysis module by following the methodology indicated in the first or second aspect of the invention.
  • Non volatile media includes, for example, optical or magnetic disks.
  • Volatile media includes dynamic memory, such as memory.
  • Transmission media includes coaxial cables, copper wire, and fiber optics, including the wires that comprise bus.
  • Computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, papertape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other tangible medium from which a computer can read.
  • Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor for execution.
  • the instructions may initially be carried on the magnetic disk of a remote computer.
  • the remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem.
  • a modem local to computer system can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal.
  • An infra-red detector coupled to bus can receive the data carried in the infra-red signal and place the data on bus.
  • Bus carries the data to memory, from which processor retrieves and executes the instructions.
  • the instructions received by memory may optionally be stored on storage device either before or after execution by processor.
  • instructions configured to be executed by a processor to perform a method are stored on a computer-readable medium.
  • the computer- readable medium can be a device that stores digital information.
  • a computer- readable medium includes a compact disc read-only memory (CD-ROM) as is known in the art for storing software.
  • CD-ROM compact disc read-only memory
  • the computer-readable medium is accessed by a processor suitable for executing instructions configured to be executed.
  • the described implementation includes software but the present teachings may be implemented as a combination of hardware and software or in hardware alone.
  • a fifth aspect of the invention refers to a computer program or a computer program product, which may comprise or not a tangible computer-readable storage medium, wherein the computer program contents include a program with instructions being executed on a processor so as to perform a method for calculating protein probabilities in proteomic analysis, the method comprising: providing a system, wherein the system comprises distinct software modules, and wherein the distinct software modules comprise at least an analysis module, wherein the analysis module calculates a protein probability from the peptide p- values obtained according to the first aspect or from other peptide p-values obtained according to any further methods known to the skilled person, and wherein said protein probability is determined by following the methodology indicated in the second aspect of the invention.
  • the analysis module is use for identifying or classifying a protein from a set of proteins of a sample, preferably of a biological sample, by using as an indicator the protein probability obtained according to the fifth aspect of the invention.
  • the system further comprises a measurement module, and said measurement module obtains a plurality of spectra from one or more mass spectrometers that perform a plurality of scans of a sample using the measurement module; and/or identifies a plurality of peptides from the plurality of spectra using the analysis module; and/or identifies a plurality of proteins from the plurality of peptides using the analysis module.
  • a sixth aspect of the invention refers to a computer implemented system for calculating peptide p-values from a set of peptides obtained after searching against a target and a decoy protein database (concatenated or separated), wherein said peptides are identified in a proteomics experiment in which one or more mass spectrometers that perform a plurality of scans of a sample produce a plurality of spectra and wherein a processor in communication with a protein database and the one or more mass spectrometers identify said set of peptides from the plurality of spectra, and a set of proteins from the plurality of peptides, characterized in that the system comprises: a.
  • Computer system may preferably comprise a communication mechanism for communicating information, and a processor coupled with said communication mechanism for processing information.
  • Computer system may also include a memory, which can be a random access memory (RAM) or other dynamic storage device, coupled to the communication mechanism for determining base calls, and instructions to be executed by processor. Memory may also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor.
  • Computer system may further include a read only memory (ROM) or other static storage device coupled to the communication mechanism for storing static information and instructions for processor.
  • ROM read only memory
  • a storage device such as a magnetic disk or optical disk may be provided and coupled to the communication mechanism for storing information and instructions.
  • a seventh aspect of the invention refers to a system for determining a protein probability from the peptide p-values obtained according to the system of the sixth aspect, and calculating said probability for each protein that evaluates the likelihood that a candidate protein is correctly identified, wherein the method integrates the -values of the identified peptides according to the following formula:
  • LPF is the cologarithm of the product of p-values of the m identified peptides matching the protein
  • n are the total number of peptides matching the protein (considered as identified or not)
  • G(x; k; Q) is the CDF of the gamma distribution with shape k and scale Q
  • Chi 2 (x; k) is the CDF of the chi-squared distribution with k degrees of freedom.
  • HPM Human Proteome Map
  • the target fasta database has been generated from GENCODE 15 version 25, with the addition of 47 common contaminants from UniProt. To account for PSM ambiguities in leucine vs isoleucine assignments, we replaced all leucines in the database with isoleucines.
  • a separated decoy database was generated with Decoy Pyrat, a software tool that generates decoy sequences with minimal overlap between target and decoy peptides. Decoy Pyrat achieves this by first switching proteolytic cleavage sites with the preceding amino acid, reversing the database, and then shuffling any decoy sequences that become identical to target sequences.
  • LPM stands for LP Maximum and assigns the protein the probability of its best peptide (i.e., the peptide with lowest probability).
  • LPS stands for LP Sum and estimates the protein probability as the product of all peptides of the protein that have been matched in the database search.
  • LPF which stands for LP Filter, is a variant of LPS that only integrates the probability of the identified peptides (i.e., peptides above a predefined peptide-level FDR threshold).
  • Eqs. 3-5 are constructed from calibrated (i.e., uniformly distributed) and independent decoy peptide probabilities; however, when applied to decoy proteins, none of these scores followed the expected uniform distribution ( Figure 1). This can be explained in the case of LPM, since the probability of the best score is not a true measure of protein probability. However, LPS increases much faster than would be expected for a true probability, and this marked deviation from a uniform distribution is somewhat surprising, since proteomics specialists have often assumed that the product of peptide probabilities is a good estimate of protein probability. The deviation from the uniform distribution, though less pronounced, still grows too fast when LPF is used, i.e, when only the decoy peptides above peptide-FDR threshold are considered in the protein score.
  • neperian logarithm of the product of n independent and identically distributed uniform [0; 1] random variables follows a gammafn; 1) distribution. Since peptide p-values are uniformly distributed, we can use the gamma function to construct a protein probability as follows:
  • LPGS —log w (l - G(LPS - hi 10; n, 1)) (9)
  • LPGS —log w (l - Chi2(2 LPS ZnlO; 2 n)) where Chi2(x; k) is the CDF of the chi-squared distribution with k degrees of freedom.
  • LPGS has the expected protein probability properties and accurately predicts the observed probability of matching a decoy protein in all samples across different search engines ( Figure 1).
  • Figure 1 A similar deviation, although less accused, was also observed in the case of Comet.
  • a close inspection of the three best matching peptides from these decoy proteins revealed the presence of sequences containing high proportions of repeated amino acids like Gly, Ala, Ser, Pro, Leu; all these sequences, typical from proteins like keratins and collagens, are highly homologous to target proteins and produce non-random PSM.
  • these deviations are produced by imperfections in the algorithm used to detect overlaps between target and decoy peptides at the time of generating the decoy database, and not to the protein probability algorithm.
  • LPGF accurately predicts the behavior of decoy proteins in all tissues and across all search engines ( Figure 1) and thus acts as a true protein probability score. Note that, in contrast with LPGS, LPGF stands very well the imperfections in the decoy database, showing negligible deviations in the expected trend for top scoring proteins with MSFragger and Comet.
  • LPS and LPF do not account for protein length, since they only consider the final product of peptide p-values, thus introducing a bias toward larger proteins, which tend to be matched by more decoy peptides.
  • LPGS takes into account the number of peptides used to calculate the product, and LPGF includes not only the number of peptides but also the total number of peptides matching the protein. In general, this finding highlights the importance of using true probabilities that accurately reflect decoy protein behavior instead of empiric protein scores.
  • LPGM and LPM were consistently more sensitive than LPGS and LPS, respectively. This finding reflects the detrimental accumulation in LPGS and LPS of false target peptide matches, which only add random noise and contribute nothing to target protein identification. This phenomenom is avoided in LPGM and LPM, increasing their sensitivity despite the inclusion of only the best peptide in the score.
  • the third and most important pattern is that LPGF was consistently more sensitive than LPGM. This contrasts with LPF, which was less sensitive than LPM in most cases.
  • This finding shows that the inclusion of all relevant peptide information in a well-constructed protein probability score is always preferable to the simplification provided by using only the best peptide. More importantly, it resolves the peptide-protein paradox, demonstrating that the inclusion of relevant information produces better results than ignoring it, and establishing the basis for a rational approach to protein identification.
  • true protein identifications can be viewed as an increase in the score of proteins harboring true target peptide matches, producing a horizontal displacement of scores to the right.
  • the resulting distribution of target protein scores can thus be viewed as a superposition of the decoy distribution (containing the false-positive target protein scores) and the distribution of true-positive protein scores.
  • a protein score threshold is applied to select a population of positively identified target proteins.
  • the FDR of this population can be calculated in several ways. In the simplest approach, we use the number of above-threshold decoy proteins (d) to estimate the fraction of false positives in the population of above-threshold target proteins
  • FDR is estimated as:
  • D and T are the sizes of the decoy and target databases. This equation can be derived under the basic assumption that the proportion of decoy matches above (d) and below (D - d) the protein score threshold is identical to the proportion of false-positive (FP) target matches above threshold and the number of target matches below threshold (T - 1):
  • peptide identifications are filtered according to their peptide-level FDR, and the FDR of the resulting list of proteins is calculated using the conventional approach (FDRn); this method is frequently used by researchers in the field.
  • FDRn the conventional approach
  • the proteins are assigned the score of their best peptide, and the picked FDR is used to validate the results.
  • FDRp(LPM) is used in the most recent version (3.0) of Percolator, 12 a popular algorithm included in the Proteome Discoverer package.
  • the FDRp(LPM) workflow increased protein identification performance presumably because the use of only the best peptide minimizes the error-rate increase for decoy proteins (Table 1 and Figure 5).
  • the algorithm proposed here (LPGF), in combination with the refined FDR, is able to use the information provided by all the peptides in a more efficient way, outperforming the other algorithms in all cases (Table 1 and Figure 5). This result was consistently reproduced when the same data were analysed in other search engines.
  • FDRr(LPGF) included most of the identified proteins, missing only a small number of proteins identified by the other workflows ( Figure 6). Closer inspection revealed that all proteins identified only by FDRp(LPM) were wonderhits: proteins identified from only one peptide. Moreover, these proteins identied only by FDRp(LPM) were matched by unexpectedly large numbers of target peptides (more than 50 in most cases). Wonder-hits were the minority of proteins identified by FDRn(LPF) only; however, these proteins were also matched by an abnormally high number of target peptides.
  • Table 1 Number of identifications provided by the different workflows using three tissues of the Human Proteome Map as separate target datasets
  • Target database size . 20407 genes.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hematology (AREA)
  • Immunology (AREA)
  • Chemical & Material Sciences (AREA)
  • Urology & Nephrology (AREA)
  • Food Science & Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medicinal Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Cell Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Pathology (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Microbiology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

Dans la présente invention, nous proposons un modèle de probabilité de protéine issu de considérations analytiques qui i) intègre les informations fournies par tous les peptides identifiés, ii) prédit avec précision le comportement aléatoire des protéines leurres, et iii) résout efficacement le paradoxe peptide-protéine. Nos résultats ont été validés par l'analyse des résultats de trois moteurs de recherche pour plusieurs tissus de la carte de protéome humain.
PCT/EP2020/067751 2019-06-24 2020-06-24 Modèle de probabilité de protéine WO2020260419A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP19382533.8 2019-06-24
EP19382533 2019-06-24

Publications (1)

Publication Number Publication Date
WO2020260419A1 true WO2020260419A1 (fr) 2020-12-30

Family

ID=67139679

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2020/067751 WO2020260419A1 (fr) 2019-06-24 2020-06-24 Modèle de probabilité de protéine

Country Status (1)

Country Link
WO (1) WO2020260419A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100179766A1 (en) * 2007-05-31 2010-07-15 The Regents Of The University Of California Method for Identifying Peptides Using Tandem Mass Spectra by Dynamically Determining the Number of Peptide Reconstructions Required
US20120191685A1 (en) * 2009-07-01 2012-07-26 Consejo Superior De Investigaciones Cientificas Method for identifying peptides and proteins from mass spectrometry data
US20130211734A1 (en) * 2010-05-14 2013-08-15 Dh Technologies Development Pte Ltd Systems and Methods for Calculating Protein Confidence Values

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100179766A1 (en) * 2007-05-31 2010-07-15 The Regents Of The University Of California Method for Identifying Peptides Using Tandem Mass Spectra by Dynamically Determining the Number of Peptide Reconstructions Required
US20120191685A1 (en) * 2009-07-01 2012-07-26 Consejo Superior De Investigaciones Cientificas Method for identifying peptides and proteins from mass spectrometry data
US20130211734A1 (en) * 2010-05-14 2013-08-15 Dh Technologies Development Pte Ltd Systems and Methods for Calculating Protein Confidence Values

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BERN M W ET AL: "Two-Dimensional Target Decoy Strategy for Shotgun Proteomics", JOURNAL OF PROTEOME RESEARCH, vol. 10, no. 12, 2 December 2011 (2011-12-02), pages 5296 - 5301, XP055652882, ISSN: 1535-3893, DOI: 10.1021/pr200780j *
JEONG K ET AL: "False discovery rates in spectral identification", BMC BIOINFORMATICS, BIOMED CENTRAL, LONDON, GB, vol. 13, no. Suppl 16, 5 November 2012 (2012-11-05), pages S2, XP021119902, ISSN: 1471-2105, DOI: 10.1186/1471-2105-13-S16-S2 *
KÄLL L ET AL: "Assigning Significance to Peptides Identified by Tandem Mass Spectrometry Using Decoy Databases", JOURNAL OF PROTEOME RESEARCH, vol. 7, no. 1, 1 January 2008 (2008-01-01), pages 29 - 34, XP055652905, ISSN: 1535-3893, DOI: 10.1021/pr700600n *
KELLER A ET AL: "Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search", ANALYTICAL CHEMISTRY, vol. 74, no. 20, 12 September 2002 (2002-09-12), US, pages 5383 - 5392, XP055264381, ISSN: 0003-2700, DOI: 10.1021/ac025747h *

Similar Documents

Publication Publication Date Title
US10991453B2 (en) Alignment of nucleic acid sequences containing homopolymers based on signal values measured for nucleotide incorporations
Jones et al. Improving sensitivity in proteome studies by analysis of false discovery rates for multiple search engines
Szklarczyk et al. Tracking repeats using significance and transitivty.
Fermin et al. LuciPHOr: algorithm for phosphorylation site localization with false localization rate estimation using modified target-decoy approach
US9354236B2 (en) Method for identifying peptides and proteins from mass spectrometry data
CN105956416B (zh) 一种快速自动分析原核生物蛋白质基因组学数据的方法
Ivanov et al. Empirical multidimensional space for scoring peptide spectrum matches in shotgun proteomics
CN106033502B (zh) 鉴定病毒的方法和装置
CN104034792A (zh) 基于质荷比误差识别能力的蛋白质二级质谱鉴定方法
EP1695255B1 (fr) Procedes et systemes de mise en evidence de proteines et de peptides
KR101936933B1 (ko) 염기서열의 변이 검출방법 및 이를 이용한 염기서열의 변이 검출 디바이스
Samaras et al. Reanalysis of ProteomicsDB using an accurate, sensitive, and scalable false discovery rate estimation approach for protein groups
CN113096737B (zh) 一种用于对病原体类型进行自动分析的方法及系统
EP3896697A1 (fr) Procédé et dispositif pour identifier des peptides présentés par mhc de classe i à partir de spectres de masse d'ions fragmentaires
Jian et al. A novel algorithm for validating peptide identification from a shotgun proteomics search engine
WO2020260419A1 (fr) Modèle de probabilité de protéine
Prieto et al. Protein probability model for high-throughput protein identification by mass spectrometry-based proteomics
Lopez-Fernandez et al. A comprehensive analysis about the influence of low-level preprocessing techniques on mass spectrometry data for sample classification
Sun et al. PPIRank-an advanced method for ranking protein-protein interations in TAP/MS data
Shao et al. Computational prediction of human body-fluid protein
CN112614542B (zh) 一种微生物鉴定方法、装置、设备及存储介质
JP2019185224A (ja) 内在性修飾ペプチドの同定品質評価方法及び装置
CN114420204B (zh) 用于预测待测基因的拷贝数的方法、计算设备和存储介质
Raja et al. Quality control and annotation of variant peptides identified through Proteogenomics
US20220189581A1 (en) Method and apparatus for classification and/or prioritization of genetic variants

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20735512

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20735512

Country of ref document: EP

Kind code of ref document: A1