CN112415208A - Method for evaluating quality of proteomics mass spectrum data - Google Patents

Method for evaluating quality of proteomics mass spectrum data Download PDF

Info

Publication number
CN112415208A
CN112415208A CN202011282435.1A CN202011282435A CN112415208A CN 112415208 A CN112415208 A CN 112415208A CN 202011282435 A CN202011282435 A CN 202011282435A CN 112415208 A CN112415208 A CN 112415208A
Authority
CN
China
Prior art keywords
score
scoring
spectrum
quality
secondary spectrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011282435.1A
Other languages
Chinese (zh)
Inventor
刘超
吴利则
宫鹏云
郭一洁
李威铮
汤敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202011282435.1A priority Critical patent/CN112415208A/en
Publication of CN112415208A publication Critical patent/CN112415208A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6803General methods of protein analysis not limited to specific proteins or families of proteins
    • G01N33/6848Methods of protein analysis involving mass spectrometry
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N27/00Investigating or analysing materials by the use of electric, electrochemical, or magnetic means
    • G01N27/62Investigating or analysing materials by the use of electric, electrochemical, or magnetic means by investigating the ionisation of gases, e.g. aerosols; by investigating electric discharges, e.g. emission of cathode

Abstract

The invention discloses a method for evaluating the quality of proteomics mass spectrum data, which comprises the following steps: scoring the mass spectrum data obtained by testing to obtain a primary scoring result, wherein the primary scoring result comprises an identification result score, a secondary spectrum acquisition number score, a secondary spectrum utilization rate score and a secondary spectrum quality score; and taking the average score of the identification result score, the secondary spectrum acquisition number score, the secondary spectrum utilization rate score and the secondary spectrum quality score as the total score of the mass spectrum experiment. By analyzing most of the influence factors of the mass spectrum experimental data obtained by the experiment, each acquired spectrogram and each identified data of the peptide fragment in the mass spectrum experiment can be extracted for overall inspection without sampling detection. Therefore, the method can comprehensively, accurately, objectively and conveniently show the quality of each influence factor of the mass spectrum experiment data quality for the user, help the user to improve the experiment of the user and further promote the application of the mass spectrum method in the clinical aspect.

Description

Method for evaluating quality of proteomics mass spectrum data
Technical Field
The invention relates to the technical field of biotechnology and proteomics, in particular to a method for evaluating the quality of proteomic mass spectrum data.
Background
The HPLC-MS method is an important method for the study of bottom-up proteomics in recent years, and is used for discovering protein markers of diseases or drugs by exploring protein components in biological cells or tissues and researching molecular signal paths under different mechanisms or interaction among different molecules. And (4) judging the type and the content of the protein contained in the sample according to the MS identification result by scientific researchers, and obtaining a conclusion through comparison of different samples. Generally, the quality of experimental data and the quality of HPLC-MS experiments can be proved by the fact that the number of peptide fragments and the number of proteomes contained in identification results are large. However, this method cannot determine the cause of high or low mass spectrum experimental data: that is, although the researchers know that their own mass spectrometry experiment performed well or not well, they did not know why their own experiment resulted in this quality level and therefore could not improve their own experiments under limited conditions. The situation causes the problems of low repeatability of the mass spectrometry proteomics experiment result and the like, and the application of the mass spectrometry method in the clinical aspect is hindered.
In order to solve the problem of different qualities of mass spectrum data caused by deep analysis, researchers can only manually extract and roughly analyze part of mass spectrum data influence factors each time aiming at an experiment short-plate improvement experiment. The mass of mass spectrum experimental data is huge each time, manual analysis cannot be achieved all the time, and researchers can only conduct sampling inspection on one aspect to judge the quality of the aspect. This method is not only time consuming and labor intensive, but more importantly because sampling examination is very haphazard, the results of this examination do not correctly reflect the short plates of mass spectrometry experiments of researchers. Meanwhile, manual analysis cannot show the quality of all the mass spectrum data influence factors, so that the short plate of the mass spectrum experiment of a researcher cannot be correctly reflected. Currently, there is no similar method to automatically analyze mass spectrometry data quality, and mass spectrometry data quality-affecting factors. Similar methods are most used for searching to analyze the mass spectrum identification result, but there is no method for analyzing the mass spectrum identification result.
In view of this, the invention is particularly proposed.
Disclosure of Invention
The invention aims to provide a method for evaluating the quality of proteomics mass spectrum data so as to improve the problems.
The invention is realized by the following steps:
the invention provides a method for evaluating the quality of proteomics mass spectrum data, which comprises the following steps: scoring the mass spectrum data obtained by testing to obtain a primary scoring result, wherein the primary scoring result comprises an identification result score, a secondary spectrum acquisition number score, a secondary spectrum utilization rate score and a secondary spectrum quality score; and taking the average score of the identification result score, the secondary spectrum acquisition number score, the secondary spectrum utilization rate score and the secondary spectrum quality score as the total score of mass spectrum data quality evaluation.
Optionally, each primary scoring result is obtained by calculating a secondary scoring result, for each secondary scoring of the mass spectrum data, according to a linear scoring standard, a score is given to an actual value within a preset value interval, and a highest score and a lowest score are taken when the actual value exceeds the preset value interval.
Optionally, the secondary scoring result corresponding to the identification result score comprises a peptide fragment identification result score and a proteome identification result score;
optionally, the secondary scoring result corresponding to the secondary spectrum acquisition number score includes at least one of an average secondary spectrum acquisition speed score, a ratio score of the total secondary spectrum acquisition number to the total primary spectrum acquisition number, a maximum secondary spectrum acquisition speed score, and a spectrum effective acquisition time score determined by a spectrum dead volume;
optionally, the secondary spectrum scoring result corresponding to the secondary spectrum utilization score comprises at least one of a score of a ratio of the number of secondary spectrums to the number of peptide segments, a median score of a peptide segment elution full-length time, and a score of a chromatographic tailing condition;
optionally, the secondary spectrum scoring result corresponding to the secondary spectrum quality score includes at least one of an identification rate score, an evaluation relative quality deviation score, a standard deviation score of relative quality deviation, an enzyme digestion specificity score, a missed cleavage condition score and a cysteine modification condition score.
Optionally, the identification result score is an average score of the peptide identification result score and the proteome identification result score; the secondary scoring rule of peptide fragment identification result score is as follows: when the number of the identified peptide fragments per hour is less than or equal to 10000, the lowest score is taken, when the number of the identified peptide fragments per hour is greater than or equal to 35000, the highest score is taken, and when the number of the identified peptide fragments per hour is greater than 10000 and less than 35000, the calculation is carried out according to a linear scoring standard.
The two-stage scoring rule obtained from the proteome identification result is as follows: when the number of identified proteomes per hour is less than or equal to 1500, the lowest score is taken, when the number of identified proteomes per hour is greater than or equal to 3000, the highest score is taken, and when the number of identified proteomes per hour is greater than 1500 and less than 3000, the calculation is performed according to a linear scoring criterion.
Optionally, the secondary spectrum collection number score is a secondary spectrum average collection speed score, and the secondary scoring rule of the secondary spectrum average collection speed score is as follows: and when the average acquisition speed of the secondary spectrum is less than or equal to 3Hz, taking the lowest score, when the average acquisition speed of the secondary spectrum is greater than or equal to 40Hz, taking the highest score, and when the average acquisition speed of the secondary spectrum is greater than 3Hz and less than 40Hz, calculating according to a linear scoring standard.
Optionally, the secondary spectrum utilization score is a score of a ratio of the number of secondary spectra to the number of peptide segments, and the secondary scoring rule of the score of the ratio of the number of secondary spectra to the number of peptide segments is as follows: and when the ratio of the number of secondary spectra to the number of peptide segments is less than or equal to 1.2, taking the highest score, when the ratio of the number of secondary spectra to the number of peptide segments is greater than or equal to 2, taking the lowest score, and when the ratio of the number of secondary spectra to the number of peptide segments is greater than 1.2 and less than 2, calculating according to a linear scoring standard.
Optionally, the secondary spectrum quality score is an identification rate score, and the secondary scoring rule of the identification rate score is as follows: and when the recognition rate is less than or equal to 20%, taking the lowest score, when the recognition rate is greater than or equal to 85%, taking the highest score, and when the recognition rate is greater than 20% and less than 85%, calculating according to a linear scoring standard.
Optionally, the lowest score is set to be 0.5-1 score, the highest score is set to be 5-10 score, and preferably, the lowest score is set to be 1 score, and the highest score is set to be 5 score.
Optionally, the mass spectral data comprises data of all spectra acquired and identified peptide fragments in the mass spectral experiment.
Optionally, the mass spectrometry experiment total score and the plurality of primary scoring results are visualized, and a data visualization result is output.
The invention also provides a storable medium having stored thereon a computer program which, when executed by a processor, performs the above-described method of assessing the quality of proteomic mass spectrometry data.
One of the embodiments of the invention has at least the following beneficial effects: by analyzing most of the influence factors of the mass spectrum experimental data obtained by the experiment, each acquired spectrogram and each identified data of the peptide fragment in the mass spectrum experiment can be extracted for overall inspection without sampling detection. Therefore, the method can comprehensively, accurately, objectively and conveniently show the quality of each influence factor of the mass spectrum experiment data quality for the user, help the user to improve the experiment of the user and further promote the application of the mass spectrum method in the clinical aspect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a schematic diagram of an ideal case of mass spectrometric identification of a biological sample;
FIG. 2 is a schematic diagram of the actual situation of mass spectrometric identification of a protein sample;
figure 3 is a flow chart of a method of assessing proteomic mass spectrometry data quality provided by an embodiment of the present invention;
FIG. 4 is a graph showing the results of the first-order scoring and total scoring of data in example 1 of the present invention;
FIG. 5 is a radar chart of two-stage scoring according to example 1 of the present invention;
FIG. 6 is a comparison of the actual condition of protein sample identified by mass spectrometry and the results of different mass spectrometry experiments in example 2 of the present invention;
FIG. 7 is the first-order scoring result output in example 2 of the present invention and the comparison when applied to different data;
FIG. 8 is a radar chart showing the two-level scoring results output by example 2 of the present invention and comparison applied to different data;
FIG. 9 is a data visualization of the full peak width of the chromatogram of an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below. The examples, in which specific conditions are not specified, were conducted under conventional conditions or conditions recommended by the manufacturer. The reagents or instruments used are not indicated by the manufacturer, and are all conventional products available commercially.
The method for evaluating the quality of the proteomics mass spectrum data provided by the invention is specifically explained below.
The inventors have studied about the problems of the existing mass spectrometry results, and found that fig. 1 is a process of identifying peptide fragments by mass spectrometry under ideal conditions, as shown in fig. 1. In an ideal situation, first, all peptides in the sample can be collected by the mass spectrometer (the secondary spectrum collection number is enough), and each peptide is collected into only one secondary spectrum (the secondary spectrum utilization rate is high), and simultaneously all secondary spectra can be identified (the secondary spectrum quality is high). Ideally, the total number of mass spectrometric identifications should equal the total number of protein and peptide fragments in the sample. However, as shown in fig. 2, firstly, a large part of peptide fragments cannot be acquired in a prescribed time, that is, the number of secondary spectrum acquisitions is not large enough, which means that the acquisition efficiency and the recognition rate of the mass spectrometer are not high enough. Secondly, the collected secondary spectrum can not be identified, which means that the quality of the spectrogram is not high enough, and the wrong detection can be caused if the quality of the spectrogram is not high enough. Thirdly, a plurality of secondary spectrums are repeatedly collected from part of the peptide segments, the identification results of the secondary spectrums point to the same peptide segment, and the utilization rate of the secondary spectrums is not high enough, so that the identification efficiency is low, and the repeated inspection process can be missed. In practical cases, the identification of mass spectral data quality depends on the above three aspects. If all three aspects are well done, the mass spectrum data quality can greatly improve the quantity, efficiency and accuracy of the identification result of the mass spectrometer on the protein in the sample.
Based on this, the inventors have proposed the following rules (formulas) through a large number of studies and practices:
Nmass spectrometric identification results=NNumber of second-order spectra collected×PSecond level spectrum utilization×QSecond order spectral mass
The formula quantifies the identification result from the perspective of influencing factors for the first time, wherein the x symbols among the factors represent the mutual operational relationship which generates the connection among the three factors. Based on this formula, the inventors propose the following method to achieve the evaluation of the mass spectrum data quality and further improve the detection process.
Some embodiments of the invention provide a method of assessing the quality of proteomic mass spectrometry data comprising: scoring the mass spectrum data obtained by testing to obtain a primary scoring result, wherein the primary scoring result comprises an identification result score, a secondary spectrum acquisition number score, a secondary spectrum utilization rate score and a secondary spectrum quality score; and taking the average score of the identification result score, the secondary spectrum acquisition number score, the secondary spectrum utilization rate score and the secondary spectrum quality score as the total score of mass spectrum data quality evaluation.
Through a scoring mode, the quality of the mass spectrum data influence factors is disclosed, a user is helped to explain the generation reason of the mass spectrum result quality made by the user, and an experimental scheme is optimized.
The above-described method of embodiments of the present invention may be performed by operating in the form of software, and in particular, referring to fig. 3, the method for evaluating the quality of proteomics mass spectrometry data provided by some embodiments of the present invention may comprise the following steps:
and S1, extracting and classifying the data input by the user.
And extracting data corresponding to the required scoring items in the single mass spectrum experimental data and the pFind identification result according to the path input by the user.
And S2, scoring the mass spectrum data to obtain a primary scoring result, wherein the primary scoring result comprises an identification result score, a secondary spectrum acquisition number score, a secondary spectrum utilization rate score and a secondary spectrum quality score.
The total score of each mass spectrum experimental data is determined by four primary scoring results, namely an identification result score, a secondary spectrum acquisition score, a secondary spectrum utilization score and a secondary spectrum quality score. And each primary scoring result is obtained by calculating a secondary scoring result, for each secondary scoring of the mass spectrum data, according to a linear scoring standard, a score is given to an actual value in a preset value interval, and the highest score and the lowest score are obtained when the actual value exceeds the preset value interval. That is, the secondary scoring results are the specific influence factors of the mass spectrum experimental data and are classified according to the primary scoring results influenced or reflected by the secondary scoring results. The secondary scoring results are generated by actual values of the term in the mass spectral data and pre-established scoring rules. By setting a highest score and a lowest score, according to the characteristics of the items corresponding to each secondary scoring result, actual numerical values corresponding to the highest score and the lowest score are set, and therefore the unique linear scoring standard of each secondary scoring item is obtained. For each secondary score of the user data, the actual values within the preset value interval may be assigned a score according to a linear scoring criterion. If the preset value is exceeded, the corresponding highest score or lowest score is taken nearby by the item of the secondary score.
For example, the scoring mode is explained by the median of the full peak width of the item elution time corresponding to the secondary scoring result, and first, the software reconstructs a chromatographic elution curve from all the peptide fragments analyzed by the mass spectrometry experiment, and obtains the full peak width and the half peak width of the elution time of each peptide fragment. The preset scoring rule is as follows: if the median value of the full peak width of the elution time is less than or equal to 0.3 minute, 5 minutes are obtained; if the median of the full peak width of the elution time is greater than or equal to 2 minutes, 1 point is obtained. And if the median of the full peak width of the threshing time is more than 0.3 minute and less than 2 minutes, the linear scoring standard of the secondary scoring item is as follows:
y (actual score) — 0.425x (actual value) +1.85
If the actual value in the user data is in the interval, the score of the user data can be obtained according to the model and the actual value of the user data. For example, if the full peak width of the elution time of the user data is 1 minute, the score is 1.4 points (one decimal is retained).
The items and the categories (corresponding primary scoring results) corresponding to all the secondary scoring results of the embodiment of the invention are shown in table 1. The design of the scoring criteria was mainly derived from experience and summary of experimental data in the "CNCP 2018" project. There are different rules for scoring items at each level.
TABLE 1
Figure BDA0002781225510000071
Figure BDA0002781225510000081
Specifically, in some embodiments, the secondary score corresponding to the identification result score comprises a peptide identification result score and a proteomic identification result score. The score of the identification result is determined by the average value of the scores of two secondary scoring items of 1.1 and 1.2, namely:
Figure BDA0002781225510000082
in mass spectrum data, peptide fragment identification results and proteome identification results reflect two angles of the identification results. The results are less useful if more peptides are identified but less peptides are identified, and less reliable if more peptides are identified. Therefore, the higher the average value of the two terms is, the more the identification result is represented, and the identification result is more reliable.
Further, the secondary scoring rule of the peptide fragment identification result score is as follows: when the number of the identified peptide fragments per hour is less than or equal to 10000, the lowest score is taken, when the number of the identified peptide fragments per hour is greater than or equal to 35000, the highest score is taken, and when the number of the identified peptide fragments per hour is greater than 10000 and less than 35000, the calculation is carried out according to a linear scoring standard. The two-stage scoring rule obtained from the proteome identification result is as follows: when the number of identified proteomes per hour is less than or equal to 1500, the lowest score is taken, when the number of identified proteomes per hour is greater than or equal to 3000, the highest score is taken, and when the number of identified proteomes per hour is greater than 1500 and less than 3000, the calculation is performed according to a linear scoring criterion.
The secondary spectrum acquisition score, the secondary spectrum utilization score and the secondary spectrum quality score are respectively the scores of the items corresponding to the first secondary scoring result corresponding to the secondary spectrum acquisition score in table 1, namely:
x2 (score of secondary spectrum acquisition number) ═ X2.1 (score of average acquisition speed of secondary spectrum)
X3 (second-order spectral efficiency score) ═ X3.1 (second-order number of spectra/number of peptide segments ratio)
X4 (second-order spectral quality score) ═ X4.1 (discrimination score)
Under the category corresponding to each primary scoring result, the indexes of the items corresponding to the first secondary scoring result directly reflect the quality of the items corresponding to the primary scoring result. The items corresponding to other items of the secondary scoring results are influence factor indexes of the items corresponding to the item of the primary scoring results, and therefore are not used as the reference of the items of the primary scoring results, but the scores of the influence factors can prompt why the items corresponding to the item of the primary scoring results have such scores, and if the scores of the items corresponding to the primary scoring results tell the short plate of the mass spectrum experiment of the user, the items of the secondary scoring results (except the first item of each category) help the user to find the short plate of the items corresponding to each primary scoring result.
Specifically, the secondary scoring result corresponding to the secondary spectrum acquisition number score is a secondary spectrum average acquisition speed score, and the secondary scoring rule of the secondary spectrum average acquisition speed score is as follows: and when the average acquisition speed of the secondary spectrum is less than or equal to 3Hz, taking the lowest score, when the average acquisition speed of the secondary spectrum is greater than or equal to 40Hz, taking the highest score, and when the average acquisition speed of the secondary spectrum is greater than 3Hz and less than 40Hz, calculating according to a linear scoring standard.
The secondary scoring result corresponding to the secondary spectrum utilization rate score is the score of the ratio of the number of the secondary spectrums to the number of the peptide segments, and the secondary scoring rule of the score of the ratio of the number of the secondary spectrums to the number of the peptide segments is as follows: and when the ratio of the number of secondary spectra to the number of peptide segments is less than or equal to 1.2, taking the highest score, when the ratio of the number of secondary spectra to the number of peptide segments is greater than or equal to 2, taking the lowest score, and when the ratio of the number of secondary spectra to the number of peptide segments is greater than 1.2 and less than 2, calculating according to a linear scoring standard.
And the secondary scoring result corresponding to the secondary spectrum quality score is an identification rate score, and the secondary scoring rule of the identification rate score is as follows: and when the recognition rate is less than or equal to 20%, taking the lowest score, when the recognition rate is greater than or equal to 85%, taking the highest score, and when the recognition rate is greater than 20% and less than 85%, calculating according to a linear scoring standard.
It should be noted that, in some embodiments, the lowest score is set to 0.5 to 1 score, and the highest score is set to 5 to 10 scores, and preferably, as shown in table 1, the lowest score is set to 1 score, and the highest score is set to 5 scores. Wherein, the mass spectrum data comprises all the acquired spectrograms and the data of the identified peptide fragments in the mass spectrum experiment.
And S3, calculating the average score of the identification result score, the secondary spectrum acquisition number score, the secondary spectrum utilization rate score and the secondary spectrum quality score to be used as the total score of the mass spectrum experiment.
The more identification results, the better the quality of mass spectrum data, which represents that the mass spectrum experiment can detect the protein in the sample more comprehensively. The higher the acquisition number of the secondary spectrum is, the more detection results of the mass spectrum experiment are proved within a certain time, and the efficiency and the speed of the mass spectrum experiment are reflected. The higher the utilization rate of the secondary spectrum, the less the empty acquisition of the mass spectrometer, the higher the detection efficiency, and the low omission factor. The quality of the secondary spectrum reflects the clear check of whether the acquired secondary spectrum can be checked. The higher the secondary spectrum quality, the higher the accuracy of the identification result representing the mass spectrum experiment. Therefore, the mass of the mass spectrum experiment can be fully reflected through the four parts, and the calculation mode of the total score is designed to be the average value of the four primary scoring results:
y (total score) — (X1 (score of identification) + X2 (score of number of secondary spectrum acquisitions) + X3 (score of secondary spectrum utilization) + X4 (score of secondary spectrum quality))/4
In some embodiments, in order to enable better presentation of the results, the following steps may be further included:
and S4, visualizing the mass spectrum experiment total score and the multiple primary scoring results, and outputting a data visualization result.
Visualizing the data of each influencing factor in a reasonable mode; and outputting the scores of various influencing factors and the total data score in html, and outputting a data visualization result. And visualizing the actual data corresponding to all the previously scored secondary scoring results through software. The software intuitively judges the quality of different influence factors through scoring, and the data visualization enables a user to more carefully know the score of the influence factors, see the basis of the score and further analyze the reason of the quality of the influence factors. At the same time, the software also visualizes some influence factors that do not include the score, such as the distribution of the maximum implantation time of the ions.
Some embodiments of the present invention also provide a storable medium having stored thereon a computer program that, when executed by a processor, performs the above-described method of assessing the quality of proteomic mass spectrometry data.
The features and properties of the present invention are described in further detail below with reference to examples.
Example 1
The data are from the data set obtained from the analysis of the sample by a mass spectrometer QEDFX with a resolution of 15000 set by the authors of the present application, in example 1, in an article "Performance Evaluation of the Q active HF X for shotgunproperties" published in 2018 on Journal of the protein Research. Firstly, a user loads mass spectrum data and extracts and calculates the actual data of the item corresponding to the required secondary scoring result. According to the scoring rules and actual data shown in table 1, each secondary scoring result can be calculated. As shown in table 2, in the item 1.1 of the peptide fragment identification results, 43052.81 peptide fragments are identified by the mass spectrum per hour in the user data, and the linear scoring criterion of the item 1.1 is that the mass spectrum identifies more than 35000 peptide fragments per hour for 5 points, and identifies less than 10000 peptide fragments per hour for 1 point. Clearly, the number of peptide fragments identified in this data is much higher than 35000 peptide fragments per hour, so this data is given 5 points for 1.1 entries. In item 1.2, the actual value of the item in the user data is 4425.94, which is higher than the maximum score setting of item 1.2, so that the data is given a score of 5 in item 1.2. The actual value of the data of 2.1 items is 24.96Hz, the linear standard is calculated according to the highest score set value and the lowest score set value of 2.1 items, the score of 2.1 items is 3.4, and the scores of all other secondary scoring items can be obtained by the same method. The score of the 1 st primary scoring result "1. identification result" was 5 points (5+5)/2 according to two scores of 1.1 and 1.2. The score of the 2 nd first-order scoring result, the "second-order spectrum collection number", is a score of 2.1: and 3.4 points. The scores of the 3 rd and 4 th grade scoring items are 3.1 and 4.1 respectively.
TABLE 2
Figure BDA0002781225510000121
Calculating the average score of the first-grade scoring result: (5+3.4+5+5)/4 ═ 4.6.
As shown in fig. 4 and 5, the primary scoring item score and the total secondary scoring item score of the data are respectively displayed to the user through a bar chart and a radar chart, and the user can clearly see from the chart that the overall situation of the data is very good and the short board is that the secondary spectrum collection number is not enough. Specifically, the secondary spectrum proportion of the mass spectrum acquisition is insufficient for the secondary scoring project. Therefore, short plates of mass spectrometry experiments are clear at a glance.
Example 2
This example uses two data sets, each from the Performance Evaluation of the Q active HF X for shotgunproperties published in Journal of protein Research in 2018, data set 1 being the data set used in example 1, data set 2 being the data set obtained by the authors setting the resolution to 15000 and the sample being identified using a mass spectrometer QEHHF (data set 1 is QHFX). Similarly, both data sets were evaluated in the same manner as in example 1, and the inventors performed comparative analysis.
First, as shown in fig. 6: the data set 2 represented by the dashed line identified a much higher number of peptides than the data set 1 represented by the solid line, and thus the mass spectral data quality of the data set 2 is higher than that of the data set 1.
The number of secondary spectrum acquisitions is such that data set 1 is much higher than data set 2 (dashed line and first line uppermost in the solid line). The ID Rate (identification Rate) is obtained by dividing the final total number of the lowest line of the dotted line and the solid line (i.e., the total number of the identified secondary spectra) by the first highest line of the dotted line and the solid line (i.e., the total number of the collected secondary spectra), and the identification Rate represents the quality of the spectrogram. It can be seen that the authentication rate of data set 2 is also higher than that of data set 1. Thirdly, the ratio of the final total number of the lowest line of the dotted line and the solid line divided by the final total number of the middle line of the dotted line and the solid line (the identified peptide fragment) can be used to determine the utilization rate of the spectrogram (ideally, a spectrogram corresponds to a peptide fragment, so that the smaller the value, the better the utilization rate). It can be seen that the spectrogram utilization is higher for dataset 1 than for dataset 2. The above three major aspects were scored as shown in figure 7 of example 1. The user does not need to manually calculate the identification rate, does not need to find the total number of the secondary spectrum in a complex data set, and can see the quality of the three aspects at a glance only by the method.
Although dataset 1 does not extend all the way beyond dataset 2, the overall quality of dataset 1 is better than dataset 2, which results from the superior performance of the mass spectrometer used in the dataset 1 experiment, as will be explained below.
From the radar map of the secondary scoring, as shown in fig. 8, the scores and comparisons of the secondary scoring items of the two data sets can be seen. The data set 1 is an all-round surpass data set 2 in the aspect of mass spectrum performance, however, the quality of the chromatogram in the data set 1 and the utilization rate of the secondary spectrum influenced by the quality are inferior to those of the data set 2, and the data set 1 is also inferior to the data set 2 in the enzyme digestion condition of the influencing factor of the secondary spectrum quality. If the experiment of the data set 1 can improve the chromatographic and sample pretreatment conditions, the utilization rate and the quality of the secondary spectrum can be improved, and the mass spectrometer performance is combined to improve the mass spectrum data quality by one step. The short plate of the data set 2 is also revealed in the secondary scoring, i.e. the performance of the mass spectrometer is not enough, the secondary spectrum acquisition proportion, speed and effective acquisition time are not as good as those of the data set 1, so that the quality of the data set 2 cannot exceed that of the data set 1 even if the chromatogram and the sample pretreatment are good.
FIG. 9 shows the result of visualizing the chromatographic data of example 1 output by the present method. The full peak width of the chromatogram is used as a secondary scoring project, so that the scoring evaluation is performed on the aspect of the mass spectrum experiment, and the actual numerical value is visualized. The traditional method for evaluating the quality of the chromatogram is that researchers randomly select a plurality of peptide fragments, then find the elution time and the concentration of the peptide fragments from one spectrogram to one spectrogram, and finally reconstruct the elution curve of the chromatogram to obtain the full peak width of one peptide fragment. The manual method is time-consuming and labor-consuming to obtain one peptide segment, actual data contains tens of thousands of peptide segments, and random peptide segments are difficult to represent the whole. The method reconstructs the chromatographic elution curve of all peptide segments in mass spectrum data and visualizes the full peak width into a histogram. The user can understand the basis of our scoring from the histogram or make his own evaluation of the item based on the phenomena reflected by the histogram. Compared with the traditional manual method, the method is characterized in that a user extracts data with huge body quantity and integrates and processes the data according to categories, so that the user can conveniently and intuitively evaluate short plates of mass spectrum data, and experiments are improved. It can be seen in the figure that the chromatogram of this data set has a short median full width peak and a good tailing with little tailing (peptides eluting longer than 2 minutes are rare). Thus both items 3.2 and 3.3 of the second order score of the example 1 dataset scored very high, approaching full score. 3.2 and 3.3 as factors influencing the utilization of the 3 rd second-order spectrum, their high scores contributed to the good utilization of the second-order spectrum of this dataset, explaining why the dataset in example 1 had a good 3 rd score.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of assessing proteomic mass spectrometry data quality, wherein determinants of the identification of mass spectrometry data quality comprise secondary spectrum acquisition number, secondary spectrum utilization rate, and secondary spectrum quality, the method comprising:
scoring the mass spectrum data obtained by testing to obtain a primary scoring result, wherein the primary scoring result comprises an identification result score, a secondary spectrum acquisition number score, a secondary spectrum utilization rate score and a secondary spectrum quality score;
and taking the average score of the identification result score, the secondary spectrum acquisition number score, the secondary spectrum utilization rate score and the secondary spectrum quality score as a mass spectrum data quality evaluation total score.
2. The method of claim 1, wherein each of the primary scoring results is calculated from secondary scoring results, and for each secondary scoring of the mass spectral data, actual values within a preset range are assigned a score according to a linear scoring criterion, and a highest score and a lowest score are taken beyond the preset range;
preferably, the secondary scoring result corresponding to the identification result score comprises a peptide fragment identification result score and a proteome identification result score;
preferably, the secondary scoring result corresponding to the secondary spectrum acquisition score includes at least one of an average secondary spectrum acquisition speed score, a ratio score of the total secondary spectrum acquisition number to the total primary spectrum acquisition number, a maximum secondary spectrum acquisition speed score, and a spectrum effective acquisition time score determined by a spectrum dead volume;
preferably, the secondary spectrum scoring result corresponding to the secondary spectrum utilization score comprises at least one of a score of a ratio of the number of secondary spectrums to the number of peptide segments, a median score of the elution full length time of the peptide segments and a score of the chromatographic tailing condition;
preferably, the secondary spectrum scoring result corresponding to the secondary spectrum quality score comprises at least one of an identification rate score, an evaluation relative quality deviation score, a standard deviation score of relative quality deviation, an enzyme digestion specificity score, a missed cleavage condition score and a cysteine modification condition score.
3. The method of claim 2, wherein the identification result score is an average of the peptide identification result score and the proteomic identification result score;
the second-level scoring rule of the peptide fragment identification result score is as follows: when the number of sequences of the identified peptide fragment per hour is less than or equal to 10000, taking the lowest score, when the number of sequences of the identified peptide fragment per hour is greater than or equal to 35000, taking the highest score, and when the number of sequences of the identified peptide fragment per hour is greater than 10000 and less than 35000, calculating according to a linear scoring standard;
the two-stage scoring rule obtained from the proteome identification result is as follows: when the number of identified proteomes per hour is less than or equal to 1500, the lowest score is taken, when the number of identified proteomes per hour is greater than or equal to 3000, the highest score is taken, and when the number of identified proteomes per hour is greater than 1500 and less than 3000, the calculation is performed according to a linear scoring criterion.
4. The method according to claim 2 or 3, wherein the secondary spectrum acquisition count score is a secondary spectrum average acquisition velocity score, and the secondary scoring rule of the secondary spectrum average acquisition velocity score is: and when the average acquisition speed of the secondary spectrum is less than or equal to 3Hz, taking the lowest score, when the average acquisition speed of the secondary spectrum is greater than or equal to 40Hz, taking the highest score, and when the average acquisition speed of the secondary spectrum is greater than 3Hz and less than 40Hz, calculating according to a linear scoring standard.
5. The method of claim 4, wherein the secondary spectrum utilization score is a score of the ratio of the number of secondary spectra to the number of peptide fragments, and the secondary scoring rule of the score of the ratio of the number of secondary spectra to the number of peptide fragments is: and when the ratio of the number of secondary spectra to the number of peptide segments is less than or equal to 1.2, taking the highest score, when the ratio of the number of secondary spectra to the number of peptide segments is greater than or equal to 2, taking the lowest score, and when the ratio of the number of secondary spectra to the number of peptide segments is greater than 1.2 and less than 2, calculating according to a linear scoring standard.
6. The method of claim 5, wherein the secondary spectral quality score is an identification score whose secondary scoring rule is: and when the recognition rate is less than or equal to 20%, taking the lowest score, when the recognition rate is greater than or equal to 85%, taking the highest score, and when the recognition rate is greater than 20% and less than 85%, calculating according to a linear scoring standard.
7. The method according to claim 2, wherein the lowest score is set to 0.5-1 score and the highest score is set to 5-10 score, preferably the lowest score is set to 1 score and the highest score is set to 5 score.
8. The method of claim 1, wherein the mass spectral data comprises data for all spectra collected and identified peptide fragments in a mass spectral experiment.
9. The method of claim 1, wherein the mass spectrometry data quality assessment total score and the plurality of primary scoring results are visualized and a data visualization result is output.
10. A storable medium having stored thereon a computer program which, when executed by a processor, performs a method of assessing the quality of proteomic mass spectrometry data according to any one of claims 1 to 9.
CN202011282435.1A 2020-11-17 2020-11-17 Method for evaluating quality of proteomics mass spectrum data Pending CN112415208A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011282435.1A CN112415208A (en) 2020-11-17 2020-11-17 Method for evaluating quality of proteomics mass spectrum data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011282435.1A CN112415208A (en) 2020-11-17 2020-11-17 Method for evaluating quality of proteomics mass spectrum data

Publications (1)

Publication Number Publication Date
CN112415208A true CN112415208A (en) 2021-02-26

Family

ID=74831360

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011282435.1A Pending CN112415208A (en) 2020-11-17 2020-11-17 Method for evaluating quality of proteomics mass spectrum data

Country Status (1)

Country Link
CN (1) CN112415208A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113284563A (en) * 2021-04-20 2021-08-20 厦门大学 Screening method and system for protein mass spectrum quantitative analysis result
CN116106464A (en) * 2023-04-10 2023-05-12 西湖欧米(杭州)生物科技有限公司 Control system, evaluation system and method for mass spectrum data quality degree or probability

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060008922A1 (en) * 1999-01-30 2006-01-12 Chace Donald H Clinical method for the genetic screening of newborns using tandem mass spectrometry and internal standards therefor
CN102495127A (en) * 2011-11-11 2012-06-13 暨南大学 Protein secondary mass spectrometric identification method based on probability statistic model
CN104076115A (en) * 2014-06-26 2014-10-01 云南民族大学 Protein second-level mass spectrum identification method based on peak intensity recognition capability
CN105527359A (en) * 2015-11-19 2016-04-27 云南民族大学 Tandem mass spectrometric identification method for protein based on matching between characteristic information of target database and decoy database
CN106771011A (en) * 2016-08-31 2017-05-31 云南白药天颐茶品有限公司 A kind of integrated sensory quality evaluating method of tealeaves
CN107123123A (en) * 2017-05-02 2017-09-01 电子科技大学 Image segmentation quality evaluating method based on convolutional neural networks
CN107727727A (en) * 2017-11-13 2018-02-23 复旦大学 A kind of protein identification method and system
CN108268671A (en) * 2016-12-30 2018-07-10 中国电力科学研究院 A kind of safety on line analysis model quality testing system and its evaluation method
CN108846880A (en) * 2018-04-25 2018-11-20 云南中烟工业有限责任公司 A kind of cigarette quality feature visualization method
CN109863558A (en) * 2016-10-17 2019-06-07 布鲁克道尔顿有限公司 The appraisal procedure and mass spectrography and MALDI TOF mass spectrograph of mass spectrometric data
CN110196814A (en) * 2019-06-12 2019-09-03 王轶昆 A kind of method for evaluating software quality
EP3598135A1 (en) * 2018-07-20 2020-01-22 Univerzita Palackého v Olomouci Method of identification of entities from mass spectra
CN111208299A (en) * 2018-11-21 2020-05-29 中国科学院大连化学物理研究所 Qualitative and quantitative analysis method for cross-linked peptide fragments

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060008922A1 (en) * 1999-01-30 2006-01-12 Chace Donald H Clinical method for the genetic screening of newborns using tandem mass spectrometry and internal standards therefor
CN102495127A (en) * 2011-11-11 2012-06-13 暨南大学 Protein secondary mass spectrometric identification method based on probability statistic model
CN104076115A (en) * 2014-06-26 2014-10-01 云南民族大学 Protein second-level mass spectrum identification method based on peak intensity recognition capability
CN105527359A (en) * 2015-11-19 2016-04-27 云南民族大学 Tandem mass spectrometric identification method for protein based on matching between characteristic information of target database and decoy database
CN106771011A (en) * 2016-08-31 2017-05-31 云南白药天颐茶品有限公司 A kind of integrated sensory quality evaluating method of tealeaves
CN109863558A (en) * 2016-10-17 2019-06-07 布鲁克道尔顿有限公司 The appraisal procedure and mass spectrography and MALDI TOF mass spectrograph of mass spectrometric data
CN108268671A (en) * 2016-12-30 2018-07-10 中国电力科学研究院 A kind of safety on line analysis model quality testing system and its evaluation method
CN107123123A (en) * 2017-05-02 2017-09-01 电子科技大学 Image segmentation quality evaluating method based on convolutional neural networks
CN107727727A (en) * 2017-11-13 2018-02-23 复旦大学 A kind of protein identification method and system
CN108846880A (en) * 2018-04-25 2018-11-20 云南中烟工业有限责任公司 A kind of cigarette quality feature visualization method
EP3598135A1 (en) * 2018-07-20 2020-01-22 Univerzita Palackého v Olomouci Method of identification of entities from mass spectra
CN111208299A (en) * 2018-11-21 2020-05-29 中国科学院大连化学物理研究所 Qualitative and quantitative analysis method for cross-linked peptide fragments
CN110196814A (en) * 2019-06-12 2019-09-03 王轶昆 A kind of method for evaluating software quality

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
PAUL A. RUDNICK ET AL.: "Performance Metrics for Liquid Chromatography-Tandem Mass Spectrometry Systems in Proteomics Analyses", 《MOLECULAR & CELLULAR PROTEOMICS》 *
孙瑞祥等: ""自顶向下(top-down)"的蛋白质组学——蛋白质变体的规模化鉴定", 《生物化学与生物物理进展》 *
张成普: "基于串联质谱的肽段与修饰鉴定的质量控制算法研究与应用", 《中国优秀博硕士学位论文全文数据库(博士)》 *
翟芳: "鸟枪法蛋白质组学质谱平台性能标准和参考数据集的建立", 《中国优秀博硕士学位论文全文数据库(硕士)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113284563A (en) * 2021-04-20 2021-08-20 厦门大学 Screening method and system for protein mass spectrum quantitative analysis result
CN113284563B (en) * 2021-04-20 2024-04-09 厦门大学 Screening method and system for protein mass spectrum quantitative analysis result
CN116106464A (en) * 2023-04-10 2023-05-12 西湖欧米(杭州)生物科技有限公司 Control system, evaluation system and method for mass spectrum data quality degree or probability

Similar Documents

Publication Publication Date Title
JP5512546B2 (en) System, method and computer readable medium for determining the composition of chemical components of a complex mixture
Lam et al. Development and validation of a spectral library searching method for peptide identification from MS/MS
America et al. Comparative LC‐MS: a landscape of peaks and valleys
US20160216244A1 (en) Method and electronic nose for comparing odors
JP2003533672A (en) Methods for untargeted complex sample analysis
JP4857000B2 (en) Mass spectrometry system
CN112415208A (en) Method for evaluating quality of proteomics mass spectrum data
JP2016180599A (en) Data analysis device
JP2016061670A (en) Time-series data analysis device and method
CN111537659A (en) Method for screening biomarkers
US20070059842A1 (en) Mass analysis method and mass analysis apparatus
US7348143B2 (en) Method of visualizing non-targeted metabolomic data generated from fourier transform ion cyclotron resonance mass spectrometers
CN115380212A (en) Method, medium, and system for comparing intra-group and inter-group data
JPH09257780A (en) Data processing apparatus of chromatography/mass analyser
JP2013506843A (en) Apparatus and related methods for small molecule component analysis in complex mixtures
CN111474287A (en) Computer-aided system and method for analyzing composition components of medicine
CN111524549B (en) Integral protein identification method based on ion index
Chen et al. Random Forest model for quality control of high resolution mass spectra from SILAC labeling experiments
US20230288384A1 (en) Method for determining small molecule components of a complex mixture, and associated apparatus and computer program product
US20170117122A1 (en) Method for Analyzing Small Molecule Components of a Complex Mixture in a Multi-Sample Process, and Associated Apparatus and Computer Program Product
WO2018174858A1 (en) Clinical method for the population screening of adult metabolic disorder associated with chronic human disease
JP2018119897A (en) Substance identification method using mass analysis and mass analysis data processing device
LaMarche Methods for comparing metaproteomic data in the absence of metagenomic information
JP4839248B2 (en) Mass spectrometry system
Kapp et al. Christine A. Miller, David Fenyo, Jimmy K. Eng, Joshua N. Adkins, Gilbert S. Omenn and Richard J. Simpson

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210226