CN115359847A - Peak searching algorithm for proteomics series mass spectrogram - Google Patents

Peak searching algorithm for proteomics series mass spectrogram Download PDF

Info

Publication number
CN115359847A
CN115359847A CN202210953144.3A CN202210953144A CN115359847A CN 115359847 A CN115359847 A CN 115359847A CN 202210953144 A CN202210953144 A CN 202210953144A CN 115359847 A CN115359847 A CN 115359847A
Authority
CN
China
Prior art keywords
peak
algorithm
data
value
smoothing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210953144.3A
Other languages
Chinese (zh)
Inventor
何情祖
黎玉林
郭欢
帅建伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou Research Institute Of Guoke Wenzhou Institute Of Biomaterials And Engineering
Original Assignee
Wenzhou Research Institute Of Guoke Wenzhou Institute Of Biomaterials And Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou Research Institute Of Guoke Wenzhou Institute Of Biomaterials And Engineering filed Critical Wenzhou Research Institute Of Guoke Wenzhou Institute Of Biomaterials And Engineering
Priority to CN202210953144.3A priority Critical patent/CN115359847A/en
Publication of CN115359847A publication Critical patent/CN115359847A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Abstract

The peak searching algorithm of the proteomic tandem mass spectrogram is divided into three parts, and adopts kernel regression to carry out data smoothing processing so as to avoid the situation that the peak is flattened due to a sliding average algorithm. In order to solve the problem of baseline drift caused by factors such as instruments, a self-adaptive least square method is adopted, the algorithm has extremely high iterative convergence speed, and meanwhile, the baseline problem can be excellently processed, and the calculation force problem in the data smoothing stage is made up. Finally, for the peak value searching, one-dimensional continuous wavelet transform is used, the mass spectrum is regarded as a plurality of wavelets, a localized processing mode is carried out, the shape characteristics of the peak in the mass spectrum are very fit, the calculation speed is extremely high, the requirement of noise interference can be reduced as much as possible in the peak searching process of the mass spectrum, the peak of the processed peptide segment can be identified as accurately as possible, and therefore effective tracing can be carried out during subsequent credibility measurement.

Description

Peak searching algorithm for proteomics series mass spectrogram
Technical Field
The invention relates to the technical field of protein analysis in proteomics, in particular to a peak searching algorithm for a proteomic tandem mass spectrogram.
Background
Proteomics (Proteomics) is the science for studying the composition, location, change and interaction rules of proteins in cells, tissues or organisms, and the development of Proteomics has important significance for searching diagnostic markers of diseases, screening drug targets, toxicology research and the like. The protein mass spectrometry technology can be applied to a plurality of application fields such as protein identification, protein quantitative analysis, protein structure identification, protein genomics and the like, and is an important ring in proteomics.
Protein mass spectrometry is carried out by matching an experimental spectrogram about a detected protein obtained by an experiment with a theoretical protein by using a calculation method, and finally determining which proteins exist in the protein to be detected.
Peak detection is one of the important preprocessing steps in proteomics data analysis based on Mass Spectrometry (MS), and the peak information in mass spectrometry often represents the information of protein. Because the complexity of signals in MS frequency spectrum and multiple noise sources and high false positive peak recognition rate are a main problem, the peak searching algorithm aims to reduce the noise signals in the mass spectrum and highlight the protein signals in the mass spectrum, thereby improving the protein identification capability in the protein mass spectrum analysis process and obtaining more accurate protein identification results
Currently, most peak detection algorithms identify peaks by searching for local maxima by the local signal-to-noise ratio (SNR) exceeding a certain threshold. The estimation of the signal-to-noise ratio typically depends on the peak amplitude relative to the ambient noise level. However, high amplitudes do not always guarantee a true peak-some noise sources may cause high amplitude peaks. Conversely, low amplitude peaks may still be true. To reduce the false positive rate, the peak detection algorithm imposes different constraints. While the application of these constraints reduces the false positive rate of the algorithm, it also reduces the sensitivity of the method, resulting in peaks that are not detected.
The existing software and algorithm mainly comprise:
PeakSeeker
PeakSeeker is a comprehensive algorithm for solving peak detection, peak overlap and charge state assignment in this mass spectrum. Overlapping peaks are detected by examining the second derivative of the original mass spectrum and the charge state distribution of the molecular species is determined by fitting a linear combination of charge envelopes to the entire experimental mass spectrum.
PeakSeeker deconvolves the overlapping signal by applying a second derivative based peak detection method. The second derivative has been widely used for peak detection in chromatography, nuclear magnetic resonance and astronomical spectroscopy. PeakSeeker simulates the charge envelope in order to best fit the peaks in the mass spectrum. Goodness of fit is determined by a scoring function that combines quality error and intensity error.
Peak detection algorithm based on continuous wavelet transform (cwt)
Pan Du et al, a Continuous Wavelet Transform (CWT) based peak detection algorithm, can identify peaks of different scales and amplitudes by converting the spectrum into wavelet space, simplifying the pattern matching problem, and also provides a powerful technique for identifying and separating signals from spike and colored noise. This transformation, in addition to the additional information provided by the two-dimensional CWT coefficients, can greatly improve the effective snr. Furthermore, using this technique, no baseline removal or peak smoothing pre-processing step is required prior to peak detection, which improves the sensitivity of peak detection under various conditions. CWT based algorithms can identify strong and weak peaks while maintaining a low false positive rate.
SOMMS
SOMMS (Solving complex Macromolecular Mass Spectra) uses Gaussian curve fitting to simulate the hypothetical Mass spectrum of protein (sub) complexes within a given charge state window. In addition, the program can simulate the spectrum of a heterogeneous protein complex using binomial and multinomial distributions, which can calculate the zero charge spectrum and relatively quantify the abundance of each component in the mixture.
SOMMS can be used in two modes. In the first mode of action, the program uses binomial distributions to calculate the probability of each possible complex being formed. In a second mode of action, SOMMS may be used to fit the simulated spectrum to the original mass spectrum. In the latter case, the user can optimize the curve fit by shifting the calculated m/z values for each charge state to compensate for at very high mlz values
Automass
Automass uses an intensity-independent method to vary the charge state assignments for a set of peaks and checks the standard deviation of the masses of the minima and the periodicity of such deviations across different charge states. The overlapping signals are not directly evaluated because each peak is assumed to belong to exactly one envelope, and the boundaries between charge envelopes are modeled by game theory based processing.
The Automass algorithm can automatically minimize the standard deviation in a series of correlated ion peaks with different charge numbers. The algorithm assumes that the mass is constant and allows the correct charge state in the sequence of peaks to be determined. When the minimum standard deviation of the charge state sequence is found, the analysis produces a periodic pattern, which can be interpreted as a harmonic oscillator.
Baseline removal and smoothing are two steps of preprocessing the MS data. Usually before peak detection. However, different pre-processing algorithms can severely impact downstream analysis. Moreover, baseline removal and smoothing is not recoverable, i.e., if a true peak is removed in these preprocessing steps, it is never recovered in subsequent analysis.
Recent developments in deconvolution algorithms, which assume constant mass and increment of ionic charge equal to 1, have largely focused on improved methods of assigning peak membership to the charge state distribution. The detection of individual peaks, representing the early stages of data analysis, was somewhat overlooked and not tailored to the breadth and heterogeneity of ion signals observed from the protocomplexes.
The problem with these algorithms is that they typically require manual intervention or fine tuning of program parameters and only allow evaluation of mass distribution within a typically limited m/z range or require very high mass resolution.
Since these methods rely on local maxima detection, none of these methods can handle more complex overlap situations where the overlap signal causes peak vertex shifts and the underlying signal cannot simply be distributed between the charge envelopes. These conditions are particularly acute for protein mass spectrometry where high mass, multi-component and broadened ion signals can result in peak overlap. Distorted peak shapes can be assigned to erroneous charge envelopes, resulting in inaccurate mass and abundance determinations.
The peak finding algorithm is an algorithm for finding local high points of a series of continuous or separated data in a series of ways. The most common raw data in proteomics databases are discrete mass spectra data, which contain the mass/charge data of the polypeptide produced by the instrument, and the high points of which are often generated by peptide fragments with high enough ionic strength and concentration, and are also targets in mass spectrometry, and these target data are polluted by noise with low intensity.
Disclosure of Invention
In order to solve the defects in the prior art, the invention provides a proteomics series mass spectrogram peak searching algorithm, which well eliminates the interference of noise and searches for a peak value of a signal so as to assist in a series of identification processes.
The technical solution adopted by the invention is as follows: the peak searching algorithm for the proteomic tandem mass spectrogram comprises the following steps of:
(1) Acquiring a sample, a data set thereof and data preprocessing: obtaining raw data from a database, and extracting an MS1 spectrogram from the raw data;
before mass spectrum data is acquired by using a mass spectrometer, proteins are subjected to processes such as enzyme digestion, isotope labeling and phosphorylation, the processes cause problems such as peptide fragment ionization, overlapping peaks and mass-to-charge ratio change, and meanwhile, noise is inevitably generated by an instrument. In order to identify, classify and identify new proteins and peptides, researchers often need to search peak features in data and follow-up identification and research procedures according to mass/charge ratios corresponding to peaks.
Proteomexchange Consortium as an open proteomics data sharing alliance provides strict and clear file identification requirements for data provided by research members, and utilizes RESULT labeled protein identification files, RAW labeled mass spectrum output files (based on mzMXL or mzML format), PEAK labeled PEAK list files (based on mzldentML format), SEARCH labeled SEARCH engine output files and the like. The several member databases with the largest contribution are PRIDE, iProX, JPOSTrepo, massIVE, and the 4 master databases will be compared below to provide the reader with a database selection reference.
The PRIDE is originated from EMBL (European molecular biology laboratory) and is a structured proteomics database based on mass spectrum, researchers can submit and acquire original protein mass spectrum experimental data in a plurality of ways such as a webpage, a client, an API interface and the like provided by the official, and the data format of the PRIDE is a mainstream submission format by ProteOxchange Consortium. And the official provided an open source PRIDE Converter tool to help researchers convert other 15 XML-based format-based mass spectrometry data to the recommended format. All submissions will be assigned a ProteomeXchange identifier (PXD). Currently, the data volume is up to 390PB (over one million GB), and 8200 ten thousand access volumes are provided every day. Of the 4 main-stream databases, PRIDE provides the easiest tool to use while having the largest size and most abundant access requirements, and data is fully publicly accessible once submitted.
(2) Data smoothing: performing data smoothing operation on signals of an MS1 spectrogram, and reducing additional interference on data peaks as much as possible while reducing noise interference;
(3) And (3) reference calibration: performing baseline calibration on the smoothed MS1 spectrogram, and improving data drift and quantitative analysis;
(4) Wavelet transformation: performing wavelet change on the MS1 spectrogram after the baseline so as to highlight a peak value;
(5) Peak value searching: the first derivative of the spectrogram signal is calculated to determine the peak position in MS1.
The data smoothing in the step (2) comprises the following specific steps:
because actually acquired mass spectrum data often contains a large amount of noise and interference, which causes waveform distortion, the smoothing algorithm is used for reducing the noise interference and simultaneously reducing the additional interference caused to the data peak as much as possible. The invention adopts a kernel regression algorithm as a smoothing algorithm, a discrete kernel regression formula is shown as a formula (1),
Figure BDA0003788918320000051
wherein K h (x-x i ) For a Gaussian kernel, the expression is as follows:
Figure BDA0003788918320000052
x in the formula i Are observed data points; x is the value of the kernel function to be computed, i.e. the point to be predicted; y is i Is each data point representing ion abundance; h is the bandwidth, also called smoothing parameter of kernel regression.
The essence of the kernel regression algorithm in the mass spectrum signal is to adopt a weighting mode to give the noise as low as possible weight, thereby reducing the influence of random noise. Dividing the whole data into multiple parts, and performing linear weighting and denominator on each part
Figure BDA0003788918320000053
For weighting locally all sample points, the total weight being naturally 1, numerator
Figure BDA0003788918320000054
Divide by 1 for each y in the part i (abundance of ions, signal intensity in mass spectra). In this way, when the abundance ratio corresponding to each mass-to-charge ratio is estimated, the influence of noise can be reduced.
The data smoothing in the step (2) comprises the following specific steps:
the sliding average method is adopted as a smoothing algorithm, and is as the formula (1.1):
Figure BDA0003788918320000061
wherein p is t Representing the result of the filtering at time t, x t Represents the observed value of the t-site signal and n represents the sliding window radius.
The step (2) of data smoothing comprises the following specific steps:
an exponential moving average method is adopted as a smoothing algorithm, and is shown as the formula (1.2):
p t =w*x t-1 +(1-w)p t-1 ; (1.2)
wherein p is t Represents the predicted value, w represents the attenuation weight, x t Representing an observed value, which is a recursive formula.
The step (2) of data smoothing comprises the following specific steps:
adopting SG filtering method as smoothing algorithm, wherein SG filtering method is as formula (1.3):
x t =a 0 +a 1 *t+a 2 *t 2 +…+a k-1 *t k-1 ; (1.3)
wherein a is n Is the coefficient of the nth term, typically determined using the least squares method; t represents a signal site; x is the number of t Indicates the predicted value at the t-site.
The step (3) of reference calibration specifically comprises the following steps:
since the instrument for acquiring mass spectrum data is not perfect in the experimental process, even in the absence of signals, the dark current of the instrument and the charge exchange of ionized substances can cause the peak value to have strong deviation. Therefore, mathematical methods are needed to correct, eliminate glitches, and adjust the baseline. The invention adopts a self-adaptive least square method as a baseline correction algorithm, firstly defines the square difference between an expected value and an actual value, seeks to ensure that the expected value minimizing the square difference is a true value, and has a specific mathematical expression as a formula (2):
S=∑w i (y-y i ) 2 ; (2)
wherein y is i Is each data point representing ion abundance and y is an estimate. Self-adaptive least square method for automatically aligning w by utilizing computing power of computer i Perform an overlayInstead, find a weight w that minimizes S i And an expectation value y, which can improve the effect of least squares, primarily for linear scenarios, in baseline drift scenarios.
The step (3) of reference calibration specifically comprises the following steps:
using the least squares method as the baseline correction algorithm, equation (2.1) is as follows:
S=∑(y-y i ) 2 ; (2.1);
wherein S represents an objective function, y represents a theoretical value, y i Representing the observed value.
The specific steps of the wavelet transformation in the step (4) are as follows:
considering that the peak width is often very large in practical mass spectrum, we use wavelet transform which can perform local analysis on data locally, which is a good peak finding method. In the data processing process, the wavelet transform algorithm considers the data as a plurality of wavelets (functions which are quickly attenuated to 0 in a limited interval), so that the signal diagram can be regarded as a diagram formed by combining a plurality of wavelets after being subjected to scaling, shifting and the like. Even for such discrete data as mass spectrum, one-dimensional continuous wavelet algorithm can be adopted, which is considered continuous in data processing. The invention adopts wavelet transformation as a waveform transformation method, and the specific mathematical expression is as formula (3):
Figure BDA0003788918320000071
wherein
Figure BDA0003788918320000072
Where t is the site where the signal occurs, x (t) is the signal strength, a is the scale, b is the amount of translation,
Figure BDA0003788918320000073
is a mother wavelet function for translation and scaling;
in the process of performing wavelet transformation on spectrogram signals, an independent variable t represents mass-to-charge ratio, x (t) represents ion abundance and signal intensity, a is a sampling abscissa interval, wavelet transformation is performed in each interval based on the interval by an algorithm, and b is a parameter which needs to be set during actual calculation and is used for determining how wide a signal is regarded as a combination of wavelets when a peak-finding algorithm is used.
The specific steps of the wavelet transformation in the step (4) are as follows:
using fourier transform as the waveform transformation method, equation (3.1) is as follows:
Figure BDA0003788918320000081
where f (x) is an objective function, a n Is the coefficient of the cosine wave of the (n + 1) th term, b n Is the coefficient of the n +1 th sine wave,
Figure BDA0003788918320000082
representing the frequency.
The beneficial effects of the invention are: the invention provides a proteomics tandem mass spectrogram peak searching algorithm which is divided into three parts, adopts kernel regression to carry out data smoothing processing so as to avoid the situation that a peak is flattened due to a sliding average algorithm, and has excellent performance when encountering adjacent peaks and wide peaks. In order to solve the baseline drift problem caused by factors such as instruments, the self-adaptive least square method is adopted, the iterative convergence speed of the algorithm is extremely high, meanwhile, the baseline problem can be excellently processed, and the calculation force problem in the data smoothing stage is made up. Finally, for the peak value searching, one-dimensional continuous wavelet transform is used, the mass spectrum is regarded as a plurality of wavelets, a localized processing mode is carried out, the shape characteristics of the peak in the mass spectrum are very fit, the calculation speed is extremely high, the requirement of noise interference can be reduced as much as possible in the peak searching process of the mass spectrum, the peak of the processed peptide segment can be identified as accurately as possible, and therefore effective tracing can be carried out during subsequent credibility measurement.
Drawings
FIG. 1 is a schematic flow chart of the peak finding algorithm of the present invention.
FIG. 2 is a representation of the kernel regression algorithm in smoothing data.
FIG. 3 is a graph of the baseline correction effect of the adaptive least squares method.
Fig. 4 is a representation of the continuous wavelet algorithm on a narrower wavelet.
Figure 5 is a representation of the continuous wavelet algorithm on a wider wavelet.
FIG. 6 is a MS1 of the PXD029773 data set.
Figure 7 is the result of the peak finding algorithm in MS1 in PXD029773 data set.
Fig. 8 is an MS1 of the PXD028735 data set.
Figure 9 peak finding algorithm results for MS1 in PXD028735 dataset.
Detailed Description
The present invention is further described below in conjunction with the following examples and figures, with the understanding that the figures and the following examples are intended to illustrate, but not limit the invention.
Example 1:
the method comprises the following steps: obtaining a sample and a data set thereof
A. Obtaining a sample and its dataset:
data set 1: PXD029773
In performing the acquisition of data PXD029773[5], the protein in each cell line was acquired by two different data acquisition methods. The first is data dependent acquisition-parallel accumulation serial fragmentation (DDA-PASEF) and the second is parallel accumulation-serial fragmentation combined with data independent acquisition (diaPASEF). Protein assembly was performed after searching the Swiss-Prot Human database, the DDA dataset and the Spectronaut dataset using Peaks Studio. The results of the assembly contained identified PSMs, peptides and proteins, which were above the threshold for each HeLa and SiHa sample. In terms of coverage, approximately 6, 090 and 7, 298 proteins were quantified for the DDA-PASEF, heLa and SiHA samples, while 13, 339 and 8, 773 proteins were quantified for the dia PASEF of HeLa of SiHA samples, respectively. In terms of consistency, the deficiency values of diaPASEF (-2%) were less than the DDA counterparts (-5-7%).
Mass spectrometry proteomics data has been deposited by the iProX depository into the proteome Xchange consortium with a data set identifier of PXD029773.
The data set may be obtained by the following web page links:
http://proteomecentral.proteomexchange.org/cgi/GetDatasetID=PXD029773
data set 2: PXD028735
Experimental data a comprehensive LC-MS/MS data set was generated using samples consisting of a whole proteomic digest of human K562, yeast and e.coli (e.coli) [6]. The two hybrid proteome samples A and B contained known amounts of human, yeast and E.coli tryptic peptides as described by Navaro et al. Three preparations were done in succession to include process variability. Furthermore, QC samples were created by mixing one sixth of each of the six master batches (65% w/w human, 22.5%w/w yeast and 12.5%. These commercial lysates were separately measured and triple-pooled proteome mixtures using six DDA and DIA collection methods available on LC-MS/MS platforms, namely SCEX TripleTOF 5600 and TripleTOF 6600+, thermo Orbitrap QE HF-X, waters synapse G2-Si and synapse XS and brueck timetof Pro.
The complete data set is publicly provided to the proteomics community by ProteomeXchange, with data set identifiers: PXD028735.
The data set may be obtained by the following web page links:
https://www.ebi.ac.uk/pride/archive/projects/PXD028735
the specific logical order of data acquisition is: reading the download address, determining the download address as a PRIDE library address, acquiring a PRIDE PXD identifier, correcting the download address as an executable FTP download link, and downloading the mzML file marked as the mzML file to a folder with the PXD as the file name.
Step two: constructing a model:
1. data smoothing
For data smoothing, the kernel regression algorithm used in the present application is formula (1), and the example of data smoothing uses the test result of the simulated data as shown in fig. 2, green is the assumed target mass spectrum, orange is the simulated actual signal, and red is the smoothed signal. It can be seen that smoothing the signal in this way allows good separation of the peak signal in the region where the peaks overlap, without flattening the peaks.
2. Baseline calibration
For baseline calibration, the method used in the present application is a weighted least squares method, as in equation (2), and the baseline calibration example uses data in the ide, and the effect after baseline correction using the adaptive least squares method is shown in fig. 3. Red is the mass spectral signal obtained and blue is the result after baseline correction. It can be seen that the situation of baseline drift can be effectively eliminated based on the method. And the performance is excellent, and the result of the test based on ZEN3 5900X CPU is that the baseline correction of 18 ten thousand data points only needs less than 3 seconds.
3. Wavelet transform
For wavelet transform, the method used in this application is a one-dimensional continuous wavelet algorithm, which is formula (3). Tests show that the selected wavelet is too wide, so that detailed information is omitted, and the too narrow wavelet needs sufficient data. The results are shown in fig. 4 and 5, where fig. 4 uses a narrower wavelet of (1, 10), fig. 5 uses a wider wavelet of (10, 100), the test sample using the narrow wavelet identified more small peaks, and the test sample using the wide wavelet agreed with the former results on large peaks but ignored many small peaks. Obviously, when continuous wavelet transform is adopted, close attention needs to be paid to the width setting of the wavelet, and the parameter adjusting experience can be very depended when mass spectrum is subjected to quantitative analysis.
4. Peak finding
For the spectrogram after data processing, we use the first derivative of the peak to have a downward zero crossing at the peak maximum in the peak search process, which can be used to locate the m/z value of the peak. If there is no noise in the signal, any data point with lower values on both sides will be the peak maximum.
The first derivative of the signal is the rate of change of y with x, dy/dx, which is interpreted as the slope of the tangency of each point to the signal. Assuming that the x-interval between neighboring points is constant, the simplest algorithm to compute the first derivative is:
Figure BDA0003788918320000111
wherein X' j And Y' j Is the X and Y values of the j-th point of the derivative, n = the number of points in the signal, Δ X is the difference between the X values of adjacent data points. This is called the center difference method; this has the advantage that it does not involve an offset of the x-axis position of the derivative. The gap segment derivative may also be calculated, where the x-axis spacing between points in the above expression is greater than 1; for example, Y j-2 And Y j+2 Or Y is j-3 And Y j+3 And so on. It turns out that this is equivalent to applying a moving average (rectangular) smoothing outside the derivative.
FIG. 6 is a first order mass spectrum (MS 1) from data set PXD029773, which we have performed peak finding by peak finding algorithm and the location of the ten peaks with the highest peak signal intensity is marked on FIG. 7.
Fig. 8 is a first order mass spectrum (MS 1) from the data set PXD028735 for which we have performed peak finding using a peak finding algorithm and plotted on fig. 9 the position of the ten peaks with the highest peak signal intensity.
The performance is as follows:
the peak searching algorithm is divided into three parts, and adopts kernel regression to carry out data smoothing processing so as to avoid the situation that the peak is flattened due to the sliding average algorithm. In order to solve the problem of baseline drift caused by factors such as instruments, a self-adaptive least square method is adopted, the algorithm has extremely high iterative convergence speed, and meanwhile, the baseline problem can be excellently processed, and the calculation force problem in the data smoothing stage is made up. Finally, for the searching of the peak value, a one-dimensional continuous wavelet transform is used, the mass spectrum is regarded as a plurality of wavelets, a localized processing mode is carried out, the shape characteristics of the peak in the mass spectrum are well matched, and the calculating speed is extremely high.
Example 2:
1. data smoothing:
alternatives are the moving average method, the exponential moving average method, the SG Filter method (Savitzky Golay Filter).
The moving average method is as follows:
Figure BDA0003788918320000121
wherein p is t Representing the result of the filtering at time t, x t Represents the observed value of the t-site signal and n represents the sliding window radius.
The formula of the exponential moving average method is as follows:
p t =w*x t-1 +(1-w)p t-1
wherein p is t Represents the predicted value, w represents the attenuation weight, x t Representing the observed value, which is a recursive formula.
The formula of the SG filtering method is as follows:
x t =a 0 +a 1 *t+a 2 *t 2 +…+a k-1 *t k-1
wherein a is n Is the coefficient of the nth term, typically determined using the least squares method; t represents a signal site; x is the number of t Indicates the predicted value at the t-site.
2. Baseline calibration
An alternative to baseline calibration is the least squares method, with the formula:
S=∑(y-y i ) 2
wherein S represents an objective function, y represents a theoretical value, y i Representing the observed value.
3. Wavelet transform
An alternative to the wavelet transform may be a fourier transform, the formula being:
Figure BDA0003788918320000131
where f (x) is an objective function, a n Is the coefficient of the cosine wave of the (n + 1) th term, b n Is the coefficient of the n +1 th term sine wave,
Figure BDA0003788918320000132
representing the frequency.
Conclusion
1. The method and the device adopt kernel regression to carry out data smoothing processing so as to avoid the situation that the peak is flattened due to the sliding average algorithm.
2. The application adopts a self-adaptive least square method, can excellently process the baseline problem, and makes up the calculation force problem in the data smoothing stage.
3. The method realizes the search of the peak value by using one-dimensional continuous wavelet transform, treats the mass spectrum as a plurality of wavelets and carries out a localized processing mode.
The skilled person should know that: although the invention has been described in terms of the above specific embodiments, the inventive concept is not limited thereto and any modification applying the inventive concept is intended to be included within the scope of the patent claims.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (9)

1. The peak searching algorithm for the proteomic tandem mass spectrogram is characterized by comprising the following steps of:
(1) Obtaining a sample and a data set thereof and data preprocessing: obtaining raw data from a database, and extracting an MS1 spectrogram from the raw data;
(2) Data smoothing: performing data smoothing operation on signals of an MS1 spectrogram, and reducing additional interference on data peaks as much as possible while reducing noise interference;
(3) And (3) reference calibration: performing baseline calibration on the smoothed MS1 spectrogram, and improving data drift and quantitative analysis;
(4) Wavelet transformation: performing wavelet change on the MS1 spectrogram after the baseline so as to highlight the peak value;
(5) Peak value searching: the first derivative of the spectrogram signal is calculated to determine the peak position in MS1.
2. The proteomic tandem mass spectrum peak finding algorithm as claimed in claim 1, wherein the step (2) of data smoothing comprises the following specific steps:
a kernel regression algorithm is adopted as a smoothing algorithm, a discrete kernel regression formula is shown as a formula (1),
Figure FDA0003788918310000011
wherein K h (x-x i ) For a Gaussian kernel, the expression is as follows:
Figure FDA0003788918310000012
x in the formula i Are observed data points; x is the value of the kernel function to be computed, i.e. the point to be predicted; y is i Is each data point, representing ion abundance; h is the bandwidth, also called smoothing parameter of kernel regression.
3. The proteomic tandem mass spectrogram peak finding algorithm of claim 1, wherein the step (2) of data smoothing comprises the following specific steps:
the sliding average method is adopted as a smoothing algorithm, and is as the formula (1.1):
Figure FDA0003788918310000013
wherein p is t Representing the result of the filtering at time t, x t Represents the observed value of the t-site signal and n represents the sliding window radius.
4. The proteomic tandem mass spectrogram peak finding algorithm of claim 1, wherein the step (2) of data smoothing comprises the following specific steps:
an exponential moving average method is adopted as a smoothing algorithm, and is shown as the formula (1.2):
p t =w*x t-1 +(1-w)p t-1 ; (1.2)
wherein p is t Represents the predicted value, w represents the attenuation weight, x t Representing an observed value, which is a recursive formula.
5. The proteomic tandem mass spectrogram peak finding algorithm of claim 1, wherein the step (2) of data smoothing comprises the following specific steps:
adopting SG filtering method as smoothing algorithm, wherein SG filtering method is as formula (1.3):
x t =a 0 +a 1 *t+a 2 *t 2 +…+a k-1 *t k-1 ; (1.3)
wherein a is n Is the coefficient of the nth term, typically determined using the least squares method; t represents a signal site; x is the number of t Indicates the predicted value at the t-site.
6. The proteomic tandem mass spectrogram peak finding algorithm as defined in claim 1, wherein said step (3) of calibrating the reference standard specifically comprises the steps of:
the adaptive least square method is adopted as a baseline correction algorithm, the square difference between an expected value and an actual value is defined, the expected value for minimizing the square difference is a true value, and a specific mathematical expression is as shown in a formula (2):
S=∑w i (y-y i ) 2 ; (2)
wherein y is i Is the value of each of the data points,representing ion abundance and y is an estimate.
7. The proteomic tandem mass spectrogram peak finding algorithm as claimed in claim 1, wherein said step (3) of reference calibration comprises the following steps:
using the least squares method as the baseline correction algorithm, equation (2.1) is as follows:
S=∑(y-y i ) 2 ; (2.1);
wherein S represents an objective function, y represents a theoretical value, y i Representing the observed value.
8. The proteomic tandem mass spectrogram peak finding algorithm as claimed in claim 1, wherein the wavelet transformation of step (4) comprises the following steps:
wavelet transformation is adopted as a waveform transformation method, and a specific mathematical expression is as shown in a formula (3):
Figure FDA0003788918310000031
wherein
Figure FDA0003788918310000032
Where t is the site where the signal occurs, x (t) is the signal strength, a is the scale, b is the amount of translation,
Figure FDA0003788918310000033
is the mother wavelet function for translation and scaling;
in the process of performing wavelet transformation on spectrogram signals, an independent variable t represents mass-to-charge ratio, x (t) represents ion abundance and signal intensity, a is a sampling abscissa interval, wavelet transformation is performed in each interval based on the interval by an algorithm, and b is a parameter which needs to be set during actual calculation and is used for determining how wide a signal is regarded as a combination of wavelets when a peak-finding algorithm is used.
9. The proteomic tandem mass spectrogram peak finding algorithm as claimed in claim 1, wherein the wavelet transformation of step (4) comprises the following steps:
using fourier transform as the waveform transformation method, equation (3.1) is as follows:
Figure FDA0003788918310000034
wherein f (x) is an objective function, a n Is the coefficient of the cosine wave of the (n + 1) th term, b n Is the coefficient of the n +1 th term sine wave,
Figure FDA0003788918310000035
representing the frequency.
CN202210953144.3A 2022-08-09 2022-08-09 Peak searching algorithm for proteomics series mass spectrogram Pending CN115359847A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210953144.3A CN115359847A (en) 2022-08-09 2022-08-09 Peak searching algorithm for proteomics series mass spectrogram

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210953144.3A CN115359847A (en) 2022-08-09 2022-08-09 Peak searching algorithm for proteomics series mass spectrogram

Publications (1)

Publication Number Publication Date
CN115359847A true CN115359847A (en) 2022-11-18

Family

ID=84001103

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210953144.3A Pending CN115359847A (en) 2022-08-09 2022-08-09 Peak searching algorithm for proteomics series mass spectrogram

Country Status (1)

Country Link
CN (1) CN115359847A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304259A (en) * 2023-05-24 2023-06-23 药融云数字科技(成都)有限公司 Spectrogram data matching retrieval method, system, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304259A (en) * 2023-05-24 2023-06-23 药融云数字科技(成都)有限公司 Spectrogram data matching retrieval method, system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US6936814B2 (en) Median filter for liquid chromatography-mass spectrometry data
US7894650B2 (en) Discover biological features using composite images
CA2523975C (en) Computational method and system for mass spectral analysis
Zhang et al. Review of peak detection algorithms in liquid-chromatography-mass spectrometry
US20070278395A1 (en) Apparatus and Method For Identifying Peaks In Liquid Chromatography/Mass Spectrometry And For Forming Spectra And Chromatograms
Sugimoto et al. Differential metabolomics software for capillary electrophoresis-mass spectrometry data analysis
WO2004089972A2 (en) Mass spectrometry data analysis techniques
US20090210167A1 (en) Computational methods and systems for multidimensional analysis
CN112418072A (en) Data processing method, data processing device, computer equipment and storage medium
CN115359847A (en) Peak searching algorithm for proteomics series mass spectrogram
Yu et al. A chemometric-assisted method based on gas chromatography–mass spectrometry for metabolic profiling analysis
Sun et al. BPDA-a Bayesian peptide detection algorithm for mass spectrometry
Tong et al. A simpler method of preprocessing MALDI-TOF MS data for differential biomarker analysis: stem cell and melanoma cancer studies
EP1623352B1 (en) Computational methods and systems for multidimensional analysis
Cannataro et al. Preprocessing of mass spectrometry proteomics data on the grid
Wang et al. A dynamic wavelet-based algorithm for pre-processing tandem mass spectrometry data
CN109946413B (en) method for detecting proteome by pulse type data independent acquisition mass spectrum
Wolski et al. Calibration of mass spectrometric peptide mass fingerprint data without specific external or internal calibrants
EP3523818B1 (en) System and method for real-time isotope identification
US11721534B2 (en) Peak width estimation in mass spectra
Seneviratne et al. Improved identification and quantification of peptides in mass spectrometry data via chemical and random additive noise elimination (CRANE)
Cannataro et al. Preprocessing, management, and analysis of mass spectrometry proteomics data
EP4102509A1 (en) Method and apparatus for identifying molecular species in a mass spectrum
He et al. Profiling MS proteomics data using smoothed non‐linear energy operator and Bayesian additive regression trees
Price Optimising the statistical pipeline for quantitative proteomics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination