CN101055559A - Noise baseline identification method in mass spectrum data processing - Google Patents

Noise baseline identification method in mass spectrum data processing Download PDF

Info

Publication number
CN101055559A
CN101055559A CNA2006100721693A CN200610072169A CN101055559A CN 101055559 A CN101055559 A CN 101055559A CN A2006100721693 A CNA2006100721693 A CN A2006100721693A CN 200610072169 A CN200610072169 A CN 200610072169A CN 101055559 A CN101055559 A CN 101055559A
Authority
CN
China
Prior art keywords
noise
peak
spectrum peak
intensity
spectrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2006100721693A
Other languages
Chinese (zh)
Other versions
CN100483394C (en
Inventor
高文
张京芬
贺思敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CNB2006100721693A priority Critical patent/CN100483394C/en
Publication of CN101055559A publication Critical patent/CN101055559A/en
Application granted granted Critical
Publication of CN100483394C publication Critical patent/CN100483394C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention discloses a method for indentifying a noise base line in mass spectra based on statistical method. The method, includes the steps of: 1)dividing the spectrum peak in mass spectra into at least two kinds based on the distribution nature of peak; 2) calculating the distributed parameter of intensity for different spectrum peak in the step of (1) respectively; 3) depicting generalized noise base line using the intensity distribution parameter of spectrum peak classification; 4)calculating the distance between each spectrum peak and noise base line for each spectrum peak in mass spectra, and judging whether the spectrum peak is effective ionic spectrum peak. The invention, through the method of statistical classification, can reflect the actual distribution of noise peak in mass spectra on the intensity, and can describe the distribution of noise in mass spectra through a generalized noise base line, accordingly the form is flexible and the search speed for indentifying software is improved.

Description

Noise baseline recognition methods during a kind of mass spectrometric data is handled
Technical field
The present invention relates to the method for mass spectrometric data pre-service and information extraction, particularly a kind of based on noise baseline recognition methods in the mass spectrum of statistical method.
Background technology
In Bioexperiment, polypeptide to be identified collides the cracked fragmention that is through inducing in tandem mass spectrometer, and the quality of these fragmentions and abundance are measured by mass spectrometer, form tandem mass spectrum.Each fragmention with and isotope ion all in tandem mass spectrum, form corresponding spectrum peak.Biology laboratory all produces a large amount of mass spectrometric datas every day, and the mass spectrum that can identify peptide sequence only is about about the 10-30% of sum, and a large amount of mass spectrums can not obtain believable qualification result when database search.One very important reasons be not ideal enough to the pre-service of mass spectrometric data.In the mass spectrum to identifying that useful spectrum peak is the monoisotopic peak of ion, and compose about 1 ~ 5% of peak sum to identifying that useful spectrum peak only accounts in the common mass spectrum, the spectrum peak of the overwhelming majority is the physics noise that instrument produces, or the isotopic peak of ion (being called the isotope noise), these noises cause to evaluation to be obscured.Therefore a pretreated major issue is carried out the effective peak of mass spectrum picking exactly, and mass spectrum denoising in other words its objective is as far as possible the monoisotopic peak of the ion in the mass spectrum is picked out.
One of difficulty of mass spectrum denoising is that the Instrumental Physics noise distribution in the different mass spectrums is different, and the noise in same mass spectrum different quality interval distributes also different.Moreover the very low and noise of the intensity at the spectrum peak of a lot of leading ions mixes, and is difficult to it is judged.In the prior art, the method of identification noise commonly used mainly contains threshold method and wavelet analysis denoising method, such as document 1:J.K.Eng, A.L.McCormack and J.R.Yates, " An approachto correlate tandem mass spectral data of peptides with amino acid sequencesin a protein database ", J Am Soc Mass Spectrom., 1994,5,976-989., with document 2:J.Grossmann, F.F.Roos, M.Cieliebak, Z.Liptak, L.K.Mathis, M.Muller, W.Gruissem, and S.Baginsky, " AuDeNS:A Tool for Automatic De Novo PeptideSequencing ", J.Proteome.Res., 2005,4 (5), 1768-74., and document 3:M.Cannataro, P.H.Guzzi, T.Mazza, and P.Veltri, " Preprocessing, Management; and Analysis of Mass Spectrometry Proteomics Data ", disclosed technology adopts threshold method exactly among the In workshop Workflowsmanagement:new abilities for the biological information overflow-NETTAB2005., promptly in a specific m/z interval, selects those and is higher than the spectrum peak of given intensity threshold or selects the usefulness of the forward spectrum peak of some intensity levels rank as next step evaluation.Because intensity is not the most basic difference at noise and ionic spectrum peak, many important b-series ionic strengths are just very low, utilize threshold method simply, no matter be fixed threshold method or threshold method selectively, tend to lose important mass of ion information.In addition, the process that some are commonly used, such as wavelet transformation, be used to remove the noise in the original series connection spectrum, as document 4:T.Rejtar, H.S.Chen, V.Andreev, E.Moskovets, andB.L.Karger, " Increased Identification of Peptides by Enhanced DataPreprocessing of High-Resolution MALDI TOF/TOF Mass Spectra Prior toDatabase Searching ", Anal.Chem., 2004,76,6017-6028 and document 5:E.Lange, C.Gropl, K.Reinert, O.Kohlbacher, and R.Hildebrandt, " High-Accuracy PeakPicking of Proteomics Data Using Wavelet Techniques ", disclosed technology among PSB 2006 OnlineProceedings.But, also point out in the document, the parameter of conversion process, as the basis function of wavelet transformation, in proper order, decomposition water equality all influenced the reliability of denoising.
At the deficiencies in the prior art, people wish to have a kind of new method of discerning noise in the mass spectrum, particularly carry out the method for noise identification according to the spectral strength distribution property, utilize a kind of broad sense noise baseline to carry out the method for noise identification in other words.
Summary of the invention
The objective of the invention is to overcome the deficiencies in the prior art, noise baseline recognition methods in a kind of mass spectrometric data processing is provided.
In order to achieve the above object, the present invention takes following technical scheme.
Noise baseline recognition methods during a kind of mass spectrometric data is handled comprises step:
1) according to the spectral strength distribution property mass spectrum is composed the peak and be divided into two classes at least;
2) the different classes of spectrum peak in rapid calculates the distribution parameter of its intensity respectively to previous step;
3) portray the noise baseline of broad sense with the intensity distributions parameter of spectrum peak classification.
4) to each the spectrum peak in the mass spectrum, calculate the distance of itself and noise baseline and judge whether it is effective ionic spectrum peak.
In technique scheme, classification described in the described step 1) is according to the distribution trend of the intensity at the spectrum peak in the mass spectrum to be classified in the spectrum peak, comprises Gauss (Guass) distribution or gamma (Gamma) distribution etc.Described distribution trend is to obtain by mass spectral spectrum peak is added up.
In technique scheme, be divided at least described in the described step 1) two classes be meant by intensity will compose the peak be divided into two different classes of, represent noise-like and ionic spectrum peak class respectively.Can increase the classification number as required, the classification number is many more, and is then careful more to the division at spectrum peak.The most basic purpose of classification is to obtain the separatrix at noise spectrum peak and other classification spectrum peak by classification.
In technique scheme, in the described step 3), portray the noise baseline of broad sense with the intensity distributions parameter of spectrum peak classification; For the Gaussian distribution class, mean value mean and standard deviation deviation just can be used for representing the noise baseline, average has been described the mean value of the spectral strength of whole classification, and standard deviation has been described the degree of intensity deviation average at the spectrum peak of whole classification, also can be understood as the width of distribution.For the gamma distributional class, represent the noise baseline with parameter (α, beta, gamma), wherein, α is the form parameter that gamma distributes, and β is the calibration parameters that distributes, and γ then is the location parameter that distributes.
In technique scheme, in the described step 4) that the distance of intensity and the noise baseline at a spectrum peak standard as the judgement noise is obvious, far away more from the noise classification, then be that the possibility at effective peak is big more.Therefore, the big or little correspondingly expression spectrum peak of distance is possibility little or big at effective peak;
The present invention proposes a kind of method of discerning the noise baseline.The present invention is also referred to as the noise baseline according to the fundamental strength level that spectral strength distributes and discerns the mass spectrum noise; Different with the threshold filtering method, the present invention adopts the method for statistical learning, the different baseline of identification in the mass spectrum, and these baselines are distinguished noise and ionic spectrum peak as one rather than whole features.
Compared with prior art, the invention has the advantages that:
1) overcome on experience or heuristic the shortcoming of determining the noise baseline, but the method by statistical classification more can reflect the true distribution of mass spectrum noise peak on intensity.
2) different with the existing method of finding out a definite noise baseline, this method is to describe the distribution of the noise in the mass spectrum by the noise baseline of a broad sense, form is flexible, can adjust according to different instruments, different experiments chamber, different mass spectral characteristic that different sample produced.
3) adopt this method greatly to improve the search speed of identifying software.
Embodiment
Below in conjunction with the drawings and specific embodiments the present invention is described in further detail:
Embodiment 1
Present embodiment is attempted the intensity at spectrum peak in the mass spectrum is divided into three levels: 1) high-intensity fragmention is composed the peak; Though the identity of corresponding fragmention may be unknown, the sufficiently high spectrum of intensity peak is that the possibility at ionic spectrum peak is very big; 2) low intensive noise, and these noises are ubiquitous along the m/z axle, and its intensity is Normal Distribution then; This part noise is relevant with the physics noise of instrument; 3) potpourri at high-intensity noise and low intensive fragmention spectrum peak.
Therefore, present embodiment will be discerned two class noise baselines in the mass spectrum: a) the intensity upper limit of low intensive noise; In order to express easily, the intensity upper limit of also representing low intensive noise hereinafter with global baseline; B) high-intensity fragmention is composed the low intensity limit at peak; In order to express easily, the low intensity limit of also representing high-intensity fragmention spectrum peak hereinafter with local baseline.Behind this two classes noise baseline of identification, this two classes noise baseline in the mass spectrum is composed as judging whether the peak is a feature at effective peak.Therefore, method based on statistical learning, such as adopting mixed Gauss model classified by intensity in spectrum peak in the mass spectrum, the spectrum peak is divided into different normal state subclass, and adopt the average and the standard deviation of normal state subclass to represent the noise baseline, this noise baseline is different from the intensity threshold baseline in the threshold method, but a kind of broad sense noise baseline.
Although the spectrum peak that intensity is very low in mass spectrum all is noise usually, much the intensity at the spectrum peak of important fragmention is not high yet, and usually easy and noise is obscured.Therefore, present embodiment is three classes with the spectrum peak in the mass spectrum according to its intensity Distribution: a class is high-intensity ionic spectrum peak, and a class is low intensive noise, and another kind of then is the mixture at high-intensity noise and low intensive ionic spectrum peak.
Because noise is produced at random by mass spectrum in inducing collision cracked (CID) process, the intensity Normal Distribution of noise, and the also approximate Normal Distribution of the intensity distributions of fragmention, therefore can the GMM mixed Gauss model be classified by spectrum peak in the mass spectrum, the spectrum peak in the mass spectrum is divided three classes.By classification to spectrum peak in the mass spectrum, can identify high-intensity ionic spectrum peak, and low intensive noise, provide the threshold of mixture on intensity at high-intensity noise and low intensive ionic spectrum peak simultaneously, helpful to follow-up evaluation, those skilled in the art know this point.
Specifically, present embodiment divides two levels: at first, the spectrum peak in the mass spectrum is divided into the member of two normal distributions, represents the distribution at high strength ionic peak and noise peak respectively; Then, the member who is divided into two normal distributions is once more gathered at ebb and noise peak in the high strength fragmention spectrum peak, represent the mixture at low intensive noise peak and high strength noise and low-intensity fragmention spectrum peak respectively.
Adopt the normal state member's of second level average and standard deviation to portray a kind of broad sense noise baseline, in other words, calculate two kinds of baselines: the baseline (global baseline) and the local baseline (local baseline) of the overall situation are designated as I Baseline=(GI Mean, GI Deviation, LI Mean, LI Deviation).And I BaselineThe value of each component obtain by EM (Expectation-Maximization) algorithm computation.I BaselineComponent be actually the parameter of average mean and the standard deviation deviation of two normal state members in the mixture model.And, I BaselineThe baseline of the middle overall situation is represented the lower limit of high strength ionic peak on intensity, and local baseline is then represented the upper limit of low intensive noise peak on intensity.Spectrum peak between the overall situation and local baseline then promptly may be that noise also may be fragmention spectrum peak, needs to adopt other known method to distinguish.
For the ease of understanding the present invention, further introduce the distance of spectral strength and noise baseline herein.After having determined the noise baseline, the distance of intensity and the noise baseline at a spectrum peak standard as the judgement noise is far away more from noise apart from big more explanation, may be effective peak more.In the present embodiment, can adopt the intensity at following two formulates spectrum peak and the distance of noise baseline:
F RA1=A 1*(I peak-B 1*GI mean)/GI deviation (1)
F RA2=A 2(I peak-B 2*LI mean)/LI deviation (2)
Wherein, A 1, B 1, A 2, B 2Be respectively weighted value.In fact this distance has reflected ratio of composing the peak far from distance with the dispersion of distribution of the whole classification of noise at the center of noise baseline in the mass spectrum.A 1, B 1, A 2, B 2Can be equal to 1, perhaps in actual applications, to determine the weighted value of each parameter according to the result who adds up, to tally with the actual situation better.
Present embodiment is applied on the different data sets, adopts the result of pFind and MASCOT software test to show that performance of the present invention surmounts existing business software ProteinLynx TMThe preprocessing function of Global Server 2.0.5 version.Test result on 8 protein datas shows that the data of handling through this method can identify the number ratio of reliable polypeptide through ProteinLynx TMThe evaluation number of the data of Global Server2.0.5 software processes is much average 50%, the highest can be how 180%.
This method has greatly improved the search speed of identifying software, shows such as the test result of pFind1.5 version, through this method data are carried out pre-service after, speed can improve 5 ~ 10 times, the test result that MASCOT is 2.0 editions shows that speed can improve 2 ~ 4 times.
Embodiment 2
Present embodiment is attempted the intensity at spectrum peak in the mass spectrum is divided into two levels: 1) fragmention spectrum peak, and its intensity is obeyed gamma and is distributed; 2) noise peak, its intensity is Normal Distribution then;
Therefore, present embodiment will be discerned the class noise baseline in the mass spectrum: the intensity upper limit of noise; In order to express easily, the intensity upper limit of also representing low intensive noise hereinafter with global baseline; After having discerned the noise baseline, it is composed as judging whether the peak is a feature at effective peak.Therefore, based on the method for statistical learning, adopt Gauss, Gamma mixture model to be classified by intensity in spectrum peak in the mass spectrum, to compose the peak and be divided into two subclass, a Normal Distribution is obeyed Gamma for one and is distributed, and the average of wherein normal distribution subclass and standard deviation are then represented the noise baseline.
Remainder is with embodiment 1.
It should be noted last that above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is had been described in detail with reference to embodiment, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is made amendment or is equal to replacement, do not break away from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims (6)

1, noise baseline recognition methods during a kind of mass spectrometric data is handled comprises step:
1) according to the spectral strength distribution property mass spectrum is composed the peak and be divided into two classes at least;
2) the different classes of spectrum peak in the step 1) is calculated the distribution parameter of its intensity respectively;
3) portray the noise baseline of broad sense with the intensity distributions parameter of spectrum peak classification.
4) to each the spectrum peak in the mass spectrum, calculate the distance of itself and noise baseline and judge whether it is effective ionic spectrum peak.
2, according to noise baseline recognition methods in the described mass spectrometric data processing of claim 1, it is characterized in that, classification described in the described step 1) is according to the distribution trend of the intensity at the spectrum peak in the mass spectrum to be classified in the spectrum peak, and described distribution trend is to obtain by mass spectral spectrum peak is added up.
According to noise baseline recognition methods in the described mass spectrometric data processing of claim 2, it is characterized in that 3, described distribution trend comprises that Gaussian distribution or Gamma distribute.
4, handle according to the described mass spectrometric data of claim 1 in the recognition methods of noise baseline, it is characterized in that, be divided at least described in the described step 1) two classes be meant by intensity will compose the peak be divided into two different classes of, represent noise-like and ionic spectrum peak class respectively.
5, according to noise baseline recognition methods in the described mass spectrometric data processing of claim 1, it is characterized in that, portray the noise baseline of broad sense with the intensity distributions parameter of spectrum peak classification in the described step 3), for the Gaussian distribution class, mean value and standard deviation are used for representing the noise baseline, average has been described the mean value of the spectral strength of whole classification, and standard deviation has been described the degree of intensity deviation average at the spectrum peak of whole classification; For the Gama distributional class, represent the noise baseline with parameter (α, beta, gamma), wherein, α is the form parameter that Gama distributes, and β is the calibration parameters that distributes, and γ then is the location parameter that distributes.
6, according to noise baseline recognition methods during each described mass spectrometric data is handled among the claim 1-5, it is characterized in that, in the described step 4), as a standard judging noise, distance big or little correspondingly represents to compose possibility little or big that the peak is effective peak with the distance of intensity and the noise baseline at spectrum peak.
CNB2006100721693A 2006-04-14 2006-04-14 Noise baseline identification method in mass spectrum data processing Expired - Fee Related CN100483394C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006100721693A CN100483394C (en) 2006-04-14 2006-04-14 Noise baseline identification method in mass spectrum data processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006100721693A CN100483394C (en) 2006-04-14 2006-04-14 Noise baseline identification method in mass spectrum data processing

Publications (2)

Publication Number Publication Date
CN101055559A true CN101055559A (en) 2007-10-17
CN100483394C CN100483394C (en) 2009-04-29

Family

ID=38795399

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006100721693A Expired - Fee Related CN100483394C (en) 2006-04-14 2006-04-14 Noise baseline identification method in mass spectrum data processing

Country Status (1)

Country Link
CN (1) CN100483394C (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101610519B (en) * 2009-07-07 2011-01-12 南京工业大学 Wireless real-time multimedia communication flow implementation method based on wavelet analysis and Gamma distribution
CN102289558A (en) * 2011-05-23 2011-12-21 公安部第一研究所 Baseline adjusting method based on random signal processing
CN109726667A (en) * 2018-12-25 2019-05-07 广州市锐博生物科技有限公司 Mass spectrometric data treating method and apparatus, computer equipment, computer storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2187035A (en) * 1986-01-27 1987-08-26 Eric James Sjoberg Pyrolysis mass spectrometer disease diagnosis aid
WO2002031484A2 (en) * 2000-10-11 2002-04-18 Ciphergen Biosystems, Inc. Methods for characterizing molecular interactions using affinity capture tandem mass spectrometry

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101610519B (en) * 2009-07-07 2011-01-12 南京工业大学 Wireless real-time multimedia communication flow implementation method based on wavelet analysis and Gamma distribution
CN102289558A (en) * 2011-05-23 2011-12-21 公安部第一研究所 Baseline adjusting method based on random signal processing
CN102289558B (en) * 2011-05-23 2014-10-01 公安部第一研究所 Baseline adjusting method based on random signal processing
CN109726667A (en) * 2018-12-25 2019-05-07 广州市锐博生物科技有限公司 Mass spectrometric data treating method and apparatus, computer equipment, computer storage medium

Also Published As

Publication number Publication date
CN100483394C (en) 2009-04-29

Similar Documents

Publication Publication Date Title
US10593530B2 (en) Method for identification of the monoisotopic mass of species of molecules
Huang et al. The influence of histidine on cleavage C-terminal to acidic residues in doubly protonated tryptic peptides
McDonnell et al. Higher sensitivity secondary ion mass spectrometry of biological molecules for high resolution, chemically specific imaging
JP5727503B2 (en) Multiple tandem mass spectrometry
JP6115288B2 (en) Peak detection method and system in mass spectrometry
CN101055558B (en) Mass spectrum effective peak selection method based on data isotope mode
CN100483394C (en) Noise baseline identification method in mass spectrum data processing
US8110793B2 (en) Tandem mass spectrometry with feedback control
JP2021081365A (en) Glycopeptide analyzer
JP2007263641A (en) Structure analysis system
Wirth et al. Post‐translational modification detection using metastable ions in reflector matrix‐assisted laser desorption/ionization‐time of flight mass spectrometry
Wu et al. Quality assessment of peptide tandem mass spectra
US11378581B2 (en) Monoisotopic mass determination of macromolecules via mass spectrometry
WO2004083233A2 (en) Peptide identification
Dodds et al. Systematic characterization of high mass accuracy influence on false discovery and probability scoring in peptide mass fingerprinting
US11600359B2 (en) Methods and systems for analysis of mass spectrometry data
Choo et al. Tandem mass spectrometry data quality assessment by self-convolution
Riter et al. Comparison of the Paul ion trap to the linear ion trap for use in global proteomics
US20080015785A1 (en) Mass Spectrometry Algorithm
Fazal et al. Multifactorial Understanding of Ion Abundance in Tandem Mass Spectrometry Experiments
Tsou Computational Framework for Data-Independent Acquisition Proteomics.
Yao Mass Spectrometry Based De Novo Peptide Sequencing Error Correction
Freestone et al. Group-walk, a rigorous approach to separate FDR analysis by TDC
Rogers Assessment of an amalgamative approach to protein identification
Beaudrie Assessment of tandem mass spectra quality to improve protein identification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090429

CF01 Termination of patent right due to non-payment of annual fee