CN1769891A

CN1769891A - A method for identifying peptides using tandem mass spectrometry data

Info

Publication number: CN1769891A
Application number: CN 200410088779
Authority: CN
Inventors: 高文; 付岩; 李德泉; 孙瑞祥; 贺思敏; 杨强; 曾嵘; 周虎; 陈益强; 王晓彪
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2004-11-03
Filing date: 2004-11-03
Publication date: 2006-05-10
Anticipated expiration: 2024-11-03
Also published as: CN100376895C

Abstract

本发明公开了一种使用串联质谱数据鉴定肽的方法，包括步骤：将要被鉴定的肽进行实验碎裂以生成实验串联质谱；将数据库中的多个候选肽进行理论碎裂以生成多个理论串联质谱；用径向基函数核分别计算多个理论串联质谱与实验串联质谱的相似度，该径向基函数包括一指数部分；根据所计算的相似度选取出与实验串联质谱最相似的理论串联质谱所对应的肽作为鉴定结果。本发明的使用串联质谱数据鉴定肽的方法采用径向基函数核来评价多个理论串联质谱与实验串联质谱的相似度，并进一步在径向基函数核的指数部分通过对连续碎片离子的求和来强调连续碎片离子的正相关特性，比现有技术中鉴定肽的方法具有更高的准确率，明显降低了假阳性结果。The invention discloses a method for identifying peptides using tandem mass spectrum data, comprising the steps of: experimentally fragmenting the peptide to be identified to generate an experimental tandem mass spectrum; theoretically fragmenting a plurality of candidate peptides in a database to generate a plurality of theoretical Tandem mass spectrometry; use the radial basis function kernel to calculate the similarity between multiple theoretical tandem mass spectra and experimental tandem mass spectra respectively. The peptides corresponding to the tandem mass spectrometry were used as identification results. The method for identifying peptides using tandem mass spectrometry data of the present invention uses a radial basis function kernel to evaluate the similarity between multiple theoretical tandem mass spectra and experimental tandem mass spectra, and further calculates the continuous fragment ions in the exponential part of the radial basis function kernel. and to emphasize the positive correlation characteristics of continuous fragment ions, which has a higher accuracy rate than the method for identifying peptides in the prior art, and significantly reduces false positive results.

Description

A kind of method of using tandem mass spectrum data to identify peptide

Technical field

The present invention relates to a kind of proteomic analysis methods, specifically, relate to a kind of method of identifying peptide sequence.

Background technology

In current proteome research, be one of the most widely used technology (list of references: Aebersold, R.and Mann based on the identification of proteins of tandem mass spectrum, M.Mass spectrometry-based proteomics, Nature, 2003,422:198-207).One of them problem is exactly that the tandem mass spectrum how to obtain from experiment automatically identifies these mass spectral peptide sequences of generation.In order to identify the sequence of the peptide that produces the experiment tandem mass spectrum, the method of database search is widely adopted (list of references: Eng, J.K., McCormack, A.L.and Yates, J.R.An approach tocorrelate tandem mass spectral data of peptides with amino acid sequences in a proteindatabase.J Am Soc Mass Spectrom, 1994,5:976-989; Perkins, D.N., Pappin, D.J., Creasy, D.M.and Cottrell, J.S.Probability-based protein identification by searchingsequence databases using mass spectrometry data.Electrophoresis, 1999,20:3551-3567; Field, H.I., Feny , D.and Beavis, R.C.RADARS, a bioinformatics solution that automatesproteome mass spectral analysis, optimises protein identification, and archives data in arelational database.Proteomics, 2002,2:36-47).In the method, the peptide sequence in the database by theory cracked be fragmention, the generative theory tandem mass spectrum; And want certified peptide in mass spectrometer by cracked be fragmention, and generate and test tandem mass spectrum; Theoretical tandem mass spectrum is compared with the experiment tandem mass spectrum, thereby the candidate's peptide in the database is given a mark; Result according to marking selects the corresponding peptide of the theoretical tandem mass spectrum the most similar with testing tandem mass spectrum as qualification result at last.

As seen, in the method for database search, the problem of a key is theoretical tandem mass spectrum to be made suitable similarity with the experiment tandem mass spectrum calculate, and promptly selects suitable peptide marking algorithm.Unsuitable similarity is calculated the algorithm of peptide marking in other words can increase wrong peptide qualification result---and be false positive results, and by selecting suitable peptide marking algorithm can reduce the false positive results that peptide is identified.

Used scoring functions supposes that usually the appearance of fragmention is mutually independently in the tandem mass spectrum in the existing peptide marking algorithm, thereby adopts linear scoring functions.In linear scoring method, the correlativity that may exist between fragmention has been left in the basket fully.All ion couplings between experiment and theoretical mass spectrum are put on an equal footing calculates total mark.In fact, the foreseeability fully of peptide fragmentation pattern, the expendable information of being lost in the fragmentation, the enormous quantity of candidate's peptide all make the random error coupling often take place, the peptide that finally may lead to errors is identified, promptly false-positive result.

In fact, peptide is by theoretical or test after cracked back produces fragmention, and continuous fragmention wherein is potential positively related ion.When positively related ion is mated simultaneously, should have higher credibility as individuality than them on these couplings are directly perceived as a whole.So these positively related ions should be carried out to a certain extent and emphasize, correspondingly just need to use nonlinear peptide scoring functions.

Summary of the invention

An object of the present invention is to provide a kind of method of using tandem mass spectrum data to identify peptide, adopt a kind of new peptide scoring method in the method; Another object of the present invention provides a kind of method of using tandem mass spectrum data to identify peptide, has considered the correlativity of continuous fragmention in the method.

To achieve these goals, the invention provides a kind of method of using tandem mass spectrum data to identify peptide, comprise step:

Will experimentize cracked by certified peptide to generate the experiment tandem mass spectrum;

It is cracked to generate a plurality of theoretical tandem mass spectrums that a plurality of candidate's peptides in the database are carried out theory;

Calculate a plurality of theoretical tandem mass spectrums and the similarity of testing tandem mass spectrum respectively with radial basis function nuclear, this radial basis function comprises an exponential part;

Select the theoretical tandem mass spectrum pairing peptide the most similar as qualification result according to the similarity of being calculated to testing tandem mass spectrum.

The method of described use tandem mass spectrum data evaluation peptide also comprises carries out denoising to described experiment tandem mass spectrum.

In generating described theoretical tandem mass spectrum step, also comprise selected fragmention type.

The exponential part of described radial basis function nuclear comprises the summation operation to continuous fragmention.

In the similarity step of calculating described a plurality of theoretical tandem mass spectrums and experiment tandem mass spectrum, also comprise:

With theoretical tandem mass spectrum with the experiment tandem mass spectrum according to selected fragmention type and fragment

The cracked position of ion is arranged in matrix T and Matrix C respectively; Described continuous fragmention is arranged in the continuous position of matrix delegation;

Described radial basis function kernel form is

Σ_{i = 1}^{m} Σ_{j = 1}^{n} \exp (- γ Σ_{k = j - l_{2}}^{j + l_{2}} {(c_{ik} - t_{ik})}^{2}),

C wherein _IkAnd t _IkBe respectively the matrix element of matrix T and Matrix C, when k≤0 and k＞n, c _IkAnd t _IkBe changed to 0;

Positive integer l ₁And l ₂Equal respectively

With

Integer l is the number of the described continuous fragmention that will consider; γ is described customized parameter.L=5 and 0.8≤γ≤1 preferably.

Use tandem mass spectrum data of the present invention identifies that the method for peptide adopts radial basis function to examine and estimates a plurality of theoretical tandem mass spectrums and the similarity of testing tandem mass spectrum, and further in the exponential part of radial basis function nuclear by the summation of continuous fragmention being emphasized the positive correlation characteristic of continuous fragmention, have higher accuracy rate than the method for identifying peptide in the prior art, obviously reduced false positive results.

Description of drawings

Fig. 1 is that an exemplary peptide forms synoptic diagram;

Fig. 2 is the synoptic diagram of the fragmention of six series possible after the peptide cracking;

Fig. 3 is an exemplary experiment tandem mass spectrum;

Fig. 4 is the synoptic diagram of pre-in one embodiment measured ion array, and empty frame table wherein shows correlation window;

Fig. 5 is the error rate curve map with respect to parameter of RBF-KSDP of the present invention.

Embodiment

Below in conjunction with the drawings and specific embodiments the present invention is described in further detail.

As shown in Figure 1, two amino acid can link up at their C-end and N-end formation peptide bond by losing a hydrone, and peptide is exactly that amino acid residue interconnects the sequence that forms by peptide bond.This sequence has been determined the identity of peptide.

In order to identify the amino acid sequence of peptide, after being ionized, peptide enters mass spectrometer.In mass spectrometer, the peptide ion (these peptide ions also have identical amino acid sequence usually) with specific mass-to-charge ratio (m/z) is in collision one separation of inducing (Collision-Induced Dissociation, CID) effect cracking down.Under low-yield CID effect, the rib key can rupture in three kinds of modes usually, generates the fragmention of six series, i.e. a of N-end, and b, the x of c and C-end, y, z series fragmention, as shown in Figure 2.Fig. 2 is the example of a peptide fragmention that cracking forms under the CID effect that is made of four amino acid residues, the cracked position of peptide when wherein representing index number 1～3 expression of alphabetical a, b, c, x, y and the z of fragmention series to generate this fragmention, the symbol H in the upper right corner among Fig. 2 ⁺The true positive charge of expression peptide band.

The m/z of these fragmentions is tested to be measured, thereby forms tandem mass spectrum, perhaps is referred to as to test tandem mass spectrum.Fig. 3 has provided an exemplary experiment tandem mass spectrum.The m/z of the fragmention that mass spectral horizontal ordinate representative is detected, ordinate is represented the relative intensity of fragmention.Mass peak in the mass spectrum also may be formed by uncertain fragmention (such as inner ion) except being formed by foreseeable fragmention, also may be physics or chemical noise.Usually need carry out denoising to the tandem mass spectrum that experiment obtains.Simple way is to keep the bigger mass peak of certain proportion intensity, and removes other mass peak, for example in one embodiment, can only keep preceding 200 mass peaks that intensity is bigger.

Identify peptide sequence in order to utilize tandem mass spectrum, need generate the process of tandem mass spectrum to the simulation of the candidate's peptide sequence in the database of forming by known peptide, the mass spectrum that this simulation generates is called theoretical tandem mass spectrum, the corresponding theoretical tandem mass spectrum of each candidate's peptide sequence.When the generative theory tandem mass spectrum, at first will be according to mass spectrometric type and the selected fragmention type that will consider of characteristic.For example in one embodiment, only consider a, b and y series fragmention among Fig. 2, this is because the fragmention of common a, b and y series (situation that comprises monovalence and multivalence and dehydration or lose ammonia) is main.Be readily appreciated that those skilled in the art can be according to the selected fragmention type considered different with the foregoing description of actual conditions.After the selected fragmention type that will consider, again peptide sequence is simulated crackedly, predict the mass-to-charge ratio (m/z) and the intensity of the fragmention of all specified type, to form theoretical mass spectrum.The mass-to-charge ratio of fragmention equals the charge number of the molecular weight of this ion divided by this ion.The prediction of the theoretical strength of fragmention itself is that another one studies a question, and can all be appointed as 1 under the simple scenario, supposes that promptly the probability that all ions occur equates.

According to the cracked position of selected fragmention type and fragmention correspondence pre-measured ion is arranged in the form of an array, this array is called pre-measured ion array.Fig. 4 shows the embodiment of a pre-measured ion array, and in this embodiment, selected fragmention type is b and y series fragmention, specifically comprises b, b ⁰, b ^*And b ⁺⁺And y, y ⁰, y ^*And y ⁺⁺, wherein two positive charges of subscript ' ++ ' expression ion band are not gone up target and are represented positive charge of ion band, and subscript ' * ' expression ion has lost an amino molecule, and subscript ' 0 ' expression ion has lost a hydrone, b, b ⁰, b ^*And b ⁺⁺And y, y ⁰, y ^*And y ⁺⁺The cracked position of index number 1～n representative peptide when generating this fragmention.In Fig. 4, with the fragmention type as vertically, will generate the cracked position of peptide of fragmention correspondence as the pre-measured ion array of transversely arranged one-tenth.

Fragmention intensity in the theoretical tandem mass spectrum is shown as matrix T according to the sequence list of pre-measured ion array,

T = (\begin{matrix} t_{1,1} & t_{1, 2} & t_{1, 3} & . . . & t & _{1, n} \\ t \\ _{2, 1} & t_{2, 2} & t_{2, 3} & . . . & t & _{2, n} \\ t_{3,1} & t_{3,2} & t_{3,3} & . . . & t_{3, n} \\ . . . & . . . & . . . & . . . & . . . \\ t_{m, 1} & t_{m, 2} & t_{m, 3} & . . . & t_{m, n} \end{matrix}),

Wherein corresponding with pre-measured ion array, in matrix T, element t _{I, j}Subscript i be used to distinguish different fragmention types, subscript j is used to distinguish different cracked positions, element t _{I, j}Be in the pre-measured ion array (i, the j) intensity of locational fragmention in theoretical tandem mass spectrum, for example, t _2,3Corresponding to the b among Fig. 4 ₃ ^*The intensity of ion in theoretical tandem mass spectrum; M is the number of selected fragmention type; N+1 is the amino acid residue number that peptide sequence comprises, and such peptide comprises n cracked position.

The intensity of each mass peak in the experiment tandem mass spectrum also is shown as Matrix C according to the sequence list of pre-measured ion array,

C = (\begin{matrix} c_{1,1} & c_{1,2} & c_{1,3} & . . . & c & _{1, n} \\ c \\ _{2,1} & c_{2,2} & c_{2,3} & . . . & c & _{2, n} \\ c_{3,1} & c_{3, 2} & c_{3,3} & . . . & c_{3, n} \\ . . . & . . . & . . . & . . . & . . . \\ c_{m, 1} & c_{m, 2} & c_{m, 3} & . . . & c_{m, n} \end{matrix}),

Wherein, if having one or more mass peaks in experiment in the tandem mass spectrum, the (i, j) mass-to-charge ratio of the fragmention of individual position is complementary, then c in their mass-to-charge ratio and the pre-measured ion array _{I, j}Equal to test the intensity of mating mass peak in the tandem mass spectrum and, otherwise C _{I, j}=0.Corresponding with theoretical tandem mass spectrum matrix T with pre-measured ion array, subscript i is used to distinguish different fragmention types, and subscript j is used to distinguish different cracked positions.Here in the be complementary mass-to-charge ratio that is meant some mass peaks in the experiment tandem mass spectrum and the pre-measured ion array of said mass-to-charge ratio the difference of the mass-to-charge ratio of the fragmention of some positions in the specification error scope, specified error range is generally about 1Da for the ion trap mass spectrometry data, and error range specified for the Q-Tof data is generally about 0.4Da.

Weigh experiment mass spectrum and theoretical mass spectrum similarity with formula (1), this method can be described as RBF-KSDP marking algorithm.

Σ_{i = 1}^{m} Σ_{j = 1}^{n} \exp (- γ Σ_{k = j - l_{1}}^{j + l_{2}} {(c_{ik} - t_{ik})}^{2}), . . . (1)

Wherein, positive integer l ₁And l ₂Equal respectively

With

(symbol

With

Respectively representative downwards and round up), and integer l (＜n) be the number correlation window length in other words of the continuous fragmention that will consider, γ is the parameter in the RBF kernel function.For k≤0 and k＞n, c _IkAnd t _IkBe changed to 0.

Formula (1) is radial basis function nuclear exp (γ ‖ x-y ‖ ²) a concrete form, it comprises the summation of the summation of various fragmention types (promptly to subscript i summation) and each cracked position (promptly to subscript j summation).Further, in formula (1), its exponential part also comprises a summation to k, and it is summed to j is that Center Length is the summation of l.This shows, in the character of having considered continuous fragmention with formula (1) when giving a mark, said continuous fragmention is meant a plurality of fragmentions that are in continuous cracked position in a kind of fragmention type, as three empty frames of usefulness exemplary among Fig. 4 respectively frame gone out three groups of continuous fragmentions (number of continuous ionic promptly is the l in the formula (1) in the empty frame), continuously fragmention is arranged in the continuous position of pre-measured ion array delegation.

All peptide sequences in the database can be arranged according to itself and the mass spectral RBF-KSDP score value size of experiment, thereby identify the peptide sequence that most probable generates the experiment tandem mass spectrum.

Fig. 5 illustrates an experimental result that adopts authentication method of the present invention, the horizontal ordinate of Fig. 5 is a γ value in the formula (1), and ordinate is for identifying error rate, and the curve among the figure is represented the variation of l=2～6 o'clock error rate with γ respectively, from Fig. 5, can obtain, preferably l=5 and 0.8≤γ≤1.

Claims

1, a kind of method of using tandem mass spectrum data to identify peptide comprises step:

2, use tandem mass spectrum data according to claim 1 is identified the method for peptide, it is characterized in that, also comprises described experiment tandem mass spectrum is carried out denoising.

3, use tandem mass spectrum data according to claim 1 is identified the method for peptide, it is characterized in that, also comprises selected fragmention type in generating described theoretical tandem mass spectrum step.

4, according to the method for claim 1 or 3 described use tandem mass spectrum data evaluation peptides, it is characterized in that the exponential part of described radial basis function nuclear comprises the summation operation to continuous fragmention.

5, use tandem mass spectrum data according to claim 4 is identified the method for peptide, it is characterized in that, also comprises in the similarity step of calculating described a plurality of theoretical tandem mass spectrums and experiment tandem mass spectrum:

Theoretical tandem mass spectrum and experiment tandem mass spectrum are arranged in matrix T and Matrix C respectively according to the cracked position of selected fragmention type and fragmention; Described continuous fragmention is arranged in the continuous position of matrix delegation;

Described radial basis function kernel form is

Σ_{i = 1}^{m} Σ_{j = 1}^{n} \exp (- γ Σ_{k = j - l_{2}}^{j + l_{2}} {(c_{ik} - t_{ik})}^{2}),

C wherein _IkAnd t _IkBe respectively the matrix element of matrix T and Matrix C, when k≤0 and k≤n, c _IkAnd t _IkBe changed to 0; Positive integer, l ₁And l ₂Equal respectively

With

Integer l is the number of the described continuous fragmention that will consider; γ is described customized parameter.

6, use tandem mass spectrum data according to claim 5 is identified the method for peptide, it is characterized in that l=5 and 0.8≤γ≤1.