CN100390537C

CN100390537C - A Method for Predicting the Molecular Formula of Ions Using Isotope Peaks of Fragment Ions in Tandem Mass Spectrometry

Info

Publication number: CN100390537C
Application number: CNB2004100908060A
Authority: CN
Inventors: 高文; 张京芬; 蔡津津; 贺思敏; 曾嵘; 陈润生; 王海鹏
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2004-11-12
Filing date: 2004-11-12
Publication date: 2008-05-28
Anticipated expiration: 2024-11-12
Also published as: CN1773276A

Abstract

本发明公开了一种用串联质谱中碎片离子的同位素峰预测离子分子式的方法，该方法从串联质谱和从各元素的原子个数待定的通用分子式中分别获取碎片离子的单同位素的质量以及各同位素谱峰相对于单同位素的相对丰度；将分别获取的质量和相对丰度做匹配以获得所述通用分子式中待定的各元素的原子个数的非负整数解，得到碎片离子的分子式。本发明的方法利用串联质谱中碎片离子的同位素谱峰信息，通过串联质谱碎片离子的同位素谱峰的模式计算此碎片离子对应的分子式。本发明的方法可以提供碎片离子准确的分子式信息，可对鉴定多肽序列的数据库搜索方法提供的候选序列进行鉴别；以及为求解多肽序列的de novo方法产生高可靠候选序列提供依据。The invention discloses a method for predicting the molecular formula of an ion by using the isotope peak of the fragment ion in the tandem mass spectrum. The method obtains the monoisotope mass of the fragment ion and the mass of each The relative abundance of the isotopic peak relative to the monoisotope; the mass and relative abundance obtained respectively are matched to obtain the non-negative integer solution of the number of atoms of each element to be determined in the general molecular formula, and the molecular formula of the fragment ion is obtained. The method of the present invention utilizes the isotope spectrum peak information of the fragment ion in the tandem mass spectrum, and calculates the molecular formula corresponding to the fragment ion through the mode of the isotope spectrum peak of the fragment ion in the tandem mass spectrum. The method of the present invention can provide accurate molecular formula information of fragment ions, can identify candidate sequences provided by the database search method for identifying polypeptide sequences, and provide a basis for generating highly reliable candidate sequences by de novo method for solving polypeptide sequences.

Description

A Method for Predicting the Molecular Formula of Ions Using Isotope Peaks of Fragment Ions in Tandem Mass Spectrometry

技术领域 technical field

本发明涉及一种蛋白质组分析方法，具体地说，涉及一种预测肽序列碎裂后产生的碎片离子的分子式的方法。The invention relates to a proteome analysis method, in particular to a method for predicting the molecular formula of fragment ions generated after peptide sequence fragmentation.

背景技术 Background technique

在目前利用肽指纹质谱及串联质谱技术和数据库搜索及直接解序(de novo)方法鉴定肽序列和蛋白质的研究中，质谱数据的预处理以及鉴定结果的后处理非常重要。In the current research of identifying peptide sequences and proteins using peptide fingerprint mass spectrometry and tandem mass spectrometry technology, database search and direct de novo (de novo) methods, the preprocessing of mass spectrometry data and the postprocessing of identification results are very important.

被鉴定的多肽在质谱仪中被碎裂为碎片离子，这些碎片离子的质量和丰度被质谱仪器测量出来，形成串联质谱。每一个碎片离子以及其同位素离子都在串联质谱中形成对应的谱峰。考虑到碎片离子的同位素峰会给肽或蛋白质的鉴定过程造成混淆，比如某些氨基酸残基之间的质量差约为0.34，1和1.5da，而同一个碎片离子的一价，二价，三价的同位素峰之间的质荷比(m/z)差分别为1、0.5和0.333，这些氨基酸残基质量差值与同位素峰的m/z差值重叠，导致在鉴定过程中需要判断串联质谱中的一个谱峰是某个碎片离子峰还是另一个碎片离子的同位素峰；此外，多个氨基酸质量求和后与某个碎片离子的同位素峰的重叠现象会更多。因此，传统的数据预处理任务之一是识别出一个碎片离子的同位素峰并予以剔除。The identified peptides are fragmented into fragment ions in the mass spectrometer, and the mass and abundance of these fragment ions are measured by the mass spectrometer to form a tandem mass spectrum. Each fragment ion and its isotopic ion form a corresponding peak in the tandem mass spectrometer. Considering that the isotope peaks of fragment ions confuse the identification process of peptides or proteins, for example, the mass difference between some amino acid residues is about 0.34, 1 and 1.5da, while the monovalent, divalent, trivalent The mass-to-charge ratio (m/z) differences between the isotopic peaks of the valence are 1, 0.5, and 0.333, respectively, and these amino acid residue mass differences overlap with the m/z differences of the isotopic peaks, resulting in the need to judge the tandem mass spectrum in the identification process Is one of the peaks in a fragment ion peak or the isotope peak of another fragment ion; in addition, the overlap of the isotope peak of a fragment ion after summing the masses of multiple amino acids will be more. Therefore, one of the traditional data preprocessing tasks is to identify isotopic peaks of a fragment ion and remove them.

然而，事实上，质谱中表现出的碎片离子的同位素峰的分布模式与该碎片离子的原子组成(即分子式)是密切相关的。因此就需要有一种方法能够利用碎片离子的同位素峰来预测该碎片离子的分子式，这样，预测出的碎片离子的分子式一方面可以为肽鉴定的数据库搜索及de novo方法提供更多更准确的信息，另一方面，为鉴定结果进行后处理提供更多的依据。However, in fact, the distribution pattern of the isotopic peaks of the fragment ions shown in the mass spectrum is closely related to the atomic composition (ie molecular formula) of the fragment ions. Therefore, there is a need for a method that can use the isotope peaks of fragment ions to predict the molecular formula of the fragment ion. In this way, the predicted molecular formula of the fragment ion can provide more and more accurate information for the database search and de novo method of peptide identification on the one hand. , on the other hand, provide more basis for the post-processing of identification results.

发明内容 Contents of the invention

本发明的目的在于提供一种利用串联质谱中的碎片离子的同位素峰来预测该碎片离子的分子式的方法。The purpose of the present invention is to provide a method for predicting the molecular formula of the fragment ion by using the isotope peak of the fragment ion in the tandem mass spectrum.

为了实现上述目的，本发明提供一种用串联质谱中碎片离子的同位素峰预测离子分子式的方法，包括：In order to achieve the above object, the present invention provides a method for predicting ion molecular formula with the isotope peak of fragment ion in tandem mass spectrometry, comprising:

步骤1)：从串联质谱中获取一碎片离子的单同位素及其至少一个同位素的谱峰，计算所述碎片离子的单同位素的质量、所述碎片离子的单同位素的谱峰和所述碎片离子的至少一个同位素的谱峰之间的相对丰度；Step 1): Obtain a monoisotopic peak of a fragment ion and at least one isotopic peak of the fragment ion from the tandem mass spectrometer, calculate the mass of the monoisotope of the fragment ion, the monoisotopic spectral peak of the fragment ion, and the fragment ion The relative abundance between the spectral peaks of at least one isotope of

步骤2)提供碎片离子的一通用分子式，所述通用分子式中各元素的原子个数待定；Step 2) providing a general molecular formula of fragment ions, the number of atoms of each element in the general molecular formula is to be determined;

步骤3)：用所述通用分子式得到碎片离子的理论上的单同位素的质量、碎片离子的单同位素和其至少一个同位素的相对丰度；所述理论上的单同位素的质量、碎片离子的单同位素和其至少一个同位素离子的相对丰度为所述通用分子式中待定的原子个数的函数；Step 3): use the general molecular formula to obtain the theoretical monoisotopic mass of the fragment ion, the monoisotope of the fragment ion and the relative abundance of at least one isotope thereof; The relative abundance of an isotope and at least one isotopic ion thereof is a function of the number of atoms to be determined in said general formula;

步骤4)：将步骤3)中得到的质量和相对丰度与步骤1)中从串联质谱质量和相对丰度做匹配，以获得所述通用分子式中待定的各元素的原子个数的非负整数解，从而得到所述碎片离子的分子式。Step 4): Match the mass and relative abundance obtained in step 3) with the mass and relative abundance of the tandem mass spectrum in step 1), so as to obtain the non-negative number of atoms of each element to be determined in the general molecular formula Integer solution to obtain the molecular formula of the fragment ion.

在上述技术方案中，步骤1)和步骤3)中所述的碎片离子的至少一个同位素包括碎片离子的第一同位素和第二同位素。In the above technical solution, at least one isotope of the fragment ion described in step 1) and step 3) includes the first isotope and the second isotope of the fragment ion.

在上述技术方案中，将步骤1)中得到的所述碎片离子的单同位素的质量、所述碎片离子的单同位素的谱峰和所述碎片离子的至少一个同位素的谱峰之间的相对丰度构成一实验的同位素分布向量；将步骤3)中得到的碎片离子的理论上的单同位素的质量、碎片离子的单同位素和其至少一个同位素的相对丰度构成一理论的同位素分布向量；步骤4)中的所述匹配是用所述实验的同位素分布向量与所述的理论的同位素分布向量之间的欧氏距离作为匹配分数。In the above technical scheme, the relative abundance between the mass of the monoisotope of the fragment ion obtained in step 1), the spectrum peak of the monoisotope of the fragment ion and the spectrum peak of at least one isotope of the fragment ion Constitute an experimental isotope distribution vector; The mass of the theoretical monoisotope of the fragment ion obtained in step 3), the monoisotope of the fragment ion and the relative abundance of at least one isotope thereof constitute a theoretical isotope distribution vector; Step 4 The matching in ) uses the Euclidean distance between the experimental isotope distribution vector and the theoretical isotope distribution vector as the matching score.

在上述技术方案中，还包括用使获得的分子式符合化学意义的化学规则约束条件约束所述匹配。In the above technical solution, it is also included to constrain the matching with chemical rules and constraints that make the obtained molecular formula conform to the chemical meaning.

在上述技术方案中，通过所述匹配获得的所述通用分子式中待定的各元素的原子个数的非负整数解包括：通过所述匹配获得所述通用分子式中待定的各元素的原子个数的实数解；在所述实数解的领域内搜索得到所述通用分子式中待定的各元素的原子个数的非负整数解。In the above technical solution, the non-negative integer solution of the number of atoms of each element to be determined in the general molecular formula obtained through the matching includes: obtaining the number of atoms of each element to be determined in the general molecular formula through the matching The real number solution; search in the field of the real number solution to obtain the non-negative integer solution of the number of atoms of each element to be determined in the general molecular formula.

在上述技术方案中，还包括对步骤4)中得到的所述通用分子式中待定的各元素的原子个数的非负整数解进行过滤的步骤。所述过滤包括平均同位素分布模式方法，该方法用碎片离子的理论上的单同位素的质量、碎片离子的单同位素和其至少一个同位素的相对丰度之间的统计关系过滤所述非负整数解。所述过滤包括用使获得的分子式符合化学意义的化学规则约束条件过滤所述非负整数解。所述过滤包括用两个碎片离子的非负整数解进行交叉验证以过滤所述两个碎片离子的非负整数解。In the above technical solution, it also includes the step of filtering the non-negative integer solutions of the atomic number of each element to be determined in the general molecular formula obtained in step 4). The filtering includes an average isotope distribution pattern method that filters the non-negative integer solution using a theoretical monoisotopic mass of the fragment ion, a statistical relationship between the monoisotope of the fragment ion and the relative abundance of at least one isotope thereof . The filtering includes filtering the non-negative integer solutions with chemical rule constraints that make the obtained molecular formula conform to chemical meaning. The filtering includes cross-validating with the non-negative integer solutions of the two fragment ions to filter the non-negative integer solutions of the two fragment ions.

本发明的优点在于：The advantages of the present invention are:

1)本方法是对串联质谱中碎片离子的同位素谱峰信息的充分利用；1) This method is the full utilization of the isotopic spectrum peak information of fragment ions in the tandem mass spectrometry;

2)本方法能通过串联质谱碎片离子的同位素谱峰的模式，快速准确地计算此碎片离子对应的分子式(准确程度与质谱的精度相关，精度越高，计算出的分子式越可靠)；2) This method can quickly and accurately calculate the molecular formula corresponding to the fragment ion through the isotope spectrum peak mode of the tandem mass spectrometry fragment ion (the degree of accuracy is related to the accuracy of the mass spectrum, the higher the accuracy, the more reliable the calculated molecular formula);

3)本方法可以提供碎片离子的准确的分子式信息，可对鉴定多肽序列的数据库搜索方法提供的候选序列进行鉴别；3) The method can provide accurate molecular formula information of fragment ions, and can identify candidate sequences provided by the database search method for identifying polypeptide sequences;

4)本方法计算出的离子分子式可以指导求解多肽序列的de novo方法产生高可靠的候选的序列。4) The ion molecular formula calculated by this method can guide the de novo method for solving the polypeptide sequence to generate highly reliable candidate sequences.

具体实施方式 Detailed ways

下面结合附图和具体实施方式对本发明作进一步详细描述。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

将一个碎片离子的单同位素记为P，此碎片离子的第一同位素记为P₁，第二同位素碎片离子记为P₂，依此类推，第N同位素离子记为P_N。在这里，碎片离子单同位素P是指在该离子的各种组成元素均为单同位素(即质子数和中子数相同)。而碎片离子的同位素是指与单同位素碎片离子具有相同的分子式、但是比单同位素带有更多额外中子的离子，例如碎片离子的第一同位素P₁比碎片离子的单同位素P多带有一个额外的中子，第二同位素P₂比单同位素P多带有两个额外的中子，依此类推。在本发明中，碎片离子的同位素是在整体上比碎片离子的单同位素带有额外的中子的离子。The monoisotope of a fragment ion is marked as P, the first isotope of this fragment ion is marked as P ₁ , the second isotope fragment ion is marked as P ₂ , and so on, and the Nth isotope ion is marked as _PN . Here, the monoisotopic P of the fragment ion means that the various constituent elements of the ion are monoisotopic (that is, the number of protons and neutrons is the same). The isotope of the fragment ion refers to the ion that has the same molecular formula as the monoisotopic fragment ion, but has more extra neutrons than the monoisotope. For example, the first isotope _P of the fragment ion has more than the monoisotope P of the fragment ion. One extra neutron, the second isotope _P2 has two more neutrons than the monoisotope P, and so on. In the present invention, an isotope of a fragment ion is an ion that, as a whole, has an extra neutron than a monoisotope of the fragment ion.

肽序列进入质谱仪被离子化，且在质谱仪中，具有特定质荷比(m/z)的肽离子(这些肽离子通常也有相同的氨基酸序列)在碰撞-诱导的分离(Collision-InducedDissociation，CID)作用下裂解为多个碎片离子。这些碎片离子的m/z被检测量出来从而形成串联质谱，在一个串联质谱中，其横坐标表示碎片离子的质荷比(m/z)，其纵坐标为检测到的碎片离子的丰度。The peptide sequence enters the mass spectrometer and is ionized, and in the mass spectrometer, peptide ions with a specific mass-to-charge ratio (m/z) (these peptide ions usually also have the same amino acid sequence) undergo collision-induced separation (Collision-InducedDissociation, CID) under the action of fragmentation into multiple fragment ions. The m/z of these fragment ions are detected and measured to form a tandem mass spectrum. In a tandem mass spectrometer, the abscissa represents the mass-to-charge ratio (m/z) of the fragment ions, and the ordinate represents the abundance of the detected fragment ions .

在串联质谱中，挑选出一个碎片离子的单同位素P以及其同位素P₁～P_N中至少一个对应的谱峰，本发明的目标则是通过这些同位素峰的分布情况来预测碎片离子的单同位素P对应的分子式。在本发明的一个实施例中，从串联质谱中仅挑选出该碎片离子的单同位素P以及其第一同位素P₁和第二同位素P₂。从后面的描述中本领域的技术人员很容易理解，在本发明的其它实施例中，对于同位素碎片离子，也可以仅挑选出碎片离子的一个同位素的谱峰——例如第一同位素碎片离子P₁，或者也可以挑选出更多的同位素的谱峰，不同数目同位素碎片离子的选取都可以实现本发明的方法，但是会影响到本发明实施时计算的复杂度和精度。In the tandem mass spectrometry, the monoisotope P of a fragment ion and at least one corresponding spectral peak among its isotopes P ₁ to _PN are selected. The object of the present invention is to predict the monoisotope of the fragment ion through the distribution of these isotope peaks. The molecular formula corresponding to P. In one embodiment of the present invention, only the monoisotope P of the fragment ion and its first isotope P ₁ and second isotope P ₂ are selected from the tandem mass spectrum. Those skilled in the art can easily understand from the following description that in other embodiments of the present invention, for isotopic fragment ions, it is also possible to select only the spectral peak of one isotope of the fragment ion—for example, the first isotopic fragment ion P ₁ , or more isotopic spectral peaks can be selected, and the selection of different numbers of isotopic fragment ions can realize the method of the present invention, but it will affect the complexity and accuracy of calculation when the present invention is implemented.

从串联质谱中还可以得到单同位素碎片离子P的离子质量M_e，这是本领域的技术人员所熟知的。The ion mass _Me of the monoisotopic fragment ion P can also be obtained from the tandem mass spectrometry, which is well known to those skilled in the art.

为了方便于下面的计算，首先定义一实验的同位素分布向量eIPV＝(M_e，I₁，I₂)，其中，M_e为从串联质谱中获得的碎片离子的单同位素P的离子质量，I₁和I₂分别对应碎片离子的第一同位素P₁和第二同位素P₂的谱峰相对于单同位素P的谱峰的相对丰度，这些数据均可从串联质谱中获得。In order to facilitate the following calculations, first define an experimental isotope distribution vector eIPV=(M _e , I ₁ , I ₂ ), where Me _is the ion mass of the monoisotope P of the fragment ion obtained from the tandem mass spectrometry, I ₁ and _I2 correspond to the relative abundance of the peaks of the first isotope _P1 and the second isotope _P2 of the fragment ions, respectively, relative to the peaks of the monoisotope P, both of which can be obtained from tandem mass spectrometry.

然后，再定义一理论的同位素分布向量tIPV＝(M，T₁，T₂)，该理论同位素分布向量tIPV可从碎片离子的通用分子式获得。设碎片离子的通用分子式为C_n1H_n2N_n3O_n4S_n5，其中该分子式中表示各原子组成个数的n1～n5为待定参数。这样，在理论同位素分布向量tIPV中，M为从通用分子式获得的碎片离子的质量，T₁和T₂分别为从通用分子式获得的第一同位素碎片离子和第二同位素碎片离子关于单同位素碎片离子相对丰度。理论同位素分布向量tIPV可具体可通过公式得到：Then, define a theoretical isotope distribution vector tIPV=(M, T ₁ , T ₂ ), which can be obtained from the general molecular formula of fragment ions. The general molecular formula of fragment ions is assumed to be C _n1 H _n2 N _n3 O _n4 S _n5 , wherein n1 to n5 representing the number of each atom in the molecular formula are undetermined parameters. Thus, in the theoretical isotope distribution vector tIPV, M is the mass of the fragment ion obtained from the general formula, and _T1 and _T2 are respectively the first isotopic fragment ion and the second isotopic fragment ion obtained from the general molecular formula with respect to the monoisotopic fragment ion relative abundance. The theoretical isotope distribution vector tIPV can be specifically obtained by the formula:

M＝V×X (1)M=V×X (1)

T₁＝n₁q_C+n₂q_H+n₃q_N+n₄q_O1+n₅q_S1 (2)T ₁ =n ₁ q _C +n ₂ q _H +n ₃ q _N +n ₄ q _O1 +n ₅ q _S1 (2)

${T T}_{22} = = {n no}_{44} {q q}_{O o 22} + + {n no}_{55} {q q}_{S S 22} + + \frac{11}{22} {T T}_{11}^{22} - - \frac{11}{22} (({n no}_{11} {q q}_{C C}^{22} + + {n no}_{22} {q q}_{H h}^{22} + + {n no}_{33} {q q}_{N N}^{22} + + {n no}_{44} {q q}_{O o 11}^{22} + + {n no}_{55} {q q}_{S S 11}^{22})) - - - - - - ((33))$

其中V＝[12，1，14，16，32]，V中的数字为各元素的原子量，X＝[n1，n2，n3，n4，n5]^T；q_C、q_H和q_N分别是自然界中¹³C相对于¹²C、D相对于H、¹⁴N相对于¹⁵N的相对丰度，q₀₁和q₀₂则分别是自然界中¹⁷O相对于¹⁶O、¹⁸O相对于¹⁶O的相对丰度，q_s1和q_s2是自然界中³³S相对于³²S、³⁴S相对于³²S的相对丰度，这些相对丰度均为已知数值。Wherein V=[12,1,14,16,32], the number in V is the atomic weight of each element, X=[n1, n2, n3, n4, n5] ^T ; q _C , q _H and q _N are respectively The relative abundance of ¹³ C relative to ¹² C, D relative to H, ¹⁴ N relative to ¹⁵ N in nature, q ₀₁ and q ₀₂ are the relative abundance of ¹⁷ O relative to ¹⁶ O, ¹⁸ O relative to ¹⁶ O in nature, respectively Abundance, q _s1 and q _s2 are the relative abundances of ³³ S relative to ³² S and ³⁴ S relative to ³² S in nature, and these relative abundances are all known values.

可见，对于理论的同位素分布向量tIPV＝(M，T₁，T₂)，其中的M、T₁和T₂均为X＝[n1，n2，n3，n4，n5]的函数。It can be seen that for the theoretical isotope distribution vector tIPV=(M, T ₁ , T ₂ ), M, T ₁ and T ₂ are all functions of X=[n1, n2, n3, n4, n5].

在本发明中，将理论的同位素分布向量tIPV＝(M，T₁，T₂)与实验同位素分布向量eIPV＝(M_e，T₁，T₂)做匹配，以便获得与实验的同位素分布向量最匹配的分子式，也即通用分子式中的原子组成向量X＝[n1，n2，n3，n4，n5]的一个非负整数解。In the present invention, the theoretical isotope distribution vector tIPV=(M, T ₁ , T ₂ ) is matched with the experimental isotope distribution vector eIPV=(M _e , T ₁ , T ₂ ), so as to obtain the experimental isotope distribution vector The most matching molecular formula, that is, a non-negative integer solution of the atom composition vector X=[n1, n2, n3, n4, n5] in the general molecular formula.

在本发明的一个实施例中，用理论的同位素分布向量tIPV和实验的同位素分布向量eIPV之间的欧氏距离E作为tIPV与eIPV的匹配分数：In one embodiment of the present invention, the Euclidean distance E between the theoretical isotope distribution vector tIPV and the experimental isotope distribution vector eIPV is used as the matching score of tIPV and eIPV:

$E E. = = \sqrt{{δ δ}_{m m}^{22} + + {δ δ}_{11}^{22} + + {δ δ}_{22}^{22}} = = \sqrt{{(({M m - - M m}_{e e}))}^{22} + + {(({T T}_{11} - - {I I}_{11}))}^{22} + + {(({T T}_{22} - - {I I}_{22}))}^{22}} - - - - - - ((44))$

将公式(1)～(3)代入(4)，得到Substituting formulas (1)~(3) into (4), we get

δ_m＝n₁*12+n₂*1+n₃*14+n₄*16+n₅*32-M_e1， (5)δ _m =n ₁ *12+n ₂ *1+n ₃ *14+n ₄ *16+n ₅ *32-M _e1 , (5)

δ₁＝n₁*q_C+n₂*q_H+n₃*q_N+n₄*q_O1+n₅*q_S1-I₁， (6)δ ₁ =n ₁ *q _C +n ₂ *q _H +n ₃ *q _N +n ₄ *q _O1 +n ₅ *q _S1 -I ₁ , (6)

${δ δ}_{22} = = {n no}_{44} * * {q q}_{O o 22} + + {n no}_{55} * * {q q}_{S S 22} - - \frac{11}{22} (({n no}_{11} * * {q q}_{C C}^{22} + + {n no}_{22} * * {q q}_{H h}^{22} + + {n no}_{33} * * {q q}_{N N}^{22} + + {n no}_{44} * * {q q}_{O o 11}^{22} + + {n no}_{55} * * {q q}_{S S 11}^{22}))$

$+ + (({n no}_{11} * * {q q}_{C C} + + {n no}_{22} * * {q q}_{H h} + + {n no}_{33} * * {q q}_{N N} + + {n no}_{44} * * {q q}_{O o} + + {n no}_{55} * * {q q}_{S S})) * * {I I}_{11} - - \frac{11}{22} {I I}_{11}^{22} - - {I I}_{22} + + \frac{11}{22} {δ δ}_{11}^{22} . . - - - - - - ((77))$

忽略公式(7)中的

项，则有[δ_m δ₁ δ₂]＝AX+B，得到Ignoring the formula (7)

item, then [δ _m δ ₁ δ ₂ ]=AX+B, we get

$Q Q ((X x)) = = {E E.}^{22} = = [\begin{matrix} {δ δ}_{m m} & {δ δ}_{11} & {δ δ}_{22} \end{matrix}] [\begin{matrix} {δ δ}_{m m} \\ {δ δ}_{11} \\ {δ δ}_{22} \end{matrix}] = = {X x}^{T T} {A A}^{T T} AX AX + + 22 {B B}^{T T} AX AX + + {B B}^{T T} B B,, - - - - - - ((88))$

则有：Then there are:

$E E. = = \sqrt{Q Q ((X x))} = = \sqrt{{X x}^{T T} {A A}^{T T} AX AX + + {22 B B}^{T T} AX AX + + {B B}^{T T} B B} - - - - - - ((99))$

这里在公式(9)中，X＝[n1，n2，n3，n4，n5]^T是待定的碎片离子的原子组成向量，A和B是由已知量构成的常数矩阵，这里已知量包括从串联质谱中获得的M_e、I₁和I₂，和V＝[12，1，14，16，32]以及公式(2)和(3)中的各同位素的相对丰度。Here in formula (9), X=[n1, n2, n3, n4, n5] ^T is the atomic composition vector of undetermined fragment ion, and A and B are the constant matrix that known quantity is made of, and here known quantity comprises _Me , I ₁ and I ₂ obtained from tandem mass spectrometry, and V=[12, 1, 14, 16, 32] and the relative abundance of each isotope in formulas (2) and (3).

将公式(9)所描述的欧氏距离E最小化，即可得到X的一个解。通常，为了使获得的分子式符合化学意义，优选还要对公式(9)设置一些化学规则约束条件，例如：A solution of X can be obtained by minimizing the Euclidean distance E described by formula (9). Usually, in order to make the obtained molecular formula conform to the chemical meaning, it is preferable to set some chemical rule constraints on the formula (9), for example:

●用X获得的分子式对应的碎片离子质量一定要在范围[M_e-δ，M_e+δ]内，δ是m/z误差的最大范围，δ可由质谱仪的测量精度来确定。也就是要满足|VX-M_e|≤δ。●The mass of the fragment ion corresponding to the molecular formula obtained by X must be within the range [M _e -δ, M _e +δ], δ is the maximum range of m/z error, and δ can be determined by the measurement accuracy of the mass spectrometer. That is to satisfy |VX-M _e |≤δ.

●对于碎片离子分子式中的某种元素，用离子的m/z除以这种元素质量最低的同位素的质量数，取所得结果的整数部分就是此元素个数的上限。例如元素O的原子量为16，若离子的质荷比为m/z，则碎片离子中O元素的个数的上限为

即在X中

类似地，对于碎片离子中其它元素也可获得相似的约束条件。●For a certain element in the fragment ion molecular formula, divide the m/z of the ion by the mass number of the isotope with the lowest mass of this element, and take the integer part of the result as the upper limit of the number of this element. For example, the atomic weight of element O is 16, if the mass-to-charge ratio of ions is m/z, the upper limit of the number of O elements in fragment ions is

i.e. in X

Similarly, similar constraints can be obtained for other elements in the fragment ions.

●在碎片离子中，C的个数一定小于H的个数(即在X中n1＜n3)、O和N的个数一定小于C的个数(即在n4＜n1和n3＜n1)等等。这些约束条件隐含在氨基酸残基的分子组成方式和主要离子类型的组成方式中，本领域的技术人员很容易根据它们的特点总结出来。●In fragment ions, the number of C must be less than the number of H (that is, n1<n3 in X), and the number of O and N must be less than the number of C (that is, in n4<n1 and n3<n1 )etc. These constraints are implicit in the molecular composition of amino acid residues and the composition of main ion types, and those skilled in the art can easily summarize them based on their characteristics.

●在带一个电子的离子中，H和N的个数之和为奇数。原因是如果离子带有一个电荷，那么就有一个不饱和化学键存在，并且，H和N都有奇数个化合价而C、O、S都有偶数个化合价。●In an ion with one electron, the sum of the numbers of H and N is an odd number. The reason is that if the ion has a charge, then there is an unsaturated chemical bond, and both H and N have odd valences and C, O, and S have even valences.

应当理解，本领域的技术人员也可从使碎片离子的分子式符合化学意义的目的出发构造出其它的约束条件。It should be understood that those skilled in the art can also construct other constraint conditions for the purpose of making the molecular formula of the fragment ion conform to the chemical meaning.

上述约束条件或者其它约束条件中的一部分或者全部可表示为一个线性不等式DX≤G。这样，结合公式(9)，可以通过标准的二次规划方法来解决欧氏距离E的这个最小化问题，如公式(10)所示：Part or all of the above constraints or other constraints can be expressed as a linear inequality DX≤G. In this way, combined with formula (9), the standard quadratic programming method can be used to solve the minimization problem of Euclidean distance E, as shown in formula (10):

从公式(10)用二次规划方法求出的X的最优解为一个实数域内的解X_R，为了寻找真正的分子式，可以将X_R当作起始点，然后在它的邻域内局部搜索X的非负整数候选解。确切地说，就是对每一个与X_R存在一个距离d的范围内的非负整数候选候选解分子式进行打分，或者说用公式(9)评价这些非负整数候选解的匹配度。d的值是与离子质量范围相适应的。这样避免了枚举所有可能的分子式，能够在大质量范围内预测离子分子式并且确保较高的可靠性和运行效率。The optimal solution of X obtained from formula (10) by quadratic programming method is a solution X _R in the real number field. In order to find the real molecular formula, X _R can be used as the starting point, and then search locally in its neighborhood Non-negative integer candidate solutions for X. To be precise, it is to score each non-negative integer candidate solution formula within a distance d from X _R , or use formula (9) to evaluate the matching degree of these non-negative integer candidate solutions. The value of d is adapted to the ion mass range. This avoids enumerating all possible molecular formulas, enables prediction of ion molecular formulas in a large mass range and ensures high reliability and operating efficiency.

经过局部搜索，仍会产生一定数量的候选分子式，其中包括一些不合法的和与实验串联质谱不匹配的分子式(可分别称为无效的和不可能的分子式)，为了提高预测的精确度，优选需要尽可能多的排除它们。在本发明中可利用包括平均同位素分布模式、化学规则约束和交叉验证中的一种或者多种方法来过滤候选分子式。这些方法具体描述如下：After a local search, a certain number of candidate molecular formulas will still be produced, including some illegal and mismatched molecular formulas with the experimental tandem mass spectrum (respectively called invalid and impossible molecular formulas), in order to improve the accuracy of prediction, preferably They need to be excluded as much as possible. In the present invention, one or more methods including average isotope distribution pattern, chemical rule constraint and cross-validation can be used to filter candidate molecular formulas. These methods are described in detail as follows:

A.平均的同位素分布模式A. Average isotope distribution pattern

所说的平均的同位素分布模式是理论同位素分布向量tIPV＝(M，T₁，T₂)中的组成部分M、T₁和T₂之间的统计关系。为了寻找碎片离子的理论平均同位素分布模式，发明人计算了现有的蛋白质数据库中所有蛋白质的trypsin水解对应的多肽的理论碎片离子的同位素的平均分布和标准差，揭示了tIPV的组成部分M、T₁和T₂之间的关系。具体地说，发明人首先将SWISS-PROT中的蛋白质进行理论酶切计算得到多肽；然后选择质量在(60u～3000u)内的多肽，这个范围对应着Q-TOF MS/MS实验质谱的标准范围。另外，值得注意的是S的同位素⁺²S在自然界中的含量很高(出现的几率是0.04210，大约是¹⁸O的20倍)，而多数情况下能够包含五个以上的S的多肽十分少见。因此，我们可以将上述分子式分成六类：S⁰，S¹，S²，S³，S⁴和S⁵⁺，分别对应所含S的个数为0，1，2，3，4和5个及5个以上的肽段。发明人按这六个类别对做了统计。统计结果显示T₁与质量M呈线性关系，T₂则与M呈二次关系，而T²随着T₁增加而增加并且与T₁成二次函数关系。The mean isotope distribution pattern is the statistical relationship between the components M, T1 and _T2 in the theoretical isotope distribution vector tIPV=(M, _T1 _, _T2 ). In order to find the theoretical average isotope distribution pattern of fragment ions, the inventors calculated the average distribution and standard deviation of the isotopes of the theoretical fragment ions of polypeptides corresponding to trypsin hydrolysis of all proteins in the existing protein database, revealing the components of tIPV M, Relationship between _T1 and _T2 . Specifically, the inventors first theoretically digested the proteins in SWISS-PROT to obtain polypeptides; then selected polypeptides with a mass within (60u~3000u), which corresponds to the standard range of Q-TOF MS/MS experimental mass spectrometry . In addition, it is worth noting that the isotope of S ^{+ 2} S is very high in nature (the probability of occurrence is 0.04210, which is about 20 times that of ¹⁸ O), and in most cases, peptides that can contain more than five S are very rare . Therefore, we can divide the above molecular formula into six categories: S ⁰ , S ¹ , S ² , S ³ , S ⁴ and S ⁵⁺ , corresponding to the number of S contained in 0, 1, 2, 3, 4 and 5 and more than 5 peptides. The inventors made statistics according to these six categories. Statistical results show that T ₁ has a linear relationship with mass M, T ₂ has a quadratic relationship with M, and T ² increases with T ₁ and has a quadratic function with T ₁ .

这样，通过T₁、T₂与M的上述分布关系可以对候选分子式进行过滤，以排除那些无效的和/或不可能的分子式。In this way, the candidate molecular formulas can be filtered through the above distribution relationship of T ₁ , T ₂ and M to exclude those invalid and/or impossible molecular formulas.

B.化学规则约束B. Chemical rule constraints

这里的化学规则约束与公式(10)中的约束条件DX≤G相类似，其区别在于：在公式(10)中，约束条件DX≤G用于约束公式The chemical rule constraint here is similar to the constraint condition DX≤G in formula (10), the difference is that in formula (10), the constraint condition DX≤G is used to constrain the formula

$E E. = = \sqrt{Q Q ((X x))} = = \sqrt{{X x}^{T T} {A A}^{T T} AX AX + + {22 B B}^{T T} AX AX + + {B B}^{T T} B B}$

以便得到在此约束条件下X的一个实数域内的解X_R。而在这里，这些约束条件用于约束在X_R的领域内搜索得到的非负整数解候选分子式，以便对这些候选分子式进行过滤。In order to obtain the solution X _R in a real field of X under this constraint. Here, these constraint conditions are used to constrain the candidate molecular formulas of non-negative integer solutions searched in the domain of _XR , so as to filter these candidate molecular formulas.

C.交叉验证C. Cross Validation

特别地，一个肽段的b系列的碎片离子都是同源的，包括b-，a-，b*-，a*-，b°-，a°型离子，它们共享一个相同的原始氨基酸序列，由此可推测它们的同位素分布模式很相似。y系列离子也是这样。如果质谱中某两个碎片离子的M_e相差28、17或18，并且这两个碎片离子的I₁和I₂很接近，就可认为两个碎片离子对应的eIPV同源的。而后，我们就可以使用同源的eIPV对预测结果进行交叉验证。例如对于同源的两个碎片离子，在一个碎片离子中的候选分子式列表中有C_a1H_a2N_a3O_a4S_a5，如果C_a1-1H_a2N_a3O_a4-1S_a5没有出现在另一个碎片离子的候选分子式列表里，那么就可以认为候选分子式C_a1H_a2N_a3O_a4S_a5是随机匹配上的结果而将它排除。In particular, the b-series fragment ions of a peptide are all homologous, including b-, a-, b*-, a*-, b°-, a° type ions, which share the same original amino acid sequence , so it can be speculated that their isotope distribution patterns are very similar. The same is true for the y-series ions. If the _Me of two fragment ions in the mass spectrum differ by 28, 17 or 18, and the I ₁ and I ₂ of the two fragment ions are very close, it can be considered that the eIPVs corresponding to the two fragment ions are homologous. Then, we can use the homologous eIPV to cross-validate the prediction results. For example, for two homologous fragment ions, there is C _a1 H _a2 N _a3 O _a4 S _a5 in the list of candidate molecular formulas in one fragment ion, if C _a1-1 H _a2 N _a3 O _a4-1 S _a5 does not appear in In the candidate molecular formula list of another fragment ion, then the candidate molecular formula C _a1 H _a2 N _a3 O _a4 S _a5 can be considered as the result of random matching and excluded.

Claims

1. A method for predicting ion molecular formula with the isotope peak of fragment ion in tandem mass spectrometry, comprising:

Step 1): Obtain a monoisotopic peak of a fragment ion and at least one isotopic peak of the fragment ion from the tandem mass spectrometer, calculate the mass of the monoisotope of the fragment ion, the monoisotopic spectral peak of the fragment ion, and the fragment ion The relative abundance between the spectral peaks of at least one isotope of is used as the isotope distribution vector of the experiment;

Step 2): providing a general molecular formula of fragment ions, the number of atoms of each element in the general molecular formula is to be determined;

Step 3): use the general molecular formula to obtain the theoretical monoisotopic mass of the fragment ion, the monoisotope of the fragment ion and the relative abundance of at least one isotope thereof as a theoretical isotope distribution vector; the theoretical monoisotopic the mass, monoisotope of the fragment ion and the relative abundance of at least one isotope thereof as a function of the number of atoms to be determined in said general formula;

Step 4): The Euclidean distance between the theoretical isotope distribution vector obtained in step 3) and the experimental isotope distribution vector obtained from tandem mass spectrometry in step 1) is minimized to obtain the real number domain of the fragment ions Molecular formula;

Step 5): local search in the neighborhood of the molecular formula of the real number field to obtain the candidate molecular formula of the non-negative integer field, and obtain the predicted molecular formula of the fragment ion from the candidate molecular formula;

Wherein, at least one isotope of the fragment ion described in the step 1) and step 3) includes the first isotope and the second isotope of the fragment ion; and

The experimental isotope distribution vector eIPV=(M _e , I ₁ , I ₂ ), wherein, Me _is the ion mass of the monoisotope of the fragment ion obtained from the tandem mass spectrometry, and I ₁ and I ₂ correspond to the ion mass of the fragment ion The relative abundance of the peaks of the first isotope and the second isotope relative to the peaks of the monoisotope;

The theoretical isotope distribution vector tIPV=(M, T ₁ , T ₂ ) is obtained as follows:

M＝V×X (1)

T ₁ =n ₁ q _C +n ₂ q _H +n ₃ q _N +n ₄ q _O1 +n ₅ q _S1 (2)

{T T}_{22} = = {n no}_{44} {q q}_{O o 22} + + {n no}_{55} {q q}_{S S 22} + + \frac{11}{22} {T T}_{11}^{22} - - \frac{11}{22} (({n no}_{11} {q q}_{C C}^{22} + + {n no}_{22} {q q}_{H h}^{22} + + {n no}_{33} {q q}_{N N}^{22} + + {n no}_{44} {q q}_{O o 11}^{22} + + {n no}_{55} {q q}_{S S 11}^{22})) - - - - - - ((33))

Wherein, the general molecular formula is C _n1 H _n2 N _n3 O _n4 S _n5 , n1～n5 is the number of atoms in the general molecular formula of the fragment ions, and M is the number of fragment ions obtained from the general molecular formula. Mass, T _and _T are the relative abundance of the first isotopic fragment ion and the second isotopic fragment ion obtained from the general molecular formula, respectively, with respect to the monoisotopic fragment ion, and V is composed of the atomic weight of each element in the general molecular formula The row vector of X=[n1, n2, n3, n4, n5] ^T ; q _C , q _H and q _N are the ratios of ¹³ C relative to ¹² C, D relative to H, and ¹⁴ N relative to ¹⁵ N in nature q ₀₁ and q ₀₂ are the relative abundances of ¹⁷ O relative to ¹⁸ O and ¹⁸ O relative to ¹⁶ O in nature, respectively; q _s1 and q _s2 are the relative abundances of ³³ S relative to ³² S and ³⁴ S in nature The relative abundance of ³² S.

2. the method for predicting ion molecular formula with the isotope peak of fragment ion in the tandem mass spectrum according to claim 1, is characterized in that, uses the Euclidean distance E between the isotope distribution vector tIPV of theory and the isotope distribution vector eIPV of experiment as Matching scores of tIPV and eIPV:

E E. = = \sqrt{{δ δ}_{m m}^{22} + + {δ δ}_{11}^{22} + + {δ δ}_{22}^{22}} = = \sqrt{{((M m - - {M m}_{e e}))}^{22} + + {(({T T}_{11} - - {I I}_{11}))}^{22} + + {(({T T}_{22} - - {I I}_{22}))}^{22}} - - - - - - ((44))

3. according to claim 1 or 2 described method with the isotope peak prediction ion molecular formula of fragment ion in tandem mass spectrometry, it is characterized in that, in described step 4) also comprise using the chemical rule that the molecular formula that obtains meets chemical meaning As a constraint condition for the minimization of the Euclidean distance.

4. the method for predicting ion molecular formula with the isotope peak of fragment ion in tandem mass spectrometry according to claim 1, is characterized in that, step 5) described in local search refers to that there is a distance with the molecular formula of described real number field Search for candidate formulas in the domain of non-negative integers in the range.

5. The method for predicting ion molecular formulas with the isotope peaks of fragment ions in tandem mass spectrometry according to claim 1, characterized in that, in step 5) also includes the step of filtering the candidate molecular formulas in the non-negative integer field.

6. the method for predicting ion molecular formula with the isotope peak of fragment ion in tandem mass spectrometry according to claim 5, is characterized in that, described filtering step comprises average isotope distribution mode method, and this method uses the theoretical monoisotope of fragment ion The statistical relationship between the mass of , the monoisotope of the fragment ion and the relative abundance of at least one isotope thereof filters the candidate molecular formulas of the non-negative integer domain.

7. the method for predicting ion molecular formula with the isotope peak of fragment ion in tandem mass spectrometry according to claim 5, is characterized in that, described filtering step comprises using the chemical rule constraints that make the molecular formula obtained meet chemical meaning to filter described non- Candidate formulas for the domain of negative integers.

8. the method for predicting ion molecular formula with the isotope peak of fragment ion in tandem mass spectrometry according to claim 5, is characterized in that, described filtering step comprises the candidate molecular formula of the non-negative integer domain with two fragment ions to carry out cross validation to Candidate formulas for the non-negative integer domain of the two fragment ions are filtered.