CN114166925B

CN114166925B - Denovo method and system for identifying N-sugar chain structure based on mass spectrum data

Info

Publication number: CN114166925B
Application number: CN202111235025.6A
Authority: CN
Inventors: 张军英; 杨芝; 吴金辉; 刘继源; 孙士生
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-10-22
Filing date: 2021-10-22
Publication date: 2024-03-26
Anticipated expiration: 2041-10-22
Also published as: CN114166925A

Abstract

The invention belongs to the technical field of glycomics, and discloses a method and a system for identifying a Dennovo based on mass spectrum data and N-sugar chain structure, wherein the method comprises the following steps: and extracting structure and composition information of sugar chain fragment ions in mass spectrum data, carrying out N-sugar chain identification based on basic peaks, cross peaks and generalized monosaccharide dictionary, and reducing search space of candidate structures of identification results by utilizing pruning strategies to obtain N-sugar chain structures corresponding to mass spectrum. The invention is based on mass spectrum data of N-sugar chains and ideas of de novo sequencing (Dennovo), and the structure and composition information of sugar chain fragment ions in the mass spectrum data are extracted to identify the N-sugar chain structure corresponding to the mass spectrum. In the identification process, a basic peak and a cross peak are introduced and N-sugar chain identification is carried out based on the basic peak and the cross peak; a generalized monosaccharide dictionary is introduced, so that the spectrogram quality is improved, and the robustness of the identification method to noise in mass spectrum data is improved; and (5) reducing the search space of the candidate structure of the identification result by using a pruning strategy. The invention improves the mass spectrum identification quality.

Description

Denovo method and system for identifying N-sugar chain structure based on mass spectrum data

Technical Field

The invention belongs to the technical field of glycomics, and particularly relates to a method and a system for identifying a Dennovo based on mass spectrum data and an N-sugar chain structure.

Background

At present: glycosylation of proteins is a post-translational modification of proteins that is prevalent in organisms, and its N-sugar chain structure largely determines the biological function of the glycoprotein. With the rapid promotion of mass spectrometry technology, the identification of sugar chain structures by using mass spectrometry data has been an important way of understanding glycoprotein biological functions.

N-sugar chain is a tree-like structure having a pentasaccharide core immobilized structure, and the current methods for identifying N-sugar chain structure are largely classified into two types: 1) A database searching method; 2) De novo sequencing (de novo); 3) Labeling method. Wherein the tag method is a combination of a database search method and a de novo sequencing method. The database search method and the de novo sequencing method are described separately below.

1. The database searching method comprises the following steps: and referring to databases such as glycoSearchMS, glycoPep DB, glyDB and the like, performing similarity matching on a to-be-detected glycopeptide mass spectrogram with an unknown structure and a real spectrogram with an annotated sugar chain structure, so as to obtain a score representing similarity, and taking the sugar chain structure with the best matching as an identification result. Algorithms based on the method include GRIP, arMone 2.0, glycoPep Detector, byonic, protein-Prospector, pGlyco 2.0.0 and the like.

2. Typically, de novo sequencing methods consist of two processes, namely enumerating possible sugar chain structures and evaluating these candidate structures, with the sugar chain structure with the highest score being the result of the identification. The ideal enumeration procedure should generate as few candidate structures as possible for further evaluation, but should not miss the target sugar chain structure.

The current de novo sequencing methods are mainly divided into three categories:

the first is an exhaustive search: considering the parent ion mass of the glycopeptides under study, the monosaccharide composition of the sugar chain can be readily calculated using the Knapsack algorithm. STAT, strOligo, OSCAR et al, describe all possible branching structures matching the monosaccharide composition. Since the number of candidate sugar chain structures will increase exponentially with the number of monosaccharides, this strategy is only used for the identification of sugar chains having up to ten monosaccharide residues.

By applying a restriction constraint to the biosynthesis rules to the candidate sugar chain structures, the search space can be greatly reduced, but the reality is that the biosynthesis rules for forming sugar chains are not completely known, limiting the widespread applicability of the utilization of the biosynthesis rules.

The second category is heuristic: the problem of generating candidate sugar chains has proven to be an NP-hard problem under conditions where each peak in the spectrum can only be used once. There are a number of heuristic methods for this purpose, for example, each peak position is kept only in a limited number of substructures, reducing the computational complexity to save time and space. Prior art 1 suggests reconstructing the sugar chain structure stepwise and considering a fixed number of high quality structures in each iteration. The prior art 2 proposes an accurate algorithm based on a fixed parameter algorithm, wherein the parameter is the number of peaks, and for mass spectra with a large number of peaks, at most only the k most intense peaks need to be used, while the other peaks can be used multiple times.

The third category is methods based on dynamic programming: like de novo peptide chain sequencing, GLYCH uses dynamic programming techniques to find the most likely branching structure from tandem MS mass spectra, and is only applicable to MS/MS spectra that release sugar chains, and cannot process glycopeptide data. Prior art 3 formulates a candidate structure generation problem as an integer linear programming problem and then uses dynamic programming techniques to infer the most likely structure. In order to make the calculations manageable, dynamic planning methods typically return a fixed number of highest scoring structures, e.g., GLYCH reports up to 200 candidate structures for subsequent evaluation.

The advantage of the de novo sequencing method over the database search method is that new sugar chain structures not included in the database can be identified, which is of great research value. The method has the defects that a high-quality mass spectrum is required, and in reality, various factors exist to influence the quality of the spectrum to a certain extent, for example, the spectrum peaks of fragment ions are always lost continuously in the spectrum, so that the spectrum utilization rate is not very high compared with database searching, but the method has very promising research prospect in sugar chain structure identification along with the development of mass spectrum technology.

Through the above analysis, the problems and defects existing in the prior art are as follows:

(1) Current database search-based methods are not able to identify unrecorded structures.

(2) Current de novo sequencing (de novo) based methods are greatly affected by mass spectral data noise, resulting in less robust identification structures.

The difficulty of solving the problems and the defects is as follows:

(1) Diversity in the N-sugar chain scale. Some N-sugar chains have a small number of monosaccharides, and some have a large number, i.e., the number of monosaccharides contained therein varies widely, and generally the larger the size of the sugar chain, the more difficult it is to identify it;

(2) Diversity of N-sugar chain structures. Although the N-sugar chain has a tree structure, its composition and the position of each monosaccharide on the sugar chain may be varied, which presents a great challenge for accurately identifying the sugar chain structure from mass spectrometry data;

(3) The stability of the N-sugar chain structure varies. The structural stability of different substructures is different, namely some are not easy to crack, some are easy to crack, and the structural stability information of each substructures is unknown, so that the identification of the N-sugar chain structure based on mass spectrum data is difficult;

(4) The glycopeptides fed to the mass spectrometer are not necessarily pure, resulting in the actual possibility of a mixture of several glycopeptides, interfering with the identification of enriched glycopeptides;

(5) The measurement noise, isotope effect and the like of the mass spectrometer also bring certain interference to the identification of the sugar chain structure, and usually, the problems are solved through pretreatment, but the pretreatment effect of mass spectrum data directly influences the identification performance of a subsequent identification algorithm, and the pretreatment of high quality is an important guarantee for improving the identification quality.

The meaning of solving the problems and the defects is as follows:

on the basis of conventional pretreatment of mass spectrum data of the N-glycopeptide, the invention develops a set of method and system for identifying the Dennovo of the N-sugar chain structure, and has good robustness to noise existing in the mass spectrum data.

The main challenge of glycomics is the characterization of complex glycan structures, which is crucial for understanding their role in biological processes. The sugar chains formed during glycosylation participate in life regulation activities of organisms, and the modified protein can be enhanced in the resistance to protease, so that the interaction between proteins is influenced, and the spatial structure, biological activity, transportation, positioning, functions and the like of the proteins are influenced; in some vital activities, the structural change of the sugar chain attached to the peptide chain is an important cause of disease occurrence; glycosylation also plays an important role in the solubility, stability and efficacy of many biopharmaceuticals. Therefore, the sugar chain generated by glycosylation is accurately identified, and the method has important significance for understanding life regulation activities, discovering pathogenic causes, treating diseases and designing medicines.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a method and a system for identifying a Denov based on mass spectrum data and an N-sugar chain structure.

The invention is realized in that a method for identifying a Denov based on a N-sugar chain structure of mass spectrum data comprises the following steps:

and extracting structure and composition information of sugar chain fragment ions in mass spectrum data, carrying out N-sugar chain identification based on basic peaks, cross peaks and generalized monosaccharide dictionary, and reducing search space of candidate structures of identification results by utilizing pruning strategies to obtain N-sugar chain structures corresponding to mass spectrum.

Further, the method for identifying a Denovo based on the N-sugar chain structure of mass spectrum data comprises the following steps:

step one, reading mass spectrum data processed by a mass spectrometer, and extracting related data related to identification; and converting mass-to-charge ratio m/z of the mass spectrum into mass m through pretreatment of the mass spectrum, judging whether a pentasaccharide core exists or not by adopting a pentasaccharide core related spectrum peak judging method, and turning to the second step if the pentasaccharide core exists. The main functions are as follows: pretreatment of data and judgment of whether the data are N-sugar chains or not, wherein the N-sugar chains have a pentasaccharide core structure;

and secondly, correcting a spectral peak with monosaccharide characteristics in the mass spectrum to be theoretical mass based on the monosaccharide and the generalized monosaccharide. The main functions are as follows: allowing a missing spectrum peak and a mass spectrometer to measure errors of mass in mass spectrum data, so that the robustness of an identification result to noise is enhanced;

initializing a sugar chain structure to form a root node of a tree, continuously growing monosaccharide from the initial structure according to a certain rule by the sugar chain, calculating a basic peak and a cross peak of the structure after growing the monosaccharide while growing, and generating a theoretical mass spectrum of the structure based on the calculated basic peak and cross peak. The main functions are as follows: continuously expanding a sugar chain structure tree through a growth process;

step four, filtering isomorphic structures in the grown structures through pruning strategies to obtain N-sugar chain structure identification results; and (3) scoring and evaluating the obtained identification result, wherein the structure with the first scoring rank is the identified sugar chain structure. The main functions are as follows: the duplicate structures were filtered out and each remaining structure was scored.

Further, in the first step, the related data includes: sugar chain mass glycane mass, peptide chain mass peptide mass, peak lowenergy peaks obtained at low energy.

Further, in the second step, correcting the mass of the spectral peak having monosaccharide characteristics in the mass spectrum to theoretical mass based on the monosaccharide and the generalized monosaccharide includes:

(1) Calculating the mass difference delta m of adjacent spectrum peaks, wherein if the mass of a certain monosaccharide or the mass of generalized monosaccharide is in the range [ delta m-delta, delta m+delta ], the mass difference is matched with the mass of the generalized monosaccharide, wherein delta is a correction error with the value of 0.2;

(2) And updating the quality difference into the quality of monosaccharide or generalized monosaccharide, recalculating the new quality after adding the corresponding quality of monosaccharide or generalized monosaccharide to obtain corrected quality, wherein the quality after correcting the quality of the characteristic spectrum peak of the monosaccharide is theoretical quality.

Further, in the third step, the rule of growing monosaccharides by the sugar chain structure includes: based on the principle of trying to grow the lightest monosaccharides each time.

Further, the base peak and the cross peak include:

the basal peak, the peak in spectrum S that is associated with only a single monosaccharide pathway: if the sum of the monosaccharide masses on each monosaccharide path of the sugar chain G corresponding to the spectrum S is respectively recorded asIf m=b _i ，1≤i≤k _b The peak at mass m in the spectrum S is a base peak;

cross peaks, i.e., peaks in S that relate to two or more monosaccharide paths; if the sum of the monosaccharide masses in any two or more monosaccharide paths of G is respectivelyIf m=c _i ，1≤i≤k _c The peak at mass m in the spectrum S is a cross peak;

the monosaccharide pathway is a collection of monosaccharides on the path from the root node to any node in the tree structure.

Further, in the third step, the theoretical mass spectrum generating method includes:

initializing a theoretical mass spectrum as null, corresponding to a spectral peak with the mass of 0 of the theoretical mass spectrum, and corresponding to a root node of the sugar chain tree structure.

Every monosaccharide grows on the sugar chain tree structure, the abundance of one unit intensity is increased on the corresponding mass of the theoretical mass spectrum, and is marked as a basic spectrum peak, and the abundance of one unit intensity is increased at the position of the cross peak mass generated by growing the monosaccharide, and is marked as a cross spectrum peak.

Theoretical mass spectrum if there are multiple basic spectral peaks, cross spectral peaks or basic and cross spectral peaks of unit intensity on the same mass, the abundance on this mass is the superposition of these noted spectral peaks.

Further, in the third step, the growing of the monosaccharide further includes:

when the sugar chain structure has grown to a pentasaccharide core structure and a monosaccharide g is further grown, the following conditions are satisfied:

in the mass spectrum to be identified, if abundance exists at the mass position where monosaccharide g grows; the mass position of the cross peak required for growing the monosaccharide has the abundance in the mass spectrum to be identified in a proportion of more than or equal to theta percent, and theta percent (theta=20) represents the support degree of the cross term; the monosaccharide composition of the structure after growing the monosaccharide is not more than that of a target sugar chain; the grown structure meets the constraint of the following biological rule, and then the situations of growing the monosaccharide and not growing the monosaccharide are examined on the basis of the existing structure. The constraint of the biological rule includes:

delta (dHex) monosaccharides do not regrow;

the degree of the sugar chain tree is 4 at most;

each monosaccharide has a water fraction and the water mass is removed.

Further, the filtering of the structure obtained by using the pruning strategy comprises:

1) Judging whether the structure has a pentasaccharide core structure and whether the structure meets biological rules, and if not, filtering;

2) And calculating hash values of the structures, if the hash values are equal, judging that the structures are isomorphic, and only preserving one of the structures in the isomorphic structures.

Further, the hash value calculation formula of the structure is as follows:

Hs(x)＝(∑Hs(Son _x [i]) ² )+Offset _x ；

wherein Hs (x) represents hash value of tree with x as root node, son _x [i]The ith child node, offset, representing root node x _x Representing the weight value of node X itself, codeValue _x The code value of node x is represented, x is 5 types in total, and is { Hex, hexNAc, neuAc, neuGc, dHex }, and maxbankh=4 represents the maximum degree of node in the tree.

Another object of the present invention is to provide an application of the de novo method for identifying a sugar chain structure based on mass spectrum data in identifying a sugar chain structure.

By combining all the technical schemes, the invention has the advantages and positive effects that: the invention is based on mass spectrum data of N-sugar chains and ideas of de novo sequencing (Dennovo), and the structure and composition information of sugar chain fragment ions in the mass spectrum data are extracted to identify the N-sugar chain structure corresponding to the mass spectrum. In the identification process, basic peaks and cross peaks are introduced and the identification of the N-sugar chain structure is carried out based on the basic peaks and the cross peaks; a generalized monosaccharide dictionary is introduced, so that the quality of a spectrogram is improved, and the robustness of an identification method to noise in the spectrogram is improved; and (5) reducing the search space of the candidate structure of the identification result by using a pruning strategy. The invention improves the mass spectrum identification quality.

The method introduces generalized monosaccharides to improve the mass of a spectrogram, introduces basic peaks and cross peaks, establishes a theoretical mass spectrum to improve the identification quality of the N-sugar chain, adopts a method of hashing a redundancy-removing structure to improve the identification efficiency, and evaluates the identification structure through matching scoring of the mass spectrogram and the spectral peaks of the theoretical mass spectrum to achieve the aim of high-efficiency and high-quality identification of the N-sugar chain structure.

The invention can identify sugar chain structures which are not recorded in the sugar chain database at present on the basis of improving the quality of spectrograms, and can reasonably evaluate the identification result.

Drawings

Fig. 1 is a schematic diagram of a method for identifying a Denovo based on mass spectrum data of an N-sugar chain structure according to an embodiment of the present invention.

Fig. 2 is a flowchart of a method for identifying Denovo based on mass spectrum data of N-sugar chain structure according to an embodiment of the present invention.

FIG. 3 shows the N-sugar pentasaccharide core structure provided by the example of the present invention.

Fig. 4 is a raw mass spectrum provided by an embodiment of the present invention.

Fig. 5 is a mass spectrum of an original mass spectrum pretreatment provided in an embodiment of the present invention.

Fig. 6 is a corrected mass spectrum provided by an embodiment of the present invention.

FIG. 7 shows an example of sugar chain structure calculated from fundamental peaks and cross peaks provided in the examples of the present invention.

Fig. 8 is a top ten ranking identification result and its score provided by an embodiment of the present invention.

FIG. 9 (a) is a theoretical mass spectrum comparison of the mass spectrum to be identified with the first identified sugar chains scored according to the example of the present invention.

Fig. 9 (b) is a theoretical mass spectrum of the first scoring first identified result provided by the embodiment of the present invention.

Fig. 10 is a graph of bait score distribution according to an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Aiming at the problems existing in the prior art, the invention provides a method and a system for identifying a Dennovo structure of an N-sugar chain based on mass spectrum data, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the method for identifying a Denovo based on the N-sugar chain structure of mass spectrum data provided by the embodiment of the invention comprises the following steps:

and extracting structure and composition information of sugar chain fragment ions in mass spectrum data, carrying out N-sugar chain identification based on basic peaks, cross peaks, monosaccharide dictionary and generalized monosaccharide dictionary, reducing search space of candidate structures of identification results by using pruning strategies, and finally scoring the candidate structures to obtain N-sugar chain structures corresponding to the mass spectrum to be identified.

As shown in fig. 2, the method for identifying a Denovo based on the N-sugar chain structure of mass spectrum data provided by the embodiment of the invention comprises the following steps:

s101, reading mass spectrum data processed by a mass spectrometer, and extracting relevant data related to identification; converting mass-to-charge ratio m/z of the mass spectrum into mass m through preprocessing of mass spectrum data, judging whether a pentasaccharide core exists or not by adopting a pentasaccharide core (the structure of which is shown in fig. 3) related spectrum peak judging method, and if so, turning to step S102;

s102, correcting a spectrum peak of a mass spectrum with a pentasaccharide core based on monosaccharide and generalized monosaccharide, and correcting the spectrum peak with monosaccharide characteristics in the mass spectrum to be theoretical mass;

s103, initializing a sugar chain structure as a root node of a tree, growing monosaccharide from the initial structure according to a certain rule by the sugar chain, calculating a basic peak and a cross peak of the structure after growing the monosaccharide while growing, and generating a theoretical mass spectrum of the structure based on the calculated basic peak and cross peak;

s104, filtering isomorphic structures in the grown structures through pruning strategies to obtain N-sugar chain structure identification results; and (3) performing scoring evaluation on the obtained identification result by referring to the theoretical mass spectrum, wherein the structure with the first scoring rank is the identified N-sugar chain structure.

The related data provided by the embodiment of the invention comprises the following steps: sugar chain mass glycane mass, peptide chain mass peptide mass, peak lowenergy peaks obtained at low energy.

The spectrum peak correction of the mass spectrum with the pentasaccharide core based on the generalized monosaccharide provided by the embodiment of the invention comprises the following steps:

(1) Calculating the mass difference delta m of adjacent spectrum peaks, wherein if the mass of a certain monosaccharide or generalized monosaccharide is within the range of [ delta m-delta, delta m+delta ], the mass difference is matched with the mass of the generalized monosaccharide, wherein delta is a correction error with the value of 0.2;

(2) And updating the mass difference into the mass of the corresponding monosaccharide or the generalized monosaccharide, recalculating the new mass obtained by adding the mass of the monosaccharide or the generalized monosaccharide to obtain corrected mass, wherein the mass obtained by correcting the characteristic spectrum peak of the monosaccharide is theoretical mass.

The rule for growing monosaccharide by using the sugar chain structure provided by the embodiment of the invention comprises the following steps: based on the principle of trying to grow the lightest monosaccharides each time.

The basic peak and the cross peak provided by the embodiment of the invention comprise:

the basal peak, the peak in spectrum S that is associated with only a single monosaccharide pathway: if the sum of the monosaccharide masses on each monosaccharide path of the sugar chain G corresponding to the spectrum S is respectivelyIf m=b _i ，1≤i≤k _b The peak at mass m in the spectrum S is the base peak;

cross peaks, i.e., peaks in S that relate to two or more monosaccharide paths; if the sum of the monosaccharide masses in any two or more monosaccharide paths of G is respectivelyIf m=c _i ，1≤i≤k _c The location isThe peak at mass m is the cross peak;

The theoretical mass spectrum generation method provided by the embodiment of the invention comprises the following steps:

The growth of the monosaccharide provided by the embodiment of the invention further comprises the following steps:

in the mass spectrum to be identified, if abundance exists at the mass position where monosaccharide g grows; the mass position of the cross peak required for growing the monosaccharide has the abundance in the mass spectrum to be identified in a proportion of more than or equal to theta percent, and theta percent (theta=20) represents the support degree of the cross term; the monosaccharide composition of the structure after growing the monosaccharide is not more than that of a target sugar chain; the grown structure meets the constraint of the following biological rule, and then the situations of growing the monosaccharide and not growing the monosaccharide are examined on the basis of the existing structure.

The constraint of the biological rule includes: delta (dHex) monosaccharides do not regrow; the degree of sugar chain trees is at most 4 (since one monosaccharide has at most five bonds, one of which is linked to a parent node, a single node of the tree structure can grow backward by at most 4 branches); each monosaccharide has a water fraction and the water mass is removed.

The filtering of the obtained structure by utilizing the pruning strategy provided by the embodiment of the invention comprises the following steps:

The hash value calculation formula of the structure provided by the embodiment of the invention is as follows:

Hs(x)＝(∑Hs(Son _x [i]) ² )+Offset _x ；

wherein Hs (x) represents hash value of tree with x as root node, son _x [i]The ith child node, offset, representing root node x _x Representing the weight value of node x itself, codedValue _x The code value of node x is represented, x is 5 types in total, and is { Hex, hexNAc, neuAc, neuGc, dHex }, and maxbankh=4 represents the maximum degree of node in the tree.

The technical scheme of the invention is further described below with reference to specific embodiments.

Example 1:

the experimental data used in the invention are mouse brain glycopeptide data, and a total of 729 mass spectrum data are subjected to sugar chain spectrum screening to obtain 669 sugar chain mass spectra and 60 non-sugar chain mass spectra (without containing a pentasaccharide core structure). The present invention selects the sugar chain mass spectrum with the number of 130 as an example to illustrate the specific identification process and the identification result of the present invention, and compares with the result of the structGP algorithm which is newly published. This experimental example is for illustrative purposes and is not intended to limit the scope of the present invention.

The specific identification process comprises the following steps:

and 1, reading mass spectrum data processed by a mass spectrometer, extracting related data related to identification, wherein the related data comprise sugar chain mass glyconomass, peptide chain mass peptide mass, and spectrum peak lowenergy peaks acquired under low energy, converting m/z into m through operations such as removing isotope peak clusters, converting m/z into m and intensity from the spectrum, wherein m/z is mass-to-charge ratio, m is mass of ion fragments of sugar chains, and intensity is abundance corresponding to mass m in the spectrum, and the spectrum is of a numerical type.

Step 2 filters out the m of the m, intensity that is very close to the mass obtained in step 1 and its corresponding intensity, if the difference between the two spectral peak masses is less than 0.15Da, it is considered that they are very close, only the first m, intensity is retained, this is to consider that a certain fragment may have different charge amounts after ionization, so that it initially appears in different isotopic peak clusters, after going through the contraction and transformation spectrogram of the isotopic peak clusters, the calculated m phase differences are very small, even identical, and essentially represent structural fragments with identical composition.

And 3, subtracting the mass of the peptide chain extracted in the step 1 from the mass m in the step 2 to obtain a mass representing only Y ion fragments of the sugar chain.

Fig. 4 is an original mass spectrum of the example, showing the ion mass-to-charge ratio m/z and abundance information intensity of the mass spectrum, and it can be seen that there are a large number of dense isotope peak clusters in the original mass spectrum, which has a certain influence on the identification algorithm of the present invention.

Fig. 5 is an intuitive effect of the original mass spectrum data after the operations of steps 1,2 and 3, and it can be seen that the preprocessing method filters out most isotope peaks in the original mass spectrum, so as to greatly reduce the influence on the identification algorithm.

And 4, screening the mass spectrum obtained in the step 3 for N-sugar chain spectrum. A spectrum is considered to be an N-sugar chain spectrum if it contains at least 3 peaks of the correlation spectrum of pentasaccharide cores. The correlation peaks for the pentasaccharide cores are expressed as: the theoretical mass of the material is [203.0794, 406.1588, 568.2116, 730.2644, 892.3172]. The N-sugar chain spectrum screening procedure was as follows:

step 4.1 counting the number num of spectrum peaks matched with the theoretical mass of the pentasaccharide core related peak in the spectrogram within the error range _core Here, the matching error threshold=2.1 Da is set, if spectrumIf the absolute value of the difference between the mass of a certain spectral peak and the theoretical mass in the graph is within the range of the matching error, the matching is considered to be performed, and the spectral peak in the spectrogram is considered to have monosaccharide characteristics.

Step 4.2 statistical analysis of step 4.1 to obtain the spectrum peak num with monosaccharide characteristic in the spectrum _core If num is _core And (3) considering the spectrogram as a non-N-sugar chain spectrum, otherwise, as an N-sugar chain spectrum, correcting a spectrum peak with monosaccharide characteristics before a pentasaccharide core in the spectrogram, wherein the corrected spectrum peak quality is the theoretical quality of the pentasaccharide core related spectrum peak.

In order to improve the quality of spectrograms, the invention also corrects the rest of the spectral peaks after the pentasaccharide core, and corrects the spectral peaks of monosaccharide characteristics into theoretical quality by utilizing monosaccharide and generalized monosaccharide. Fig. 6 is a corrected mass spectrum of this example and uses the mass spectrum as input to an authentication algorithm.

The generalized monosaccharides are explained herein as follows, taking as an example the second order generalized monosaccharides, which include the combination of such generalized monosaccharides: [ ++]，[○]，[Δ]，[◆]，[◇]，[□，□][ oral, O ]]The [ -O, the mouth][ oral, delta ]]，[△，□]，[□，◆]，[◆，□]，[□，◇]，[◇，□]，[○，○]，[○，△]，[△，○]，[○，◆]，[◆，○]，[○，◇]，[◇，○]，[△，△]，[△，◆]，[◆，Δ]，[△，◇]，[◇，Δ]，[◇，◇]，[◇，◆]，[◆，◆]，[◆，◇]]Wherein each symbol refers to the monosaccharide type specification in Table 1. If a spectral peak is missing between two spectral peaks, the difference in mass representing adjacent spectral peaks may correspond to a generalized monosaccharide, e.g., if the mass m of two adjacent spectral peaks _i And m _i+1 The mass difference of (2) is a generalized monosaccharide [ ≡o-]There are two possible cases of line missing, one is [ ≡o-]Such a peak is missing between two peaks, with a mass of m _i +m _O Therefore, the mass corresponds to a spectral peak, and the abundance is 0; the other is [ ≡o ]]Such a peak is missing between two peaks, with a mass of m _i +m _□ Therefore, the mass corresponds to a spectral peak, and the abundance is 0. In the identification of N-sugar chains, the aboveEach case was examined one by one.

If the mass difference between two spectral peaks matches the last monosaccharide mass within the error range Δ=0.2 in mass spectrometry (see table 1), then these two spectral peaks are considered to be spectral peaks with monosaccharide characteristics.

The mass of the spectral peak having monosaccharide characteristics in the mass spectrum is corrected to the theoretical mass. After screening the spectrum peaks with monosaccharide characteristics in the mass spectrum, correcting the spectrum peaks, wherein the specific steps are as follows:

(1) Calculating the mass difference delta m of adjacent spectrum peaks, and if the mass of a certain monosaccharide or generalized monosaccharide is within the range of [ delta m-delta, delta m+delta ], considering that the mass difference matches the mass of the generalized monosaccharide;

(2) And updating the mass difference into the mass of the monosaccharide or the generalized monosaccharide, recalculating the new mass added with the mass of the monosaccharide or the generalized monosaccharide to obtain corrected mass, wherein the corrected mass is called theoretical mass.

Step 5, the growth of the sugar chain tree structure is carried out by combining the following rules: each time the lightest monosaccharide is attempted to be grown on the basis of the lightest possible peak position.

When the sugar chain structure has grown to a pentasaccharide core structure and a monosaccharide g is further grown, the following conditions are satisfied: in the mass spectrum to be identified, if abundance exists at the mass position where monosaccharide g grows; the mass position of the cross peak required for growing the monosaccharide has the abundance in the mass spectrum to be identified in a proportion of more than or equal to theta percent, and theta percent (theta=20) represents the support degree of the cross term; the monosaccharide composition of the structure after growing the monosaccharide is not more than that of a target sugar chain; the grown structure meets the constraint of the following biological rule, and then the situations of growing the monosaccharide and not growing the monosaccharide are examined on the basis of the existing structure.

Every monosaccharide grows, the abundance of one unit intensity is increased on the corresponding mass, and the basic peak and the cross peak are marked.

If the mass has multiple basic spectrum peaks, cross spectrum peaks or basic spectrum peaks and cross spectrum peaks with unit intensity after the growth is finished, the abundance of the mass is the labeling times of the spectrum peaks, so that a theoretical spectrum S' of the structure is formed. Calculation of basal and cross peaks and construction of theoretical mass spectra can be seen in fig. 7.

FIG. 7 shows an example of calculation of fundamental peaks and cross peaks of sugar chain structures. Wherein the numbers in the circles represent the numbers of the nodes, and the quality of the nodes with the numbers i is recorded as m _i . The nodes numbered 2 and 3 are leaf nodes, and the node numbered 1 is an internal node. Assuming that node 3 is a monosaccharide that has just grown, the structure before growth corresponds to a fundamental peak of [1 ]]And [1, 2]]Without cross peaks, the corresponding theoretical mass spectrum is [ [ m ] ₁ ，′b′]，[m ₁ +m ₂ ，′b′]]Wherein 'b' represents the base peak marked, then the base peak brought about by the growth junction 3 is [1,3 ]]The cross peak is [1,2,3 ]]The corresponding theoretical mass spectrum is [ [ m ] ₁ ，′b′]，[m ₁ +m ₂ ，′b′]，[m ₁ +m ₃ ，′b′]，[m ₁ +m ₂ +m ₃ ，′c′]]Wherein 'c' represents a cross peak.

Step 6 the invention refers to the evaluation method in An approach for N-linked glycan identification from MS/MS spectra by target-decoy strategy literature to score the identified N-sugar chain structure, and the first-ranked structure is selected as the final identification result by sorting the scores from large to small, and fig. 8 shows the top-ranked 10 identification result and the score thereof in this example.

In order to show the matching degree of the mass spectrum of the identification result and the input mass spectrum of the algorithm, the theoretical mass spectrum of the sugar chain structure with the first rank is selected to be compared with the mass spectrum of the example, as shown in fig. 9 (a), it can be seen that most of spectral lines of the two mass spectrums are matched within the error range delta=2.1 Da of the mass. Fig. 9 (b) is a theoretical mass spectrum of the first-ranked authentication result of this example, with cross-peaks and base peaks noted for its spectral peaks.

Step 7 the invention adopts the target-decoy strategy to obtain the p value of each identification result, and uses multiple tests to obtain the FDR of the identification result according to the p value.

Fig. 8 shows the top ten identifications of this example and their scores, noting that these are all isomers, but their scores are different, with the top isomer having the same score.

Compared to the sugar chain structure score resolved by the structGP tool recently published in Nature Methods in 2021, this example identifies a first ranked structure with a score of 24.65 (less than this for all 1000 baits, with a p-value of 1/1001, and a score profile of 1000 baits given in FIG. 10), whereas the N-sugar chain structure resolved by the structGP tool has a score of only 22.65, indicating that the method and system of the present invention have a higher quality of identification of N-sugar chain structures.

N-sugar chain identification was performed on all 669 mass spectra using the system of the present invention, and 669N-sugar chain structures were identified. Each mass spectrum was subjected to 1000 random rearrangements of its mass differences to obtain decoy spectra and scored and compared with the score of the identification result to obtain the P-value of the identification structure. Quality control of these P values using the algorithm proposed by store (2002) resulted in FDR values for these structures (with values of max 7.3767e-04 and min 1.1384 e-06), thus identifying all 669 structures at the fdr=0.001 level; even with more conservative BH algorithm (1995), the FDR values of these structures were 0.0240 at maximum and 0.001 at minimum, thereby identifying 662 structures at the fdr=0.05 level.

Table 1 description of monosaccharide types table

Note that: in the table, the complete mass means the mass before the water molecules are removed from the monosaccharide, and the residual mass means the mass after the water molecules are removed from the monosaccharide, i.e. the mass of the residual mass plus the mass of the water molecules is equal to the complete mass, wherein the mass of the water molecules

It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims

1. The method for identifying the Denov based on the N-sugar chain structure of the mass spectrum data is characterized in that the method for identifying the Denov based on the N-sugar chain structure of the mass spectrum data improves the robustness of the identified structure to the noise of the mass spectrum data by extracting the structure and the composition information of sugar chain fragment ions in the mass spectrum data and introducing a generalized monosaccharide dictionary; introducing a basic peak and a cross peak, and growing the basic peak based on the support of the cross peak to grow a sugar chain structure; the search space of the candidate structure of the identification result is reduced by using a pruning strategy, and the N-sugar chain structure corresponding to the mass spectrum is finally identified;

the method for identifying the Denvo based on the N-sugar chain structure of the mass spectrum data comprises the following steps:

step one, reading mass spectrum data processed by a mass spectrometer, and extracting related data related to identification; converting mass-to-charge ratio m/z of a mass spectrum into mass m through pretreatment of the mass spectrum; judging whether the pentasaccharide core exists or not by adopting a pentasaccharide core related spectrum peak judging method, and turning to the second step if the pentasaccharide core exists;

correcting the mass of a spectrum peak with monosaccharide characteristics in the mass spectrum to be theoretical mass based on monosaccharide and generalized monosaccharide;

initializing a sugar chain structure to form a root node of a tree, continuously growing sugar chains from the initial structure, growing monosaccharides according to a certain rule, calculating a basic peak and a cross peak of the structure after growing the monosaccharides while growing, and generating a theoretical mass spectrum of the structure based on the calculated basic peak and cross peak;

step four, filtering isomorphic structures in the grown structures through pruning strategies to obtain N-sugar chain structure identification results; scoring and evaluating the identification result by referring to the theoretical mass spectrum, wherein the structure with the first scoring rank is the identified sugar chain structure;

in the second step, the spectrum peak mass correction of the mass spectrum with the pentasaccharide core based on the monosaccharide and the generalized monosaccharide comprises the following steps:

(1) The mass difference Deltam of adjacent spectrum peaks is calculated, the mass of a certain monosaccharide or generalized monosaccharide is within the range [ Deltam-DeltaDeltaDeltam, deltam+DeltaDelta ], the quality difference matches the monosaccharide or generalized monosaccharide quality, where delta is the correction error of 0.2;

(2) Updating the quality difference into the quality of corresponding monosaccharide or generalized monosaccharide, recalculating the new quality after adding the quality of the corresponding monosaccharide or generalized monosaccharide to obtain corrected quality, wherein the quality after correcting the quality of the monosaccharide characteristic spectrum peak is theoretical quality;

the fundamental and intersecting peaks include:

the basal peak, the peak in spectrum S that is associated with only a single monosaccharide pathway: sugar chain corresponding to spectrogram SThe sum of the monosaccharide masses on each monosaccharide path of G is denoted asm＝b _i ，1≤i≤k _b The peak at mass m in the spectrum S is the base peak;

the cross peak is the peak associated with two or more monosaccharide pathways in S: the sum of the monosaccharide masses on any two or more monosaccharide paths of G is respectivelym＝c _i ，1≤i≤k _c The peak at mass m in the spectrum S is a cross peak;

the monosaccharide path is a monosaccharide collection on the path from the root node to any node in the tree structure;

the filtering of the structure obtained by using the pruning strategy comprises the following steps:

1) Judging whether the structure has a pentasaccharide core structure and whether the structure meets biological rules, and filtering if the structure does not have the pentasaccharide core structure;

2) Calculating hash values of the structures, judging that the structures are isomorphic if the hash values are equal, and reserving only one of the structures in the isomorphic structures;

the hash value calculation formula of the structure is as follows:

Hs(x)＝(∑Hs(Son _x [i]) ² )+Offset _x ；

2. The method for identifying a Denovo based on an N-sugar chain structure of mass spectrum data according to claim 1, wherein in the first step, the correlation data includes: sugar chain mass glycane mass, peptide chain mass peptide mass, peak lowenergy peaks obtained at low energy.

3. The method for identifying a Denoso structure of an N-sugar chain based on mass spectrum data as set forth in claim 1, wherein in the third step, the rule for growing a monosaccharide by the sugar chain structure comprises: based on the principle of trying to grow the lightest monosaccharides each time.

4. The method for identifying a Denovo based on an N-sugar chain structure of mass spectrum data according to claim 1, wherein in the third step, the theoretical mass spectrum generating method comprises:

initializing a theoretical mass spectrum as a null, wherein the theoretical mass spectrum corresponds to a spectral peak with the mass of 0 and corresponds to a root node of a sugar chain tree structure;

every monosaccharide grows on the sugar chain tree structure, the abundance of one unit intensity is increased on the corresponding mass of the theoretical mass spectrum, the basic spectrum peak is marked, and the abundance of one unit intensity is increased at the position of the cross peak mass generated by growing the monosaccharide, and the cross spectrum peak is marked;

5. The method for identifying a Denovo based on a N-sugar chain structure of mass spectrum data of claim 1 wherein in step three, the growing of the monosaccharide further comprises:

in the mass spectrum to be identified, if abundance exists at the mass position where monosaccharide g grows; the mass position of the cross peak required for growing the monosaccharide has the abundance in the mass spectrum to be identified in a proportion of more than or equal to theta%, wherein theta% represents the support degree of the cross term and theta=20; the monosaccharide composition of the structure after growing the monosaccharide is not more than that of a target sugar chain; the grown structure meets the constraint of the following biological rule, and then the two conditions of growing the monosaccharide and not growing the monosaccharide are examined on the basis of the existing structure;

the constraint of the biological rule includes:

delta (dHex) monosaccharides do not regrow;

the degree of the sugar chain tree is 4 at most;

each monosaccharide has a water fraction and the water mass is removed.

6. Use of the de novo method for identifying a sugar chain structure based on mass spectrometry data according to any one of claims 1 to 5.