CN114166925A

CN114166925A - Method and system for identifying Denovo by N-sugar chain structure based on mass spectrum data

Info

Publication number: CN114166925A
Application number: CN202111235025.6A
Authority: CN
Inventors: 张军英; 杨芝; 吴金辉; 刘继源; 孙士生
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-10-22
Filing date: 2021-10-22
Publication date: 2022-03-11
Anticipated expiration: 2041-10-22
Also published as: CN114166925B

Abstract

The invention belongs to the technical field of glycomics, and discloses a method and a system for identifying a Denovo by an N-sugar chain structure based on mass spectrum data, which comprise the following steps: the structure and composition information of sugar chain fragment ions in mass spectrum data are extracted, N-sugar chain identification is carried out on the basis of a basic peak, a cross peak and a generalized monosaccharide dictionary, and a pruning strategy is utilized to reduce the search space of candidate structures of identification results, so that the N-sugar chain structure corresponding to the mass spectrum is obtained. The invention identifies the N-sugar chain structure corresponding to the mass spectrum by extracting the structure and composition information of sugar chain fragment ions in the mass spectrum data based on the mass spectrum data of the N-sugar chain and the thought of de novo sequencing (Denovo). In the identification process, a basal peak, a cross peak are introduced and N-sugar chain identification is carried out based on the basal peak and the cross peak; a generalized monosaccharide dictionary is introduced, so that the spectrogram quality is improved, and the robustness of the identification method to noise in mass spectrum data is improved; and narrowing the search space of the candidate structure of the identification result by using a pruning strategy. The invention improves the quality of mass spectrometric identification.

Description

Method and system for identifying Denovo by N-sugar chain structure based on mass spectrum data

Technical Field

The invention belongs to the technical field of glycomics, and particularly relates to a method and a system for identifying Denovo by an N-sugar chain structure based on mass spectrum data.

Background

At present: glycosylation of proteins is a post-translational modification of proteins that is ubiquitous in the organism, and the N-sugar chain structure largely determines the biological function of glycoproteins. With the rapid improvement of mass spectrometry technology, identification of sugar chain structures by using mass spectrometry data is an important way for recognizing the biological functions of glycoproteins.

N-sugar chains are a tree structure having a pentasaccharide core-fixed structure, and the current methods for structural identification of N-sugar chains are largely classified into two types: 1) a database search method; 2) de novo sequencing (Denovo) method; 3) and (4) a labeling method. Wherein the tagging method is a combination of a database search method and a de novo sequencing method. The database search method and the de novo sequencing method are described below separately.

1. The database searching method comprises the following steps: and referring to databases such as GlycoSearchMS, GlycoPep DB, GlyDB and the like, carrying out similarity matching on a mass spectrogram of a glycopeptide to be detected with an unknown structure and a real spectrogram of a sugar chain structure with an annotation so as to obtain a score representing the similarity, and taking the optimally matched sugar chain structure as an identification result. Algorithms based on the method comprise GRIP, ArMone 2.0, GlycoPep Detector, Byonic, Protein-promoter, pGlyco 2.0 and the like.

2. Generally, the de novo sequencing method consists of two processes, i.e., enumerating possible sugar chain structures and evaluating these candidate structures, taking the sugar chain structure with the highest score as the identification result. An ideal enumeration procedure should generate as few candidate structures as possible for further evaluation, but should not delete the target sugar chain structure.

At present, de novo sequencing methods are mainly classified into three categories:

the first category is exhaustive search: the kanapack algorithm can be used to easily calculate the monosaccharide composition of the sugar chain, taking into account the parent ion mass of the glycopeptide under investigation. STAT, strooligo, OSCAR, etc. exhaustive search methods list all possible branched structures that match monosaccharide composition. Since the number of candidate sugar chain structures will increase exponentially with the number of monosaccharides, this type of strategy is only used for the identification of sugar chains having at most ten monosaccharide residues.

By applying a restriction constraint to the application of the biosynthesis rules to the candidate sugar chain structures, the search space can be greatly reduced, but the reality is that the biological rules for forming sugar chains are not completely known, limiting the general applicability of using the biosynthesis rules.

The second type is heuristic: the problem of generating candidate sugar chains has proven to be an NP-hard problem under the condition that each peak in the spectrum can only be used once. For this reason, there are currently a number of heuristic approaches, for example, only a limited number of substructures are reserved for each peak position, reducing computational complexity to save time and space. Prior art 1 suggests stepwise reconstruction of sugar chain structures and consideration of a fixed number of high-quality structures in each iteration. Prior art 2 proposes an accurate algorithm based on a fixed parameter algorithm, where the parameter is the number of peaks, and for mass spectra with a large number of peaks, at most only the k most intense peaks need to be used, while other peaks can be used many times.

The third category is dynamic programming based methods: similar to de novo peptide sequencing, GLYCH uses dynamic programming techniques to find the most likely branch structure from tandem MS mass spectra, which is only applicable to MS/MS spectra that release sugar chains and cannot process glycopeptide data. Prior art 3 formulates the candidate structure generation problem as an integer linear programming problem and then uses dynamic programming techniques to infer the most likely structure. To make the computation manageable, dynamic programming methods typically return a fixed number of the highest scoring structures, e.g., GLYCH reports a maximum of 200 candidate structures for subsequent evaluation.

Compared with the database search method, the de novo sequencing method has the advantages of being capable of identifying new sugar chain structures which are not included in the database, and has great research value. The method has the defect that high-quality mass spectrogram is required, and various factors exist in reality to influence the quality of the spectrogram to a certain extent, such as the situation that the spectral peaks of fragment ions are continuously lost in the spectrogram, and the like, so that the spectrogram utilization rate is not very high compared with that of database search, but the method shows a very promising research prospect in sugar chain structure identification along with the development of mass spectrometry technology.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) current database search-based methods are unable to identify unlisted structures.

(2) The current methods based on de novo sequencing (Denovo) are greatly affected by the noise of mass spectrum data, resulting in low robustness of identifying structures.

The difficulty in solving the above problems and defects is:

(1) diversity in the N-sugar chain scale. Some N-sugar chains contain a small number of monosaccharides, and some N-sugar chains contain a large number of monosaccharides, i.e., the number of monosaccharides contained therein varies widely, and sugar chains with larger dimensions generally have higher identification difficulty;

(2) diversity of N-sugar chain structures. Although the N-sugar chain has a tree structure, the composition and the position of each monosaccharide on the sugar chain are possibly diversified, which brings great challenges to the accurate identification of the sugar chain structure from mass spectrum data;

(3) the N-sugar chain substructure is different in stability. The different substructures have different structural stability, namely some substructures are difficult to fragment, some substructures are easy to fragment, and the structural stability information of various substructures is unknown, so that the N-sugar chain structure identification based on mass spectrum data is difficult;

(4) glycopeptides fed into a mass spectrometer are not necessarily pure, so that actually, a plurality of glycopeptides are possibly mixed, and identification of enriched glycopeptides is interfered;

(5) the measurement noise, isotope effect and the like of the mass spectrometer also bring certain interference to sugar chain structure identification, generally, the interference is solved through pretreatment, the pretreatment effect of mass spectrum data directly influences the identification performance of a subsequent identification algorithm, and high-quality pretreatment is an important guarantee for improving the identification quality.

The significance of solving the problems and the defects is as follows:

on the basis of conventional pretreatment of mass spectrum data of N-glycopeptide, the invention develops a set of method and system for identifying the Denovo structure of the N-glycopeptide, and has good robustness to noise in the mass spectrum data.

The main challenge of glycomics is the characterization of complex glycan structures, which is crucial for understanding their role in biological processes. The sugar chain formed during glycosylation participates in the life regulation activity of organisms, and can enhance the protease resistance of the modified protein, thereby influencing the interaction between proteins, and influencing the spatial structure, biological activity, transportation, positioning, function and the like of the protein; in some life activities, structural changes of sugar chains attached to peptide chains are important causes of disease occurrence; glycosylation also plays an important role in the solubility, stability and efficacy of many biopharmaceuticals. Therefore, accurate identification of the sugar chain produced by glycosylation is of great significance for understanding life regulation activities, finding pathogenic causes, treatment of diseases and drug design.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a method and a system for identifying a Denovo structure based on mass spectrum data.

The present invention is achieved by a method for identifying a Denovo structure based on N-sugar chain structure of mass spectrometry data, comprising:

the structure and composition information of sugar chain fragment ions in mass spectrum data are extracted, N-sugar chain identification is carried out on the basis of a basic peak, a cross peak and a generalized monosaccharide dictionary, and a pruning strategy is utilized to reduce the search space of candidate structures of identification results, so that the N-sugar chain structure corresponding to the mass spectrum is obtained.

Further, the Denovo method for identifying an N-sugar chain structure based on mass spectrometry data comprises the steps of:

reading mass spectrum data processed by a mass spectrometer, and extracting and identifying related data; and (3) converting the mass-to-charge ratio m/z of the mass spectrum into mass m by preprocessing the mass spectrum, judging whether a pentasaccharide core exists by adopting a pentasaccharide core related spectrum peak judging method, and if so, turning to the step two. The main functions are as follows: preprocessing the data and judging whether the N-sugar chain exists, because the N-sugar chain has a pentasaccharide core structure;

and step two, correcting a spectrum peak with monosaccharide characteristics in the mass spectrum to be theoretical mass based on the monosaccharide and the generalized monosaccharide. The main functions are as follows: the mass spectrum data is allowed to have lost spectral peaks and mass measurement errors of the mass spectrometer, so that the robustness of the identification result to noise is enhanced;

and step three, initializing a sugar chain structure as a root node of the tree, continuously growing monosaccharide from the initial structure by the sugar chain according to a certain rule, calculating a basic peak and a cross peak of the structure after the monosaccharide grows while growing, and generating a theoretical mass spectrum of the structure based on the calculated basic peak and cross peak. The main functions are as follows: continuously expanding a sugar chain structure tree through a growth process;

filtering isomorphic structures in the grown structures through a pruning strategy to obtain an N-sugar chain structure identification result; and scoring and evaluating the obtained identification result, wherein the structure with the first score is the identified sugar chain structure. The main functions are as follows: duplicate structures were filtered and each remaining structure was scored.

Further, in step one, the related data includes: sugar chain mass GlycanMass, peptide chain mass PeptideMess, and low energy spectral peak obtained.

Further, in the second step, the correcting the mass of the peak having monosaccharide characteristics in the mass spectrum to the theoretical mass based on the monosaccharide and the generalized monosaccharide includes:

(1) calculating the mass difference Δ m of adjacent spectral peaks, matching the mass of a generalized monosaccharide if the mass of a monosaccharide or the mass of a generalized monosaccharide is within the range [ Δ m- Δ, Δ m + Δ ], wherein Δ is a correction error with a value of 0.2;

(2) and updating the mass difference into the mass of the monosaccharide or the generalized monosaccharide, recalculating the new mass added with the mass of the corresponding monosaccharide or the generalized monosaccharide, and obtaining the corrected mass, wherein the mass of the monosaccharide characteristic spectrum peak after correction is the theoretical mass.

Further, in step three, the rules of the sugar chain structure growth monosaccharide include: on the principle of each attempt to grow the lightest monosaccharide.

Further, the base peaks and cross-peaks include:

the basic peak is only one single strip in the spectrogram SPeaks associated with the monosaccharide pathway: if the sum of the monosaccharide masses of each monosaccharide path of the corresponding sugar chain G in the spectrogram S is respectively recorded as

If m is b_i，1≤i≤k_bIf the mass m of the peak in the spectrogram S is the basic peak;

cross peaks, i.e., peaks in S that are associated with two or more monosaccharide pathways; if the sum of the monosaccharide masses of any two or more monosaccharide paths of G is respectively

If m is equal to c_i，1≤i≤k_cIf the mass m in the spectrogram S is the cross peak, determining that the peak in the mass m in the spectrogram S is the cross peak;

the monosaccharide path is a monosaccharide set on a path from a root node to any node in the tree structure.

Further, in step three, the theoretical mass spectrum generation method includes:

and initializing the theoretical mass spectrum as null, wherein the spectral peak corresponding to the theoretical mass spectrum with the mass of 0 corresponds to the root node of the sugar chain tree structure.

Every time a monosaccharide grows on the sugar chain tree structure, the abundance of one unit intensity is increased on the corresponding mass of the theoretical mass spectrum and is marked as a basic spectrum peak, and the abundance of one unit intensity is increased on the mass position of a cross peak generated by growing the monosaccharide and is marked as a cross spectrum peak.

If the theoretical mass spectrum has a plurality of basic spectral peaks, cross spectral peaks or basic spectral peaks and cross spectral peaks of unit intensity on the same mass, the abundance on the mass is the superposition of the marked spectral peaks.

Further, in step three, the growth of the monosaccharide further comprises:

when the sugar chain structure grows to a pentasaccharide core structure and a monosaccharide g is further grown, the following conditions are satisfied:

in the mass spectrum to be identified, if abundance exists at the mass position of the growth monosaccharide g; the mass position of a cross peak required for growing the monosaccharide has the abundance ratio of more than or equal to theta% in a mass spectrum to be identified, and the theta% (theta is 20) represents the support degree of a cross term; the monosaccharide composition of the structure after the growth of the monosaccharide does not exceed that of the target sugar chain; the grown structure satisfies the following biological rule constraints, and then two conditions of growing the monosaccharide and not growing the monosaccharide are considered on the basis of the existing structure. The constraints of the biological rules include:

delta (dHex) monosaccharides followed by no longer growing monosaccharides;

the number of sugar chain trees is 4 at most;

each monosaccharide has a water portion, the quality of which needs to be removed.

Further, the filtering the structure obtained by using the pruning strategy includes:

1) judging whether the structure has a pentasaccharide core structure and meets biological rules, and if not, filtering;

2) and calculating the hash values of the structures, if the hash values are equal, judging that the structures are isomorphic, and only keeping one of the structures in the isomorphic structures.

Further, the hash value calculation formula of the structure is as follows:

Hs(x)＝(∑Hs(Son_x[i])²)+Offset_x；

wherein Hs (x) represents the hash value of a tree with x as the root node, Son_x[i]The ith child node, Offset, representing the root node x_xRepresenting the weight value of node X itself, CodedValue_xThe coding value of node x is shown, x has 5 types, which are { Hex, HexNAc, NeuAc, NeuGc, dHex }, respectively, and MaxBranch ═ 4 shows the maximum degree of the node in the tree.

Another object of the present invention is to provide an application of the method for identifying a Denovo sugar chain structure based on mass spectrometry data in sugar chain structure identification.

By combining all the technical schemes, the invention has the advantages and positive effects that: the invention identifies the N-sugar chain structure corresponding to the mass spectrum by extracting the structure and composition information of sugar chain fragment ions in the mass spectrum data based on the mass spectrum data of the N-sugar chain and the thought of de novo sequencing (Denovo). In the identification process, a basic peak and a cross peak are introduced, and the N-sugar chain structure identification is carried out based on the basic peak and the cross peak; a generalized monosaccharide dictionary is introduced, so that the spectrogram quality is improved, and the robustness of the identification method to noise in a mass spectrogram is improved; and narrowing the search space of the candidate structure of the identification result by using a pruning strategy. The invention improves the quality of mass spectrometric identification.

The invention introduces generalized monosaccharide to improve spectrogram quality, introduces basic peaks and cross peaks, establishes a theoretical mass spectrum to improve N-sugar chain identification quality, adopts a Hash redundancy structure removal method to improve identification efficiency, evaluates an identification structure through matching and scoring of a mass spectrum spectrogram and a spectral peak of the theoretical mass spectrum, and achieves the purpose of high-efficiency and high-quality identification of the N-sugar chain structure.

The invention can identify sugar chain structures which are not recorded in the current sugar chain database on the basis of improving the spectrogram quality, and can reasonably evaluate the identification result.

Drawings

FIG. 1 is a schematic diagram of a Denovo method for identifying N-sugar chain structure based on mass spectrometry data according to an embodiment of the present invention.

FIG. 2 is a flow chart of a method for identifying a Denovo based on an N-sugar chain structure of mass spectrum data, provided in an example of the present invention.

FIG. 3 shows an N-sugar chain pentasaccharide core structure provided in an example of the present invention.

FIG. 4 is a raw mass spectrum provided by an embodiment of the present invention.

FIG. 5 is a mass spectrum of a raw mass spectrum after preprocessing provided by an embodiment of the invention.

FIG. 6 is a mass spectrum after correction according to an embodiment of the present invention.

FIG. 7 is an example of sugar chain structures calculated from the base peak and the cross peak provided in the examples of the present invention.

Fig. 8 shows top ten ranked evaluation results and scores thereof according to an embodiment of the present invention.

Fig. 9(a) is a comparison of the mass spectrum to be identified of the example of the present invention with the theoretical mass spectrum of the identified sugar chain ranked first by score.

Fig. 9(b) is a theoretical mass spectrum of the first-ranked identification provided by an embodiment of the invention.

Fig. 10 is a bait score distribution chart for an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention provides a method and a system for identifying a Denovo structure based on mass spectrum data, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the Denovo method for identifying an N-sugar chain structure based on mass spectrometry data provided by the embodiment of the present invention comprises:

the method comprises the steps of extracting structure and composition information of sugar chain fragment ions in mass spectrum data, carrying out N-sugar chain identification based on a basic peak, a cross peak, a monosaccharide dictionary and a generalized monosaccharide dictionary, reducing a search space of an identification result candidate structure by using a pruning strategy, and finally grading the candidate structure to obtain an N-sugar chain structure corresponding to a mass spectrum to be identified.

As shown in fig. 2, the Denovo method for identifying an N-sugar chain structure based on mass spectrometry data provided by the embodiment of the present invention comprises the steps of:

s101, reading mass spectrum data processed by a mass spectrometer, and extracting and identifying related data; converting mass-to-charge ratio m/z of mass spectrum into mass m by preprocessing mass spectrum data, judging whether a pentasaccharide core exists by adopting a pentasaccharide core (the structure of which is shown in figure 3) correlation spectrum peak judging method, and if so, turning to step S102;

s102, based on monosaccharide and generalized monosaccharide, performing spectrum peak correction on the mass spectrum with the pentasaccharide core, and correcting the spectrum peak with monosaccharide characteristics in the mass spectrum to be theoretical mass;

s103, initializing a sugar chain structure as a root node of the tree, growing monosaccharide from the initial structure according to a certain rule by the sugar chain, calculating a basic peak and a cross peak of the structure after the monosaccharide grows while growing, and generating a theoretical mass spectrum of the structure based on the calculated basic peak and cross peak;

s104, filtering isomorphic structures in the grown structures through a pruning strategy to obtain an N-sugar chain structure identification result; and (4) scoring and evaluating the obtained identification result by referring to a theoretical mass spectrum, wherein the structure with the first score is the identified N-sugar chain structure.

The related data provided by the embodiment of the invention comprises: sugar chain mass GlycanMass, peptide chain mass PeptideMess, and low energy spectral peak obtained.

The method for performing spectrum peak correction on the mass spectrum with the pentasaccharide core based on the generalized monosaccharide comprises the following steps:

(1) calculating the mass difference between adjacent spectral peaks, wherein if the mass of a certain monosaccharide or generalized monosaccharide is in the range of [ Δ m- Δ, Δ m + Δ ], the mass difference matches the mass of the generalized monosaccharide, and Δ is a correction error with a value of 0.2;

(2) and updating the mass difference into the mass of corresponding monosaccharide or generalized monosaccharide, recalculating the new mass added with the mass of the monosaccharide or generalized monosaccharide, namely the corrected mass, and correcting the mass of the monosaccharide characteristic spectrum peak, namely the theoretical mass.

The rules for the growth of monosaccharides through sugar chain structures provided by the embodiments of the present invention include: on the principle of each attempt to grow the lightest monosaccharide.

The basic peak and the cross peak provided by the embodiment of the invention comprise:

the basal peak is the peak in spectrum S that is associated with only a single monosaccharide pathway: if the sum of the monosaccharide masses of each monosaccharide path of the corresponding sugar chain G in the spectrogram S is respectively

If m is b_i，1≤i≤k_bThen the peak at mass m in the spectrum S is the base peak;

cross peaks, i.e., peaks in S that are associated with two or more monosaccharide pathways; if any two of GThe sum of the monosaccharide masses on a single or more than two monosaccharide paths, respectively, is

If m is equal to c_i，1≤i≤k_cThen the peak at mass m is the cross peak;

The theoretical mass spectrum generation method provided by the embodiment of the invention comprises the following steps:

The growth of the monosaccharide provided by the embodiment of the invention further comprises the following steps:

in the mass spectrum to be identified, if abundance exists at the mass position of the growth monosaccharide g; the mass position of a cross peak required for growing the monosaccharide has the abundance ratio of more than or equal to theta% in a mass spectrum to be identified, and the theta% (theta is 20) represents the support degree of a cross term; the monosaccharide composition of the structure after the growth of the monosaccharide does not exceed that of the target sugar chain; the grown structure satisfies the following biological rule constraints, and then two conditions of growing the monosaccharide and not growing the monosaccharide are considered on the basis of the existing structure.

The constraints of the biological rules include: delta (dHex) monosaccharides followed by no longer growing monosaccharides; the degree of the sugar chain tree is 4 at most (since one monosaccharide has five bonds at most, one of which is connected to the parent node, a single node of the tree structure can grow 4 branches at most backward); each monosaccharide has a water portion, the quality of which needs to be removed.

The filtering of the obtained structure by using the pruning strategy provided by the embodiment of the invention comprises the following steps:

The hash value calculation formula of the structure provided by the embodiment of the invention is as follows:

Hs(x)＝(∑Hs(Son_x[i])²)+Offset_x；

The technical solution of the present invention is further described with reference to the following specific embodiments.

Example 1:

the experimental data used in the invention are mouse brain glycopeptidic data, total 729 mass spectrum data, and through sugar chain spectrum screening, 669 sugar chain mass spectra and 60 non-sugar chain mass spectra (not containing a pentasaccharide core structure) are obtained. The sugar chain mass spectrogram with the number of 130 is taken as an example to illustrate the specific identification process and the identification result of the invention and is compared with the result of the newly published strucGP algorithm. This experimental example is for illustrative purposes and is not intended to limit the scope of the present invention.

The specific identification process comprises the following steps:

step 1, mass spectrum data processed by a mass spectrometer are read, relevant data related to identification are extracted, and the relevant data comprise sugar chain mass GlycanMass, peptide chain mass PeptideMess and spectral peak lowEnergyPeaks obtained under low energy, and a spectrogram is converted into a form of [ m, intensity ] from [ m/z, intensity ] through operations of converting m/z into m, removing isotope peak clusters and the like, wherein m/z is a mass-to-charge ratio, m is ion fragment mass of a sugar chain, and intensity is abundance corresponding to mass m in the spectrogram and is a numerical type.

Step 2, m with very close mass and corresponding intensity in the [ m, intensity ] obtained in step 1 are filtered, if the mass difference of two spectral peaks is less than 0.15Da, the two spectral peaks are considered to be very close, only the first [ m, intensity ] is reserved, and the fact that a certain fragment possibly has different electric charge amounts after ionization is considered, the fragment is made to initially appear in different isotope peak clusters, and after the isotope peak clusters shrink and the spectrogram is converted, the calculated m has very small difference, even the same difference, and essentially represents the structural fragments with the same composition.

Step 3 the mass m in step 2 is subtracted from the mass of the peptide chain extracted in step 1 to obtain a mass representing only the fragment of the sugar chain Y ion.

Fig. 4 is the original mass spectrum of the example, which shows the ion mass-to-charge ratio m/z and the abundance information intensity of the mass spectrum, and it can be seen that there are a large number of dense isotope peak clusters in the original mass spectrum, which has a certain influence on the identification algorithm of the present invention.

Fig. 5 is an intuitive effect of the original mass spectrum data after the operations of

steps

1, 2 and 3, and it can be seen that most isotope peaks in the original mass spectrum are filtered out by the preprocessing method, thereby greatly reducing the influence of the isotope peaks on the identification algorithm.

And 4, screening the N-sugar chain spectrum from the mass spectrum obtained in the step 3. If a spectrum contains at least 3 relevant peaks of the pentasaccharide core, the spectrum is considered as an N-sugar chain spectrum. The relevant peaks of the pentasaccharide core are represented as: □, 2 □, 2 □ + ●, 2 □ +2 ● and 2 □ +3 ●, and the corresponding theoretical masses are [203.0794, 406.1588, 568.2116, 730.2644 and 892.3172 ]. The N-sugar chain spectrum screening steps are as follows:

step 4.1 statistics of the number num of peaks in the spectrum which are matched with the theoretical mass of the peak related to the pentasaccharide core within the error range_coreHere, the matching error threshold is set to 2.1Da, and if the absolute value of the difference between a certain peak mass and a certain theoretical mass in a spectrogram is within the matching error range, it is considered that the peak in the spectrogram is matched, and the peak is considered to have monosaccharide characteristics.

Step 4.2 obtaining the spectrum peak num with monosaccharide characteristics in the spectrogram through the statistics of the step 4.1_coreNumber of num_coreAnd < 3, the spectrogram is considered to be a non-N-sugar chain spectrogram, otherwise, the spectrogram is an N-sugar chain spectrogram, a spectral peak with monosaccharide characteristics in front of a pentasaccharide core in the spectrogram is corrected, and the corrected spectral peak mass is the theoretical mass of a pentasaccharide core related spectral peak.

In order to improve the spectrogram quality, the method also corrects the rest spectral peaks behind the pentasaccharide core, and corrects the spectral peaks with the monosaccharide characteristics to the theoretical quality by using monosaccharide and generalized monosaccharide. FIG. 6 is the example corrected mass spectrum and takes it as input to the identification algorithm.

The following explanation of the generalized monosaccharides is provided here, taking the second order generalized monosaccharide as an example, which includes such combinations of generalized monosaccharides: [ [ □ ]]，[○]，[Δ]，[◆]，[◇]，[□，□]And [ Koukai, ] O]"o, mouth]And [ oral,. DELTA. ]]，[△，□]，[□，◆]，[◆，□]，[□，◇]，[◇，□]，[○，○]，[○，△]，[△，○]，[○，◆]，[◆，○]，[○，◇]，[◇，○]，[△，△]，[△，◆]，[◆，Δ]，[△，◇]，[◇，Δ]，[◇，◇]，[◇，◆]，[◆，◆]，[◆，◇]]Wherein each symbol is as defined in the description of monosaccharide type in Table 1. If a peak is missing between two peaks, it means that the difference in mass between adjacent peaks may correspond to a generalized monosaccharide, for example, if the mass m of two adjacent peaks_iAnd m_i+1Has a mass difference of one generalized monosaccharide [ O, □ [ ]]There are two possible spectral line loss cases, one is [. smallcircle, □]One such peak is missing between two peaks and has a mass m_i+m_OTherefore, it is atOne spectrum peak should be used in quality, and the abundance is 0; the other is [ □, ] O]One such peak is missing between two peaks and has a mass m_i+m_□Therefore, the mass corresponds to a spectral peak, and the abundance is 0. In the N-sugar chain identification process, the above-mentioned cases were examined uniformly.

In the mass spectrum, two peaks are considered to be peaks with monosaccharide properties if the mass difference between the two peaks matches the mass of the last monosaccharide within an error range Δ 0.2 (see table 1).

And correcting the mass of a spectrum peak with monosaccharide characteristics in the mass spectrum to be theoretical mass. After peaks with monosaccharide characteristics in a mass spectrum are screened, the peaks are corrected, and the method specifically comprises the following steps:

(1) calculating the mass difference Delta m of adjacent spectral peaks, and if the mass of a certain monosaccharide or generalized monosaccharide is in the range of [ Delta m-Delta, Delta m + Delta ], determining that the mass difference matches the mass of the generalized monosaccharide;

(2) and updating the mass difference into the mass of the monosaccharide or the generalized monosaccharide, recalculating the new mass added with the mass of the monosaccharide or the generalized monosaccharide, and obtaining the corrected mass, wherein the corrected mass is called theoretical mass.

Step 5 growth of the sugar chain tree structure was performed in combination with the following rules: each time trying to grow the lightest monosaccharide on the basis of the position of the peak that can reach the lightest spectrum.

When the sugar chain structure grows to a pentasaccharide core structure and a monosaccharide g is further grown, the following conditions are satisfied: in the mass spectrum to be identified, if abundance exists at the mass position of the growth monosaccharide g; the mass position of a cross peak required for growing the monosaccharide has the abundance ratio of more than or equal to theta% in a mass spectrum to be identified, and the theta% (theta is 20) represents the support degree of a cross term; the monosaccharide composition of the structure after the growth of the monosaccharide does not exceed that of the target sugar chain; the grown structure satisfies the following biological rule constraints, and then two conditions of growing the monosaccharide and not growing the monosaccharide are considered on the basis of the existing structure.

Every time a monosaccharide grows, the abundance of one unit intensity is increased on the corresponding mass, and the marking of a basic peak and a cross peak is carried out.

If a plurality of basic spectral peaks and cross spectral peaks with unit intensity exist on the same mass after the growth is finished, the abundance on the mass is the labeling times of the spectral peaks, and a theoretical spectrum S' of the structure is formed. For the calculation of the base and cross peaks and the construction of the theoretical mass spectrum, see FIG. 7.

Fig. 7 shows a calculation example of the fundamental peak and the cross peak of the sugar chain structure. Wherein the numbers in the circles represent the numbers of the nodes, and the quality of the node with the number i is recorded as m_i. The nodes numbered 2 and 3 are leaf nodes, and the node numbered 1 is an internal node. Assuming node 3 is a monosaccharide that is just growing, the structure before growth corresponds to a basal peak of [1 ]]And [1, 2]]Without cross peaks, the corresponding theoretical mass spectrum is [ [ m ]₁，′b′]，[m₁+m₂，′b′]]Wherein 'b' indicates that the marker is a basal peak, the basal peak brought by the growth node 3 is [1, 3 ]]The cross peak is [1, 2, 3 ]]The corresponding theoretical mass spectrum is [ [ m ]₁，′b′]，[m₁+m₂，′b′]，[m₁+m₃，′b′]，[m₁+m₂+m₃，′c′]]Where 'c' denotes the marker cross peak.

Step 6 the invention refers to the evaluation method in An a preproach for N-linked glycal identification from MS/MS spectra by target-term strategy document to score the identified N-sugar chain structure, and the N-sugar chain structures are sorted according to the scores from big to small, and FIG. 8 is the identification result of the top 10 of the example and the score thereof, and the first structure is selected as the final identification result.

In order to show the matching degree between the mass spectrum of the identification result and the input mass spectrum of the algorithm, the theoretical mass spectrum of the sugar chain structure ranked first is selected and compared with the mass spectrum of the example, as shown in fig. 9(a), it can be seen that most spectral lines of the two mass spectra are matched within the mass error range Δ of 2.1 Da. Fig. 9(b) is a theoretical mass spectrum of the first-ranked identification of this example, with cross-peak and base peak labeling of its spectral peaks.

Step 7 the present invention employs a target-decoy strategy to obtain the p-value for each assay result and, based thereon, applies multiple tests to obtain the FDR of the assay result.

Fig. 8 shows the top ten identification results and their scores for this example, noting that these are all isomers, but with different scores, and the top ranked isomer is the same.

The first-ranked structure identified in this example had a score of 24.65 (the scores of 1000 baits were all smaller than that of the first-ranked structure, so that the p-value was 1/1001, and the score distribution of 1000 baits is shown in fig. 10) compared with the score of the sugar chain structure analyzed by the strucGP tool newly published in Nature Methods in 2021, whereas the score of the N-sugar chain structure analyzed by the strucGP tool was only 22.65, indicating that the method and system of the present invention have higher quality of identification in the identification of the N-sugar chain structure.

N-sugar chain identification was performed on all 669 mass spectra using the system of the present invention, and 669N-sugar chain structures were identified. Each mass spectrum was subjected to 1000 random rearrangements of its poor mass, a decoy spectrum was obtained and scored and compared with the score of the identification results to obtain the P-value of the identified structure. Quality control of these P values was performed using the algorithm (2002) proposed by Storey, and FDR values for these structures were obtained (with values of 7.3767e-04 max and 1.1384e-06 min), thereby identifying all 669 structures at the FDR-0.001 level; even with the quality control using the more conservative BH algorithm (1995) (FDR values for these structures were 0.0240 max, 0.001 min), 662 structures were identified at an FDR-0.05 level.

TABLE 1 monosaccharide type description Table

Note: in the table, the complete mass indicates the mass of monosaccharide before water molecules are removed, and the residual mass indicates the mass of monosaccharide after water molecules are removed, i.e., the mass of water molecules added to the residual mass is equal to the complete mass, wherein the mass of water molecules is

It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A Denovo method for identifying an N-sugar chain structure based on mass spectrum data is characterized in that the Denovo method for identifying the N-sugar chain structure based on the mass spectrum data is used for improving the robustness of the identified structure to mass spectrum data noise by introducing a generalized monosaccharide dictionary through extracting the structure and composition information of sugar chain fragment ions in the mass spectrum data; introducing a basal peak and a cross peak, and growing the basal peak based on the support of the cross peak to grow a sugar chain structure; and (4) narrowing the search space of the candidate structure of the identification result by using a pruning strategy, and finally identifying the N-sugar chain structure corresponding to the mass spectrum.

2. The method for identifying Denovo based on an N-sugar chain structure of mass spectrometry data as claimed in claim 1, wherein the method for identifying Denovo based on an N-sugar chain structure of mass spectrometry data comprises the steps of:

reading mass spectrum data processed by a mass spectrometer, and extracting and identifying related data; converting mass-to-charge ratio m/z of the mass spectrum into mass m by preprocessing the mass spectrum; judging whether a pentasaccharide core exists by adopting a pentasaccharide core correlation spectrum peak judging method, and if so, turning to the second step;

correcting the mass of a spectrum peak with monosaccharide characteristics in the mass spectrum to be theoretical mass based on monosaccharide and generalized monosaccharide;

initializing a sugar chain structure as a root node of the tree, continuously growing a sugar chain from the initial structure, growing monosaccharides according to a certain rule, calculating a basic peak and a cross peak of the structure after the monosaccharides grow while growing, and generating a theoretical mass spectrum of the structure based on the calculated basic peak and cross peak;

filtering isomorphic structures in the grown structures through a pruning strategy to obtain an N-sugar chain structure identification result; and (4) scoring and evaluating the identification result by referring to a theoretical mass spectrum, wherein the structure with the first score is the identified sugar chain structure.

3. The method for identifying Denovo based on N-sugar chain structure of mass spectrometry data as claimed in claim 2, wherein in the first step, the correlation data comprises: sugar chain mass GlycanMass, peptide chain mass PeptideMess, and low energy spectral peak obtained.

4. The method for identifying Denovo based on the N-sugar chain structure of mass spectrometry data as claimed in claim 2, wherein the step two, performing the peak mass correction on the mass spectrum having the pentasaccharide core based on the monosaccharides and the generalized monosaccharides, comprises:

(1) calculating the mass difference between adjacent spectral peaks, wherein the mass of a monosaccharide or generalized monosaccharide is within the range [ Δ m- Δ, Δ m + Δ ], and the mass difference matches the mass of the monosaccharide or generalized monosaccharide, wherein Δ is a correction error with a value of 0.2;

(2) and updating the mass difference into the mass of the corresponding monosaccharide or the generalized monosaccharide, recalculating the new mass added with the mass of the corresponding monosaccharide or the generalized monosaccharide, namely the corrected mass, and correcting the mass of the monosaccharide characteristic spectrum peak, namely the theoretical mass.

5. The method for identifying Denovo based on the N-sugar chain structure of mass spectrometry data as claimed in claim 2, wherein in step three, the rule of the growth of monosaccharides by the sugar chain structure comprises: on the principle of each attempt to grow the lightest monosaccharide.

6. The method for identifying Denovo based on N-sugar chain structure of mass spectrometry data as claimed in claim 2, wherein the base peak and the cross peak comprise:

the basal peak is the peak in spectrum S that is associated with only a single monosaccharide pathway: the sum of the monosaccharide masses in each of the monosaccharide paths of the sugar chain G corresponding to the spectrum S is respectively expressed as

m＝b_i，1≤i≤k_bThen the peak at mass m in the spectrum S is the base peak;

cross peaks are peaks in S associated with two or more monosaccharide pathways: the sum of the monosaccharide masses on any two or more monosaccharide paths of G is respectively

m＝C_i，1≤i≤k_cIf the mass m in the spectrogram S is the cross peak, determining that the peak in the mass m in the spectrogram S is the cross peak;

7. The method for identifying Denovo based on the N-sugar chain structure of mass spectrometry data according to claim 2, wherein in step three, the theoretical mass spectrometry generation method comprises:

8. The method for identifying Denovo based on the N-sugar chain structure of mass spectrometry data as claimed in claim 2, wherein the growth of the monosaccharide further comprises, in step three:

in the mass spectrum to be identified, if abundance exists at the mass position of the growth monosaccharide g; the mass position of a cross peak required for growing the monosaccharide has the abundance ratio of more than or equal to theta% in a mass spectrum to be identified, the theta% represents the support degree of a cross term, and the theta is 20; the monosaccharide composition of the structure after the growth of the monosaccharide does not exceed that of the target sugar chain; the grown structure satisfies the following biological rule constraints, and then two conditions of growing the monosaccharide and not growing the monosaccharide are considered on the basis of the existing structure.

The constraints of the biological rules include:

delta (dHex) monosaccharides followed by no longer growing monosaccharides;

the number of sugar chain trees is 4 at most;

9. The Denovo method for identifying N-sugar chain structures based on mass spectrometry data as claimed in claim 2, wherein the filtering of the structures obtained by the pruning strategy comprises:

1) judging whether the structure has a pentasaccharide core structure and meets biological rules, and filtering if the structure does not have the pentasaccharide core structure;

2) calculating hash values of the structures, judging the structures to be isomorphic if the hash values are equal, and only keeping one of the structures in the isomorphic structures;

the hash value calculation formula of the structure is as follows:

Hs(x)＝(∑Hs(Son_x[i])²)+Offset_x；

10. Use of the Denovo method for identifying a sugar chain structure based on mass spectrometry data according to any one of claims 1 to 9.