CN113362899B

CN113362899B - Deep learning-based protein mass spectrum data analysis method and system

Info

Publication number: CN113362899B
Application number: CN202110425032.6A
Authority: CN
Inventors: 何情祖; 郭欢; 帅建伟; 韩家淮
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2023-12-19
Anticipated expiration: 2041-04-20
Also published as: CN113362899A

Abstract

The invention discloses a protein mass spectrum data analysis method and system based on deep learning, comprising the following steps: obtaining DIA protein data of the sample; based on the DIA protein data, a slider moving in a specific step along the residence time dimension is taken as a minimum processing unit, background ions with low signal to noise ratio in the slider are deleted, and candidate parent ions and candidate ion ions are determined; the extraction chromatograph input variation of candidate sub-ions is embedded into Euclidean space after being encoded by a neural network of an encoder, and then the candidate sub-ions are divided into k classes by a k-means classification algorithm; combining each fragment sub-ion cluster with a corresponding parent ion based on a protein database to generate parent ion-fragment sub-ion pairs; and (3) judging the parent ion-fragment ion pairs again by calculating the similarity between fragment ion pairs matched with the theoretical spectrum, and storing the parent ion-fragment ion pairs with the similarity exceeding a preset threshold as a pseudo-tandem spectrum. The invention can increase the number of identified peptide fragments and proteins.

Description

Deep learning-based protein mass spectrum data analysis method and system

Technical Field

The invention relates to the field of protein analysis in proteomics, in particular to a method and a system for analyzing protein mass spectrum data based on deep learning.

Background

Mass spectrometry has long been the dominant technique for peptide fragment and protein identification and quantification. Typical recognition strategies are implemented in conjunction with Data Dependent Acquisition (DDA) patterns and search engines. Data are collected in a DDA mode in a serial mode, and only the strongest k peptide ions are selected when a primary mass spectrum (MS 1) is scanned, broken up, and generated polypeptide ions are broken up to form a tandem mass spectrum (a secondary mass spectrum MS 2). The search engine matches the MS2 spectrum with the theoretical spectrum of the peptide fragment to identify the peptide fragment corresponding to the MS2 spectrum. However, because the top k-strength peptide fragment ions were randomly varied in repeated DDA experiments, the reproducibility of polypeptides identified by the DDA method was poor. To overcome the limitations of DDA mode, a Data Independent Acquisition (DIA) strategy has emerged. SWATH is a generic DIA mode, and is fully referred to as full fragment ion sequential windowed acquisition mass spectrometry data independent of the acquisition mode (Sequential Windowed Acquisition of All Theoretical Fragment Ions), the mass spectrometer breaks up peptide ions within each MS1 independent window and acquires signals of all fragment ions. Clearly, the fragment signals of a large number of polypeptides are mixed in the corresponding secondary mass spectrum (MS 2), i.e. the secondary mass spectrum has its ion derived from a plurality of parent ions, and the fragment ions are also disturbed by parent ions which are not broken up, the DIA data complexity is too great and direct analysis is extremely difficult.

The current method for analyzing DIA data in the field of proteomics mainly comprises the following steps: protein identification and quantitative analysis are performed based on database search workflow and software based on statistical tests; protein identification and quantitative analysis were performed using a deep learning method.

The deep learning method is different from the traditional protein identification method, and based on the support of computational power, the deep learning method trains a large number of data sets, so that a machine learns the internal rules and features of the data autonomously, and mainly comprises the following steps:

deep novo-DIA: [ Tran N H, zhang X, xin L, et al, de novo peptide sequencing by deep learning [ J ]. Proceedings of the National Academy of Sciences of the United States of America,2017.114 (31): 201705691.] deep Novo-DIA, proposed in 2017, li Ming et al, combined with de novo sequencing and deep learning of peptide fragments, identified the amino acid sequence of the peptide fragments directly from the DIA profile.

dia-NN: the cooperation of multiple institutions such as Cambridge university student chemical and Mierna therapeutics institute in 2019 [ Demichev V, messner C B, vernardoss S I, et al DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput [ J ]. Nature Methods,2020,17 (1): 41-44 ] presents a convenient integrated software package DIA-NN that utilizes deep neural networks and new quantitative and signal correction strategies to process the experimental results of DIA proteomics. The DIA-NN improves the qualitative and quantitative capability of the traditional DIA proteome, has the advantage of rapidness in high-flux application, and can realize accurate deep coverage on proteins when being used in combination with a rapid chromatographic method.

Deep dia: new methods for analysis of DIA proteomes such as DeepDIA were proposed by university of double denier in 2020 et al [ Yang Y, liu X, shen C, et al, in silico spectral libraries by deep learning facilitate data-independent acquisition proteomics [ J ]. Nature Communications,11 ]. A deep neural network model based on a convolutional neural network and a cyclic neural network is designed to predict the secondary mass spectrum (MS/MS) and normalized residence time (iRT) of peptide fragments, and a peptide fragment list obtained by DDA identification generates a spectrum gallery required by DIA analysis.

Many of the current methods based on deep learning are widely applied to the analysis flow of protein data, but the software developed at present is based on a supervised learning method, so the generalization capability is limited. There are many problems to be solved for protein DIA data analysis, so it is very important to provide a DIA data analysis unsupervised deep learning model based on k-means clustering algorithm and variation self-encoder to realize accurate identification and quantification of protein.

Disclosure of Invention

The invention mainly aims to provide a method and a system for analyzing protein mass spectrum data based on deep learning, which can improve the number of identified peptide fragments and proteins, so as to obtain more accurate protein identification results.

The invention adopts the following technical scheme:

in a first aspect, the invention provides a method for analyzing protein mass spectrometry data based on deep learning, comprising the following steps:

s101, obtaining DIA protein data of a sample;

s102, based on the DIA protein data, a slider moving along a residence time dimension in a specific step is taken as a minimum processing unit, background ions with low signal to noise ratio in the slider are deleted, and candidate parent ions and candidate ion are determined;

s103, after the extracted chromatogram input variation of the candidate sub-ions is encoded by a neural network of an encoder, embedding the candidate sub-ions into Euclidean space, and then dividing the candidate sub-ions into k classes by using a k-means classification algorithm;

s104, combining each fragment sub-ion cluster with a corresponding parent ion based on a protein database to generate a parent ion-fragment sub-ion pair;

s105, judging the parent ion-fragment ion pairs again by calculating the similarity between fragment ion pairs matched with the theoretical spectrum, and storing the parent ion-fragment ion pairs with the similarity exceeding a preset threshold as a pseudo-tandem spectrum.

Preferably, the slider moving in a specific step along the residence time dimension is a minimum processing unit, and specifically includes:

splitting a fixed width slider in a dwell time dimension in each MS1 isolation window; each slider is considered to be the smallest processing unit containing a series of MS1 spectra and corresponding MS2 spectra.

Preferably, the deleting the background ion with low signal to noise ratio in the slider specifically includes:

deleting background ions with low signal to noise ratio in the sliding block through an algorithm related to the signal to noise ratio; the algorithm related to the signal to noise ratio comprises a peak searching algorithm and a de-isotope algorithm.

Preferably, before the step S103, the method further includes:

generating triplet data to train the variational self-encoder encoded neural network; the method for acquiring the triplet state data comprises the following steps:

storing the extraction chromatograms of the six quantitative peptide fragments;

randomly selecting two fragments of extraction chromatography XICs from a peptide fragment as anchoring sample and positive sample data; randomly selecting negative sample data from different peptide fragments;

the anchor data, positive data, and negative data are combined into triple data.

Preferably, the variable self-encoder coding neural network comprises four branch networks, each branch network is a full-connection layer comprising a plurality of neurons, and the full-connection layer comprises one layer and more than one layer; the output vectors of the four branch networks are connected through end coupling, and the spliced vector is split into two vectors with equal dimensions, wherein one vector represents standard deviation and the other vector represents average value.

Preferably, the step S104 specifically includes:

dividing proteins in a protein database into theoretical peptide fragments according to enzyme cleavage sites of protease, wherein the theoretical peptide fragments form a theoretical peptide fragment database, and determining parent ions corresponding to each fragment cluster according to the theoretical peptide fragment database so as to generate parent ion-fragment child ion pairs; wherein the theoretical peptide fragment database comprises a plurality of columns of peptide fragment information, and each peptide fragment corresponds to a unique peptide fragment index.

Preferably, the step S105 specifically includes:

establishing an inverted index table expressed by binary, wherein the inverted index table expresses the intersection of two peptide fragment index sets, namely an index set obtained by parent ion inquiry and a fragment sub-ion index set obtained by fragment sub-ion inquiry;

mapping parent ions to peptide fragment indexes by using mass-to-charge ratios, mapping fragment sub-ions to peptide fragment indexes, and querying the peptide fragment indexes by using the mass-to-charge ratios of the fragment sub-ions;

calculating the scores of the peptide fragments, sorting all the theoretical peptide fragments according to the scores according to the reverse index table, and then calculating the similarity between fragment ion separation groups matched with the theoretical peptide fragments with the highest score; if the similarity exceeds a specified threshold, the fragment ions in the cluster and the corresponding parent ions are classified into one class and stored as a tandem spectrum.

In a second aspect, an analysis system for deep learning based protein mass spectrometry data, comprising:

the data acquisition module is used for acquiring DIA protein data of the sample;

the preprocessing module is used for deleting background ions with low signal to noise ratio in the sliding block and determining candidate parent ions and candidate ion based on the DIA protein data, wherein the sliding block moving along the residence time dimension in a specific step is taken as a minimum processing unit;

the encoder module is used for embedding the extraction chromatographic input variation of the candidate sub-ions into Euclidean space after the encoder encodes the neural network;

the cluster analysis module is used for dividing the output of the encoder module into k classes by using a k-means classification algorithm;

the PIndex index query module combines each fragment sub-ion cluster with a corresponding parent ion based on a protein database to generate parent ion-fragment sub-ion pairs; and judging the parent ion-fragment ion pairs again by calculating the similarity between fragment ion pairs matched with the theoretical spectrum, wherein the parent ion-fragment ion pairs with the similarity exceeding a preset threshold value are stored as pseudo-tandem spectrums.

Preferably, the analysis system based on deep learning protein mass spectrum data further comprises:

and the decoder module is used for reconstructing the potential characteristics and restoring the data to the maximum extent.

In a third aspect, a computer device includes a memory and a processor, where the memory stores a computer program, and where the processor implements the method for analyzing deep learning-based protein mass spectrometry data when executing the computer program.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention omits the step of forming a pseudo DIA spectrum library by performing a data dependent acquisition mode (DDA) experiment in the early stage, and the deep learning network structure adopts a variation self-encoder structure and a k-means clustering algorithm to detect the correspondence between peptide fragment ions and polypeptide fragment sub-ions so as to realize the potential characteristics of autonomous learning mass spectrum data without depending on the data acquisition experiment; therefore, the time consumption of experiments and analysis can be greatly reduced, and more excellent protein identification and quantification results can be achieved;

(2) Because the chromatographic peak with high signal-to-noise ratio shows that the ion signal is recorded, and the chromatographic peak with low signal-to-noise ratio is the interference signal generated by the background ion, the invention filters the chromatographic peak in the data preprocessing stage, retains the ion with high signal-to-noise ratio, realizes the noise removal on one hand, and reduces the data volume to be analyzed on the other hand.

The foregoing description is only an overview of the present invention, and is intended to provide a more clear understanding of the technical means of the present invention, so that it may be carried out in accordance with the teachings of the present specification, and to provide a more complete understanding of the above and other objects, features and advantages of the present invention, as exemplified by the following detailed description.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of the specific embodiments of the present invention when taken in conjunction with the accompanying drawings.

Drawings

FIG. 1 is a flow chart of a method for analyzing deep learning-based protein mass spectrometry data according to an embodiment of the present invention;

FIG. 2 is a structural flow chart of a method for analyzing deep learning-based protein mass spectrometry data according to an embodiment of the present invention; wherein a, b, c, d corresponds to steps S102, S103, S104 and S105 in fig. 1, respectively;

FIG. 3 is a schematic diagram of a deep learning model according to an embodiment of the present invention; where a represents the structure of a deep Variational Automatic Encoder (VAE) and the processing of a k-means classification algorithm, the input data contains three parts: an anchor sub-ion chromatograph XIC, a positive sub-ion chromatograph XIC and a negative sub-ion chromatograph XIC, the anchor sub-ion and the positive sub-ion being from the same parent ion, the anchor sub-ion and the negative sub-ion belonging to different parent ion peptide fragments, the triplet data being fed to a four-branch encoder network consisting of 1-2-2-1 Fully Connected (FC) layers; the output vectors of the four branch networks are connected by an end-to-end additional operation, and the encoder network outputs two equal-sized vectors, one representing the standard deviation (sigma) and one representing the average value (mu), the average vector representing the potential characteristics of the input data. Furthermore, after the training triplet is missing, the anchor ion is closer to a positive species of ion than to a negative species of ion; b represents the peptide index algorithm (PIndex), the left part shows the protein database Sk digested into various sets, where k represents the unique index of peptide fragments, the right part shows the process of parent and child ion queries, the peptide fragment index can be queried by the m/z charge pairs of parent and child ions, respectively;

FIG. 4 is a Venn diagram of polypeptides and proteins identified from the mouse L929 dataset of an embodiment of the present invention, with solid and dashed lines showing the identified and quantified results, respectively, and three circles representing the identification of software DIA-Umpire, the software of the present application, and the software Spectronaut, respectively;

FIG. 5 is a Venturi chart of synthetic peptides (SGS peptides) found from SGS human dataset showing dilution steps of 1-fold, 64-fold, 128-fold, 256-fold and 512-fold dilutions, three sections representing the number of peptides found by software Spectronaut, software DIAUMPire and software of the present application, respectively, according to an embodiment of the present invention;

FIG. 6 is a graph showing results of analysis HYE protein dataset using 64 variable windows; wherein a represents a comparison of the number of identified and quantified polypeptides and proteins obtained from the HYE data set by software, DIA-uire and spectrobaut of the present application with samples a and B, wherein intersections represent the Intersection of software and DIA-uire or the Intersection of software and spectrobaut of the present application, and other only show the unique results found by software, DIA-uire and spectrobaut of the present application, respectively; b represents the log2 scale distribution of quantitative peptide intensities found from the HYE124 dataset, together with samples A and B, and the peptides identified as being able to be identified with DIA-Umpire are shown as intersections; peptide fragments reported separately by Dear-Umprire and DIA-Umprire are shown as Dear-Umprire only and DIA-Umprire only, respectively; c represents the protein composition found by the software of the present application with the sample a dataset and the sample B dataset, respectively, human, yeast and e.coli representing Human, yeast and escherichia coli, respectively; the boxes with broken lines represent the basic true values of the protein composition, they are mixed in the prescribed proportions, the boxes with solid lines show the composition of the protein found by the software of the present application; d represents the number of identified proteins found by the software, DIA-Umpire and Spectronaut of the present application using the sample A and sample B datasets, respectively, dear-DIA, DIA-Umpire and Spectronaut show the quantitative results of the software, DIA-Umpire and Spectronaut of the present application, respectively;

fig. 7 is a block diagram of a deep learning based analysis system for protein mass spectrometry data according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

Referring to fig. 1, the method for analyzing protein mass spectrum data based on deep learning according to the present embodiment includes:

s101, obtaining DIA protein data of a sample;

Specifically, embodiments of the present invention use a highly complex sample dataset (including the L929 mouse dataset, the SWATH-MS Gold Standard (SGS) human dataset, and the HYE dataset with 64 variable windows (instrument: AB Sciex Triple TOF 5600)) for benchmarking. Pseudo-tandem spectrograms are generated using the model and software DIA-Umpire of the present application, respectively, and then using the search software Comet and software X-! Tandem searches and identifies peptide fragments and proteins in Tandem spectra. All identified peptides and proteins were screened using the 1% False Discovery Rate (FDR) of the protein level score calculated by software MAYU, a library was created, and then the peptides and proteins in the library of the invention and software DIA-Umpe were quantified using OpenSWATH, respectively. In addition, the baseline dataset was also analyzed using software spectrobaut and 1% fdr (false discovery rate) values for protein levels were set to filter peptide fragments and proteins.

Referring to fig. 2, the specific steps of a method for analyzing protein mass spectrometry data based on deep learning are as follows:

step one: obtaining a sample, a data set and data preprocessing:

A1. obtaining a sample and a data set thereof:

the mouse dataset was from L929 cell lysates containing three samples, 100 variable primary mass spectrometry (MS 1) windows were measured on a TripleTOF5600 mass spectrometer (AB Sciex) in SWATH mode.

SGS human dataset: three DIA data were obtained using 422 synthetic peptide fragments (SGS peptide fragments) diluted individually in HeLa cell lysates, through 10 dilution steps (from 1-fold dilution to 512-fold dilution), and then using SWATH-MS.

HYE124 data set includes two hybridized proteome samples a and B. Sample a consisted of 65% human, 15% yeast and 20% e.coli protein, while sample B consisted of 65% human, 30% yeast and 5% e.coli protein.

A2. Data preprocessing:

mass spectrometry data mainly comprises three-dimensional data: mass to charge ratio (m/z), residence Time (relaxation Time), and Intensity (Intensity). Raw data read by mass spectrometers contains noise and the amount of individual file data is very large, so that simple pre-processing of the data, such as denoising and refinement of the processing unit, is necessary before deep learning. In the SWATH experiment, the mixed protein sample is subjected to enzymolysis, enters a chromatographic column for preliminary separation, and then sequentially enters a mass spectrometer. Thus, the signal of the peptide fragment will have the characteristic of a chromatographic peak in the residence time direction. Chromatographic peaks are important criteria for determining the presence or absence of a peptide fragment. The actual chromatographic peak of the peptide fragment ion exhibits a gaussian profile, but is generally asymmetric and exhibits interfering peaks at the beginning or end of the peak. In general, a high signal-to-noise ratio of the chromatographic peak indicates that an ion signal is recorded at this time, while a low signal-to-noise ratio of the chromatographic peak is an interfering signal generated by the background ion. In the data preprocessing stage, the chromatographic peaks are filtered, and ions with high signal to noise ratio are reserved. The method comprises the following specific steps:

1. the fixed width slider is first split in the dwell time dimension in each MS1 isolation window. Each slider is considered to be the smallest processing unit containing a series of MS1 spectra and corresponding MS2 spectra.

2. Background ions of low signal-to-noise ratio (SNR) of signal-to-noise ratio (SNR) in the slider are deleted by a peaking algorithm and a deisotope algorithm, and candidate parent and child ions are determined.

3. When triplet losses are introduced into the model, triplet data need to be generated to train the neural network. The training data is derived from the results of openswitch quantization. Six high intensity chromatograms of extraction of quantized peptide fragments were stored. Then, extraction chromatograms of two fragments were randomly selected from the same peptide as anchor data and positive data. Negative data were randomly selected from different peptide fragments. Finally, the anchor data, the positive data, and the negative data are combined into triple data.

Step two: and (3) building a model:

referring to fig. 3, the deep learning model network is mainly divided into four parts: the network comprises an encoder (decoder) module, a decoder (decoder) module, a cluster analysis module and a PIndex index query module, wherein the encoder is correspondingly restored after each operation of the encoder, and the decoder has the advantages of capturing different levels of features, integrating learned shallow sub-features and deep features, reconstructing potential features by the decoder, and restoring data to the maximum extent so that the network has quite reliability. And inputting the potential characteristics output by the encoder into a cluster analysis module, clustering by the cluster analysis module, and using a clustering result as an index of the polypeptide to obtain the correlation of the parent ion and the fragment child ion by using a PIndex index algorithm.

B. An encoder module:

the encoder comprises four branch networks, wherein the first branch network is a single-layer full-connection layer comprising 384 neurons, and a dropout layer with a discarding rate of 0.2 is connected behind the full-connection layer; the second branch network comprises two full-connection layers, wherein the first full-connection layer comprises 192 neurons, the second full-connection layer comprises 384 neurons, and the second full-connection layer is connected with a dropout layer with the discarding rate of 0.2; the third branch network comprises two full-connection layers, wherein the first full-connection layer comprises 48 neurons, the second full-connection layer comprises 128 neurons, and the second full-connection layer is connected with a dropout layer with the discarding rate of 0.2; the fourth branch network is a single-layer full-connection layer containing 128 neurons, and a drop layer with a drop rate of 0.2 is connected behind the full-connection layer.

The four branch network output vectors are connected by an additional operation of the end, splitting the split vector into two equal-dimension vectors, one representing the standard deviation (sigma) and the other representing the average value (mu), as shown in fig. 3 a.

C. A decoder module:

the decoder comprises four branch networks, wherein the first branch network is a single-layer full-connection layer comprising 128 neurons, and a dropout layer with a discarding rate of 0.2 is connected behind the full-connection layer; the second branch network comprises two full-connection layers, wherein the first full-connection layer comprises 384 neurons, the second full-connection layer comprises 192 neurons, and the second full-connection layer is connected with a dropout layer with the discarding rate of 0.2; the third branch network comprises two full-connection layers, wherein the first full-connection layer comprises 128 neurons, the second full-connection layer comprises 48 neurons, and the second full-connection layer is connected with a dropout layer with the discarding rate of 0.2; the fourth branch network is a single-layer full-connection layer containing 384 neurons, and a drop layer with a drop rate of 0.2 is connected behind the full-connection layer.

D. And a cluster analysis module:

the cluster analysis module applies a k-means clustering algorithm based on Euclidean space metrics to distribute the potential features of fragment ion chromatographic peaks extracted from the encoder network by the VAE mapped to the Euclidean space into k categories of the feature space. As shown in fig. 3 a.

E, pindex index query module:

ideally, the fragment sub-ions of the same class should come from the same precursor parent ion, and the index query module mainly applies a peptide index algorithm named PIndex, which is a small window search algorithm and is an algorithm developed on the basis of the reverse index principle. The PInedex comprises two parts, one is an enzyme digestion algorithm and the other is a query algorithm. The enzyme digestion algorithm cuts the proteins in the protein database into theoretical peptide fragments according to enzyme digestion sites of protease, the theoretical peptide fragments form a theoretical peptide fragment database, the database contains a plurality of columns of peptide fragment information, and each peptide fragment corresponds to a unique peptide fragment index. The cleavage algorithm outputs a database of theoretical peptide fragments to determine parent ions corresponding to each fragment cluster. As shown in fig. 3b, the query algorithm first builds an inverted index table, which represents the intersection of two peptide fragment index sets, namely an index set obtained by parent-ion query and a fragment-child-ion index set obtained by fragment-child-ion query. The reverse query comprises two parts, one part is used for mapping parent ions to peptide fragment indexes by using mass-to-charge ratios, the other part is used for mapping fragment sub-ions to peptide fragment indexes, and the peptide fragment indexes are queried by using the mass-to-charge ratios of the fragment sub-ions. The module calculates the scores of the peptide fragments, sorts all the theoretical peptide fragments according to the scores according to the binary table, and then calculates the similarity between fragment ion separation groups matched with the theoretical peptide fragments with the highest score. If the similarity exceeds a specified threshold, the fragment ions in the cluster and the corresponding parent ions are classified into one class and stored as a tandem spectrum.

The test performance of the above example data is as follows:

during the identification of the mouse L929 mass spectrum dataset, 9332 peptides and 2681 proteins were found by the present invention, and 7059 peptides and 1981 proteins were reported by the software DIA-Umpire. The present invention and software DIA-Umpire together display 5961 peptide fragments and 1862 proteins. In the quantification procedure, the polypeptides found by the invention, the software DIA-Umpire and the software Spectronaut were 8999, 6780 and 8594, respectively, and the proteins found by the invention and the software DIA-Umpire were 2427, 1761 and 2183, respectively. The protein identification of the present invention covers 84% of the peptide fragments and 93% of the protein identified by the software DIA-uire. The invention also covers 75% peptide fragments and 88% protein reported by software Spectronaut. Coverage shows good reproducibility between the present invention, software DIA-uiise and software Spectronaut (fig. 4).

Whereas for the SGS human dataset, experiments used 422 synthetic peptide fragments (SGS peptide fragments) diluted alone in HeLa cell lysates, followed by 10 dilution steps (1-fold to 512-fold dilution) and then three DIA data with SWATH-MS, the present invention showed similar sensitivity in the 1 Xdilution step as that of software DIA-Umpire and software Spectronaut. However, for dilution steps greater than 64 x dilution, the sensitivity of the present invention is higher than that of software DIA-uiire and software spectrobaut (fig. 5). Overall, the present invention reported 36% more identified SGS peptides than software DIA-Umpire (i.e., 135:99) in all 8 dilution steps. It is notable that the analysis of the present invention in experimental data at 256-fold and 512-fold dilutions still found some SGS peptides, whereas software DIA-Umpire and software Spectronaut failed to find any SGS peptides (FIG. 5).

The performance of the model and software DIA-uire of the present application was compared in this experiment using a HYE dataset specifically designed to verify the performance of the DIA algorithm. HYE124 data set includes two hybridized proteome samples a and B. Sample a consisted of 65% human, 15% yeast and 20% e.coli protein, while sample B consisted of 65% human, 30% yeast and 5% e.coli protein. Two samples of the HYE124 dataset were put together, the model of the present application and the software DIA-Umpire had total recognition peptides of 25693 and 18872, respectively, and total recognition proteins of 4554 and 3613, respectively, with the identification of the model of the present application covering 95% of the protein and 88% of the peptide identified by DIA-Umpire (fig. 6 a). These results show that the model of the present application can reproduce the results of DIA-uire well. Furthermore, the number of unique proteins found by the model of the present application was 6.3 times that found by the software DIA-uire (i.e. 1118 versus 177 proteins) and 4.1 times the level of peptide fragments (i.e. 8996 versus 2175 proteins). The model of the present application is also superior to the software DIA-uire (fig. 6 a) in terms of quantification. And the model of the present application also allows the discovery of a large number of proteins and polypeptides that are ignored by DIA-uire in identification and quantification.

In addition, the software Spectronaut was able to find 23924 quantitative peptide fragments and 3294 quantitative proteins. The identification of the model of the present application covered 92% of the proteins and 78% of the peptide fragments of the software Spectronaut identification, and the model of the present application found a 3.7-fold higher number of uniquely quantified proteins than the software Spectronaut identification (i.e., 957 versus 262 proteins). Low abundance proteins and polypeptides are difficult to identify by mass spectrometry algorithms due to interference from background noise. However, the model Dear-DIA-XMBD of the present application performs better than the software DIA-helix in this problem, because the model of the present application gives an intensity distribution of the quantitative proteins and peptides that is more in the low density range (fig. 6 b).

Sample a and sample B of the HYE124 dataset were also analyzed, respectively. For sample A, the percentages of protein identified in the models of this application were 64.2%, 17.1% and 18.7% in human, yeast and E.coli, respectively. In sample B, the model of the present application found 64.6% human, 27.2% yeast and 8.2% e.coli identified protein, respectively (fig. 6 c). More human, yeast and E.coli proteins than DIA-Umpire and Spectronaut have been found by the model of the present application (FIG. 6 d).

It should be noted that, in the deep learning model network structure of the present invention, the branch network part may be replaced by a Recurrent Neural Network (RNN), a long short term memory model (LSTM), a Transformer attention mechanism network or a convolutional network, which is not particularly limited in the present invention.

Further, although the variable self-encoder encoded neural network of the above example includes 4 branch sub-networks, in practical application, the number of branches and the number of layers of the neural network in each branch sub-network may be any number, so that the task with the difficulty of classification tends to adopt a deeper network layer, and vice versa.

Further, parameters of the encoder network, such as the number of neurons of the full connection layer, the size of convolution kernel, the number, the learning rate, the optimizer, etc., can be set to appropriate parameters according to the specific situation.

The invention adopts a variational self-encoder network structure, and can autonomously learn the potential characteristics of mass spectrum data without depending on data dependent acquisition experiments. Therefore, the time consumption of experiments and analysis can be greatly reduced, and more excellent protein identification and quantification results can be achieved.

Referring to fig. 7, an analysis system for deep learning-based protein mass spectrometry data, comprising:

a data acquisition module 701 for acquiring DIA protein data of a sample;

a preprocessing module 702, configured to delete background ions with low signal to noise ratio in a slider moving along a residence time dimension in a specific step as a minimum processing unit based on the DIA protein data, and determine candidate parent ions and candidate ion;

the encoder module 703 is configured to embed the extracted chromatographic input variation of the candidate sub-ions into the euclidean space after encoding the neural network by the encoder;

a cluster analysis module 704 for dividing the output of the encoder module into k classes with a k-means classification algorithm;

a PIndex index lookup module 705 that combines each fragment child ion cluster with a corresponding parent ion based on a protein database to generate parent ion-fragment child ion pairs; and judging the parent ion-fragment ion pairs again by calculating the similarity between fragment ion pairs matched with the theoretical spectrum, wherein the parent ion-fragment ion pairs with the similarity exceeding a preset threshold value are stored as pseudo-tandem spectrums.

The analysis system based on the deep learning protein mass spectrum data further comprises:

The specific implementation of each module is the same as that of a protein mass spectrum data analysis method based on deep learning, and the embodiment of the invention is not repeated.

A computer device comprising a memory storing a computer program and a processor implementing the method of deep learning based analysis of protein mass spectrometry data when the computer program is executed.

The foregoing is merely illustrative of specific embodiments of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modification of the present invention by using the design concept shall fall within the scope of the present invention.

Claims

1. A method for analyzing protein mass spectrometry data based on deep learning, comprising:

s101, obtaining DIA protein data of a sample;

s105, judging the parent ion-fragment ion pairs again by calculating the similarity between fragment ion pairs matched with the theoretical spectrum, and storing the parent ion-fragment ion pairs with the similarity exceeding a preset threshold as a pseudo-tandem spectrum;

the variable self-encoder coding neural network comprises four branch networks, each branch network is a full-connection layer comprising a plurality of neurons, and the full-connection layer comprises one layer and more than one layer; the output vectors of the four branch networks are connected through end coupling, and the spliced vector is split into two vectors with equal dimensions, wherein one vector represents standard deviation and the other vector represents average value.

2. The method for analyzing protein mass spectrometry data based on deep learning according to claim 1, wherein the slider moving in a specific step along the residence time dimension is a minimum processing unit, and specifically comprising:

3. The method for analyzing protein mass spectrometry data based on deep learning according to claim 1, wherein the deletion of background ions with low signal to noise ratio in the slide specifically comprises:

4. The method for analyzing protein mass spectrometry data based on deep learning according to claim 1, further comprising, prior to S103:

storing the extraction chromatograms of the six quantitative peptide fragments;

5. The method for analyzing protein mass spectrometry data based on deep learning according to claim 1, wherein S104 specifically comprises:

dividing proteins in a protein database into theoretical peptide fragments according to enzyme cleavage sites of protease, wherein the theoretical peptide fragments form a theoretical peptide fragment database, and determining parent ions corresponding to each fragment cluster according to the theoretical peptide fragment database so as to generate parent ion-fragment child ion pairs; wherein the theoretical peptide fragment database comprises a series of peptide fragment information, and each peptide fragment corresponds to a unique peptide fragment index.

6. The method for analyzing protein mass spectrometry data based on deep learning according to claim 5, wherein S105 specifically comprises:

7. An analysis system for protein mass spectrometry data based on deep learning, comprising

the PIndex index query module combines each fragment sub-ion cluster with a corresponding parent ion based on a protein database to generate parent ion-fragment sub-ion pairs; and judging the parent ion-fragment ion pairs again by calculating the similarity between fragment ion pairs matched with the theoretical spectrum, wherein the parent ion-fragment ion pairs with the similarity exceeding a preset threshold value are stored as pseudo-tandem spectrums;

8. The deep learning based protein mass spectrometry data analysis system of claim 7, further comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any one of claims 1 to 6 when executing the computer program.