CN113362899A

CN113362899A - Deep learning-based protein mass spectrum data analysis method and system

Info

Publication number: CN113362899A
Application number: CN202110425032.6A
Authority: CN
Inventors: 何情祖; 郭欢; 帅建伟; 韩家淮
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-09-07
Anticipated expiration: 2041-04-20
Also published as: CN113362899B

Abstract

The invention discloses a method and a system for analyzing protein mass spectrum data based on deep learning, wherein the method comprises the following steps: obtaining DIA protein data of a sample; based on DIA protein data, a slider moving in a specific step along a dwell time dimension is taken as a minimum processing unit, background ions with low signal-to-noise ratio in the slider are deleted, and candidate parent ions and candidate ion ions are determined; inputting the extracted chromatogram of the candidate sub-ions into a variational self-encoder coding neural network, embedding the variational self-encoder coding neural network into Euclidean space, and then dividing the variational self-encoder coding neural network into k classes by using a k-means classification algorithm; combining each fragment daughter ion cluster with a corresponding parent ion based on a protein database to generate a parent ion-fragment daughter ion pair; and judging the parent ion-fragment ion pairs again by calculating the similarity between the fragment ion pairs matched with the theoretical spectrum, and storing the parent ion-fragment ion pairs with the similarity exceeding a preset threshold value as a pseudo-tandem spectrum. The invention can improve the quantity of the identified peptide fragments and proteins.

Description

Deep learning-based protein mass spectrum data analysis method and system

Technical Field

The invention relates to the field of protein analysis in proteomics, in particular to a method and a system for analyzing protein mass spectrum data based on deep learning.

Background

Mass spectrometry has long been the dominant technique for peptide fragment and protein identification and quantification. Typical identification strategies are performed in conjunction with a Data Dependent Acquisition (DDA) mode and a search engine. Data were collected in the DDA mode in the tandem mode, and when scanning the primary mass spectrum (MS1), only the strongest k peptide fragment ions were selected, fragmented, and the resulting peptide ion fragmented, forming a tandem mass spectrum (secondary mass spectrum MS 2). The search engine matches the MS2 spectrum with a theoretical spectrum of the peptide fragment to identify the peptide fragment corresponding to the MS2 spectrum. However, because the peptide fragment ions with intensity rank k are randomly varied in repeated DDA experiments, the DDA method identifies poorly reproducible polypeptides. To overcome the limitations of DDA mode, a Data Independent Acquisition (DIA) strategy has emerged. SWATH is a general DIA mode, which is called a full Fragment ion Sequential windowing Acquisition mass spectrum data independent Acquisition mode (Sequential Window Acquisition of All thermal Fragment Ions), and a mass spectrometer breaks up peptide Fragment Ions in each MS1 independent window and acquires signals of All Fragment Ions. Obviously, a large number of fragment signals of the polypeptide are mixed in the corresponding secondary mass spectrum (MS2), i.e., the daughter ions of the secondary mass spectrum are derived from a plurality of parent ions, and the fragment ions are interfered by the parent ions which are not broken up, the DIA data is too complex, and direct analysis is extremely difficult.

Current methods for analyzing DIA data in the field of proteomics mainly include: performing protein identification and quantitative analysis based on a workflow of database search and software based on statistical tests; the protein is identified and quantitatively analyzed by using a deep learning method.

The deep learning method is different from the traditional protein identification method, is based on the support of computing power, trains a large number of data sets, and enables a machine to independently learn the internal rules and characteristics of the data, and mainly comprises the following steps:

DeepNovo-DIA: the introduction of [ Tran N H, Zhang X, Xin L, et al, De novo peptide sequencing by deep learning [ J ]. Proceedings of the National Academy of Sciences of the United States of America,2017.114(31):201705691.] Deepnovo-DIA proposed by Li Ming et al in 2017 combined with de novo sequencing and deep learning of the peptide fragment to identify the amino acid sequence of the peptide fragment directly from the DIA spectrum.

DIA-NN: in 2019, the department of biochemistry of Cambridge university and the Miner therapeutics research institute, etc. collaborate [ Demichev V, Messner C B, Vernardis S I, et al. DIA-NN: neural networks and interference selectivity capable depth protein coverage in high throughput [ J ]. Nature Methods,2020,17(1):41-44 ], a convenient integrated software package DIA-NN was proposed, which utilizes deep neural networks and new quantitative and signal correction strategies to process the experimental results of DIA proteins. DIA-NN improves the ability of traditional DIA proteome to be both qualitative and quantitative, has the advantage of being fast, especially in high-throughput applications, and enables accurate deep coverage of proteins when used in conjunction with flash chromatography methods.

DeepDIA: a novel DIA proteome analysis method was proposed by DeepDIA, Qianda Y, Liu X, Shen C, et al, in silica spectral libraries by deep learning data-independent acquisition protocols [ J ]. Nature Communications,11 ]. Deep neural network models based on convolutional and recurrent neural networks were designed to predict secondary mass spectra (MS/MS) and normalized residence times (iRT) of peptide fragments, and the peptide fragment lists from DDA identification generated the library of spectra required for DIA analysis.

Many current methods based on deep learning are widely applied to the analysis process of protein data, but currently developed software is based on a supervised learning method, so that the generalization capability is limited. There are many problems to be solved for protein DIA data analysis, so it is very important to provide a DIA data analysis unsupervised deep learning model based on k-means clustering algorithm and variational self-encoder to realize accurate identification and quantification of protein.

Disclosure of Invention

The invention mainly aims to provide a method and a system for analyzing protein mass spectrum data based on deep learning, which are used for increasing the number of identified peptide fragments and proteins so as to obtain a more accurate protein identification result.

The invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for analyzing protein mass spectrometry data based on deep learning, including:

s101, obtaining DIA protein data of a sample;

s102, based on the DIA protein data, a slider moving in a specific step along a dwell time dimension is taken as a minimum processing unit, background ions with low signal-to-noise ratio in the slider are deleted, and candidate parent ions and candidate daughter ions are determined;

s103, inputting the extracted chromatogram of the candidate sub-ions into a variational self-encoder coding neural network, embedding the variational self-encoder coding neural network into Euclidean space, and then dividing the variational self-encoder coding neural network into k classes by using a k-means classification algorithm;

s104, combining each fragment daughter ion cluster with corresponding parent ions based on a protein database to generate parent ion-fragment daughter ion pairs;

and S105, judging the parent ion-fragment ion pairs again by calculating the similarity between the fragment ion pairs matched with the theoretical spectrum, and storing the parent ion-fragment ion pairs with the similarity exceeding a preset threshold value as a pseudo-tandem spectrum.

Preferably, the slider moving in a specific step along the dwell time dimension is a minimum processing unit, and specifically includes:

splitting the fixed-width slider in the dwell time dimension in each MS1 isolation window; each slider is considered to be the smallest processing unit containing a series of MS1 spectrograms and a corresponding MS2 spectrogram.

Preferably, the deleting background ions with low signal-to-noise ratio in the slider specifically includes:

background ions with low signal-to-noise ratio in the sliding block are deleted through an algorithm related to the signal-to-noise ratio; the algorithms related to signal-to-noise ratio include peak finding algorithms and isotope removal algorithms.

Preferably, before S103, the method further includes:

generating triplet state data to train a variational self-encoder encoding neural network; the method for acquiring the triplet state data comprises the following steps:

storing the extraction chromatogram of six quantitative peptide fragment;

randomly selecting two fragment extraction chromatogram XICs from one peptide fragment as an anchor sample and positive sample data; randomly selecting negative sample data from different peptide fragments;

anchor data, positive data and negative data are combined into triple data.

Preferably, the variational self-encoder coding neural network comprises four branch networks, each branch network is a fully-connected layer comprising a plurality of neurons, and the fully-connected layer comprises one layer or more; the output vectors of the four branch networks are connected through end coupling, the splicing vector is divided into two vectors with equal dimensions, one represents a standard deviation, and the other represents an average value.

Preferably, the S104 specifically includes:

dividing proteins in a protein database into theoretical peptide fragments according to enzyme cutting sites of protease, wherein the theoretical peptide fragments form a theoretical peptide fragment database, and determining parent ions corresponding to each fragment cluster according to the theoretical peptide fragment database to generate parent ion-fragment ion pairs; wherein the theoretical peptide fragment database comprises a plurality of sequences of peptide fragment information, and each peptide fragment corresponds to a unique peptide fragment index.

Preferably, the S105 specifically includes:

establishing a reverse index table expressed by binary, wherein the reverse index table expresses the intersection of two peptide fragment index sets, namely an index set obtained by inquiring parent ions and a fragment daughter ion index set obtained by inquiring fragment daughter ions;

mapping the parent ions to the peptide segment indexes by using the mass-to-charge ratio, mapping fragment daughter ions to the peptide segment indexes, and querying the peptide segment indexes by using the mass-to-charge ratio of the fragment daughter ions;

calculating the score of the peptide fragments, sequencing all theoretical peptide fragments according to the scores according to the reverse index table, and then calculating the similarity between fragment ion subsets matched with the theoretical peptide fragments with the highest scores; if the similarity exceeds a specified threshold, the fragment ion in the cluster and the corresponding parent ion are sorted into one class and stored as a tandem spectrum.

In a second aspect, a system for analyzing protein mass spectrometry data based on deep learning includes:

a data acquisition module for acquiring DIA protein data of a sample;

a preprocessing module, based on the DIA protein data, deleting background ions with low signal-to-noise ratio in a slider as a minimum processing unit, which moves along a dwell time dimension in a specific step, and determining candidate parent ions and candidate daughter ions;

the encoder module is used for inputting the extracted chromatogram of the candidate sub-ions into the variational self-encoder encoding neural network and then embedding the variational self-encoder encoding neural network into Euclidean space;

a cluster analysis module for dividing the output of the encoder module into k classes by a k-means classification algorithm;

the PINdex index query module is used for combining each fragment daughter ion cluster with corresponding parent ions based on a protein database to generate parent ion-fragment daughter ion pairs; and judging the parent ion-fragment ion pairs again by calculating the similarity between the fragment ion pairs matched with the theoretical spectrum, and storing the parent ion-fragment ion pairs with the similarity exceeding a preset threshold value as a pseudo-tandem spectrum.

Preferably, the system for analyzing protein mass spectrometry data based on deep learning further includes:

and the decoder module is used for reconstructing the potential characteristics and restoring the data to the maximum extent.

In a third aspect, a computer device comprises a memory and a processor, the memory stores a computer program, and the processor implements the method for analyzing protein mass spectrometry data based on deep learning when executing the computer program.

Compared with the prior art, the invention has the following beneficial effects:

(1) the method omits the step of forming a pseudo DIA (two-dimensional analysis) spectrogram library by performing a data dependent acquisition mode (DDA) experiment in the early stage, and the deep learning network structure adopts a variational self-encoder structure combined with a k-means clustering algorithm to detect the correspondence between the peptide fragment ions and the polypeptide fragment sub-ions so as to realize the potential characteristics of the self-learning mass spectrum data without depending on the data acquisition experiment; therefore, the invention can greatly reduce the time consumption of experiments and analysis and can achieve more excellent protein identification and quantification results;

(2) as the chromatographic peak with high signal-to-noise ratio indicates that an ion signal is recorded, and the chromatographic peak with low signal-to-noise ratio is an interference signal generated by background ions, the chromatographic peak is filtered in the data preprocessing stage, and the ions with high signal-to-noise ratio are reserved, so that the noise removal is realized on one hand, and the data volume to be analyzed is reduced on the other hand.

The above description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the description of the technical means more comprehensible.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

FIG. 1 is a flow chart of a method for analyzing protein mass spectrometry data based on deep learning according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating the structure of a method for analyzing protein mass spectrum data based on deep learning according to an embodiment of the present invention; wherein a, b, c, d correspond to steps S102, S103, S104 and S105 in fig. 1, respectively;

FIG. 3 is a schematic diagram of a deep learning model according to an embodiment of the invention; wherein a represents the structure of a deep Variation Automatic Encoder (VAE) and the processing of a k-means classification algorithm, and input data comprises three parts: an anchor ion chromatogram XIC, a positive ion-like chromatogram XIC and a negative ion-like chromatogram XIC, wherein the anchor ion and the positive ion-like chromatogram are from the same parent ion, the anchor ion and the negative ion-like chromatogram belong to different parent ion peptide segments, and the triple data is fed to a four-branch encoder network consisting of 1-2-2-1 fully-connected (FC) layers; the output vectors of the four branch networks are connected by an additional operation at the end, and the encoder network outputs two equal-sized vectors, one representing the standard deviation (σ) and one representing the mean (μ), the mean vector representing the underlying characteristics of the input data. Furthermore, the anchor ion is closer to the positive than the negative ion after the training triplet is missing; b represents the peptide indexing algorithm (PIndex), the left part shows the protein database Sk digested into various sets, where k represents the unique index of the peptide fragment, the right part shows the process of parent and daughter ion queries, and the peptide fragment index can be queried by the m/z charge pair of the parent ion and the m/z charge pair of the daughter ion, respectively;

FIG. 4 is a Venn diagram of polypeptides and proteins identified from the mouse L929 dataset according to an embodiment of the present invention, the solid and dashed lines showing the identified and quantified results, respectively, and the three circles representing the identification results of the software DIA-Umpire, the software of the present application, and the software Spectronaut, respectively;

FIG. 5 is a Venn diagram of synthetic peptide fragments found from the SGS human data set (SGS peptide fragments) according to an embodiment of the present invention, five Venn diagrams showing dilution steps of 1-fold, 64-fold, 128-fold, 256-fold and 512-fold dilutions, three parts representing the number of peptide fragments found by software Spectronaut, software DIAUmpire and the software of the present application, respectively;

fig. 6 is a result of analyzing the HYE124 protein dataset using a 64 variable window; wherein a represents a comparison of the software of the present application, DIA-Umpire and Spectrronaut, obtained from the HYE124 dataset, of the amount of identified and quantified polypeptides and proteins, with samples A and B, wherein interaction represents the Intersection of the software of the present application and DIA-Umpire or the Intersection of the software of the present application and Spectrronaut, and the other only show the unique results found by the software of the present application, DIA-Umpire and Spectrronaut, respectively; b represents the log2 scale distribution of the quantitative peptide fragment intensities found from the HYE124 dataset along with samples A and B, and peptide fragments identified with DIA-Umpire are shown in the interaction; peptide fragments reported separately for Dear-Umpire and DIA-Umpire are shown as Dear-Umpire only and DIA-Umpire only, respectively; c represents the protein composition found by the software of the present application using the sample a dataset and the sample B dataset, Human, Yeast and E.coli, respectively; the boxes with dotted lines represent the basic true compositions of the proteins, mixed in the specified proportions, and the boxes with solid lines show the compositions of the proteins found by the software of the present application; d represents the amount of identified protein found by the software, DIA-Umpire and Spectrronaut of the present application using the sample A and sample B datasets, respectively, Dear-DIA, DIA-Umpire and Spectrronaut showing the quantitative results of the software, DIA-Umpire and Spectrronaut of the present application, respectively;

fig. 7 is a block diagram of an analysis system for protein mass spectrometry data based on deep learning according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Referring to fig. 1, the method for analyzing protein mass spectrometry data based on deep learning of the present embodiment includes:

s101, obtaining DIA protein data of a sample;

In particular, embodiments of the present invention benchmarked using a highly complex sample dataset, including the L929 mouse dataset, SWATH-MS Gold Standard (SGS) human dataset, and HYE124 dataset with 64 variable windows (Instrument: AB Sciex Triple TOF 5600). The pseudo-tandem spectrogram is generated by using the model and software DIA-Umpire of the application, respectively, and then by using library-searching software Comet and software X! Tandem searches and identifies peptide fragments and proteins in the Tandem spectrum. All identified peptide fragments and proteins were screened using the 1% False Discovery Rate (FDR) of protein level fraction calculated by software MAYU, a library of profiles was created, and then peptide fragments and proteins in the library of the present invention and software DIA-Umpire were quantified separately using OpenSWATH. In addition, the baseline data set was analyzed using the software Spectronaut and the peptide fragments and proteins were filtered by setting a 1% FDR (false discovery Rate) value for the protein level.

Referring to fig. 2, the specific steps of a method for analyzing protein mass spectrum data based on deep learning are as follows:

the method comprises the following steps: obtaining a sample and a data set thereof and data preprocessing:

A1. obtaining a sample and its dataset:

the mouse data set was from L929 cell lysates, which contained triplicate samples, and 100 variable primary mass spectrometry (MS1) windows were measured on a TripleTOF5600 mass spectrometer (AB Sciex) in SWATH mode.

SGS human dataset: using 422 synthetic peptide fragments (SGS peptide fragments) diluted individually in HeLa cell lysates, three DIA data were obtained by 10 dilution steps (from 1-fold dilution to 512-fold dilution) followed by SWATH-MS.

The HYE124 data set includes two hybrid proteome samples a and B. Sample a consisted of 65% human, 15% yeast and 20% e.coli protein, while sample B consisted of 65% human, 30% yeast and 5% e.coli protein.

A2. Data preprocessing:

the mass spectrometry data mainly comprises three-dimensional data: mass-to-charge ratio (m/z), residence Time (Retention Time), and Intensity (Intensity). The raw data read by the mass spectrometer contains noise, and the data volume of a single file is very large, so that simple preprocessing of the data, such as denoising and thinning processing units, is necessary before deep learning. In the SWATH experiment, a mixed protein sample is subjected to enzymolysis, enters a chromatographic column for preliminary separation, and then sequentially enters a mass spectrometer. Thus, the signal of the peptide fragment will feature a chromatographic peak in the direction of the residence time. Chromatographic peaks are important basis for judging whether the peptide fragments exist. The true chromatographic peak of the peptide fragment ion exhibits the characteristics of a gaussian curve, but it is generally asymmetric and an interfering peak appears at the beginning or end of the peak. Generally, a chromatographic peak with a high signal-to-noise ratio indicates that an ion signal is recorded at this time, while a chromatographic peak with a low signal-to-noise ratio is an interference signal generated by background ions. In the data preprocessing stage, the chromatographic peak is filtered, and ions with high signal-to-noise ratio are reserved. The method comprises the following specific steps:

1. the fixed-width slider is first split in the dwell time dimension in each MS1 isolation window. Each slider is considered to be the smallest processing unit containing a series of MS1 spectrograms and a corresponding MS2 spectrogram.

2. Background ions with low signal-to-noise ratio (SNR) of signal-to-noise ratio (SNR) in the slider are deleted by a peak finding algorithm and a deisotopic algorithm, and candidate parent ions and daughter ions are determined.

3. When triplet state losses are introduced in the model, triplet state data needs to be generated to train the neural network. The training data comes from the results of OpenSWATH quantization. Six high intensity extracted chromatograms of quantified peptide fragments were stored. Then, two fragments of the extracted chromatographic XICs were randomly selected from the same peptide as the anchor and positive data. Negative data were randomly selected from different peptide fragments. Finally, the anchor data, the positive data and the negative data are combined into triple data.

Step two: constructing a model:

referring to fig. 3, the deep learning model network is mainly divided into four parts: the device comprises an encoder (encoder) module, a decoder (decoder) module, a cluster analysis module and a Pindex index query module, wherein the encoder is used for searching corresponding decoders after each encoder operation, the device has the advantages that different levels of features can be captured, the learned shallow level features and deep level features are integrated, the decoder reconstructs potential features, data are restored to the maximum extent, and the network has considerable reliability. And then inputting the potential features output by the encoder into a clustering analysis module, clustering by the clustering analysis module, taking the clustering result as the index of the polypeptide, and obtaining the correlation between the polypeptide parent ions and the fragment daughter ions by using a Pindex indexing algorithm.

B. An encoder module:

the encoder comprises four branch networks, wherein the first branch network is a single-layer fully-connected layer comprising 384 neurons, and a dropout layer with a discarding rate of 0.2 is connected behind the fully-connected layer; the second branch network comprises two fully-connected layers, the first fully-connected layer comprises 192 neurons, the second fully-connected layer comprises 384 neurons, and a dropout layer with a discarding rate of 0.2 is connected behind the second fully-connected layer; the third branch network comprises two fully-connected layers, wherein the first fully-connected layer comprises 48 neurons, the second fully-connected layer comprises 128 neurons, and a dropout layer with a discarding rate of 0.2 is connected behind the second fully-connected layer; the fourth branch network is a single-layer fully-connected layer containing 128 neurons, and a dropout layer with a discarding rate of 0.2 is connected behind the fully-connected layer.

The four branch network output vectors are connected by an additional operation at the end, splitting the spliced vector into two equal dimensional vectors, one representing the standard deviation (σ) and the other the mean (μ), as shown in fig. 3 a.

C. A decoder module:

the decoder comprises four branch networks, wherein the first branch network is a single-layer full-connection layer comprising 128 neurons, and a dropout layer with a discarding rate of 0.2 is connected behind the full-connection layer; the second branch network comprises two fully-connected layers, wherein the first fully-connected layer comprises 384 neurons, the second fully-connected layer comprises 192 neurons, and a dropout layer with a discarding rate of 0.2 is connected behind the second fully-connected layer; the third branch network comprises two fully-connected layers, wherein the first fully-connected layer comprises 128 neurons, the second fully-connected layer comprises 48 neurons, and a dropout layer with a discarding rate of 0.2 is connected behind the second fully-connected layer; the fourth branch network is a single-layer full-connection layer containing 384 neurons, and a dropout layer with a discarding rate of 0.2 is connected behind the full-connection layer.

D. A cluster analysis module:

and the clustering analysis module is used for distributing potential features of fragment ion chromatographic peaks extracted from the coder network by the VAE mapped to the Euclidean space to k categories of the feature space by applying a k-means clustering algorithm based on Euclidean space measurement. As shown in fig. 3 a.

The index query module of the PIndex:

ideally, fragment daughter ions of the same class should be from the same precursor parent ion, and the index query module mainly applies a peptide fragment index algorithm named Pindex, which is a small window search algorithm and is an algorithm developed on the basis of the reverse order index principle. The PInedex contains two parts, one is the enzyme digestion algorithm and one is the query algorithm. The enzyme cutting algorithm divides the protein in the protein database into theoretical peptide fragments according to the enzyme cutting sites of the protease, the theoretical peptide fragments form a theoretical peptide fragment database, the database comprises a series of peptide fragment information, and each peptide fragment corresponds to a unique peptide fragment index. And (4) outputting a theoretical peptide fragment database by using the enzyme digestion algorithm so as to determine the parent ions corresponding to each fragment cluster. As shown in fig. 3b, the query algorithm first establishes an inverted index table, which represents the intersection of two peptide fragment index sets, i.e. the index set obtained by the parent ion query and the fragment daughter ion index set obtained by the fragment daughter ion query. The reverse query comprises two parts, wherein one part maps parent ions to the peptide segment index by using the mass-to-charge ratio, the other part maps fragment daughter ions to the peptide segment index, and then the mass-to-charge ratio of the fragment daughter ions is used for querying the peptide segment index. The module calculates the score of the peptide fragments, sorts all theoretical peptide fragments according to the scores according to a binary table, and then calculates the similarity between fragment ion subsets matched with the theoretical peptide fragments with the highest scores. If the similarity exceeds a specified threshold, the fragment ion in the cluster and the corresponding parent ion are sorted into one class and stored as a tandem spectrum.

The test performance of the above example data is as follows:

in the identification process of a mouse L929 mass spectrum data set, 9332 peptide fragments and 2681 proteins are found, and 7059 peptide fragments and 1981 proteins are reported by software DIA-Umpire. The invention and software DIA-Umpire showed 5961 peptide fragments and 1862 proteins together. In the quantification process, the polypeptides found by the invention, software DIA-Umpire and software Spectronaut were 8999, 6780 and 8594, respectively, and the proteins found by the invention and software DIA-Umpire were 2427, 1761 and 2183, respectively. The protein identification of the invention covers 84% of the peptide fragments and 93% of the protein identified by the software DIA-Umpire. The invention also covers 75% of the peptide fragment and 88% of the protein as reported by the software Spectrronaut. The coverage rate shows good reproducibility between the invention, the software DIA-Umpire and the software Spectronaut (fig. 4).

Whereas for the SGS human data set, experiments using 422 synthetic peptides (SGS peptides) diluted individually in HeLa cell lysates, followed by 10 dilution steps (from 1-fold dilution to 512-fold dilution), and then obtaining triplicate DIA data with SWATH-MS, the present invention showed similar sensitivity to software DIA-Umpire and software Spectronaut in the 1 × dilution step. However, for dilution steps greater than 64 × dilution, the sensitivity of the present invention is higher than that of software DIA-Umpire and software Spectrronaut (FIG. 5). In general, the identified SGS peptide fragments reported by the present invention were 36% more than the software DIA-Umpire (i.e., 135: 99) in all 8 dilution steps, it is noted that the analysis of the experimental data of the present invention at 256-fold and 512-fold dilutions still found some SGS peptide fragments, while the software DIA-Umpire and software Spectrronaut failed to find any SGS peptide fragments (FIG. 5).

The performance of the model and software DIA-Umpire of the present application was compared in this experiment using the HYE124 dataset specifically designed to test the performance of the DIA algorithm. The HYE124 data set includes two hybrid proteome samples a and B. Sample a consisted of 65% human, 15% yeast and 20% e.coli protein, while sample B consisted of 65% human, 30% yeast and 5% e.coli protein. Two samples of the HYE124 dataset were put together, with 25693 and 18872 total recognition peptides and 4554 and 3613 total recognition proteins for the model and software DIA-Umpire, respectively, where the identification of the model covers 95% of the proteins and 88% of the peptides identified by DIA-Umpire (FIG. 6 a). These results show that the model of the present application can reproduce the results of DIA-Umpire well. Furthermore, the number of unique proteins found by the model of the present application was 6.3 times higher (i.e., 1118 vs. 177 proteins) than that found by the software DIA-Umpire, and 4.1 times higher (i.e., 8996 vs. 2175 proteins) than the peptide fragment level. The model of the present application is also superior to the software DIA-Umpire (fig. 6a) in terms of quantification. And the model of the present application also enables the discovery of a large number of proteins and polypeptides that are ignored by DIA-Umpire in identification and quantification.

In addition, the software Spectronaut can find 23924 quantitative peptide fragments and 3294 quantitative proteins. The identification results of the model of the present application cover 92% of the protein and 78% of the peptide fragments of the software Spectronaut identification results, and the model of the present application found a 3.7-fold greater number of uniquely quantified proteins than the software Spectronaut identification (i.e., 957 versus 262 proteins). Due to interference from background noise, low abundance proteins and polypeptides are difficult to identify using mass spectrometry algorithms. However, the model Dear-DIA-XMBD of the present application performs better than the software DIA-Umpire in this problem because the model of the present application gives intensity distributions of quantitative protein and peptide fragments that are more localized in the low density range (fig. 6 b).

Sample a and sample B of the HYE124 dataset were also analyzed separately. For sample a, the percentage of protein identified by the model of the present application in human, yeast and e.coli was 64.2%, 17.1% and 18.7%, respectively. In sample B, 64.6% of human, 27.2% of yeast and 8.2% of the proteins identified by e.coli were found in the model of the present application, respectively (fig. 6 c). The model of the present application has consistently found more human, yeast and E.coli proteins than DIA-Umpire and Spectrronaut (FIG. 6 d).

It should be noted that, in the deep learning model network structure of the present invention, the branch network portion may be replaced by a Recurrent Neural Network (RNN), a long short term memory model (LSTM), a Transformer attention mechanism network, or a convolution network, and the present invention is not limited in particular.

Further, although the variational self-encoder coding neural network of the above example includes 4 sub-networks, in practical applications, the number of branches and the number of neural network layers in each sub-network may be replaced by any number, and the task of difficult classification tends to adopt a deeper network layer, and vice versa.

Furthermore, parameters of the variational self-encoder network, such as the number of neurons of the full-link layer, the size and number of convolution kernels, the learning rate, the optimizer and the like, can be set to be suitable according to specific situations.

The invention adopts a variational self-encoder network structure, can autonomously learn the potential characteristics of mass spectrum data, and does not depend on data to depend on an acquisition experiment. Therefore, the invention can greatly reduce the time consumption of experiments and analysis and can achieve more excellent protein identification and quantification results.

Referring to fig. 7, a system for analyzing protein mass spectrometry data based on deep learning includes:

a data acquisition module 701 for acquiring DIA protein data of a sample;

a pre-processing module 702, based on the DIA protein data, removing background ions with low signal-to-noise ratio in a slider as a minimum processing unit moving in a specific step along a dwell time dimension, determining candidate parent ions and candidate daughter ions;

the encoder module 703 is configured to input the extracted chromatogram of the candidate daughter ion into a variational self-encoder encoding neural network, and then embed the variational self-encoder encoding neural network into an euclidean space;

a cluster analysis module 704 for dividing the output of the encoder module into k classes using a k-means classification algorithm;

a PIndex index lookup module 705 that combines each fragment daughter ion cluster with a corresponding parent ion based on a protein database to generate parent ion-fragment daughter ion pairs; and judging the parent ion-fragment ion pairs again by calculating the similarity between the fragment ion pairs matched with the theoretical spectrum, and storing the parent ion-fragment ion pairs with the similarity exceeding a preset threshold value as a pseudo-tandem spectrum.

The analysis system of protein mass spectrum data based on deep learning further comprises:

The specific implementation of each module is the same as the analysis method of protein mass spectrum data based on deep learning, and the embodiment of the invention is not repeated.

A computer device comprising a memory storing a computer program and a processor implementing the method of deep learning based analysis of protein mass spectrometry data when the computer program is executed.

The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept should fall within the scope of infringing the present invention.

Claims

1. A method for analyzing protein mass spectrum data based on deep learning is characterized by comprising the following steps:

s101, obtaining DIA protein data of a sample;

2. The method for analyzing protein mass spectrometry data based on deep learning of claim 1, wherein the slider moving in a specific step along the dwell time dimension is a minimum processing unit, and specifically comprises:

3. The method for analyzing protein mass spectrometry data based on deep learning of claim 1, wherein the deleting background ions with low signal-to-noise ratio in the slider specifically comprises:

4. The method for analyzing protein mass spectrometry data based on deep learning of claim 1, wherein before S103, the method further comprises:

storing the extraction chromatogram of six quantitative peptide fragment;

anchor data, positive data and negative data are combined into triple data.

5. The method of claim 1, wherein the variational self-encoder-encoded neural network comprises four branch networks, each branch network is a fully-connected layer comprising a plurality of neurons, and the fully-connected layer comprises one or more layers; the output vectors of the four branch networks are connected through end coupling, the splicing vector is divided into two vectors with equal dimensions, one represents a standard deviation, and the other represents an average value.

6. The method for analyzing protein mass spectrometry data based on deep learning of claim 1, wherein the step S104 specifically comprises:

7. The method for analyzing protein mass spectrometry data based on deep learning of claim 6, wherein the step S105 specifically comprises:

8. A system for analyzing protein mass spectrum data based on deep learning is characterized by comprising

A data acquisition module for acquiring DIA protein data of a sample;

9. The deep learning based protein mass spectrometry data analysis system of claim 8, further comprising:

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any one of claims 1 to 7 when executing the computer program.