CN112185460B

CN112185460B - Heterogeneous data independent proteomics mass spectrometry analysis system and method

Info

Publication number: CN112185460B
Application number: CN202011005330.1A
Authority: CN
Inventors: 钟传奇; 陈希; 韩强强; 尚骏; 黄邵鑫; 刘宜子; 杜博贾; 杨勇; 周欣
Original assignee: Spectral Double Combined Wuhan Life Technology Co ltd
Current assignee: Spectral Double Combined Wuhan Life Technology Co ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2022-07-08
Anticipated expiration: 2040-09-23
Also published as: CN112185460A

Abstract

The invention discloses a heterogeneous data independent proteomics mass spectrometry analysis system and method. The system comprises a local client and a cloud high-performance server. The method comprises the following steps: (1) the local client reads local heterogeneous data independent proteomics mass spectrum data, and calls a cloud high-performance server to obtain a data interpreter; (2) after the local client locally finishes data interpretation, spectrogram extraction and pseudo-peptide fragment generation, the local client submits peptide fragment spectrogram data, pseudo-peptide fragments and target detection peptide fragments to a high-performance server; (3) and the high-performance server performs data analysis according to the peptide fragment spectrogram data, the pseudo peptide fragments and the target detection peptide fragments provided by the local client, and returns a calculated proteomics analysis result to the local client. The invention can give consideration to the privacy and strong operational capability of the original data independent proteomics mass spectrum data.

Description

Heterogeneous data independent proteomics mass spectrometry analysis system and method

Technical Field

The invention belongs to the field of proteomics, and particularly relates to a heterogeneous library file data independent proteomics mass spectrometry analysis system and method.

Background

Traditional proteomics employs a Data Dependent Acquisition (DDA) strategy to digest protein samples into peptide fragments, ionize and analyze by mass spectrometry. In the full scan mass spectrum, the peptide signals above noise are selectively cleaved to produce a random (MS/MS) mass spectrum that can be matched to the spectra in the database. Although this method is very powerful, it randomly extracts peptides for cleavage, always biased towards those peaks with the strongest signals. Therefore, quantification of low abundance peptide fragments remains a challenge.

In the subsequent development of a directed analysis technique, Selective Reaction Monitoring (SRM), mass spectrometers can detect specific peptide fragments with high sensitivity and high quantitative accuracy.

The proteomics research community now focuses on Data Independent Acquisition (DIA), which theoretically combines the advantages of DDA and SRM. In the DIA analysis, all peptide fragments within a given mass-to-charge ratio (m/z) window are cleaved; the analysis was repeated until the mass spectrometer covered the entire m/z range. This enables accurate peptide quantification without being limited to analysis of pre-defined peptide fragments.

The analysis of the data independent proteomics mass spectrum data has to depend on a bioinformatics algorithm for regression fitting due to the extremely large data volume, however, with the continuous improvement of detection means, the form and format of the data independent proteomics mass spectrum data are continuously updated. The existing analysis system cannot be compatible with the analysis of various data independent proteomics mass spectrum data in an extensible mode. Meanwhile, a cloud centralized analysis system can cause leakage of detection original data, and is not beneficial to commercial popularization, so that a new generation of data independent proteomics mass spectrometry analysis system needs to be developed.

Disclosure of Invention

Aiming at the defects or the improvement requirements of the prior art, the invention provides a heterogeneous data independent proteomics mass spectrometry analysis system and a method, aiming at realizing an extensible and compatible complex data independent proteomics mass spectrometry data format by combining local service and cloud service, and simultaneously shortening data analysis time and reducing the requirement of local on computing capability on the premise of ensuring data privacy safety, thereby solving the technical problems that analysis software in the prior art cannot be extensible and compatible with a continuously updated data format, has high requirement on local computing power, long analysis time and risk of data privacy disclosure, and is not beneficial to commercial popularization.

In order to achieve the above object, according to an aspect of the present invention, there is provided a heterogeneous data independent proteomics mass spectrometry system, including a local client and a cloud high performance server;

the local client is used for acquiring data independent original data and target detection data, interpreting the data independent original data into data independent proteomic mass spectrum data in a standard format and interpreting the target detection data into library files in the standard format according to a data interpreter called from a cloud high-performance server, generating peptide fragment spectrogram data, pseudo peptide fragments and target detection peptide fragments according to the data independent proteomic mass spectrum data in the standard format and the library files in the standard format, and submitting the peptide fragment spectrogram data, the pseudo peptide fragments and the target detection peptide fragments to the high-performance server;

and the cloud high-performance server is used for performing data analysis according to the peptide fragment spectrogram data, the pseudo peptide fragments and the target detection peptide fragments provided by the local client, performing retention time regularization and regression calculation on data analysis results, obtaining the sub-ion series strength and the pseudo-positive rate of the target detection peptide fragments as proteomics analysis results, and returning the proteomics analysis results to the local client.

Preferably, the heterogeneous data independent proteomics mass spectrometry system, wherein the local client is further configured to obtain and display proteomics analysis results from a cloud high-performance server.

Preferably, the heterogeneous data-independent proteomics mass spectrometry system, wherein the local client comprises: a data interpreter, a spectrogram extractor and a pseudopeptide segment generator;

the data interpreter is called from a high-performance server and used for reading the data-independent original data and the target detection data, identifying the data-independent original data and the target detection data of the currently supported types, respectively converting the data-independent original data and the target detection data into data-independent proteomics mass spectrum data in a standard format and library files in a standard format, submitting the data-independent proteomics mass spectrum data in the standard format and the library files in the standard format to a spectrogram extractor, and submitting the library files in the standard format to a pseudo-peptide fragment generator;

the spectrogram extractor is used for merging the data independent proteomics mass spectrum data in the standard format according to the library file in the standard format to obtain peptide fragment spectrogram data and submitting the peptide fragment spectrogram data to the cloud high-performance server;

and the pseudo peptide segment generator is used for generating and operating the library file in the standard format to obtain a pseudo peptide segment, and submitting the pseudo peptide segment and the target detection peptide segment to a cloud high-performance server.

Preferably, the heterogeneous data independent proteomics mass spectrometry system, wherein the merging process comprises cyclic scanning, convolution merging and noise reduction; the convolutions are combined as a Tophat convolution operation or a Bartlett convolution operation.

Preferably, the heterogeneous data independent proteomic mass spectrometry system, wherein the generating operation is an operation of maintaining the peptide fragment composition unchanged and changing the amino acid sequence.

Preferably, the heterogeneous data independent proteomics mass spectrometry system, the cloud high performance server thereof, comprises a data analyzer, a regularizer, and a quality controller;

the data analyzer is used for scoring based on chromatogram, mass spectrum and/or ion mobility according to the peptide fragment spectrogram data, the pseudo-peptide fragment and the target detection peptide fragment data, predicting signal values of the target detection peptide fragment and the pseudo-peptide fragment according to a scoring result and supplying the signal values to the regularizer;

the regularizer is used for carrying out retention time regularization and regression algorithm according to the signal values of the target detection peptide segment and the pseudo peptide segment to obtain the sub-ion series strength of the target detection peptide segment and the pseudo peptide segment, and submitting the sub-ion series strength to the quality controller;

and the quality controller is used for calculating the false positive rate of the peptide fragment according to the sub-ion series strength of the target detection peptide fragment and the pseudo peptide fragment and returning the false positive rate to the local client.

According to another method of the invention, a heterogeneous data independent proteomics mass spectrometry method is provided, which is characterized by applying the heterogeneous data independent proteomics mass spectrometry system provided by the invention.

Preferably, the heterogeneous data independent proteomics mass spectrometry method comprises the following steps:

(1) the local client reads local heterogeneous data independent proteomics mass spectrometry data, and calls a cloud high-performance server to obtain a data interpreter;

(2) after the local client locally finishes data interpretation, spectrogram extraction and pseudo peptide segment generation, the local client submits peptide segment spectrogram data, pseudo peptide segments and target detection peptide segments to a high-performance server;

(3) and the high-performance server performs data analysis according to the peptide fragment spectrogram data, the pseudo peptide fragments and the target detection peptide fragments provided by the local client, performs retention time regularization and regression calculation on data analysis results, obtains the sub-ion series strength and the pseudo-positive rate of the target detection peptide fragments as proteomics analysis results, and returns the proteomics analysis results to the local client.

Preferably, the heterogeneous data independent proteomic mass spectrometry method, when processing a high-throughput data set, the steps (1-2) and step (3) are performed in a distributed or integrated manner.

Preferably, the heterogeneous data independent proteomics mass spectrometry analysis method, which is performed in a distributed manner, is: when the high-performance server analyzes the data of the current data independent proteomics mass spectrum data, the local client simultaneously processes the next batch of data independent proteomics mass spectrum data;

the integrated process, namely: the system is provided with a plurality of high-performance servers and one or more local clients, and task scheduling is performed on the high-performance servers, so that the shortest total processing time or the shortest processing time of the specific data independent proteomics mass spectrum data is realized.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

according to the method and the system, the local client and the cloud high-performance server are adopted to respectively carry out local preprocessing of heterogeneous data independent proteomics mass spectrometry data and cloud analysis, privacy and strong computing capacity of original data independent proteomics mass spectrometry data can be considered, huge time cost and computing performance requirements brought by localization in the whole analysis process are avoided, or original detection data leakage risk and huge data transmission bandwidth requirements are brought because the original data independent proteomics mass spectrometry data are completely completed by the cloud high-performance server.

Meanwhile, the cloud continuously updates the data independent proteomics data and the database file, so that the data independent proteomics data obtained by different formats and different detection means can be adapted in an extensible manner.

According to the optimal scheme of the method, distributed or integrated task scheduling is performed by utilizing the computing power of the local client and the cloud high-performance server, the computing time is further shortened, the method is particularly suitable for high-throughput operation, and the processing speed of high-throughput data in an ideal state is improved by nearly one time.

Drawings

FIG. 1 is a schematic diagram of the heterogeneous data independent proteomics mass spectrometry system structure provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The heterogeneous data independent proteomics mass spectrometry system provided by the invention comprises a local client and a cloud high-performance server, as shown in fig. 1;

the local client is used for acquiring data independent original data and target detection data, interpreting the data independent original data into data independent proteomic mass spectrum data in a standard format and interpreting the target detection data into library files in the standard format according to a data interpreter called from a cloud high-performance server, generating peptide fragment spectrogram data, pseudo peptide fragments and target detection peptide fragments according to the data independent proteomic mass spectrum data in the standard format and the library files in the standard format, and submitting the peptide fragment spectrogram data, the pseudo peptide fragments and the target detection peptide fragments to the high-performance server; the system is also used for obtaining and displaying a proteomics analysis result from the cloud high-performance server;

the local client includes: a data interpreter, a spectrogram extractor and a pseudopeptide segment generator;

the spectrogram extractor is used for merging the data independent from the proteomics mass spectrum data in the standard format according to the library file in the standard format to obtain peptide fragment spectrogram data and submitting the peptide fragment spectrogram data to the cloud high-performance server; the merging processing comprises circular scanning, convolution merging and noise reduction; the convolution combination is preferably a Tophat convolution operation or a Bartlett convolution operation.

The pseudo-peptide fragment generator is used for generating and operating the library file in the standard format to obtain a pseudo-peptide fragment, and submitting the pseudo-peptide fragment and a target detection peptide fragment to a cloud high-performance server; the generating operation comprises: random scrambling, inversion, pseudo-inversion, translation, etc., operations that maintain the peptide fragment components unchanged and change the amino acid sequence.

The cloud high-performance server is used for performing data analysis according to the peptide fragment spectrogram data, the pseudo peptide fragments and the target detection peptide fragments provided by the local client, performing retention time regularization and regression calculation on data analysis results, obtaining the sub-ion series strength and the pseudo-positive rate of the target detection peptide fragments as proteomics analysis results, and returning the proteomics analysis results to the local client; an update module is also included for storing and updating the data interpreter, the update module continually updating the data interpreter according to the type of current data-independent proteomic mass spectrometry data.

The cloud high-performance server comprises a data analyzer, a regularizer and a quality controller;

the data analyzer is used for scoring based on chromatogram, mass spectrum and/or ion mobility according to the peptide fragment spectrogram data, the pseudo peptide fragment and the target detection peptide fragment data, predicting signal values of the target detection peptide fragment and the pseudo peptide fragment according to a scoring result and providing the signal values to the regular device;

the regularizer is used for carrying out retention time regularization and regression algorithm according to the signal values of the target detection peptide segment and the pseudo peptide segment to obtain the sub-ion series strength of the target detection peptide segment and the pseudo peptide segment, and submitting the sub-ion series strength to the quality controller.

And the quality controller is used for extracting the sub-ion series strength of the target detection peptide fragment and calculating the false positive rate of the peptide fragment according to the sub-ion series strength of the target detection peptide fragment and the false peptide fragment, and returning the pseudo positive rate to the local client.

The method for performing heterogeneous data independent proteomics mass spectrometry by using the heterogeneous data independent proteomics mass spectrometry system provided by the invention comprises the following steps:

(1) the local client reads local heterogeneous data independent proteomics mass spectrum data, and calls a cloud high-performance server to obtain a data interpreter;

When processing a high-throughput data set, the steps (1-2) and (3) are performed in a distributed or integrated manner;

the distributed operation is that: when the high-performance server analyzes the data of the current data independent proteomics mass spectrum data, the local client simultaneously processes the next batch of data independent proteomics mass spectrum data;

the integrated process, namely: the system is provided with a plurality of high-performance servers and one or more local clients, and task scheduling is performed on the high-performance servers, so that the shortest total processing time or the shortest processing time of specific data independent proteomics mass spectrum data is realized.

The following are examples:

the heterogeneous data independent proteomics mass spectrometry system provided by the invention comprises a local client and a cloud high-performance server, as shown in figure 1;

the local client is used for acquiring data independent original data and target detection data, interpreting the data independent original data into data independent proteomic mass spectrum data in a standard format and interpreting the target detection data into a library file in the standard format according to a data interpreter called from a cloud high-performance server, generating peptide fragment spectrogram data, a pseudo peptide fragment and a target detection peptide fragment according to the data independent proteomic mass spectrum data in the standard format and the library file in the standard format, and submitting the peptide fragment, the pseudo peptide fragment and the target detection peptide fragment to the high-performance server; the system is also used for obtaining and displaying proteomics analysis results from the cloud high-performance server;

the data-independent original data format supported by the current data interpreter is: raw, etc.; the library file format supported by the current data interpreter is: sptxt, blib, and csv. The standard format of data independent proteomics mass spectrometry data is: mzML; the standard format library file format is TraL.

The spectrogram extractor is similar to a chromatograma extractor of OpenSWATH (OpenSWATH enabled automation of data-independent acquisition MS data. nature Biotechnology, 2014/3/10) and is used for merging the data-independent proteomics mass spectrum data in the standard format to obtain peptide fragment spectrogram data and submitting the data to a cloud high-performance server; the merging processing comprises circular scanning, convolution merging and noise reduction; the convolution combination is preferably a Tophat convolution operation or a Bartlett convolution operation.

The pseudo peptide segment generator is similar to a decoy generator of OpenSWATH and used for generating and operating the library file in the standard format to obtain a pseudo peptide segment, and submitting the pseudo peptide segment and a target detection peptide segment to a cloud high-performance server; the generating operation comprises: random scrambling, inversion, pseudo-inversion, and translation, among others, which maintain the peptide fragment composition and alter the amino acid sequence.

The cloud high-performance server is used for performing data analysis according to the peptide fragment spectrogram data, the pseudo peptide fragments and the target detection peptide fragments provided by the local client, performing retention time regularization and regression calculation on data analysis results, obtaining the sub-ion series strength and the pseudo-positive rate of the target detection peptide fragments as proteomics analysis results, and returning the proteomics analysis results to the local client; further comprising an update module for storing and updating the data interpreter, the update module continuously updating the data interpreter according to the type of current data independent proteomic mass spectrometry data.

the data Analyzer is similar to the Analyzer of OpenSWATH, and is used for scoring based on chromatogram, mass spectrum and/or ion mobility according to the peptide fragment spectrogram data, the pseudo peptide fragment and the target detection peptide fragment data, predicting the signal values of the target detection peptide fragment and the pseudo peptide fragment according to the scoring result, and providing the signal values to the regularizer;

the chromatographic-based scoring item includes: cross-validation (Cross-Correlation Score), Intensity (Intensity Score), Signal-to-noise Score, EMG (explicit Modified Gaussian Score), Relative Intensity (Relative Intensity Score), and Retention Time (Retention Time Score); the mass spectrum-based scoring items include: isotope (Isotope Score), Mass spectral Mass Accuracy (Mass Accuracy Score), and Ion Series (Ion Series Score); the scoring items based on ion mobility include: ion mobility (ion mobility).

The regularizer is similar to an RTNormalizer of OpenSWATH and is used for carrying out retention time regularization and an LDA linear regression algorithm according to the signal values of the target detection peptide segment and the pseudo-peptide segment to obtain the sub-ion series strength of the target detection peptide segment and the pseudo-peptide segment and submitting the sub-ion series strength to a quality controller;

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A heterogeneous data independent proteomics mass spectrometry analysis system is characterized by comprising a local client and a cloud high-performance server;

2. The heterogeneous data-independent proteomic mass spectrometry system of claim 1, wherein the local client is further configured to obtain and display proteomic analysis results from a cloud-based high performance server.

3. The heterogeneous data-independent proteomics mass spectrometry system of claim 1, wherein the local client comprises: a data interpreter, a spectrogram extractor and a pseudopeptide segment generator;

the spectrogram extractor is used for merging the data independent from the proteomics mass spectrum data in the standard format according to the library file in the standard format to obtain peptide fragment spectrogram data and submitting the peptide fragment spectrogram data to the cloud high-performance server;

4. The heterogeneous data independent proteomic mass spectrometry system of claim 3, wherein the merging process comprises cyclic scanning, convolution merging, noise reduction; the convolutions are combined as a Tophat convolution operation or a Bartlett convolution operation.

5. The heterogeneous data independent proteomic mass spectrometry system of claim 3, wherein the generating operation is an operation that maintains the peptide fragment composition and changes the amino acid sequence.

6. The heterogeneous data independent proteomics mass spectrometry system of claim 1, wherein the cloud high performance server comprises a data analyzer, a regularizer, and a quality controller;

the regularizer is used for carrying out retention time regularization and regression calculation according to the signal values of the target detection peptide segment and the pseudo peptide segment to obtain the sub-ion series strength of the target detection peptide segment and the pseudo peptide segment, and submitting the sub-ion series strength to the quality controller;

7. A heterogeneous data independent proteomics mass spectrometry method, characterized in that the heterogeneous data independent proteomics mass spectrometry system of any one of claims 1 to 6 is applied.

8. The heterogeneous data independent proteomic mass spectrometry method of claim 7, comprising the steps of:

9. The heterogeneous data independent proteomic mass spectrometry method of claim 8, wherein the steps (1-2) and (3) are performed in a distributed or integrated manner when processing high throughput data sets.

10. The heterogeneous data independent proteomic mass spectrometry method of claim 9, wherein the distribution is performed by: when the high-performance server analyzes the data of the current data independent proteomics mass spectrum data, the local client simultaneously processes the next batch of data independent proteomics mass spectrum data;