CN113990387A - Application method based on IM-DIAT data structure and application thereof - Google Patents

Application method based on IM-DIAT data structure and application thereof Download PDF

Info

Publication number
CN113990387A
CN113990387A CN202111243593.0A CN202111243593A CN113990387A CN 113990387 A CN113990387 A CN 113990387A CN 202111243593 A CN202111243593 A CN 202111243593A CN 113990387 A CN113990387 A CN 113990387A
Authority
CN
China
Prior art keywords
data
diat
data structure
application method
mass
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111243593.0A
Other languages
Chinese (zh)
Inventor
郭天南
张芳菲
胡一凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
West Lake Laboratory Zhejiang Provincial Laboratory Of Life Sciences And Biomedicine
Original Assignee
West Lake Laboratory Zhejiang Provincial Laboratory Of Life Sciences And Biomedicine
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by West Lake Laboratory Zhejiang Provincial Laboratory Of Life Sciences And Biomedicine filed Critical West Lake Laboratory Zhejiang Provincial Laboratory Of Life Sciences And Biomedicine
Priority to CN202111243593.0A priority Critical patent/CN113990387A/en
Publication of CN113990387A publication Critical patent/CN113990387A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention relates to an application method based on IM-DIAT data structure and application thereof, the scheme comprises the following steps: extracting necessary attributes of mass spectrum information in the mass spectrum original file; converting and calculating the necessary attributes to obtain a window index, a cycle index, ion mobility and a mass-to-charge ratio, and respectively corresponding to the signal kurtosis one by one to form an IM-DIAT data structure by using four dimensions of the IM-DIAT data structure; converting the IM-DIAT data structure into two-dimensional multi-channel image data which can be utilized by deep learning through image processing; and taking the two-dimensional multi-channel image data as training data for deep learning to obtain a classification result. The method is suitable for the novel acquisition mode of dia-PASEF, ion mobility information can be reserved, DIAT data is formatted and structured into image-like data, and therefore the data can be effectively processed by a deep learning technology, and the problem of judging whether the thyroid is benign or malignant is solved.

Description

Application method based on IM-DIAT data structure and application thereof
Technical Field
The invention relates to the technical field of proteomics for independent acquisition of mass spectra, in particular to an application method based on an IM-DIAT data structure and application thereof.
Background
Data-dependent acquisition (DDA) combined with various separation methods is the most widely used mass spectrometry-based proteomics strategy for complex samples such as clinical samples. In contrast to DDA, DIA obtains all fragment patterns of all possible precursors (MS2) by sequential separation and fragmentation of the precursor window, overcoming the problem of random selection of precursor ions by DDA, achieving the advantages of high protein coverage and high replication.
dia-PASEF, implemented in a trapping ion mobility spectrometer (timetof Pro), allows almost 100% transport of peptide fragment precursor ions through the correlation of mass and ion mobility of parallel accumulation-tandem fragmentation (PASEF). This greatly reduces the spectral complexity of independent data acquisition (DIA) and improves the sensitivity and specificity of protein identification, and the ion mobility selected for separation from the quadrupole mass spectrometer by adding the fourth dimension can be scanned synchronously by PASEF, further improving the ion sampling efficiency of the quadrupole mass spectrometer to 100% compared to typical DIA methods which are limited to only 1-3% ion sampling efficiency, which essentially improves the sensitivity of acquisition. Related software tools have not fully supported this new emerging data.
In addition, the traditional method needs to search the database first and then perform data processing on the database searching result, and the defects are two: extracting ion chromatographic peaks (XICs) requires extensive calculations and depends on the number of peptide fragments in the library, identifying peptide precursors in DIA-MS data, and there are often a large number of missing values that make the library search result matrix.
Therefore, a method for realizing a data independent acquisition mass spectrum-based molecular group data structure, such as CN111370072B, is provided, the iat tensor can be directly analyzed through a deep learning algorithm, and the problem that a large amount of calculation is needed for extracting an ion chromatographic peak (XIC) is avoided. Finally, a deep learning model of clinical sample classification can be directly established according to the format file. The method uses an end-to-end deep learning framework to construct a function mapping from original MS data to a diagnostic classifier, and does not need to identify peptide precursors in DIA-MS data, so that the problem of missing values is avoided, but the method is not suitable for a novel acquisition mode of DIA-PASEF, and important ion mobility information is not reserved, so that an application method based on an IM-DIAT data structure and application thereof are urgently needed.
Disclosure of Invention
The present invention is directed to solving the above-mentioned problems of the prior art, and provides an application method based on an IM-iat data structure and an application thereof.
In order to realize the purpose of the invention, the invention adopts the following technical scheme: the application method based on the IM-DIAT data structure comprises the following steps:
extracting necessary attributes of mass spectrum information in the mass spectrum original file;
converting and calculating the necessary attributes to obtain a window index, a cycle index, ion mobility and a mass-to-charge ratio, and respectively corresponding to the signal kurtosis one by one to form an IM-DIAT data structure by using four dimensions of the IM-DIAT data structure;
converting the IM-DIAT data structure into two-dimensional multi-channel image data which can be utilized by deep learning through image processing;
and taking the two-dimensional multi-channel image data as training data for deep learning to obtain a classification result.
The working principle and the beneficial effects are as follows: 1. compared with the prior art, the method can extract and calculate the IM-DIAT data structure with four dimensions from the original mass spectrum file, wherein due to the fact that the ion mobility is increased, peptide fragment ions which cannot be distinguished by mass-to-charge ratio dimensions can be separated on the dimension of the newly added ion mobility (ion mobility), namely the ion mobility separated and selected from the quadrupole mass spectrometer by adding the fourth dimension can be synchronously scanned through the PASEF, and compared with a typical DIA method which is limited to 1-3% of ion sampling efficiency, the ion sampling efficiency of the quadrupole mass spectrometer is further improved to 100%.
2. The problem that Data in the DIAT (Data-Independent Acquisition Tensor) Tensor format in the prior art is not suitable for a novel Acquisition mode of dia-PASEF and important ion mobility information is not reserved is thoroughly solved, and the problem that the Data in the DIAT Tensor format in the prior art is not suitable for the novel Acquisition mode of dia-PASEF is solved. The DIAT tensor can be directly analyzed through a deep learning algorithm, and the problem that a large amount of calculation is needed for extracting an ion chromatographic peak (XIC) is avoided;
3. the IM-DIAT data structure can reduce the size of an original file, can directly receive neural network analysis, obtains a classification result through the neural network analysis, and can be better applied to the medical field, particularly used for judging whether thyroid gland is benign or malignant.
Further, the IM-iat data structure is based on proteomic quantification of timsto Pro mass spectrometer.
Due to the characteristics and functions of the TimsTOF Pro mass spectrometer, the application is actually useful data compiled from raw data acquired from the mass spectrometer, so that the application can be applied to common DiaPASEF mass spectrum data, such as proteomics, metabonomics and various small molecule DIA mass spectrum data, and the application range is wide. The TimsTOF Pro can complete proteomics analysis more quickly, sensitively and stably by means of a special trapped ion mobility mass spectrometry (TIMS) technology, and the unique PASEF technology breaks a new record of data acquisition speed and brings higher sensitivity and speed to proteomics. Therefore, for the developed kit and mass spectrometry file analysis software product using the above omics data analysis, the technical processes therein can be replaced by the IM-dial data structure or IM-dial data format and corresponding analysis procedures in the present application.
Further, the image processing conversion comprises the following specific steps:
down-sampling and data augmentation are performed on the data of the IM-DIAT data structure;
performing max pooling, average pooling, and min pooling operations on the IM-DIAT data structure after data augmentation.
Because there is no continuous relation between the window and the ion mobility of the mass spectrum data, that is, there is no correlation between the signal kurtosis of the adjacent window or the adjacent ion mobility, and the mass-to-charge ratio and the period index are continuous variables, the deep learning cannot be directly utilized, so the original 4D data (window index, cycle index, ion mobility and mass-to-charge ratio) can be converted into the two-dimensional multi-channel image data which can be directly utilized by the deep learning through the image processing and converting step.
Further, deep learning is trained by using two-dimensional multi-channel image data after data augmentation;
randomly adding white noise and performing translation operation on each channel;
and respectively predicting the maximum pooling data, the average pooling data and the minimum pooling data of each data, and taking the average value as a final predicted value to obtain a classification result, wherein each data is the data of the two-dimensional multi-channel image data after training and translation operations.
In this setting, in the number of channels of the two-dimensional multi-channel image data, the mass-to-charge ratio and the period index are used as the horizontal and vertical coordinates (width W and height H) of the image, and the width and height ranges of different mass spectrum data are uncertain (depending on a mass spectrometer), so that the data of the IM-iat data structure needs to be subjected to down-sampling and then data amplification, and thus the method can be directly applied to a deep learning framework.
Further, parameters of the deep-learning neural network model are updated based on the Adam gradient descent method;
obtaining prediction data based on the neural network model and calculating two-class cross entropy loss between the prediction data and the real class as a loss function;
the error gradient is calculated by minimizing the loss function and the gradient of the neural network model is updated by back propagation.
In the setting, the steps can be directly applied to the classification of benign and malignant thyroid proteomic data or the classification of benign and malignant thyroid tissues, so that the identification efficiency is greatly improved.
The application of the IM-iat data structure includes a front end for inputting DIA data and a back end for performing the above-described application method based on the IM-iat data structure to output to the front end.
The method has the advantage of convenient operation, can directly input DIA data or other original mass spectrum data, and can directly display the data at the front end after the data is processed at the back end.
A computer program product comprising software code portions for performing the above IM-DIAT data structure based application method when said computer program product is run on a computer.
An electronic device, at least one processor; a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the IM-DIAT data structure based application method described above.
A non-transitory computer readable storage medium storing computer instructions for causing the computer to execute the IM-DIAT data structure-based application method described above.
Drawings
FIG. 1 is a flow diagram of the present invention for generating an IM-DIAT data structure;
FIG. 2 is a schematic diagram of the IM-DIAT data format structure of the present invention;
FIG. 3 is an effect diagram of the visualization scheme in generating an IM-DIAT data structure of the present invention;
FIG. 4 is a comparison of various compression schemes in generating an IM-DIAT data structure in accordance with the present invention;
FIG. 5 is a schematic diagram of the 2D ResNet framework of the present invention;
FIG. 6 is a logic diagram of an application of the present invention;
FIG. 7 is a flow chart of the method of the present invention;
FIG. 8 is a schematic diagram of the use of various depth-networking methods with a set of urine samples comprising 19 new crowns and 39 non-new crowns;
FIG. 9 is a graph of classification accuracy for various deep network play methods.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
It will be understood by those skilled in the art that in the present disclosure, the terms "longitudinal," "lateral," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for ease of description and simplicity of description, and do not indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and thus, the above terms should not be construed as limiting the present invention.
DIA-PASEF is used as the latest DIA technology on timeTOF Pro and formally named DIA-PASEF, and the DIA-PASEF technology is adopted, so that the data utilization rate is greatly improved, the missing value is reduced, the false positive is effectively reduced due to the matching of 4D data, the quantitative accuracy is higher, the repeatability is better, the reliability of quantitative analysis is greatly improved, and the method is suitable for the quantitative proteome research of a large sample amount. Existing related software tools do not fully support this emerging new type of data, and the additional dimensionality significantly increases data storage, target data extraction, data storage file size, and computational cost. At present, mass spectrum data formats can be roughly classified into three types, including original data formats proprietary to mass spectrum manufacturers, formats based on PSI standards, and other formats commonly used in practical applications, such as emerging tensor formats, and reference can be made to a method for realizing a data structure based on data independent acquisition of mass spectra disclosed in chinese patent grant publication No. CN111370072B, wherein new DIAT tensor data is disclosed, the storage format is the direct format, but the data in the DIAT tensor format is not suitable for the novel acquisition mode of dia-PASEF, important ion mobility information is not retained, and additional dimensions significantly increase data storage, target data extraction, data storage file size, and calculation cost, so that a novel data format directly applicable to deep learning from the original data in the dia-PASEF needs to be developed. The following examples are proposed for this purpose:
in the case of the example 1, the following examples are given,
the application method based on the IM-DIAT data structure comprises the following steps:
step 1, extracting necessary attributes of mass spectrum information in a mass spectrum original file (DIA mass spectrum original file), wherein the mass spectrum original file (. d original file) is acquired in a DIA-PASEF mode, the necessary attributes of the mass spectrum information are extracted through a Bruker special library in an Opentims _ Bruker _ bridge contained in the self weight of Timspy3Python, and a table (Dataframe) containing as much information as possible is extracted; the Bruker proprietary library is used in this step because TOF-m/z and scan-dt in Bruker's Tims Data Format (TDF) file needs to be decoded, while Timspy3 is an open source Python's toolkit.
The TimsPy3 in this embodiment is a C + + based library that can be easily accessed using python by opening the vendor's m/z code library to read the Bruker TDF file, providing a practical interface to decode the original data for further software development. However, the analyzed original data format is not favorable for deep learning training and testing, and further data deformation and conversion are still needed.
And 2, converting and calculating the information in the Dataframe to obtain a window index (window index), a cycle index (cycle index), ion-mobility (ion mobility) and m/z (mass-to-charge ratio) as four dimensions of the IM-DIAT data structure, thereby forming the IM-DIAT data structure and saving the IM-DIAT data structure in the file format of the IM-DIA sensor. Ion-mobility (ion mobility) is increased compared to the existing DIAT data structure, i.e., one-dimensional ion mobility is newly increased, because the ion mobility enables peptide fragment ions that cannot be distinguished in the m/z dimension to be separated in the dimension of the newly added ion mobility. In this step, the information transformation calculation in the Dataframe is actually reshaped according to the structure of the DIA window (data independent acquisition, DIA), and the window index (window index), cycle index (loop index), ion-mobility (ion mobility), and m/z (mass-to-charge ratio) also require four-dimensional null MS2 tof intensity tensor alignment. The finally generated file format is a four-dimensional dynamic data format. Reshaping comprises the steps of performing binning operation on data in the acquired mass spectrum original file according to the mapping relation among the four dimensions of IM-DIAT, and forming indexes in an IM-DIAT data structure, namely window indexes, cycle indexes, ion mobility and mass-to-charge ratios.
In this embodiment, the information of the dia-PASEF window of 'WindowGroup' (window group), 'ScanNumBegin' (start scan number), 'ScanNumBegin' (end scan number) and 'ScanNumEnd' (end scan number) can be easily obtained by using the self-contained TimsPyDIA function in TimsPy 3. The TimsPyDF function of the self-contained TimsPyDF in TimsPy3 is used to extract the data of all scans through the columns of 'frame' (serial number ID), 'scan' (scan serial number), 'tof' (intensity value),'m/z' (mass-to-charge ratio), 'inv _ ion _ mobility' (ion mobility), and 'extension _ time' (retention time).
Therefore, the window index (window index) calculation method can be calculated by the principle of dia-PASEF, and the window number included in each scanned window and the window number included in one window group are derived by using the window group, the scan number, and the scan number acquired in the TimsPyDIA function. For example, in the COVID-19 publication raw data set, each PASEF scan involves four windows, which are a window group of 16, i.e., 64 windows. The start sequence id and the end sequence id of each window in each window group are recorded, the window group in which the data is located can be known by dividing the total number of the remaining window groups by the sequence id, and the window index (window index) can be finally formed by positioning the window in which the data is located according to the sequence number of the scanning sequence number.
And the cycle index (loop index) is obtained by dividing the sequence id by the number of window groups and rounding.
In this embodiment, the ion-mobility and m/z (mass-to-charge ratio) need to be binned after extraction, i.e. the continuous numbers in the two are discretized, and the precision is controlled to be 0.5 and 0.1 (default), respectively, which means that the ion-mobility and m/z (mass-to-charge ratio) continuous values are discretized to form integer subscripts. The accuracy of the binning can be set in a user-defined mode according to the requirements of a user, if more original information needs to be reserved, the binning accuracy can be set to be small enough, but the time of corresponding format change and the space occupied by the file size will be increased, so the user can perform the user-defined mode according to the requirements, the data size of the IM-DIAT data structure after being compressed can be small and even can reach 3.66% of the original file, most commonly can reach 10% of the original file, and therefore the user can have a higher progress to reserve more original information.
The IM-iat data structure is schematically shown in fig. 2, where a in fig. 2 is TimsPy3 reading the transformation coordinates, that is, the operation including the above steps, where Cycle index is Cycle index (Cycle index), Window index is Window index (Window index), and Binned m/z and Binned IM are the values obtained by binning ion-mobility and m/z (mass-to-charge ratio), respectively, and IM is the abbreviation of ion-mobility. In FIG. 1, B is the file format of the formed IM-DIA sensor.
The whole workflow summarizing the steps is end-to-end, and protein identification is not needed, because the data is converted according to the original mass spectrum data, namely useful data is extracted and arranged from a mass spectrometer original data file, the effective information content of the mass spectrometer original file can be ensured, and in the data reading process, the data is read in a four-dimensional tensor form, the reading can be unlimited, and the data reading convenience and the reading speed are greatly improved. Because the MSconvert tool in the ProteWizard software package is used for converting the original book notes of the mass spectrum into the file in the mzXML format in the prior art, the file cannot be used for the novel acquisition mode of dia-PASEF, and important ion mobility information is not reserved, but the Bruker proprietary library in the open times _ Bruker _ bridge in the Timspy3 of the existing computer programming language python tool package is used for conveniently extracting necessary attributes in the mass spectrum information, and the important ion mobility information, namely the ion mobility, is reserved, so that the MSconvert tool is suitable for the novel acquisition mode of the dia-PASEF.
The full flow diagram framework of the present application is presented in fig. 1, whereby it can be seen that the entire workflow of IM-DIAT file generation is end-to-end and the order of reading is not limited. It starts from the original file and finally generates the format of IM-DIA sensor, forming a data file with the diat as suffix and a visual picture in the order of window sequence, wherein the steps between visualization and pooling and compression are described in embodiment 1.
The IM-DIAT data structure is stored in the file format of the IM-DIA sensor and needs to be compressed, and in the default operation, the IM-DIAT takes numpy as a carrier and compresses numpy's savez _ compressed by default. The IM-DIAT data format reduces the original file size by more than ten times.
For this purpose, fig. 2 shows a comparison of the compression modes, the size of the original file and the size of the IM-DIAT file generated with different M/z bins, 0.1IM bins, the original file and the Tensor file using 37 data provided in the COVID-19 public data set, the average value of the original file being 12G, and the average value of the size of the IM-DIAT file with 0.5M/z bins being 450M.
There are two other compression formats, Hierarchical Data Format, among others. This is a file format that is used specifically to store and manage large data. Originally developed by the national super computing application center (NCSA), and now operated by the HDF Group of the non-profit corporation. We here use File object in hdf5 to further store numpy data type and choose compression option to compress, setting the compression rate to be highest, which makes the compressed data File size smaller than IM-iat format File without data information loss.
Additionally, scipy may also store the matrix and provide compression functionality by storing the data in dictionary form, and we have found experimentally that the size of the document suffixed with mat is 13% smaller than the data size of the original IM-DIAT format document.
In order to more intuitively represent the intensity of the signal kurtosis (intensity) distribution, the IM-DIAT data format needs to be visualized.
Referring to fig. 4, the cycle index is used as the abscissa of the picture, and m/z (mass-to-charge ratio) after binning operation is used as the ordinate, where the binning size of m/z is 0.5. Mapping signal kurtosis (intensity) from low to high to color from green to red, only the color shading shown in the figure, due to picture limitations, is derived from detailed data of window index (window index) in the COVID-19 public data set, hectic-ZR-DIA-1_ Slot1-19_1_1096. d. In the original graph of fig. 4, the lowest percentage is filled with pure green, the highest intensity percentage is filled with pure red, the 0 value is filled with black, the layout of the graph is shown at the top, the upper right part of fig. 4 shows that the signal kurtosis (intensity) is mapped from low to high to the color change from green to red, the horizontal bar slowly changes from pure green to pure yellow first, the process is a change in G value, when the color completely changes to yellow, at the intersection of the horizontal and vertical bars, the pure yellow slowly changes to pure red, the process is a change in R value, i.e., at lowest intensity (lowest signal kurtosis) when pure green and highest intensity (highest signal kurtosis) when pure red.
And selecting a part of the visual pictures with the granularity drawn, selecting a cycle index (loop index) from 2000 to 3000, selecting an m/z index (mass-to-charge ratio index) after a binning operation from 80000 to 81000, and selecting a value of m/z from 894.99 to 904.99. The binning size of m/z dimension in fine-grained pictures is 0.01, and it is obvious from the result that a plurality of bright and dark stripes are formed after binning operation of m/z.
As shown in fig. 7, the above steps correspond to extracting necessary attributes of mass spectrum information in the original mass spectrum file; and converting and calculating the necessary attributes to obtain a window index, a cycle index, ion mobility and a mass-to-charge ratio, and respectively corresponding to the signal kurtosis one by one to serve as four dimensions of an IM-DIAT data structure to form an IM-DIAT data structure or an IM-DIAT data file.
However, the original data format analyzed by the above steps can also be directly applied to deep learning, but is not favorable for deep learning training and testing, and still needs further data transformation and conversion.
Therefore, the IM-iat data structure needs to be converted into two-dimensional multi-channel image data that can be utilized for deep learning through image processing; and taking the two-dimensional multi-channel image data as training data for deep learning to obtain a classification result. The two-dimensional multichannel image data is equivalent to two-dimensional convolution, a window and ion mobility which are irrelevant to space are converted into a channel, and a mass-to-charge ratio and a period index which are used as continuous variables are converted into two dimensions of an image (the correlation between adjacent mass-to-charge ratios and the period index is high, the characteristic of high correlation between pixel points in the image is met, and a certain pattern is provided). Two-dimensional convolution is commonly used in the fields of computer vision and image processing (in video processing, CNN is used to identify each frame of image, and information of time dimension is not considered). And thus can be directly applied to deep learning.
Because the data format of the traditional DIA mass spectrum original file is encrypted and relatively disordered, the scheme innovatively understands the one-to-one correspondence between (window, mass-to-charge ratio, ion mobility and period index) and signal kurtosis (intensity) in mass spectrum data, coordinates the mass spectrum data into 4D data, and marks each data in advance to form an IM-DIAT data structure or an IM-DIAT data file for use by a later training neural network. Wherein, the labeling behavior is to label a label, which means providing a label during deep learning training, which is a conventional operation in deep learning.
For this purpose, the IM-dial data structure needs to be preprocessed first, since there is no continuous relationship between windows of mass spectrum data and ion mobility, i.e. there is no correlation between signal kurtosis of adjacent windows or adjacent ion mobility, and the mass-to-charge ratio and the period index are continuous variables, we further convert the 4D data (windows, mass-to-charge ratio, ion mobility, period index) of the IM-dial data structure into two-dimensional multi-channel image data that can be used for deep learning, where the number of channels C is 64win 100ion, the mass-to-charge ratio and the period index are horizontal and vertical coordinates (width W and height H) of the image, and the range of width and height is uncertain for different mass spectrum data (depending on the mass spectrometer), therefore, the preprocessing step is to perform pooling (pooling), and use the library function in the pytorch to down-sample the original data 6400W H to the data format of 6400 256, in this process, data augmentation is performed simultaneously, and maximum pooling, average pooling and minimum pooling operations are performed using nn. adaptivemaxpool2d ((256,256)), nn. adaptiveavgpool2d ((256,256)), and constructed AdaptiveMinPool2d ((256,256)), to unify dimensions and reduce dimensions.
Wherein, the original data is converted from IM-DIAT data; the system comprises a plurality of servers, wherein the directory is an open-source Python machine learning library and is used for application programs such as natural language processing and the like based on the Torch; since the 4D data of IM-iat is window-to-mass-to-charge ratio ion mobility-to-period index, the two-dimensional multi-channel image data available for deep learning is N-to-C-to-H-to-W, where N is batch size, i.e., the data input at each training (N-1 at test), C is the number of channels, where C is window-to-ion mobility, H and W are the picture width and height at which we conventionally understand, and after conversion, H-to-mass-to-charge ratio, W is period index.
After the above steps, the 2D ResNet framework shown in fig. 5 can be used to classify thyroid proteomic data into two categories of benign and malignant, not only for this but also for other tissues. And training the two-dimensional multi-channel image data with data augmentation during training, adding white noise randomly and performing translation operation on each channel on the basis of the two-dimensional multi-channel image data to increase the robustness of the model, predicting the maximum pooling, average pooling and minimum pooling data of each tested data during testing, and averaging the predicted value to obtain the final predicted value. Data augmentation can be performed to increase the stability of the model, and overfitting is not easy to occur.
Wherein Robust is the transliteration of Robust, i.e., the meaning of Robust and Robust. It is also the ability of the system to survive abnormal and dangerous conditions. For example, whether computer software is halted or crashed in the case of input error, disk failure, network overload, or intentional attack is the robustness of the software. By "robustness", it is also meant that the control system maintains some other characteristic under certain (structural, size) parameter perturbation. According to different definitions of performance, stable robustness and performance robustness can be divided. A fixed controller designed with the robustness of a closed loop system as a target is called a robust controller.
The 2D ResNet framework is also called ResNet (residual Neural network), and proposed by four people, such as Kaiming He of microsoft research institute, the Neural network of 152 layers is successfully trained by using the ResNet Unit, and champions are obtained in the ILSVRC2015 game, the error rate on top5 is 3.57%, and the number of parameters is lower than that of VGGNet, so that the effect is very prominent. The structure of ResNet can accelerate the training of the neural network very fast, and the accuracy of the model is greatly improved. Meanwhile, the popularization of ResNet is very good, and even the ResNet can be directly used in an IncepotionNet network. The main idea of ResNet is to add a direct connection channel, i.e. the idea of Highway Network, in the Network. And (5) proposing the idea of residual error learning. The traditional convolution network or the full-connection network has the problems of information loss, loss and the like more or less during information transmission, and simultaneously, the gradient disappears or the gradient explodes, so that the deep network cannot be trained. ResNet solves the problem to a certain extent, input information is directly bypassed to output, the integrity of the information is protected, the whole network only needs to learn the part with difference between input and output, and the learning goal and difficulty are simplified. ResNet differs most in that there are many bypasses connecting the input directly to the following layers, a structure also known as shortcut or skip connections.
Therefore, the training model obtained through the steps can solve the problem that the data size is low due to high dimensionality.
Finally, parameters of the network are updated by adopting a gradient descent method based on Adam, the initial learning rate is 0.05, and beta in Adam is (0.95, 0.9995). The neural network model obtained through the steps obtains the prediction probability, BCE loss (binary cross entropy loss) between the neural network model and the label (real classification) is calculated to be used as a loss function, the error gradient can be calculated through the minimum loss function, and the gradient of the network can be updated through back propagation. And after the network training is finished, obtaining a classification result by utilizing the final predicted probability value. Table 1 below is the image size for each step of the deep learning framework.
Figure BDA0003320100710000141
Figure BDA0003320100710000151
TABLE 1
The great innovation of the scheme is the universality, the scheme can be applied to mass spectrum data of different machines, omics and large molecules for classification, and meanwhile, the same data can be nested on different classical networks, including but not limited to ResNet series, Incepration series, VGG series and the like. The model established by the scheme can be used for performing interpretative research by using deep learning, and the peptide fragments found in key areas can be used for biomarker searching. For this purpose, referring to fig. 8, a group of urine samples containing 19 new crowns and 39 non-new crowns was used, and classification learning was performed by using deep network methods such as ResNet, densnet, MobileNetv2, and referring to fig. 9, the classification effect with accuracy higher than 90% was obtained in a single experiment and 5-fold cross validation. FIG. 8 is a schematic diagram of the utilization of various deep web methods using a set of urine samples containing 19 new crowns and 39 non-new crowns, and FIG. 9 is a graph of classification accuracy for various deep web playing methods.
In the case of the example 2, the following examples are given,
referring to fig. 6, the application of the IM-iat data structure includes a front end for inputting DIA data and a back end for performing the above-described application method based on the IM-iat data structure to output to the front end.
In the case of the example 3, the following examples are given,
a computer program product comprising software code portions for performing the above IM-DIAT data structure based application method when the computer program product is run on a computer.
In the case of the example 4, the following examples are given,
an electronic device, at least one processor; a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described IM-DIAT data structure-based application method. Such as computers, cell phones.
In the case of the example 5, the following examples were conducted,
a non-transitory computer readable storage medium storing computer instructions for causing a computer to execute the above IM-DIAT data structure-based application method. Such as a usb disk, a removable hard disk.
The present invention is not described in detail in the prior art, and therefore, the present invention is not described in detail.
The computer system of the server for implementing the method of the embodiment of the present invention includes a central processing unit CPU) that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) or a program loaded from a storage section into a Random Access Memory (RAM). In the RAM, various programs and data necessary for system operation are also stored. The CPU, ROM, and RAM are connected to each other via a bus. An input/output (I/O) interface is also connected to the bus.
The following components are connected to the I/O interface: an input section including a keyboard, a mouse, and the like; an output section including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section including a hard disk and the like; and a communication section including a network interface card such as a LAN card, a modem, or the like. The communication section performs communication processing via a network such as the internet. The drive is also connected to the I/O interface as needed. A removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive as necessary, so that a computer program read out therefrom is mounted into the storage section as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program performs the above-described functions defined in the system of the present invention when executed by a Central Processing Unit (CPU).
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. Each block of the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described modules may also be disposed in a processor.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to perform the process steps corresponding to the following method.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. The application method based on the IM-DIAT data structure is characterized by comprising the following steps:
extracting necessary attributes of mass spectrum information in the mass spectrum original file;
converting and calculating the necessary attributes to obtain a window index, a cycle index, ion mobility and a mass-to-charge ratio, and respectively corresponding to the signal kurtosis one by one to form an IM-DIAT data structure by using four dimensions of the IM-DIAT data structure;
converting the IM-DIAT data structure into two-dimensional multi-channel image data which can be utilized by deep learning through image processing;
and taking the two-dimensional multi-channel image data as training data for deep learning to obtain a classification result.
2. The method for IM-iat data structure based application according to claim 1, wherein the IM-iat data structure is based on proteomics quantification by TimsTOF Pro mass spectrometer.
3. The IM-iat data structure-based application method of claim 1, wherein the image processing conversion comprises the specific steps of:
down-sampling and data augmentation are performed on the data of the IM-DIAT data structure;
performing max pooling, average pooling, and min pooling operations on the IM-DIAT data structure after data augmentation.
4. The IM-iat data structure-based application method of claim 3, wherein deep learning is trained using data augmented two-dimensional multi-channel image data;
randomly adding white noise and performing translation operation on each channel;
and respectively predicting the maximum pooling data, the average pooling data and the minimum pooling data of each data, and taking the average value as a final predicted value to obtain a classification result, wherein each data is the data of the two-dimensional multi-channel image data after training and translation operations.
5. The IM-iat data structure-based application method of claim 4, wherein parameters of the deep-learned neural network model are updated based on Adam's gradient descent method;
obtaining prediction data based on the neural network model and calculating two-class cross entropy loss between the prediction data and the real class as a loss function;
calculating an error gradient by minimizing the loss function and updating the gradient of the neural network model by back propagation;
and obtaining a classification result by using the final predicted value after the steps are completed.
Use of an IM-iat data structure, comprising a front end for inputting DIA data and a back end for performing the IM-iat data structure based application method of any of claims 1-5 for output to the front end.
7. A computer program product, comprising software code portions for performing the IM-DIAT data structure based application method of any of claims 1-5 when said computer program product is run on a computer.
8. An electronic device, characterized by at least one processor; a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the IM-DIAT data structure based application method of any of claims 1-5.
9. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the IM-DIAT data structure based application method of any of claims 1-5.
CN202111243593.0A 2021-10-25 2021-10-25 Application method based on IM-DIAT data structure and application thereof Pending CN113990387A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111243593.0A CN113990387A (en) 2021-10-25 2021-10-25 Application method based on IM-DIAT data structure and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111243593.0A CN113990387A (en) 2021-10-25 2021-10-25 Application method based on IM-DIAT data structure and application thereof

Publications (1)

Publication Number Publication Date
CN113990387A true CN113990387A (en) 2022-01-28

Family

ID=79741213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111243593.0A Pending CN113990387A (en) 2021-10-25 2021-10-25 Application method based on IM-DIAT data structure and application thereof

Country Status (1)

Country Link
CN (1) CN113990387A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034017A (en) * 2023-09-07 2023-11-10 云鉴康(杭州)医疗科技有限公司 Mass spectrogram classification method, system, medium and equipment based on deep learning
CN117972757A (en) * 2024-03-25 2024-05-03 贵州大学 Method and system for realizing safety analysis of mine data based on cloud platform

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034017A (en) * 2023-09-07 2023-11-10 云鉴康(杭州)医疗科技有限公司 Mass spectrogram classification method, system, medium and equipment based on deep learning
CN117034017B (en) * 2023-09-07 2024-03-19 云鉴康(杭州)医疗科技有限公司 Mass spectrogram classification method, system, medium and equipment based on deep learning
CN117972757A (en) * 2024-03-25 2024-05-03 贵州大学 Method and system for realizing safety analysis of mine data based on cloud platform

Similar Documents

Publication Publication Date Title
Demichev et al. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput
López-Fernández et al. Mass-Up: an all-in-one open software application for MALDI-TOF mass spectrometry knowledge discovery
Adusumilli et al. Data conversion with ProteoWizard msConvert
WO2020014767A1 (en) Systems and methods for de novo peptide sequencing from data-independent acquisition using deep learning
Teleman et al. DIANA—algorithmic improvements for analysis of data-independent acquisition MS data
Wenger et al. COMPASS: A suite of pre‐and post‐search proteomics software tools for OMSSA
Ràfols et al. Signal preprocessing, multivariate analysis and software tools for MA (LDI)‐TOF mass spectrometry imaging for biological applications
Castillo et al. Algorithms and tools for the preprocessing of LC–MS metabolomics data
CN113990387A (en) Application method based on IM-DIAT data structure and application thereof
Prakash et al. Signal maps for mass spectrometry-based comparative proteomics
Jaitly et al. Decon2LS: An open-source software package for automated processing and visualization of high resolution mass spectrometry data
Vaudel et al. Peptide and protein quantification: a map of the minefield
Pascal et al. HD desktop: an integrated platform for the analysis and visualization of H/D exchange data
Dowsey et al. Image analysis tools and emerging algorithms for expression proteomics
CN111370072B (en) Implementation method of molecular omics data structure based on data independent acquisition mass spectrum
CN103884806A (en) Proteome label-free quantification method combining tandem mass spectrometry with machine learning algorithm
CN110579554A (en) 3D mass spectrometric predictive classification
KR102380684B1 (en) Method and apparatus for determining cancer-specific biomarkers through glycopeptide analysis based on mass spectrum based on ai
WO2022184406A1 (en) System and method for improving high-precision ion mobility workflow
Li et al. MSSort-DIAXMBD: A deep learning classification tool of the peptide precursors quantified by OpenSWATH
Tully Toffee–a highly efficient, lossless file format for DIA-MS
Wang et al. StackZDPD: a novel encoding scheme for mass spectrometry data optimized for speed and compression ratio
CN109946413B (en) method for detecting proteome by pulse type data independent acquisition mass spectrum
Meng et al. LipidMiner: a software for automated identification and quantification of lipids from multiple liquid chromatography-mass spectrometry data files
CN113936794A (en) Dia-PASEF-based IM-DIAT data structure implementation method and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination