WO2021174901A1 - 基于数据非依赖采集质谱的分子组学数据结构的实现方法 - Google Patents

基于数据非依赖采集质谱的分子组学数据结构的实现方法 Download PDF

Info

Publication number
WO2021174901A1
WO2021174901A1 PCT/CN2020/127823 CN2020127823W WO2021174901A1 WO 2021174901 A1 WO2021174901 A1 WO 2021174901A1 CN 2020127823 W CN2020127823 W CN 2020127823W WO 2021174901 A1 WO2021174901 A1 WO 2021174901A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
mass
mass spectrum
diat
charge ratio
Prior art date
Application number
PCT/CN2020/127823
Other languages
English (en)
French (fr)
Inventor
郭天南
栾钟治
李子青
张芳菲
禹韶阳
臧泽林
Original Assignee
西湖大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西湖大学 filed Critical 西湖大学
Priority to US17/597,648 priority Critical patent/US20220284989A1/en
Publication of WO2021174901A1 publication Critical patent/WO2021174901A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/20Identification of molecular entities, parts thereof or of chemical compositions
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N27/00Investigating or analysing materials by the use of electric, electrochemical, or magnetic means
    • G01N27/62Investigating or analysing materials by the use of electric, electrochemical, or magnetic means by investigating the ionisation of gases, e.g. aerosols; by investigating electric discharges, e.g. emission of cathode
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/80Data visualisation
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01JELECTRIC DISCHARGE TUBES OR DISCHARGE LAMPS
    • H01J49/00Particle spectrometers or separator tubes
    • H01J49/0027Methods for using particle spectrometers
    • H01J49/0036Step by step routines describing the handling of the data generated during a measurement

Definitions

  • the invention relates to the technical field of biomolecular omics mass spectrometry data, in particular to a method for implementing molecular omics data structure based on data-independent acquisition of mass spectra.
  • MS Mass Spectrum
  • LC liquid chromatography
  • MS/MS tandem mass spectrometry
  • Data-dependent acquisition depends on the intensity of the precursor ions in the first-level spectrum of the sample. Sorting the precursor ions for the second-level fragmentation has a certain degree of randomness, and the identification is more reproducible Low;
  • Targeted monitoring target method-although the selected reaction monitoring can accurately analyze a set of predefined molecules, the throughput is only hundreds of;
  • DIA Data independent acquisition
  • DIA Data independent acquisition
  • DIA is a holographic data independent acquisition quantitative technology, which divides the entire scan range of the mass spectrometer into several windows, and performs high-speed and cyclical alignment. All ions in each window are selected, fragmented and detected, so that all fragment information of all ions in the sample can be obtained without omission and difference.
  • the target molecule There is no need to specify the target molecule, the number of scanning points is uniform, and the spectral library can be used to achieve qualitative confirmation and verification. Quantitative ion screening and data backtracking can be achieved.
  • Satellite Scanning Mass Spectrometry divides the primary mass spectrum into a series of adjacent precursor ion selection windows of 25m/z (ie 25m/z) or larger.
  • each precursor Ions are fragmented at the same time as all other precursor ions, and the corresponding multiple spectra of fragment ions from the same window are recorded at the same time.
  • Fragment ions falling into the same precursor ion window can be systematically recorded without bias, which overcomes the DDA mode
  • the randomness of precursor ion selection also retains the high accuracy of the target method.
  • the data-independent mass spectrometry method can repeatedly cover low-abundance molecules, so that a permanent digital map can be generated to represent all measurable molecular signals as an electronic archive of biomolecular omics.
  • mass spectrometer manufacturers have protected mass spectrometry data formats, such as the raw format of Thermo Fisher, the wiff format of Sciex, and the baf format of Bruker. Although there are some open source conversion data formats on the market, such as mzXML format, mzML format, mz5 format, etc., these formats generally suffer from low storage efficiency.
  • XML Extensible Markup Language
  • mzXML format and mzML format are converted to a readable language and cannot directly store binary data, resulting in a significant increase in the size of the converted XML format file, and The reading of XML files must be sequential reading, and the non-sequential reading of data is required for mass spectrometry data analysis, which leads to the problem of low input and output (I/O) rates.
  • Mz5 format is an efficient data management and storage format based on High-performance data management and storage5 (HDF5), it still maintains the ontology of the mzML file content, which is not all required for DIA data analysis. information.
  • DIA analysis software such as: OpenSWATH software, Skyline software, Spectronaut software, PeakView software, etc.
  • OpenSWATH software can realize the function of identifying and quantifying biomolecules
  • these programs are not easy to operate and time-consuming and computationally expensive.
  • Part of the secondary mass spectrometer is used for peak group inference, which will produce unpredictable effects (for example: the inevitable missing value problem), which will affect the downstream statistical classification analysis.
  • the existing mass spectrometry data structure is no longer suitable for storing and analyzing the large-scale data generated by the new independent mass spectrometry acquisition method.
  • the present invention provides a biomolecular mass spectrum data structure based on data-independent acquisition of mass spectra and an implementation method thereof.
  • the technical solution of the present invention is:
  • a molecular omics data structure based on data-independent acquisition of mass spectra is DIAT tensor data generated from mass spectrum raw data, and the DIAT tensor data has three-dimensional attributes, the first dimension It is the index of the number of cycles, the second dimension is the mass-to-charge ratio of the pooled fragment ions, and the third dimension is the index of the precursor ion window corresponding to the fragment ions.
  • a method for implementing molecular omics data structure based on data-independent acquisition of mass spectra the steps include:
  • Step A Convert the mass spectrum raw data file into an mzXML format file, and perform mass-charge centering processing on the mass spectrum raw data at the same time, and the obtained mzXML format file contains all necessary information of the primary mass spectrum and the secondary mass spectrum data;
  • Step B Extract the required mass spectrum data from the mzXML format file obtained in Step A.
  • the mass spectrum data at least contains the following attributes: scan level, scan index, retention time, precursor ion mass-to-charge ratio, fragment ion mass-to-charge ratio, and fragment Ionic strength
  • Step C Count the total cycle number and cycle number index based on the scan level and scan index of the mass spectrum data extracted in step B, and perform lost scan detection at the same time, fill in 0 placeholders in all missing positions and obtain the fragment ion corresponding to the data The index of the window and cycle number of the precursor ion;
  • Step D According to the fragment ion mass-to-charge ratio attribute, the mass spectrum data obtained in step C is binned, and the fragment ion intensity values falling in the same fragment ion mass-to-charge ratio bin are added and processed;
  • Step E Perform a reordering operation on the mass spectrum data processed in step D.
  • the reordering refers to obtaining the corresponding window index according to the precursor ion mass-to-charge ratio data corresponding to the secondary mass spectrum, and combining the data with the same window index
  • the secondary mass spectrometers are rearranged in the order of the cycle number index;
  • Step F For the data processed in step E, the tensor data of the fragment ion intensity of the secondary mass spectrometer is composed of three dimensions: cycle index, fragment ion mass-to-charge ratio, and precursor ion window index corresponding to the fragment ion.
  • step G is also included: pooling data of different dimensions is performed to reduce the size of the tensor data and then generating pooled DIAT tensor data.
  • the method of pooling in step G is: first, in each second-order window, perform distribution statistics on the non-zero value of the precursor ion mass-to-charge ratio to obtain a primary and secondary alternating peak with a predefined grid Mode, and then use the principle of the main and auxiliary alternating peak mode to pool the regions with different mass-to-charge ratios, use the nonlinear square Gaussian fitting of the non-zero intensity distribution peaks to dynamically determine the upper and lower boundaries of the regions that need to merge the mass-to-charge ratios, and finally discard them All non-peak grids, and merge multiple rows of each main and sub-peak area into one row, reducing the number of rows in the mass-to-charge ratio dimension.
  • the DIAT tensor data is processed into a pseudo-color image to achieve visualization.
  • the fragment ion intensity in the DIAT tensor data is grayed out as an input model for deep learning.
  • the present invention has the following advantages:
  • the DIAT tensor data of the present invention is transformed according to the original mass spectrum data structure, which can ensure the effective information volume of the DIA mass spectrum data, and when the data is read, it is read in the form of a three-dimensional tensor, and the reading order is not affected.
  • the limitation greatly improves the convenience and speed of data reading. After storing it as a DIAT format file, the file size is only one tenth of the mzXML file, which greatly reduces the storage space required for mass spectrometry data files.
  • the present invention can also directly observe the DIA mass spectrum data through the visualized pooled DIAT file image, can directly use the visual processing algorithm to analyze the DIAT tensor, avoid the need for a large amount of calculation to extract ion chromatographic peaks (XIC) operations, and A computer deep learning model for clinical sample classification can be established directly based on this format file.
  • XIC ion chromatographic peaks
  • Figure 1 is a flow chart of the implementation method of the present invention
  • Figure 2 is a schematic diagram of the original mass spectrometry data of the present invention.
  • Figure 3 is a schematic diagram of the DIAT tensor data format conversion of the original mass spectrum data of the present invention
  • FIG. 4 is a schematic diagram of index of the number of cycles of DIAT tensor data according to the present invention.
  • FIG. 5 is a schematic diagram of the DIAT tensor data of the present invention.
  • FIG. 6 is a comparison diagram of the size of the DIAT file of the present invention, the size of the mzXML file and the size of the original mass spectrum data file;
  • Figure 7 is a schematic diagram of DIAT tensor data pooled in the present invention.
  • Figure 8 is a schematic diagram of the main and secondary peaks of the experimental data of the present invention.
  • Figure 9 is a Gaussian distribution fitting diagram of the present invention.
  • Figure 10 is a schematic diagram of the simulated main peak of the present invention.
  • FIG. 11 is a schematic diagram of the visualization process of the two-dimensional graph of the present invention.
  • Figure 12 is a schematic diagram of the grayscale results of the application of the present invention to proteomics
  • Figure 13 is a schematic diagram of the grayscale results of the present invention applied to metabolomics
  • Figure 14 is a schematic diagram of the grayscale results of the application of the present invention to lipidomics.
  • a method for implementing the data structure of biomolecular omics mass spectrometry based on data-independent acquisition of mass spectra the specific steps include:
  • Step A Use the MSconvert tool in the ProteoWizard software package to convert the original mass spectrometry data file provided by the supplier into a mzXML format file, and at the same time use the MSconvert tool to perform centroiding processing on the original mass spectrometry data file to obtain the mzXML
  • the format file contains all the necessary information of the primary mass spectrum and secondary mass spectrum data (as shown in Figure 2, a schematic diagram of the original mass spectrum data file provided by the supplier);
  • Step B Write the read_mzxml_body function, and use the pyteomic toolkit to extract the required mass spectrum data from the mzXML format file obtained in step A.
  • the mass spectrum data contains at least the following attributes: scan level (MS level), scan index (scan index), retention Retention time, precursor ion mass-to-charge ratio (peptide precursor m/z), fragment ion mass-to-charge ratio (fragment m/z) and fragment ion intensity (fragment intensity);
  • Step C Use the detect_missing_scan function to count the total cycle number and cycle index (cycle index) of the mass spectrum data extracted in step B according to the scan level and scan index (as shown in Figure 3), and perform missing scan detection at the same time. Fill in all the missing positions with 0 placeholders and obtain the window and cycle index of the precursor ion corresponding to the fragment ion in the data (as shown in Figure 4);
  • Step D According to the fragment ion mass-to-charge ratio attribute, use the binning function to binning the mass spectrum data obtained in step C, and to add and process the fragment ion intensity values falling in the same fragment ion mass-to-charge ratio bin, and the bin size Set according to the mass accuracy of different mass spectrometer machines, so as not to affect the overall integrity of the data;
  • Step E Because the original data format of the data-independent collection of mass spectra is a repeated cycle formed by a primary mass spectrum plus a series of secondary mass spectra, and each secondary mass spectrum in the same collection cycle is relatively independent, and in different cycles
  • the MS mass spectra corresponding to the same precursor ion mass-to-charge ratio are related to each other, so the reorder_by_window function is used to perform the reordering operation on the mass spectrum data processed in step D.
  • the reordering refers to the precursor ion corresponding to the MS mass spectrum Mass-to-charge ratio data, obtain its corresponding window index, and rearrange the secondary mass spectra with the same window index in the order of the cycle index;
  • Step F For the data processed in step E, generate DIAT( Data-Independent Acquisition Tensor) tensor data.
  • the final result is a biomolecular mass spectrum data structure based on data-independent acquisition of mass spectra.
  • the mass spectrum data structure is a DIAT tensor data with three-dimensional attributes.
  • the first dimension is the index of the number of cycles
  • the second dimension is the mass-to-charge ratio of fragment ions
  • the third dimension is the index of the precursor ion window corresponding to the fragment ions.
  • This kind of DIAT tensor data is transformed according to the original mass spectrum data structure, which can ensure the effective amount of DIA mass spectrum data, and when reading the data, it is read in the form of a three-dimensional tensor, and the reading order is not restricted, which greatly Improve the convenience and speed of data reading.
  • DIAT Data-Independent Acquisition Tensor
  • the file size will be reduced to a few tenths of the original mzXML file.
  • Figure 6 a comparison diagram of the size of the DIAT file generated by the example in Figure 2 and the size of the mzXML file and the size of the original mass spectrum data file is given. It can be seen from Figure 6 that the size of the DIAT format file is compared with the original mass spectrum data file. In comparison, the file size is reduced by 30 times. Compared with the mzXML file, it is reduced to 1/60 of the mzXML file size, which greatly reduces the storage space required for the mass spectrum data file.
  • Add step G reduce the size of the tensor data by pooling data of different dimensions, and the generated pooled DIAT tensor data (as shown in Figure 7 is a three-dimensional diagram containing the main and secondary peaks) DIAT tensor data schematic diagram), the specific method of pooling can be: first, in each second-order window, perform distribution statistics on the non-zero value of the precursor ion mass-to-charge ratio to obtain a primary and secondary alternation with a predefined grid Peak mode (as shown in Figure 8), and then use the principle of the main and auxiliary alternating peak mode to pool areas with different mass-to-charge ratios, and use the nonlinear square Gaussian fitting of non-zero intensity distribution peaks to dynamically determine the need to merge mass-charges The upper and lower boundaries of the ratio area (as shown in Figure 9), and finally use the pooling_mz_peaks_by_window function to discard all peak-free grids, and merge the multiple rows of the main and auxiliary peak areas into one row,
  • the draw_diat function uses the draw_diat function to grayscale the fragment ion intensity in the DIAT tensor data as the input model for the subsequent deep learning.
  • the method used for grayscale is: use percentiles to divide the non-zero value of intensity with equal frequency discretization, and color each division interval, divide 0-100 equal intervals into 256 values, use this 256 A floating point number from 0 to 100 and the percentile function calculates the 256 values corresponding to the non-zero value of the intensity. These 256 values correspond to 255 intervals, each of which has a color, and the interval value ranges from 1-255 .
  • Figure 12-14 shows the gray-scale results obtained with proteomics, metabolomics and lipidomics as application objects.
  • the present invention has the following advantages:
  • the DIAT tensor data of the present invention is transformed according to the original mass spectrum data structure, which can ensure the effective information volume of the DIA mass spectrum data, and when the data is read, it is read in the form of a three-dimensional tensor, and the reading order is not affected.
  • the limitation greatly improves the convenience and speed of data reading. After storing it as a DIAT file, the file size is only one tenth of the mzXML file, which greatly reduces the storage space required for mass spectrometry data files.
  • the present invention can also directly observe the DIA mass spectrum data through the visualized pooled DIAT file image, can directly use the visual processing algorithm to analyze the DIAT tensor, avoid the need for a large amount of calculation to extract ion chromatographic peaks (XIC) operations, and A computer deep learning model for clinical sample classification can be established directly based on this format file.
  • XIC ion chromatographic peaks

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Physics & Mathematics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Analytical Chemistry (AREA)
  • Electrochemistry (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • General Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Immunology (AREA)
  • Pathology (AREA)
  • Bioethics (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Biophysics (AREA)
  • Signal Processing (AREA)
  • Molecular Biology (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

本发明涉及生物分子组学质谱数据技术领域,尤其涉及一种基于数据非依赖采集质谱的分子组学数据结构的实现方法,该质谱数据结构是由质谱原始数据生成的DIAT张量数据,具有三个维度的属性,第一维度为循环次数索引,第二维度为碎片离子质荷比,第三维度为碎片离子所对应的前体离子窗口索引。本案所述的DIAT张量数据完整度高,便于读取且读取速度快,DIAT文件大小仅为mzXML文件的几十分之一,通过可视化池化的DIAT文件图像能够对DIA质谱数据直接观察,能够直接使用视觉处理算法对DIAT张量进行分析,避免了需要大计算量抽取离子色谱峰的操作,并且能够直接根据此文件建立临床样品分类的计算机深度学习模型。

Description

基于数据非依赖采集质谱的分子组学数据结构的实现方法 技术领域
本发明涉及生物分子组学质谱数据技术领域,尤其涉及一种基于数据非依赖采集质谱的分子组学数据结构的实现方法。
背景技术
基于质谱(即Mass Spectrum,缩写为MS)的组学已经发展了几十年,并已发展出能够利用其在数小时内对复杂生物样品中成千上万种生物分子,进行谱分析的应用。生物分子经过液相色谱(即liquid chromatography,缩写为LC)分离并通过串联质谱(MS/MS)碎片离子谱鉴定和定量,并以此鉴定和定量生物分子,应用包括蛋白质组学,代谢组学和脂质组学。
基于质谱的组学目前有以下几种采集模式:
1.数据依赖性采集(简称DDA):数据依赖型采集依赖于样品的一级谱图中前体离子的强度,按前体离子排序进行二级碎裂具有一定随机性,鉴定重现性较低;
2.靶向监测(简称SRM):目标方法-选定的反应监测虽然可以精确地分析一组预定义分子,但通量只有数百条;
3.数据非依赖性采集(简称DIA):DIA(即Data independent acquisition)是一项全息式数据非依赖性采集定量技术,它将质谱整个全扫描范围分为若干个窗口,高速、循环地对每个窗口中的所有离子进行选择、碎裂及检测,从而无遗漏、无差异地获得样本中所有离子的全部碎片信息,无需指定目标分子,扫描点数均匀,利用谱图库即可实现定性确证和定量离子筛选,并可实现数据回溯。例如:卫星扫描质谱技术(SWATH)将一级质谱分为一系列相邻的25m/z(即25m/z)或更大尺寸的前体离子选择窗口,在每一个窗口中,每个前体离子与所有其他前体离子同时破碎,并同时记录了源自同一窗口的碎片离子的相应多重谱图,落入相同前体离子窗口的碎片离子可以被系统地无偏差记录,克服了DDA模式下前体离子选择的随机性的同时,也保留了目标方法的高准确性。数据非依赖采集的质谱方法可以重复覆盖低丰度的分子,从而可以产生一个永久的数字图谱来代表所有可测量的分子信号,作为生物分子组学的电子档案。
在实际应用中,大部分质谱仪器制造商都有受于保护的质谱数据格式,例如Thermo Fisher公司的raw格式、Sciex公司的wiff格式、Bruker公司的baf格式等。虽然市场上也有一些开源的转换数据格式,例如:mzXML格式、mzML格式、mz5格式等,但是这些格式普遍存在存储效率低下的问题。例如:基于扩展性标记语言(XML)的文件格式(如mzXML格式和mzML格式),由于转换成了可读语言而且不可以直接存储二进制数据,导致转换的XML格式文件大小明显增大,并且由于读取XML文件时必须为顺序读取,而进行质谱数据分析时需要非顺序读取数据,进而导致了输入输出(I/O)速率低下的问题。虽然Mz5格式是一种基于High-performance data management and storage5(HDF5)的高效数据管理和存储的格式,但是其依旧保持了mzML文件内容的本体,而这些并不全是所有DIA数据分析时所需要的信息。另外,DIA由于前体离子和碎片离子之间的关系的丧失,共流出的前体离子会在同一窗口中共碎片化,进而产生高度复杂的碎片质谱,因而需要在DDA中获得目标分子的先验信息,包括前体质荷比、其碎片离子的质荷比及相应的相对强度和保留时间等,再进行抽取离子色谱峰(XIC)以推断出属于靶向分子的峰组,耗费大量计算资源和时间,并且常常导致数据的失真。虽然现有的多种DIA分析软件,例如:OpenSWATH软件、Skyline软件、Spectronaut软件、PeakView软件等,都可以实现鉴定和定量生物分子的功能,但是这些程序不易操作且耗时耗计算资源,并且仅将部分二级质谱用于峰组推论,因而会产生不可预测的效果(例如:不可避免的缺失值问题),进而会影响下游的统计分类分析。
因此,现有的质谱数据结构已不适用于存储并分析新型的非依赖性质谱采集方法所产生的大规模数据。
发明内容
针对现有技术中的问题,本发明提供一种基于数据非依赖性采集质谱的生物分子组学质谱数据结构及其实现方法。
为实现以上技术目的,本发明的技术方案是:
1.一种基于数据非依赖采集质谱的分子组学数据结构,所述质谱数据结构是由质谱原始数据生成的DIAT张量数据,所述DIAT张量数据具有三个维度的属性, 第一维度为循环次数索引,第二维度为池化碎片离子质荷比,第三维度为碎片离子所对应的前体离子窗口索引。
2.一种基于数据非依赖采集质谱的分子组学数据结构的实现方法,步骤包括:
步骤A:将质谱原始数据文件转换为mzXML格式文件,并同时对质谱原始数据进行质荷中心化处理,得到的mzXML格式文件包含一级质谱和二级质谱数据的所有必要信息;
步骤B:从步骤A得到的mzXML格式文件中提取需要的质谱数据,所述质谱数据至少包含以下属性:扫描级别、扫描索引、保留时间、前体离子质荷比、碎片离子质荷比和碎片离子强度;
步骤C:对步骤B提取的质谱数据根据扫描级别和扫描索引来统计总循环次数和循环次数索引,同时进行丢失扫描检测,在所有丢失的位置填补0占位符和获取该数据中碎片离子对应的前体离子的窗口、循环次数索引;
步骤D:根据碎片离子质荷比属性,对步骤C获得的质谱数据进行分箱处理,对落在同一个碎片离子质荷比分箱的碎片离子强度数值进行加和处理;
步骤E:对步骤D处理后的质谱数据执行重排序操作,所述重排序是指根据二级质谱对应的前体离子质荷比数据,得到其对应的窗口索引,并将具有相同窗口索引的二级质谱按照循环次数索引的顺序重新排列在一起;
步骤F:对经步骤E处理后的数据,以循环次数索引、碎片离子质荷比、碎片离子所对应的前体离子窗口索引三个维度构成二级质谱碎片离子强度的张量数据。
作为改进,还包括步骤G:通过对不同维度的数据进行池化运算减小张量数据的大小后生成池化后的DIAT张量数据。
作为优选,所述步骤G中池化的方法为:首先在每个二阶窗口中,对前体离子质荷比的非零值进行分布统计,获得一个具有预定义网格的主副交替峰模式,再利用此主副交替峰模式的规律对不同质荷比区域进行池化,使用非零强度分布峰的非线性平方高斯拟合来动态确定需要合并质荷比区域的上下边界,最后舍弃所有无峰网格,并将各个主、副峰区域的多行合并成一行,将质荷比维度的行数减少。
作为改进,还包括以下步骤:在获得池化后的DIAT张量数据后,将DIAT张量数据处理为伪彩色图像以达到可视化。
作为改进,还包括以下步骤:在获得池化后的DIAT张量数据后,将DIAT张量数据中的碎片离子强度灰度化,作为深度学习的输入模型。
从以上描述可以看出,本发明具备以下优点:
本发明所述的DIAT张量数据是依据原始质谱数据结构进行转化的,能够保证DIA质谱数据的有效信息量,并且在进行数据读取时,以三维张量形式读取,读取顺序不受限制,大大提高了数据的读取便捷性和读取速度,将其存储为DIAT格式文件后,文件大小仅为mzXML文件的几十分之一,大大降低了质谱数据文件所需的存储空间。本发明还能够通过可视化池化的DIAT文件图像对DIA质谱数据直接观察,能够直接使用视觉处理的算法对DIAT张量进行分析,避免了需要大计算量的抽取离子色谱峰(XIC)操作,且能够直接根据此格式文件建立临床样品分类的计算机深度学习模型。随着DIA数据质量和数量的增加,可以预见本发明所述技术在临床诊断中的潜力,为提供疾病分型诊断提供了有效解决方案。
附图说明
图1是本发明实现方法的流程图;
图2是本发明原始质谱数据示意图;
图3是本发明原始质谱数据格式转换DIAT张量数据示意图;
图4是本发明DIAT张量数据循环次数索引示意图;
图5是本发明DIAT张量数据示意图;
图6是本发明DIAT文件大小与mzXML文件大小和质谱原始数据文件大小对比图;
图7是本发明池化后的DIAT张量数据示意图;
图8是本发明实验数据主副峰示意图;
图9是本发明高斯分布拟合图;
图10是本发明模拟主峰示意图;
图11是本发明二维图可视化过程示意图;
图12是本发明应用于蛋白质组学的灰度化结果示意图;
图13是本发明应用于代谢组学的灰度化结果示意图;
图14是本发明应用于脂质组学的灰度化结果示意图。
具体实施方式
结合图1-图14,详细说明本发明的实施例,但不对本发明的权利要求做任何限定。
如图1所示,一种基于数据非依赖性采集质谱的生物分子组学质谱数据结构的实现方法,具体步骤包括:
步骤A:利用ProteoWizard软件包中的MSconvert工具,将供应商提供的质谱原始数据文件转换为mzXML格式文件,并同时通过MSconvert工具对质谱原始数据文件进行质荷中心化(centroiding)处理,得到的mzXML格式文件包含一级质谱和二级质谱数据的所有必要信息(如图2所示,为供应商提供的质谱原始数据文件示意图);
步骤B:编写read_mzxml_body函数,利用pyteomic工具包从步骤A得到的mzXML格式文件中提取需要的质谱数据,所述质谱数据至少包含以下属性:扫描级别(MS level)、扫描索引(scan index)、保留时间(retention time)、前体离子质荷比(peptide precursor m/z)、碎片离子质荷比(fragment m/z)和碎片离子强度(fragment intensity);
步骤C:利用detect_missing_scan函数对步骤B提取的质谱数据根据扫描级别和扫描索引来统计总循环次数(cycle number)和循环次数索引(cycle index)(如图3所示),同时进行丢失扫描检测,在所有丢失的位置填补0占位符和获取该数据中碎片离子对应的前体离子的窗口、循环次数索引(如图4所示);
步骤D:根据碎片离子质荷比属性,使用binning函数对步骤C获得的质谱数据进行分箱处理,对落在同一个碎片离子质荷比分箱的碎片离子强度数值进行加和处理,分箱大小根据不同质谱机器对应的质量精度设置,从而不影响数据整体的完整性;
步骤E:因为数据非依赖性采集质谱的原始数据格式是一个一级质谱加上一系列二级质谱形成的重复循环,而同一个采集循环中的各个二级质谱是相对独立的,不同循环中同一个前体离子质荷比对应的二级质谱是相互关联的,所以使用reorder_by_window函数对步骤D处理后的质谱数据执行重排序操作,所述重排 序是指根据二级质谱对应的前体离子质荷比数据,得到其对应的窗口索引,并将具有相同窗口索引的二级质谱按照循环次数索引的顺序重新排列在一起;
步骤F:对经步骤E处理后的数据,以循环次数索引、碎片离子质荷比、碎片离子所对应的前体离子窗口索引三个维度构成二级质谱碎片离子强度的张量数据生成DIAT(Data-Independent Acquisition Tensor)张量数据。
通过上述实现方法,最后得到的是一种基于数据非依赖性采集质谱的生物分子组学质谱数据结构,如图5所示,该质谱数据结构为一DIAT张量数据,具有三个维度的属性,第一维度为循环次数索引,第二维度为碎片离子质荷比,第三维度为碎片离子所对应的前体离子窗口索引。这种DIAT张量数据是依据原始质谱数据结构进行转化的,能够保证DIA质谱数据的有效信息量,并且在进行数据读取时,以三维张量形式读取,读取顺序不受限制,大大提高了数据的读取便捷性和读取速度。将这种DIAT(Data-Independent Acquisition Tensor)张量数据存储为DIAT文件后(存储格式为.diat格式),文件大小将减小到原有的mzXML文件的几十分之一。如图6所示,给出了由图2的示例生成的DIAT文件大小与mzXML文件大小和质谱原始数据文件大小的对比图,从图6中可以看出,DIAT格式文件大小与质谱原始数据文件相比,文件大小减小了30倍,与mzXML文件相比,减少至mzXML文件大小的1/60,大大降低了质谱数据文件所需的存储空间。
上述实现方法中,需要注意的是,由于同一批质谱原始数据转化的mzXML文件中循环次数可能存在差异,需要对不同文件中的质谱总共循环次数进行统计,并将同一批次中最小循环次数向下取整十数字设定为该批次数据读取的统一循环次数,以保证后续数据处理的扫描次数数量一致性。
在获得上述DIAT张量数据后,为了进一步提高该数据的性能,对上述技术方案做以下改进:
(1)增加步骤G:通过对不同维度的数据进行池化运算减小张量数据的大小,生成的池化后的DIAT张量数据(如图7所示,为包含主副峰示意图的三维DIAT张量数据示意图),池化的具体方法可以为:首先在每个二阶窗口中,对前体离子质荷比的非零值进行分布统计,获得一个具有预定义网格的主副交替峰模式(如图8所示),再利用此主副交替峰模式的规律对不同质荷比区域进行池 化,使用非零强度分布峰的非线性平方高斯拟合来动态确定需要合并质荷比区域的上下边界(如图9所示),最后使用pooling_mz_peaks_by_window函数,舍弃所有无峰网格,并将各个主、副峰区域的多行合并成一行,将质荷比维度的行数减少50倍;此步骤中,具有预定义网格的主副交替峰模式之所以可以作为池化规律,是因为通过模拟所有人类蛋白质组的单电荷的碎片离子的分布情况(如图10所示),发现模拟的结果与真实实验的样本具有相同的主峰分布模式,而副峰可被解释为双电荷碎片离子质荷比。
(2)在获得池化的DIAT张量数据后,利用draw_image函数将DIAT张量数据处理为伪彩色图像以达到可视化,如图11所示,为二维图可视化示意图,通过可视化处理,不仅能够通过可视化DIAT文件图像对DIA质谱数据进行直接观察,而且能够直接使用视觉处理的算法对DIAT张量进行分析,避免了需要大计算量的抽取离子色谱峰(XIC)的操作,还能够直接根据此文件建立临床样品分类的模型。
(3)在获得池化的DIAT张量数据后,利用draw_diat函数,将DIAT张量数据中的碎片离子强度灰度化,作为后续深度学习的输入模型。例如:灰度化采用的方法为:利用百分位数对intensity非零值进行等频离散化划分,并对各划分区间进行着色,将0~100等间距划分为256个值,利用这个256个0~100的浮点数字和百分位数函数计算intensity非零值对应的256个值,这256个值对应的即为255个区间,每个区间一种颜色,区间值从1-255。如图12-14分别给出了以蛋白质组学、代谢组学和脂质组学为应用对象,获得的灰度结果示意图。
综上所述,本发明具有以下优点:
本发明所述的DIAT张量数据是依据原始质谱数据结构进行转化的,能够保证DIA质谱数据的有效信息量,并且在进行数据读取时,以三维张量形式读取,读取顺序不受限制,大大提高了数据的读取便捷性和读取速度,将其存储为DIAT文件后,文件大小仅为mzXML文件的几十分之一,大大降低了质谱数据文件所需的存储空间。本发明还能够通过可视化池化的DIAT文件图像对DIA质谱数据直接观察,能够直接使用视觉处理的算法对DIAT张量进行分析,避免了需要大计算量的抽取离子色谱峰(XIC)操作,且能够直接根据此格式文件建立临床样品 分类的计算机深度学习模型。随着DIA数据质量和数量的增加,可以预见本发明所述技术在临床诊断中的潜力,为提供疾病分型诊断提供了有效解决方案。
可以理解的是,以上关于本发明的具体描述,仅用于说明本发明而并非受限于本发明实施例所描述的技术方案。本领域的普通技术人员应当理解,仍然可以对本发明进行修改或等同替换,以达到相同的技术效果;只要满足使用需要,都在本发明的保护范围之内。

Claims (5)

  1. 一种基于数据非依赖采集质谱的分子组学数据结构的实现方法,步骤包括:
    步骤A:将质谱原始数据文件转换为mzXML格式文件,并同时对质谱原始数据进行质荷中心化处理,得到的mzXML格式文件包含一级质谱和二级质谱数据的所有必要信息;
    步骤B:从步骤A得到的mzXML格式文件中提取需要的质谱数据,所述质谱数据至少包含以下属性:扫描级别、扫描索引、保留时间、前体离子质荷比、碎片离子质荷比和碎片离子强度;
    步骤C:对步骤B提取的质谱数据根据扫描级别和扫描索引来统计总循环次数和循环次数索引,同时进行丢失扫描检测,在所有丢失的位置填补0占位符和获取该数据中碎片离子对应的前体离子的窗口、循环次数索引;
    步骤D:根据碎片离子质荷比属性,对步骤C获得的质谱数据进行分箱处理,对落在同一个碎片离子质荷比分箱的碎片离子强度数值进行加和处理;
    步骤E:对步骤D处理后的质谱数据执行重排序操作,所述重排序是指根据二级质谱对应的前体离子质荷比数据,得到其对应的窗口索引,并将具有相同窗口索引的二级质谱按照循环次数索引的顺序重新排列在一起;
    步骤F:对经步骤E处理后的数据,以循环次数索引、碎片离子质荷比、碎片离子所对应的前体离子窗口索引三个维度构成二级质谱碎片离子强度的张量数据。
  2. 根据权利要求1所述的基于数据非依赖采集质谱的分子组学数据结构的实现方法,其特征在于:还包括步骤G:通过对不同维度的数据进行池化运算减小张量数据的大小后生成池化后的DIAT张量数据。
  3. 根据权利要求2所述的基于数据非依赖采集质谱的分子组学数据结构的实现方法,其特征在于:所述步骤G中池化的方法为:首先在每个二阶窗口中,对前体离子质荷比的非零值进行分布统计,获得一个具有预定义网格的主副交替峰模式,再利用此主副交替峰模式的规律对不同质荷比区域进行池化,使用非零强度分布峰的非线性平方高斯拟合来动态确定需要合并质荷比区域的上下边界,最后舍弃所有无峰网格,并将各个主、副峰区域的多行合并成一行,将质荷比维度的行数减少。
  4. 根据权利要求2所述的基于数据非依赖采集质谱的分子组学数据结构的实现方法,其特征在于:还包括以下步骤:在获得池化后的DIAT张量数据后,将DIAT张量数据处理为伪彩色图像以达到可视化。
  5. 根据权利要求2所述的基于数据非依赖采集质谱的分子组学数据结构的实现方法,其特征在于:还包括以下步骤:在获得池化后的DIAT张量数据后,将DIAT张量数据中的碎片离子强度灰度化,作为深度学习的输入模型。
PCT/CN2020/127823 2020-03-04 2020-11-10 基于数据非依赖采集质谱的分子组学数据结构的实现方法 WO2021174901A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/597,648 US20220284989A1 (en) 2020-03-04 2020-11-10 Implementation method of molecular omics data structure based on data independent acquisition mass spectra

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010144110.0A CN111370072B (zh) 2020-03-04 2020-03-04 基于数据非依赖采集质谱的分子组学数据结构的实现方法
CN202010144110.0 2020-03-04

Publications (1)

Publication Number Publication Date
WO2021174901A1 true WO2021174901A1 (zh) 2021-09-10

Family

ID=71210184

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/127823 WO2021174901A1 (zh) 2020-03-04 2020-11-10 基于数据非依赖采集质谱的分子组学数据结构的实现方法

Country Status (3)

Country Link
US (1) US20220284989A1 (zh)
CN (1) CN111370072B (zh)
WO (1) WO2021174901A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114002368A (zh) * 2021-12-30 2022-02-01 天津市食品安全检测技术研究院 超高效液相色谱-四级杆-飞行时间高分辨质谱法测定保健食品中非法添加成分的方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111370072B (zh) * 2020-03-04 2020-11-17 西湖大学 基于数据非依赖采集质谱的分子组学数据结构的实现方法
CN114577972B (zh) * 2020-11-30 2023-05-12 中国科学院大连化学物理研究所 一种用于体液鉴定的蛋白质标志物筛选方法
CN114858958B (zh) * 2022-07-05 2022-11-01 西湖欧米(杭州)生物科技有限公司 质谱数据在质量评估中的分析方法、装置和存储介质
CN115267033B (zh) * 2022-08-05 2024-06-14 西湖大学 基于质谱数据的宏蛋白质组学分析方法及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765984A (zh) * 2015-03-20 2015-07-08 同济大学 一种生物质谱数据库快速建立与搜索的方法
CN108072728A (zh) * 2016-11-16 2018-05-25 中国科学院大连化学物理研究所 一种基于数据依赖性质谱扫描模式的谱图库建立方法及其应用
CN108351342A (zh) * 2015-08-20 2018-07-31 深圳华大生命科学研究院 冠心病的生物标志物
CN109416926A (zh) * 2016-04-11 2019-03-01 迪森德克斯公司 质谱数据分析工作流程
US20190228956A1 (en) * 2018-01-23 2019-07-25 Jeol Ltd. Apparatus and Method for Processing Mass Spectrum
CN111370072A (zh) * 2020-03-04 2020-07-03 西湖大学 基于数据非依赖性采集技术的生物分子组学质谱数据结构及其实现方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2902197B2 (ja) * 1992-02-04 1999-06-07 株式会社日立製作所 大気圧イオン化質量分析装置
WO2015191980A1 (en) * 2014-06-13 2015-12-17 Waters Technologies Corporation Intelligent target-based acquisition
CN108140060B (zh) * 2015-05-29 2022-06-28 沃特世科技公司 用于处理质谱数据的技术
CN109828068B (zh) * 2017-11-23 2021-12-28 株式会社岛津制作所 质谱数据采集及分析方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765984A (zh) * 2015-03-20 2015-07-08 同济大学 一种生物质谱数据库快速建立与搜索的方法
CN108351342A (zh) * 2015-08-20 2018-07-31 深圳华大生命科学研究院 冠心病的生物标志物
CN109416926A (zh) * 2016-04-11 2019-03-01 迪森德克斯公司 质谱数据分析工作流程
CN108072728A (zh) * 2016-11-16 2018-05-25 中国科学院大连化学物理研究所 一种基于数据依赖性质谱扫描模式的谱图库建立方法及其应用
US20190228956A1 (en) * 2018-01-23 2019-07-25 Jeol Ltd. Apparatus and Method for Processing Mass Spectrum
CN111370072A (zh) * 2020-03-04 2020-07-03 西湖大学 基于数据非依赖性采集技术的生物分子组学质谱数据结构及其实现方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114002368A (zh) * 2021-12-30 2022-02-01 天津市食品安全检测技术研究院 超高效液相色谱-四级杆-飞行时间高分辨质谱法测定保健食品中非法添加成分的方法

Also Published As

Publication number Publication date
US20220284989A1 (en) 2022-09-08
CN111370072B (zh) 2020-11-17
CN111370072A (zh) 2020-07-03

Similar Documents

Publication Publication Date Title
WO2021174901A1 (zh) 基于数据非依赖采集质谱的分子组学数据结构的实现方法
Pascal et al. HD desktop: an integrated platform for the analysis and visualization of H/D exchange data
JP5348029B2 (ja) 質量分析データ処理方法及び装置
US8180581B2 (en) Systems and methods for identifying correlated variables in large amounts of data
CN105190303A (zh) 成像质量分析数据处理方法及成像质量分析装置
EP2558982A1 (en) Intensity normalization in imaging mass spectrometry
EP3584795B1 (en) 3d mass spectrometry predictive classification
EP3497709B1 (en) Automated spectral library retention time correction
CN113990387A (zh) 基于im-diat数据结构的应用方法及其应用
CN109946413B (zh) 脉冲式数据非依赖性采集质谱检测蛋白质组的方法
CN114858958B (zh) 质谱数据在质量评估中的分析方法、装置和存储介质
Meng et al. LipidMiner: a software for automated identification and quantification of lipids from multiple liquid chromatography-mass spectrometry data files
US11181511B2 (en) Rapid scoring of LC-MS/MS peptide data
KR20120124767A (ko) 당 동정을 위한 새로운 생물정보처리 분석 방법
CN115171790A (zh) 质谱的数据序列在质量评估中的分析方法、装置和存储介质
CN114705766A (zh) 基于is联合svr的大规模组学数据校正方法及系统
CN109564227B (zh) 结果相依分析-swath数据的迭代分析
CN110455907B (zh) 基于飞行时间质量分析器的串联质谱数据分析方法
Chen et al. Random Forest model for quality control of high resolution mass spectra from SILAC labeling experiments
CN116106464B (zh) 质谱数据质量程度或概率的控制系统、评估系统及方法
CN107895159A (zh) 临床蛋白质质谱数据的分类方法
CN117999605A (zh) 谱比较
CN113936794A (zh) 基于dia-PASEF的IM-DIAT数据结构实现方法及其应用
LaMarche Methods for comparing metaproteomic data in the absence of metagenomic information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20922826

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20922826

Country of ref document: EP

Kind code of ref document: A1