WO2020199866A1 - Biometabolomics data processing and analysis methods and apparatuses, and application thereof - Google Patents

Biometabolomics data processing and analysis methods and apparatuses, and application thereof Download PDF

Info

Publication number
WO2020199866A1
WO2020199866A1 PCT/CN2020/078647 CN2020078647W WO2020199866A1 WO 2020199866 A1 WO2020199866 A1 WO 2020199866A1 CN 2020078647 W CN2020078647 W CN 2020078647W WO 2020199866 A1 WO2020199866 A1 WO 2020199866A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
biological
mass spectrometry
metabolomics
adjacent
Prior art date
Application number
PCT/CN2020/078647
Other languages
French (fr)
Chinese (zh)
Inventor
栾恩慧
李尉
龙巧云
李德华
王雅兰
宋佳平
李振宇
刘兵行
Original Assignee
深圳碳云智能数字生命健康管理有限公司
深圳微伴生物有限公司
深圳数字生命研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳碳云智能数字生命健康管理有限公司, 深圳微伴生物有限公司, 深圳数字生命研究院 filed Critical 深圳碳云智能数字生命健康管理有限公司
Publication of WO2020199866A1 publication Critical patent/WO2020199866A1/en

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N30/00Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
    • G01N30/02Column chromatography
    • G01N30/86Signal analysis
    • G01N30/8624Detection of slopes or peaks; baseline correction
    • G01N30/8631Peaks
    • G01N30/8634Peak quality criteria
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N30/00Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
    • G01N30/02Column chromatography
    • G01N30/88Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86

Definitions

  • the present invention relates to the technical field of metabolomics, in particular to a method, analysis method, device and application of biological metabolomics data processing.
  • Metabolomics is a new subject after genomics and proteomics. It is an important part of systems biology. It mainly investigates the dynamic changes of all small molecule metabolites and their contents before and after the biological system is stimulated or disturbed. Through the overall qualitative and quantitative analysis of all small molecule metabolites in the organism, the relationship between metabolites and physiological and pathological changes can be explored and discovered. Studies have shown that metabolome has important application value in the fields of early disease diagnosis, biomarker discovery, drug screening, toxicity evaluation, sports medicine, and nutrition.
  • LC-MS liquid chromatography-mass spectrometry
  • LC-MS technology has been further improved, and large-scale sample detection applications have also increased.
  • the test time for large-scale samples is longer, and the sensitivity of the machine will decrease and retention time drift during long-term operation. Therefore, researchers often put large-scale samples on the machine in batches to keep the machine running in good condition, but this has another problem, that is, the metabolome data between samples and batches are random. Errors and systematic errors cannot be directly compared, and data integration is required.
  • the common one is to use the XCMS method for data integration, which can realize multi-sample metabolomics data analysis.
  • the present invention provides a biological metabolomics data processing method, analysis method and device, which can effectively solve the problem that the sample information complementation between different batches cannot be effectively used in the metabolome data processing process, resulting in poor metabolite detection repeatability And the coverage will be reduced.
  • the present invention aims to provide a biological metabolomics data processing method, analysis method, device and application, which are suitable for processing larger scale metabolomics data.
  • the biological metabolomics data includes liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data, liquid chromatography-mass spectrometry data includes primary mass spectrometry data, and gas chromatography-mass spectrometry data includes primary mass spectrometry data; biological metabolomics data processing method It includes the steps of integrating liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of multiple biological samples to form a feature database.
  • the integration steps include:
  • S11 arbitrarily select one of the multiple biological samples as a reference sample, and perform correction on the time axis of other samples one by one according to the time axis of the reference sample;
  • S13 includes:
  • S131 Determine whether the [mzmin, mzmax] regions of multiple identification characteristic peaks overlap or are adjacent, if they overlap, go to S133; if they do not overlap, further determine whether they are adjacent, if multiple [mzmin, mzmax] regions of multiple identification characteristic peaks If the interval between is less than the first preset threshold, it is determined to be adjacent and enter S133; if it is neither overlapping nor adjacent, it is determined that the multiple identification characteristic peaks are independent characteristic peaks;
  • S132 Determine whether the [rtmin, rtmax] regions of the multiple identification feature peaks overlap or are adjacent, if they overlap, go to S133; if they do not overlap, further determine whether they are adjacent, if the [rtmin, rtmax] regions of the multiple identification feature peaks If the interval between is less than the second preset threshold, it is judged to be adjacent, and enter S133; if it is neither overlapping nor adjacent, it is judged that the multiple identification characteristic peaks are independent characteristic peaks;
  • S134 Generate a feature list using data of all feature peaks to obtain a feature database.
  • the first preset threshold is set according to instrument parameters, and the second preset threshold is set according to the maximum value of the time deviation in the retention time correction; preferably, the first preset threshold is set to 0.01 ⁇ 0.015Da , The second preset threshold is set to 10-15.
  • mass spectrometry data also includes secondary mass spectrometry data
  • S13 also includes:
  • the third preset threshold is set to 40%.
  • the third preset threshold is set to 50%.
  • the third preset threshold is set to 60%.
  • the third preset threshold is set to 80%.
  • the mass spectrum data also includes the secondary mass spectrum data
  • S11 also includes the retention time correction of the primary mass spectrum data and the secondary mass spectrum data; preferably, the retention time correction is performed using the Obiwarp algorithm.
  • the algorithm for peak recognition is CentWave algorithm, matchedFilter algorithm or mzMine algorithm.
  • parameter settings of the peak recognition algorithm include: ppm: the resolution of the instrument used; peak width: set to 2-30; noise: set to 0; signal-to-noise ratio: set to 10.
  • biological samples include human or animal body fluids, tissues or cells, plant roots, stems, leaves, fruits or seeds, or microbial cell culture fluid; wherein, body fluids include urine, blood, saliva, cerebrospinal fluid or amniotic fluid, Tissues include organ tissues, muscle tissues or tumor tissues, and cells include stem cells, somatic cells, tumor cells or microbial cells.
  • a method for analyzing biological metabolomics data sequentially includes the steps of biological metabolomics data processing and qualitative identification of metabolites through secondary mass spectrometry data information, wherein the biological metabolomics data processing adopts any of the above-mentioned biological metabolomics data processing methods of the present invention.
  • the step of qualitatively identifying metabolites through the data information of the secondary mass spectrum includes:
  • S23 take all the mass-to-charge ratio data of the MS mass spectrum corresponding to a characteristic value selected in S22 as one side, and use the mass-to-charge ratio data of the MS mass spectrum of the matched standard compound found in S22 as the other side, and perform Similarity scores, points points are calculated, and metabolites are qualitatively based on the points.
  • S23 includes: calculating the median of the similarity between each standard compound in the multiple standard compounds on the matching and multiple MS mass spectrometry data, and selecting the compound with the largest median; preferably, according to the median of the compound Whether it is greater than the cut-off value, judge whether it matches.
  • mass-to-charge ratio data of standard compounds are obtained from existing databases, including NISTlib, HMDB or METLIN.
  • the analysis method also includes a step of quantifying biological metabolites.
  • the steps of quantifying biological metabolites include:
  • a method for detecting vitamins, amino acids, lipids, steroids, aromatic acids, neurotransmitters, pigments, carbohydrates or short peptides includes: performing liquid chromatography-mass spectrometry and/or gas chromatography-mass spectrometry on a biological sample to obtain liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data; using any of the above-mentioned biological metabolomics data processing Method or analysis method of biological metabolomics data process the liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data of biological samples to obtain data results; and convert vitamins, amino acids, lipids, steroids, and aromas based on the data results Acid, neurotransmitter, pigment, carbohydrate or short peptide.
  • biological metabolomics data includes liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data, liquid chromatography-mass spectrometry data includes primary mass spectrometry data, and gas chromatography-mass spectrometry data includes primary mass spectrometry data; biological metabolomics data processing
  • the device includes a database generating module that integrates liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of multiple biological samples to form a feature database.
  • the database generating module includes:
  • the time axis correction sub-module is set to arbitrarily select one of multiple biological samples as a reference sample, and correct the time axis of other samples one by one according to the time axis of the reference sample;
  • the characteristic peak recognition sub-module is configured to perform peak recognition processing of the ion peaks of the primary mass spectrometer one by one for each sample after calibration to obtain multiple characteristic peaks;
  • the feature database forming sub-module is set to merge multiple identification feature peaks according to the principle of sample information complementarity to obtain feature databases of multiple biological samples.
  • the feature database forming sub-module includes a data integration unit, the data integration unit is set to overlap or adjacent [mzmin, mzmax] regions, and [rtmin, rtmax] regions overlap or merge multiple adjacent identification feature peaks into one Characteristic peaks.
  • the feature database forming sub-module includes a first judgment unit, a second judgment unit, a data integration unit and a feature database forming unit:
  • the first judging unit is set to judge whether the [mzmin, mzmax] regions of multiple identification feature peaks overlap, if they overlap, enter the data integration unit; if they do not overlap, further judge whether they are adjacent, if multiple identification feature peaks are If the interval of the [mzmin, mzmax] area is less than the first preset threshold, it is determined to be adjacent and enter S133; if it is neither overlapping nor adjacent, it is determined that the multiple identification characteristic peaks are independent characteristic peaks;
  • the second judging unit is set to judge whether the [rtmin, rtmax] regions of multiple identification feature peaks overlap or are adjacent, if they overlap, enter the data integration unit; if they do not overlap, further judge whether they are adjacent, if multiple identification feature peaks If the interval of the [rtmin, rtmax] area is less than the second preset threshold, it is judged to be adjacent and enter the data integration unit; if it is neither overlapping nor adjacent, it is judged that the multiple identification characteristic peaks are independent characteristic peaks;
  • the data integration unit is set to combine the [mzmin, mzmax] area overlap or adjacent, and the [rtmin, rtmax] area overlap or the adjacent multiple identification characteristic peaks are combined into one characteristic peak;
  • the feature database forming unit is configured to generate a feature list using the data of all feature peaks to obtain the feature database.
  • the first preset threshold is set according to instrument parameters, and the second preset threshold is set according to the maximum value of the time deviation in the retention time correction; preferably, the first preset threshold is set to 0.01 ⁇ 0.015Da , The second preset threshold is set to 10-15.
  • the mass spectrometry data also includes secondary mass spectrometry data
  • the biological metabolomics data processing device further includes: a peak merging validity verification sub-module configured to compare the secondary mass spectrometry data of multiple biological samples to a feature database, wherein , When the comparison rate is greater than or equal to the third preset threshold, it is determined that the peak combination is valid.
  • the third preset threshold is set to 40%.
  • the third preset threshold is set to 50%.
  • the third preset threshold is set to 60%.
  • the third preset threshold is set to 80%.
  • the mass spectrum data also includes secondary mass spectrum data
  • the time axis correction sub-module is also configured to perform retention time correction on the primary mass spectrum data and the secondary mass spectrum data; preferably, the Obiwarp algorithm is used to perform retention time correction.
  • the algorithm for peak recognition is CentWave algorithm, matchedFilter algorithm or mzMine algorithm.
  • parameter settings of the peak recognition algorithm include: ppm: the resolution of the instrument used; peak width: set to 2-30; noise: set to 0; signal-to-noise ratio: set to 10.
  • a device for analyzing biological metabolomics data includes a module configured to process biological metabolomics data and a module configured to qualitatively identify metabolites through secondary mass spectrometry data information, wherein the module configured to process biological metabolomics data is any of the above-mentioned biological Metabolomics data processing device.
  • the module configured to qualitatively identify metabolites through the data information of the secondary mass spectrum includes:
  • the standard compound mass-to-charge ratio data acquisition sub-module is set to acquire the mass-to-charge ratio data of each standard compound
  • the standard compound matching sub-module is set to randomly select a characteristic value in the characteristic database obtained after the biological metabolomics data processing, and find all the MS mass-to-charge ratio data corresponding to the characteristic value, according to all the secondary Mass-to-charge ratio data of mass spectrometry to find a standard compound that matches it;
  • the integral qualitative sub-module is set to take all the MS mass-to-charge ratio data corresponding to a characteristic value selected in the standard compound matching sub-module as one side, and use the standard compound matching sub-module to find the second level of the matched standard compound
  • the mass-to-nucleus ratio data of the mass spectrometer is the other party, score the similarity between the two, calculate the point integration, and qualitative the metabolites based on the integrated value.
  • the integral qualitative sub-module is set to calculate the median of the similarity between each standard compound and the multiple secondary mass spectrometry data among the multiple standard compounds on the matching, and select the compound with the largest median; preferably, according to the compound Whether the median is greater than the cut-off value, judge whether it matches.
  • mass-to-charge ratio data of standard compounds are obtained from existing databases, including NISTlib, HMDB or METLIN.
  • the analysis device further includes a module configured to quantify biological metabolites.
  • module set to quantify biological metabolites includes:
  • the time axis correction sub-module is set to correct the time axis of the sample to be quantified according to the time axis of the reference sample;
  • the relative quantification of biological metabolites sub-module is set to integrate the corresponding characteristic regions of the samples to be quantified in the established characteristic database to obtain the relative quantitative results of biological metabolites.
  • the present invention has at least the following beneficial effects:
  • the invention constructs a feature database, fixes a reference sample, and unifies the time axis, which can ensure that subsequent samples are comparable in time, so that the metabolome data processing process can effectively use sample information complementation between different batches, and effectively improve metabolism. Object detection repeatability and coverage.
  • the present invention performs peak merging processing in the process of constructing a feature database, and the merged peaks can cover a larger area, so that quantification can be performed more accurately when only one sample is detected, even if the chromatographic peak shape is not good Metabolites still have a good effect, and produce a larger coverage area, which makes it more compatible with subsequent samples, and effectively reduces the impact of retention time (RT) drift.
  • RT retention time
  • the present invention effectively improves the analysis efficiency of samples, so that subsequent samples are comparable in time, and the samples do not need to be rolled back, and can be widely used in business.
  • Fig. 1 shows a schematic diagram of the process of constructing a feature database in an embodiment of the present invention
  • FIG. 2 shows a schematic diagram of the process of merging and identifying characteristic peaks in an embodiment of the present invention
  • Figure 3 shows a sample retention time correction diagram in Embodiment 1
  • Figure 4 shows the ionization form diagrams of the 16 standard compounds in Example 1;
  • Figure 5 shows the similarity comparison diagram between 35 MS2 of Example 1 and 16 standard compounds on matching
  • FIG. 6 shows the distribution of the number of feature missing values in Embodiment 1
  • Figure 7 shows a comparison graph of the coefficient of variation (CV) between the samples of Example 1 and Comparative Example 1;
  • Figure 8 shows the PCA results of Example 1 and Comparative Example 1;
  • Figure 9 shows a comparison of the number of metabolites identified in Example 1 and Comparative Example 1;
  • Figure 10 shows the mz and RT distributions of MS2 precursor ions corresponding to FT08341 in Example 1.
  • Figure 11 shows the similarity of 35 MS2 spectra of FT08341 in Example 1.
  • Metabolome refers to the dynamic overall of metabolites in an organism.
  • the metabolome usually refers to the general term for small molecular metabolites with a relative molecular mass of less than 1000 Da (Da: Dalton).
  • Mass Spectrometry also known as Mass Spectrometry (MS), is a method of ionizing the measured substance, using electric and magnetic fields to separate the moving ions according to their mass-to-charge ratios for detection.
  • Precursor ion also known as precursor ion, is an ion that can further decompose and generate fragment ions.
  • Product ion Fragment ion obtained by high-energy fragmentation of a certain molecular ion (parent ion).
  • MS1 Primary mass spectrum: Detect the mass-to-charge ratio and intensity of all charged ions to form a primary spectrum.
  • the signal in the primary mass spectrum is the precursor ion signal.
  • Second-level mass spectrometry (MS2): Select parent ions in a certain way, dissociate them further, analyze the mass-to-charge ratio and intensity of the formed product ions, and form a second-level spectrum.
  • Mass-to-charge ratio The ratio of the mass of a charged ion to the charge, which is the physical characteristic of the ion and is a certain value. Limited by the resolution of the instrument, the detected mz will fluctuate.
  • Retention Time the time from the beginning of the sample injection to the time when the maximum concentration of the component appears after the column, that is, from the beginning of the sample injection to the peak of a certain component chromatographic peak The elapsed time.
  • Retention Time the time from the beginning of the sample injection to the time when the maximum concentration of the component appears after the column, that is, from the beginning of the sample injection to the peak of a certain component chromatographic peak The elapsed time.
  • the retention time of the component molecular ion
  • Ion peaks Ion peaks in a sample, expressed in [mzmin, mzmax, rtmin, rtmax].
  • features have the same representation form as peaks [mzmin, mzmax, rtmin, rtmax]. Unlike peaks, features can represent the molecular ion (peaks is a part of the molecular ion, and a molecular ion can have multiple peaks). Features can be merged from multiple peaks from one sample, or from multiple peaks from multiple samples.
  • PPM parts per million
  • the present invention proposes a new metabolome data integration idea, which can be applied to large-scale metabolome data analysis, and can realize data correction and data integration in batches or single samples, and It is not affected by the test batch, and at the same time, the coverage of metabolites and the accuracy of qualitative and quantitative are also improved.
  • Biological metabolomics data includes liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data, liquid chromatography-mass spectrometry data includes primary mass spectrometry data, and gas chromatography-mass spectrometry data includes primary mass spectrometry data; the biological metabolomics data
  • the processing method includes the step of integrating liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of multiple biological samples to form a feature database, and the integration step includes:
  • S11 arbitrarily select one of the multiple biological samples as a reference sample, and perform correction on the time axis of other samples one by one according to the time axis of the reference sample;
  • the mass spectrometry data also includes secondary mass spectrometry data
  • S11 also includes retention time correction of the primary and secondary mass spectrum data to further improve the accuracy of the mass spectrometry data; preferably, the Obiwarp algorithm is used for Retention time correction has the advantages of fast calculation speed and high data processing accuracy.
  • S13 includes: if the [mzmin, mzmax] regions of the multiple identification feature peaks overlap or are adjacent, and the [rtmin, rtmax] regions overlap or are adjacent, then merge the multiple identification feature peaks into A characteristic peak; preferably, S13 includes: S131, judging whether the [mzmin, mzmax] regions of multiple identifying characteristic peaks overlap, if they overlap, go to S133; if they do not overlap, further determine whether they are adjacent, if the multiple identified If the interval between the [mzmin, mzmax] regions of the characteristic peaks is less than the first preset threshold, it is determined to be adjacent, and then proceeds to S133; if it is neither overlapping nor adjacent, it is determined that multiple identification characteristic peaks are independent characteristic peaks; S132: Determine whether the [rtmin, rtmax] regions of the multiple identification feature peaks overlap or are adjacent, if they overlap, go to S133; if they do not overlap, further determine whether they are adjacent,
  • the combined characteristic peaks can cover a larger area, even if only one sample is detected, it can be quantified more accurately (effective for metabolites with poor chromatographic peak shapes), and a larger coverage area can be more effective Compatible with subsequent samples, effectively reducing the impact of retention time (RT) drift.
  • RT retention time
  • the peak recognition algorithm is CentWave algorithm, matchedFilter algorithm or mzMine algorithm, more preferably CentWave algorithm, because this method can improve sensitivity, limit its error, and find the most recognized characteristic peaks most accurately. In this way, peaks are merged on the basis of CentWave algorithm, which can effectively use CentWave to locate the maximum response area.
  • the algorithm parameter setting follows the idea of "improving detection sensitivity as much as possible”.
  • the parameter setting of the peak recognition algorithm includes: ppm: the resolution of the instrument is adopted; peak width: set to 2-30; noise: Set to 0; SNR: set to 10.
  • the first preset threshold is set according to instrument parameters, and the second preset threshold is set according to the maximum value of the time deviation in the retention time correction; preferably, the first preset threshold is set It is 0.01 ⁇ 0.015Da, more preferably, the first preset threshold is set to 0.01Da, 0.011Da, 0.012Da, 0.013Da, 0.014Da or 0.015Da, and the second preset threshold is set to 10 ⁇ 15, more preferably Yes, the second preset threshold is set to 10, 11, 12, 13, 14 or 15, to improve the effectiveness of peak merging, thereby improving the accuracy of the feature database.
  • the mass spectrometry data further includes secondary mass spectrometry data
  • S13 further includes: S135, comparing the secondary mass spectrometry data of multiple biological samples to the feature database generated in S134, and assisting in determining peak merging The higher the ratio of the secondary mass spectrum data to the feature database, the stronger the effectiveness of peak merging.
  • the secondary mass spectrometry data of multiple biological samples are compared to the feature database generated in S134, wherein when the comparison rate is greater than or equal to the third preset threshold, it is determined that the peak combination is effective; one is more preferred
  • the third preset threshold is set to 40%; in a more preferred embodiment, the third preset threshold is set to 50%; in a more preferred embodiment, the third preset threshold is set In a more preferred embodiment, the third preset threshold is set to 80%.
  • the biological metabolomics data processing method of the present invention is suitable for almost all biological samples that can be detected by liquid chromatography-mass spectrometry and/or gas chromatography-mass spectrometry.
  • biological samples include but are not limited to human or animal body fluids, tissues or cells , Plant roots, stems, leaves, fruits or seeds, or microbial cell culture fluid, etc.; among them, body fluids include urine, blood, saliva, cerebrospinal fluid or amniotic fluid, etc., tissues include organ tissue, muscle tissue or tumor tissue, etc., cell Including stem cells, somatic cells, tumor cells or microbial cells, etc.
  • the biological metabolomics data analysis method sequentially includes the steps of biological metabolomics data processing and qualitative identification of metabolites through secondary mass spectrometry data information, wherein the biological metabolomics data processing adopts any of the above-mentioned biological metabolisms of the present invention.
  • the omics data processing method is carried out. Since the above-mentioned biological metabolomics data processing method of the present invention is not affected by the detection batch, the sample data volume of the characteristic database can be continuously accumulated, thereby continuously improving the qualitative identification of metabolites through the secondary mass spectrum data information. accuracy.
  • the step of qualitatively identifying metabolites through the data information of the secondary mass spectrum includes:
  • S23 take all the mass-to-charge ratio data of the MS mass spectrum corresponding to a characteristic value selected in S22 as one side, and use the mass-to-charge ratio data of the MS mass spectrum of the matched standard compound found in S22 as the other side, and perform Similarity scores, points points are calculated, and metabolites are qualitatively based on the points.
  • This method can effectively avoid the problem that the median is not representative, and the operation is simple.
  • a general method of calculating dot product can also be used to score MS2 similarity. This method is subordinated to the comparison of multiple MS2 belonging to the same feature with the MS2 of the standard compound, and the comparison can be achieved through integration. The purpose of feature identification.
  • S23 specifically includes: calculating the median of the similarity between each standard compound in the multiple standard compounds and multiple MS mass spectrometry data, and selecting the compound with the largest median; more preferably, according to the median of the compound Whether the number of digits is greater than the cut-off value, it is judged whether it matches.
  • Using the above steps not only includes the "representative" MS2, but also adds various possible MS2 of the compound, which increases the degree of matching with the standard compound.
  • the mass-to-charge ratio data of the standard compound is obtained from an existing database, for example, the database includes NISTlib, HMDB or METLIN.
  • the analysis method further includes a step of quantifying biological metabolites.
  • a unified time axis is determined, the retention time is corrected, and a feature database with rich data volume is obtained, so as to maximize the coverage area of the precursor ion (mz) , Can reduce the influence of the fluctuation of mass-to-charge ratio mz and retention time RT, and improve the accuracy of biological metabolite quantification.
  • the step of quantifying biological metabolites includes: S31, calibrating the time axis of the sample to be quantified according to the time axis of the reference sample; S32, integrating the corresponding feature area of the sample to be quantified in the established feature database to obtain the biological The result of relative quantification of metabolites.
  • the sample is selected as the reference sample (the reference sample is the same as the test sample type, which can be understood as a standard product and only needs to be tested once), and the time axis of other samples are corrected according to this sample, which is a reference.
  • xml that is, fix a reference sample to ensure that subsequent samples are comparable in time.
  • the new sample is first corrected for retention time (RT).
  • the Obiwarp algorithm is used to perform retention time (RT) correction on the primary mass spectrum data (MS1) and secondary mass spectrum data (MS2).
  • Peak identification Specifically: on the corrected time axis, use the CentWave algorithm to identify the peaks of each sample's primary mass spectrum ion peak (findPeaks).
  • the peak recognition algorithm includes but not limited to CentWave, matchedFilter, mzMine, and CentWave is preferred. This method can improve the sensitivity, limit its error, and find the most identifying characteristic peaks (peak1, peak2,..., peakn) most accurately.
  • the noise caused by high sensitivity and the problem of the same ion peak being divided into two ion peaks caused by the strict ppm setting are handled by sample information complementation.
  • 2peakwidth (peak width): set to 2 ⁇ 30.
  • This parameter setting is related to the column type and elution time, generally 1/10 of the elution time.
  • the purpose of selecting 2 as the lower limit is to identify very narrow peaks and improve the sensitivity of findPeaks.
  • noise set to 0. This parameter represents the intensity of noise, and the purpose of setting it to 0 is to improve sensitivity. The greater the noise, the lower the sensitivity
  • This parameter represents the signal-to-noise ratio and uses the default parameters.
  • n and a are independently valued in positive integers
  • m is valued in 0 and positive integers
  • m ⁇ n are independently valued in positive integers
  • the standard compound is obtained from an existing database, the database is mainly NISTlib, or HMDB, METLIN and other public databases).
  • This step calculates the median of similarity between each compound and multiple MS2s, and selects the compound with the largest median. According to whether the median of the compound is greater than the specified value (also called the cut-off value, the specified value can be determined as 0.5 to 1 according to the actual situation) to determine whether it matches.
  • the specified value also called the cut-off value
  • the quantitative method is as follows:
  • the present invention also provides the above-mentioned biological metabolomics data processing method and biological metabolomics data analysis method in vitamins, amino acids, lipids, steroids, aromatic acids, neurotransmitters, pigments , Application in the identification of carbohydrates or short peptides. Since the above-mentioned biological metabolomics data processing method of the present invention is not affected by the test batch, it can continuously accumulate the sample data volume of the characteristic database, thereby also increasing vitamins, amino acids, lipids, steroids, aromatic acids, and neurotransmitters. Quality, pigment, carbohydrate or short peptide identification accuracy and precision.
  • the present invention also provides a method for detecting vitamins, amino acids, lipids, steroids, aromatic acids, neurotransmitters, pigments, carbohydrates or short peptides.
  • the detection method includes: performing liquid chromatography-mass spectrometry and/or gas chromatography-mass spectrometry on a biological sample to obtain liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data; using any of the above-mentioned biological metabolomics data processing Method or analysis method of biological metabolomics data process the liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data of biological samples to obtain data results; and convert vitamins, amino acids, lipids, steroids, and aromas based on the data results The type and content of acids, neurotransmitters, pigments, carbohydrates or short peptides.
  • the test results of vitamins, amino acids, lipids, steroids due to the advancement of the biological metabolomics data processing method and analysis method of the present invention,
  • a biological metabolomics data processing device includes liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data, liquid chromatography-mass spectrometry data includes primary mass spectrometry data, and gas chromatography-mass spectrometry data includes primary mass spectrometry data; the biological metabolomics data
  • the processing device includes a database generation module that integrates liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of multiple biological samples to form a feature database.
  • the database generation module includes: a time axis correction submodule, a feature peak recognition submodule, and features
  • the database forms a sub-module, where the time-axis correction sub-module is set to arbitrarily select one of the multiple biological samples as a reference sample, and correct the time-axis of other samples one by one according to the time axis of the reference sample; the characteristic peak recognition sub-module setting In order to perform the peak identification processing of the first mass spectrum ion peaks one by one for each sample after calibration, to obtain multiple identification characteristic peaks; and the feature database formation sub-module is set to merge the multiple identification characteristic peaks according to the principle of sample information complementarity Through processing, a feature database of multiple biological samples is obtained.
  • the device of the present invention uses the device of the present invention to select a sample as the reference sample, and the time axis of other samples are corrected according to this sample, that is, a unified coordinate axis is determined so that the liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of the sample is in time It is comparable; on the corrected time axis, peak identification is performed on the primary mass spectrum ion peak of each sample, and then the peaks are merged using the principle of information complementarity between samples to construct a feature database, which can be realized Super large-scale integration of metabolome data. Since all samples are calibrated on the time axis based on the reference sample, data calibration and data integration can be achieved in batches or individual samples, and are not affected by the test batch. This is suitable for commercial testing.
  • the mass spectrum data further includes secondary mass spectrum data
  • the time axis correction sub-module is further configured to perform retention time correction on the primary mass spectrum data and the secondary mass spectrum data to further improve the accuracy of the mass spectrum data; preferably , Use Obiwarp algorithm for retention time correction, which has the advantages of fast calculation speed and high data processing accuracy.
  • the feature database forming sub-module includes a data integration unit, and the data integration unit is configured to overlap or adjacent [mzmin, mzmax] regions, and multiple identifications where the [rtmin, rtmax] regions overlap or are adjacent
  • the characteristic peaks are merged into one characteristic peak;
  • the characteristic database forming sub-module includes a first judgment unit, a second judgment unit, a data integration unit, and a characteristic database forming unit: wherein the first judgment unit is configured to judge multiple identification characteristic peaks If the [mzmin, mzmax] regions overlap, if they overlap, enter the data integration unit; if they do not overlap, further determine whether they are adjacent.
  • the second judgment unit is set to judge the [rtmin, rtmax ] Whether the regions overlap or are adjacent, if they overlap, enter the data integration unit; if they do not overlap, further determine whether they are adjacent, if the interval between the [rtmin, rtmax] regions of multiple identifying characteristic peaks is less than the second preset threshold, then determine If it is adjacent to each other, enter the data integration unit; if it is neither overlapping nor adjacent, it is determined that multiple identification characteristic peaks are independent characteristic peaks; the data integration unit is set to overlap or adjacent to the [mzmin, mzmax] area, And the [rtmin, rtmax] area overlaps or multiple adjacent identification feature peaks are merged into one feature peak; the database forming
  • the combined characteristic peaks can cover a larger area, even if only one sample is detected, it can be quantified more accurately (effective for metabolites with poor chromatographic peak shapes), and a larger coverage area can be more effective Compatible with subsequent samples, effectively reducing the impact of retention time (RT) drift.
  • RT retention time
  • the peak recognition algorithm is CentWave algorithm, matchedFilter algorithm or mzMine algorithm, more preferably CentWave algorithm, because this method can improve sensitivity, limit its error, and find the most recognized characteristic peaks most accurately. In this way, peaks are merged on the basis of CentWave algorithm, which can effectively use CentWave to locate the maximum response area.
  • the algorithm parameter setting follows the idea of "improving detection sensitivity as much as possible”.
  • the parameter setting of the peak recognition algorithm includes: ppm: the resolution of the instrument is adopted; peak width: set to 2-30; noise: Set to 0; SNR: set to 10.
  • the first preset threshold is set according to instrument parameters, and the second preset threshold is set according to the maximum value of the time deviation in the retention time correction; preferably, the first preset threshold is set It is 0.01 ⁇ 0.015Da, more preferably, the first preset threshold is set to 0.01Da, 0.011Da, 0.012Da, 0.013Da, 0.014Da or 0.015Da, and the second preset threshold is set to 10 ⁇ 15, more preferably Yes, the second preset threshold is set to 10, 11, 12, 13, 14 or 15, to improve the effectiveness of peak merging, thereby improving the accuracy of the feature database.
  • the mass spectrometry data further includes secondary mass spectrometry data
  • the biological metabolomics data processing device further includes: a peak merging validity verification submodule configured to combine the secondary mass spectrometry data of multiple biological samples Compare to the feature database to assist in judging the effectiveness of peak merging. The higher the ratio of the secondary mass spectrum data to the feature database, the stronger the effectiveness of peak merging.
  • the secondary mass spectrometry data of multiple biological samples are compared to a feature database, wherein when the comparison rate is greater than or equal to the third preset threshold, it is determined that the peak combination is effective; a more preferred embodiment
  • the third preset threshold is set to 40%; in a more preferred embodiment, the third preset threshold is set to 50%; in a more preferred embodiment, the third preset threshold is set to 60% In a more preferred embodiment, the third preset threshold is set to 80%.
  • a biological metabolomics data analysis device includes a module configured to process biological metabolomics data and a module configured to qualitatively identify metabolites through secondary mass spectrometry data information, wherein the module configured to process biological metabolomics data is the aforementioned biological Metabolomics data processing device. Since the above-mentioned biological metabolomics data processing device of the present invention can not be affected by the test batch, it can continuously accumulate the sample data volume of the characteristic database, thereby continuously improving the qualitative identification of metabolites through the secondary mass spectrum data information accuracy.
  • the module configured to qualitatively identify metabolites through the data information of the secondary mass spectrum includes a standard compound mass-to-charge ratio data acquisition sub-module and a standard compound matching sub-module, wherein the standard compound mass-to-charge ratio
  • the data acquisition sub-module is set to acquire the mass-to-charge ratio data of each standard compound
  • the standard compound matching sub-module is set to randomly select a characteristic value in the characteristic database obtained after the biological metabolomics data processing, and find the corresponding characteristic value All the mass-to-charge ratio data of the secondary mass spectrum, according to all the mass-to-charge ratio data of the secondary mass spectrum, find the matching standard compound
  • the integration qualitative sub-module is set to correspond to a characteristic value selected in the standard compound matching sub-module All the mass-to-charge ratio data of the second-level mass spectra of, take the mass-to-charge ratio data of the matched standard compound found in the standard compound matching sub-module as the other side, score the similarity between the two, and calculate
  • the integral qualitative sub-module is set to calculate the median of the similarity between each standard compound in the multiple standard compounds on the match and the data of multiple secondary mass spectrometry, and select the compound with the largest median; more preferably, according to the compound Whether the median of is greater than the cut-off value, judge whether it matches.
  • the above algorithm not only includes the "representative" MS2, but also adds various possible MS2 of the compound, which increases the matching degree with the standard compound.
  • the mass-to-charge ratio data of the standard compound is obtained from an existing database, for example, the database includes NISTlib, HMDB or METLIN.
  • the analysis device further includes a module configured to quantify biological metabolites.
  • a module configured to quantify biological metabolites.
  • the module configured to quantify biological metabolites includes a time axis correction sub-module and a biological metabolite relative quantification sub-module, wherein the time axis correction sub-module is set to correct the time axis of the sample to be quantified according to the time axis of the reference sample;
  • the relative quantification of biological metabolites sub-module is set to integrate the corresponding characteristic regions of the samples to be quantified in the established characteristic database to obtain the relative quantitative results of biological metabolites.
  • the new sample is first corrected for retention time (RT).
  • the Obiwarp algorithm is used to correct the retention time (RT) of the primary mass spectrum data (MS1) and the secondary mass spectrum data (MS2).
  • the sample retention time correction in this embodiment is shown in Figure 3 (Note: the horizontal axis is the retention time RT (unit: s), and the vertical axis is the time the retention time of the sample deviates from the reference sample (unit: s), also called retention Time deviation.
  • the horizontal line is the reference sample, and the curve is the other samples.
  • the algorithm parameter setting follows the principle of "improving the detection sensitivity as much as possible”.
  • the specific settings in this embodiment are as follows:
  • Results Each sample has about 3,600 to 5,000 peaks, and 101 samples have a total of 431,695 peaks.
  • the metabolome data of 101 samples are sequentially integrated.
  • the metabolite identification steps are as follows:
  • each standard compound contains 18 ionization forms, as shown in Table 2 (Types of ionization forms of standard compounds). When testing the sample, one or more ionized forms of the compound will be obtained. Each compound contains one or more ionized forms. Table 3 lists 5 standard compounds (S0001-S0005) and their corresponding 5 ionized forms. mz.
  • S0001 is the number of the compound, M+ is the ionized form, 74 and so on are mz.
  • the horizontal axis is the standard compound, and the vertical axis is 35 MS2 of FT08341.
  • the first column is the similarity between 35 MS2 and n01701MS2.
  • Standard compound n01701 n01694 n01696 n01835 n01577 n01578 n01579 n01440 median 0.041 0.211 0.890 0.000 0.000 0.005 0.003 0.026
  • Standard compound n01444 n01320 L0194 n00528 n01419 n01420 n01421 n01423 median 0.150 0.000 0.000 0.012 0.006 0.494 0.008 0.004
  • Table 5 shows the relative quantitative results (partial) of metabolites.
  • GXP104, GX107, etc. are sample names, score is the matching score of the compound, and metabolite is the identified compound.
  • the name of the metabolite matched by FT08341 is Phe-Trp (the name of the metabolite is obtained in the database of standard compounds, the database is mainly NISTlib, and the database can also be replaced with HMDB, METLIN and other public databases), and the score is 0.89. , The credibility is high.
  • the relative quantitative value of the metabolite is 8495.393221, and the relative quantitative value in GXP107 is 5096.885985.
  • the relative quantitative value of compound FT02707 in sample GXP104 is 12386.06788; the relative quantitative value of compound FT05421 in sample GXP104 is 2252.548371.
  • liquid chromatography-mass spectrometry data is processed.
  • gas chromatography-mass spectrometry data can also be processed by this method, and the same technical effect can be obtained.
  • Missing value filling Integrate the relevant area of mzXML according to the unified coordinates. Missing value filling is to integrate the area according to the found coordinates.
  • Example 1 The treatment results of Example 1 and Comparative Example 1 are compared as follows:
  • Example 1 found 23799 features, 21273 feature non-missing values, comparative example 1 analyzed 4289 features, feature non-missing values 4042. It shows that the technical solution of the present invention can find more features, and the number of feature missing values is less.
  • FIG. 6 shows the distribution of the number of feature missing values in Example 1. There are 101 samples in total, and 85% of the feature missing values are less than 20.
  • Figure 7 shows a comparison of the coefficient of variation (CV, standard deviation divided by the mean) between the samples of Example 1 and Comparative Example 1.
  • the median line (the first straight line from left to right in the coordinate system (parallel to the ordinate)) represents the median
  • the quartile line (the second from left to right in the coordinate system)
  • the straight line represents the upper quartile (75%), that is, there is 50% on the left side of the median line (the first straight line from left to right in the coordinate system (parallel to the ordinate))
  • the CV value of the features is less than the value corresponding to the median line.
  • FIG. 8 shows the PCA results of Example 1 and Comparative Example 1 showing that the consistency between the samples of Example 1 is better than that of Comparative Example 1.
  • the proportion that can be explained by PC1 and PC2 in Example 1 is greatly increased.
  • the distinction between experimental samples (solid circles) and QC samples (dashed circles) is also more obvious.
  • Example 1 the beneficial effect obtained in Example 1 is also manifested in the full use of the MS2 information of the sample. On the one hand, it greatly improves the identification rate of MS2. On the other hand, it brings about a benefit that an MS2 database is generated, which can be used to evaluate new MS2 similarity algorithm.
  • FIG. 10 shows the mz and RT distributions of MS2 precursor ions corresponding to FT08341 in Example 1. 1 If mz and RT are distributed in a very narrow range, it can be concluded that they belong to the same parent ion, so they can be used to evaluate the MS2 similarity algorithm. 2 If the range of mz and RT (mainly RT) is relatively wide, given the MS2 similarity algorithm, the corresponding MS2 can be used to evaluate whether the precursor ions are the same, which can assist in judging the rationality of peaks merging.
  • Figure 11 shows the similarity of 35 MS2 spectra of FT08341 in Example 1.
  • the similarity comparison of multiple MS2s of the same feature can be used to help judge the effect of combining peaks into features.
  • a fixed reference sample can ensure that subsequent samples are comparable in time.
  • This method merges peaks based on the CentWave algorithm, and effectively utilizes CentWave's positioning of the maximum response area.
  • the combined peak can cover a larger area, even if only one sample is tested, it can be more accurately quantified (effective for metabolites with poor chromatographic peak shape), and a larger coverage area can be more effectively compatible Subsequent samples effectively reduce the impact of retention time (RT) drift.
  • RT retention time
  • the present invention has at least the following beneficial effects:
  • the invention constructs a feature database, fixes a reference sample, and unifies the time axis, which can ensure that subsequent samples are comparable in time, so that the metabolome data processing process can effectively use sample information complementation between different batches, and effectively improve metabolism. Object detection repeatability and coverage.
  • the present invention performs peak merging processing in the process of constructing a feature database, and the merged peaks can cover a larger area, so that quantification can be performed more accurately when only one sample is detected, even if the chromatographic peak shape is not good Metabolites still have a good effect, and produce a larger coverage area, which makes it more compatible with subsequent samples, and effectively reduces the impact of retention time (RT) drift.
  • RT retention time
  • the present invention effectively improves the analysis efficiency of samples, so that subsequent samples are comparable in time, and the samples do not need to be rolled back, and can be widely used in business.

Abstract

Biometabolomics data processing and analysis methods and apparatuses, and an application thereof. The biometabolomics data processing method comprises a step of integrating liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of a plurality of biological samples to form a feature database. The integration step comprises: S11, randomly selecting one of the plurality of biological samples as a reference sample, and correcting the timelines of other samples one by one according to the timeline of the reference sample; S12, performing peak recognition processing of a primary mass spectrometry ion peak on the corrected samples one by one to obtain a plurality of recognition feature peaks; and S13, merging the plurality of recognition feature peaks according to the sample information complementation principle to obtain a feature database of the plurality of biological samples. The processing method can realize integration of mega-scale metabolomics data, and can achieve data correction and data integration in batches or of a single sample, without being affected by detection batches.

Description

生物代谢组学数据处理方法、分析方法及装置和应用Biological metabolomics data processing method, analysis method, device and application 技术领域Technical field
本发明涉及代谢组学技术领域,具体而言,涉及一种生物代谢组学数据处理方法、分析方法及装置和应用。The present invention relates to the technical field of metabolomics, in particular to a method, analysis method, device and application of biological metabolomics data processing.
背景技术Background technique
代谢组学是继基因组学和蛋白质组学之后新起的一门学科,它是系统生物学的重要组成部分,主要考察生物体系受刺激或扰动前后所有小分子代谢物及其含量的动态变化。通过对生物体内所有的小分子代谢物进行整体的定性、定量分析,可以探索并发现代谢物与生理病理变化的关系。研究表明,代谢组在疾病早期诊断、生物标志物发现、药物筛选、毒性评价、运动医学、营养学等领域有着重要应用价值。Metabolomics is a new subject after genomics and proteomics. It is an important part of systems biology. It mainly investigates the dynamic changes of all small molecule metabolites and their contents before and after the biological system is stimulated or disturbed. Through the overall qualitative and quantitative analysis of all small molecule metabolites in the organism, the relationship between metabolites and physiological and pathological changes can be explored and discovered. Studies have shown that metabolome has important application value in the fields of early disease diagnosis, biomarker discovery, drug screening, toxicity evaluation, sports medicine, and nutrition.
随着科学技术的快速发展,针对代谢组的研究和检测方法层出不穷,目前应用最为广泛、功能最强大的主要是液相色谱-质谱联用技术(LC-MS)。近年来,LC-MS技术得到了进一步的提高,大规模样本的检测应用也越来越多。随着检测样本数的增加,随之也产生了一系列问题,例如,大规模样本的检测时间较长,机器在长时间的运行过程中,会出现灵敏度下降、保留时间漂移等情况。因此,研究者们常常会将大规模样本分批次进行上机,可以保持机器的良好运行状态,但是这样又会有另一个问题,就是样本之间和批次之间的代谢组数据存在随机误差和系统误差,无法直接进行比较,需要进行数据整合。针对不同样本和不同批次间的数据的整合,目前也有一些方法可以使用,常见的是利用XCMS方法进行数据整合,可以实现多样本的代谢组学数据分析。With the rapid development of science and technology, research and detection methods for metabolome emerge in endlessly. At present, the most widely used and most powerful is mainly liquid chromatography-mass spectrometry (LC-MS). In recent years, LC-MS technology has been further improved, and large-scale sample detection applications have also increased. As the number of test samples increases, a series of problems have also arisen. For example, the test time for large-scale samples is longer, and the sensitivity of the machine will decrease and retention time drift during long-term operation. Therefore, researchers often put large-scale samples on the machine in batches to keep the machine running in good condition, but this has another problem, that is, the metabolome data between samples and batches are random. Errors and systematic errors cannot be directly compared, and data integration is required. For the integration of data between different samples and different batches, there are currently some methods that can be used. The common one is to use the XCMS method for data integration, which can realize multi-sample metabolomics data analysis.
然而,利用诸如XCMS这些方法来整合不同样本和不同批次的代谢组数据,也存在一些问题和局限性。它们目前的处理方式是需要将所有的样本数据放在一起进行整合,不能分批次或单个样本单独进行整合。对于样本数规模固定的,它可以进行处理,且因样本数的大小,处理时间长短不同。这种处理方式有个弊端是,数据处理时间和难度会随样本数增加而增加,当样本数非常巨大或不断有新样本需要进行数据整合的时候,这种方式可能就不太适用了,且不利于商业化应用。同时,现有方法还存在一些问题和不足,例如不能有效利用不同批次间样本信息互补,不同批次样本都有各自的坐标,信息很难进行比较,也很难互补,会丢失一些信息,导致代谢物检测重复性和覆盖度会降低。However, using methods such as XCMS to integrate metabolome data from different samples and different batches also has some problems and limitations. Their current processing method is to put all the sample data together for integration, and cannot be integrated separately in batches or individual samples. For a fixed number of samples, it can be processed, and the processing time varies depending on the size of the sample number. The disadvantage of this processing method is that the data processing time and difficulty will increase with the increase of the number of samples. When the number of samples is very large or there are constantly new samples that need data integration, this method may not be suitable, and Not conducive to commercial applications. At the same time, the existing methods still have some problems and shortcomings. For example, they cannot effectively use the sample information complementation between different batches. Different batches of samples have their own coordinates. The information is difficult to compare and complementary, and some information will be lost. As a result, the repeatability and coverage of metabolite detection will be reduced.
为解决上述问题,本发明提供一种生物代谢组学数据处理方法、分析方法及装置,可有效地解决代谢组数据处理过程中不能有效利用不同批次间样本信息互补,导致代谢物检测重复性差和覆盖度会降低等问题。In order to solve the above problems, the present invention provides a biological metabolomics data processing method, analysis method and device, which can effectively solve the problem that the sample information complementation between different batches cannot be effectively used in the metabolome data processing process, resulting in poor metabolite detection repeatability And the coverage will be reduced.
发明内容Summary of the invention
本发明旨在提供一种生物代谢组学数据处理方法、分析方法及装置和应用,以适合处理更大规模代谢组学数据。The present invention aims to provide a biological metabolomics data processing method, analysis method, device and application, which are suitable for processing larger scale metabolomics data.
为了实现上述目的,根据本发明的一个方面,提供了一种生物代谢组学数据处理方法。该生物代谢组学数据包括液相色谱-质谱数据或气相色谱-质谱数据,液相色谱-质谱数据包括一级质谱数据,气相色谱-质谱数据包括一级质谱数据;生物代谢组学数据处理方法包括将多个生物样本的液相色谱-质谱数据或气相色谱-质谱数据进行整合以形成特征数据库的步骤,整合的步骤包括:In order to achieve the above objective, according to one aspect of the present invention, a method for processing biological metabolomics data is provided. The biological metabolomics data includes liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data, liquid chromatography-mass spectrometry data includes primary mass spectrometry data, and gas chromatography-mass spectrometry data includes primary mass spectrometry data; biological metabolomics data processing method It includes the steps of integrating liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of multiple biological samples to form a feature database. The integration steps include:
S11,任意选取多个生物样本中的一个样本作为参照样本,根据参照样本的时间轴逐一对其他样本的时间轴进行校正;S11: arbitrarily select one of the multiple biological samples as a reference sample, and perform correction on the time axis of other samples one by one according to the time axis of the reference sample;
S12,对校正后的每一个样本,逐一进行一级质谱离子峰的峰识别处理,得到多个识别特征峰;以及S12, for each sample after calibration, perform peak identification processing of the ion peaks of the primary mass spectrum one by one to obtain multiple identification characteristic peaks; and
S13,根据样本信息互补原则,对多个识别特征峰进行合并处理,得到多个生物样本的特征数据库。S13: According to the principle of complementarity of sample information, a plurality of identification characteristic peaks are combined to obtain a characteristic database of a plurality of biological samples.
进一步地,S13中:如果多个识别特征峰的[mzmin,mzmax]区域重叠或相邻,且[rtmin,rtmax]区域重叠或相邻,则将多个识别特征峰合并为一个特征峰。Further, in S13: if the [mzmin, mzmax] regions of the multiple identification feature peaks overlap or are adjacent, and the [rtmin, rtmax] regions overlap or are adjacent, then the multiple identification feature peaks are merged into one feature peak.
进一步地,S13包括:Further, S13 includes:
S131,判断多个识别特征峰的[mzmin,mzmax]区域是否重叠或相邻,若重叠,进入S133;若不重叠,进一步判断是否相邻,如果多个识别特征峰的[mzmin,mzmax]区域的间隔小于第一预设阈值,则判定为相邻,进入S133;若既不重叠也不相邻,则判定多个识别特征峰分别为独立的特征峰;S131: Determine whether the [mzmin, mzmax] regions of multiple identification characteristic peaks overlap or are adjacent, if they overlap, go to S133; if they do not overlap, further determine whether they are adjacent, if multiple [mzmin, mzmax] regions of multiple identification characteristic peaks If the interval between is less than the first preset threshold, it is determined to be adjacent and enter S133; if it is neither overlapping nor adjacent, it is determined that the multiple identification characteristic peaks are independent characteristic peaks;
S132,判断多个识别特征峰的[rtmin,rtmax]区域是否重叠或者相邻,若重叠,进入S133;若不重叠,进一步判断是否相邻,如果多个识别特征峰的[rtmin,rtmax]区域的间隔小于第二预设阈值,则判断为相邻,进入S133;若既不重叠也不相邻,则判定多个识别特征峰分别为独立的特征峰;S132: Determine whether the [rtmin, rtmax] regions of the multiple identification feature peaks overlap or are adjacent, if they overlap, go to S133; if they do not overlap, further determine whether they are adjacent, if the [rtmin, rtmax] regions of the multiple identification feature peaks If the interval between is less than the second preset threshold, it is judged to be adjacent, and enter S133; if it is neither overlapping nor adjacent, it is judged that the multiple identification characteristic peaks are independent characteristic peaks;
S133,如果多个识别特征峰的同时满足S131中的重叠或相邻,和S132中的重叠或相邻两个条件,则将多个识别特征峰合并为一个特征峰;S133: If the multiple identification characteristic peaks simultaneously satisfy the overlapping or adjacent conditions in S131 and the overlapping or adjacent conditions in S132, the multiple identification characteristic peaks are merged into one characteristic peak;
S134,利用所有特征峰的数据生成特征列表即得到特征数据库。S134: Generate a feature list using data of all feature peaks to obtain a feature database.
进一步地,第一预设阈值依据仪器参数进行设定,第二预设阈值依据保留时间校正中时间偏差的最大值来进行设定;优选的,第一预设阈值设定为0.01~0.015Da,第二预设阈值设定为10~15。Further, the first preset threshold is set according to instrument parameters, and the second preset threshold is set according to the maximum value of the time deviation in the retention time correction; preferably, the first preset threshold is set to 0.01~0.015Da , The second preset threshold is set to 10-15.
进一步地,质谱数据还包括二级质谱数据,S13还包括:Further, the mass spectrometry data also includes secondary mass spectrometry data, and S13 also includes:
S135,将多个生物样本的二级质谱数据比对到S134生成的特征数据库中,其中,比对率大于或等于第三预设阈值时,判断峰合并有效。S135. Compare the secondary mass spectrometry data of the multiple biological samples to the feature database generated in S134, wherein when the comparison rate is greater than or equal to a third preset threshold, it is determined that the peak combination is effective.
进一步地,所述第三预设阈值设定为40%。Further, the third preset threshold is set to 40%.
进一步地,所述第三预设阈值设定为50%。Further, the third preset threshold is set to 50%.
进一步地,所述第三预设阈值设定为60%。Further, the third preset threshold is set to 60%.
进一步地,所述第三预设阈值设定为80%。Further, the third preset threshold is set to 80%.
进一步地,质谱数据还包括二级质谱数据,S11还包括对一级质谱数据和二级质谱数据进行保留时间校正;优选的,使用Obiwarp算法进行保留时间校正。Further, the mass spectrum data also includes the secondary mass spectrum data, and S11 also includes the retention time correction of the primary mass spectrum data and the secondary mass spectrum data; preferably, the retention time correction is performed using the Obiwarp algorithm.
进一步地,峰识别的算法为CentWave算法、matchedFilter算法或mzMine算法。Further, the algorithm for peak recognition is CentWave algorithm, matchedFilter algorithm or mzMine algorithm.
进一步地,峰识别的算法的参数设置包括:ppm:采用仪器的分辨率;峰宽:设置为2~30;噪音:设置为0;信噪比:设置为10。Further, the parameter settings of the peak recognition algorithm include: ppm: the resolution of the instrument used; peak width: set to 2-30; noise: set to 0; signal-to-noise ratio: set to 10.
进一步地,生物样本包括人或动物的体液、组织或细胞,植物的根、茎、叶、果实或种子,或微生物的细胞培养液;其中,体液包括尿液、血液、唾液、脑脊液或羊水,组织包括器官组织、肌肉组织或肿瘤组织,细胞包括干细胞、体细胞、肿瘤细胞或微生物细胞。Further, biological samples include human or animal body fluids, tissues or cells, plant roots, stems, leaves, fruits or seeds, or microbial cell culture fluid; wherein, body fluids include urine, blood, saliva, cerebrospinal fluid or amniotic fluid, Tissues include organ tissues, muscle tissues or tumor tissues, and cells include stem cells, somatic cells, tumor cells or microbial cells.
根据本发明的另一个方面,提供一种生物代谢组学数据的分析方法。该分析方法依次包括生物代谢组学数据处理和通过二级质谱数据信息对代谢物进行定性鉴定的步骤,其中,生物代谢组学数据处理采用本发明上述任一种生物代谢组学数据处理方法进行。According to another aspect of the present invention, a method for analyzing biological metabolomics data is provided. The analysis method sequentially includes the steps of biological metabolomics data processing and qualitative identification of metabolites through secondary mass spectrometry data information, wherein the biological metabolomics data processing adopts any of the above-mentioned biological metabolomics data processing methods of the present invention. .
进一步地,通过二级质谱数据信息对代谢物进行定性鉴定的步骤包括:Further, the step of qualitatively identifying metabolites through the data information of the secondary mass spectrum includes:
S21,获取各标准化合物的质荷比数据;S21, obtain the mass-to-charge ratio data of each standard compound;
S22,在生物代谢组学数据处理后得到的特征数据库中任意选择一个特征值,并找到与该特征值对应的所有的二级质谱质荷比数据,根据所有的二级质谱质荷比数据,找到与其相匹配的标准化合物;S22. Select a characteristic value arbitrarily in the characteristic database obtained after the biological metabolomics data processing, and find all the mass-to-charge ratio data of the second-stage mass spectrometry corresponding to the characteristic value, and according to all the mass-to-charge ratio data of the second-stage mass spectrum, Find a matching standard compound;
S23,以S22中选择的一个特征值所对应的所有的二级质谱质荷比数据为一方,以S22中找到的匹配的标准化合物的二级质谱质荷比数据为另一方,对二者进行相似性打分,计算点积分,根据积分值对代谢物进行定性。S23, take all the mass-to-charge ratio data of the MS mass spectrum corresponding to a characteristic value selected in S22 as one side, and use the mass-to-charge ratio data of the MS mass spectrum of the matched standard compound found in S22 as the other side, and perform Similarity scores, points points are calculated, and metabolites are qualitatively based on the points.
进一步地,S23包括:计算匹配上的多个标准化合物中每个标准化合物与多个二级质谱数据相似性的中位数,选择中位数最大的化合物;优选的,根据化合物的中位数是否大于截止值,判别是否匹配。Further, S23 includes: calculating the median of the similarity between each standard compound in the multiple standard compounds on the matching and multiple MS mass spectrometry data, and selecting the compound with the largest median; preferably, according to the median of the compound Whether it is greater than the cut-off value, judge whether it matches.
进一步地,标准化合物的质荷比数据从已有的数据库中获得,数据库包括NISTlib、HMDB或METLIN。Furthermore, the mass-to-charge ratio data of standard compounds are obtained from existing databases, including NISTlib, HMDB or METLIN.
进一步地,分析方法还包括生物代谢物定量的步骤。Further, the analysis method also includes a step of quantifying biological metabolites.
进一步地,生物代谢物定量的步骤包括:Further, the steps of quantifying biological metabolites include:
S31,根据参照样本的时间轴对待定量样本的时间轴进行校正;S31: Correct the time axis of the sample to be quantified according to the time axis of the reference sample;
S32,对所建立的特征数据库中待定量样本的对应的特征区域进行积分,得到生物代谢物相对定量的结果。S32: Integrating the corresponding characteristic regions of the sample to be quantified in the established characteristic database to obtain a relatively quantitative result of biological metabolites.
根据本发明的再一个方面,提供了一种上述生物代谢组学数据处理方法、生物代谢组学数据的分析方法在维生素、氨基酸、脂质、类固醇、芳香酸、神经递质、色素、碳水化合物或短肽鉴定中的应用。According to another aspect of the present invention, there is provided a method for processing data of the above-mentioned biological metabolomics and analysis method for data of biological metabolomics in terms of vitamins, amino acids, lipids, steroids, aromatic acids, neurotransmitters, pigments, and carbohydrates. Or the application of short peptide identification.
根据本发明的又一个方面,提供了一种维生素、氨基酸、脂质、类固醇、芳香酸、神经递质、色素、碳水化合物或短肽的检测方法。该检测方法包括:对生物样本进行液相色谱-质谱和/或气相色谱-质谱检测,得到液相色谱-质谱数据和/或气相色谱-质谱数据;采用上述任一种生物代谢组学数据处理方法或生物代谢组学数据的分析方法对生物样本的液相色谱-质谱数据和/或气相色谱-质谱数据进行处理得到数据结果;以及根据数据结果换算出维生素、氨基酸、脂质、类固醇、芳香酸、神经递质、色素、碳水化合物或短肽。According to another aspect of the present invention, a method for detecting vitamins, amino acids, lipids, steroids, aromatic acids, neurotransmitters, pigments, carbohydrates or short peptides is provided. The detection method includes: performing liquid chromatography-mass spectrometry and/or gas chromatography-mass spectrometry on a biological sample to obtain liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data; using any of the above-mentioned biological metabolomics data processing Method or analysis method of biological metabolomics data process the liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data of biological samples to obtain data results; and convert vitamins, amino acids, lipids, steroids, and aromas based on the data results Acid, neurotransmitter, pigment, carbohydrate or short peptide.
根据本发明的再一个方面,提供了一种生物代谢组学数据处理装置。其中,生物代谢组学数据包括液相色谱-质谱数据或气相色谱-质谱数据,液相色谱-质谱数据包括一级质谱数据,气相色谱-质谱数据包括一级质谱数据;生物代谢组学数据处理装置包括将多个生物样本的液相色谱-质谱数据或气相色谱-质谱数据进行整合以形成特征数据库的数据库生成模块,数据库生成模块包括:According to another aspect of the present invention, a biological metabolomics data processing device is provided. Among them, biological metabolomics data includes liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data, liquid chromatography-mass spectrometry data includes primary mass spectrometry data, and gas chromatography-mass spectrometry data includes primary mass spectrometry data; biological metabolomics data processing The device includes a database generating module that integrates liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of multiple biological samples to form a feature database. The database generating module includes:
时间轴校正子模块,设置为任意选取多个生物样本中的一个样本作为参照样本,根据参照样本的时间轴逐一对其他样本的时间轴进行校正;The time axis correction sub-module is set to arbitrarily select one of multiple biological samples as a reference sample, and correct the time axis of other samples one by one according to the time axis of the reference sample;
特征峰识别子模块,设置为对校正后的每一个样本,逐一进行一级质谱离子峰的峰识别处理,得到多个识别特征峰;以及The characteristic peak recognition sub-module is configured to perform peak recognition processing of the ion peaks of the primary mass spectrometer one by one for each sample after calibration to obtain multiple characteristic peaks; and
特征数据库形成子模块,设置为根据样本信息互补原则,对多个识别特征峰进行合并处理,得到多个生物样本的特征数据库。The feature database forming sub-module is set to merge multiple identification feature peaks according to the principle of sample information complementarity to obtain feature databases of multiple biological samples.
进一步地,特征数据库形成子模块包括数据整合单元,数据整合单元设置为将[mzmin,mzmax]区域重叠或相邻,且[rtmin,rtmax]区域重叠或相邻的多个识别特征峰合并为一个特征峰。Further, the feature database forming sub-module includes a data integration unit, the data integration unit is set to overlap or adjacent [mzmin, mzmax] regions, and [rtmin, rtmax] regions overlap or merge multiple adjacent identification feature peaks into one Characteristic peaks.
进一步地,特征数据库形成子模块包括第一判断单元、第二判断单元、数据整合单元和特征数据库形成单元:Further, the feature database forming sub-module includes a first judgment unit, a second judgment unit, a data integration unit and a feature database forming unit:
其中,第一判断单元,设置为判断多个识别特征峰的[mzmin,mzmax]区域是否重叠,若重叠,进入数据整合单元;若不重叠,进一步判断是否相邻,如果多个识别特征峰的[mzmin,mzmax]区域的间隔小于第一预设阈值,则判定为相邻,进入S133;若既不重叠也不相邻,则判定多个识别特征峰分别为独立的特征峰;Among them, the first judging unit is set to judge whether the [mzmin, mzmax] regions of multiple identification feature peaks overlap, if they overlap, enter the data integration unit; if they do not overlap, further judge whether they are adjacent, if multiple identification feature peaks are If the interval of the [mzmin, mzmax] area is less than the first preset threshold, it is determined to be adjacent and enter S133; if it is neither overlapping nor adjacent, it is determined that the multiple identification characteristic peaks are independent characteristic peaks;
第二判断单元,设置为判断多个识别特征峰的[rtmin,rtmax]区域是否重叠或者相邻,若重叠,进入数据整合单元;若不重叠,进一步判断是否相邻,如果多个识别特征峰的[rtmin,rtmax]区域的间隔小于第二预设阈值,则判断为相邻,进入数据整合单元;若既不重叠也不相邻,则判定多个识别特征峰分别为独立的特征峰;The second judging unit is set to judge whether the [rtmin, rtmax] regions of multiple identification feature peaks overlap or are adjacent, if they overlap, enter the data integration unit; if they do not overlap, further judge whether they are adjacent, if multiple identification feature peaks If the interval of the [rtmin, rtmax] area is less than the second preset threshold, it is judged to be adjacent and enter the data integration unit; if it is neither overlapping nor adjacent, it is judged that the multiple identification characteristic peaks are independent characteristic peaks;
数据整合单元,设置为将[mzmin,mzmax]区域重叠或相邻,且[rtmin,rtmax]区域重叠或相邻的多个识别特征峰合并为一个特征峰;The data integration unit is set to combine the [mzmin, mzmax] area overlap or adjacent, and the [rtmin, rtmax] area overlap or the adjacent multiple identification characteristic peaks are combined into one characteristic peak;
特征数据库形成单元,设置为利用所有特征峰的数据生成特征列表即得到特征数据库。The feature database forming unit is configured to generate a feature list using the data of all feature peaks to obtain the feature database.
进一步地,第一预设阈值依据仪器参数进行设定,第二预设阈值依据保留时间校正中时间偏差的最大值来进行设定;优选的,第一预设阈值设定为0.01~0.015Da,第二预设阈值设定为10~15。Further, the first preset threshold is set according to instrument parameters, and the second preset threshold is set according to the maximum value of the time deviation in the retention time correction; preferably, the first preset threshold is set to 0.01~0.015Da , The second preset threshold is set to 10-15.
进一步地,质谱数据还包括二级质谱数据,生物代谢组学数据处理装置还包括:峰合并有效性验证子模块,设置为将多个生物样本的二级质谱数据比对到特征数据库中,其中,比对率大于或等于第三预设阈值时,判断峰合并有效。Further, the mass spectrometry data also includes secondary mass spectrometry data, and the biological metabolomics data processing device further includes: a peak merging validity verification sub-module configured to compare the secondary mass spectrometry data of multiple biological samples to a feature database, wherein , When the comparison rate is greater than or equal to the third preset threshold, it is determined that the peak combination is valid.
进一步地,所述第三预设阈值设定为40%。Further, the third preset threshold is set to 40%.
进一步地,所述第三预设阈值设定为50%。Further, the third preset threshold is set to 50%.
进一步地,所述第三预设阈值设定为60%。Further, the third preset threshold is set to 60%.
进一步地,所述第三预设阈值设定为80%。进一步地,质谱数据还包括二级质谱数据,时间轴校正子模块还设置为对一级质谱数据和二级质谱数据进行保留时间校正;优选的,使用Obiwarp算法进行保留时间校正。Further, the third preset threshold is set to 80%. Further, the mass spectrum data also includes secondary mass spectrum data, and the time axis correction sub-module is also configured to perform retention time correction on the primary mass spectrum data and the secondary mass spectrum data; preferably, the Obiwarp algorithm is used to perform retention time correction.
进一步地,峰识别的算法为CentWave算法、matchedFilter算法或mzMine算法。Further, the algorithm for peak recognition is CentWave algorithm, matchedFilter algorithm or mzMine algorithm.
进一步地,峰识别的算法的参数设置包括:ppm:采用仪器的分辨率;峰宽:设置为2~30;噪音:设置为0;信噪比:设置为10。Further, the parameter settings of the peak recognition algorithm include: ppm: the resolution of the instrument used; peak width: set to 2-30; noise: set to 0; signal-to-noise ratio: set to 10.
根据本发明的又一个方面,提供了一种生物代谢组学数据的分析装置。该分析装置包括设置为生物代谢组学数据处理的模块和设置为通过二级质谱数据信息对代谢物进行定性鉴定的模块,其中,设置为生物代谢组学数据处理的模块为上述任一种生物代谢组学数据处理装置。According to another aspect of the present invention, a device for analyzing biological metabolomics data is provided. The analysis device includes a module configured to process biological metabolomics data and a module configured to qualitatively identify metabolites through secondary mass spectrometry data information, wherein the module configured to process biological metabolomics data is any of the above-mentioned biological Metabolomics data processing device.
进一步地,设置为通过二级质谱数据信息对代谢物进行定性鉴定的模块包括:Further, the module configured to qualitatively identify metabolites through the data information of the secondary mass spectrum includes:
标准化合物质荷比数据获取子模块,设置为获取各标准化合物的质荷比数据;The standard compound mass-to-charge ratio data acquisition sub-module is set to acquire the mass-to-charge ratio data of each standard compound;
标准化合物匹配子模块,设置为在生物代谢组学数据处理后得到的特征数据库中任意选择一个特征值,并找到与该特征值对应的所有的二级质谱质荷比数据,根据所有的二级质谱质荷比数据,找到与其相匹配的标准化合物;The standard compound matching sub-module is set to randomly select a characteristic value in the characteristic database obtained after the biological metabolomics data processing, and find all the MS mass-to-charge ratio data corresponding to the characteristic value, according to all the secondary Mass-to-charge ratio data of mass spectrometry to find a standard compound that matches it;
积分定性子模块,设置为以标准化合物匹配子模块中选择的一个特征值所对应的所有的二级质谱质荷比数据为一方,以标准化合物匹配子模块中找到的匹配的标准化合物的二级质谱质核比数据为另一方,对二者进行相似性打分,计算点积分,根据积分值对代谢物进行定性。The integral qualitative sub-module is set to take all the MS mass-to-charge ratio data corresponding to a characteristic value selected in the standard compound matching sub-module as one side, and use the standard compound matching sub-module to find the second level of the matched standard compound The mass-to-nucleus ratio data of the mass spectrometer is the other party, score the similarity between the two, calculate the point integration, and qualitative the metabolites based on the integrated value.
进一步地,积分定性子模块设置为计算匹配上的多个标准化合物中每个标准化合物与多个二级质谱数据相似性的中位数,选择中位数最大的化合物;优选的,根据化合物的中位数是否大于截止值,判别是否匹配。Further, the integral qualitative sub-module is set to calculate the median of the similarity between each standard compound and the multiple secondary mass spectrometry data among the multiple standard compounds on the matching, and select the compound with the largest median; preferably, according to the compound Whether the median is greater than the cut-off value, judge whether it matches.
进一步地,标准化合物的质荷比数据从已有的数据库中获得,数据库包括NISTlib、HMDB或METLIN。Furthermore, the mass-to-charge ratio data of standard compounds are obtained from existing databases, including NISTlib, HMDB or METLIN.
进一步地,分析装置还包括设置为生物代谢物定量的模块。Further, the analysis device further includes a module configured to quantify biological metabolites.
进一步地,设置为生物代谢物定量的模块包括:Further, the module set to quantify biological metabolites includes:
时间轴校正子模块,设置为根据参照样本的时间轴对待定量样本的时间轴进行校正;The time axis correction sub-module is set to correct the time axis of the sample to be quantified according to the time axis of the reference sample;
生物代谢物相对定量子模块,设置为对所建立的特征数据库中待定量样本的对应的特征区域进行积分,得到生物代谢物相对定量的结果。The relative quantification of biological metabolites sub-module is set to integrate the corresponding characteristic regions of the samples to be quantified in the established characteristic database to obtain the relative quantitative results of biological metabolites.
通过实施上述技术方法,本发明至少有如下有益效果:By implementing the above technical methods, the present invention has at least the following beneficial effects:
应用本发明的技术方案,通过构建特征(feature)数据库、统一时间轴、利用样本间信息互补原则进行峰(peak)合并等方式,可以实现超大规模的代谢组数据的整合,可实现分批次或单个样本的数据校正与数据整合,且不受检测批次的影响,且适用于商业化检测。Applying the technical solution of the present invention, by constructing a feature database, unifying the time axis, using the principle of information complementarity between samples for peak merging, etc., ultra-large-scale metabolome data integration can be realized, and batches can be realized Or the data correction and data integration of a single sample is not affected by the test batch, and is suitable for commercial testing.
本发明构建特征数据库,固定一个参照样本,统一时间轴,可保证后续样本在时间上具有可对比性,使得代谢组数据处理过程中实现有效利用不同批次间样本信息互补,有效地提高了代谢物检测重复性和覆盖度。The invention constructs a feature database, fixes a reference sample, and unifies the time axis, which can ensure that subsequent samples are comparable in time, so that the metabolome data processing process can effectively use sample information complementation between different batches, and effectively improve metabolism. Object detection repeatability and coverage.
本发明在构建特征数据库过程中进行合并峰处理,合并后的峰可以覆盖更大的区域,使得在只检测一个样本的情况下,也能更准确地进行定量,即使对于色谱峰型不好的代谢物依旧具有很好的效果,并产生了更大的覆盖区域使得更有效地兼容后续样本,有效地减少保留时间(RT)的偏移造成的影响。The present invention performs peak merging processing in the process of constructing a feature database, and the merged peaks can cover a larger area, so that quantification can be performed more accurately when only one sample is detected, even if the chromatographic peak shape is not good Metabolites still have a good effect, and produce a larger coverage area, which makes it more compatible with subsequent samples, and effectively reduces the impact of retention time (RT) drift.
本发明通过建立特征数据库后,有效提高样本的分析效率,使得后续样本在时间上具有可比性,且不用对样本进行回滚,在商业上可广泛使用。After the feature database is established, the present invention effectively improves the analysis efficiency of samples, so that subsequent samples are comparable in time, and the samples do not need to be rolled back, and can be widely used in business.
附图说明Description of the drawings
构成本申请的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The accompanying drawings constituting a part of the present application are used to provide a further understanding of the present invention. The exemplary embodiments and descriptions of the present invention are used to explain the present invention, and do not constitute an improper limitation of the present invention. In the attached picture:
图1示出了本发明一实施方式中的构建特征数据库的流程示意图;Fig. 1 shows a schematic diagram of the process of constructing a feature database in an embodiment of the present invention;
图2示出了本发明一实施方式中的合并识别特征峰的流程示意图;FIG. 2 shows a schematic diagram of the process of merging and identifying characteristic peaks in an embodiment of the present invention;
图3示出了实施例1中的一样本保留时间校正图;Figure 3 shows a sample retention time correction diagram in Embodiment 1;
图4示出了实施例1中的16个标准化合物的电离形式图;Figure 4 shows the ionization form diagrams of the 16 standard compounds in Example 1;
图5示出了实施例1的35个MS2和匹配上的16个标准化合物的相似性比较图;Figure 5 shows the similarity comparison diagram between 35 MS2 of Example 1 and 16 standard compounds on matching;
图6示出了实施例1的特征缺失值数目分布;FIG. 6 shows the distribution of the number of feature missing values in Embodiment 1;
图7示出了实施例1和对比例1样本间的变异系数(CV)比较图;Figure 7 shows a comparison graph of the coefficient of variation (CV) between the samples of Example 1 and Comparative Example 1;
图8示出了实施例1和对比例1PCA的结果显示图;Figure 8 shows the PCA results of Example 1 and Comparative Example 1;
图9示出了实施例1和对比例1鉴定到的代谢物数目比较;Figure 9 shows a comparison of the number of metabolites identified in Example 1 and Comparative Example 1;
图10示出了实施例1中FT08341对应的MS2母离子的mz和RT分布;以及Figure 10 shows the mz and RT distributions of MS2 precursor ions corresponding to FT08341 in Example 1; and
图11示出了实施例1中FT08341的35个MS2谱图的相似性。Figure 11 shows the similarity of 35 MS2 spectra of FT08341 in Example 1.
具体实施方式detailed description
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。It should be noted that the embodiments in this application and the features in the embodiments can be combined with each other if there is no conflict. Hereinafter, the present invention will be described in detail with reference to the drawings and in conjunction with the embodiments.
本发明中涉及的缩写及术语解释如下:The abbreviations and terms involved in the present invention are explained as follows:
代谢组:指生物体内代谢物质的动态整体,通常所指的代谢组只涉及相对分子质量约小于1000Da(Da:道尔顿)的小分子代谢物质的总称。Metabolome: refers to the dynamic overall of metabolites in an organism. The metabolome usually refers to the general term for small molecular metabolites with a relative molecular mass of less than 1000 Da (Da: Dalton).
质谱:又称质谱法(Mass Spectrometry,MS),是将被测物质离子化,用电场和磁场将运动的离子,按它们的质荷比分离后进行检测的方法。Mass Spectrometry: also known as Mass Spectrometry (MS), is a method of ionizing the measured substance, using electric and magnetic fields to separate the moving ions according to their mass-to-charge ratios for detection.
母离子:又称作前体离子,是可以进一步发生分解反应产生碎片离子的离子。Precursor ion: also known as precursor ion, is an ion that can further decompose and generate fragment ions.
子离子:某一分子离子(母离子)进行高能碎裂后得到的碎片离子。Product ion: Fragment ion obtained by high-energy fragmentation of a certain molecular ion (parent ion).
一级质谱(MS1):检测所有带电离子的质荷比和强度,形成一级谱图,一级质谱中的信号为母离子信号。Primary mass spectrum (MS1): Detect the mass-to-charge ratio and intensity of all charged ions to form a primary spectrum. The signal in the primary mass spectrum is the precursor ion signal.
二级质谱(MS2):按照一定方式选择母离子,将其进一步解离,分析所形成的子离子的质荷比和强度,形成二级谱图。Second-level mass spectrometry (MS2): Select parent ions in a certain way, dissociate them further, analyze the mass-to-charge ratio and intensity of the formed product ions, and form a second-level spectrum.
质荷比(mz):带电离子的质量与所带电荷之比值,是该离子的物理特性,为一定值。受仪器分辨率的限制,检测出的mz会有波动。Mass-to-charge ratio (mz): The ratio of the mass of a charged ion to the charge, which is the physical characteristic of the ion and is a certain value. Limited by the resolution of the instrument, the detected mz will fluctuate.
保留时间(Retention Time,RT):被分离样品组分从进样开始到柱后出现该组分浓度极大值时的时间,也即从进样开始到出现某组分色谱峰的顶点时为止所经历的时间。对于特定的 分离柱,组分(分子离子)的保留时间与其物理化学性质有关。Retention Time (Retention Time, RT): the time from the beginning of the sample injection to the time when the maximum concentration of the component appears after the column, that is, from the beginning of the sample injection to the peak of a certain component chromatographic peak The elapsed time. For a specific separation column, the retention time of the component (molecular ion) is related to its physical and chemical properties.
离子峰(peaks):某一样品中的离子峰,以[mzmin,mzmax,rtmin,rtmax]表示。Ion peaks: Ion peaks in a sample, expressed in [mzmin, mzmax, rtmin, rtmax].
特征(features):与peaks有相同的表示形式[mzmin,mzmax,rtmin,rtmax],与peaks不同的是,features可代表该分子离子(peaks是该分子离子的一部分,一个分子离子可有多个peaks)。features可由一个样本的多个peaks合并而来,也可由多个样本的多个peaks合并而来。Features (features): have the same representation form as peaks [mzmin, mzmax, rtmin, rtmax]. Unlike peaks, features can represent the molecular ion (peaks is a part of the molecular ion, and a molecular ion can have multiple peaks). Features can be merged from multiple peaks from one sample, or from multiple peaks from multiple samples.
PPM:parts per million,是比率的表示,表示“百万分之…”。PPM: parts per million, is the expression of the ratio, which means "parts per million...".
基于LC-MS技术,目前大规模样本的代谢组检测都是依次检测和分批次进行的,存在样本之间和批次间的偏差,需要对同一批次和不同批次的样本数据进行整合后才能进行下一步比较分析。针对大规模代谢组学数据的整合分析,现有的一些技术(例如XCMS)存在着弊端,它们需要将所有的样本数据放在一起进行整合,不能分批次或单个样本单独进行整合,同时在代谢物定性定量方面也存在着一些不足。Based on the LC-MS technology, the current metabolome testing of large-scale samples is performed sequentially and in batches. There are deviations between samples and batches, and the sample data of the same batch and different batches need to be integrated Only then can the next step of comparative analysis be carried out. For the integration and analysis of large-scale metabolomics data, some existing technologies (such as XCMS) have drawbacks. They need to put all sample data together for integration. They cannot be integrated in batches or individual samples individually. There are also some shortcomings in the qualitative and quantitative aspects of metabolites.
针对现有技术中的这些不足,本发明提出了一种新的代谢组数据整合思路,可以适用于大规模的代谢组数据分析,可实现分批次或单个样本的数据校正与数据整合,且不受检测批次的影响,同时在代谢物覆盖度以及定性定量准确性上也有所提高。In view of these shortcomings in the prior art, the present invention proposes a new metabolome data integration idea, which can be applied to large-scale metabolome data analysis, and can realize data correction and data integration in batches or single samples, and It is not affected by the test batch, and at the same time, the coverage of metabolites and the accuracy of qualitative and quantitative are also improved.
根据本发明一种典型的实施方式,提供一种生物代谢组学数据处理方法。生物代谢组学数据包括液相色谱-质谱数据和/或气相色谱-质谱数据,液相色谱-质谱数据包括一级质谱数据,气相色谱-质谱数据包括一级质谱数据;该生物代谢组学数据处理方法包括将多个生物样本的液相色谱-质谱数据或气相色谱-质谱数据进行整合以形成特征数据库的步骤,整合的步骤包括:According to a typical embodiment of the present invention, a method for processing biological metabolomics data is provided. Biological metabolomics data includes liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data, liquid chromatography-mass spectrometry data includes primary mass spectrometry data, and gas chromatography-mass spectrometry data includes primary mass spectrometry data; the biological metabolomics data The processing method includes the step of integrating liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of multiple biological samples to form a feature database, and the integration step includes:
S11,任意选取多个生物样本中的一个样本作为参照样本,根据参照样本的时间轴逐一对其他样本的时间轴进行校正;S11: arbitrarily select one of the multiple biological samples as a reference sample, and perform correction on the time axis of other samples one by one according to the time axis of the reference sample;
S12,对校正后的每一个样本,逐一进行一级质谱离子峰的峰识别处理,得到多个识别特征峰;S12, for each sample after calibration, perform peak identification processing of the ion peak of the first-level mass spectrum one by one to obtain multiple identification characteristic peaks;
以及S13,根据样本信息互补原则,对多个识别特征峰进行合并处理,得到多个生物样本的特征数据库。And S13, according to the principle of complementary sample information, the multiple identification characteristic peaks are combined to obtain a characteristic database of multiple biological samples.
应用本发明的技术方案,首先选择一样本为参照样本,其他样本的时间轴都根据这个样本进行校正,即确定统一的坐标轴,使样本的液相色谱-质谱数据或气相色谱-质谱数据在时间上具有可比性;在校正后的时间轴上,对每一个样本的一级质谱离子峰做峰识别,然后利用样本间信息互补原则进行峰(peak)合并构建得到特征(feature)数据库,从而可以实现超大规模的代谢组数据的整合。由于所有样本均根据参照样本进行时间轴校正,因此可实现分批次或单个样本的数据校正与数据整合,且不受检测批次的影响,这适用于商业化检测。Applying the technical solution of the present invention, first select a sample as a reference sample, and the time axis of other samples are corrected according to this sample, that is, a unified coordinate axis is determined so that the liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of the sample is Comparable in time; on the corrected time axis, peak identification is performed on the primary mass spectrum ion peak of each sample, and then peaks are merged using the principle of information complementarity between samples to construct a feature database, thereby It can realize the integration of super large-scale metabolome data. Since all samples are calibrated on the time axis according to the reference sample, data calibration and data integration can be realized in batches or single samples, and are not affected by the testing batch, which is suitable for commercial testing.
在本发明一实施方式中,质谱数据还包括二级质谱数据,S11还包括对一级质谱数据和二级质谱数据进行保留时间校正,进一步提高质谱数据的准确性;优选的,使用Obiwarp算法 进行保留时间校正,具有运算速度快,数据处理准确度高等优点。In an embodiment of the present invention, the mass spectrometry data also includes secondary mass spectrometry data, and S11 also includes retention time correction of the primary and secondary mass spectrum data to further improve the accuracy of the mass spectrometry data; preferably, the Obiwarp algorithm is used for Retention time correction has the advantages of fast calculation speed and high data processing accuracy.
在本发明一实施方式中,S13包括:如果多个识别特征峰的[mzmin,mzmax]区域重叠或相邻,且[rtmin,rtmax]区域重叠或相邻,则将多个识别特征峰合并为一个特征峰;优选的,S13包括:S131,判断多个识别特征峰的[mzmin,mzmax]区域是否重叠,若重叠,进入S133;若不重叠,进一步判断是否相邻,如果所述多个识别特征峰的[mzmin,mzmax]区域的间隔小于第一预设阈值,则判定为相邻,进入S133;若既不重叠也不相邻,则判定多个识别特征峰分别为独立的特征峰;S132,判断多个识别特征峰的[rtmin,rtmax]区域是否重叠或者相邻,若重叠,进入S133;若不重叠,进一步判断是否相邻,如果多个识别特征峰的[rtmin,rtmax]区域的间隔小于第二预设阈值,则判断为相邻,进入S133;若既不重叠也不相邻,则判定多个识别特征峰分别为独立的特征峰;S133,如果多个识别特征峰的同时满足S131中的重叠或相邻和S132中的重叠或相邻两个条件,则将多个识别特征峰合并为一个特征峰;S134,利用所有特征峰的数据生成特征列表即得到特征数据库。如此合并后的特征峰可以覆盖更大的区域,即使只检测一个样本,也可以更准确地进行定量(对于色谱峰型不好的代谢物很有效),并且更大的覆盖区域可以更有效地兼容后续样本,有效地减少保留时间(RT)的偏移造成的影响。In an embodiment of the present invention, S13 includes: if the [mzmin, mzmax] regions of the multiple identification feature peaks overlap or are adjacent, and the [rtmin, rtmax] regions overlap or are adjacent, then merge the multiple identification feature peaks into A characteristic peak; preferably, S13 includes: S131, judging whether the [mzmin, mzmax] regions of multiple identifying characteristic peaks overlap, if they overlap, go to S133; if they do not overlap, further determine whether they are adjacent, if the multiple identified If the interval between the [mzmin, mzmax] regions of the characteristic peaks is less than the first preset threshold, it is determined to be adjacent, and then proceeds to S133; if it is neither overlapping nor adjacent, it is determined that multiple identification characteristic peaks are independent characteristic peaks; S132: Determine whether the [rtmin, rtmax] regions of the multiple identification feature peaks overlap or are adjacent, if they overlap, go to S133; if they do not overlap, further determine whether they are adjacent, if the [rtmin, rtmax] regions of the multiple identification feature peaks If the interval is less than the second preset threshold, it is judged to be adjacent and enter S133; if it is neither overlapping nor adjacent, it is judged that the multiple identification feature peaks are independent feature peaks; S133, if multiple identification feature peaks are At the same time, if the two conditions of overlap or adjacent in S131 and overlap or adjacent in S132 are met, multiple identification characteristic peaks are merged into one characteristic peak; S134, the characteristic database is obtained by generating a characteristic list using the data of all characteristic peaks. In this way, the combined characteristic peaks can cover a larger area, even if only one sample is detected, it can be quantified more accurately (effective for metabolites with poor chromatographic peak shapes), and a larger coverage area can be more effective Compatible with subsequent samples, effectively reducing the impact of retention time (RT) drift.
优选的,峰识别的算法为CentWave算法、matchedFilter算法或mzMine算法,更优选为CentWave算法,因为该方法可提高灵敏度,限制其误差,最精确地发现最多的识别特征峰。如此在CentWave算法的基础上进行peaks的合并,能够有效地利用CentWave对最大响应区域的定位。在本发明中,算法参数设置遵循“尽可能提高检测灵敏度”的思想,优选的,峰识别的算法的参数设置包括:ppm:采用仪器的分辨率;峰宽:设置为2~30;噪音:设置为0;信噪比:设置为10。Preferably, the peak recognition algorithm is CentWave algorithm, matchedFilter algorithm or mzMine algorithm, more preferably CentWave algorithm, because this method can improve sensitivity, limit its error, and find the most recognized characteristic peaks most accurately. In this way, peaks are merged on the basis of CentWave algorithm, which can effectively use CentWave to locate the maximum response area. In the present invention, the algorithm parameter setting follows the idea of "improving detection sensitivity as much as possible". Preferably, the parameter setting of the peak recognition algorithm includes: ppm: the resolution of the instrument is adopted; peak width: set to 2-30; noise: Set to 0; SNR: set to 10.
在本发明一实施方式中,第一预设阈值依据仪器参数进行设定,第二预设阈值依据保留时间校正中时间偏差的最大值来进行设定;优选的,第一预设阈值设定为0.01~0.015Da,更优选的,第一预设阈值设定为0.01Da、0.011Da、0.012Da、0.013Da、0.014Da或0.015Da,第二预设阈值设定为10~15,更优选的,第二预设阈值设定为10、11、12、13、14或15,以提高峰合并的有效性,进而提高特征数据库的准确性。In an embodiment of the present invention, the first preset threshold is set according to instrument parameters, and the second preset threshold is set according to the maximum value of the time deviation in the retention time correction; preferably, the first preset threshold is set It is 0.01~0.015Da, more preferably, the first preset threshold is set to 0.01Da, 0.011Da, 0.012Da, 0.013Da, 0.014Da or 0.015Da, and the second preset threshold is set to 10~15, more preferably Yes, the second preset threshold is set to 10, 11, 12, 13, 14 or 15, to improve the effectiveness of peak merging, thereby improving the accuracy of the feature database.
在本发明一实施方式中,优选的,质谱数据还包括二级质谱数据,S13还包括:S135,将多个生物样本的二级质谱数据比对到S134生成的特征数据库中,辅助判断峰合并的有效性,二级质谱数据比对到特征数据库的比率越高,说明峰合并的有效性越强。本发明一实施方式中,将多个生物样本的二级质谱数据比对到S134生成的特征数据库中,其中,比对率大于或等于第三预设阈值时,判断峰合并有效;一个更优选的实施方式中,第三预设阈值设定为40%;一个更优选的实施方式中,第三预设阈值设定为50%;一个更优选的实施方式中,第三预设阈值设定为60%一个更优选的实施方式中,第三预设阈值设定为80%。In an embodiment of the present invention, preferably, the mass spectrometry data further includes secondary mass spectrometry data, and S13 further includes: S135, comparing the secondary mass spectrometry data of multiple biological samples to the feature database generated in S134, and assisting in determining peak merging The higher the ratio of the secondary mass spectrum data to the feature database, the stronger the effectiveness of peak merging. In an embodiment of the present invention, the secondary mass spectrometry data of multiple biological samples are compared to the feature database generated in S134, wherein when the comparison rate is greater than or equal to the third preset threshold, it is determined that the peak combination is effective; one is more preferred In the embodiment, the third preset threshold is set to 40%; in a more preferred embodiment, the third preset threshold is set to 50%; in a more preferred embodiment, the third preset threshold is set In a more preferred embodiment, the third preset threshold is set to 80%.
本发明的生物代谢组学数据处理方法,几乎适合于所有能够进行液相色谱-质谱和/或气相色谱-质谱检测的生物样本,这些生物样本包括但不限于人或动物的体液、组织或细胞,植物的根、茎、叶、果实或种子,或微生物的细胞培养液等;其中,体液包括尿液、血液、唾液、 脑脊液或羊水等,组织包括器官组织、肌肉组织或肿瘤组织等,细胞包括干细胞、体细胞、肿瘤细胞或微生物细胞等。The biological metabolomics data processing method of the present invention is suitable for almost all biological samples that can be detected by liquid chromatography-mass spectrometry and/or gas chromatography-mass spectrometry. These biological samples include but are not limited to human or animal body fluids, tissues or cells , Plant roots, stems, leaves, fruits or seeds, or microbial cell culture fluid, etc.; among them, body fluids include urine, blood, saliva, cerebrospinal fluid or amniotic fluid, etc., tissues include organ tissue, muscle tissue or tumor tissue, etc., cell Including stem cells, somatic cells, tumor cells or microbial cells, etc.
在本发明的发明宗旨之下,还提供一种生物代谢组学数据的分析方法。该生物代谢组学数据的分析方法依次包括生物代谢组学数据处理和通过二级质谱数据信息对代谢物进行定性鉴定的步骤,其中,生物代谢组学数据处理采用本发明上述任一种生物代谢组学数据处理方法进行。由于本发明上述生物代谢组学数据处理方法可以不受检测批次的影响,从而可以不断的积累特征数据库的样本数据量,从而可以不断地提高通过二级质谱数据信息对代谢物进行定性鉴定的准确性。Under the purpose of the invention, a method for analyzing biological metabolomics data is also provided. The biological metabolomics data analysis method sequentially includes the steps of biological metabolomics data processing and qualitative identification of metabolites through secondary mass spectrometry data information, wherein the biological metabolomics data processing adopts any of the above-mentioned biological metabolisms of the present invention. The omics data processing method is carried out. Since the above-mentioned biological metabolomics data processing method of the present invention is not affected by the detection batch, the sample data volume of the characteristic database can be continuously accumulated, thereby continuously improving the qualitative identification of metabolites through the secondary mass spectrum data information. accuracy.
根据本发明一种典型的实施方式,通过二级质谱数据信息对代谢物进行定性鉴定的步骤包括:According to an exemplary embodiment of the present invention, the step of qualitatively identifying metabolites through the data information of the secondary mass spectrum includes:
S21,获取各标准化合物的质荷比数据;S21, obtain the mass-to-charge ratio data of each standard compound;
S22,在生物代谢组学数据处理后得到的特征数据库中任意选择一个特征值,并找到与该特征值对应的所有的二级质谱质荷比数据,根据所有的二级质谱质荷比数据,找到与其相匹配的标准化合物的集合;S22. Select a characteristic value arbitrarily in the characteristic database obtained after the biological metabolomics data processing, and find all the mass-to-charge ratio data of the second-stage mass spectrometry corresponding to the characteristic value, and according to all the mass-to-charge ratio data of the second-stage mass spectrum, Find a set of matching standard compounds;
S23,以S22中选择的一个特征值所对应的所有的二级质谱质荷比数据为一方,以S22中找到的匹配的标准化合物的二级质谱质荷比数据为另一方,对二者进行相似性打分,计算点积分,根据积分值对代谢物进行定性。S23, take all the mass-to-charge ratio data of the MS mass spectrum corresponding to a characteristic value selected in S22 as one side, and use the mass-to-charge ratio data of the MS mass spectrum of the matched standard compound found in S22 as the other side, and perform Similarity scores, points points are calculated, and metabolites are qualitatively based on the points.
此方法能够有效避免中位数不具有代表性这个问题,且操作简单。This method can effectively avoid the problem that the median is not representative, and the operation is simple.
在本发明一实施方式中,也可以采用通用的计算点积的方法对MS2相似性打分,该方法从属于同一个feature的多个MS2与标准化合物的MS2进行比对,通过积分情况可以达到对feature进行鉴定的目的。In an embodiment of the present invention, a general method of calculating dot product can also be used to score MS2 similarity. This method is subordinated to the comparison of multiple MS2 belonging to the same feature with the MS2 of the standard compound, and the comparison can be achieved through integration. The purpose of feature identification.
优选的,S23具体包括:计算匹配上的多个标准化合物中每个标准化合物与多个二级质谱数据相似性的中位数,选择中位数最大的化合物;更优选的,根据化合物的中位数是否大于截止值(cut-off),判别是否匹配。采用上述步骤,不仅包含“有代表性的”MS2,而且加上了化合物各种可能的MS2,增加了与标准化合物的匹配度。Preferably, S23 specifically includes: calculating the median of the similarity between each standard compound in the multiple standard compounds and multiple MS mass spectrometry data, and selecting the compound with the largest median; more preferably, according to the median of the compound Whether the number of digits is greater than the cut-off value, it is judged whether it matches. Using the above steps, not only includes the "representative" MS2, but also adds various possible MS2 of the compound, which increases the degree of matching with the standard compound.
在本发明中,标准化合物的质荷比数据从已有的数据库中获得,例如数据库包括NISTlib、HMDB或METLIN等。In the present invention, the mass-to-charge ratio data of the standard compound is obtained from an existing database, for example, the database includes NISTlib, HMDB or METLIN.
在本发明一种典型的实施方式中,分析方法还包括生物代谢物定量的步骤。经过了上述数据的处理及定性的步骤,确定了统一的时间轴,对保留时间进行了校正,还获得了数据量丰富的特征数据库,这样就尽可能地提高了母离子(mz)的覆盖区域,可以减少质荷比mz和保留时间RT的波动带来的影响,提高了生物代谢物定量的准确性。优选的,生物代谢物定量的步骤包括:S31,根据参照样本的时间轴对待定量样本的时间轴进行校正;S32,对所建立的特征数据库中待定量样本的对应的特征区域进行积分,得到生物代谢物相对定量的结果。In a typical embodiment of the present invention, the analysis method further includes a step of quantifying biological metabolites. After the above-mentioned data processing and qualitative steps, a unified time axis is determined, the retention time is corrected, and a feature database with rich data volume is obtained, so as to maximize the coverage area of the precursor ion (mz) , Can reduce the influence of the fluctuation of mass-to-charge ratio mz and retention time RT, and improve the accuracy of biological metabolite quantification. Preferably, the step of quantifying biological metabolites includes: S31, calibrating the time axis of the sample to be quantified according to the time axis of the reference sample; S32, integrating the corresponding feature area of the sample to be quantified in the established feature database to obtain the biological The result of relative quantification of metabolites.
基于上述技术方案的阐述,在本发明一实施方式或实施例中,具体的技术方案如下:Based on the above description of the technical solution, in an embodiment or embodiment of the present invention, the specific technical solution is as follows:
1.构建特征(features)数据库,进行多样本代谢组数据整合。构建特征数据库的流程参见图1。1. Construct a feature database to integrate multi-sample metabolome data. See Figure 1 for the process of constructing a feature database.
1)固定参照样本,统一坐标轴。具体为:确定统一的坐标轴,使样本的液相色谱-质谱数据或气相色谱-质谱数据在时间上具有可比性。1) Fix the reference sample and unify the coordinate axis. Specifically: determining a unified coordinate axis to make the liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of the sample comparable in time.
在整个检测过程中,选择一样本为参照样本(参照样本与检测样本类型一致,可理解作为标准品,仅需检测一次),其他样本的时间轴都根据这个样本进行校正,该样本为reference.xml,即固定一个参照样本,保证后续样本在时间上具有可比性。In the entire testing process, the sample is selected as the reference sample (the reference sample is the same as the test sample type, which can be understood as a standard product and only needs to be tested once), and the time axis of other samples are corrected according to this sample, which is a reference. xml, that is, fix a reference sample to ensure that subsequent samples are comparable in time.
2)保留时间校正。具体为:新样本首先做保留时间(RT)的校正,这一步使用Obiwarp算法,同时对一级质谱数据(MS1)和二级质谱数据(MS2)进行保留时间(RT)校正。2) Retention time correction. Specifically, the new sample is first corrected for retention time (RT). In this step, the Obiwarp algorithm is used to perform retention time (RT) correction on the primary mass spectrum data (MS1) and secondary mass spectrum data (MS2).
3)峰识别。具体为:在校正后的时间轴上,使用CentWave算法对每一个样本一级质谱离子峰做峰识别(findPeaks)。其中,峰识别算法包括但不限于CentWave、matchedFilter、mzMine,优选CentWave。该方法可提高灵敏度,限制其误差,最精确地发现最多的识别特征峰(peak1,peak2,…,peakn)。高灵敏度带来的噪音和严格的ppm设置带来的同一个离子峰被分成两个离子峰的问题交由样本信息互补来处理。3) Peak identification. Specifically: on the corrected time axis, use the CentWave algorithm to identify the peaks of each sample's primary mass spectrum ion peak (findPeaks). Among them, the peak recognition algorithm includes but not limited to CentWave, matchedFilter, mzMine, and CentWave is preferred. This method can improve the sensitivity, limit its error, and find the most identifying characteristic peaks (peak1, peak2,..., peakn) most accurately. The noise caused by high sensitivity and the problem of the same ion peak being divided into two ion peaks caused by the strict ppm setting are handled by sample information complementation.
其中,算法参数设置遵循“尽可能提高检测灵敏度”的思想:Among them, the algorithm parameter setting follows the idea of "improving detection sensitivity as much as possible":
①ppm:根据仪器的类别,采用仪器的分辨率,降低容错率。①ppm: According to the type of the instrument, the resolution of the instrument is adopted to reduce the error tolerance rate.
②peakwidth(峰宽):设置为2~30。该参数设置跟色谱柱类型和洗脱时间有关,一般为洗脱时间的1/10,选取2作为下限的目的是识别很窄的峰,提高findPeaks的灵敏度。②peakwidth (peak width): set to 2~30. This parameter setting is related to the column type and elution time, generally 1/10 of the elution time. The purpose of selecting 2 as the lower limit is to identify very narrow peaks and improve the sensitivity of findPeaks.
③noise(噪音):设置为0。该参数表示噪音强度,设置为0的目的是为了提高灵敏度。噪音越大灵敏度越小③noise (noise): set to 0. This parameter represents the intensity of noise, and the purpose of setting it to 0 is to improve sensitivity. The greater the noise, the lower the sensitivity
④snthresh(信噪比):设置为10。该参数表示信噪比,采用默认参数。④ snthresh (signal to noise ratio): set to 10. This parameter represents the signal-to-noise ratio and uses the default parameters.
4)根据样本信息互补原则合并峰。具体为:根据样本信息互补原则,合并识别特征峰(peaks),生成统一的坐标,即特征(features)数据库。处理方式如下(参见图2):4) Merge peaks according to the principle of complementary sample information. Specifically, according to the principle of complementary sample information, the peaks are combined and identified to generate a unified coordinate, that is, a feature database. The processing method is as follows (see Figure 2):
对来自多个样本的识别特征峰peak1,peak2,…,peakn,作如下判断:For the identification characteristic peaks peak1, peak2,..., peakn from multiple samples, make the following judgments:
①判断多个识别特征峰的[mzmin,mzmax]区域是否重叠或相邻,若重叠,进入③;若不重叠,进一步判断是否相邻,如果peak m+1,peak m+2,…,peak m+a的[mzmin,mzmax]区域的间隔小于第一预设阈值,则判定为相邻,进入③;若既不重叠也不相邻,则判定多个识别特征峰分别为独立的特征峰;① Determine whether the [mzmin, mzmax] regions of multiple identified characteristic peaks overlap or are adjacent, if they overlap, enter ③; if they do not overlap, further determine whether they are adjacent, if peak m+1, peak m+2,..., peak If the interval of the [mzmin, mzmax] area of m+a is less than the first preset threshold, it is judged to be adjacent, and proceed to ③; if it is neither overlapping nor adjacent, it is judged that multiple identified characteristic peaks are independent characteristic peaks. ;
②判断多个识别特征峰的[rtmin,rtmax]区域是否重叠或者相邻,若重叠,进入③;若不重叠,判断是否相邻,如果peak m+1,peak m+2,…,peak m+a的[rtmin,rtmax]区域的间隔小于第二预设阈值absRt,则判断为相邻,进入③;若既不重叠也不相邻,则判定多个识别 特征峰分别为独立的特征峰;② Determine whether the [rtmin, rtmax] regions of multiple identified characteristic peaks overlap or are adjacent, if they overlap, enter ③; if they do not overlap, determine whether they are adjacent, if peak m+1, peak m+2,..., peak m The interval of the [rtmin, rtmax] area of +a is less than the second preset threshold absRt, then it is judged to be adjacent, and proceed to ③; if it is neither overlapping nor adjacent, it is judged that multiple identified characteristic peaks are independent characteristic peaks. ;
③如果peak m+1,peak m+2,…,peak m+a的关系同时满足①中的重叠或相邻和②中的重叠或相邻两个条件,则判定peak m+1,peak m+2,…,peak m+a属于同一个特征峰,新特征峰的坐标取a者的并集,生成特征列表,即得到特征数据库;③If the relationship between peak m+1, peak m+2,..., peak m+a satisfies both the overlap or adjacent conditions in ① and the overlap or adjacent conditions in ②, then it is determined that peak m+1, peak m +2,..., peak m+a belong to the same characteristic peak, and the coordinates of the new characteristic peak are the union of those of a, and the characteristic list is generated to obtain the characteristic database;
其中,n和a分别独立的取值于正整数,m取值于0和正整数,m<n。Among them, n and a are independently valued in positive integers, m is valued in 0 and positive integers, and m<n.
④将多个样本的二级质谱数据比对到特征数据库(由[“mzmin”,“mzmax”,“rtmin”,“rtmax”]确定的矩形区域)。该步骤可以辅助判断峰合并的有效性。④ Compare the secondary mass spectrum data of multiple samples to the feature database (rectangular area determined by ["mzmin", "mzmax", "rtmin", "rtmax"]). This step can assist in judging the effectiveness of peak merging.
2.代谢物鉴定(定性)。2. Metabolite identification (qualitative).
通过二级质谱(MS2)数据信息,对代谢物进行鉴定,具体方法如下:Identify metabolites through MS2 data information. The specific methods are as follows:
1)获取标准化合物的质荷比mz(标准化合物从已有的数据库中获得,数据库主要为NISTlib,也可以是HMDB,METLIN等公开的数据库)。1) Obtain the mass-to-charge ratio mz of the standard compound (the standard compound is obtained from an existing database, the database is mainly NISTlib, or HMDB, METLIN and other public databases).
2)找与MS2的母离子质荷比mz相同的标准化合物(参数设置:absMz=0.015Da(Da:道尔顿),absMz根据仪器参数设定,可设定为0.01~0.015Da这个范围)。2) Find a standard compound with the same mass-to-charge ratio mz of the precursor ion of MS2 (parameter setting: absMz = 0.015Da (Da: Dalton), absMz is set according to the instrument parameters, which can be set to the range of 0.01~0.015Da) .
3)比较多个实验得到的MS2和匹配上的多个标准化合物的相似性,进行相似性打分,计算点积分。3) Compare the similarity between MS2 obtained in multiple experiments and multiple standard compounds on the match, score the similarity, and calculate the points.
综合多个MS2的结果,选择最匹配的标准化合物,进行代谢物鉴定。该步骤计算每个化合物与多个MS2相似性的中位数,选择中位数最大的化合物。根据该化合物的中位数是否大于指定值(也称截止值,指定值可根据实际情况确定为0.5~1),判别是否匹配。Combining the results of multiple MS2s, select the most matching standard compound for metabolite identification. This step calculates the median of similarity between each compound and multiple MS2s, and selects the compound with the largest median. According to whether the median of the compound is greater than the specified value (also called the cut-off value, the specified value can be determined as 0.5 to 1 according to the actual situation) to determine whether it matches.
3.代谢物的相对定量3. Relative quantification of metabolites
经过上述两大步骤,具备了相对定量的前提条件:After the above two major steps, the prerequisites for relative quantification are met:
(1)特征(features)的坐标[“mzmin”,“mzmax”,“rtmin”,“rtmax”],这些坐标尽可能地提高了母离子(mz)的覆盖区域,可以减少质荷比mz和保留时间RT的波动带来的影响。(1) The coordinates of the features ["mzmin", "mzmax", "rtmin", "rtmax"], these coordinates maximize the coverage area of the precursor ion (mz) and reduce the mass-to-charge ratio mz and The impact of fluctuations in retention time RT.
(2)与features相匹配的MS2数据库,以及对MS2的鉴定结果。(2) MS2 database matching the features and the identification results of MS2.
(3)对于每一个feature,有[“mzmin”,“mzmax”,“rtmin”,“rtmax”,“metabolite”,“adduct”],可以对一个feature进行完整注释(备注:“Metabolite”和“adduct”信息从参考数据库中(如NISTlib等)获得,“Metabolite”和“adduct”信息设置为定性过程)。(3) For each feature, there are ["mzmin", "mzmax", "rtmin", "rtmax", "metabolite", "adduct"], and one feature can be fully annotated (remarks: "Metabolite" and " "Adduct" information is obtained from reference databases (such as NISTlib, etc.), and "Metabolite" and "adduct" information are set as qualitative processes).
定量方法如下:The quantitative method is as follows:
1)根据参考样本信息(reference.xml),校正样本的时间轴。1) Correct the time axis of the sample according to the reference sample information (reference.xml).
2)对样本的features区域进行积分,得到代谢物相对定量的结果。2) Integrate the features area of the sample to obtain the relative quantitative results of metabolites.
在本发明总体的发明构思之下,本发明还提供了上述生物代谢组学数据处理方法、生物 代谢组学数据的分析方法在维生素、氨基酸、脂质、类固醇、芳香酸、神经递质、色素、碳水化合物或短肽鉴定中的应用。由于本发明上述生物代谢组学数据处理方法可以不受检测批次的影响,从而可以不断的积累特征数据库的样本数据量,从而也可增加维生素、氨基酸、脂质、类固醇、芳香酸、神经递质、色素、碳水化合物或短肽鉴定准确性和精确度。Under the overall inventive concept of the present invention, the present invention also provides the above-mentioned biological metabolomics data processing method and biological metabolomics data analysis method in vitamins, amino acids, lipids, steroids, aromatic acids, neurotransmitters, pigments , Application in the identification of carbohydrates or short peptides. Since the above-mentioned biological metabolomics data processing method of the present invention is not affected by the test batch, it can continuously accumulate the sample data volume of the characteristic database, thereby also increasing vitamins, amino acids, lipids, steroids, aromatic acids, and neurotransmitters. Quality, pigment, carbohydrate or short peptide identification accuracy and precision.
进一步地,本发明还提供一种维生素、氨基酸、脂质、类固醇、芳香酸、神经递质、色素、碳水化合物或短肽的检测方法。该检测方法包括:对生物样本进行液相色谱-质谱和/或气相色谱-质谱检测,得到液相色谱-质谱数据和/或气相色谱-质谱数据;采用上述任一种生物代谢组学数据处理方法或生物代谢组学数据的分析方法对生物样本的液相色谱-质谱数据和/或气相色谱-质谱数据进行处理得到数据结果;以及根据数据结果换算出维生素、氨基酸、脂质、类固醇、芳香酸、神经递质、色素、碳水化合物或短肽的种类、含量。同样的,由于本发明生物代谢组学数据的处理方法及分析方法的先进行,维生素、氨基酸、脂质、类固醇、芳香酸、神经递质、色素、碳水化合物或短肽的检测结果也必将更加精准。Further, the present invention also provides a method for detecting vitamins, amino acids, lipids, steroids, aromatic acids, neurotransmitters, pigments, carbohydrates or short peptides. The detection method includes: performing liquid chromatography-mass spectrometry and/or gas chromatography-mass spectrometry on a biological sample to obtain liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data; using any of the above-mentioned biological metabolomics data processing Method or analysis method of biological metabolomics data process the liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data of biological samples to obtain data results; and convert vitamins, amino acids, lipids, steroids, and aromas based on the data results The type and content of acids, neurotransmitters, pigments, carbohydrates or short peptides. Similarly, due to the advancement of the biological metabolomics data processing method and analysis method of the present invention, the test results of vitamins, amino acids, lipids, steroids, aromatic acids, neurotransmitters, pigments, carbohydrates or short peptides must also be More precise.
另外,为了方便本发明上述方法的实现,在本发明的发明宗旨之下,根据本发明一种典型的实施方式,提供一种生物代谢组学数据处理装置。生物代谢组学数据包括液相色谱-质谱数据和/或气相色谱-质谱数据,液相色谱-质谱数据包括一级质谱数据,气相色谱-质谱数据包括一级质谱数据;该生物代谢组学数据处理装置包括将多个生物样本的液相色谱-质谱数据或气相色谱-质谱数据进行整合以形成特征数据库的数据库生成模块,数据库生成模块包括:时间轴校正子模块、特征峰识别子模块和特征数据库形成子模块,其中,时间轴校正子模块设置为任意选取多个生物样本中的一个样本作为参照样本,根据参照样本的时间轴逐一对其他样本的时间轴进行校正;特征峰识别子模块设置为对校正后的每一个样本,逐一进行一级质谱离子峰的峰识别处理,得到多个识别特征峰;以及特征数据库形成子模块设置为根据样本信息互补原则,对多个识别特征峰进行合并处理,得到多个生物样本的特征数据库。In addition, in order to facilitate the implementation of the above method of the present invention, under the purpose of the present invention, according to a typical embodiment of the present invention, a biological metabolomics data processing device is provided. Biological metabolomics data includes liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data, liquid chromatography-mass spectrometry data includes primary mass spectrometry data, and gas chromatography-mass spectrometry data includes primary mass spectrometry data; the biological metabolomics data The processing device includes a database generation module that integrates liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of multiple biological samples to form a feature database. The database generation module includes: a time axis correction submodule, a feature peak recognition submodule, and features The database forms a sub-module, where the time-axis correction sub-module is set to arbitrarily select one of the multiple biological samples as a reference sample, and correct the time-axis of other samples one by one according to the time axis of the reference sample; the characteristic peak recognition sub-module setting In order to perform the peak identification processing of the first mass spectrum ion peaks one by one for each sample after calibration, to obtain multiple identification characteristic peaks; and the feature database formation sub-module is set to merge the multiple identification characteristic peaks according to the principle of sample information complementarity Through processing, a feature database of multiple biological samples is obtained.
应用本发明的装置,首先选择一样本为参照样本,其他样本的时间轴都根据这个样本进行校正,即确定统一的坐标轴,使样本的液相色谱-质谱数据或气相色谱-质谱数据在时间上具有可比性;在校正后的时间轴上,对每一个样本一级质谱离子峰做峰识别,然后利用样本间信息互补原则进行峰(peak)合并构建得到特征(feature)数据库,从而可以实现超大规模的代谢组数据的整合。由于所有样本均根据参照样本进行时间轴校正,因此可实现分批次或单个样本的数据校正与数据整合,且不受检测批次的影响,这适设置为商业化检测。Using the device of the present invention, first select a sample as the reference sample, and the time axis of other samples are corrected according to this sample, that is, a unified coordinate axis is determined so that the liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of the sample is in time It is comparable; on the corrected time axis, peak identification is performed on the primary mass spectrum ion peak of each sample, and then the peaks are merged using the principle of information complementarity between samples to construct a feature database, which can be realized Super large-scale integration of metabolome data. Since all samples are calibrated on the time axis based on the reference sample, data calibration and data integration can be achieved in batches or individual samples, and are not affected by the test batch. This is suitable for commercial testing.
在本发明一实施方式中,质谱数据还包括二级质谱数据,时间轴校正子模块还设置为对一级质谱数据和二级质谱数据进行保留时间校正,进一步提高质谱数据的准确性;优选的,使用Obiwarp算法进行保留时间校正,具有运算速度快,数据处理准确度高等优点。In an embodiment of the present invention, the mass spectrum data further includes secondary mass spectrum data, and the time axis correction sub-module is further configured to perform retention time correction on the primary mass spectrum data and the secondary mass spectrum data to further improve the accuracy of the mass spectrum data; preferably , Use Obiwarp algorithm for retention time correction, which has the advantages of fast calculation speed and high data processing accuracy.
在本发明一实施方式中,特征数据库形成子模块包括数据整合单元,数据整合单元设置为将[mzmin,mzmax]区域重叠或相邻,且[rtmin,rtmax]区域重叠或相邻的多个识别特征峰合并为一个特征峰;优选的,特征数据库形成子模块包括第一判断单元、第二判断单元、数据整合单元和特征数据库形成单元:其中,第一判断单元设置为判断多个识别特征峰的[mzmin,mzmax]区域是否重叠,若重叠,进入数据整合单元;若不重叠,进一步判断是否相邻,如果 多个识别特征峰的[mzmin,mzmax]区域的间隔小于第一预设阈值,则判定为相邻,进入S133;若既不重叠也不相邻,则判定多个识别特征峰分别为独立的特征峰;第二判断单元,设置为判断多个识别特征峰的[rtmin,rtmax]区域是否重叠或者相邻,若重叠,进入数据整合单元;若不重叠,进一步判断是否相邻,如果多个识别特征峰的[rtmin,rtmax]区域的间隔小于第二预设阈值,则判断为相邻,进入数据整合单元;若既不重叠也不相邻,则判定多个识别特征峰分别为独立的特征峰;数据整合单元,设置为将[mzmin,mzmax]区域重叠或相邻,且[rtmin,rtmax]区域重叠或相邻的多个识别特征峰合并为一个特征峰;数据库形成模块,利用所有特征峰的数据生成特征列表即得到特征数据库。如此合并后的特征峰可以覆盖更大的区域,即使只检测一个样本,也可以更准确地进行定量(对于色谱峰型不好的代谢物很有效),并且更大的覆盖区域可以更有效地兼容后续样本,有效地减少保留时间(RT)的偏移造成的影响。In an embodiment of the present invention, the feature database forming sub-module includes a data integration unit, and the data integration unit is configured to overlap or adjacent [mzmin, mzmax] regions, and multiple identifications where the [rtmin, rtmax] regions overlap or are adjacent The characteristic peaks are merged into one characteristic peak; preferably, the characteristic database forming sub-module includes a first judgment unit, a second judgment unit, a data integration unit, and a characteristic database forming unit: wherein the first judgment unit is configured to judge multiple identification characteristic peaks If the [mzmin, mzmax] regions overlap, if they overlap, enter the data integration unit; if they do not overlap, further determine whether they are adjacent. If the interval between the [mzmin, mzmax] regions of multiple identifying characteristic peaks is less than the first preset threshold, If it is judged to be adjacent, go to S133; if it is neither overlapping nor adjacent, judge that multiple identification characteristic peaks are independent characteristic peaks; the second judgment unit is set to judge the [rtmin, rtmax ] Whether the regions overlap or are adjacent, if they overlap, enter the data integration unit; if they do not overlap, further determine whether they are adjacent, if the interval between the [rtmin, rtmax] regions of multiple identifying characteristic peaks is less than the second preset threshold, then determine If it is adjacent to each other, enter the data integration unit; if it is neither overlapping nor adjacent, it is determined that multiple identification characteristic peaks are independent characteristic peaks; the data integration unit is set to overlap or adjacent to the [mzmin, mzmax] area, And the [rtmin, rtmax] area overlaps or multiple adjacent identification feature peaks are merged into one feature peak; the database forming module uses the data of all the feature peaks to generate a feature list to obtain the feature database. In this way, the combined characteristic peaks can cover a larger area, even if only one sample is detected, it can be quantified more accurately (effective for metabolites with poor chromatographic peak shapes), and a larger coverage area can be more effective Compatible with subsequent samples, effectively reducing the impact of retention time (RT) drift.
优选的,峰识别的算法为CentWave算法、matchedFilter算法或mzMine算法,更优选为CentWave算法,因为该方法可提高灵敏度,限制其误差,最精确地发现最多的识别特征峰。如此在CentWave算法的基础上进行peaks的合并,能够有效地利用CentWave对最大响应区域的定位。在本发明中,算法参数设置遵循“尽可能提高检测灵敏度”的思想,优选的,峰识别的算法的参数设置包括:ppm:采用仪器的分辨率;峰宽:设置为2~30;噪音:设置为0;信噪比:设置为10。Preferably, the peak recognition algorithm is CentWave algorithm, matchedFilter algorithm or mzMine algorithm, more preferably CentWave algorithm, because this method can improve sensitivity, limit its error, and find the most recognized characteristic peaks most accurately. In this way, peaks are merged on the basis of CentWave algorithm, which can effectively use CentWave to locate the maximum response area. In the present invention, the algorithm parameter setting follows the idea of "improving detection sensitivity as much as possible". Preferably, the parameter setting of the peak recognition algorithm includes: ppm: the resolution of the instrument is adopted; peak width: set to 2-30; noise: Set to 0; SNR: set to 10.
在本发明一实施方式中,第一预设阈值依据仪器参数进行设定,第二预设阈值依据保留时间校正中时间偏差的最大值来进行设定;优选的,第一预设阈值设定为0.01~0.015Da,更优选的,第一预设阈值设定为0.01Da、0.011Da、0.012Da、0.013Da、0.014Da或0.015Da,第二预设阈值设定为10~15,更优选的,第二预设阈值设定为10、11、12、13、14或15,以提高峰合并的有效性,进而提高特征数据库的准确性。In an embodiment of the present invention, the first preset threshold is set according to instrument parameters, and the second preset threshold is set according to the maximum value of the time deviation in the retention time correction; preferably, the first preset threshold is set It is 0.01~0.015Da, more preferably, the first preset threshold is set to 0.01Da, 0.011Da, 0.012Da, 0.013Da, 0.014Da or 0.015Da, and the second preset threshold is set to 10~15, more preferably Yes, the second preset threshold is set to 10, 11, 12, 13, 14 or 15, to improve the effectiveness of peak merging, thereby improving the accuracy of the feature database.
在本发明一实施方式中,优选的,质谱数据还包括二级质谱数据,生物代谢组学数据处理装置还包括:峰合并有效性验证子模块,设置为将多个生物样本的二级质谱数据比对到特征数据库中,辅助判断峰合并的有效性,二级质谱数据比对到特征数据库的比率越高,说明峰合并的有效性越强。本发明一实施方式中,将多个生物样本的二级质谱数据比对到特征数据库中,其中,比对率大于或等于第三预设阈值时,判断峰合并有效;一个更优选的实施方式中,第三预设阈值设定为40%;一个更优选的实施方式中,第三预设阈值设定为50%;一个更优选的实施方式中,第三预设阈值设定为60%一个更优选的实施方式中,第三预设阈值设定为80%。In an embodiment of the present invention, preferably, the mass spectrometry data further includes secondary mass spectrometry data, and the biological metabolomics data processing device further includes: a peak merging validity verification submodule configured to combine the secondary mass spectrometry data of multiple biological samples Compare to the feature database to assist in judging the effectiveness of peak merging. The higher the ratio of the secondary mass spectrum data to the feature database, the stronger the effectiveness of peak merging. In an embodiment of the present invention, the secondary mass spectrometry data of multiple biological samples are compared to a feature database, wherein when the comparison rate is greater than or equal to the third preset threshold, it is determined that the peak combination is effective; a more preferred embodiment In a more preferred embodiment, the third preset threshold is set to 40%; in a more preferred embodiment, the third preset threshold is set to 50%; in a more preferred embodiment, the third preset threshold is set to 60% In a more preferred embodiment, the third preset threshold is set to 80%.
在本发明的发明宗旨之下,还提供一种生物代谢组学数据的分析装置。该分析装置包括设置为生物代谢组学数据处理的模块和设置为通过二级质谱数据信息对代谢物进行定性鉴定的模块,其中,设置为生物代谢组学数据处理的模块为本发明的上述生物代谢组学数据处理装置。由于本发明上述生物代谢组学数据处理装置可以不受检测批次的影响,从而可以不断的积累特征数据库的样本数据量,从而可以不断地提高通过二级质谱数据信息对代谢物进行定性鉴定的准确性。Under the purpose of the present invention, a biological metabolomics data analysis device is also provided. The analysis device includes a module configured to process biological metabolomics data and a module configured to qualitatively identify metabolites through secondary mass spectrometry data information, wherein the module configured to process biological metabolomics data is the aforementioned biological Metabolomics data processing device. Since the above-mentioned biological metabolomics data processing device of the present invention can not be affected by the test batch, it can continuously accumulate the sample data volume of the characteristic database, thereby continuously improving the qualitative identification of metabolites through the secondary mass spectrum data information accuracy.
根据本发明一种典型的实施方式,设置为通过二级质谱数据信息对代谢物进行定性鉴定 的模块包括标准化合物质荷比数据获取子模块和标准化合物匹配子模块,其中,标准化合物质荷比数据获取子模块设置为获取各标准化合物的质荷比数据;标准化合物匹配子模块设置为在生物代谢组学数据处理后得到的特征数据库中任意选择一个特征值,并找到与该特征值对应的所有的二级质谱质荷比数据,根据所有的二级质谱质荷比数据,找到与其相匹配的标准化合物;积分定性子模块,设置为以标准化合物匹配子模块中选择的一个特征值所对应的所有的二级质谱质荷比数据为一方,以标准化合物匹配子模块中找到的匹配的标准化合物的二级质谱质核比数据为另一方,对二者进行相似性打分,计算点积分,根据积分值对代谢物进行定性。此方法借鉴了knn算法的参数设置和density算法的合并方法,能够有效避免中位数不具有代表性这个问题,且操作简单。According to a typical implementation of the present invention, the module configured to qualitatively identify metabolites through the data information of the secondary mass spectrum includes a standard compound mass-to-charge ratio data acquisition sub-module and a standard compound matching sub-module, wherein the standard compound mass-to-charge ratio The data acquisition sub-module is set to acquire the mass-to-charge ratio data of each standard compound; the standard compound matching sub-module is set to randomly select a characteristic value in the characteristic database obtained after the biological metabolomics data processing, and find the corresponding characteristic value All the mass-to-charge ratio data of the secondary mass spectrum, according to all the mass-to-charge ratio data of the secondary mass spectrum, find the matching standard compound; the integration qualitative sub-module is set to correspond to a characteristic value selected in the standard compound matching sub-module All the mass-to-charge ratio data of the second-level mass spectra of, take the mass-to-charge ratio data of the matched standard compound found in the standard compound matching sub-module as the other side, score the similarity between the two, and calculate the point integral. The metabolites are qualitatively based on the integral value. This method draws on the parameter setting of the knn algorithm and the merging method of the density algorithm, which can effectively avoid the problem that the median is not representative, and the operation is simple.
优选的,积分定性子模块设置为计算匹配上的多个标准化合物中每个标准化合物与多个二级质谱数据相似性的中位数,选择中位数最大的化合物;更优选的,根据化合物的中位数是否大于截止值,判别是否匹配。采用上述算法,不仅包含“有代表性的”MS2,而且加上了化合物各种可能的MS2,增加了与标准化合物的匹配度。Preferably, the integral qualitative sub-module is set to calculate the median of the similarity between each standard compound in the multiple standard compounds on the match and the data of multiple secondary mass spectrometry, and select the compound with the largest median; more preferably, according to the compound Whether the median of is greater than the cut-off value, judge whether it matches. The above algorithm not only includes the "representative" MS2, but also adds various possible MS2 of the compound, which increases the matching degree with the standard compound.
在本发明中,标准化合物的质荷比数据从已有的数据库中获得,例如数据库包括NISTlib、HMDB或METLIN等。In the present invention, the mass-to-charge ratio data of the standard compound is obtained from an existing database, for example, the database includes NISTlib, HMDB or METLIN.
在本发明一种典型的实施方式中,分析装置还包括设置为生物代谢物定量的模块。经过了上述数据的处理及定性,确定了统一的时间轴,对保留时间进行了校正,还获得了数据量丰富的特征数据库,这样就尽可能地提高了母离子(mz)的覆盖区域,可以减少质荷比mz和保留时间RT的波动带来的影响,提高了生物代谢物定量的准确性。优选的,设置为生物代谢物定量的模块包括时间轴校正子模块和生物代谢物相对定量子模块,其中,时间轴校正子模块设置为根据参照样本的时间轴对待定量样本的时间轴进行校正;生物代谢物相对定量子模块,设置为对所建立的特征数据库中待定量样本的对应的特征区域进行积分,得到生物代谢物相对定量的结果。In a typical embodiment of the present invention, the analysis device further includes a module configured to quantify biological metabolites. After the above-mentioned data processing and characterization, a unified time axis was determined, the retention time was corrected, and a feature database with rich data volume was obtained, so that the coverage area of the precursor ion (mz) was increased as much as possible. Reduce the impact of fluctuations in the mass-to-charge ratio mz and retention time RT, and improve the accuracy of biological metabolite quantification. Preferably, the module configured to quantify biological metabolites includes a time axis correction sub-module and a biological metabolite relative quantification sub-module, wherein the time axis correction sub-module is set to correct the time axis of the sample to be quantified according to the time axis of the reference sample; The relative quantification of biological metabolites sub-module is set to integrate the corresponding characteristic regions of the samples to be quantified in the established characteristic database to obtain the relative quantitative results of biological metabolites.
下面将结合实施例进一步说明本发明的有益效果。In the following, the beneficial effects of the present invention will be further described in conjunction with embodiments.
以101个干血片样本为例,利用本发明的技术方案(实施例1)和现有技术的方法(对比例1)对这101个样本的代谢组数据进行整合,同时进行定性和定量分析,具体如下。Taking 101 dried blood slice samples as an example, using the technical solution of the present invention (Example 1) and the method of the prior art (Comparative Example 1) to integrate the metabolome data of these 101 samples, and perform qualitative and quantitative analysis at the same time ,details as follows.
实施例1Example 1
1、构建features数据库1. Build the features database
1)确定统一的坐标轴,使样本在时间上具有可比性。从所有样本中挑选一个固定样本作为参照,其他样本的时间轴都根据这个样本进行校正,该样本为reference.xml。1) Determine a unified coordinate axis to make the samples comparable in time. A fixed sample is selected from all samples as a reference, and the time axis of other samples is corrected according to this sample, which is reference.xml.
2)新样本首先做保留时间(RT)的校正,这一步使用Obiwarp算法,同时对一级质谱数据(MS1)和二级质谱数据(MS2)进行保留时间(RT)校正。本实施例中一样本保留时间校正如图3所示(注:横轴是保留时间RT(单位:s),纵轴是样本的保留时间偏离参照样本的时间(单位:s),也称保留时间偏差。横线是参照样本(reference),曲线是其他样本(sample)。2) The new sample is first corrected for retention time (RT). In this step, the Obiwarp algorithm is used to correct the retention time (RT) of the primary mass spectrum data (MS1) and the secondary mass spectrum data (MS2). The sample retention time correction in this embodiment is shown in Figure 3 (Note: the horizontal axis is the retention time RT (unit: s), and the vertical axis is the time the retention time of the sample deviates from the reference sample (unit: s), also called retention Time deviation. The horizontal line is the reference sample, and the curve is the other samples.
3)在校正后的时间轴上,使用CentWave算法对每一个样本做峰识别(findPeaks),最精确地发现最多的peaks。高灵敏度带来的噪音和严格的ppm设置带来的同一个peak被分成两个peak的问题交由样本信息互补来处理。3) On the corrected time axis, use the CentWave algorithm to identify peaks (findPeaks) for each sample, and find the most peaks most accurately. The noise caused by high sensitivity and the problem of the same peak being divided into two peaks caused by the strict ppm setting are handled by the complementary sample information.
算法参数设置遵循“尽可能提高检测灵敏度”的原则进行设置,本实施例中具体设置如下:The algorithm parameter setting follows the principle of "improving the detection sensitivity as much as possible". The specific settings in this embodiment are as follows:
ppm:10ppm: 10
Peakwidth:2~30Peakwidth: 2~30
Noise:0Noise: 0
Snthresh:10Snthresh: 10
结果:每个样本大概有3600~5000个peaks,101个样本共431695个peaks。Results: Each sample has about 3,600 to 5,000 peaks, and 101 samples have a total of 431,695 peaks.
4)根据样本信息互补原则,合并101个样本的peaks,生成统一的坐标,即特征(features)数据库,具体处理如下(参见图2):4) According to the principle of sample information complementarity, the peaks of 101 samples are combined to generate a unified coordinate, that is, a feature database. The specific processing is as follows (see Figure 2):
对来自101个样本的peak1,peak2,…,peakn,作如下判断:For peak1, peak2,...,peakn from 101 samples, make the following judgments:
①101个样本中的[mzmin,mzmax]区域是否重叠或相邻,若重叠,进入第③步;若不重叠,进一步判断是否相邻,如果peak m+1,peak m+2,…,peak m+a的[mzmin,mzmax]区域的间隔小于第一预设阈值0.015Da,则判定为相邻,进入③;若既不重叠也不相邻,则判定多个识别特征峰分别为独立的特征峰;①Whether the [mzmin,mzmax] regions in the 101 samples overlap or are adjacent, if they overlap, go to step ③; if they do not overlap, further judge whether they are adjacent, if peak m+1, peak m+2,..., peak m The interval of the [mzmin, mzmax] area of +a is less than the first preset threshold 0.015Da, it is judged to be adjacent, and enter ③; if it is neither overlapping nor adjacent, it is judged that multiple identification feature peaks are independent features peak;
②101个样本的[rtmin,rtmax]区域是否重叠或者相邻,若重叠,进入第③步;若不重叠,判断是否相邻:设置第二预设阈值absRt=15,如果peak m+1,peak m+2,…,peak m+a的RT区域的间隔小于absRt,则判断为相邻,进入第③步;若既不重叠也不相邻,则判定多个识别特征峰分别为独立的特征峰;②Whether the [rtmin,rtmax] regions of the 101 samples overlap or are adjacent, if they overlap, go to step ③; if they do not overlap, judge whether they are adjacent: set the second preset threshold absRt=15, if peak m+1, peak m+2,..., peak m+a If the interval of the RT region is less than absRt, it is judged to be adjacent, and proceed to step ③; if it is neither overlapping nor adjacent, it is judged that multiple identified characteristic peaks are independent features peak;
③如果peak m+1,peak m+2,…,peak m+a的关系同时满足上面重叠/相邻和重叠/相邻两个条件,则判定peak m+1,peak m+2,…,peak m+a属于同一个peak。新peak的坐标取a者的并集,生成feature列表。其中,n和a分别独立的取值于正整数,m取值于0和正整数,m<n。③If the relationship of peak m+1, peak m+2,..., peak m+a satisfies the above two conditions of overlap/adjacent and overlap/adjacent at the same time, then it is determined that peak m+1, peak m+2,..., peak m+a belongs to the same peak. The coordinates of the new peak are taken as the union of a to generate a feature list. Among them, n and a are independently valued in positive integers, m is valued in 0 and positive integers, and m<n.
结果:合并后,生成23799个features(特征数据库)。Result: After merging, 23799 features (feature database) are generated.
4)将MS2比对到合并后的features(由[“mzmin”,“mzmax”,“rtmin”,“rtmax”]确定的矩形区域)。101个样本共有346518个MS2,其中279568个比对到features上,比对率80.68%,而单样本的MS2比对率只有50%左右。一共有6916个features有对应的MS2,这些features拥有的MS2的数量从1到2272不等。4) Compare MS2 to the merged features (rectangular area determined by ["mzmin", "mzmax", "rtmin", "rtmax"]). There are 346,518 MS2s in 101 samples, of which 279,568 are compared to features, with a comparison rate of 80.68%, while the comparison rate of MS2 for a single sample is only about 50%. A total of 6916 features have corresponding MS2, and the number of MS2 owned by these features ranges from 1 to 2272.
5)根据特征数据库中已知的信息,对101个样本的代谢组数据依次单个进行整合。5) According to the known information in the feature database, the metabolome data of 101 samples are sequentially integrated.
整合结果如下表1:The results of the integration are shown in Table 1:
表1Table 1
Figure PCTCN2020078647-appb-000001
Figure PCTCN2020078647-appb-000001
2、代谢物鉴定2. Metabolite identification
以其中一个feature(编号FT08341)为例,其mz为[352.158899,352.168942],RT为[167.3529,189.8049]。Take one of the features (No. FT08341) as an example, its mz is [352.158899,352.168942] and RT is [167.3529,189.8049].
代谢物鉴定步骤如下:The metabolite identification steps are as follows:
1)获取标准化合物的质荷比mz,每个标准化合物含有18种电离形式,具体见表2(标准化合物的电离形式种类)。检测样本的时候,会获得化合物的一个或多个电离形式的mz,每个化合物含有一个或多个电离形式,表3列举了5个标准化合物(S0001-S0005)及其5种电离形式对应的mz。1) Obtain the mass-to-charge ratio mz of the standard compound. Each standard compound contains 18 ionization forms, as shown in Table 2 (Types of ionization forms of standard compounds). When testing the sample, one or more ionized forms of the compound will be obtained. Each compound contains one or more ionized forms. Table 3 lists 5 standard compounds (S0001-S0005) and their corresponding 5 ionized forms. mz.
表2Table 2
11 M+M+ 77 (M-H+2Na)+(M-H+2Na)+ 1313 (M+CH3CN+Na)+(M+CH3CN+Na)+
22 (M+H)+(M+H)+ 88 (M-2H+3Na)+(M-2H+3Na)+ 1414 (2M+H)+(2M+H)+
33 (M+H-H2O)+(M+H-H2O)+ 99 (M+K)+(M+K)+ 1515 (2M+NH4)+(2M+NH4)+
44 (M+H-2H2O)+(M+H-2H2O)+ 1010 (M-H+2K)+(M-H+2K)+ 1616 (2M+Na)+(2M+Na)+
55 (M+NH4)+(M+NH4)+ 1111 (M-2H+3K)+(M-2H+3K)+ 1717 (2M+K)+(2M+K)+
66 (M+Na)+(M+Na)+ 1212 (M+CH3CN+H)+(M+CH3CN+H)+ 1818 (M+CH3COO+2H)(M+CH3COO+2H)
表3table 3
 To M+M+ (M+H)+(M+H)+ (M+H-H2O)+(M+H-H2O)+ (M+H-2H2O)+(M+H-2H2O)+ (M+NH4)+(M+NH4)+
S0001S0001 7474 75.0175.01 5757 38.9938.99 92.0392.03
S0002S0002 112112 113113 95.0295.02 77.0177.01 130.1130.1
S0003S0003 116116 117117 99.0199.01 8181 134134
S0004S0004 116116 117117 99.0199.01 8181 134134
S0005S0005 117.1117.1 118.1118.1 100.1100.1 82.0482.04 135.1135.1
注:S0001是化合物的编号,M+是电离形式,74等是mz。Note: S0001 is the number of the compound, M+ is the ionized form, 74 and so on are mz.
2)找与MS2的母离子mz相同的标准化合物2) Find the same standard compound as the parent ion mz of MS2
具体过程:Specific process:
a.找到匹配到FT08341上的所有MS2,共35个;a. Find all MS2s matching FT08341, a total of 35;
b.计算这35个MS2母离子的mz中位数,命名为mzmed,计算得到mzmed=352.1652;b. Calculate the median mz of the 35 MS2 precursor ions, name it mzmed, and calculate mzmed=352.1652;
c.寻找与这35个MS2的mzmed相同的标准化合物,共找到16个,这些标准化合物的电离形式,质荷比mz和强度intensity如图4所示(与feature(FT08341)的MS2的mz相近的标准化合物的MS2,注n01701等是化合物编号,(M+)等是该化合物的电离形式)。c. Search for standard compounds that are the same as the mzmed of these 35 MS2s. A total of 16 are found. The ionized form, mass-to-charge ratio mz and intensity of these standard compounds are shown in Figure 4 (close to the mz of feature (FT08341) MS2) The MS2 of the standard compound, note n01701, etc. are the compound number, (M+) etc. are the ionized form of the compound).
3)比较实验得到的35个MS2和匹配上的16个标准化合物的相似性,见图5。其中,FT08341的MS2与n01696的相似度较高(大部分MS2与n01696的相似性均大于0.8,平均相似度最大)。3) Comparing the similarities between the 35 MS2 obtained in the experiment and the 16 standard compounds on the match, see Figure 5. Among them, the similarity between MS2 of FT08341 and n01696 is relatively high (most of the similarities between MS2 and n01696 are greater than 0.8, and the average similarity is the largest).
注:图5横轴是标准化合物,纵轴是FT08341的35个MS2,例如,第一列是35个MS2与n01701MS2的相似性。Note: In Figure 5, the horizontal axis is the standard compound, and the vertical axis is 35 MS2 of FT08341. For example, the first column is the similarity between 35 MS2 and n01701MS2.
4)综合35个MS2的结果,选择最匹配的标准化合物,进行代谢物鉴定。4) Synthesize the results of 35 MS2s and select the most matching standard compound for metabolite identification.
计算这16个标准化合物与35个MS2相似性的中位数,具体如表4所示。选择中位数最大的化合物n01696,由于该化合物的中位数0.890大于指定值(cutoff=0.5),所以鉴定n01696为FT08341匹配上的化合物,该结果也与图5结果一致。Calculate the median of similarity between these 16 standard compounds and 35 MS2, as shown in Table 4. The compound n01696 with the largest median was selected. Since the median of 0.890 of the compound was greater than the specified value (cutoff=0.5), n01696 was identified as a FT08341 matched compound. The result is also consistent with the result in Figure 5.
表4Table 4
标准化合物Standard compound n01701n01701 n01694n01694 n01696n01696 n01835n01835 n01577n01577 n01578n01578 n01579n01579 n01440n01440
中位数median 0.0410.041 0.2110.211 0.8900.890 0.0000.000 0.0000.000 0.0050.005 0.0030.003 0.0260.026
标准化合物Standard compound n01444n01444 n01320n01320 L0194L0194 n00528n00528 n01419n01419 n01420n01420 n01421n01421 n01423n01423
中位数median 0.1500.150 0.0000.000 0.0000.000 0.0120.012 0.0060.006 0.4940.494 0.0080.008 0.0040.004
指定值可根据实际情况确定为0.5~1,本实施例的指定值为cutoff=0.5。The designated value can be determined as 0.5 to 1 according to actual conditions, and the designated value in this embodiment is cutoff=0.5.
3、单个样本代谢物相对定量3. Relative quantification of metabolites in a single sample
1)根据reference.xml,校正样本的时间轴。1) Correct the time axis of the sample according to reference.xml.
2)对样本的features区域进行积分,确定代谢物相对定量值。2) Integrate the features area of the sample to determine the relative quantitative value of metabolites.
表5table 5
Figure PCTCN2020078647-appb-000002
Figure PCTCN2020078647-appb-000002
Figure PCTCN2020078647-appb-000003
Figure PCTCN2020078647-appb-000003
表5为代谢物的相对定量结果(部分),GXP104,GX107等为样本名称,score为化合物的匹配得分,metabolite为鉴定到的化合物。Table 5 shows the relative quantitative results (partial) of metabolites. GXP104, GX107, etc. are sample names, score is the matching score of the compound, and metabolite is the identified compound.
例如,FT08341匹配上的代谢物名称为Phe-Trp(在标准化合物的数据库中获得匹配代谢物的名称,数据库主要为NISTlib,数据库可也替换成HMDB,METLIN等公开的数据库),得分为0.89分,可信度很高。在GXP104样本中,该代谢物的相对定量值为8495.393221,而在GXP107的相对定量值为5096.885985。For example, the name of the metabolite matched by FT08341 is Phe-Trp (the name of the metabolite is obtained in the database of standard compounds, the database is mainly NISTlib, and the database can also be replaced with HMDB, METLIN and other public databases), and the score is 0.89. , The credibility is high. In the GXP104 sample, the relative quantitative value of the metabolite is 8495.393221, and the relative quantitative value in GXP107 is 5096.885985.
又如,如表5所示,样本GXP104中的化合物FT02707的相对定量值为12386.06788;样本GXP104中的化合物FT05421的相对定量值为2252.548371。As another example, as shown in Table 5, the relative quantitative value of compound FT02707 in sample GXP104 is 12386.06788; the relative quantitative value of compound FT05421 in sample GXP104 is 2252.548371.
由此可见,运用本发明的技术方案可以准确的对代谢物进行相对定量。It can be seen that using the technical scheme of the present invention can accurately perform relative quantification of metabolites.
上述实施例中进行处理的是液相色谱-质谱数据,本领域技术人员可以理解的是气相色谱-质谱数据也可以采用此方法进行处理,并且得到相同的技术效果。In the above embodiments, liquid chromatography-mass spectrometry data is processed. Those skilled in the art can understand that gas chromatography-mass spectrometry data can also be processed by this method, and the same technical effect can be obtained.
对比例1Comparative example 1
1)将101个干血片的下机原始文件转换为mzXML格式;1) Convert the original files of 101 dried blood tablets into mzXML format;
2)使用XCMS的obiwarp方法对保留时间进行校正,校正每个扫描(scan)的时间,相关参数设置为:2) Use the obiwarp method of XCMS to correct the retention time and correct the time of each scan. The relevant parameters are set as:
ppm:25ppm: 25
Peakwidth:4~10Peakwidth: 4~10
Noise:10Noise: 10
Snthresh:3Snthresh: 3
3)使用XCMS的CentWave方法来进行离子峰(peaks)识别;3) Use the CentWave method of XCMS to identify ion peaks;
4)上面3步得到单个mzXML的peaks,peaks以[“mzmin”,“mzmax”,“rtmin”,“rtmax”,“into”,“maxo”]的形式存在,其中[“mzmin”,“mzmax”,“rtmin”,“rtmax”]是peaks的坐标,[“into”,“maxo”]是定量信息(积分和最大值);4) The peaks of a single mzXML are obtained in the above 3 steps, and the peaks exist in the form of ["mzmin","mzmax","rtmin","rtmax","into","maxo"], where ["mzmin","mzmax ","Rtmin","rtmax"] are the coordinates of peaks, ["into","maxo"] are quantitative information (integral and maximum);
5)通过将peaks进行对齐和合并(alignment&group),得到一系列特征(features),确保[“mzmin”,“mzmax”,“rtmin”,“rtmax”]区域无重叠。这里的mzmin是取多个样本mz的中位数的最小值,mzmax是多个样本mz中位数的最大值,101个样本共得到4289个features;5) A series of features are obtained by aligning and grouping the peaks, ensuring that the ["mzmin", "mzmax", "rtmin", "rtmax"] area does not overlap. Here mzmin is the minimum value of the median mz of multiple samples, and mzmax is the maximum value of the median mz of multiple samples. A total of 4289 features are obtained from 101 samples;
6)缺失值填充:根据统一的坐标,对mzXML的相关区域进行积分,缺失值填充是根据 发现的坐标对该区域进行积分。6) Missing value filling: Integrate the relevant area of mzXML according to the unified coordinates. Missing value filling is to integrate the area according to the found coordinates.
实施例1与对比例1处理结果对比如下:The treatment results of Example 1 and Comparative Example 1 are compared as follows:
1、实施例1发现23799个特征,特征非缺失值21273个,对比例1分析4289个特征,特征非缺失值4042个。说明本发明的技术方案可以发现更多的特征(feature),且特征缺失值数目更少。图6示出了实施例1的特征缺失值数目分布,一共101个样本,85%的特征缺失值数目小于20。1. Example 1 found 23799 features, 21273 feature non-missing values, comparative example 1 analyzed 4289 features, feature non-missing values 4042. It shows that the technical solution of the present invention can find more features, and the number of feature missing values is less. FIG. 6 shows the distribution of the number of feature missing values in Example 1. There are 101 samples in total, and 85% of the feature missing values are less than 20.
2、图7示出了实施例1和对比例1样本间的变异系数(CV,标准差除于均值)比较。其中,图7中:中位数线(坐标系中从左到右第一条直线(与纵坐标平行))代表中位数,四分位数线(坐标系中从左到右第二条直线(与纵坐标平行))代表上四分位数(75%),即在中位数线(坐标系中从左到右第一条直线(与纵坐标平行))的左侧有50%的features的CV值小于中位数线对应的值,在四分位数线(坐标系中从左到右第二条直线(与纵坐标平行))的左侧有75%的features的CV值小于四分位数线(坐标系中从左到右第二条直线(与纵坐标平行))对应的值。由图7可知,实施例1的样本间的变异系数(CV)更小,因为实施例1中构建了feature数据库,使得实施例1中的feature更加稳定,且排除了出峰时间(RT)的差异,最大限度地挖掘了分子离子(mz)本身的丰度。QC样本的一致性也有所提高。2. Figure 7 shows a comparison of the coefficient of variation (CV, standard deviation divided by the mean) between the samples of Example 1 and Comparative Example 1. Among them, in Figure 7, the median line (the first straight line from left to right in the coordinate system (parallel to the ordinate)) represents the median, and the quartile line (the second from left to right in the coordinate system) The straight line (parallel to the ordinate) represents the upper quartile (75%), that is, there is 50% on the left side of the median line (the first straight line from left to right in the coordinate system (parallel to the ordinate)) The CV value of the features is less than the value corresponding to the median line. On the left side of the quartile line (the second straight line from left to right in the coordinate system (parallel to the ordinate)), there are 75% of the features’ CV value Less than the value corresponding to the quartile line (the second straight line from left to right in the coordinate system (parallel to the ordinate)). It can be seen from Figure 7 that the coefficient of variation (CV) between samples in Example 1 is smaller, because the feature database is constructed in Example 1, which makes the feature in Example 1 more stable and excludes the peak time (RT) The difference maximizes the abundance of the molecular ion (mz) itself. The consistency of QC samples has also improved.
图8示出了实施例1和对比例1PCA的结果显示,实施例1样本间的一致性要优于对比例1。一方面,实施例1PC1和PC2可以解释的比例大大增加,另一方面,实验样本(实线圈)和QC样本(虚线圈)的区分度也更明显。FIG. 8 shows the PCA results of Example 1 and Comparative Example 1 showing that the consistency between the samples of Example 1 is better than that of Comparative Example 1. On the one hand, the proportion that can be explained by PC1 and PC2 in Example 1 is greatly increased. On the other hand, the distinction between experimental samples (solid circles) and QC samples (dashed circles) is also more obvious.
另外,实施例1取得的有益效果还表现在充分利用了样本的MS2信息,一方面大大提高了MS2的鉴定率,另一方面带来的一个益处是产生了一个MS2数据库,可以用来评估新的MS2相似性算法。这些效果的获得主要是因为:In addition, the beneficial effect obtained in Example 1 is also manifested in the full use of the MS2 information of the sample. On the one hand, it greatly improves the identification rate of MS2. On the other hand, it brings about a benefit that an MS2 database is generated, which can be used to evaluate new MS2 similarity algorithm. These effects are mainly due to:
1)同一个feature有来自多个样本的MS2,这涵盖了该母离子多种碎裂方式,提高与标准化合物的匹配效率,因此可以鉴定到更多的代谢物。1) The same feature has MS2 from multiple samples, which covers multiple fragmentation methods of the precursor ion and improves the matching efficiency with standard compounds, so more metabolites can be identified.
2)多个MS2与标准化合物进行比较,这种方法有效避免了单个MS2匹配到多个化合物的问题,减少假阳性。图9示出了实施例1和对比例1鉴定到的代谢物数目比较。2) Compare multiple MS2s with standard compounds. This method effectively avoids the problem of a single MS2 matching multiple compounds and reduces false positives. Figure 9 shows a comparison of the number of metabolites identified in Example 1 and Comparative Example 1.
3)属于同一个features的多个MS2之间的相关性,可以辅助判断peaks合并算法的有效性。3) The correlation between multiple MS2 belonging to the same feature can assist in judging the effectiveness of the peaks merging algorithm.
进一步地,图10示出了实施例1中FT08341对应的MS2母离子的mz和RT分布。①如果mz和RT分布在很窄的范围,可以断定属于同一个母离子,因此可以用来评价MS2相似性算法。②如果mz和RT(主要是RT)的范围比较宽,则在给定MS2相似性算法的情况下,可以根据相应的MS2评价母离子是否为同一个,进而可以辅助判断peaks合并的合理性。Furthermore, FIG. 10 shows the mz and RT distributions of MS2 precursor ions corresponding to FT08341 in Example 1. ① If mz and RT are distributed in a very narrow range, it can be concluded that they belong to the same parent ion, so they can be used to evaluate the MS2 similarity algorithm. ② If the range of mz and RT (mainly RT) is relatively wide, given the MS2 similarity algorithm, the corresponding MS2 can be used to evaluate whether the precursor ions are the same, which can assist in judging the rationality of peaks merging.
图11示出了实施例1中FT08341的35个MS2谱图的相似性。同一个features的多个MS2的相似性比较可以用来辅助判断peaks合并成features的效果。Figure 11 shows the similarity of 35 MS2 spectra of FT08341 in Example 1. The similarity comparison of multiple MS2s of the same feature can be used to help judge the effect of combining peaks into features.
从以上的描述中,可以看出,本发明上述的实施例实现了如下技术效果:From the above description, it can be seen that the above-mentioned embodiments of the present invention achieve the following technical effects:
1、在时间轴校正方面,固定一个参照样本,可保证后续样本在时间上具有可比性。1. In terms of time axis correction, a fixed reference sample can ensure that subsequent samples are comparable in time.
2、该方法在CentWave算法的基础上进行peaks的合并,有效地利用了CentWave对最大响应区域的定位。2. This method merges peaks based on the CentWave algorithm, and effectively utilizes CentWave's positioning of the maximum response area.
3、借鉴了knn算法的参数设置和density算法的合并方法,并且有效避免了中位数不具有代表性这个问题,操作简单。3. The parameter setting of the knn algorithm and the merging method of the density algorithm are used for reference, and the problem that the median is not representative is effectively avoided, and the operation is simple.
4、合并后的peak可以覆盖更大的区域,即使只检测一个样本,也可以更准确地进行定量(对于色谱峰型不好的代谢物很有效),更大的覆盖区域可以更有效地兼容后续样本,有效地减少保留时间(RT)的偏移造成的影响。4. The combined peak can cover a larger area, even if only one sample is tested, it can be more accurately quantified (effective for metabolites with poor chromatographic peak shape), and a larger coverage area can be more effectively compatible Subsequent samples effectively reduce the impact of retention time (RT) drift.
5、建立数据库后,提高样本的分析效率,后续样本在时间上具有可比性,不用对样本进行回滚,提高商业可用性。5. After the database is established, the efficiency of sample analysis is improved, and subsequent samples are comparable in time, so there is no need to roll back the samples, which improves commercial availability.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The foregoing descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention can have various modifications and changes. Any modification, equivalent replacement, improvement, etc., made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
工业实用性Industrial applicability
通过本申请的技术方法,本发明至少有如下有益效果:Through the technical methods of this application, the present invention has at least the following beneficial effects:
应用本发明的技术方案,通过构建特征(feature)数据库、统一时间轴、利用样本间信息互补原则进行峰(peak)合并等方式,可以实现超大规模的代谢组数据的整合,可实现分批次或单个样本的数据校正与数据整合,且不受检测批次的影响,且适设置为商业化检测。Applying the technical solution of the present invention, by constructing a feature database, unifying the time axis, using the principle of information complementarity between samples for peak merging, etc., ultra-large-scale metabolome data integration can be realized, and batches can be realized Or the data correction and data integration of a single sample is not affected by the test batch, and it is suitable for commercial testing.
本发明构建特征数据库,固定一个参照样本,统一时间轴,可保证后续样本在时间上具有可对比性,使得代谢组数据处理过程中实现有效利用不同批次间样本信息互补,有效地提高了代谢物检测重复性和覆盖度。The invention constructs a feature database, fixes a reference sample, and unifies the time axis, which can ensure that subsequent samples are comparable in time, so that the metabolome data processing process can effectively use sample information complementation between different batches, and effectively improve metabolism. Object detection repeatability and coverage.
本发明在构建特征数据库过程中进行合并峰处理,合并后的峰可以覆盖更大的区域,使得在只检测一个样本的情况下,也能更准确地进行定量,即使对于色谱峰型不好的代谢物依旧具有很好的效果,并产生了更大的覆盖区域使得更有效地兼容后续样本,有效地减少保留时间(RT)的偏移造成的影响。The present invention performs peak merging processing in the process of constructing a feature database, and the merged peaks can cover a larger area, so that quantification can be performed more accurately when only one sample is detected, even if the chromatographic peak shape is not good Metabolites still have a good effect, and produce a larger coverage area, which makes it more compatible with subsequent samples, and effectively reduces the impact of retention time (RT) drift.
本发明通过建立特征数据库后,有效提高样本的分析效率,使得后续样本在时间上具有可比性,且不用对样本进行回滚,在商业上可广泛使用。After the feature database is established, the present invention effectively improves the analysis efficiency of samples, so that subsequent samples are comparable in time, and the samples do not need to be rolled back, and can be widely used in business.

Claims (31)

  1. 一种生物代谢组学数据处理方法,其中,所述生物代谢组学数据包括液相色谱-质谱数据或气相色谱-质谱数据,所述液相色谱-质谱数据包括一级质谱数据,所述气相色谱-质谱数据包括一级质谱数据;所述生物代谢组学数据处理方法包括将多个生物样本的液相色谱-质谱数据或气相色谱-质谱数据进行整合以形成特征数据库的步骤,所述整合的步骤包括:A method for processing biological metabolomics data, wherein the biological metabolomics data includes liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data, the liquid chromatography-mass spectrometry data includes primary mass spectrometry data, and the gas phase Chromatography-mass spectrometry data includes primary mass spectrometry data; the biological metabolomics data processing method includes the step of integrating liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of a plurality of biological samples to form a feature database. The steps include:
    S11,任意选取所述多个生物样本中的一个样本作为参照样本,根据所述参照样本的时间轴逐一对其他样本的时间轴进行校正;S11, arbitrarily selecting one of the plurality of biological samples as a reference sample, and correcting the time axes of other samples one by one according to the time axis of the reference sample;
    S12,对校正后的每一个样本,逐一进行一级质谱离子峰的峰识别处理,得到多个识别特征峰;以及S12, for each sample after calibration, perform peak identification processing of the ion peaks of the primary mass spectrum one by one to obtain multiple identification characteristic peaks; and
    S13,根据样本信息互补原则,对所述多个识别特征峰进行合并处理,得到所述多个生物样本的特征数据库。S13: According to the principle of complementary sample information, the multiple identification characteristic peaks are combined to obtain a characteristic database of the multiple biological samples.
  2. 根据权利要求1所述的生物代谢组学数据处理方法,所述S13中:如果所述多个识别特征峰的[mzmin,mzmax]区域重叠或相邻,且[rtmin,rtmax]区域重叠或相邻,则将所述多个识别特征峰合并为一个特征峰。The biological metabolomics data processing method according to claim 1, in said S13: if the [mzmin, mzmax] regions of the multiple identification characteristic peaks overlap or are adjacent, and the [rtmin, rtmax] regions overlap or are in phase Adjacent, merge the multiple identification characteristic peaks into one characteristic peak.
  3. 根据权利要求2所述的生物代谢组学数据处理方法,所述S13包括:The biological metabolomics data processing method according to claim 2, wherein S13 comprises:
    S131,判断所述多个识别特征峰的[mzmin,mzmax]区域是否重叠或相邻,若重叠,进入S133;若不重叠,进一步判断是否相邻,如果所述多个识别特征峰的[mzmin,mzmax]区域的间隔小于第一预设阈值,则判定为相邻,进入S133;若既不重叠也不相邻,则判定所述多个识别特征峰分别为独立的特征峰;S131: Determine whether the [mzmin, mzmax] regions of the multiple identification feature peaks overlap or are adjacent, if they overlap, go to S133; if they do not overlap, further determine whether they are adjacent, if the [mzmin , Mzmax] area is less than the first preset threshold, it is determined to be adjacent, and enter S133; if it is neither overlapping nor adjacent, it is determined that the multiple identification characteristic peaks are independent characteristic peaks;
    S132,判断所述多个识别特征峰的[rtmin,rtmax]区域是否重叠或者相邻,若重叠,进入S133;若不重叠,进一步判断是否相邻,如果所述多个识别特征峰的[rtmin,rtmax]区域的间隔小于第二预设阈值,则判断为相邻,进入S133;若既不重叠也不相邻,则判定所述多个识别特征峰分别为独立的特征峰;S132: Determine whether the [rtmin, rtmax] regions of the multiple identification feature peaks overlap or are adjacent, if they overlap, go to S133; if they do not overlap, further determine whether they are adjacent, if the [rtmin, rtmax] regions of the multiple identification feature peaks , Rtmax] The interval of the region is less than the second preset threshold, it is determined as being adjacent, and entering S133; if it is neither overlapping nor adjacent, then determining that the plurality of identification characteristic peaks are independent characteristic peaks;
    S133,如果所述多个识别特征峰的同时满足S131中的重叠或相邻,和S132中的重叠或相邻两个条件,则将所述多个识别特征峰合并为一个特征峰;S133: If the multiple identification characteristic peaks simultaneously satisfy the overlapping or adjacent conditions in S131 and the overlapping or adjacent conditions in S132, merge the multiple identification characteristic peaks into one characteristic peak;
    S134,利用所有特征峰的数据生成特征列表即得到所述特征数据库。S134: Use the data of all the characteristic peaks to generate a characteristic list to obtain the characteristic database.
  4. 根据权利要求3所述的生物代谢组学数据处理方法,所述第一预设阈值依据仪器参数进行设定,所述第二预设阈值依据保留时间校正中时间偏差的最大值来进行设定;The biological metabolomics data processing method according to claim 3, wherein the first preset threshold is set according to instrument parameters, and the second preset threshold is set according to the maximum value of the time deviation in the retention time correction ;
    优选的,所述第一预设阈值设定为0.01~0.015Da,所述第二预设阈值设定为10~15。Preferably, the first preset threshold is set to 0.01 to 0.015 Da, and the second preset threshold is set to 10 to 15.
  5. 根据权利要求3所述的生物代谢组学数据处理方法,所述质谱数据还包括二级质谱数据,所述S13还包括:The biological metabolomics data processing method according to claim 3, wherein the mass spectrometry data further comprises secondary mass spectrometry data, and the S13 further comprises:
    S135,将所述多个生物样本的二级质谱数据比对到所述S134生成的特征数据库中,其中,比对率大于或等于第三预设阈值时,判断峰合并有效;S135. Compare the secondary mass spectrometry data of the multiple biological samples to the feature database generated in S134, wherein when the comparison rate is greater than or equal to a third preset threshold, it is determined that the peak combination is effective;
    优选的,所述第三预设阈值设定为40%;更优选的,所述第三预设阈值设定为50%;更优选的,所述第三预设阈值设定为60%;更优选的,所述第三预设阈值设定为80%。Preferably, the third preset threshold is set to 40%; more preferably, the third preset threshold is set to 50%; more preferably, the third preset threshold is set to 60%; More preferably, the third preset threshold is set to 80%.
  6. 根据权利要求1所述的生物代谢组学数据处理方法,所述质谱数据还包括二级质谱数据,所述S11还包括对所述一级质谱数据和所述二级质谱数据进行保留时间校正;The biological metabolomics data processing method according to claim 1, wherein the mass spectrometry data further comprises secondary mass spectrometry data, and the S11 further comprises performing retention time correction on the primary mass spectrometry data and the secondary mass spectrum data;
    优选的,使用Obiwarp算法进行保留时间校正。Preferably, the Obiwarp algorithm is used for retention time correction.
  7. 根据权利要求1所述的生物代谢组学数据处理方法,所述峰识别的算法为CentWave算法、matchedFilter算法或mzMine算法。The biological metabolomics data processing method according to claim 1, wherein the algorithm for peak identification is CentWave algorithm, matchedFilter algorithm or mzMine algorithm.
  8. 根据权利要求7所述的生物代谢组学数据处理方法,所述峰识别的算法的参数设置包括:ppm:采用仪器的分辨率;峰宽:设置为2~30;噪音:设置为0;信噪比:设置为10。The biological metabolomics data processing method according to claim 7, wherein the parameter settings of the peak recognition algorithm include: ppm: the resolution of the instrument; peak width: set to 2-30; noise: set to 0; Noise ratio: set to 10.
  9. 根据权利要求1所述的生物代谢组学数据处理方法,所述生物样本包括人或动物的体液、组织或细胞,植物的根、茎、叶、果实或种子,或微生物的细胞培养液;其中,所述体液包括尿液、血液、唾液、脑脊液或羊水,所述组织包括器官组织、肌肉组织或肿瘤组织,所述细胞包括干细胞、体细胞、肿瘤细胞或微生物细胞。The biological metabolomics data processing method according to claim 1, wherein the biological sample comprises human or animal body fluids, tissues or cells, plant roots, stems, leaves, fruits or seeds, or microbial cell culture fluid; wherein The body fluid includes urine, blood, saliva, cerebrospinal fluid or amniotic fluid, the tissue includes organ tissue, muscle tissue or tumor tissue, and the cell includes stem cells, somatic cells, tumor cells or microbial cells.
  10. 一种生物代谢组学数据的分析方法,依次包括生物代谢组学数据处理和通过二级质谱数据信息对代谢物进行定性鉴定的步骤,其中,所述生物代谢组学数据处理采用如权利要求1至9中任一项所述的生物代谢组学数据处理方法进行。A method for analyzing biological metabolomics data, including the steps of biological metabolomics data processing and qualitative identification of metabolites through secondary mass spectrometry data information, wherein the biological metabolomics data processing adopts the steps of claim 1. The biological metabolomics data processing method described in any one of to 9 is performed.
  11. 根据权利要求10所述的分析方法,通过二级质谱数据信息对代谢物进行定性鉴定的步骤包括:According to the analysis method of claim 10, the step of qualitatively identifying metabolites through the data information of secondary mass spectrometry comprises:
    S21,获取各标准化合物的质荷比数据;S21, obtain the mass-to-charge ratio data of each standard compound;
    S22,在生物代谢组学数据处理后得到的特征数据库中任意选择一个特征值,并找到与该特征值对应的所有的二级质谱质荷比数据,根据所述所有的二级质谱质荷比数据,找到与其相匹配的标准化合物;S22. Select a characteristic value arbitrarily in the characteristic database obtained after the biological metabolomics data processing, and find all the mass-to-charge ratio data of the secondary mass spectrum corresponding to the characteristic value, and according to the mass-to-charge ratio of all the secondary mass spectra Data, find a standard compound that matches it;
    S23,以所述S22中选择的所述一个特征值所对应的所述所有的二级质谱质荷比数据为一方,以所述S22中找到的所述匹配的标准化合物的二级质谱质荷比数据为另一方,对二者进行相似性打分,计算点积分,根据积分值对代谢物进行定性。S23, taking the mass-to-charge ratio data of all the MS mass spectra corresponding to the one characteristic value selected in S22 as one side, and taking the mass-to-charge mass spectra of the matched standard compound found in S22 The comparison data is the other side, score the similarity of the two, calculate the points, and qualitative metabolites based on the integral value.
  12. 根据权利要求11所述的分析方法,所述S23包括:计算所述匹配上的多个标准化合物中每个标准化合物与多个二级质谱数据相似性的中位数,选择中位数最大的化合物;The analysis method according to claim 11, said S23 comprising: calculating the median of the similarity between each standard compound in the plurality of standard compounds on the matching and the data of a plurality of secondary mass spectrometry, and selecting the one with the largest median Compound
    优选的,根据化合物的中位数是否大于截止值,判别是否匹配。Preferably, according to whether the median of the compound is greater than the cut-off value, it is judged whether it matches.
  13. 根据权利要求11所述的分析方法,所述标准化合物的质荷比数据从已有的数据库中获得,所述数据库包括NISTlib、HMDB或METLIN。The analysis method according to claim 11, wherein the mass-to-charge ratio data of the standard compound is obtained from an existing database, and the database includes NISTlib, HMDB or METLIN.
  14. 根据权利要求10所述的分析方法,所述分析方法还包括生物代谢物定量的步骤。The analysis method according to claim 10, further comprising a step of quantifying biological metabolites.
  15. 根据权利要求14所述的方法,所述生物代谢物定量的步骤包括:The method according to claim 14, wherein the step of quantifying biological metabolites comprises:
    S31,根据参照样本的时间轴对待定量样本的时间轴进行校正;S31: Correct the time axis of the sample to be quantified according to the time axis of the reference sample;
    S32,对所建立的特征数据库中所述待定量样本的对应的特征区域进行积分,得到生物代谢物相对定量的结果。S32: Integrating the corresponding characteristic regions of the sample to be quantified in the established characteristic database to obtain a relatively quantitative result of biological metabolites.
  16. 权利要求1至9中任一项所述的生物代谢组学数据处理方法、权利要求10至15中任一项所述的生物代谢组学数据的分析方法在维生素、氨基酸、脂质、类固醇、芳香酸、神经递质、色素、碳水化合物或短肽鉴定中的应用。The method for processing biological metabolomics data according to any one of claims 1 to 9, and the method for analyzing biological metabolomics data according to any one of claims 10 to 15 include vitamins, amino acids, lipids, steroids, Application in the identification of aromatic acids, neurotransmitters, pigments, carbohydrates or short peptides.
  17. 一种维生素、氨基酸、脂质、类固醇、芳香酸、神经递质、色素、碳水化合物或短肽的检测方法,其特征在于,包括:对生物样本进行液相色谱-质谱和/或气相色谱-质谱检测,得到液相色谱-质谱数据和/或气相色谱-质谱数据;采用如权利要求1至9中任一项所述的生物代谢组学数据处理方法或权利要求10至15中任一项所述的生物代谢组学数据的分析方法对所述生物样本的液相色谱-质谱数据和/或气相色谱-质谱数据进行处理得到数据结果;以及根据所述数据结果换算出所述维生素、氨基酸、脂质、类固醇、芳香酸、神经递质、色素、碳水化合物或短肽。A method for detecting vitamins, amino acids, lipids, steroids, aromatic acids, neurotransmitters, pigments, carbohydrates or short peptides, which is characterized in that it comprises: performing liquid chromatography-mass spectrometry and/or gas chromatography- on biological samples Mass spectrometry detection to obtain liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data; using the biological metabolomics data processing method according to any one of claims 1 to 9 or any one of claims 10 to 15 The method for analyzing biological metabolomics data processes liquid chromatography-mass spectrometry data and/or gas chromatography-mass spectrometry data of the biological sample to obtain data results; and converts the vitamins and amino acids according to the data results , Lipids, steroids, aromatic acids, neurotransmitters, pigments, carbohydrates or short peptides.
  18. 一种生物代谢组学数据处理装置,其中,所述生物代谢组学数据包括液相色谱-质谱数据或气相色谱-质谱数据,所述液相色谱-质谱数据包括一级质谱数据,所述气相色谱-质谱数据包括一级质谱数据;所述生物代谢组学数据处理装置包括将多个生物样本的液相色谱-质谱数据或气相色谱-质谱数据进行整合以形成特征数据库的数据库生成模块,所述数据库生成模块包括:A biological metabolomics data processing device, wherein the biological metabolomics data includes liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data, the liquid chromatography-mass spectrometry data includes primary mass spectrometry data, and the gas phase Chromatography-mass spectrometry data includes primary mass spectrometry data; the biological metabolomics data processing device includes a database generation module that integrates liquid chromatography-mass spectrometry data or gas chromatography-mass spectrometry data of multiple biological samples to form a feature database, The database generation module includes:
    时间轴校正子模块,设置为任意选取所述多个生物样本中的一个样本作为参照样本,根据所述参照样本的时间轴逐一对其他样本的时间轴进行校正;The time axis correction sub-module is configured to arbitrarily select one of the plurality of biological samples as a reference sample, and correct the time axes of other samples one by one according to the time axis of the reference sample;
    特征峰识别子模块,设置为对校正后的每一个样本,逐一进行一级质谱离子峰的峰识别处理,得到多个识别特征峰;以及The characteristic peak recognition sub-module is configured to perform peak recognition processing of the ion peaks of the primary mass spectrometer one by one for each sample after calibration to obtain multiple characteristic peaks; and
    特征数据库形成子模块,设置为根据样本信息互补原则,对所述多个识别特征峰进行合并处理,得到所述多个生物样本的特征数据库。The feature database forming sub-module is configured to merge the multiple identification feature peaks according to the principle of sample information complementarity to obtain feature databases of the multiple biological samples.
  19. 根据权利要求18所述的生物代谢组学数据处理装置,所述特征数据库形成子模块包括数据整合单元,所述数据整合单元设置为将[mzmin,mzmax]区域重叠或相邻,且[rtmin,rtmax]区域重叠或相邻的多个识别特征峰合并为一个特征峰。The biological metabolomics data processing device according to claim 18, the feature database forming sub-module includes a data integration unit, and the data integration unit is configured to overlap or adjacent to [mzmin, mzmax] regions, and [rtmin, rtmax] The overlapping or adjacent multiple identification characteristic peaks are merged into one characteristic peak.
  20. 根据权利要求19所述的生物代谢组学数据处理装置,所述特征数据库形成子模块包括第一判断单元、第二判断单元、所述数据整合单元和特征数据库形成单元:The biological metabolomics data processing device according to claim 19, the characteristic database forming sub-module includes a first judgment unit, a second judgment unit, the data integration unit and a characteristic database forming unit:
    其中,所述第一判断单元,设置为判断所述多个识别特征峰的[mzmin,mzmax]区域是否重叠,若重叠,进入所述数据整合单元;若不重叠,进一步判断是否相邻,如果所 述多个识别特征峰的[mzmin,mzmax]区域的间隔小于第一预设阈值,则判定为相邻,进入S133;若既不重叠也不相邻,则判定所述多个识别特征峰分别为独立的特征峰;Wherein, the first determining unit is configured to determine whether the [mzmin, mzmax] regions of the multiple identification characteristic peaks overlap, if they overlap, enter the data integration unit; if they do not overlap, further determine whether they are adjacent, if If the interval between the [mzmin, mzmax] regions of the multiple identification characteristic peaks is less than the first preset threshold, it is determined to be adjacent, and the process proceeds to S133; if neither overlap nor adjacent, then the multiple identification characteristic peaks are determined Are independent characteristic peaks;
    所述第二判断单元,设置为判断所述多个识别特征峰的[rtmin,rtmax]区域是否重叠或者相邻,若重叠,进入所述数据整合单元;若不重叠,进一步判断是否相邻,如果所述多个识别特征峰的[rtmin,rtmax]区域的间隔小于第二预设阈值,则判断为相邻,进入所述数据整合单元;若既不重叠也不相邻,则判定所述多个识别特征峰分别为独立的特征峰;The second determining unit is configured to determine whether the [rtmin, rtmax] regions of the multiple identification characteristic peaks overlap or are adjacent, if they overlap, enter the data integration unit; if they do not overlap, further determine whether they are adjacent, If the interval between the [rtmin, rtmax] regions of the plurality of identification characteristic peaks is less than the second preset threshold, it is determined to be adjacent and enters the data integration unit; if it is neither overlapping nor adjacent, it is determined that the Multiple identification characteristic peaks are independent characteristic peaks;
    所述数据整合单元,设置为将[mzmin,mzmax]区域重叠或相邻,且[rtmin,rtmax]区域重叠或相邻的多个识别特征峰合并为一个特征峰;The data integration unit is configured to merge multiple identification characteristic peaks that overlap or be adjacent to [mzmin, mzmax] regions, and to merge or adjacent [rtmin, rtmax] regions into one characteristic peak;
    所述特征数据库形成单元,设置为利用所有特征峰的数据生成特征列表即得到所述特征数据库。The feature database forming unit is configured to generate a feature list using data of all feature peaks to obtain the feature database.
  21. 根据权利要求20所述的生物代谢组学数据处理装置,所述第一预设阈值依据仪器参数进行设定,所述第二预设阈值依据保留时间校正中时间偏差的最大值来进行设定;The biological metabolomics data processing device according to claim 20, wherein the first preset threshold is set according to instrument parameters, and the second preset threshold is set according to the maximum value of the time deviation in the retention time correction ;
    优选的,所述第一预设阈值设定为0.01~0.015Da,所述第二预设阈值设定为10~15。Preferably, the first preset threshold is set to 0.01 to 0.015 Da, and the second preset threshold is set to 10 to 15.
  22. 根据权利要求20所述的生物代谢组学数据处理装置,所述质谱数据还包括二级质谱数据,所述生物代谢组学数据处理装置还包括:The biological metabolomics data processing device according to claim 20, wherein the mass spectrometry data further comprises secondary mass spectrometry data, and the biological metabolomics data processing device further comprises:
    峰合并有效性验证子模块,设置为将所述多个生物样本的二级质谱数据比对到所述特征数据库中,其中,比对率大于或等于第三预设阈值时,判断峰合并有效;The peak merging validity verification sub-module is configured to compare the secondary mass spectrum data of the multiple biological samples to the feature database, wherein when the comparison rate is greater than or equal to a third preset threshold, it is determined that the peak merging is valid ;
    优选的,所述第三预设阈值设定为40%;更优选的,所述第三预设阈值设定为50%;更优选的,所述第三预设阈值设定为60%;更优选的,所述第三预设阈值设定为80%。Preferably, the third preset threshold is set to 40%; more preferably, the third preset threshold is set to 50%; more preferably, the third preset threshold is set to 60%; More preferably, the third preset threshold is set to 80%.
  23. 根据权利要求18所述的生物代谢组学数据处理装置,所述质谱数据还包括二级质谱数据,所述时间轴校正子模块还设置为对所述一级质谱数据和所述二级质谱数据进行保留时间校正;The biological metabolomics data processing device according to claim 18, wherein the mass spectrometry data further comprises secondary mass spectrometry data, and the time axis correction submodule is further configured to compare the primary mass spectrum data and the secondary mass spectrum data Perform retention time correction;
    优选的,使用Obiwarp算法进行保留时间校正。Preferably, the Obiwarp algorithm is used for retention time correction.
  24. 根据权利要求18所述的生物代谢组学数据处理装置,所述峰识别的算法为CentWave算法、matchedFilter算法或mzMine算法。The biological metabolomics data processing device according to claim 18, wherein the algorithm for peak recognition is CentWave algorithm, matchedFilter algorithm or mzMine algorithm.
  25. 根据权利要求24所述的生物代谢组学数据处理装置,所述峰识别的算法的参数设置包括:ppm:采用仪器的分辨率;峰宽:设置为2~30;噪音:设置为0;信噪比:设置为10。The biological metabolomics data processing device according to claim 24, the parameter settings of the peak recognition algorithm include: ppm: the resolution of the instrument; peak width: set to 2-30; noise: set to 0; Noise ratio: set to 10.
  26. 一种生物代谢组学数据的分析装置,包括设置为生物代谢组学数据处理的模块和设置为通过二级质谱数据信息对代谢物进行定性鉴定的模块,其中,所述设置为生物代谢组学数据处理的模块为权利要求18至25中任一项所述的生物代谢组学数据处理装置。A biological metabolomics data analysis device, comprising a module configured to process biological metabolomics data and a module configured to qualitatively identify metabolites through secondary mass spectrometry data information, wherein the setting is biological metabolomics The data processing module is the biological metabolomics data processing device according to any one of claims 18 to 25.
  27. 根据权利要求26所述的分析装置,所述设置为通过二级质谱数据信息对代谢物进行定性鉴定的模块包括:The analysis device according to claim 26, wherein the module configured to qualitatively identify metabolites through secondary mass spectrometry data information comprises:
    标准化合物质荷比数据获取子模块,设置为获取各标准化合物的质荷比数据;The standard compound mass-to-charge ratio data acquisition sub-module is set to acquire the mass-to-charge ratio data of each standard compound;
    标准化合物匹配子模块,设置为在生物代谢组学数据处理后得到的所述特征数据库中任意选择一个特征值,并找到与该特征值对应的所有的二级质谱质荷比数据,根据所述所有的二级质谱质荷比数据,找到与其相匹配的标准化合物;The standard compound matching submodule is set to randomly select a characteristic value in the characteristic database obtained after biological metabolomics data processing, and find all the mass-to-charge ratio data of the secondary mass spectrometry corresponding to the characteristic value, according to the All the mass-to-charge ratio data of the secondary mass spectrometer, find the matching standard compound;
    积分定性子模块,设置为以所述标准化合物匹配子模块中选择的所述一个特征值所对应的所述所有的二级质谱质荷比数据为一方,以所述标准化合物匹配子模块中找到的所述匹配的标准化合物的二级质谱质核比数据为另一方,对二者进行相似性打分,计算点积分,根据积分值对代谢物进行定性。The integral qualitative sub-module is set to take all the mass-to-charge ratio data of the secondary mass spectrometer corresponding to the one characteristic value selected in the standard compound matching sub-module as one side, and find it in the standard compound matching sub-module The mass-nucleus ratio data of the secondary mass spectrum of the matched standard compound is the other party, and the two are scored for similarity, the points are calculated, and the metabolites are qualitatively based on the integral value.
  28. 根据权利要求27所述的分析装置,所述积分定性子模块设置为计算所述匹配上的多个标准化合物中每个标准化合物与多个二级质谱数据相似性的中位数,选择中位数最大的化合物;The analysis device according to claim 27, wherein the integration and qualitative sub-module is configured to calculate the median of the similarity between each standard compound in the plurality of standard compounds on the match and the data of a plurality of secondary mass spectrometry, and select the median The largest number of compounds;
    优选的,根据化合物的中位数是否大于截止值,判别是否匹配。Preferably, according to whether the median of the compound is greater than the cut-off value, it is judged whether it matches.
  29. 根据权利要求27所述的分析装置,所述标准化合物的质荷比数据从已有的数据库中获得,所述数据库包括NISTlib、HMDB或METLIN。The analysis device according to claim 27, wherein the mass-to-charge ratio data of the standard compound is obtained from an existing database, and the database includes NISTlib, HMDB or METLIN.
  30. 根据权利要求26所述的分析装置,所述分析装置还包括设置为生物代谢物定量的模块。The analysis device according to claim 26, further comprising a module configured to quantify biological metabolites.
  31. 根据权利要求30所述的装置,所述设置为生物代谢物定量的模块包括:The device according to claim 30, wherein the module configured to quantify biological metabolites comprises:
    时间轴校正子模块,设置为根据参照样本的时间轴对待定量样本的时间轴进行校正;The time axis correction sub-module is set to correct the time axis of the sample to be quantified according to the time axis of the reference sample;
    生物代谢物相对定量子模块,设置为对所建立的特征数据库中所述待定量样本的对应的特征区域进行积分,得到生物代谢物相对定量的结果。The relative quantification of biological metabolites sub-module is configured to integrate the corresponding characteristic regions of the sample to be quantified in the established characteristic database to obtain the relative quantitative results of biological metabolites.
PCT/CN2020/078647 2019-03-22 2020-03-10 Biometabolomics data processing and analysis methods and apparatuses, and application thereof WO2020199866A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201910223490 2019-03-22
CN201910256090.3A CN111157664A (en) 2019-03-22 2019-03-29 Biological metabonomics data processing method, analysis method, device and application
CN201910256090.3 2019-03-29

Publications (1)

Publication Number Publication Date
WO2020199866A1 true WO2020199866A1 (en) 2020-10-08

Family

ID=70555706

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/078647 WO2020199866A1 (en) 2019-03-22 2020-03-10 Biometabolomics data processing and analysis methods and apparatuses, and application thereof

Country Status (2)

Country Link
CN (1) CN111157664A (en)
WO (1) WO2020199866A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113588847A (en) * 2021-09-26 2021-11-02 萱闱(北京)生物科技有限公司 Biological metabonomics data processing method, analysis method, device and application

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112819751B (en) * 2020-12-31 2024-01-26 珠海碳云智能科技有限公司 Method and device for processing data of detection result of polypeptide chip
CN114200048B (en) * 2021-12-09 2024-03-22 哈尔滨脉图精准技术有限公司 LC-MS (liquid Crystal-mobile station) off-line data processing method and processing device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005113830A2 (en) * 2004-05-20 2005-12-01 Waters Investments Limited System and method for grouping precursor and fragment ions using selected ion chromatograms
CN103620401A (en) * 2011-06-29 2014-03-05 株式会社岛津制作所 Analysis data processing method and device
CN105486796A (en) * 2015-12-28 2016-04-13 中国检验检疫科学研究院 LC-Q-TOF/MS (liquid chromatography-quadrupole-time of flight/mass spectrometry) technology for detecting 544 kinds of pesticide residues in melons and fruit
CN105606742A (en) * 2014-11-18 2016-05-25 塞莫费雪科学(不来梅)有限公司 Method for time-alignment of chromatography-mass spectrometry data sets
US20170365458A1 (en) * 2016-06-03 2017-12-21 Woods Hole Oceanographic Institution Adduct-Based System and Methods for Analysis and Identification of Mass Spectrometry Data
CN108061776A (en) * 2016-11-08 2018-05-22 中国科学院大连化学物理研究所 A kind of metabolism group data peak match method for liquid chromatography-mass spectrography

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012125121A1 (en) * 2011-03-11 2012-09-20 Agency For Science, Technology And Research A method, an apparatus, and a computer program product for identifying metabolites from liquid chromatography-mass spectrometry measurements
CN102798684B (en) * 2011-05-21 2015-04-15 中国科学院大连化学物理研究所 Chemical profile analysis method based on retention time locking-gas chromatography-quadrupole mass spectrometry-selected ion monitoring mode
CN106970161B (en) * 2017-03-04 2019-09-27 宁夏医科大学 A kind of method of the non-target method rapid screening plant otherness metabolin of GC-MS
CN106841494B (en) * 2017-04-17 2018-03-20 宁夏医科大学 Plant otherness metabolin rapid screening method based on UPLC QTOF
CN107727727B (en) * 2017-11-13 2020-11-20 复旦大学 Protein identification method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005113830A2 (en) * 2004-05-20 2005-12-01 Waters Investments Limited System and method for grouping precursor and fragment ions using selected ion chromatograms
CN103620401A (en) * 2011-06-29 2014-03-05 株式会社岛津制作所 Analysis data processing method and device
CN105606742A (en) * 2014-11-18 2016-05-25 塞莫费雪科学(不来梅)有限公司 Method for time-alignment of chromatography-mass spectrometry data sets
CN105486796A (en) * 2015-12-28 2016-04-13 中国检验检疫科学研究院 LC-Q-TOF/MS (liquid chromatography-quadrupole-time of flight/mass spectrometry) technology for detecting 544 kinds of pesticide residues in melons and fruit
US20170365458A1 (en) * 2016-06-03 2017-12-21 Woods Hole Oceanographic Institution Adduct-Based System and Methods for Analysis and Identification of Mass Spectrometry Data
CN108061776A (en) * 2016-11-08 2018-05-22 中国科学院大连化学物理研究所 A kind of metabolism group data peak match method for liquid chromatography-mass spectrography

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TIM P.SANGSTER ET AL.: "Investigation of analytical variation in metabonomic analysis using liquid chromatography/mass spectrometry", RAPID COMMUN. MASS SPECTROM., vol. 21, no. 18, 31 December 2007 (2007-12-31), XP020250870, DOI: 20200519141534Y *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113588847A (en) * 2021-09-26 2021-11-02 萱闱(北京)生物科技有限公司 Biological metabonomics data processing method, analysis method, device and application

Also Published As

Publication number Publication date
CN111157664A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
WO2020199866A1 (en) Biometabolomics data processing and analysis methods and apparatuses, and application thereof
Draper et al. Flow infusion electrospray ionisation mass spectrometry for high throughput, non-targeted metabolite fingerprinting: a review
AU2016204969B2 (en) Metabolic biomarkers of autism
EP2834835B1 (en) Method and apparatus for improved quantitation by mass spectrometry
Cho et al. After the feature presentation: technologies bridging untargeted metabolomics and biology
CN109725072A (en) A kind of targeting qualitative, quantitative metabonomic analysis methods of the screening biomarker for cancer based on LC-MS/MS technology
WO2021232943A1 (en) Metabolomics relative quantitative analysis method based on uplc/hmrs
US20130282300A1 (en) Combined Spectroscopic Method for Rapid Differentiation of Biological Samples
CN111562338B (en) Application of transparent renal cell carcinoma metabolic marker in renal cell carcinoma early screening and diagnosis product
US20140088885A1 (en) Method, an apparatus, and a computer program product for identifying metabolites from liquid chromatography-mass spectrometry measurements
CN110568174B (en) Construction and evaluation method of early liver cancer rat model
JP2013506438A (en) Metabolic biomarkers of drug-induced cardiotoxicity
Yang et al. Serum metabolic profiling study of endometriosis by using wooden-tip electrospray ionization mass spectrometry
Feng et al. Dynamic binning peak detection and assessment of various lipidomics liquid chromatography-mass spectrometry pre-processing platforms
Bjerrum Metabonomics: analytical techniques and associated chemometrics at a glance
Reaser et al. Non-targeted determination of 13C-labeling in the Methylobacterium extorquens AM1 metabolome using the two-dimensional mass cluster method and principal component analysis
Xu et al. Systematic optimization and evaluation of sample pretreatment methods for LC-MS-based metabolomics analysis of adherent mammalian cancer cells
Kalogeropoulou Pre-processing and analysis of high-dimensional plant metabolomics data
CN104364658B (en) For the method for diagnosing chronic valve disease
Li Development of Data Processing Methods in Chemical Isotope Labeling Liquid Chromatography-Mass Spectrometry-Based Metabolomics
Kenar Design and implementation of efficient workflows for computational metabolomics
GENGBO DEVELOPMENT OF COMPUTATIONAL METHODS FOR MASS SPECTROMETRY-BASED UNTARGETED METABOLOMICS DATA ANALYSIS
CN115266985A (en) UHPLC-QTOF-MS-based laryngocarcinoma patient serum lipidomics detection method
CN115825262A (en) Application of group of differential small molecule metabolites in preparation of reagent for detecting nasopharyngeal carcinoma
Yu Development of analytical workflows and bioinformatic programs for mass spectrometry-based metabolomics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20784301

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20784301

Country of ref document: EP

Kind code of ref document: A1