WO2021004355A1 - 构建诱饵库、构建目标-诱饵库、代谢组fdr鉴定的方法及装置 - Google Patents

构建诱饵库、构建目标-诱饵库、代谢组fdr鉴定的方法及装置 Download PDF

Info

Publication number
WO2021004355A1
WO2021004355A1 PCT/CN2020/099769 CN2020099769W WO2021004355A1 WO 2021004355 A1 WO2021004355 A1 WO 2021004355A1 CN 2020099769 W CN2020099769 W CN 2020099769W WO 2021004355 A1 WO2021004355 A1 WO 2021004355A1
Authority
WO
WIPO (PCT)
Prior art keywords
spectrum
signal
library
target
identified
Prior art date
Application number
PCT/CN2020/099769
Other languages
English (en)
French (fr)
Inventor
李德华
李尉
栾恩慧
龙巧云
宋佳平
李振宇
王雅兰
Original Assignee
深圳微伴生物有限公司
深圳数字生命研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳微伴生物有限公司, 深圳数字生命研究院 filed Critical 深圳微伴生物有限公司
Publication of WO2021004355A1 publication Critical patent/WO2021004355A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR

Definitions

  • the present invention relates to the technical field of metabolomics, in particular to a method and device for constructing a decoy library, constructing a target-bait library, and metabolomics FDR identification.
  • Metabolomics is a discipline that emerged after genomics and proteomics. It is an important part of systems biology. It mainly investigates the dynamic changes of all small molecular metabolites and their contents before and after the biological system is stimulated or disturbed. Through the overall qualitative and quantitative analysis of all small molecule metabolites in the organism, the relationship between metabolites and physiological and pathological changes can be explored and discovered. Studies have shown that metabolome has important application value in the fields of early disease diagnosis, biomarker discovery, drug screening, toxicity evaluation, sports medicine and nutrition.
  • metabolites have different molecular structures, and different structures have unique secondary spectrogram signals. According to this principle, different metabolites can be Atlas for identification.
  • the main difficulties of metabolome identification at present are: 1. The FDR of large-scale metabolome identification cannot be evaluated, and there is no effective quality control method; 2. The spectrum utilization and identification coverage of large-scale metabolite identification are low; 3. Metabolism The large-scale identification tools have low performance and poor operability, and cannot meet the needs of many commercial applications and scientific research. Therefore, we need to develop a high-performance and large-scale metabolome identification method (tool) capable of FDR quality control to meet the needs of scientific research and commercial applications.
  • the present invention aims to provide a method and device for constructing a decoy library, constructing a target-bait library, and metabolome FDR identification to process large-scale metabolomics data.
  • a method for constructing a bait library includes the following steps: S1, compare the mass-to-charge ratio M of the metabolite precursor ion of each spectrum in the target database with all other spectra in the target database one by one, and compare the spectra with the product ion mass-to-charge ratio equal to M and / Or the sequence number of the spectrum is stored in the signal spectrum index array, all the spectrum in the target database is traversed to generate a signal spectrum index two-dimensional array; S2, select a group of signal spectrum index arrays in the signal spectrum index two-dimensional array, Store the product ion signal of each spectrum in the signal spectrum index array in the first signal warehouse, and then randomly select a part of the product ion signals from the corresponding spectrum in the target database to copy to the array D, from the first signal A certain number of product ion signals are randomly selected from the warehouse to fill in array D, so that the number of product ion signals in array D is
  • a part of the product ion signals are randomly selected from the corresponding spectrum in the target database and copied to the array D.
  • the number of selected product ion signals accounts for the total number of product ion signals in the corresponding spectrum in the target database.
  • the ratio is h, and h is between 0.6 and 0.9; preferably, h is 0.775.
  • the parent ion information of the spectrum in the target database includes the retention time, mass-to-charge ratio, and charge information of the parent ion.
  • a method for constructing a target-bait library includes: selecting and forming a target database; constructing a decoy library; and merging the target database and the decoy library to obtain a target-bait library, wherein the decoy library is constructed by any of the aforementioned methods for constructing a decoy library.
  • a method for metabolome FDR identification includes: converting the original mass spectrum data into unified spectrum data and reading to obtain the spectrum to be identified; constructing a target-bait library; matching the spectrum to be identified with the target-bait library; and sorting the matching results, and FDR identification is performed on the matching results; among them, the target-bait library is constructed by the above-mentioned method of constructing the target-bait library.
  • the unified spectrum data is a spectrum data file containing charge-to-mass ratio-peak intensity information; preferably, the spectrum data file containing charge-to-mass ratio-peak intensity information is further stored as a data link list, and the data link list is stored
  • the spectrum information includes the spectrum number, parent ion retention time, mass-to-charge ratio, charge information, product ion mass-to-charge ratio and corresponding peak intensity information.
  • matching the spectrum to be identified with the target-bait library includes: comparing each spectrum in the spectrum to be identified with each spectrum in the target-bait library, and each spectrum in the spectrum to be identified
  • the product ion signal intensity value in a spectrum is normalized; select a spectrum in the spectrum to be identified and obtain its parent ion mass-to-charge ratio M, and screen out the mass-to-charge ratio of all parent ions in the target-bait library as
  • the spectrum number of M is stored in the spectrum number index array, and each spectrum in the spectrum to be identified is traversed to obtain a two-dimensional array of spectrum number index;
  • the product ion signals of all the spectra in the target-bait library are stored In the second signal warehouse, take the second signal warehouse as the overall distribution of the signal peak intensity, select a spectrum to be identified, and use the second signal warehouse as the overall to select all the product ion spectrum signals in the spectrum to be identified Perform inspection to obtain the weight of the spectrum signal, traverse each spectrum in the spectrum to be identified, and obtain
  • the normalization processing includes normalizing the signal intensity value of the product ion to an interval of (0,1); preferably, the normalization processing includes dividing the signal intensity value of the product ion by the sub-intensity in the spectrum to which it belongs. The maximum signal intensity value of the ion.
  • the weight value is obtained by the following steps: the second signal warehouse is used as the whole to check all the product ion spectrum signals in the selected spectrum to be identified to obtain the statistics of all the product ion spectrum signals in the spectrum to be identified ,
  • the reciprocal of the obtained statistics is used as the weight of the product ion spectrum signal; preferably, the test is Grubbs test, box plot test or normal distribution test.
  • the calculation formula is as follows:
  • is the correction coefficient, which is the reciprocal of the difference between the product ion signal of the spectrum to be identified and the product ion signal of the reference spectrum
  • I the spectrogram product ion signal vector
  • w is the weight value of the spectrogram product ion spectrum signal to be identified
  • T is the theoretical signal matching rate of this match
  • E is the experimental signal matching rate of this match.
  • the spectrum identification result sn makes FDR ⁇ x
  • the effective identification result of the batch is M ⁇ s1, s2, s3...s(n-1) ⁇ ; preferably, x is less than or equal to 0.2, It is more preferably 0.05 or less, and still more preferably 0.01 or less.
  • a bait library is provided.
  • the bait library is constructed by any one of the methods of constructing a bait library above.
  • a target-bait library is provided.
  • the target-bait library is constructed by any of the aforementioned methods for constructing a target-bait library.
  • a device for constructing a bait library includes: a signal spectrum index two-dimensional array generation module, which is set to compare the mass-to-charge ratio M of the metabolite precursor ion of each spectrum in the target database with all other spectra in the target database one by one, and there will be a product ion mass charge
  • the spectrum and/or the sequence number of the spectrum with ratio equal to M are stored in the signal spectrum index array.
  • the decoy library signal array generation module is set as the selected signal
  • a set of signal spectrum index arrays in the spectrum index two-dimensional array store the product ion signal of each spectrum in the signal spectrum index array in the first signal warehouse, and then randomly select from the corresponding spectrum in the target database
  • Part of the product ion signals are copied to array D, and a certain number of product ion signals are randomly selected from the first signal warehouse to fill in array D, so that the number of product ion signals in array D is the same as that of the corresponding spectrum in the target database
  • the number of signals is the same; then some signals in the array D are randomly selected, and their mass-to-charge ratios are randomly changed to avoid overlaps with the mass-to-charge ratios of the corresponding spectra in the target database.
  • n Arrays D and n arrays D constitute the decoy library signal array; where n is a natural number, corresponding to the same serial number; and the decoy library generation module is set to set the decoy library signal array corresponding to each subset of the target database The precursor ion information is copied to the decoy library signal array to form the decoy library.
  • a part of the product ion signals are randomly selected from the corresponding spectrum in the target database and copied to the array D.
  • the number of selected product ion signals accounts for all of the corresponding spectrum in the target database
  • the ratio of the number of product ion signals is h, and h is between 0.6 and 0.9; preferably, h is 0.775.
  • randomly changing the mass-to-charge ratio includes: adding or reducing the mass-to-charge ratio of random size, and the perturbation value is less than the mass-to-charge ratio of the precursor ion; preferably, adding or reducing the mass-to-charge ratio of random size Ratios include uniformly increasing and random mass-to-charge ratios, uniformly decreasing and random mass-to-charge ratios, or randomly increasing/decreasing random mass-to-charge ratios; preferably, the disturbance is ⁇ 1Da; preferably, the selected part of the signal occupies the array
  • the parent ion information of the spectrum in the target database includes the retention time, mass-to-charge ratio, and charge information of the parent ion.
  • a device for constructing a target-bait library includes: a target database generating module configured to select and form a target database; a decoy library building module configured to build a decoy library; and a merge module configured to build a bait constructed by the target database generated by the target database generating module and the decoy library building module
  • the libraries are merged to obtain a target-bait library, wherein the decoy library building module is any of the above-mentioned devices for constructing a decoy library.
  • a device for FDR identification of metabolome includes: a unified format module, which is set to convert the original mass spectrum data into unified spectral data and read to obtain the spectrum to be identified; a target-bait library building module, which is set to build a target-bait library; a matching module, settings To match the to-be-identified spectra obtained in the unifying format module with the target-bait library constructed by the target-bait library building module; and the FDR identification module is set to sort the matching results of the matching modules and perform FDR identification on the matching results;
  • the target-bait library building module is the device for building the target-bait library.
  • the unified spectrum data is a spectrum data file containing charge-to-mass ratio-peak intensity information; preferably, the unified format module stores a spectrum data file containing charge-mass ratio-peak intensity information as Data link table.
  • the spectrum information stored in the data link table includes the number of the spectrum, the retention time of the parent ion, the mass-to-charge ratio, the charge information, the mass-to-charge ratio of the product ions, and the corresponding peak intensity information.
  • the matching module includes: a normalization processing sub-module, which is configured to compare each spectrum in the spectrum to be identified with each spectrum in the target-bait library, and to compare each spectrum in the spectrum to be identified
  • the product ion signal intensity value in a spectrum is normalized;
  • the spectrum number index two-dimensional array generation submodule is set to select a spectrum in the spectrum to be identified and obtain its parent ion mass-to-charge ratio M, and filter Get all the spectrum numbers of the parent ion mass-to-charge ratio M in the target-bait library and store them in the spectrum number index array, traverse each spectrum in the spectrum to be identified, and obtain the spectrum number index two-dimensional array;
  • the value array generation sub-module is set to store the product ion signals of all spectra in the target-bait library in the second signal storehouse.
  • the second signal storehouse is used as the overall distribution of signal peak intensity, and a spectrum to be identified is selected, Use the second signal warehouse as the whole to check all the product ion spectrum signals in the selected spectrum to be identified to obtain the weight of the spectrum signal, traverse each spectrum in the spectrum to be identified to obtain the weight array;
  • the scoring submodule is set to match and score the product ion signals of the spectrum to be identified based on the product ion signals in the reference spectrum; and the identification result array generation module is set to select a spectrum number index array and set the spectrum to be identified
  • the graph is matched with the spectrum traversed in the selected spectrum index array, the result with the highest matching score is used as the identification result of the spectrum to be identified, and all elements in the two-dimensional array of spectrum index are traversed to obtain the spectrum to be identified
  • the evaluation result array of the graph is used as the overall distribution of signal peak intensity, and a spectrum to be identified is selected, Use the second signal warehouse as the whole to check all the product ion spectrum signals in the selected spectrum
  • the normalization processing sub-module is set to normalize the product ion signal intensity value to an interval of (0,1); preferably, the normalization processing includes dividing the product ion signal intensity value by its respective spectrum The maximum signal intensity value of the product ion in the figure.
  • the weight array generation sub-module is set to check all the product ion spectrum signals in the selected spectrum to be identified with the second signal warehouse as the overall to obtain statistics of all the product ion spectrum signals in the spectrum to be identified Quantities, the reciprocal of the obtained statistic is used as the weight of the product ion spectrum signal; preferably, the test is Grubbs test, box plot test or normal distribution test.
  • the scoring sub-module is set to define the product ion signal of the spectrum to be identified and the product ion signal of the reference spectrum into two arrays respectively with
  • the calculation formula is as follows:
  • is the correction coefficient, which is the reciprocal of the difference between the product ion signal of the spectrum to be identified and the product ion signal of the reference spectrum
  • I the spectrogram product ion signal vector
  • w is the weight value of the spectrogram product ion spectrum signal to be identified
  • T is the theoretical signal matching rate of this match
  • E is the experimental signal matching rate of this match.
  • the effective identification result of the batch is M ⁇ s1, s2, s3...s(n-1) ⁇ ; preferably, x is less than or equal to 0.2, more preferably less than or equal to 0.05, more preferably 0.01 or less.
  • a storage medium stores a computer program, wherein the computer program is configured to execute the method of constructing a decoy library, the method of constructing a target-bait library, and/or the method of metabolome FDR identification during operation.
  • an electronic device includes a memory and a processor, and a computer program is stored in the memory.
  • the processor is configured to run the computer program to execute the method of constructing a decoy library, the method of constructing a target-bait library, and/or the method of metabolome FDR identification.
  • the method of randomly selecting signals based on the database and using the target database can effectively generate the bait library, and can be widely used in FDR and quality control.
  • the decoy library constructed by the method or device of the present invention has a high similarity to the target library, so that it has a higher decoy ability, and can be applied to metabolome identification with more isomers or high metabolite structure similarity FDR quality control of results.
  • the technical scheme of the present invention can be adjusted as needed to generate the similarity between the decoy library and the target library to meet the FDR quality control requirements of different situations (high similarity, medium similarity, or low similarity).
  • the method for identifying the metabolome FDR of the decoy library or the target-bait library obtained by the technical scheme of the present invention has the following advantages: 1) FDR quality control can be performed on the identification results, and the FDR quality control method uses the target-bait library strategy; 2) It can identify the spectra of metabolites quickly and with high throughput; 3) In the process of spectrum identification, the retention time limit of the parent ion is lifted, the matching range of the experimental spectrum is increased, and the utilization and utilization of the spectrum are improved. Coverage of metabolite identification.
  • Figure 1 shows a schematic diagram of the overall analysis process of the metabolome FDR identification method in an embodiment of the present invention
  • Figure 2 shows a schematic diagram of an exemplary MGF spectrogram file data format in an embodiment of the present invention
  • FIG. 3 shows a schematic diagram of the main flow of target-bait library generation in an embodiment of the present invention
  • FIG. 4 shows a schematic diagram of the main process of metabolite spectrum matching in an embodiment of the present invention
  • Figure 5 shows an example of a Passatutto_query.mgf format file obtained in Embodiment 1;
  • FIG. 6 shows an example of the Target_GNPS.mgf format file obtained in Embodiment 1;
  • FIG. 7 shows a schematic diagram of the generation process of the bait library in Embodiment 1;
  • Figure 8 shows an example of a schematic diagram of the signal warehouse S in Embodiment 1;
  • Figure 9a shows the target database spectrogram p1 in Example 1
  • Figure 9b shows the array D1 in Example 1
  • Figure 9c shows the signal warehouse S in Example 1 that randomly selects a certain number of product ion signals and fills them To the spectrum in array D1;
  • FIG. 10 shows an example of a schematic diagram of the target-decoy library file Target_Decoy_GNPS.mgf generated in Embodiment 1;
  • FIG. 11 shows an example of a schematic diagram of comparison between the first query spectrum q1 in Example 1 and the first spectrum of the reference database, namely the target-bait library;
  • FIG. 12 shows the scoring and ranking of the comparison between the spectra to be queried and the reference library spectra in Embodiment 1;
  • FIG. 1 Figure 13-1, Figure 13-2, Figure 13-3, Figure 13-4, Figure 13-5, Figure 13-6, Figure 13-7, Figure 13-8, Figure 13-9, Figure 13-10 and Figures 13-11 show the FDR quality control and output list of Passatutto_query.mgf identification results in Example 1;
  • Figure 14 shows the FDR quality control performance of the XY-Meta target-bait library in Example 1;
  • Figure 15 shows a schematic diagram of the loading process of an XY-Meta decoy library
  • FIG. 16 shows a schematic diagram of the XY-Meta spectrum matching result of Example 1.
  • Figure 17 shows a schematic diagram of a XY-Meta semi-search metabolome identification process
  • Figure 18 shows a schematic diagram of an XY-Meta open search metabolome identification process
  • Figure 19 shows a schematic diagram of an XY-Meta iterative search metabolome identification process.
  • Metabolome refers to the dynamic overall collection of metabolites in an organism.
  • the metabolome usually refers to only small molecular metabolites with a relative molecular mass within 1000.
  • Mass-to-charge ratio The ratio of the mass of a charged ion to the charged charge. It is the physical characteristic of the ion. It is a certain value. Limited by the resolution of the instrument, the detected mz will fluctuate.
  • Retention Time the time from the beginning of the sample injection to the time when the maximum concentration of the component appears after the column, that is, from the beginning of the sample injection to the peak of a certain component chromatographic peak.
  • Retention Time the time from the beginning of the sample injection to the time when the maximum concentration of the component appears after the column, that is, from the beginning of the sample injection to the peak of a certain component chromatographic peak.
  • the elapsed time, for a specific separation column, the retention time of the component (molecular ion) is related to its physical and chemical properties.
  • Molecular ion peaks (Peaks): The molecular ion peaks in a sample, expressed in [mzmin, mzmax, rtmin, rtmax].
  • Collision Induced Dissociation The process of transferring energy to ions through collisions with neutral molecules. The energy transfer is sufficient to cause bond cracking and rearrangement.
  • False-discovery rate It is a method used to control multiple comparisons in multiple hypothesis testing, and is used to describe the proportion of false positives that may occur in a large-scale identification.
  • Target library A target reference library for the comparison of secondary spectra.
  • Decoy A simulated reference library, theoretically having the same characteristics as the target library, the spectra in the decoy library will not appear in the target library.
  • Target-Decoy A FDR quality control strategy that simulates the state of random matching of spectra through the decoy library, and then estimates the false discovery rate FDR of spectra matching based on the statistical results.
  • Signal features Compound ions generate specific product ions through secondary fragmentation such as collision dissociation.
  • the mass spectrometer can collect the signals of these product ions, and the signal data obtained is called the signal features of the compound.
  • Intensity A measure of the abundance of an element or compound in mass spectrometry.
  • MS2 Secondary spectrum
  • Precursor ion/precursor ion unbroken substance (metabolite) MS1.
  • Product ions Compound ions can generate characteristic fragment ions by inducing collisions and other fragmentation methods in the mass spectrum, called product ions.
  • Experimental spectrum The secondary spectrum collected by the experimental sample in the experimental process is called the experimental spectrum.
  • Reference spectrum The standard secondary spectrum of the compound.
  • the compound corresponding to the experimental spectrum can be determined by comparing with the experimental spectrum.
  • Adducts After ionization, metabolites can combine with H2O, H+ and NH4+ ions. These ions are called adducts.
  • Ion addition form A metabolite combines with H2O, H+, NH4+, Na+ and K+ ions in the process of ionization to form a new compound form.
  • MSconvert A software that converts mass spectrometry raw data into other file formats.
  • Spectrum_info The data structure used to save the spectrum signal and attributes of the mass spectrum.
  • Signal warehouse a numerical matrix composed of all the product ion signals of more than one secondary spectrum.
  • Signal spectrum the second-level spectrum extracted from the target library, and all the signals in the second-level spectrum will be added to the signal warehouse.
  • Signal spectrum index array it is used to store the index numbers of the selected spectrum in the target library.
  • Spectral number index array an array used to store candidate spectra numbers in the spectral database.
  • Passatutto A tool for evaluating the performance of metabolite decoy libraries. It carries a database of query spectra and standard reference spectra, and can achieve FDR quality control of the identification results.
  • Grubbs test a hypothesis testing method, often used to test a single outlier in a univariate data set that obeys a normal distribution; if there is an outlier, it must be the maximum or minimum value in the data set.
  • Experimental signal matching rate the ratio of the number of signals that can be matched with the reference spectrum in the query spectrum to the total number of signals in the query spectrum.
  • Theoretical signal matching rate the ratio of the number of signals that can be matched with the query spectrum in the reference spectrum to the total number of signals in the reference spectrum.
  • Decoy ability an indicator to measure the performance of the decoy library. During the matching process of the query spectrum and the target-bait library, the more the number of spectra matched to the decoy library from the query spectrum, the stronger the decoy library's ability to decoy the model algorithm .
  • Non-targeted metabolomics has the characteristics of strong ability to identify unknown metabolites, high throughput and low cost, and is widely used in various applications. Metabolic testing and scientific research of various samples, the total amount of samples and data for metabolic testing is unprecedentedly huge. On the other hand, due to the lack of stability and poor reproducibility of non-targeted metabolome identification, the study of metabolome identification strategies has become an important and difficult point in non-targeted metabonomics.
  • non-targeted metabolome analysis tools have become a research hotspot, and many non-targeted metabolome analysis tools have appeared in the past 10 years. These metabolic tools have been very mature for the quantitative analysis of metabolome, but the large-scale identification of metabolites is still the bottleneck of non-targeted metabolome research.
  • the main problem of non-targeted metabolome identification is that the FDR of the identification result cannot be evaluated, which greatly limits the application of non-targeted metabonomics technology. If the FDR of metabolome identification can be evaluated reasonably, the accuracy and stability of metabolome identification can be improved, and the development and application of non-targeted metabolomics technology can be greatly promoted.
  • a method for constructing a decoy library includes the following steps: S1, compare the mass-to-charge ratio M of the metabolite precursor ion of each spectrum in the target database with all other spectra in the target database one by one, and compare the spectra with the product ion mass-to-charge ratio equal to M and / Or the sequence number of the spectrum is stored in the signal spectrum index array, all the spectrum in the target database is traversed to generate a signal spectrum index two-dimensional array; S2, select a group of signal spectrum index arrays in the signal spectrum index two-dimensional array, Store the product ion signal of each spectrum in the signal spectrum index array in the first signal warehouse, and then randomly select a part of the product ion signals from the corresponding spectrum in the target database to copy to the array D, from the first signal A certain number of product ion signals are randomly selected from the
  • the target database is used to generate a bait library based on the method of randomly selecting signals from the database.
  • the FDR of the identification result can be evaluated by the quality control module and quality control can be performed.
  • the performance of the decoy library of the present invention was evaluated using the Passatutto standard spectral library, and it was found that the decoy library constructed by the method of constructing the decoy library of the present invention has the same characteristics as the target library, and can effectively evaluate the FDR of the identification result.
  • a part of the product ion signals are randomly selected from the corresponding spectrum in the target database and copied to the array D.
  • the ratio of the number of selected product ion signals to the total number of product ion signals in the corresponding spectrum in the target database For h, h ⁇ 1, the larger the h, the greater the similarity between the obtained bait library and the target database.
  • the bait library obtained has a better FDR quality control effect. 0.775 works best.
  • randomly changing its mass-to-charge ratio includes: adding or reducing random-sized mass-to-charge ratios, in order to increase the disturbance to avoid overlapping with the original library spectrum P.
  • the disturbance value should be less than the parent ion mass-to-charge ratio.
  • adding or reducing the random mass-to-charge ratio includes uniformly increasing the random mass-to-charge ratio, uniformly decreasing the random mass-to-charge ratio, or randomly increasing/decreasing the random mass-to-charge ratio; preferably, the disturbance is ⁇ 1Da ; More preferably, the proportion of the selected partial signal to the total signal in the array D is k, k ⁇ 1. The larger the value of k, the greater the disturbance to the spectrogram signal.
  • the present invention uses the method of spectral database signal perturbation to generate the decoy library through the target database, and further constructs the target-bait library to perform quality control on the FDR of the identification result, so that the similarity between the target library and the decoy library can be controlled, thereby adapting to different structural similarities
  • the metabolome identification of the target data set improves the accuracy and stability of metabolome identification.
  • the parent ion information of the spectra in the target database includes the retention time, mass-to-charge ratio, and charge information of the parent ions, so that the decoy library can have more comprehensive parent ion information.
  • a method for constructing a target-bait library includes: selecting and forming a target database; constructing a decoy library; and merging the target database and the decoy library to obtain a target-bait library, wherein the decoy library is constructed by the method of constructing a decoy library as described above. Therefore, the method of constructing a target-bait library also has the advantages mentioned in the above method of constructing a decoy library.
  • a method for metabolome FDR identification includes: converting the original mass spectrum data into unified spectrum data and reading to obtain the spectrum to be identified; constructing a target-bait library; matching the spectrum to be identified with the target-bait library; and sorting the matching results, and FDR (False-discovery Rate) identification is performed on the matching results; among them, the target-bait library is constructed by the above-mentioned method of constructing the target-bait library.
  • the application of this metabolome FDR identification method can perform FDR quality control on the identification results.
  • the FDR quality control method uses a target-bait library strategy; it can quickly and high-throughput the spectrum of metabolites; in the link of spectrum identification Remove the retention time limit of the parent ion, increase the matching range of the experimental spectrum, improve the utilization of the spectrum and the coverage of metabolite identification.
  • the unified spectrum data is a spectrum data file containing charge-to-mass ratio-peak intensity information, where the spectrum data file includes, but is not limited to, files in MGF, mzXML, mzML, or tda formats.
  • the unified spectrum data is a spectrum data file in MGF format; preferably, the spectrum data file containing the charge-to-mass ratio-peak intensity information is further stored as a data link list, and the spectrum information stored in the data link list includes the spectrum Graph number, parent ion retention time, mass-to-charge ratio, charge information, product ion mass-to-charge ratio and corresponding peak intensity information.
  • the data linked list includes, but is not limited to, singly linked list, double linked list, binary tree, hash or mapping.
  • the spectrum data file in the MGF format is stored as Spectrum info, which belongs to a single-linked list.
  • matching the spectrum to be identified with the target-bait library includes: comparing each spectrum in the spectrum to be identified with each spectrum in the target-bait library, Normalize the product ion signal intensity value in each spectrum in the spectrum to be identified; select a spectrum in the spectrum to be identified and obtain its parent ion mass-to-charge ratio M, and screen out the target-bait library All the spectrum numbers of the precursor ion mass-to-charge ratio of M are stored in the spectrum index array, and each spectrum in the spectrum to be identified is traversed to obtain a two-dimensional array of spectrum index indexes; all of the target-bait library The product ion signals of the spectrum are stored in the second signal warehouse.
  • the second signal warehouse is used as the overall distribution of the signal peak intensity, and a spectrum to be identified is selected, and the second signal warehouse is used as the overall for the selected spectrum to be identified All product ion spectra signals of, are tested to obtain the weight of the spectrum signal, and each spectrum in the spectrum to be identified is traversed to obtain the weight array; the product ion signal in the reference spectrum is used as the basis for the spectrum to be identified The product ion signal is matched and scored; and a spectrum index array is selected, and the (one) spectrum to be identified is matched with the spectrum traversed in the selected spectrum index array, and the result with the highest matching score is regarded as the waiting To identify the identification result of the spectrum, traverse all the elements in the two-dimensional array index by the spectrum number to obtain the identification result array of the spectrum to be identified.
  • the similarity between the spectrum to be identified and the target-bait library can be compared, and the similarity is good or bad by matching the spectrum to be identified with the reference in the target-bait library
  • the matching score of the spectrum is reflected by the level, which can effectively screen out the best identification result of the spectrum to be identified.
  • the normalization processing includes normalizing the product ion signal intensity values to the interval (0, 1); preferably, the normalization processing includes dividing the product ion signal intensity values respectively Take the maximum signal intensity value of the product ion in the spectrum to which it belongs. After the values are normalized, the ion signal values of all the spectra to be identified and the reference spectra can be adjusted to a numerical interval, so that the spectra to be identified including all the reference spectra can be compared in pairs.
  • the weight value is obtained by the following steps: the second signal warehouse is used as the overall to check all the product ion spectrum signals in the selected spectrum to be identified to obtain the statistics of all the product ion spectrum signals in the spectrum to be identified , Take the reciprocal of the obtained statistic as the weight of the product ion spectrum signal; among them, the test is Grubbs test, box plot test or normal distribution test.
  • the sex-to-noise ratio of the spectrogram signal is introduced into the scoring algorithm of the spectrogram matching, and the matching algorithm is combined with the Goblas outlier test method to calculate the weight of the spectrogram signal, and participate in the follow-up
  • the calculation of the spectrum matching score improves the anti-noise ability of the spectrum matching.
  • is the correction coefficient, which is the reciprocal of the difference between the product ion signal of the spectrum to be identified and the product ion signal of the reference spectrum
  • I the spectrogram product ion signal vector
  • w is the weight value of the spectrogram product ion spectrum signal to be identified
  • T is the theoretical signal matching rate of this match
  • E is the experimental signal matching rate of this match.
  • FDR can control the quality of the identification results. Taking the result of FDR ⁇ 0.01 as the effective identification result indicates that there are 1% false positives in the effective identification result, and the result of FDR ⁇ 0.02 as the effective identification result indicates the false identification result. Positive may be 2%
  • a bait library is also provided.
  • the bait library is constructed by the method of constructing the bait library described above.
  • a target-bait library is also provided.
  • the target-bait library is constructed by the above method of constructing the target-bait library.
  • XY-Meta a new set of metabolome identification method is provided, named XY-Meta, and the specific technical solution is as follows:
  • the overall analysis process of XY-Meta mainly includes the conversion of spectrum raw data, spectrum data standardization, spectrum matching, identification result FDR quality control and matching result output.
  • the specific process is as follows:
  • the MGF format is a common data format for MS2 spectra. This format includes the number of the spectrum, retention time, mass-to-charge ratio, charge, product ion mass-to-charge ratio and peak intensity information.
  • a complete MGF file can be used Analysis and identification of spectra.
  • Use MSconvert to convert the off-machine original file (the off-machine original file is the original mass spectrum data, which can also be called the data to be identified or the spectrum to be identified, such as the data from Thermo Fisher’s off the machine) into a spectrum data file in MGF format
  • Figure 2 shows the data format of the MGF spectrogram file as an example.
  • the Spectrum_info structure stores the number of the spectrum, the retention time of the parent ion, the mass-to-charge ratio, the charge information, the mass-to-charge ratio of the product ion, and Corresponding peak intensity information.
  • target-bait library generation includes screening the target database with parent ion to obtain the signal spectrum, combining all signal spectra to obtain the signal warehouse, randomly selecting signals from the signal warehouse to form the decoy spectrum, and then Obtain the decoy library, merge the target database and the decoy library to obtain the target-bait library.
  • the specific process is as follows:
  • the value of h is 0.6-0.9, and the decoy library obtained between this value It has a better FDR quality control effect.
  • the disturbance value should be less than the parent ion mass-to-charge ratio, preferably the disturbance is ⁇ 1Da,
  • Array A is the decoy library.
  • the spectrum to be identified and the target-bait library are obtained, and the spectrum matching algorithm is used to match the spectrum to be identified with the target-bait library.
  • the main process of metabolite spectrum matching is shown in Figure 4, including peak intensity normalization of the spectrum to be identified, peak intensity weight calculation, matching score, and matching result output. The specific process is as follows:
  • the matching rate T e/total_t, after the signal matching is completed, the vector dot product algorithm is used to calculate the dot product sum of the product ion signal of the spectrum to be identified and the product ion signal of the reference spectrum.
  • the calculation formula is as follows:
  • is the correction coefficient, which is the reciprocal of the difference between the product ion signal of the spectrum to be identified and the product ion signal of the reference spectrum.
  • Spectrum matching and result output traverse the two-dimensional array H ⁇ h1,h2,h3 whilhn ⁇ of spectrum index index, select a spectrum index array hn, traverse all the spectrum numbers in hn , Match the spectrum qn to be identified with the reference spectrum traversed in hn, and use the result with the highest matching score as the identification result of the spectrum qn to be identified, and then put the identification result of each spectrum into the array Score .
  • FDR of the identification result decoy_score/(target_score+decoy_score), preferably, in an embodiment of the present application, the selection of the threshold is less than 0.2. In a more preferred embodiment, the threshold is preferably less than 0.05, and more preferably 0.01, when traversing to a certain spectrum identification result sn such that FDR ⁇ 0.01, the effective identification result of the batch is M ⁇ s1, s2, s3...s(n-1) ⁇ .
  • the FDR calculation process is shown in Table 1.
  • the output identification information includes: mass spectrum number , Final score, FDR, metabolite annotation information, matching score, theoretical signal matching rate, experimental spectrum signal-to-noise ratio, theoretical spectrum parent ion mass-to-charge ratio, experimental spectrum parent ion mass-to-charge ratio, adduct type, addition The quality of the compound and the number of matching signals.
  • the metabolome FDR identification method of the present invention has the following important features: 1) FDR quality control can be performed on the identification results, and the FDR quality control method uses a target-bait library strategy; 2) It can quickly and high-throughput the spectrum of metabolites Perform identification; 3) Remove the retention time limit of the parent ion in the spectrum identification link, increase the matching range of the experimental spectrum, and improve the utilization of the spectrum and the coverage of metabolite identification.
  • a device for constructing a bait library includes a signal spectrum index two-dimensional array generation module, a decoy library signal array generation module, and a decoy library generation module, wherein the signal spectrum index two-dimensional array generation module is set to set the metabolite precursor ion mass of each spectrum in the target database
  • the charge ratio M is compared with all other spectra in the target database one by one, and the spectra with product ion mass-to-charge ratios equal to M and/or the sequence numbers of the spectra are stored in the signal spectrum index array, and all spectra in the target database are traversed Figure, generate a signal spectrum index two-dimensional array
  • the decoy library signal array generation module is set to select a set of signal spectrum index arrays in the signal spectrum index two-dimensional array, and index the signal spectrum to the product ion signal of each spectrum in the array Stored in the first signal warehouse, and then randomly select a part of the product ion signals from the corresponding spectrum in
  • the target database is used to generate a bait library based on the method of randomly selecting signals from the database.
  • the FDR of the identification result can be evaluated by the quality control module and quality control can be performed.
  • the performance of the decoy library of the present invention was evaluated using the Passatutto standard spectral library, and it was found that the decoy library constructed by the device for constructing the decoy library of the present invention has the same characteristics as the target library, and can effectively evaluate the FDR of the identification result.
  • the decoy library signal array generation module a part of the product ion signals are randomly selected from the corresponding spectrum in the target database and copied to the array D.
  • the number of selected product ion signals accounts for all the sub-ions of the corresponding spectrum in the target database.
  • the ratio of the number of ion signals is h, h ⁇ 1, the larger the h, the greater the similarity between the obtained bait library and the target database.
  • the value of h is 0.6 to 0.9. In a more preferred embodiment, when the value of h is 0.775, the effect is the best .
  • randomly changing its mass-to-charge ratio includes: adding or reducing a random mass-to-charge ratio, the purpose is to increase the disturbance to avoid overlap with the original library spectrum P, the disturbance value should be less than the parent ion mass-to-charge ratio .
  • the present invention uses the method of spectral database signal perturbation to generate the decoy library through the target database, and further constructs the target-bait library to perform quality control on the FDR of the identification result, so that the similarity between the target library and the decoy library can be controlled, thereby adapting to different structural similarities
  • the metabolome identification of the target data set improves the accuracy and stability of metabolome identification.
  • the precursor ion information of the spectra in the target database includes the retention time, mass-to-charge ratio and charge information of the precursor ions, etc., so that the decoy library has a more comprehensive precursor ion information.
  • a device for constructing a target-bait library includes a target database generating module, a decoy library building module, and a merging module.
  • the target database generating module is set to select and form the target database; the decoy database building module is set to build the bait database; and the merging module is set to generate the target database generating module.
  • the target database and the decoy library constructed by the decoy library building module are merged to obtain the target-bait library, where the decoy library building module is the above-mentioned device for constructing the decoy library. Therefore, the device for constructing a target-bait library also has the advantages mentioned in the above device for constructing a bait library.
  • a device for metabolome FDR identification includes a unified format module, a target-bait library building module, a matching module, and an FDR identification module.
  • the unified format module is set to convert the original mass spectrum data into unified spectrum data and read it to obtain the spectrum to be identified;
  • -The decoy library building module is set to build the target-bait library;
  • the matching module is set to match the spectra to be identified from the unifying format module with the target-bait library constructed by the target-bait library building module;
  • the FDR identification module is set to The matching results of the matching module are sorted and FDR identification is performed on the matching results; wherein the target-bait library building module is the device for constructing the target-bait library.
  • the FDR identification device using this metabolome can perform FDR quality control on the identification results.
  • the FDR quality control method uses the target-bait library strategy; it can quickly and high-throughput the spectrum of metabolites; in the process of spectrum identification Remove the retention time limit of the parent ion, increase the matching range of the experimental spectrum, improve the utilization of the spectrum and the coverage of metabolite identification.
  • the unified spectrum data is a spectrum data file containing charge-to-mass ratio-peak intensity information, for example, MGF format; preferably, the unified format module will include spectrum data with charge-mass ratio-peak intensity information
  • the graph data file is stored as a data linked list, and the spectrum information stored in the data linked list includes the number of the spectrum, the retention time of the parent ion, the mass-to-charge ratio, the charge information, the mass-to-charge ratio of the product ions, and the corresponding peak intensity information.
  • the data linked list includes, but is not limited to, singly linked list, double linked list, binary tree, hash or mapping.
  • the spectrum data file in the MGF format is stored as Spectrum info, which belongs to a single-linked list.
  • the matching module includes a normalization processing sub-module, a spectrum sequence number index two-dimensional array generating sub-module, a weight array generating sub-module, a scoring sub-module, and an identification result array generating module, wherein:
  • the normalization processing sub-module is set to compare each spectrum in the spectrum to be identified with each spectrum in the target-bait library, and the product ion signal in each spectrum in the spectrum to be identified The intensity value is normalized;
  • the spectrum number index two-dimensional array generation sub-module is set to select a spectrum in the spectrum to be identified and obtain its parent ion mass-to-charge ratio M, and screen out all the parent ions in the target-bait library
  • the spectrum number with mass-to-charge ratio M is stored in the spectrum index array, and each spectrum in the spectrum to be identified is traversed to obtain a two-dimensional array of spectrum index indexes;
  • the weight array generation submodule is set to target -The product ion signals of all spectra in
  • the second signal warehouse is used as the overall distribution of the signal peak intensity, and a spectrum to be identified is selected, and the second signal warehouse is the overall selection All product ion spectrum signals in the spectrum to be identified are tested to obtain the weight of the spectrum signal, and each spectrum in the spectrum to be identified is traversed to obtain the weight array; the scoring submodule is set to refer to the The product ion signal is the basis for matching and scoring the product ion signals of the spectrum to be identified; and the identification result array generation module is set to select a spectrum number index array, and the spectrum to be identified and the selected spectrum number index array are traversed The spectra are matched, and the result with the highest matching score is used as the identification result of the spectrum to be identified, and all elements in the two-dimensional array are indexed by the spectrum number to obtain the identification result array of the spectrum to be identified.
  • the normalization processing sub-module is configured to normalize the product ion signal intensity value to the interval of (0,1); preferably, the normalization processing includes reducing the product ion signal intensity The value is divided by the maximum signal intensity value of the product ion in the spectrum to which it belongs.
  • the weight array generation sub-module is set to check all the product ion spectrum signals in the selected spectrum to be identified with the second signal warehouse as the overall to obtain statistics of all the product ion spectrum signals in the spectrum to be identified Quantities, take the reciprocal of the obtained statistic as the weight of the product ion spectrum signal; among them, the test is Grubbs test, box plot test or normal distribution test.
  • the sex-to-noise ratio of the spectrogram signal is introduced into the scoring algorithm of the spectrogram matching, and the matching algorithm is combined with the Goblas outlier test method to calculate the weight of the spectrogram signal, and participate in the follow-up
  • the calculation of the spectrum matching score improves the anti-noise ability of the spectrum matching.
  • the calculation formula is as follows:
  • is the correction coefficient, which is the reciprocal of the difference between the product ion signal of the spectrum to be identified and the product ion signal of the reference spectrum
  • I the spectrogram product ion signal vector
  • w is the weight value of the spectrogram product ion spectrum signal to be identified
  • T is the theoretical signal matching rate of this match
  • E is the experimental signal matching rate of this match.
  • the device for metabolome FDR identification (also called XY-Meta software) of the present invention can be developed using the Golang programming language.
  • the data structure and code logic of the data index are carefully designed and repeatedly debugged to achieve spectrogram identification.
  • Multi-core parallelization improves computer resource utilization and realizes high-performance metabolome identification.
  • the GNPS database is a public metabolite spectrum database, which contains the mass spectra of various natural metabolite standards and experimental samples collected on different instrument platforms.
  • the Passatutto tool organizes the mass spectra of a small number of metabolite standards in GNPS.
  • the standard library can evaluate the performance of the target-bait library to evaluate FDR. This example uses Passatutto's standard database for metabolite identification.
  • the instrument and experimental parameters involved in using XY-Meta for metabolome identification mainly include: column type, charge mode, mass tolerance of precursor and product ions, and spectral signal preprocessing (parameters for hydrophilic columns):
  • XY-Meta generates target-bait library.
  • XY-Meta reads the target library Target_GNPS.mgf and generates the corresponding decoy library.
  • the process of generating the decoy library is shown in Figure 7.
  • the signal spectrum index two-dimensional array R ⁇ r1, r2, r3...r4139 ⁇ select the first signal spectrum index array r1 ⁇ p100, p103, p201...p3890 ⁇ for element traversal, Starting from the first spectrum of the first signal spectrum index array r1, store all the product ion signals of each spectrum in a signal warehouse S ( Figure 8) (the signal warehouse S includes a signal spectrum index two-dimensional array R Corresponding to all ion signals in all spectra).
  • XY-Meta compares the query spectrum with the target-bait library.
  • Spectrum matching score define the product ion signal of the spectrum to be identified and the product ion signal of the reference spectrum into two groups respectively with Based on the reference spectrum, the signal of the spectrum to be identified is compared with the signal of the reference spectrum.
  • the vector dot product algorithm is used to calculate the spectrum to be identified
  • the sum of the dot product of the ion signal and the reference spectrum product ion signal is 4.619.
  • Spectrogram matching and result output traverse the two-dimensional array H ⁇ h1, h2, h3...h2106 ⁇ of the spectrum index index, starting from the first spectrum index array h1, and traverse all of h1 Spectral number, match the spectrum q1 to be identified with all the reference spectra recorded in h1, and use the result with the highest matching score as the identification result of the spectrum to be identified q1, and then put the identification result of each spectrum Into the array Score. All the elements in the two-dimensional array H of the spectrum index index are cyclically traversed in turn to obtain the array Score ⁇ s1, s2, s3...s2106 ⁇ of the identification results of the 2106 spectra to be identified, as shown in Figure 12.
  • XY-Meta performs FDR quality control and result output on the spectrum matching results.
  • the non-targeted metabolome identification process and the quality control process are implemented in one workflow, so that the FDR of the metabolome identification results can be controlled, mainly in:
  • XY-Meta uses the target database to generate a bait library based on the method of randomly selecting signals from the database. After the spectrum identification is completed, the quality control module (matching module, FDR identification module) evaluates the FDR of the identification result and performs quality control. Use Passatutto standard spectral library to evaluate the performance of XY-Meta's target decoy library.
  • the decoy library generated by XY-Meta has the same characteristics as the target library, and can effectively evaluate the FDR of the identification result.
  • XY-Meta can adjust the similarity between the decoy library and the target database.
  • the decoy library with high similarity to the target database has strong decoy ability and is more suitable for more isomers or high metabolite structure similarity FDR quality control of the metabolome identification results, such as the bait library generated by the ion fragment tree method.
  • the decoy library with low similarity to the target database lacks the signal characteristics of the target database, and the decoy ability is insufficient, and the estimated FDR is lower than the actual one.
  • the target-bait library generated by using the default parameter settings of XY-Meta can meet most metabolome identification scenarios.
  • FIG. 14 shows the FDR quality control performance of the XY-Meta target-bait library, where note: Simulation_level1-Simulation_level11 are respectively 0.66, 0.68, 0.70, 0.72, 0.74, 0.76, 0.78, 0.80, 0.82, 0.84 to the target similarity ⁇ 0.86
  • Simulation_level1-Simulation_level11 are respectively 0.66, 0.68, 0.70, 0.72, 0.74, 0.76, 0.78, 0.80, 0.82, 0.84 to the target similarity ⁇ 0.86
  • Expect_FDR is the ideal curve.
  • XY-Meta can quickly generate a bait library based on the target library, without the help of Passatutto and metabolite spectrum prediction software CFM-ID and other tools to generate the bait library.
  • the decoy library generated by XY-Meta can be saved and reused locally, and the decoy library generated by other tools can be imported through the decoy library import option to realize the flexible construction of a metabolome identification database.
  • XY-Meta's decoy library loading process is shown in Figure 15.
  • a target library needs to be imported to generate the corresponding decoy library.
  • the generated bait library can be stored permanently, and the saved bait library can be used as an external bait library.
  • the external bait library can be imported for FDR quality control.
  • This example can identify large quantities of metabolite spectra at high speed, and effective FDR quality control improves the utilization of spectra, mainly in:
  • the present invention has good anti-noise ability for the identification of spectra.
  • the XY-Meta spectrum matching algorithm has a good anti-noise ability. Through effective FDR quality control, the spectrum with more noise signals can also be accurately identified.
  • the XY-Meta spectrum matching result is shown in Figure 16.
  • the FDR quality control strategy of the present invention is flexible to use and meets different scientific research and production needs, mainly in:
  • Database semi-search The XY-Meta database search process can skip the FDR control process after obtaining the identification results, and directly output the identification results. Users can also use other tools to control the identification results to increase the flexibility of FDR control .
  • the semi-search metabolome identification process of XY-Meta can be shown in Figure 17.
  • the open search method is to expand the tolerance of the parent ion mass, so that the unknown adduct modification can be replaced with a larger mass error, thereby expanding the matching range of the query spectrum during the search process, so that the correct target spectrum can enter the spectrum Figure matching.
  • the side effect of open search is that it increases the amount of calculations for identification and introduces more erroneous reference spectra, especially metabolites that are more common in isomers. Therefore, the open search strategy should use more Strict FDR threshold for quality control.
  • XY-Meta's open search metabolome identification process can be shown in Figure 18.
  • Iterative database search When the target database is too large and the real target spectra are few, using the target-bait library strategy to perform FDR quality control on the identification results will often lead to the estimated FDR is too large, thereby reducing the effective spectrum The number of graphs, this problem often occurs when using the HMDB metabolite database full library for metabolome identification and macrometabolome identification.
  • the iterative search strategy of the database can effectively improve the accuracy and sensitivity of identification.
  • the iterative database search requires at least two database searches, and the initial database search is not controlled by FDR. According to the identification results, all matched theoretical spectra are assembled into a new spectral library, thereby reducing the volume of the target library. Import the newly generated metabolite spectrum library into the next search. After the last iteration, the identification results are controlled by FDR, and the metabolome identification results are finally output.
  • XY-Meta's iterative search metabolome identification process can be shown in Figure 19.
  • the present invention has at least the following beneficial effects:
  • the method of randomly selecting signals based on the database and using the target database can effectively generate the bait library, and can be widely used in FDR and quality control.
  • the decoy library constructed by the method or device of the present invention has a high similarity to the target library, so that it has a higher decoy ability, and can be applied to metabolome identification with more isomers or high metabolite structure similarity FDR quality control of results.
  • the technical scheme of the present invention can be adjusted as needed to generate the similarity between the decoy library and the target library to meet the FDR quality control requirements of different situations (high similarity, medium similarity, or low similarity).
  • the method for identifying the metabolome FDR of the decoy library or the target-bait library obtained by the technical scheme of the present invention has the following advantages: 1) FDR quality control can be performed on the identification results, and the FDR quality control method uses the target-bait library strategy; 2) It can identify the spectra of metabolites quickly and with high throughput; 3) In the process of spectrum identification, the retention time limit of the parent ion is lifted, the matching range of the experimental spectrum is increased, and the utilization and utilization of the spectrum are improved. Coverage of metabolite identification.

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Molecular Biology (AREA)
  • Epidemiology (AREA)
  • Hematology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Biomedical Technology (AREA)
  • Urology & Nephrology (AREA)
  • Artificial Intelligence (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Cell Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Food Science & Technology (AREA)
  • Medicinal Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • General Physics & Mathematics (AREA)
  • Pathology (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种构建诱饵库、构建目标-诱饵库、代谢组FDR鉴定的方法及装置。其中,构建诱饵库的方法包括:S1,将目标数据库中每张谱图的代谢物母离子质荷比M分别与目标数据库中其他所有谱图逐一比较,将存在子离子质荷比等于M的谱图和/或谱图的序号存储在信号谱索引数组中,生成信号谱索引二维数组;S2,遍历信号谱索引二维数组中的全部元素,得到n个数组D,n个数组D组成诱饵库信号数组;S3,将诱饵库信号数组中每个子集对应的目标数据库中谱图的母离子信息拷贝给诱饵库信号数组,形成诱饵库。该代谢组鉴定方法能够对鉴定结果进行FDR质控;能够快速高通量地对谱图进行鉴定,提高谱图的利用率和代谢物鉴定的覆盖度。

Description

构建诱饵库、构建目标-诱饵库、代谢组FDR鉴定的方法及装置 技术领域
本发明涉及代谢组学技术领域,具体而言,涉及一种构建诱饵库、构建目标-诱饵库、代谢组FDR鉴定的方法及装置。
背景技术
代谢组学是继基因组学和蛋白质组学之后兴起的一门学科,它是系统生物学的重要组成部分,主要考察生物体系受刺激或扰动前后所有小分子代谢物及其含量的动态变化。通过对生物体内所有的小分子代谢物进行整体的定性和定量分析,可以探索并发现代谢物与生理病理变化的关系。研究表明,代谢组在疾病早期诊断、生物标志物发现、药物筛选、毒性评价、运动医学和营养学等领域有着重要应用价值。
随着质谱仪的飞速发展,代谢物检测的准确性、覆盖度和速度都有较大的提升,基于质谱检测的代谢组学的应用也越来越广泛,如尿液、血浆、唾液、细胞和组织等样品均可以进行代谢物检测。随着代谢数据的增多,对后续的数据分析环节的要求也越来越高,需要性能更高的计算平台和分析工具。代谢物的鉴定是代谢组学分析的重要环节,通过对采集到的质谱图谱进行解析,鉴定样品中存在的代谢物种类,能够对生物的生理表型和疾病表型等作进一步的解释。代谢物在质谱中经过诱导碰撞碎裂产生二级谱图,理论上不同的代谢物具有不同的分子结构,不同的结构具有独特的二级谱图信号,根据这一原理能够对不同的代谢物图谱进行识别。目前代谢组鉴定的主要难点为:1、大规模代谢组鉴定的FDR无法评估,没有有效的质控手段;2、代谢物大规模鉴定的谱图利用率和鉴定覆盖度较低;3、代谢物大规模鉴定工具的性能较低且可操作性较差,对许多商业应用和科研的需求无法满足。因此,我们需要开发一个高性能并且能够进行FDR质控的大规模代谢组鉴定方法(工具)以满足科研和商业应用的需求。
发明内容
本发明旨在提供一种构建诱饵库、构建目标-诱饵库、代谢组FDR鉴定的方法及装置,以处理大规模代谢组学数据。
为了实现上述目的,根据本发明的一个方面,提供了一种构建诱饵库的方法。该方法包括以下步骤:S1,将目标数据库中每张谱图的代谢物母离子质荷比M分别与目标数据库中其他所有谱图逐一比较,将存在子离子质荷比等于M的谱图和/或谱图的序号存储在信号谱索引数组中,遍历完目标数据库中所有的谱图,生成信号谱索引二维数组;S2,选中信号谱索引二维数组中的一组信号谱索引数组,将信号谱索引数组中的每一张谱图的子离子信号存储在第一信号仓库中,随后从目标数据库中对应的谱图中随机选择部分子离子信号复制到数组D中,从第一信号仓库中随机选择一定数量的子离子信号填充到数组D中,使得数组D中的子离子信号数量与目标数据库中对应的谱图的子离子信号数量一致;然后随机选择数组D中的 部分信号,随机改变其质荷比以避免与目标数据库中对应的谱图的质荷比重叠,遍历信号谱索引二维数组中的全部元素,得到n个数组D,n个数组D组成诱饵库信号数组;其中,n为自然数,对应为序号一致;以及S3,将诱饵库信号数组中每个子集对应的目标数据库中谱图的母离子信息拷贝给诱饵库信号数组,形成诱饵库。
进一步的,S2中,随后从目标数据库中对应的谱图中随机选择部分子离子信号复制到数组D中所选择的子离子信号的数量占目标数据库中对应的谱图的全部子离子信号数量的比例为h,且h在0.6~0.9;优选的,h为0.775。
进一步的,S2中,随机改变其质荷比包括:添加或者减少大小随机的质荷比,其扰动值小于母离子质荷比;优选的,添加或者减少大小随机的质荷比包括统一加大小随机的质荷比、统一减大小随机的质荷比,或随机加/减大小随机的质荷比;优选的,扰动为±1Da;优选的,所选择的部分信号占数组D中总信号比例为k,k﹤1,更优选k=0.5。
进一步的,S3中,目标数据库中谱图的母离子信息包括母离子的保留时间、质荷比和电荷信息。
根据本发明的另一个方面,提供了一种构建目标-诱饵库的方法。该方法包括:挑选形成目标数据库;构建诱饵库;以及将目标数据库与诱饵库进行合并得到目标-诱饵库,其中,诱饵库通过上述任一种构建诱饵库的方法构建而成。
根据本发明的再一个方面,提供了一种代谢组FDR鉴定的方法。该方法包括:将原始质谱数据转换为统一的谱图数据并读取,得到待鉴定谱图;构建目标-诱饵库;将待鉴定谱图与目标-诱饵库匹配;以及将匹配结果进行排序以及对匹配结果进行FDR鉴定;其中,目标-诱饵库通过上述构建目标-诱饵库的方法构建而成。
进一步的,统一的谱图数据为含荷质比-峰强度信息的谱图数据文件;优选的,进一步将含荷质比-峰强度信息的谱图数据文件存储为数据链表,数据链表中存储的谱图信息包括谱图的编号、母离子保留时间、质荷比、电荷信息、子离子的质荷比和对应的峰强度信息。
进一步的,将待鉴定谱图与目标-诱饵库匹配包括:将待鉴定谱图中的每一张图谱与目标-诱饵库中的每一张谱图进行比对,对待鉴定谱图中的每一张图谱中的子离子信号强度值做归一化处理;选中待鉴定谱图中的一张图谱并获取其母离子质荷比M,筛选出目标-诱饵库中所有母离子质荷比为M的谱图序号并存储在谱图序号索引数组中,遍历待鉴定谱图中的每一张图谱,得到谱图序号索引二维数组;将目标-诱饵库中所有谱图的子离子信号储存在第二信号仓库中,对第二信号仓库作为信号峰强度的总体分布,选中一张待鉴定谱图,以第二信号仓库为总体对选中的待鉴定谱图中的所有子离子谱图信号做检验,得到谱图信号的权值,遍历待鉴定谱图中的每一张图谱,得到权值数组;以参考图谱中的子离子信号为基础对待鉴定谱图的子离子信号进行匹配评分;以及选中一个谱图序号索引数组,将待鉴定谱图与选中的谱图序号索引数组中所遍历的谱图进行匹配,将匹配评分最高的结果作为待鉴定谱图的鉴定结果,遍历谱图序号索引二维数组中的所有元素,得到待鉴定谱图的鉴定结果数组。
进一步的,归一化处理包括将子离子信号强度值归一化到(0,1)的区间内;优选的,归一化处理包括将子离子信号强度值分别除以其所属谱图中子离子最大的信号强度值。
进一步的,权值通过以下步骤得到:以第二信号仓库为总体对选中的待鉴定谱图中的所有子离子谱图信号做检验获得待鉴定谱图中的所有子离子谱图信号的统计量,对得到的统计 量取倒数后作为子离子谱图信号的权值;优选的,检验为格拉布斯检验、箱线图法检验或正态分布检验。
进一步的,以参考图谱中的子离子信号为基础对待鉴定谱图的子离子信号进行匹配评分包括:将待鉴定谱图的子离子信号和参考谱图的子离子信号分别定义为两个数组
Figure PCTCN2020099769-appb-000001
Figure PCTCN2020099769-appb-000002
以参考谱图为基础,将待鉴定谱图的信号与参考谱图的信号进行比对,设待鉴定谱图中的信号数目总数为total_e,其中能够匹配到参考谱图中的信号的数量为e,则该次匹配的实验信号匹配率E=e/total_e,参考谱图中的信号数目总数为total_t,其中能够匹配到查询谱图中的信号的数量为e,则该次匹配的理论信号匹配率T=e/total_t,信号匹配完成后使用向量点积算法计算待鉴定谱图子离子信号与参考谱图子离子信号的点积和,计算公式如下:
Figure PCTCN2020099769-appb-000003
其中,μ为校正系数,为待鉴定谱图的子离子信号和参考谱图的子离子信号间的差值的倒数,
Figure PCTCN2020099769-appb-000004
为谱图子离子信号向量,w为待鉴定谱图子离子谱图信号权值,T为该次匹配的理论信号匹配率,E为该次匹配的实验信号匹配率。
进一步的,将匹配结果进行排序以及对匹配结果进行FDR鉴定包括:将待鉴定谱图的鉴定结果数组按照匹配得分从高到低排序,设target_score为目标数据库得分,decoy_score为诱饵库得分,假设鉴定结果为目标谱图则计作target_score+1,而鉴定结果为诱饵谱图则计作decoy_score+1;鉴定结果的FDR=decoy_score/(target_score+decoy_score),设置FDR阈值为x,当遍历到某一个谱图鉴定结果sn使得FDR≥x的时候,则该批次的有效鉴定结果为M{s1,s2,s3......s(n-1)};优选的,x小于等于0.2,更优选为小于等于0.05,进一步更优选为小于等于0.01。
根据本发明的又一个方面,提供了一种诱饵库。该诱饵库通过上述任一项构建诱饵库的方法构建而成。
根据本发明的再一个方面,提供了一种目标-诱饵库。该目标-诱饵库通过上述任一种构建目标-诱饵库的方法构建而成。
根据本发明的又一个方面,提供了一种构建诱饵库的装置。该装置包括:信号谱索引二维数组生成模块,设置为将目标数据库中每张谱图的代谢物母离子质荷比M分别与目标数据库中其他所有谱图逐一比较,将存在子离子质荷比等于M的谱图和/或谱图的序号存储在信号谱索引数组中,遍历完目标数据库中所有的谱图,生成信号谱索引二维数组;诱饵库信号数组生成模块,设置为选中信号谱索引二维数组中的一组信号谱索引数组,将信号谱索引数组中的每一张谱图的子离子信号存储在第一信号仓库中,随后从目标数据库中对应的谱图中随机选择部分子离子信号复制到数组D中,从第一信号仓库中随机选择一定数量的子离子信号填充到数组D中,使得数组D中的子离子信号数量与目标数据库中对应的谱图的子离子信号数量一致;然后随机选择数组D中的部分信号,随机改变其质荷比以避免与目标数据库中对应的谱图的质荷比重叠,遍历信号谱索引二维数组中的全部元素,得到n个数组D,n个数组 D组成诱饵库信号数组;其中,n为自然数,对应为序号一致;以及诱饵库生成模块,设置为将诱饵库信号数组中每个子集对应的目标数据库中谱图的母离子信息拷贝给诱饵库信号数组,形成诱饵库。
进一步地,诱饵库信号数组生成模块中,随后从目标数据库中对应的谱图中随机选择部分子离子信号复制到数组D中所选择的子离子信号的数量占目标数据库中对应的谱图的全部子离子信号数量的比例为h,且h在0.6~0.9;优选的,h为0.775。
进一步地,诱饵库信号数组生成模块中,随机改变其质荷比包括:添加或者减少大小随机的质荷比,其扰动值小于母离子质荷比;优选的,添加或者减少大小随机的质荷比包括统一加大小随机的质荷比、统一减大小随机的质荷比,或随机加/减大小随机的质荷比;优选的,扰动为±1Da;优选的,所选择的部分信号占数组D中总信号比例为k,k﹤1,更优选k=0.5。
进一步地,诱饵库生成模块中,目标数据库中谱图的母离子信息包括母离子的保留时间、质荷比和电荷信息。
根据本发明的再一个方面,提供了一种构建目标-诱饵库的装置。该装置包括:目标数据库生成模块,设置为挑选形成目标数据库;诱饵库构建模块,设置为构建诱饵库;以及合并模块,设置为将目标数据库生成模块生成的目标数据库与诱饵库构建模块构建的诱饵库进行合并得到目标-诱饵库,其中,诱饵库构建模块为上述任一种构建诱饵库的装置。
根据本发明的又一个方面,提供了一种代谢组FDR鉴定的装置。该装置包括:格式统一模块,设置为将原始质谱数据转换为统一的谱图数据并读取,得到待鉴定谱图;目标-诱饵库构建模块,设置为构建目标-诱饵库;匹配模块,设置为将格式统一模块中得到的待鉴定谱图与目标-诱饵库构建模块构建的目标-诱饵库匹配;以及FDR鉴定模块,设置为将匹配模块的匹配结果进行排序以及对匹配结果进行FDR鉴定;其中,目标-诱饵库构建模块为上述构建目标-诱饵库的装置。
进一步地,格式统一模块中,统一的谱图数据为含荷质比-峰强度信息的谱图数据文件;优选的,格式统一模块将含荷质比-峰强度信息的谱图数据文件存储为数据链表,数据链表中存储的谱图信息包括谱图的编号、母离子保留时间、质荷比、电荷信息、子离子的质荷比和对应的峰强度信息。
进一步地,匹配模块包括:归一化处理子模块,设置为将待鉴定谱图中的每一张图谱与目标-诱饵库中的每一张谱图进行比对,对待鉴定谱图中的每一张图谱中的子离子信号强度值做归一化处理;谱图序号索引二维数组生成子模块,设置为选中待鉴定谱图中的一张图谱并获取其母离子质荷比M,筛选出目标-诱饵库中所有母离子质荷比为M的谱图序号并存储在谱图序号索引数组中,遍历待鉴定谱图中的每一张图谱,得到谱图序号索引二维数组;权值数组生成子模块,设置为将目标-诱饵库中所有谱图的子离子信号储存在第二信号仓库中,对第二信号仓库作为信号峰强度的总体分布,选中一张待鉴定谱图,以第二信号仓库为总体对选中的待鉴定谱图中的所有子离子谱图信号做检验,得到谱图信号的权值,遍历待鉴定谱图中的每一张图谱,得到权值数组;打分评分子模块,设置为以参考图谱中的子离子信号为基础对待鉴定谱图的子离子信号进行匹配评分;以及鉴定结果数组生成模块,设置为选中一个谱图序号索引数组,将待鉴定谱图与选中的谱图序号索引数组中所遍历的谱图进行匹配,将 匹配评分最高的结果作为待鉴定谱图的鉴定结果,遍历谱图序号索引二维数组中的所有元素,得到待鉴定谱图的鉴定结果数组。
进一步地,归一化处理子模块设置为将子离子信号强度值归一化到(0,1)的区间内;优选的,归一化处理包括将子离子信号强度值分别除以其所属谱图中子离子最大的信号强度值。
进一步地,权值数组生成子模块设置为以第二信号仓库为总体对选中的待鉴定谱图中的所有子离子谱图信号做检验获得待鉴定谱图中的所有子离子谱图信号的统计量,对得到的统计量取倒数后作为子离子谱图信号的权值;优选的,检验为格拉布斯检验、箱线图法检验或正态分布检验。
进一步地,评分子模块设置为将待鉴定谱图的子离子信号和参考谱图的子离子信号分别定义为两个数组
Figure PCTCN2020099769-appb-000005
Figure PCTCN2020099769-appb-000006
以参考谱图为基础,将待鉴定谱图的信号与参考谱图的信号进行比对,设待鉴定谱图中的信号数目总数为total_e,其中能够匹配到参考谱图中的信号的数量为e,则该次匹配的实验信号匹配率E=e/total_e,参考谱图中的信号数目总数为total_t,其中能够匹配到查询谱图中的信号的数量为e,则该次匹配的理论信号匹配率T=e/total_t,信号匹配完成后使用向量点积算法计算待鉴定谱图子离子信号与参考谱图子离子信号的点积和,计算公式如下:
Figure PCTCN2020099769-appb-000007
其中,μ为校正系数,为待鉴定谱图的子离子信号和参考谱图的子离子信号间的差值的倒数,
Figure PCTCN2020099769-appb-000008
为谱图子离子信号向量,w为待鉴定谱图子离子谱图信号权值,T为该次匹配的理论信号匹配率,E为该次匹配的实验信号匹配率。
进一步地,FDR鉴定模块设置为执行以下指令:将待鉴定谱图的鉴定结果数组按照匹配得分从高到低排序,设target_score为目标数据库得分,decoy_score为诱饵库得分,假设鉴定结果为目标谱图则计作target_score+1,而鉴定结果为诱饵谱图则计作decoy_score+1;鉴定结果的FDR=decoy_score/(target_score+decoy_score),设置FDR阈值为x,当遍历到某一个谱图鉴定结果sn使得FDR≥x的时候,则该批次的有效鉴定结果为M{s1,s2,s3......s(n-1)};优选的,x小于等于0.2,更优选为小于等于0.05,进一步更优选为小于等于0.01。
根据本发明的再一个方面,提供了一种存储介质。该存储介质中存储有计算机程序,其中,计算机程序被设置为运行时执行上述构建诱饵库的方法、构建目标-诱饵库的方法和/或代谢组FDR鉴定的方法。
根据本发明的又一个方面,提供了一种电子装置。该电子装置包括存储器和处理器,存储器中存储有计算机程序,处理器被设置为运行计算机程序以执行上述构建诱饵库的方法、构建目标-诱饵库的方法和/或代谢组FDR鉴定的方法。
应用本发明的技术方案,基于数据库随机选取信号的方法利用目标数据库能有效地生成诱饵库,并可广泛地应用于FDR并进行质控。本发明构建诱饵库的方法或装置构建的诱饵库与目标库相似度高,使其具有更高的诱骗能力,能够适用于同分异构体较多或者代谢物结 构相似度高的代谢组鉴定结果的FDR质控。此外,可根据需要调节本发明的技术方案生成诱饵库与目标库相似度,满足不同情况(相似度高、相似度中等或相似度低)的FDR质控的需求。进一步,采用本发明的技术方案获得的诱饵库或目标-诱饵库进行的代谢组FDR鉴定方法拥有以下优点:1)能够对鉴定结果进行FDR质控,FDR质控方法使用目标-诱饵库策略;2)能够快速高通量地对代谢物的谱图进行鉴定;3)在谱图鉴定的环节中解除母离子的保留时间限制,增大实验谱图的匹配范围,提高谱图的利用率和代谢物鉴定的覆盖度。
附图说明
构成本申请的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1示出了本发明一实施方式中代谢组FDR鉴定方法的整体分析流程示意图;
图2示出了本发明一实施方式中示例性MGF谱图文件数据格式示意图;
图3示出了本发明一实施方式中目标-诱饵库生成的主要流程示意图;
图4示出了本发明一实施方式中代谢物谱图匹配主要流程示意图;
图5示出了实施例1中得到的Passatutto_query.mgf格式文件示例;
图6示出了实施例1中得到的Target_GNPS.mgf格式文件示例;
图7示出了实施例1中诱饵库的生成流程示意图;
图8示出了实施例1中信号仓库S示意图示例;
图9a示出了实施例1中目标数据库谱图p1,图9b示出了实施例1中数组D1,图9c示出了实施例1中信号仓库S中随机选择一定数量的子离子信号并填充到数组D1中的谱图;
图10示出了实施例1中生成的目标-诱饵库文件Target_Decoy_GNPS.mgf示意图示例;
图11示出了实施例1中第一个查询谱图q1与参考数据库即目标-诱饵库的第一个谱图比对示意图示例;
图12示出了实施例1中待查询谱图与参考库谱图比对的评分排序;
图13-1、图13-2、图13-3、图13-4、图13-5、图13-6、图13-7、图13-8、图13-9、图13-10和图13-11示出了实施例1中Passatutto_query.mgf鉴定结果FDR质控及输出列表;
图14示出了实施例1中XY-Meta目标-诱饵库FDR质控性能;
图15示出了一种XY-Meta的诱饵库载入流程示意图;
图16示出了实施例1的XY-Meta谱图匹配结果示意图;
图17示出了一种XY-Meta的半搜索代谢组鉴定流程示意图;
图18示出了一种XY-Meta的开放搜索代谢组鉴定流程示意图;以及
图19示出了一种XY-Meta的迭代搜索代谢组鉴定流程示意图。
具体实施方式
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。
本发明中涉及的缩写及术语解释如下:
代谢组:指生物体内代谢物质的动态整体集合,通常所指的代谢组只涉及相对分子质量在1000以内的小分子代谢物质。
质荷比(mz):带电离子的质量与所带电荷之比值,是该离子的物理特性,为一定值,受仪器分辨率的限制,检测出的mz会有波动。
保留时间(Retention Time,RT):被分离样品组分从进样开始到柱后出现该组分浓度极大值时的时间,也即从进样开始到出现某组分色谱峰的顶点时为止所经历的时间,对于特定的分离柱,组分(分子离子)的保留时间与其物理化学性质有关。
分子离子峰(Peaks):某一样品中的分子离子峰,以[mzmin,mzmax,rtmin,rtmax]表示。
诱导碰撞解离(Collision Induced Dissociation):通过与中性分子碰撞将能量传递给离子的过程,能量传递足以导致键的开裂和重排。
假发现率(False-discovery Rate,FDR):是在多重假设检验中用来控制多重比较的一种方法,用于描述一次大规模的鉴定可能出现的假阳性比例。
目标库(Target):用于二级谱图比对的目标参考谱图库。
诱饵库(Decoy):一种模拟的参考谱图库,理论上与目标库具有一样的特性,诱饵库中的谱图不会出现在目标库中。
目标-诱饵库策略(Target-Decoy):一种FDR质控策略,通过诱饵库来模拟谱图发生随机匹配的状态,再根据统计结果估算谱图匹配的假发现率FDR,计算公式为:FDR=Decoy/(Target+Decoy)
信号特征(Signal features):化合物离子通过诱导碰撞解离等二级碎裂方式产生特定的子离子,质谱仪能够采集这些子离子的信号,得到的信号数据称为该化合物的信号特征。
信号强度(Intensity):一种元素或者化合物在质谱检测中丰度的衡量指标。
二级谱图(MS2):某一分子离子(母离子)进行诱导碰撞解离后得到的子离子的质荷比mz和信号强度intensity的数据矩阵,即二级谱图,成为MS2。
母离子/前体离子:未打碎的物质(代谢物)MS1。
子离子:化合物离子在质谱中通过诱导碰撞等碎裂方式能够产生特征的碎片离子,称为子离子。
实验谱图:使用实验样品在实验流程中采集的二级谱图称为实验谱图。
参考谱图:化合物的标准二级谱图,通过与实验谱图比对能够确定实验谱图所对应的化合物。
加合物:代谢物离子化后能够与H2O,H+和NH4+等离子相结合,这些离子称为加合物。
离子加合形式:一种代谢物在离子化的过程中与H2O,H+、NH4+、Na+和K+等离子结合形成新的化合物形式。
MSconvert:一种将质谱原始数据转换成其他文件格式的软件。
Spectrum_info:用于储蓄质谱谱图信号和属性的数据结构。
信号仓库:由一个以上的二级谱图的全部子离子信号组成的数值矩阵。
信号谱:从目标库中抽取的二级谱图,该二级谱图中的信号将全部加入信号仓库中。
信号谱索引数组:用于存储目标库中的被选中为信号谱的谱图索引号。
谱图序号索引数组:用于储存谱图数据库中候选谱图序号的数组。
Passatutto:一种用于评估代谢物诱饵库性能的工具,自身携带查询谱图和标准的参考谱图数据库,并且能够实现对鉴定结果的FDR质控。
格拉布斯检验:为一种假设检验的方法,常被用来检验服从正态分布的单变量数据集中的单个异常值;若有异常值,则其必为数据集中的最大值或最小值。
实验信号匹配率:查询谱图中能够与参考谱图信号匹配的数量占查询谱图全部信号数目的比例。
理论信号匹配率:参考谱图中能够与查询谱图信号匹配的数量占参考谱图全部信号数目的比例。
诱骗能力:衡量诱饵库性能的指标,查询谱图与目标-诱饵库匹配的过程中,查询谱图匹配到诱饵库中的谱图数目越多,则说明诱饵库对模型算法的诱骗能力越强。
近年来,质谱检测技术发展迅猛,质谱的检测速度和分辨率有巨大的提升,非靶向的代谢组学具有识别未知代谢物能力强、高通量和低成本的特点,被广泛用于各种样品的代谢检测和科学研究,代谢检测的样品和数据总量空前巨大。另一方面由于非靶向代谢组鉴定存在稳定性不足和重复性差的特点,使得代谢组鉴定策略的研究成为非靶向代谢组学的重难点。为了进一步提高代谢物大规模鉴定的准确性和提高代谢组定量的稳定性,非靶向代谢组分析工具成为研究热点,而在过去的10年也出现许多非靶向代谢组分析工具。这些代谢工具对代谢组定量分析的策略已经非常成熟,但是代谢物的大规模鉴定依旧是非靶向代谢组研究的瓶颈。非靶向代谢组鉴定主要的问题是鉴定结果的FDR无法评估,这一点极大地限制了非靶向代谢组学技术的应用。如果能够合理地评估代谢组鉴定的FDR,就能够提高代谢组鉴定的准确性和稳定性,极大地推动非靶向代谢组学技术的发展和应用。
为了实现非靶向代谢组鉴定能够快速并稳定地应用于科研或者生产中,根据本发明一种典型的实施方式,提供一种构建诱饵库的方法。该方法包括以下步骤:S1,将目标数据库中每张谱图的代谢物母离子质荷比M分别与目标数据库中其他所有谱图逐一比较,将存在子离子质荷比等于M的谱图和/或谱图的序号存储在信号谱索引数组中,遍历完目标数据库中所有的谱图,生成信号谱索引二维数组;S2,选中信号谱索引二维数组中的一组信号谱索引数组,将信号谱索引数组中的每一张谱图的子离子信号存储在第一信号仓库中,随后从目标数据库中对应的谱图中随机选择部分子离子信号复制到数组D中,从第一信号仓库中随机选择一定数量的子离子信号填充到数组D中,使得数组D中的子离子信号数量与目标数据库中对应的谱图的子离子信号数量一致;然后随机选择数组D中的部分信号,随机改变其(此处“其”是指随机选择数组D中的部分信号)质荷比以避免与目标数据库中对应的谱图的质荷比重叠,遍历信号谱索引二维数组中的全部元素,得到n个数组D,n个数组D组成诱饵库信号数组;其中,n为自然数,对应为序号一致;以及S3,将诱饵库信号数组中每个子集对应的目标数据库中谱图的母离子信息拷贝给诱饵库信号数组,形成诱饵库。
应用本发明的技术方案,基于数据库随机选取信号的方法利用目标数据库生成诱饵库,谱图鉴定完成后,可以通过质控模块评估鉴定结果的FDR并进行质控。利用Passatutto标准谱图库对本发明的诱饵库性能进行评测,发现本发明构建诱饵库的方法构建的诱饵库与目标库有相同的特性,能够有效评估鉴定结果的FDR。
其中,S2中,随后从目标数据库中对应的谱图中随机选择部分子离子信号复制到数组D中所选择的子离子信号的数量占目标数据库中对应的谱图的全部子离子信号数量的比例为h,h<1,h越大,得到的诱饵库与目标数据库的相似性就越大,在h的取值为0.6~0.9之间时获得的诱饵库具有更好的FDR质控效果,0.775效果最佳。
S2中,随机改变其质荷比包括:添加或者减少大小随机的质荷比,目的是增加扰动,以避免与原始库谱图P重叠,扰动值应小于母离子质荷比。典型的,添加或者减少大小随机的质荷比包括统一加大小随机的质荷比、统一减大小随机的质荷比,或随机加/减大小随机的质荷比;优选的,扰动为±1Da;更优选的,所选择的部分信号占数组D中总信号比例为k,k﹤1,k值越大,对谱图信号扰动越大,增加扰动是为了防止诱饵谱图与原始谱图完全重合,k越大,相似性越低,k越小相似性越高,因而通过调整k值进行调整诱饵谱图和原始谱图的相似性。一个优选的实施方式中,k=0.5,构建的诱饵库具有更好的效果。本发明使用谱图数据库信号扰动的方法通过目标数据库生成诱饵库,进一步构建目标-诱饵库对鉴定结果的FDR进行质控,使得目标库与诱饵库的相似度可控,从而适应结构相似性不同的目标数据集的代谢组鉴定,提高了代谢组鉴定的准确性和稳定性。
在本发明一典型的实施方式中,S3中,目标数据库中谱图的母离子信息包括母离子的保留时间、质荷比和电荷信息等,可以使诱饵库具有较为全面的母离子信息。
根据本发明一种典型的实施方式,提供一种构建目标-诱饵库的方法。该方法包括:挑选形成目标数据库;构建诱饵库;以及将目标数据库与诱饵库进行合并得到目标-诱饵库,其中,诱饵库通过上述构建诱饵库的方法构建而成。所以,该构建目标-诱饵库的方法也具有上述构建诱饵库方法中提到的优点。
根据本发明一种典型的实施方式,提供一种代谢组FDR鉴定的方法。该方法包括:将原始质谱数据转换为统一的谱图数据并读取,得到待鉴定谱图;构建目标-诱饵库;将待鉴定谱图与目标-诱饵库匹配;以及将匹配结果进行排序以及对匹配结果进行FDR(假发现率,False-discovery Rate)鉴定;其中,目标-诱饵库通过上述构建目标-诱饵库的方法构建而成。
应用该代谢组FDR鉴定的方法能够对鉴定结果进行FDR质控,FDR质控方法使用目标-诱饵库策略;能够快速高通量地对代谢物的谱图进行鉴定;在谱图鉴定的环节中解除母离子的保留时间限制,增大实验谱图的匹配范围,提高谱图的利用率和代谢物鉴定的覆盖度。
典型的,统一后的谱图数据为含荷质比-峰强度信息的谱图数据文件,其中,谱图数据文件包括但不限于MGF、mzXML、mzML或tda等格式的文件,一个优选的实施方式中,统一的谱图数据为MGF格式的谱图数据文件;优选的,进一步将含荷质比-峰强度信息的谱图数据文件存储为数据链表,数据链表中存储的谱图信息包括谱图的编号、母离子保留时间、质荷比、电荷信息、子离子的质荷比和对应的峰强度信息。其中,数据链表包括但不限于单链表、双链表、二叉树、哈希或映射。本发明的一个优选的实施方式中,将MGF格式的谱图数据文件储存为Spectrum info,Spectrum info属于单链表中的一种。
根据本发明一种典型的实施方式,将待鉴定谱图与目标-诱饵库匹配包括:将待鉴定谱图中的每一张图谱与目标-诱饵库中的每一张谱图进行比对,对待鉴定谱图中的每一张图谱中的子离子信号强度值做归一化处理;选中待鉴定谱图中的一张图谱并获取其母离子质荷比M,筛选出目标-诱饵库中所有母离子质荷比为M的谱图序号并存储在谱图序号索引数组中,遍 历待鉴定谱图中的每一张图谱,得到谱图序号索引二维数组;将目标-诱饵库中所有谱图的子离子信号储存在第二信号仓库中,对第二信号仓库作为信号峰强度的总体分布,选中一张待鉴定谱图,以第二信号仓库为总体对选中的待鉴定谱图中的所有子离子谱图信号做检验,得到谱图信号的权值,遍历待鉴定谱图中的每一张图谱,得到权值数组;以参考图谱中的子离子信号为基础对待鉴定谱图的子离子信号进行匹配评分;以及选中一个谱图序号索引数组,将(一张)待鉴定谱图与选中的谱图序号索引数组中所遍历的谱图进行匹配,将匹配评分最高的结果作为待鉴定谱图的鉴定结果,遍历谱图序号索引二维数组中的所有元素,得到待鉴定谱图的鉴定结果数组。
通过将待鉴定谱图与目标-诱饵库进行匹配能够对待鉴定谱图与目标-诱饵库中的相似性进行比较,相似性的好与坏通过匹配待鉴定谱图与目标-诱饵库中的参考谱图的匹配评分的高低来体现,能够有效地筛选出待鉴定谱图最佳的鉴定结果。
在本发明一典型的实施方式中,归一化处理包括将子离子信号强度值归一化到(0,1)的区间内;优选的,归一化处理包括将子离子信号强度值分别除以其所属谱图中子离子最大的信号强度值。将数值归一化后能够将所有的待鉴定谱图和参考谱图的离子信号数值都调整到一个数值区间内,才能使得待鉴定谱图包括所有的参考谱图之间能够两两进行比较。
优选的,权值通过以下步骤得到:以第二信号仓库为总体对选中的待鉴定谱图中的所有子离子谱图信号做检验获得待鉴定谱图中的所有子离子谱图信号的统计量,对得到的统计量取倒数后作为子离子谱图信号的权值;其中,检验为格拉布斯检验、箱线图法检验或正态分布检验等。待鉴定谱图与目标-诱饵库匹配中将谱图信号的性噪比引入谱图匹配的评分算法中,并且匹配算法结合格布拉斯离群检验法计算谱图信号的权值,参与后续的谱图匹配评分的计算,提高了谱图匹配的抗噪能力。
在本发明一典型的实施方式中,以参考图谱中的子离子信号为基础对待鉴定谱图的子离子信号进行匹配评分包括:将待鉴定谱图的子离子信号和参考谱图的子离子信号分别定义为两个数组
Figure PCTCN2020099769-appb-000009
Figure PCTCN2020099769-appb-000010
以参考谱图为基础,将待鉴定谱图的信号与参考谱图的信号进行比对,设待鉴定谱图中的信号数目总数为total_e,其中能够匹配到参考谱图中的信号的数量为e,则该次匹配的实验信号匹配率E=e/total_e,参考谱图中的信号数目总数为total_t,其中能够匹配到查询谱图中的信号的数量为e,则该次匹配的理论信号匹配率T=e/total_t,信号匹配完成后使用向量点积算法计算待鉴定谱图子离子信号与参考谱图子离子信号的点积和,计算公式如下:
Figure PCTCN2020099769-appb-000011
其中,μ为校正系数,为待鉴定谱图子离子信号与参考谱图子离子信号间的差值的倒数,
Figure PCTCN2020099769-appb-000012
为谱图子离子信号向量,w为待鉴定谱图子离子谱图信号权值,T为该次匹配的理论信号匹配率,E为该次匹配的实验信号匹配率。
这种评分方法能够同时兼顾待鉴定谱图的信号质量和参考谱图的信号质量,当参考谱图的信号质量较差时不会造成匹配评分偏低而无法得到准确的鉴定结果。在本发明一典型的实 施方式中,将匹配结果进行排序以及对匹配结果进行FDR鉴定包括:将待鉴定谱图的鉴定结果数组按照匹配得分从高到低排序,设target_score为目标数据库得分,decoy_score为诱饵库得分,假设鉴定结果为目标谱图则计作target_score+1,而鉴定结果为诱饵谱图则计作decoy_score+1;鉴定结果的FDR=decoy_score/(target_score+decoy_score),设置FDR阈值为x,当遍历到某一个谱图鉴定结果sn使得FDR≥x的时候,则该批次的有效鉴定结果为M{s1,s2,s3......s(n-1)};优选的,x小于等于0.2,更优选为小于等于0.05,进一步更优选为小于等于0.01。
FDR能够对鉴定结果进行质量控制,取FDR<0.01的结果作为有效鉴定结果则表示有效鉴定结果中存在1%的假阳性,而取FDR<0.02的结果作为有效鉴定结果则表示鉴定结果中的假阳性可能为2%
在本发明的发明宗旨之下,还提供一种诱饵库。该诱饵库通过上述构建诱饵库的方法构建而成。
在本发明的发明宗旨之下,还提供一种目标-诱饵库。该目标-诱饵库通过上述构建目标-诱饵库的方法构建而成。
基于上述技术方案的阐述,在本发明一的实施方式或实施例中,提供了一套全新的代谢组鉴定方法,命名为XY-Meta,具体的技术方案如下:
XY-Meta(代谢组FDR鉴定方法)的整体分析流程,如图1所示,大体主要包括谱图原始数据的转换、谱图数据标准化、谱图匹配、鉴定结果FDR质控和匹配结果输出。具体流程如下:
1.代谢物原始质谱数据转换为谱图数据并读取。
1)MGF格式为质谱MS2谱图的常用数据格式,该格式包括谱图的编号、保留时间、质荷比、电荷、子离子的质荷比和峰强度信息,一个完整的MGF文件能够用于谱图的解析和识别。使用MSconvert将下机原始文件(下机原始文件为原始质谱数据,也可以称为待鉴定数据或待鉴定谱图,例如赛默飞世尔下机的数据)转换为MGF格式的谱图数据文件,图2作为示例示出了MGF谱图文件数据格式。
对MGF格式文件进行文本读取并进行解析,将谱图文件存储为Spectrum_info的结构,Spectrum_info结构中存储谱图的编号、母离子保留时间、质荷比、电荷信息、子离子的质荷比和对应的峰强度信息。
3)通过统一的数据读取方法读取待鉴定的谱图数据Q(待鉴定谱图)和参考谱图数据并存储于计算机内存。
2.目标-诱饵库生成。
目标-诱饵库生成的主要流程如图3所示,包括对目标数据库进行母离子筛选,得到信号谱,将所有信号谱进行合并,得到信号仓库,从信号仓库随机挑选信号形成诱饵谱图,进而得到诱饵库,将目标数据库和诱饵库合并得到目标-诱饵库。具体流程如下:
1)目标数据库中存在n张谱图P{p1,p2,p3......pn},从目标数据库中的第一张谱图p1开始,该张谱图对应的代谢物母离子质荷比为M,将M与目标库中除了谱图p1之外的所有谱图进行逐一比较,如果存在一张谱图pm中存在数目大于等于1的子离子质荷比等于M(具有相同的质荷比,说明该子离子与母离子具有相似性),则将谱图pm的序号或该谱图存储在信 号谱索引数组rm{pm1,pm2,pm3.....}(pm1,pm2,pm3……表述符合条件的不同的谱图,该谱图集合的每一个谱图都具有一个或一个以上的子离子质荷比等于M)中,依次循环直到遍历完目标数据库中所有的谱图,生成n个元素的信号谱索引二维数组R{r1,r2,r3......rn}。
2)遍历信号谱索引二维数组R{r1,r2,r3......rn},选中第一个信号谱索引数组r1{pm1,pm2,pm3.....},遍历r1中的所有元素,将每一张谱图的子离子信号存储在一个信号仓库S中。随后从目标数据库中选择序号与r1相同的谱图p1中随机选择部分离子信号复制到另一个数组D1中,其中所选择的子离子信号占谱图p1中的比例为h,h的取值为h<1,h越大,得到的诱饵库与目标数据库的相似性就越大,本申请的一个优选实施例中,h的取值为0.6~0.9,在此取值之间获得的诱饵库具有更好的FDR质控效果,一个更优选的实施例中,h的取值为0.775,h=0.775时效果最佳;从信号仓库S中随机选择一定数量的子离子信号并填充到数组D1中,使得D1中的子离子信号数量与谱图p1中的一致。然后随机选择数组D1中的部分信号添加或者减少大小随机的质荷比,目的是增加扰动,以避免与原始库谱图P重叠,扰动值应小于母离子质荷比,优选扰动为±1Da,所选择的信号占D1中总信号比例为k,k﹤1,在一个优选的实施例中,k=0.5,具有最好的效果。对信号索引二维数组R{r1,r2,r3......rn}中的全部元素依次按照上述流程遍历,得到n个数组D,将所有的数组D存储在诱饵库信号数组Decoy{D1,D2,D3......Dn}中。
3)遍历诱饵库信号数组Decoy{D1,D2,D3......Dn},选择诱饵库信号数组Dn(这里“n”指代1、2、3……n,即诱饵库信号数组Decoy中的每一个子集),将对应的目标库谱图pn的母离子保留时间、质荷比和电荷信息等拷贝给Dn,构成与目标谱图pn相对应的诱饵谱图an,循环遍历,生成n个诱饵谱图并存入数组A{a1,a2,a3......an}。数组A即为诱饵库。
4)将目标数据库P{p1,p2,p3......pn}和诱饵库A{a1,a2,a3......an}合并成一个数组即为目标-诱饵库TD{t1,t2,t3......t2n}(目标-诱饵库为TD{p1,p2,p3......pn,a1,a2,a3......an})。
3.代谢物谱图匹配
经过上述两大步骤,得到了待鉴定谱图和目标-诱饵库,使用谱图匹配算法将待鉴定谱图与目标-诱饵库匹配。代谢物谱图匹配主要流程如图4所示,包括待鉴定谱图峰强度归一化、峰强度权值计算、匹配评分以及匹配结果输出。具体流程如下:
1)信号峰强度归一化:遍历待鉴定谱图Q{q1,q2,q3......qn}中的每一张谱图,并于目标-诱饵库TD{t1,t2,t3......t2n}中的每一张谱图进行比对,对谱图中的子离子信号强度值做归一化处理,将信号强度值归一化到(0,1)的区间,归一化即谱图中所有的子离子信号强度值分别除以各谱图中子离子最大的信号强度值获得的值。
2)筛选候选谱图:遍历待鉴定谱图Q{q1,q2,q3......qn},选中一张谱图qn并获取该谱图的母离子质荷比M,筛选出目标-诱饵库TD{t1,t2,t3......t2n}中所有母离子质荷比为m的谱图序号并存储在谱图序号索引数组hn中。对所有的待鉴定谱图依次执行上述过程,生成n个谱图序号索引数组,并储存在谱图序号索引二维数组H{h1,h2,h3......hn}中。
3)信号强度权重计算:遍历目标-诱饵库TD{t1,t2,t3......t2n},将TD中所有谱图的子离子信号全部储存在信号仓库Signal中,将信号仓库Signal作为信号峰强度的总体分布,遍历待鉴定谱图Q{q1,q2,q3......qn},选中一张待鉴定谱图qn,假设qn中的子离子信号数量为m,以Signal为总体对qn中的所有子离子谱图信号参考格拉布斯检验、箱线图法或正态分布检验 法,对得到的统计量t取倒数后作为这些谱图信号的权值wm,将wm存储在权值数组W中,最后得到谱图qn所有子离子的权值数组W{w1,w2,w3......wm}。
4)谱图匹配评分:将待鉴定谱图的子离子信号和参考谱图的子离子信号分别定义为两组
Figure PCTCN2020099769-appb-000013
Figure PCTCN2020099769-appb-000014
以参考谱图为基础,将待鉴定谱图的信号与参考谱图的信号进行比对,设待鉴定谱图中的信号数目总数为total_e,其中能够匹配到参考谱图中的信号的数量为e,则该次匹配的实验信号匹配率E=e/total_e,参考谱图中的信号数目总数为total_t,其中能够匹配到查询谱图中的信号的数量为e,则该次匹配的理论信号匹配率T=e/total_t,信号匹配完成后使用向量点积算法计算待鉴定谱图子离子信号与参考谱图子离子信号的点积和,计算公式如下:
Figure PCTCN2020099769-appb-000015
注:μ为校正系数,为待鉴定谱图的子离子信号和参考谱图的子离子信号间的差值的倒数,
Figure PCTCN2020099769-appb-000016
为谱图子离子信号向量,w为待鉴定谱图子离子信号权值,T为该次匹配的理论信号匹配率,E为该次匹配的实验信号匹配率。
5)谱图匹配及结果输出:遍历谱图序号索引二维数组H{h1,h2,h3......hn},选中一个谱图序号索引数组hn,遍历hn中的所有谱图序号,将待鉴定的谱图qn与hn中所遍历的参考谱图进行匹配,将匹配评分最高的结果作为待鉴定谱图qn的鉴定结果,随后将每个谱图的鉴定结果放入数组Score中。对谱图序号索引二维数组H中的所有元素依次执行上述的过程,得到n个待鉴定谱图的鉴定结果数组Score{s1,s2,s3......sn}。
匹配结果排序以及鉴定结果FDR质控。
1)将谱图鉴定结果数组Score{s1,s2,s3......sn}按照匹配得分从高到低排序,设target_score为目标库得分,而decoy_score为诱饵库得分。从最高分的鉴定结果逐渐往下统计,假设鉴定结果为目标谱图则计作target_score+1,而鉴定结果为诱饵谱图则计作decoy_score+1。
2)鉴定结果的FDR=decoy_score/(target_score+decoy_score),优选的,本申请的一个实施方式中,阈值的选择小于0.2,在一个更优选的实施方式中,阈值优选小于0.05,更优选的为0.01,当遍历到某一个谱图鉴定结果sn使得FDR≥0.01的时候,则该批次的有效鉴定结果为M{s1,s2,s3......s(n-1)}。FDR计算过程如表1所示。
表1
序号 匹配得分 Target Decoy FDR
1 s1 t1 d1 d1/(t1+d1)
2 s2 t2 d2 d2/(t2+d2)
3 s3 t3 d3 d3/(t3+d3)
n sn tn dn dn/(tn+dn)
5.鉴定结果输出。
遍历有效鉴定结果M{s1,s2,s3......s(n-1)},整理每一个谱图鉴定结果,使用tsv或tsv格式 输出,输出的鉴定信息包括:质谱谱图编号、最终得分、FDR、代谢物注释信息、匹配评分、理论信号匹配率、实验谱图信噪比、理论谱图母离子质荷比、实验谱图母离子质荷比、加合物类型、加合物质量和匹配信号数目。
本发明的代谢组FDR鉴定方法拥有以下重要的特点:1)能够对鉴定结果进行FDR质控,FDR质控方法使用目标-诱饵库策略;2)能够快速高通量地对代谢物的谱图进行鉴定;3)在谱图鉴定的环节中解除母离子的保留时间限制,增大实验谱图的匹配范围,提高谱图的利用率和代谢物鉴定的覆盖度。
在本发明的发明宗旨之下,还提供一种构建诱饵库的装置。该装置包括信号谱索引二维数组生成模块、诱饵库信号数组生成模块和诱饵库生成模块,其中,信号谱索引二维数组生成模块设置为将目标数据库中每张谱图的代谢物母离子质荷比M分别与目标数据库中其他所有谱图逐一比较,将存在子离子质荷比等于M的谱图和/或谱图的序号存储在信号谱索引数组中,遍历完目标数据库中所有的谱图,生成信号谱索引二维数组;诱饵库信号数组生成模块设置为选中信号谱索引二维数组中的一组信号谱索引数组,将信号谱索引数组中的每一张谱图的子离子信号存储在第一信号仓库中,随后从目标数据库中对应的谱图中随机选择部分子离子信号复制到数组D中,从第一信号仓库中随机选择一定数量的子离子信号填充到数组D中,使得数组D中的子离子信号数量与目标数据库中对应的谱图的子离子信号数量一致;然后随机选择数组D中的部分信号,随机改变其质荷比以避免与目标数据库中对应的谱图的质荷比重叠,遍历信号谱索引二维数组中的全部元素,得到n个数组D,n个数组D组成诱饵库信号数组;其中,n为自然数,对应为序号一致;以及诱饵库生成模块设置为将诱饵库信号数组中每个子集对应的目标数据库中谱图的母离子信息拷贝给诱饵库信号数组,形成诱饵库。
应用本发明的技术方案,基于数据库随机选取信号的方法利用目标数据库生成诱饵库,谱图鉴定完成后,可以通过质控模块评估鉴定结果的FDR并进行质控。利用Passatutto标准谱图库对本发明的诱饵库性能进行评测,发现本发明构建诱饵库的装置构建的诱饵库与目标库有相同的特性,能够有效评估鉴定结果的FDR。
其中,诱饵库信号数组生成模块中,随后从目标数据库中对应的谱图中随机选择部分子离子信号复制到数组D中所选择的子离子信号的数量占目标数据库中对应的谱图的全部子离子信号数量的比例为h,h<1,h越大,得到的诱饵库与目标数据库的相似性就越大。在一个优选实施方式中,为了使获得的诱饵库具有更好的FDR质控效果,h的取值为0.6~0.9,在一个更优选的实施方式中,h取值为0.775时,效果最佳。
诱饵库信号数组生成模块中,随机改变其质荷比包括:添加或者减少大小随机的质荷比,目的是增加扰动,以避免与原始库谱图P重叠,扰动值应小于母离子质荷比。典型的,添加或者减少大小随机的质荷比包括统一加大小随机的质荷比、统一减大小随机的质荷比,或随机加/减大小随机的质荷比;优选的,扰动为±1Da;更优选的,所选择的部分信号占数组D中总信号比例为k,k﹤1,更优选k=0.5。本发明使用谱图数据库信号扰动的方法通过目标数据库生成诱饵库,进一步构建目标-诱饵库对鉴定结果的FDR进行质控,使得目标库与诱饵库的相似度可控,从而适应结构相似性不同的目标数据集的代谢组鉴定,提高了代谢组鉴定的准确性和稳定性。
在本发明一典型的实施方式中,诱饵库生成模块中,目标数据库中谱图的母离子信息包括母离子的保留时间、质荷比和电荷信息等,可以使诱饵库具有较为全面的母离子信息。
根据本发明一种典型的实施方式,提供一种构建目标-诱饵库的装置。该装置包括目标数据库生成模块、诱饵库构建模块和合并模块,其中,目标数据库生成模块设置为挑选形成目标数据库;诱饵库构建模块设置为构建诱饵库;以及合并模块设置为将目标数据库生成模块生成的目标数据库与诱饵库构建模块构建的诱饵库进行合并得到目标-诱饵库,其中,诱饵库构建模块为上述构建诱饵库的装置。所以,该构建目标-诱饵库的装置也具有上述构建诱饵库的装置中提到的优点。
根据本发明一种典型的实施方式,提供一种代谢组FDR鉴定的装置。该装置包括格式统一模块、目标-诱饵库构建模块、匹配模块和FDR鉴定模块,其中,格式统一模块设置为将原始质谱数据转换为统一的谱图数据并读取,得到待鉴定谱图;目标-诱饵库构建模块设置为构建目标-诱饵库;匹配模块设置为将格式统一模块中得到的待鉴定谱图与目标-诱饵库构建模块构建的目标-诱饵库匹配;以及FDR鉴定模块设置为将匹配模块的匹配结果进行排序以及对匹配结果进行FDR鉴定;其中,目标-诱饵库构建模块为上述构建目标-诱饵库的装置。
应用该代谢组FDR鉴定的装置能够对鉴定结果进行FDR质控,FDR质控方法使用目标-诱饵库策略;能够快速高通量地对代谢物的谱图进行鉴定;在谱图鉴定的环节中解除母离子的保留时间限制,增大实验谱图的匹配范围,提高谱图的利用率和代谢物鉴定的覆盖度。
典型的,格式统一模块中,统一的谱图数据为含荷质比-峰强度信息的谱图数据文件,例如,MGF格式;优选的,格式统一模块将含荷质比-峰强度信息的谱图数据文件存储为数据链表,数据链表中存储的谱图信息包括谱图的编号、母离子保留时间、质荷比、电荷信息、子离子的质荷比和对应的峰强度信息。其中,数据链表包括但不限于单链表、双链表、二叉树、哈希或映射。本发明的一个优选的实施方式中,将MGF格式的谱图数据文件储存为Spectrum info,Spectrum info属于单链表中的一种。
根据本发明一种典型的实施方式,匹配模块包括归一化处理子模块、谱图序号索引二维数组生成子模块、权值数组生成子模块、评分子模块以及鉴定结果数组生成模块,其中,归一化处理子模块设置为将待鉴定谱图中的每一张图谱与目标-诱饵库中的每一张谱图进行比对,对待鉴定谱图中的每一张图谱中的子离子信号强度值做归一化处理;谱图序号索引二维数组生成子模块设置为选中待鉴定谱图中的一张图谱并获取其母离子质荷比M,筛选出目标-诱饵库中所有母离子质荷比为M的谱图序号并存储在谱图序号索引数组中,遍历待鉴定谱图中的每一张图谱,得到谱图序号索引二维数组;权值数组生成子模块设置为将目标-诱饵库中所有谱图的子离子信号储存在第二信号仓库中,对第二信号仓库作为信号峰强度的总体分布,选中一张待鉴定谱图,以第二信号仓库为总体对选中的待鉴定谱图中的所有子离子谱图信号做检验,得到谱图信号的权值,遍历待鉴定谱图中的每一张图谱,得到权值数组;评分子模块设置为以参考图谱中的子离子信号为基础对待鉴定谱图的子离子信号进行匹配评分;以及鉴定结果数组生成模块设置为选中一个谱图序号索引数组,将待鉴定谱图与选中的谱图序号索引数组中所遍历的谱图进行匹配,将匹配评分最高的结果作为待鉴定谱图的鉴定结果,遍历谱图序号索引二维数组中的所有元素,得到待鉴定谱图的鉴定结果数组。
在本发明一典型的实施方式中,归一化处理子模块设置为将子离子信号强度值归一化到 (0,1)的区间内;优选的,归一化处理包括将子离子信号强度值分别除以其所属谱图中子离子最大的信号强度值。
优选的,权值数组生成子模块设置为以第二信号仓库为总体对选中的待鉴定谱图中的所有子离子谱图信号做检验获得待鉴定谱图中的所有子离子谱图信号的统计量,对得到的统计量取倒数后作为子离子谱图信号的权值;其中,检验为格拉布斯检验、箱线图法检验或正态分布检验等。待鉴定谱图与目标-诱饵库匹配中将谱图信号的性噪比引入谱图匹配的评分算法中,并且匹配算法结合格布拉斯离群检验法计算谱图信号的权值,参与后续的谱图匹配评分的计算,提高了谱图匹配的抗噪能力。
在本发明一典型的实施方式中,评分子模块设置为将待鉴定谱图的子离子信号和参考谱图的子离子信号分别定义为两个数组
Figure PCTCN2020099769-appb-000017
Figure PCTCN2020099769-appb-000018
以参考谱图为基础,将待鉴定谱图的信号与参考谱图的信号进行比对,设待鉴定谱图中的信号数目总数为total_e,其中能够匹配到参考谱图中的信号的数量为e,则该次匹配的实验信号匹配率E=e/total_e,参考谱图中的信号数目总数为total_t,其中能够匹配到查询谱图中的信号的数量为e,则该次匹配的理论信号匹配率T=e/total_t,信号匹配完成后使用向量点积算法计算待鉴定谱图子离子信号与参考谱图子离子信号的点积和,计算公式如下:
Figure PCTCN2020099769-appb-000019
其中,μ为校正系数,为待鉴定谱图的子离子信号和参考谱图的子离子信号间的差值的倒数,
Figure PCTCN2020099769-appb-000020
为谱图子离子信号向量,w为待鉴定谱图子离子谱图信号权值,T为该次匹配的理论信号匹配率,E为该次匹配的实验信号匹配率。
在本发明一典型的实施方式中,FDR鉴定模块设置为执行以下指令:将待鉴定谱图的鉴定结果数组按照匹配得分从高到低排序,设target_score为目标数据库得分,decoy_score为诱饵库得分,假设鉴定结果为目标谱图则计作target_score+1,而鉴定结果为诱饵谱图则计作decoy_score+1;鉴定结果的FDR=decoy_score/(target_score+decoy_score),设置FDR阈值为x,当遍历到某一个谱图鉴定结果sn使得FDR≥x的时候,则该批次的有效鉴定结果为M{s1,s2,s3......s(n-1)};优选的,x小于等于0.2,更优选为小于等于0.05,进一步更优选为小于等于0.01。
本发明的代谢组FDR鉴定的装置(也可以称为XY-Meta的软件)可以使用Golang编程语言开发,其数据索引的数据结构和代码逻辑经过周密的设计和反复地调试,能够实现谱图鉴定多核并行化,提高计算机的资源利用率,实现高性能的代谢组鉴定。
下面将结合实施例进一步说明本发明的有益效果。
实施例1
GNPS数据库为公开的代谢物质谱图谱数据库,收录各种天然代谢物的标品和实验样品在不同仪器平台采集的质谱谱图,Passatutto工具整理了GNPS中的少量代谢物标品的质谱谱图用于形成一个标准库,该标准库能够对目标-诱饵库评估FDR的性能进行评测。本实施例使用Passatutto的标准数据库进行代谢物鉴定。
1、评测数据获取。
下载Passatutto(https://bio.informatik.uni-jena.de/Passatutto/),将主目录中的标准谱图库和实验谱图库数据格式转换为MGF格式,得到Passatutto_query.mgf格式文件(如图5所示)和Target_GNPS.mgf格式文件(如图6所示)。
2、确定XY-Meta的主要鉴定参数。
使用XY-Meta进行代谢组鉴定所涉及的仪器和实验参数主要有:色谱柱类型、电荷模式、母离子和子离子质量容差和谱图信号预处理(针对亲水柱的参数):
色谱柱类型:hplc_pattern=1(类型分为亲水或疏水模式,本实施例的类型是亲水模式)。
电荷模式:electric_pattern=1(电荷模式分为正电荷和负电荷的模式,具体模式根据质谱仪的检测模式确定)。
离子容差:tolerance_precur=0.01Da(最大可以是正负300Da),tolerance_isotope=0.05Da(PS:可选范围为小于等于0.5Da)。
谱图信号预处理:clear=true and merge_tolerance=0.05Da(merge_tolerance大于等于tolerance_isotope)。
XY-Meta生成目标-诱饵库。
XY-Meta读取目标库Target_GNPS.mgf并生成相应的诱饵库,诱饵库的生成流程如图7所示。
具体步骤如下:
目标数据库中存在4139张谱图P{p1,p2,p3......p4139},从第一张谱图p1开始,这张谱图对应的代谢物母离子质荷比为359.151,将这个谱图p1与目标库中除了谱图p1之外的所有谱图进行逐一比较,如果存在一张谱图pm中存在一个以上的子离子质荷比等于359.151,则将pm的序号存储在信号谱索引数组r1{p100,p103,p201......p3890}中,依次循环这个过程直到遍历完目标数据库中所有的谱图,生成4139个元素的信号谱索引二维数组R{r1,r2,r3......r4139}。
遍历信号谱索引二维数组R{r1,r2,r3......r4139},选中第一个信号谱索引数组r1{p100,p103,p201......p3890}进行元素遍历,从第一个信号谱索引数组r1第一张谱图开始,将每一张谱图全部的子离子信号存储在一个信号仓库S(图8)中(信号仓库S包括信号谱索引二维数组R对应的所有谱图中的所有离子信号)。随后选择与r1序号相同的目标数据库谱图p1,从p1(图9a)中随机选择比例为0.6的子离子信号复制到另一个数组D1(图9b)中,从信号仓库S中随机选择一定数量的子离子信号并填充到数组D1(图9c)中,使得D1中的子离子信号数量与谱图p1中的一致,然后随机选择D1中比例为0.6的信号添加或者减少大小随机的质荷比,最后将数组D1存储在诱饵库信号数组Decoy中。对这个信号谱索引二维数组R{r1,r2,r3......r4139}中的全部元素进行遍历并经过上述的过程,生成4139个数组D并存入Decoy中得到Decoy{D1,D2,D3......D4139}。
3)遍历诱饵库信号数组Decoy{D1,D2,D3......D4139},从第一个诱饵库信号数组D1开始,将与这个诱饵库信号数组对应的目标库谱图p1的母离子保留时间、质荷比和电荷信息等拷贝给D1,构成与目标谱图p1相对应的诱饵谱图a1,循环遍历诱饵库信号数组中的每一个信号数组,生成4139个诱饵谱图并存入数组A{a1,a2,a3......a4139}。数组A即为诱饵 库。
4)将目标数据库P{p1,p2,p3......p4139}和诱饵库A{a1,a2,a3......a4139}合并成一个数组即为目标-诱饵库TD{t1,t2,t3......t8278}。生成目标-诱饵库文件Target_Decoy_GNPS.mgf(图10)。
XY-Meta将查询谱图与目标-诱饵库比对。
1)信号峰强度归一化:遍历待鉴定谱图Q{q1,q2,q3......q2106}中的每一张谱图,并于目标-诱饵库TD{t1,t2,t3......t8278}中的每一张谱图进行比对,对待鉴定谱图Q和目标-诱饵库TD的每张谱各自归一化,将信号强度值归一化到(0,1)的区间。
2)筛选候选谱图:遍历待鉴定谱图Q{q1,q2,q3......q2106},选中一张谱图q1并获取该谱图的母离子质荷比182.0482,筛选出目标-诱饵库TD{t1,t2,t3......t8278}中所有母离子质荷比为182.0482的谱图序号并存储在谱图序号索引数组h1中。对待鉴定的谱图Q中的元素依次执行上述过程,生成2106个谱图序号索引数组,并储存在谱图序号索引二维数组H{h1,h2,h3......h2106}中。
3)信号强度权重计算:遍历目标-诱饵库TD{t1,t2,t3......t8278},将TD中所有谱图的子离子信号全部储存在信号仓库Signal中,将信号仓库Signal作为信号峰强度的总体分布,遍历待鉴定谱图Q{q1,q2,q3......q2106},从第一张待鉴定谱图q1开始,q1中的子离子信号数量为6,以Signal为总体对q1中的所有子离子谱图信号参考格拉布斯检验,对得到的统计量t取倒数后作为这些谱图信号的权值wm,将wm存储在权值数组W中,最后得到谱图q1所有子离子的权值数组W{w1,w2,w3......w6}。
4)谱图匹配评分:将待鉴定谱图的子离子信号和参考谱图的子离子信号分别定义为两组
Figure PCTCN2020099769-appb-000021
Figure PCTCN2020099769-appb-000022
以参考谱图为基础,将待鉴定谱图的信号与参考谱图的信号进行比对,第一个查询谱图q1与参考数据库即目标-诱饵库的第一个谱图比对(图11),查询谱图q1中的信号数目总数为6,其中能够匹配到参考谱图中的信号的数量为2,则该次匹配的实验信号匹配率E=1/3,参考谱图中的信号数目总数为12,其中能够匹配到查询谱图中的信号的数量为2,则该次匹配的理论信号匹配率T=1/6,信号匹配完成后使用向量点积算法计算待鉴定谱图子离子信号与参考谱图子离子信号的点积和为4.619。
5)谱图匹配及结果输出:遍历谱图序号索引二维数组H{h1,h2,h3......h2106},从第一个谱图序号索引数组h1开始,遍历h1中的所有谱图序号,将待鉴定的谱图q1与h1中记录的所有的参考谱图进行匹配,将匹配评分最高的结果作为待鉴定谱图q1的鉴定结果,随后将每个谱图的鉴定结果放入数组Score中。对谱图序号索引二维数组H中所有的元素依次循环遍历,得到2106个待鉴定谱图的鉴定结果的数组Score{s1,s2,s3......s2106},如图12所示(待查询谱图与参考库谱图比对的评分排序)。图12中:ID:标号;Score:匹配评分;Reference_spectrum:参考数据库的谱图编号;Match_Score:信号匹配点积;TSNR:理论信号匹配率;ESNR:实验信号匹配率;Query_precursor_mass:查询谱图母离子质合比;Reference_precursor_mass:参考数据库的谱图的母离子质合比;Diviation_mas:查询谱图与参考谱图母离子质合比误差;Adduct:加和物类型;选择匹配得分即Score最高分的结果作为待查询谱图的匹配结果。
XY-Meta对谱图匹配结果进行FDR质控与结果输出。
1)将谱图鉴定结果数组Score{s1,s2,s3......s2106}按照匹配得分从高到低排序,设target_score为目标库得分,而decoy_score为诱饵库得分。从最高分的鉴定结果逐渐往下统计,假设鉴定结果为目标谱图则计作target_score+1,而鉴定结果为诱饵谱图则计作decoy_score+1。
2)鉴定结果的FDR=decoy_score/(target_score+decoy_score),设置FDR阈值为0.01,当遍历到第126个谱图鉴定结果时FDR=0.015873>0.01,则该批次的有效鉴定结果为前125个谱图的鉴定结果,M{q1,q2,q3......q125},如图13-1、图13-2、图13-3、图13-4、图13-5、图13-6、图13-7、图13-8、图13-9、图13-10和图13-11示出了Passatutto_query.mgf鉴定结果FDR质控及输出列表。图13-1、图13-2、图13-3、图13-4、图13-5、图13-6、图13-7、图13-8、图13-9、图13-10和图13-11中:ID:编号;Score:匹配评分;Target:目标库匹配结果;Decoy:诱饵库匹配结果;FDR:FDR评估结果;Reference_spectrum:参考数据库的谱图编号;Match_Score:信号匹配点积;TSNR:理论信号匹配率;ESNR:实验信号匹配率;Query_precursor_mass:查询谱图母离子质合比;Reference_precursor_mass:参考数据库的谱图的母离子质合比;Diviation_mass:查询谱图与参考谱图母离子质合比误差;Adduct:加和物类型;Adduct_mass:加和物质量;Peaks number:子离子匹配数目;选取FDR小于0.01的鉴定结果为最终的鉴定结果。
本发明上述的实施例实现了如下技术效果:
A.本实施例将非靶向代谢组鉴定流程和质控流程在一个工作流程中实现,使得代谢物组鉴定结果的FDR可控,主要表现在:
1)XY-Meta基于数据库随机选取信号的方法利用目标数据库生成诱饵库,谱图鉴定完成后,通过质控模块(匹配模块、FDR鉴定模块)评估鉴定结果的FDR并进行质控。利用Passatutto标准谱图库对XY-Meta的目标诱饵库性能进行评测,XY-Meta所生成的诱饵库与目标库有相同的特性,能够有效评估鉴定结果的FDR。
2)XY-Meta能够调节诱饵库与目标数据库的相似度,与目标数据库相似度高的诱饵库具有较强的诱骗能力,更适合用于同分异构体较多或者代谢物结构相似度高的代谢组鉴定结果的FDR质控,如依据离子碎片树方法产生的诱饵库。相反的,与目标数据库的相似度低的诱饵库缺乏目标数据库的信号特性,诱骗能力不足,所评估的FDR与实际相比偏低。通常情况下,使用XY-Meta默认的参数设置所生成的目标-诱饵库能够符合大多数的代谢组鉴定场景。
诱饵库与目标数据库越相似,诱骗能力越强,FDR质控可能偏大,相反的,诱饵库与目标数据库差异度越大,诱骗能力越弱,FDR质控可能偏小。使用Passatutto标准谱图库对XY-Meta的目标诱饵库性能进行评测,当理论的FDR与实际FDR一致的时候,将在坐标系中形成一条y=x的直线,通过评测发现,XY-Meta的目标-诱饵库评估的FDR在y=x的直线上下波动,并最终逼近这条直线,说明XY-Meta的目标-诱饵库能够有效评估代谢组鉴定的FDR。图14示出了XY-Meta目标-诱饵库FDR质控性能,其中,注:Simulation_level1-Simulation_level11分别为与目标相似性为0.66、0.68、0.70、0.72、0.74、0.76、0.78、0.80、0.82、0.84、0.86这11个梯度的诱饵库的预测FDR与真实FDR的测量曲线,Expect_FDR为理想曲线,所有的测量曲线都围绕在理想曲线的上下波动,其中与目标库相似性为0.78的诱饵库所得到的测量曲线在FDR<0.1的区间与理想曲线最为贴近,则相似性 0.78为最理想的取值。
3)XY-Meta能够快速地基于目标库产生诱饵库,无需借助Passatutto和代谢物谱图预测软件CFM-ID等其他工具产生诱饵库。XY-Meta生成的诱饵库能够保存本地重复使用,并且通过诱饵库导入选项能够导入其他工具产生的诱饵库,实现灵活地构建代谢组鉴定的数据库。
典型的,XY-Meta的诱饵库载入流程如图15所示,首次使用XY-Meta进行代谢组的鉴定和FDR质控的流程,需要先导入一个目标库来生成对应的诱饵库。生成的诱饵库能够永久保存,保存下来的诱饵库能够作为外置诱饵库使用,在使用XY-Meta进行代谢组鉴定的时候,可以导入外置诱饵库来进行FDR的质控。
B.本实施例能够高速地对大批量的代谢物谱图进行鉴定,有效的FDR质控提高谱图利用率,主要表现在:
1)使用Intel i5-7500处理器3个核心并行运算,对Passatutto的2106个实验谱图进行代谢组鉴定,运行内存占用2.5G,总共消耗1分18秒。使用现有的工具MZmatch进行谱图比对则需要约1h的时间。
2)XY-Meta内置的目标-诱饵库FDR控制性能与基于Passatutto生成的目标-诱饵库性能相近,使用两种目标-诱饵库的鉴定结果如下表2所示:
表2
Figure PCTCN2020099769-appb-000023
说明了在相同的FDR水平下,XY-Meta能够提高谱图的利用率。
C.本发明对谱图的识别具有良好的的抗噪能力。
XY-Meta的谱图匹配算法具有良好的抗噪能力,通过有效的FDR质控,存在较多噪声信号的谱图也能够进行准确地鉴定,XY-Meta谱图匹配结果见图16。
D.本发明的FDR质控策略使用灵活,满足不同的科研和生产需求,主要表现在:
1)数据库半搜索:XY-Meta数据库搜索流程能够在得到鉴定结果后跳过FDR控制的流程,直接输出鉴定结果,使用者也可以使用其他工具对鉴定结果进行FDR控制,增加FDR控制的灵活性。XY-Meta的半搜索代谢组鉴定流程可如图17所示。
2)数据库开放搜索:常规的数据库搜索策略认为母离子实际的质量应该等于理论的质量,而在一般的数据库搜索模式中,会预先设置母离子具有多种加合物形式,而在实际的场景中,一种母离子结合的加合物离子形式往往多于或者不同于理论的加合物形式,这就导致在常规数据库搜索中,大量的正确的代谢物的谱图在匹配过程中被过滤,导致最终无法匹配正确的结果。开放搜索的做法是扩大母离子质量的容差,使得未知的加合物修饰能够用较大的质量误差替代,从而在搜索过程中扩大查询谱图的匹配范围,使得正确的目标谱图进入谱 图匹配。与此同时,开放搜索带来的副作用是增大了鉴定的计算量以及引入更多错误的参考谱图,尤其是同分异构体较为普遍的代谢物,因此使用开放搜索策略应该使用更为严格的FDR阈值进行质控。XY-Meta的开放搜索代谢组鉴定流程可如图18所示。
3)数据库迭代搜索:当目标数据库体积过于巨大并且真正的目标谱图较少的时候,使用目标-诱饵库策略对鉴定结果进行FDR质控往往会导致估算的FDR过大,从而减少有效的谱图数量,在使用HMDB代谢物数据库全库进行代谢组鉴定以及进行宏代谢组鉴定的时候,往往会出现这种问题。使用数据库迭代搜索的策略能够有效地提高鉴定的准确性和灵敏度。数据库迭代搜索至少需要进行两次以上的数据库搜索,而初次的数据库搜索不进行FDR控制,并且根据鉴定结果将所有被匹配的理论谱图集合成全新的谱图库,从而缩小目标库的体积,再将新生成的代谢物谱图库导入下一次的搜索中。在最后一次迭代后对鉴定结果进行FDR控制,最终输出代谢组的鉴定结果。XY-Meta的迭代搜索代谢组鉴定流程可如图19所示。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
工业实用性
通过本申请的技术方法,本发明至少有如下有益效果:
应用本发明的技术方案,基于数据库随机选取信号的方法利用目标数据库能有效地生成诱饵库,并可广泛地应用于FDR并进行质控。本发明构建诱饵库的方法或装置构建的诱饵库与目标库相似度高,使其具有更高的诱骗能力,能够适用于同分异构体较多或者代谢物结构相似度高的代谢组鉴定结果的FDR质控。此外,可根据需要调节本发明的技术方案生成诱饵库与目标库相似度,满足不同情况(相似度高、相似度中等或相似度低)的FDR质控的需求。进一步,采用本发明的技术方案获得的诱饵库或目标-诱饵库进行的代谢组FDR鉴定方法拥有以下优点:1)能够对鉴定结果进行FDR质控,FDR质控方法使用目标-诱饵库策略;2)能够快速高通量地对代谢物的谱图进行鉴定;3)在谱图鉴定的环节中解除母离子的保留时间限制,增大实验谱图的匹配范围,提高谱图的利用率和代谢物鉴定的覆盖度。

Claims (28)

  1. 一种构建诱饵库的方法,其特征在于,包括以下步骤:
    S1,将目标数据库中每张谱图的代谢物母离子质荷比M分别与所述目标数据库中其他所有谱图逐一比较,将存在子离子质荷比等于M的谱图和/或所述谱图的序号存储在信号谱索引数组中,遍历完所述目标数据库中所有的谱图,生成信号谱索引二维数组;
    S2,选中所述信号谱索引二维数组中的一组信号谱索引数组,将所述信号谱索引数组中的每一张谱图的子离子信号存储在第一信号仓库中,随后从所述目标数据库中对应的谱图中随机选择部分子离子信号复制到数组D中,从所述第一信号仓库中随机选择一定数量的子离子信号填充到所述数组D中,使得所述数组D中的子离子信号数量与所述目标数据库中对应的谱图的子离子信号数量一致;然后随机选择所述数组D中的部分信号,随机改变其质荷比以避免与所述目标数据库中对应的谱图的质荷比重叠,遍历所述信号谱索引二维数组中的全部元素,得到n个数组D,所述n个数组D组成诱饵库信号数组;其中,n为自然数,所述对应为序号一致;以及
    S3,将所述诱饵库信号数组中每个子集对应的所述目标数据库中谱图的母离子信息拷贝给所述诱饵库信号数组,形成诱饵库。
  2. 根据权利要求1所述的方法,其特征在于,所述S2中,所述随后从所述目标数据库中对应的谱图中随机选择部分子离子信号复制到数组D中所选择的子离子信号的数量占所述目标数据库中对应的谱图的全部子离子信号数量的比例为h,且h在0.6~0.9;优选的,h为0.775。
  3. 根据权利要求1所述的方法,其特征在于,所述S2中,所述随机改变其质荷比包括:添加或者减少大小随机的质荷比,其扰动值小于母离子质荷比;
    优选的,所述添加或者减少大小随机的质荷比包括统一加大小随机的质荷比、统一减大小随机的质荷比,或随机加/减大小随机的质荷比;
    优选的,扰动为±1Da;
    优选的,所选择的部分信号占所述数组D中总信号比例为k,k﹤1,更优选k=0.5。
  4. 根据权利要求1所述的方法,其特征在于,所述S3中,所述目标数据库中谱图的母离子信息包括母离子的保留时间、质荷比和电荷信息。
  5. 一种构建目标-诱饵库的方法,其特征在于,包括:
    挑选形成目标数据库;
    构建诱饵库;以及
    将所述目标数据库与所述诱饵库进行合并得到所述目标-诱饵库,其中,所述诱饵库通过如权利要求1至4中任一项所述的构建诱饵库的方法构建而成。
  6. 一种代谢组FDR鉴定的方法,其特征在于,包括:
    将原始质谱数据转换为统一的谱图数据并读取,得到待鉴定谱图;
    构建目标-诱饵库;
    将所述待鉴定谱图与所述目标-诱饵库匹配;以及
    将匹配结果进行排序以及对所述匹配结果进行FDR鉴定;
    其中,所述目标-诱饵库通过如权利要求5所述的构建目标-诱饵库的方法构建而成。
  7. 根据权利要求6所述的方法,其特征在于,所述统一的谱图数据为含荷质比-峰强度信息的谱图数据文件;
    优选的,进一步将所述含荷质比-峰强度信息的谱图数据文件存储为数据链表,所述数据链表中存储的谱图信息包括谱图的编号、母离子保留时间、质荷比、电荷信息、子离子的质荷比和对应的峰强度信息。
  8. 根据权利要求6所述的方法,其特征在于,所述将所述待鉴定谱图与所述目标-诱饵库匹配包括:
    将所述待鉴定谱图中的每一张图谱与所述目标-诱饵库中的每一张谱图进行比对,对所述待鉴定谱图中的每一张图谱中的子离子信号强度值做归一化处理;
    选中所述待鉴定谱图中的一张图谱并获取其母离子质荷比M,筛选出所述目标-诱饵库中所有母离子质荷比为M的谱图序号并存储在谱图序号索引数组中,遍历所述待鉴定谱图中的每一张图谱,得到谱图序号索引二维数组;
    将所述目标-诱饵库中所有谱图的子离子信号储存在第二信号仓库中,对所述第二信号仓库作为信号峰强度的总体分布,选中一张待鉴定谱图,以所述第二信号仓库为总体对选中的所述待鉴定谱图中的所有子离子谱图信号做检验,得到谱图信号的权值,遍历所述待鉴定谱图中的每一张图谱,得到权值数组;
    以参考图谱中的子离子信号为基础对所述待鉴定谱图的子离子信号进行匹配评分;以及
    选中一个谱图序号索引数组,将所述待鉴定谱图与选中的所述谱图序号索引数组中所遍历的谱图进行匹配,将匹配评分最高的结果作为所述待鉴定谱图的鉴定结果,遍历所述谱图序号索引二维数组中的所有元素,得到所述待鉴定谱图的鉴定结果数组。
  9. 根据权利要求8所述的方法,其特征在于,所述归一化处理包括将所述子离子信号强度值归一化到(0,1)的区间内;
    优选的,所述归一化处理包括将所述子离子信号强度值分别除以其所属谱图中子离子最大的信号强度值。
  10. 根据权利要求8所述的方法,其特征在于,所述权值通过以下步骤得到:以所述第二信号仓库为总体对选中的所述待鉴定谱图中的所有子离子谱图信号做检验获得所述待鉴定谱图中的所有子离子谱图信号的统计量,对得到的统计量取倒数后作为子离子谱图信号的权值;
    优选的,所述检验为格拉布斯检验、箱线图法检验或正态分布检验。
  11. 根据权利要求8所述的方法,其特征在于,所述以参考图谱中的子离子信号为基础对所述待鉴定谱图的子离子信号进行匹配评分包括:
    将所述待鉴定谱图的子离子信号和所述参考谱图的子离子信号分别定义为两个数组
    Figure PCTCN2020099769-appb-100001
    Figure PCTCN2020099769-appb-100002
    以所述参考谱图为基础,将所述待鉴定谱图的信号与所述参考谱图的信号进行比对,设所述待鉴定谱图中的信号数目总数为total_e,其中能够匹配到所述参考谱图中的信号的数量为e,则该次匹配的实验信号匹配率E=e/total_e,所述参考谱图中的信号数目总数为total_t,其中能够匹配到查询谱图中的信号的数量为e,则该次匹配的理论信号匹配率T=e/total_t,信号匹配完成后使用向量点积算法计算待鉴定谱图子离子信号与参考谱图子离子信号的点积和,计算公式如下:
    Figure PCTCN2020099769-appb-100003
    其中,μ为校正系数,为所述待鉴定谱图的子离子信号和所述参考谱图的子离子信号间的差值的倒数,
    Figure PCTCN2020099769-appb-100004
    为谱图子离子信号向量,w为待鉴定谱图子离子谱图信号权值,T为该次匹配的理论信号匹配率,E为该次匹配的实验信号匹配率。
  12. 根据权利要求8所述的方法,其特征在于,所述将匹配结果进行排序以及对所述匹配结果进行FDR鉴定包括:
    将所述待鉴定谱图的鉴定结果数组按照匹配得分从高到低排序,设target_score为目标数据库得分,decoy_score为诱饵库得分,假设鉴定结果为目标谱图则计作target_score+1,而鉴定结果为诱饵谱图则计作decoy_score+1;
    鉴定结果的FDR=decoy_score/(target_score+decoy_score),设置FDR阈值为x,当遍历到某一个谱图鉴定结果sn使得FDR≥x的时候,则该批次的有效鉴定结果为M{s1,s2,s3......s(n-1)};
    优选的,所述x小于等于0.2,更优选为小于等于0.05,进一步更优选为小于等于0.01。
  13. 一种诱饵库,其特征在于,通过如权利要求1至4中任一项所述的构建诱饵库的方法构建而成。
  14. 一种目标-诱饵库,其特征在于,通过如权利要求5所述的构建目标-诱饵库的方法构建而成。
  15. 一种构建诱饵库的装置,其特征在于,包括:
    信号谱索引二维数组生成模块,设置为将目标数据库中每张谱图的代谢物母离子质荷比M分别与所述目标数据库中其他所有谱图逐一比较,将存在子离子质荷比等于M的谱图和/或所述谱图的序号存储在信号谱索引数组中,遍历完所述目标数据库中所有的谱图,生成信号谱索引二维数组;
    诱饵库信号数组生成模块,设置为选中所述信号谱索引二维数组中的一组信号谱索引数组,将所述信号谱索引数组中的每一张谱图的子离子信号存储在第一信号仓库中,随后从所述目标数据库中对应的谱图中随机选择部分子离子信号复制到数组D中,从所述第一信号仓库中随机选择一定数量的子离子信号填充到所述数组D中,使得所述数组D中的子离子信号数量与所述目标数据库中对应的谱图的子离子信号数量一致;然后随 机选择所述数组D中的部分信号,随机改变其质荷比以避免与所述目标数据库中对应的谱图的质荷比重叠,遍历所述信号谱索引二维数组中的全部元素,得到n个数组D,所述n个数组D组成诱饵库信号数组;其中,n为自然数,所述对应为序号一致;以及
    诱饵库生成模块,设置为将所述诱饵库信号数组中每个子集对应的所述目标数据库中谱图的母离子信息拷贝给所述诱饵库信号数组,形成诱饵库。
  16. 根据权利要求15所述的装置,其特征在于,所述诱饵库信号数组生成模块中,所述随后从所述目标数据库中对应的谱图中随机选择部分子离子信号复制到数组D中所选择的子离子信号的数量占所述目标数据库中对应的谱图的全部子离子信号数量的比例为h,且h在0.6~0.9;优选的,h为0.775。
  17. 根据权利要求15所述的装置,其特征在于,所述诱饵库信号数组生成模块中,所述随机改变其质荷比包括:添加或者减少大小随机的质荷比,其扰动值小于母离子质荷比;
    优选的,所述添加或者减少大小随机的质荷比包括统一加大小随机的质荷比、统一减大小随机的质荷比,或随机加/减大小随机的质荷比;
    优选的,扰动为±1Da;
    优选的,所选择的部分信号占所述数组D中总信号比例为k,k﹤1,更优选k=0.5。
  18. 根据权利要求15所述的装置,其特征在于,所述诱饵库生成模块中,所述目标数据库中谱图的母离子信息包括母离子的保留时间、质荷比和电荷信息。
  19. 一种构建目标-诱饵库的装置,其特征在于,包括:
    目标数据库生成模块,设置为挑选形成目标数据库;
    诱饵库构建模块,设置为构建诱饵库;以及
    合并模块,设置为将所述目标数据库生成模块生成的目标数据库与所述诱饵库构建模块构建的诱饵库进行合并得到目标-诱饵库,其中,所述诱饵库构建模块为如权利要求15至18中任一项所述的构建诱饵库的装置。
  20. 一种代谢组FDR鉴定的装置,其特征在于,包括:
    格式统一模块,设置为将原始质谱数据转换为统一的谱图数据并读取,得到待鉴定谱图;
    目标-诱饵库构建模块,设置为构建目标-诱饵库;
    匹配模块,设置为将所述格式统一模块中得到的所述待鉴定谱图与所述目标-诱饵库构建模块构建的目标-诱饵库匹配;以及
    FDR鉴定模块,设置为将所述匹配模块的匹配结果进行排序以及对所述匹配结果进行FDR鉴定;
    其中,所述目标-诱饵库构建模块为如权利要求19所述的构建目标-诱饵库的装置。
  21. 根据权利要求20所述的装置,其特征在于,所述格式统一模块中,所述统一的谱图数据为含荷质比-峰强度信息的谱图数据文件;
    优选的,所述格式统一模块将所述含荷质比-峰强度信息的谱图数据文件存储为数据链表,所述数据链表中存储的谱图信息包括谱图的编号、母离子保留时间、质荷比、电荷信息、子离子的质荷比和对应的峰强度信息。
  22. 根据权利要求20所述的装置,其特征在于,所述匹配模块包括:
    归一化处理子模块,设置为将所述待鉴定谱图中的每一张图谱与所述目标-诱饵库中的每一张谱图进行比对,对所述待鉴定谱图中的每一张图谱中的子离子信号强度值做归一化处理;
    谱图序号索引二维数组生成子模块,设置为选中所述待鉴定谱图中的一张图谱并获取其母离子质荷比M,筛选出所述目标-诱饵库中所有母离子质荷比为M的谱图序号并存储在谱图序号索引数组中,遍历所述待鉴定谱图中的每一张图谱,得到谱图序号索引二维数组;
    权值数组生成子模块,设置为将所述目标-诱饵库中所有谱图的子离子信号储存在第二信号仓库中,对所述第二信号仓库作为信号峰强度的总体分布,选中一张待鉴定谱图,以所述第二信号仓库为总体对选中的所述待鉴定谱图中的所有子离子谱图信号做检验,得到谱图信号的权值,遍历所述待鉴定谱图中的每一张图谱,得到权值数组;
    评分子模块,设置为以参考图谱中的子离子信号为基础对所述待鉴定谱图的子离子信号进行匹配评分;以及
    鉴定结果数组生成模块,设置为选中一个谱图序号索引数组,将所述待鉴定谱图与选中的所述谱图序号索引数组中所遍历的谱图进行匹配,将匹配评分最高的结果作为所述待鉴定谱图的鉴定结果,遍历所述谱图序号索引二维数组中的所有元素,得到所述待鉴定谱图的鉴定结果数组。
  23. 根据权利要求22所述的装置,其特征在于,所述归一化处理子模块设置为将所述子离子信号强度值归一化到(0,1)的区间内;
    优选的,所述归一化处理包括将所述子离子信号强度值分别除以其所属谱图中子离子最大的信号强度值。
  24. 根据权利要求22所述的装置,其特征在于,所述权值数组生成子模块设置为以所述第二信号仓库为总体对选中的所述待鉴定谱图中的所有子离子谱图信号做检验获得所述待鉴定谱图中的所有子离子谱图信号的统计量,对得到的统计量取倒数后作为子离子谱图信号的权值;
    优选的,所述检验为格拉布斯检验、箱线图法检验或正态分布检验。
  25. 根据权利要求22所述的装置,其特征在于,所述评分子模块设置为将所述待鉴定谱图的子离子信号和所述参考谱图的子离子信号分别定义为两个数组
    Figure PCTCN2020099769-appb-100005
    Figure PCTCN2020099769-appb-100006
    以所述参考谱图为基础,将所述待鉴定谱图的信号与所述参考谱图的信号进行比对,设所述待鉴定谱图中的信号数目总数为total_e,其中能够匹配到所述参考谱图中的信号的数量为e,则该次匹配的实验信号匹配率E=e/total_e,所述参考谱 图中的信号数目总数为total_t,其中能够匹配到查询谱图中的信号的数量为e,则该次匹配的理论信号匹配率T=e/total_t,信号匹配完成后使用向量点积算法计算待鉴定谱图子离子信号与参考谱图子离子信号的点积和,计算公式如下:
    Figure PCTCN2020099769-appb-100007
    其中,μ为校正系数,为所述待鉴定谱图的子离子信号和所述参考谱图的子离子信号间的差值的倒数,
    Figure PCTCN2020099769-appb-100008
    为谱图子离子信号向量,w为待鉴定谱图子离子谱图信号权值,T为该次匹配的理论信号匹配率,E为该次匹配的实验信号匹配率。
  26. 根据权利要求22所述的装置,其特征在于,FDR鉴定模块设置为执行以下指令:
    将所述待鉴定谱图的鉴定结果数组按照匹配得分从高到低排序,设target_score为目标数据库得分,decoy_score为诱饵库得分,假设鉴定结果为目标谱图则计作target_score+1,而鉴定结果为诱饵谱图则计作decoy_score+1;
    鉴定结果的FDR=decoy_score/(target_score+decoy_score),设置FDR阈值为x,当遍历到某一个谱图鉴定结果sn使得FDR≥x的时候,则该批次的有效鉴定结果为M{s1,s2,s3......s(n-1)};
    优选的,所述x小于等于0.2,更优选为小于等于0.05,进一步更优选为小于等于0.01。
  27. 一种存储介质,其特征在于,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至12任一项中所述的方法。
  28. 一种电子装置,包括存储器和处理器,其特征在于,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至12任一项中所述的方法。
PCT/CN2020/099769 2019-07-05 2020-07-01 构建诱饵库、构建目标-诱饵库、代谢组fdr鉴定的方法及装置 WO2021004355A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910606569.5A CN111883214B (zh) 2019-07-05 2019-07-05 构建诱饵库、构建目标-诱饵库、代谢组fdr鉴定的方法及装置
CN201910606569.5 2019-07-05

Publications (1)

Publication Number Publication Date
WO2021004355A1 true WO2021004355A1 (zh) 2021-01-14

Family

ID=73154283

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/099769 WO2021004355A1 (zh) 2019-07-05 2020-07-01 构建诱饵库、构建目标-诱饵库、代谢组fdr鉴定的方法及装置

Country Status (2)

Country Link
CN (1) CN111883214B (zh)
WO (1) WO2021004355A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114624340B (zh) * 2020-12-08 2022-11-08 中国科学院大连化学物理研究所 一种植物中病虫害介导的植物抗性相关次生代谢物的鉴定方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101871945A (zh) * 2010-06-13 2010-10-27 中国科学院计算技术研究所 谱库的生成方法和串联质谱谱图鉴定方法
CN102495127A (zh) * 2011-11-11 2012-06-13 暨南大学 一种基于概率统计模型的蛋白质二级质谱鉴定方法
CN104034792A (zh) * 2014-06-26 2014-09-10 云南民族大学 基于质荷比误差识别能力的蛋白质二级质谱鉴定方法
CN105334279A (zh) * 2014-08-14 2016-02-17 大连达硕信息技术有限公司 一种高分辨质谱数据的处理方法
WO2018138901A1 (ja) * 2017-01-30 2018-08-02 株式会社島津製作所 スペクトルデータ処理装置

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050164325A1 (en) * 2003-09-24 2005-07-28 Micromass Uk Limited Method of mass spectrometry
WO2005079263A2 (en) * 2004-02-13 2005-09-01 Waters Investments Limited Apparatus and method for identifying peaks in liquid chromatography/mass spectrometry data and for forming spectra and chromatograms
WO2011000991A1 (es) * 2009-07-01 2011-01-06 Consejo Superior De Investigaciones Científicas Método de identificación de péptidos y proteínas a partir de datos de espectrometría de masas
DE102011017084B4 (de) * 2010-04-14 2020-07-09 Wisconsin Alumni Research Foundation Massenspektrometriedaten-Erfassungsmodus zur Erzielung einer zuverlässigeren Proteinquantifizierung
BR112013012068B1 (pt) * 2010-11-17 2020-12-01 Pioneer Hi-Bred International, Inc. método imparcial para prever o fenótipo ou traço de pelo menos uma planta independente
AU2014221248B2 (en) * 2011-10-26 2016-12-22 The Regents Of The University Of California Pathway recognition algorithm using data integration on genomic models (paradigm)
WO2013097058A1 (zh) * 2011-12-31 2013-07-04 深圳华大基因研究院 一种蛋白质组的鉴定方法
JP2015523552A (ja) * 2012-05-18 2015-08-13 マイクロマス ユーケー リミテッド 改善MSe質量分析法
CN103698447B (zh) * 2012-09-28 2015-12-16 中国人民解放军军事医学科学院放射与辐射医学研究所 一种利用高能碰撞诱导电离碎裂技术鉴定蛋白的方法
US20140142865A1 (en) * 2012-11-20 2014-05-22 David A. Wright Automatic Reconstruction of MS-2 Spectra from all Ions Fragmentation to Recognize Previously Detected Compounds
WO2015191999A1 (en) * 2014-06-13 2015-12-17 Waters Technologies Corporation Analysis of complex biological matrices through targeting and advanced precursor and product ion alignment
CN105527359B (zh) * 2015-11-19 2017-10-31 云南民族大学 基于正反库特征信息匹配的蛋白质二级质谱鉴定方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101871945A (zh) * 2010-06-13 2010-10-27 中国科学院计算技术研究所 谱库的生成方法和串联质谱谱图鉴定方法
CN102495127A (zh) * 2011-11-11 2012-06-13 暨南大学 一种基于概率统计模型的蛋白质二级质谱鉴定方法
CN104034792A (zh) * 2014-06-26 2014-09-10 云南民族大学 基于质荷比误差识别能力的蛋白质二级质谱鉴定方法
CN105334279A (zh) * 2014-08-14 2016-02-17 大连达硕信息技术有限公司 一种高分辨质谱数据的处理方法
WO2018138901A1 (ja) * 2017-01-30 2018-08-02 株式会社島津製作所 スペクトルデータ処理装置

Also Published As

Publication number Publication date
CN111883214B (zh) 2023-06-16
CN111883214A (zh) 2020-11-03

Similar Documents

Publication Publication Date Title
Böcker et al. Fragmentation trees reloaded
CN107729721B (zh) 一种代谢物鉴定及紊乱通路分析方法
CN104170052A (zh) 用于改进的质谱分析法定量作用的方法和装置
CN109061020B (zh) 一种基于气相/液相色谱质谱平台的数据分析系统
Matsuda et al. Assessment of metabolome annotation quality: a method for evaluating the false discovery rate of elemental composition searches
US8631057B2 (en) Alignment of multiple liquid chromatography-mass spectrometry runs
Lundgren et al. Protein identification using Sorcerer 2 and SEQUEST
WO2021004355A1 (zh) 构建诱饵库、构建目标-诱饵库、代谢组fdr鉴定的方法及装置
CN111859275B (zh) 一种基于非负矩阵分解的质谱数据缺失值填补方法及系统
CN111858570A (zh) 一种ccs数据的标准化方法、数据库构建方法以及数据库系统
Kang et al. Accelerating open modification spectral library searching on tensor core in high-dimensional space
CN115797926A (zh) 质谱成像图的空间区域分型方法、装置及电子设备
CN114783539A (zh) 一种基于光谱聚类的中药成分分析方法及系统
US20040034477A1 (en) Methods for modeling chromatographic variables
CN113345593A (zh) 一种在生物关联网络中进行疾病关联关系预测的方法
Pyatnitskiy et al. Identification of differential signs of squamous cell lung carcinoma by means of the mass spectrometry profiling of blood plasma
Wadie et al. METASPACE-ML: Metabolite annotation for imaging mass spectrometry using machine learning
CN113744814B (zh) 基于贝叶斯后验概率模型的质谱数据搜库方法及系统
US20230288384A1 (en) Method for determining small molecule components of a complex mixture, and associated apparatus and computer program product
Zhang et al. Electron ionization mass spectrometry feature peak relationships combined with deep classification model to assist similarity algorithm for fast and accurate identification of compounds
WO2023037295A2 (en) Chemical peak finder model for unknown compound detection and identification
Delabrière New approaches for processing and annotations of high-throughput metabolomic data obtained by mass spectrometry
WO2023037306A2 (en) Three-dimensional chemical peak finder for qualitative and quantitative analytical workflows
EP4388537A1 (en) Method for structural elucidation of small molecule components of a complex mixture, and associated apparatus and computer program product
WO2023037293A2 (en) Ion type tailored library search pre-processing, constraints and spectral database building

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20836538

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 20/06/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20836538

Country of ref document: EP

Kind code of ref document: A1