CN111858570A - CCS data standardization method, database construction method and database system - Google Patents

CCS data standardization method, database construction method and database system Download PDF

Info

Publication number
CN111858570A
CN111858570A CN202010642071.7A CN202010642071A CN111858570A CN 111858570 A CN111858570 A CN 111858570A CN 202010642071 A CN202010642071 A CN 202010642071A CN 111858570 A CN111858570 A CN 111858570A
Authority
CN
China
Prior art keywords
ccs
data
compound
ion mobility
mass spectrometer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010642071.7A
Other languages
Chinese (zh)
Inventor
朱正江
周智伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Institute of Organic Chemistry of CAS
Original Assignee
Shanghai Institute of Organic Chemistry of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Institute of Organic Chemistry of CAS filed Critical Shanghai Institute of Organic Chemistry of CAS
Priority to CN202010642071.7A priority Critical patent/CN111858570A/en
Publication of CN111858570A publication Critical patent/CN111858570A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Abstract

The invention discloses a CCS data standardization method, a database construction method and a database system. The CCS data standardization method comprises the following steps: and performing quality inspection on the collected CCS data, removing abnormal values, calculating a uniform CCS value, and distributing confidence. The database system comprises a standardization module, a prediction module and a database; the standardization module is used for carrying out standardization processing on the collected CCS data measured by the experiment; the prediction model of the prediction module is obtained by data training processed by the standardization module, and the CCS of the compound can be predicted according to the input structural information of the compound to be predicted; the database contains CCS data processed by the normalization module and CCS data predicted by the prediction module. The invention provides a CCS data source with high coverage and high credibility for users.

Description

CCS data standardization method, database construction method and database system
Technical Field
The invention belongs to the technical field of databases, and relates to a data processing method, in particular to a data standardization and database construction method and a database system.
Background
The goal of non-targeted metabolomics is to measure as many metabolites as possible in a complex system comprehensively and to determine essential metabolites that are associated with phenotypic perturbations. The high complexity of the living body enables the metabolic products generated in the life process to have the characteristics of numerous, complex structure, more isomers, wide concentration distribution range and the like. A liquid chromatography-mass spectrometry combined method (LC-MS technology) is a main research method of non-targeted metabonomics at present. The identification of metabolites remains a major bottleneck in liquid chromatography-mass spectrometry-based (LC-MS) non-targeted metabolomics. The standard strategy for metabolite identification is to match experimentally determined primary mass spectra and tandem mass spectra (MS/MS or MS2) in biological samples with standard spectral libraries (such as METLIN, MASSBANK and NIST) or in silico predicted MS/MS spectra. However, standard spectral libraries have limited coverage and in-computer predictions lack high accuracy. Other bioinformatics methods (e.g., GNPS, MetDNA) also use MS2 spectra and molecular network algorithms for metabolite annotation. All of these strategies require high quality experimental MS2 spectra. However, the MS2 spectrum of low molecular weight metabolites is very sparse and often lacks characteristic fragment ions for reliable identification. Some metabolite isomers have highly similar MS2 spectra. Furthermore, many experimental factors, such as the complexity of the biological matrix, low concentrations, co-elution of isomer metabolites, present a significant challenge to the acquisition of high quality MS2 spectra. These problems lead to low coverage and high error rates for metabolite annotation, and therefore new physicochemical properties need to be developed for metabolite annotation.
In recent years, Ion mobility-Mass spectrometry (IM-MS) has become a promising technology in non-targeted metabolomics research due to its ability to provide multidimensional separation and high selectivity. Meanwhile, IM-MS realizes rapid separation of tiny structural differences through the difference of ion mobility. Wherein the ion mobility can be further characterized using a Collisional Cross Section (CCS) of the metabolite ions. IM-MS is able to distinguish between metabolite isomers that are common in biological samples. Unlike Retention Time (RT) and MS2 spectra, which are susceptible to many experimental factors, CCS has high reproducibility in a variety of instruments and laboratories. At the same time, CCS values are a unique physicochemical property that can be used to improve the accuracy of metabolite annotation. Therefore, the construction of CCS database has important research significance for large-scale metabolite identification. In addition, the CCS database can further integrate IM-MS and the existing LC-MS/MS (liquid phase-tandem mass spectrometry) method, so that four-dimensional metabonomics data including MS1 (primary mass spectrometry), RT (retention time), CCS (collision cross-sectional area) and MS/MS (secondary mass spectrometry) can be simultaneously acquired in one-time analysis, and the possibility is provided for multi-dimensional metabolite identification.
The existing CCS database establishment methods comprise two methods, namely, measurement of a standard substance through experiments and theoretical calculation.
The method for measuring the standard substance through experiments needs to purchase the corresponding standard substance and obtain the corresponding cross-sectional area of collision through measurement on the IM-MS. The coverage of the database established by the method is limited by the number of standards that can be purchased. And standards for many compounds are often not available and are expensive. Meanwhile, different instrument platforms exist in the market, so that the standard method for experimental measurement has system errors and is influenced by experimental conditions and operators. In addition, the reported data of the cross-sectional area of collision is often dispersed on different periodicals published at different times, and the acquisition difficulty is high. There is also a temporary lack of suitable methods for cross-validation of data. These deficiencies prevent the method of experimentally measuring standards from building a high-coverage, high-confidence database.
And the collision cross-sectional area is obtained through theoretical calculation, and the collision cross-sectional area of the metabolite can be calculated through a computational chemistry tool (such as MOBCAL) and an ion-gas molecular collision model method. The biggest limitation of the method is that large calculation errors exist, and the precision of the method is different from the standard value of an experiment by 3-30%. In addition, this method requires the researchers to have a deep computational background and powerful computational resources. Even so, it often takes days or weeks to complete the calculation of a molecular collision cross-sectional area. The time, manpower and financial resources required by theoretical calculation are greatly improved, and the establishment of a high-coverage and high-reliability collision cross-section database by the method is limited.
In addition, the existing databases established by the above two methods are distributed in different periodicals or databases, and there is no platform for uniformly storing and managing data, which causes great obstacles to query and use of data. The data of the cross-sectional area of the collision obtained by different methods may be inconsistent, the presentation mode of the data is disordered, and the mark of the data credibility is lacked.
Disclosure of Invention
The invention aims to provide a CCS data standardization method, a database construction method and a database system, which provide a CCS data acquisition source with high coverage and high reliability for users.
A method of normalizing CCS data, comprising:
supplementing the CCS data of each experimental measurement collected from an ion mobility mass spectrometer platform with basic information related to the compound structure corresponding to the CCS data;
performing quality inspection and processing on the CCS data after the basic information is supplemented;
removing abnormal values of the CCS data subjected to quality inspection and processing;
for the same compound CCS data from multiple ion mobility mass spectrometer instrument platforms, a uniform CCS for the compound was calculated.
The quality checking and processing the collected CCS data comprises one or more of the following operations:
Deleting CCS data for compounds having a chemical formula and/or adduct form and/or mass to charge ratio outside of specified ranges;
deleting CCS data from the same ion mobility mass spectrometer platform but with different CCSs;
calculating a maximum difference between the plurality of CCSs for an adduct ion having a plurality of CCSs; deleting CCS data for the adduct ion if the maximum difference is greater than a set threshold; otherwise, calculating an average of the plurality of CCSs as the CCS of the adduct ion;
wherein the maximum difference is a ratio between a difference between a maximum CCS and a minimum CCS and an average of the CCSs.
The removing abnormal values of the CCS data after the quality inspection and the processing comprises the following steps:
fitting a trend line of each compound type by adopting a power function to the CCS data subjected to quality inspection and treatment, and calculating a confidence interval; and deleting the CCS data with the confidence interval exceeding the set threshold.
Calculating a uniform CCS for the same compound with CCS data from multiple ion mobility mass spectrometry instrument platforms comprises: averaging CCSs of the compound from all ion mobility mass spectrometer instrument platforms to obtain a uniform CCS for the compound.
The normalization method further includes assigning a confidence level to each CCS data as follows:
calculating a uniform CCS for CCS data of experimental measurement acquired from different ion mobility mass spectrometer platforms, wherein the maximum CCS difference is smaller than a first set difference threshold, the confidence coefficient of the uniform CCS is of a first level, and the instrument types of the different ion mobility mass spectrometer platforms are DTIM-MS;
calculating a uniform CCS for CCS data of experimental measurement acquired from different ion mobility mass spectrometer platforms, wherein the maximum CCS difference is smaller than a second set difference threshold, the confidence coefficient of the uniform CCS is of a second level, and the instrument types of the different ion mobility mass spectrometer platforms are not limited;
assigning a confidence level to the CCS data acquired from only one ion mobility mass spectrometer platform to be a third level, wherein the instrument type of the ion mobility mass spectrometer platform is not limited;
calculating a uniform CCS for CCS data of experimental measurement acquired from different ion mobility mass spectrometer platforms, wherein the maximum CCS difference is greater than a second set difference threshold, the confidence level of the uniform CCS is marked as conflict, and the type of the ion mobility mass spectrometer platform is not limited;
And the first level, the second level, the third level and the confidence coefficient of the conflict are decreased in sequence.
A construction method of a CCS database comprises the following steps:
collecting CCS data of experimental measurement from an ion mobility mass spectrometer platform;
carrying out standardization processing on the collected CCS data by adopting the standardization method of the CCS data, storing the processed CCS data, and generating a training data set;
training a machine learning-based prediction model based on the training data set so that the trained prediction model can predict and output a series of CCS values of the compound based on the input structural information of the compound to be predicted;
and predicting the CCS value of the compound by the prediction model based on the structural information of the compound to be predicted, which is input by a user, and storing and/or returning the predicted CCS.
A CCS database system, comprising:
the standardization module is used for carrying out standardization processing on CCS data of experimental measurement collected from an ion mobility mass spectrometer platform according to the standardization method;
the prediction module comprises a machine learning-based prediction model and is used for predicting the CCS value of the compound according to the structural information of the compound to be predicted, which is input by a user; the prediction model is obtained through training, and a training data set is obtained through data processed by the standardization module;
And the database comprises CCS data of the experimental measurement processed by the standardization module and CCS data predicted by the prediction module.
The structural information of the compound comprises a plurality of molecular descriptors selected in the following manner: based on the training data set, selecting a plurality of molecular descriptors that contribute most to the predicted CCS value by a recursive feature elimination cross-validation method.
When a prediction model based on machine learning is trained, accuracy evaluation is carried out on CCS data obtained through prediction according to the following method: calculating the structural similarity of the predicted compound and all compounds in the training data set to obtain a structural similarity score; according to a rule that the higher the structural similarity score is, the more similar the CCS is, calculating the average value of the former N structural similarity scores with the highest score as a parameter for evaluating the accuracy of the CCS obtained by the prediction model, wherein the larger the numerical value of the parameter is, the more accurate the CCS obtained by prediction is; wherein N is a positive integer and is a set value.
The CCS database system also comprises a compound card establishing module, which is used for establishing the compound card for the CCS data stored in the database by adopting the following format: the compound card takes the structure of a compound as a unique identifier, and generates and displays the following information for each compound: CCS data for the compound initially gathered, basic information for the compound, uniform CCS for the compound, experimentally measured CCS data for the compound, predicted CCS for the compound, linkage of the compound in other databases.
Due to the adoption of the technical scheme, the invention has the following beneficial effects:
a brand-new data cleaning and standardization method is provided, so that collected data are more reliable, the problem that due to the fact that different types of instrument platforms exist in the market, system errors possibly exist in CCS data from different sources is solved, and a standardized experimental measurement CCS database is built.
In the prior art, the formats of the CCS data obtained by different modes and different sources are not uniform, so that obstacles are brought to data retrieval. The invention sets uniform serial numbers and formats for CCS data of all substances in the database, so that target data can be quickly and accurately found through retrieval of related fields.
A brand-new CCS prediction model based on machine learning is disclosed, and high-quality and high-coverage experimental measurement CCS data are used as a training data set to obtain an optimized CCS prediction model based on machine learning (based on a support vector machine algorithm), so that collision cross-sectional areas of a large number of small molecules can be rapidly predicted through the model, and a high-coverage and high-accuracy prediction CCS database is obtained.
There is no way to estimate the prediction accuracy of each molecule for the predicted cross-sectional area of collision. Previous methods can only evaluate the overall effect of the model based on a known external validation dataset. The small molecules of the CCS are input into a prediction model for prediction by using a known experimental measurement CCS, the predicted CCS is obtained, and the error between the predicted value and the experimental value is compared to estimate the approximate effect of the model. This method does not allow an accurate estimate of each compound molecule, especially for compounds lacking experimental measurements. The invention obtains a structural similarity score by comparing the structural similarity of the predicted molecule and all small molecule compounds in the training data set, and considers that the molecules with similar structures have relatively close collision cross-sectional areas, thereby systematically estimating the accuracy of each predicted molecule collision cross-sectional area.
At present, no unified database integrates all experimental measurements and the collision cross-sectional area obtained by prediction calculation. The invention integrates the collision cross-sectional areas obtained by the two methods by utilizing a uniform platform, displays all information of the compound in a compound card mode, numbers all the compounds uniformly and is convenient for data query.
At the current time node, the present invention has collected data sets from 14 experimental cross-sectional areas of impact, which is the largest experimental measurement CCS database known at present. And a high-accuracy prediction model established based on the experimental measurement CCS database predicts 1670596 compounds in 7 mainstream databases, including KEGG, HMDB, LMSD, MINE, drug bank, DSSTox and unpad, for a total of 11697711 records of the cross-sectional area of collision. This is the most comprehensive database of cross-sectional areas of collisions available. Finally, the database is suitable for but not limited to metabolites, drug molecules, natural products and pesticide molecules, so that the defects of small coverage and limited application range of the conventional CCS database are overcome.
Drawings
FIG. 1 is a flowchart of a method for standardizing CCS data according to a first embodiment of the present invention;
FIG. 2 is a power function fitting curve of lipids and lipid compounds according to one embodiment of the present invention;
fig. 3 is a block diagram of a CCS database system according to a third embodiment of the present invention.
Detailed Description
The invention will be further described with reference to examples of embodiments shown in the drawings.
Example one
Because the CCS data obtained by different methods and different sources are not uniform in format, and there may be conflicts and errors between data, the data needs to be cleaned and standardized. The invention provides a method for standardizing CCS data, and FIG. 1 is a flow chart of the method for standardizing CCS data. The method comprises the following steps:
(1) complete information collection
The CCS data collected from the ion mobility mass spectrometer platform for each experimental measurement yields its complete information (Meta information, i.e., basic information about the structure of the compound) including: distributing a uniform database number to the collected CCS data according to the corresponding compound; acquiring different types of compound structural formulas such as SMILES, InChI, InChIKey and the like by using a CTS tool and R-wrapped rinchi; calculating the molecular formula and the accurate mass of the compound by using R inclusion rcdk; ClassyFire software was used to obtain classification information for compounds.
(2) Checking data quality
The quality inspection and processing of the collected CCS data specifically comprises the following steps:
the following low-quality CCS data are deleted first, including: there is no corresponding chemical formula, the adduct form is not within the specified range, and the m/z (mass to charge ratio) error exceeds 10 ppm.
CCS data from the same ion mobility mass spectrometer platform but with different CCS were deleted.
For adduct ions with multiple CCS, the maximum difference between the multiple CCS is first calculated. If the maximum difference is greater than a set threshold (0.5% in this example), then all CCS data associated with the adduct ion are deleted; otherwise, the average of multiple CCS will be calculated as the CCS of the final compound in the adduct form. Wherein the maximum difference is a ratio of a difference between the maximum CCS and the minimum CCS to an average of the CCS.
(3) Removing outliers
Removing abnormal values of CCS data subjected to quality inspection and processing, wherein the abnormal values comprise: fitting a trend line of each compound category to all CCS data by adopting a power function, and calculating a confidence interval; and deleting the CCS data with the confidence interval exceeding the set threshold.
The following is a specific example of the present embodiment, including the steps of:
(a) And (5) sorting the CCS values according to the compound classes, and counting the number n of the CCS values. For compound classes with n ≧ 10, performing the fitting of step (b); for compounds with n <10, no treatment is performed;
(b) for each class of CCS data, a power function fit is performed using mass-to-charge ratio (m/z) (CCS ═ a × m/z)b(ii) a Wherein a and b are fitting coefficients, m is the mass of the corresponding compound, and z is the charge amount carried by the corresponding compound) to obtain a trend line of the category CCS;
(c) the 99% confidence interval of the trend line is calculated, and CCS that exceeds the 99% prediction interval is considered to be outliers (i.e., outliers in this example). For example, in fig. 2, methyl behenate exceeded the 99% confidence interval of the trend lines for lipids and lipid compounds and was therefore judged to be outliers and removed. In fig. 2, the interval between the two curves is the 99% confidence interval.
(4) Calculating a unified CCS value
Calculating a uniform CCS for the same compound having CCS data from multiple ion mobility mass spectrometer instrument platforms, comprising: the CCS of the compound from all ion mobility mass spectrometer instrument platforms were averaged to obtain a uniform CCS for the compound.
For an adduct ion, if it has multiple CCSs obtained from DTIM-MS (name of ion mobility Mass Spectrometry Instrument type), the uniform CCS is the average of the CCSs in DTIM-MS. Otherwise, all CCS from different instrument platforms will be averaged to compute a uniform CCS. In this example, 3539 unified CCS were generated in total.
(5) Assigning confidence levels
For each CCS data, a confidence is assigned as follows:
for a unified CCS calculated from experimentally measured CCS data collected from different ion mobility mass spectrometer instrument platforms, and if the maximum CCS difference is smaller than a first set difference threshold (1% in this embodiment), the confidence of the unified CCS is a first Level (Level 1), wherein the instrument types of the different ion mobility mass spectrometer platforms are DTIM-MS;
for a uniform CCS calculated from experimentally measured CCS data collected from different ion mobility mass spectrometer instrument platforms, where the maximum CCS difference is less than a second set difference threshold (3% in this example), the confidence of the uniform CCS is of a second Level (Level 2), where the instrument types of the different ion mobility mass spectrometer instrument platforms are not limited (e.g., DTIM-MS, TWIM-MS, or TIMS-MS);
assigning a confidence Level of a third Level (Level 3) to CCS data collected from only one ion mobility mass spectrometer instrument platform, wherein the instrument type of the ion mobility mass spectrometer instrument platform is not limited (e.g., DTIM-MS, TWIM-MS or TIMS-MS);
calculating a uniform CCS for CCS data of experimental measurements acquired from different ion mobility mass spectrometer instrument platforms, wherein the maximum CCS difference is greater than a second set difference threshold, the confidence level of the uniform CCS is marked as a conflict, and the instrument types of the ion mobility mass spectrometer instrument platforms are not limited (such as DTIM-MS, TWIM-MS or TIMS-MS);
The first level, the second level, the third level and the confidence of the conflict are decreased in sequence.
Example two
The embodiment discloses a method for constructing a CCS database, which comprises the following steps:
experimentally measured CCS data were collected from all ion mobility mass spectrometer instrument platforms.
The collected CCS data is normalized using the normalization methods described above, the processed CCS data is stored, and a training data set is generated (e.g., by supplementing information for a plurality of molecular descriptors thereof). To solve the problem that CCS databases of experimental measurements are relatively scattered and coverage is relatively small, the present embodiment currently collects CCS data sets from 14 experimental measurements. These data sets contained 4 laboratories, 2 instrument platforms, resulting in a total of 3539 uniform CCS.
Based on the training data set, training a machine learning-based prediction model, so that the trained prediction model can predict and output a series of CCS values of the compound under different adduct forms based on input structural information (such as a plurality of molecular descriptors of the compound) of the compound to be predicted. The multiple molecular descriptors were calculated by inputting the compound structure into the rcdk package in the R software, and were selected as follows: based on all the standardized CCS data (strain) and molecular descriptors (independent variables), a plurality of molecular descriptors that contribute most to the predicted CCS are selected by a Recursive Feature Elimination Cross Validation (RFECV). In this example, 15 molecular descriptors in the positive ion mode and 9 molecular descriptors in the negative ion mode are selected as inputs to the prediction model.
When the prediction model is trained, the CCS value obtained by prediction according to the prediction model is evaluated according to the following method: calculating the structural similarity of the predicted compound and all compounds in the training data set to obtain a structural similarity score; according to the rule that the higher the structural similarity score is, the more similar the CCS value is, by calculating the average of the similarity scores with the highest N scores, a parameter of Representative Structural Similarity (RSS) is obtained, and the accuracy of the CCS value obtained by the prediction model is evaluated. Specifically, the range of structural similarity can be defined from 0 to 1, with 1 representing an identical molecular structure and 0 representing an identical molecular structure.
And predicting the CCS value of the compound by the prediction model based on a plurality of molecular descriptors of the compound to be predicted, which are input by a user, and storing and/or returning the predicted CCS data.
The CCS data measured by the experiment and the predicted CCS data obtained by the prediction model are used as data composition of the database.
EXAMPLE III
This embodiment discloses a database system, and fig. 3 is a schematic structural diagram of the database system.
The database system includes:
the standardization module is used for carrying out standardization processing on CCS data of experimental measurement collected from an ion mobility mass spectrometer platform according to the standardization method of the CCS data;
The prediction module comprises a prediction model based on machine learning and is used for predicting the CCS value of the compound to be predicted according to the structural information of the compound input by a user and storing and/or returning the predicted CCS data; the prediction model is obtained through training, and a training data set is obtained through CCS data processed by a standardization module;
the database comprises CCS data of experimental measurement processed by the standardization module and CCS data predicted by the prediction module;
the compound card establishing module is used for establishing the compound card for the CCS data stored in the database by adopting the following format: the compound card takes the structure of a compound as a unique identifier, and generates and displays the following information for each compound: basic information for the compound, the uniform CCS value for the compound, experimentally measured CCS value for the compound, predicted CCS value for the compound, linkage of the compound in other databases. In this example, the basic information of the compound includes the name of the compound, molecular formula, precise mass, different types of structural formulae, classification of the compound, and the like. For some compounds, the absence of the compound card structure is indicated in blank form, as for the predicted CCS data, experimentally measured CCS values and links in other databases are shown in blank form.
In this embodiment, the structural information of the compound includes a plurality of molecular descriptors, which are selected as follows: based on the training data set, a plurality of molecular descriptors that contribute most to the predicted CCS value are selected by Recursive feature elimination Cross-validation (RFECV). In this embodiment, 15 molecular descriptors in the positive ion mode and 9 molecular descriptors in the negative ion mode are selected as the plurality of molecular descriptors selected above in this manner. In this example, a plurality of molecular descriptors were calculated by inputting the compound structure into the rcdk package of the R software.
When a prediction model based on machine learning is trained, accuracy evaluation is carried out on CCS data obtained through prediction according to the following method:
calculating the structural similarity of the predicted compound and all compounds in the training data set to obtain a structural similarity score; according to the rule that the higher the structural similarity score is, the more similar the CCS is, the larger the numerical value of the parameter is, the more accurate the CCS is obtained by prediction, by calculating the average value of the former N structural similarity scores with the highest scores as a parameter for evaluating the accuracy of the CCS obtained by the prediction model; wherein N is a positive integer and is a set value.
In addition, if the normalization module already includes a confidence degree assigned to CCS data measured by an experiment according to the method for normalizing CCS data of the present invention, the CCS database system in this embodiment further includes a predicted value confidence degree assignment module, configured to label the confidence degrees as a fourth Level (Level 4) for CCS values predicted by the prediction module; the fourth level of confidence is lowest.
The embodiments described above are presented to enable those skilled in the art to make and use the invention. It will be readily apparent to those skilled in the art that various modifications to these embodiments may be made, and the generic principles described herein may be applied to other embodiments without the use of the inventive faculty. Therefore, the present invention is not limited to the embodiments described herein, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims (10)

1. A method for normalizing CCS data, comprising:
supplementing the CCS data of each experimental measurement collected from an ion mobility mass spectrometer platform with basic information related to the compound structure corresponding to the CCS data;
Performing quality inspection and processing on the CCS data after the basic information is supplemented;
removing abnormal values of the CCS data subjected to quality inspection and processing;
for the same compound CCS data from multiple ion mobility mass spectrometer instrument platforms, a uniform CCS for the compound was calculated.
2. The method of normalizing CCS data according to claim 1, wherein:
the quality checking and processing the collected CCS data comprises one or more of the following operations:
deleting CCS data for compounds having a chemical formula and/or adduct form and/or mass to charge ratio outside of specified ranges;
deleting CCS data from the same ion mobility mass spectrometer platform but with different CCSs;
calculating a maximum difference between the plurality of CCSs for an adduct ion having a plurality of CCSs; deleting CCS data for the adduct ion if the maximum difference is greater than a set threshold; otherwise, calculating an average of the plurality of CCSs as the CCS of the adduct ion;
wherein the maximum difference is a ratio between a difference between a maximum CCS and a minimum CCS and an average of the CCSs.
3. The method of normalizing CCS data according to claim 1, wherein:
The removing abnormal values of the CCS data after the quality inspection and the processing comprises the following steps:
fitting a trend line of each compound type by adopting a power function to the CCS data subjected to quality inspection and treatment, and calculating a confidence interval; and deleting the CCS data with the confidence interval exceeding the set threshold.
4. The method of normalizing CCS data according to claim 1, wherein: calculating a uniform CCS for the same compound with CCS data from multiple ion mobility mass spectrometry instrument platforms comprises:
averaging CCSs of the compound from all ion mobility mass spectrometer instrument platforms to obtain a uniform CCS for the compound.
5. The method of normalizing CCS data according to claim 1, wherein: the normalization method further includes assigning a confidence level to each CCS data as follows:
calculating a uniform CCS for CCS data of experimental measurement acquired from different ion mobility mass spectrometer platforms, wherein the maximum CCS difference is smaller than a first set difference threshold, the confidence coefficient of the uniform CCS is of a first level, and the instrument types of the different ion mobility mass spectrometer platforms are DTIM-MS;
Calculating a uniform CCS for CCS data of experimental measurement acquired from different ion mobility mass spectrometer platforms, wherein the maximum CCS difference is smaller than a second set difference threshold, the confidence coefficient of the uniform CCS is of a second level, and the instrument types of the different ion mobility mass spectrometer platforms are not limited;
assigning a confidence level to the CCS data acquired from only one ion mobility mass spectrometer platform to be a third level, wherein the instrument type of the ion mobility mass spectrometer platform is not limited;
calculating a uniform CCS for CCS data of experimental measurement acquired from different ion mobility mass spectrometer platforms, wherein the maximum CCS difference is greater than a second set difference threshold, the confidence level of the uniform CCS is marked as conflict, and the type of the ion mobility mass spectrometer platform is not limited;
and the first level, the second level, the third level and the confidence coefficient of the conflict are decreased in sequence.
6. A construction method of a CCS database is characterized by comprising the following steps:
collecting CCS data of experimental measurement from an ion mobility mass spectrometer platform;
standardizing the collected CCS data by adopting the CCS data standardization method of any one of claims 1-5, storing the processed CCS data, and generating a training data set;
Training a machine learning-based prediction model based on the training data set so that the trained prediction model can predict and output a series of CCS values of the compound based on the input structural information of the compound to be predicted;
and predicting the CCS value of the compound by the prediction model based on the structural information of the compound to be predicted, which is input by a user, and storing and/or returning the predicted CCS.
7. A CCS database system, comprising
A normalization module for normalizing CCS data of experimental measurements gathered from an ion mobility mass spectrometer instrument platform according to the method of any one of claims 1-5;
the prediction module comprises a machine learning-based prediction model and is used for predicting the CCS value of the compound according to the structural information of the compound to be predicted, which is input by a user; the prediction model is obtained through training, and a training data set is obtained through data processed by the standardization module;
and the database comprises CCS data of the experimental measurement processed by the standardization module and CCS data predicted by the prediction module.
8. The CCS database system of claim 7, wherein:
The structural information of the compound comprises a plurality of molecular descriptors selected in the following manner:
based on the training data set, selecting a plurality of molecular descriptors that contribute most to the predicted CCS value by a recursive feature elimination cross-validation method.
9. The CCS database system of claim 7, wherein:
when a prediction model based on machine learning is trained, accuracy evaluation is carried out on CCS data obtained through prediction according to the following method:
calculating the structural similarity of the predicted compound and all compounds in the training data set to obtain a structural similarity score; according to a rule that the higher the structural similarity score is, the more similar the CCS is, calculating the average value of the former N structural similarity scores with the highest score as a parameter for evaluating the accuracy of the CCS obtained by the prediction model, wherein the larger the numerical value of the parameter is, the more accurate the CCS obtained by prediction is;
wherein N is a positive integer and is a set value.
10. The CCS database system of claim 7, wherein: the system also comprises a compound card establishing module which is used for establishing the compound card by adopting the following format for the CCS data stored in the database:
The compound card takes the structure of a compound as a unique identifier, and generates and displays the following information for each compound: CCS data for the compound initially gathered, basic information for the compound, uniform CCS for the compound, experimentally measured CCS data for the compound, predicted CCS for the compound, linkage of the compound in other databases.
CN202010642071.7A 2020-07-06 2020-07-06 CCS data standardization method, database construction method and database system Pending CN111858570A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010642071.7A CN111858570A (en) 2020-07-06 2020-07-06 CCS data standardization method, database construction method and database system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010642071.7A CN111858570A (en) 2020-07-06 2020-07-06 CCS data standardization method, database construction method and database system

Publications (1)

Publication Number Publication Date
CN111858570A true CN111858570A (en) 2020-10-30

Family

ID=73153093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010642071.7A Pending CN111858570A (en) 2020-07-06 2020-07-06 CCS data standardization method, database construction method and database system

Country Status (1)

Country Link
CN (1) CN111858570A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634997A (en) * 2020-11-16 2021-04-09 中国科学院上海有机化学研究所 Sterol database establishment and sterol analysis method
WO2022179441A1 (en) * 2021-02-24 2022-09-01 International Business Machines Corporation Standardization in the context of data integration

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103650100A (en) * 2011-04-28 2014-03-19 菲利普莫里斯生产公司 Computer-assisted structure identification
CN106250556A (en) * 2016-08-17 2016-12-21 贵州数据宝网络科技有限公司 Data digging method for big data analysis
CN108362761A (en) * 2018-01-08 2018-08-03 上海市刑事科学技术研究院 A kind of CCS Value Datas library, its method for building up and the application of drugs poisonous substance
CN111221809A (en) * 2020-01-08 2020-06-02 国电联合动力技术有限公司 Data cleaning method and system based on real-time database storage and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103650100A (en) * 2011-04-28 2014-03-19 菲利普莫里斯生产公司 Computer-assisted structure identification
CN106250556A (en) * 2016-08-17 2016-12-21 贵州数据宝网络科技有限公司 Data digging method for big data analysis
CN108362761A (en) * 2018-01-08 2018-08-03 上海市刑事科学技术研究院 A kind of CCS Value Datas library, its method for building up and the application of drugs poisonous substance
CN111221809A (en) * 2020-01-08 2020-06-02 国电联合动力技术有限公司 Data cleaning method and system based on real-time database storage and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TOBIAS KIND 等: "Identification of small molecules using accurate mass MS/MS search", MASS SPECTROM, 24 April 2017 (2017-04-24) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634997A (en) * 2020-11-16 2021-04-09 中国科学院上海有机化学研究所 Sterol database establishment and sterol analysis method
WO2022179441A1 (en) * 2021-02-24 2022-09-01 International Business Machines Corporation Standardization in the context of data integration
US11550813B2 (en) 2021-02-24 2023-01-10 International Business Machines Corporation Standardization in the context of data integration
GB2618956A (en) * 2021-02-24 2023-11-22 Ibm Standardization in the context of data integration

Similar Documents

Publication Publication Date Title
Heinonen et al. Metabolite identification and molecular fingerprint prediction through machine learning
JP7057913B2 (en) Big data analysis method and mass spectrometry system using the analysis method
Deutsch et al. PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows
CN107729721B (en) Metabolite identification and disorder pathway analysis method
Want et al. Processing and analysis of GC/LC-MS-based metabolomics data
CA2501003C (en) Sample analysis to provide characterization data
US8175816B2 (en) System and method for analyzing metabolomic data
CA2618123C (en) A system, method, and computer program product using a database in a computing system to compile and compare metabolomic data obtained from a plurality of samples
CN106415558B (en) Data processing device, mass spectrometry equipment and method for evaluation of mass spectrometry data
Fenyö et al. Mass spectrometric protein identification using the global proteome machine
Eriksson et al. Probity: a protein identification algorithm with accurate assignment of the statistical significance of the results
CN111858570A (en) CCS data standardization method, database construction method and database system
Ludwig et al. De novo molecular formula annotation and structure elucidation using SIRIUS 4
Lazar et al. Bioinformatics tools for metabolomic data processing and analysis using untargeted liquid chromatography coupled with mass spectrometry.
US11764044B2 (en) Deconvolution of mass spectrometry data
Dumas et al. Analyzing the physiological signature of anabolic steroids in cattle urine using pyrolysis/metastable atom bombardment mass spectrometry and pattern recognition
Feng et al. Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies
Ma et al. PIXiE: an algorithm for automated ion mobility arrival time extraction and collision cross section calculation using global data association
Barnes Overview of experimental methods and study design in metabolomics, and statistical and pathway considerations
US20230251224A1 (en) Method and system for identifying structure of compound
CN111508565A (en) Mass spectrometry for determining the presence or absence of a chemical element in an analyte
WO2021004355A1 (en) Decoy library construction method and apparatus, target-decoy library construction method and apparatus, and metabolome fdr identification method and apparatus
JP2004219140A (en) Mass spectrum analyzing method and computer program
CN113744814B (en) Mass spectrum data library searching method and system based on Bayesian posterior probability model
EP2541585A1 (en) Computer-assisted structure identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination