CN114283877A

CN114283877A - Method for establishing metabolite model and metabonomics database thereof

Info

Publication number: CN114283877A
Application number: CN202110471744.1A
Authority: CN
Inventors: 赵爽; 韩伟
Original assignee: Xiamen Mailio Technology Co ltd
Current assignee: Xiamen Mailio Technology Co ltd
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2022-04-05

Abstract

The invention discloses a method for establishing a metabolite model and a metabonomics database thereof. Firstly, a metabolite retention time model is established, and a new metabonomic database using predicted retention time is established on the basis of a known metabonomic database. During the establishment, known metabolites were randomly divided into training and test groups. The MD and retention times of the metabolites of the training set were used to model, the support vector machine method was used to model, and the test set was used to verify the model condition. And combining the information of the new metabolites to be included, and obtaining the retention time of the new metabolites according to the established model. The method of the invention aims to obtain the predicted retention time of the metabolite by means of computer-aided simulation in the absence of chemical standards for the metabolite.

Description

Method for establishing metabolite model and metabonomics database thereof

Technical Field

The invention relates to the field of biological information, in particular to a method for establishing a model and a metabonomics database thereof.

Background

Metabolite identification is an important link in non-targeted metabolomics. By metabolite identification, peak signals collected by an instrument (such as high performance liquid chromatography-mass spectrometry) can be converted into metabolite information, so that qualitative and quantitative results of the metabolites are obtained. Identification of metabolites is typically done by comparing unknown signals to known information in a metabolite database, using one or more parameter matching approaches to determine the metabolite profile. Metabolite identification can be classified as accurate identification (matching using two or more unrelated parameters) or inferred (matching using only one parameter) depending on the number of parameters used, the accuracy of the match, and the threshold of the match.

In metabolomics research, the efficacy of metabolite identification is closely related to the metabolomic databases used. The more the quantity and the more the variety of the metabolites are covered by the database, the more detailed and accurate the parameter information possessed by each metabolite, the better the metabolite identification effect is, i.e. the more the metabolites can be identified and the more accurate the identification result is.

HP-CIL metabonomics are called high-efficiency chemical isotope labeling combined liquid chromatography-mass spectrometry metabonomics. In contrast to conventional metabolomics analysis, HP-CIL metabolomics derivatize samples using chemical isotope labeling reagents in the sample processing stage. In a conventional metabolomics analysis process, the workflow includes: sample pretreatment-sample preparation-instrument analysis-data processing-metabolite identification-biological analysis. Since the original form of the metabolite is being detected, the metabolite information contained in the database is also a parameter of each type of metabolite prototype. In HP-CIL metabonomics, the workflow includes: sample pre-treatment-sample preparation-metabolite derivatization-instrument analysis-data processing-metabolite identification-biological analysis, and the detected signal comes from the derivatized metabolite. Therefore, in the metabolite identification process, the information contained in the database used should be various kinds of parameter information of the derived metabolites.

In the identification of metabolites in HP-CIL metabolomics, the parameters used include: accurate mass, retention time and secondary mass spectral fragment information. Results obtained from performing the assay using at least two of the parameters (e.g., using accurate mass and retention time), the results obtained being accurate assays; and one of the two is used (generally, the accurate mass), and the obtained result is an estimation result, so that the reliability is low. Among these, the acquisition of the retention time parameter generally requires the analysis of the metabolite by experiment, i.e. the experimental retention time.

The current method for establishing HP-CIL metabonomics database comprises the following steps:

1. purchasing or laboratory anabolic chemical standards;

2. chemical standards for each metabolite were dissolved separately and derivatized following the same derivatization reaction steps using the corresponding derivatizing reagents.

3. And detecting and analyzing the derivatized metabolite standard substance by using high performance liquid chromatography-mass spectrometry. And (3) independently analyzing each derived metabolite standard, and collecting accurate mass, experiment retention time and secondary mass spectrum fragment information.

4. Unifying the collected information and establishing a database. Each entry in the database is a metabolite and contains information corresponding to the three parameters.

The information collected by the metabonomics database established in the way is from a real experiment, the information is accurate, and the obtained retention time is called experiment retention time. The results obtained by identifying metabolites using these information are highly reliable. But the economic cost and the time cost for purchasing or laboratory anabolic chemical standards are large; some metabolites do not have chemical standards available and therefore cannot be added to metabonomic databases. The method is limited by the difficulty in obtaining chemical standards of the metabolites and high cost in the existing method, and the database established by the method has low content and contains few metabolites (<1000 metabolites). For the convenience of expression, the database established by the method is called a CIL metabonomics database.

Disclosure of Invention

In order to solve the problem of small database content caused by difficulty in obtaining standard products in the existing database establishing method, the invention aims to provide a novel database establishing method. The new database was created to obtain the predicted retention time of the metabolite by means of computer-assisted simulation in the absence of metabolite chemical standards.

In order to achieve the above object, the present invention provides a method for modeling a metabolite, comprising the steps of:

1) establishing a metabolite retention time model:

a) searching metabolites in a known metabolism database on a PubChem website to obtain the SMILES structure and other related information of the metabolites;

b) analyzing all metabolites in the known metabolomics database according to the SMILES structural formula obtained from PubChem by using Chemistry Development Kit to obtain CDK Descriptors thereof;

c) combining PubChem Descriptors and CDK Molecular Descriptors to obtain complete property expression of all metabolites, namely MDs;

d) combining the MD of all metabolites with their corresponding retention times as recorded in a known metabolomic database;

e) randomly dividing all metabolites in a known metabonomic database into two groups, wherein one group contains 6/7 total metabolites and is called a training group; the other group, containing the remaining 1/7 metabolite, was the test group;

f) establishing a regression model by using MD and retention time of the metabolites of the training group, and establishing the model by using an SVM method; requirement Q²>0.7，

2) Creating a new metabolite list:

a) establishing metabolites to be included in the database;

b) obtaining PubChem SMILES structural formula and CDK Descriptors of the metabolites by the method;

c) combining the PubChem Descriptors and CDK Descriptors of the metabolites to obtain complete property expression of each metabolite, namely MDs;

3) obtaining a predicted retention time:

a) according to the established model, using MDs of the new metabolite to obtain the retention time of the new metabolite;

b) a database is built with the list of new metabolites, the exact mass of the new metabolites and the retention time obtained.

Further, the specific steps of establishing the model by using the svm method in the step f) are as follows:

running an e1071 package by using R;

using MD and retention time of metabolites of the training group as variable inputs;

thirdly, using a radial basis kernel to transform the MD data into a high-latitude data space; radial basis kernel equation is

Wherein u and v are variables, e is a natural constant, and r and cost are parameters;

fourthly, running the program after the parameters are determined, and establishing a regression model between the MD data and the retention time; the model can be expressed as rt-XLOGP + LipinskiFailures + nRotB + MLogP + nATOmLAC + … …, wherein each variable is MD, and is limited by using a parameter weight, namely w, and an intercept, namely b; preferably, the values of the portions w and b are as follows: XLOGP of w is 109.6916, LipinskiFailues is-45.92101, nRotB is-31.93641, MLogP is 128.7612, nATOmLAC is 96.8386; and b is-1.53251329.

Furthermore, after the model is established, the metabolites of the test group are required to be used for model verification, and the steps are as follows,

1) loading the established regression model in the R program: an Rdata file;

2) inputting MD of the test group metabolites as variables;

3) running the program to obtain a predicted retention time; comparing the predicted retention time with the experimental retention time;

the model predicted retention time success criteria were:

the predicted retention time and the experimental retention time of all the metabolites in the test group are linearly related;

the difference value between the predicted retention time of all metabolites and the experimental retention time is in a certain range; preferably, this range is used as retention time threshold for metabolite identification, within 180 second.

Further, the known metabolism database is a known CIL metabolism database.

The invention also provides a method for establishing the metabonomics database, which is characterized in that the metabonomics database is obtained by applying the method.

The metabonomics database established by the invention has the following characteristics:

1) the retention time obtained by this method is the predicted retention time, not the experimental retention time, due to the lack of metabolite chemical standards. The accuracy of the predicted retention time is lower than that of the experimental retention time, and when the predicted retention time is used for metabolite identification, the reliability of the obtained result is lower than that of the experimental retention time.

2) The results of metabolite identification using the database built in this way are considered to be highly reliable presumption results: because of the lack of experimental information for chemical standards, it can only be used as a presumptive result; however, two independent parameters are used for identification (i.e. accurate mass and predicted retention time), and compared with general estimation, the obtained result has higher reliability, which is called high-reliability estimation result.

3) The metabolomics database is established in a computer-aided simulation mode, chemical standards and a large number of experimental processes are not needed, and huge economic cost and time cost are avoided.

4) The database established in this way has higher metabolite content, which can reach tens of thousands of metabolite information.

Drawings

FIG. 1 is a flow chart of the operation of the method of the present invention as exemplified by CIL metabolomics database;

FIG. 2 is a graph of the results of using the established model with a linear correlation of predicted retention time to actual experimental retention time.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention. The examples do not specify particular techniques or conditions, and are performed according to the techniques or conditions described in the literature in the art or according to the product specifications. The reagents or instruments used are not indicated by the manufacturer, and are all conventional products commercially available.

A method of modeling a metabolite, comprising the steps of:

1) establishing a metabolite retention time model:

2) Creating a new metabolite list:

a) establishing metabolites to be included in the database;

3) obtaining a predicted retention time:

running an e1071 package by using R;

1) loading the established regression model in the R program: an Rdata file;

2) inputting MD of the test group metabolites as variables;

the model predicted retention time success criteria were:

The following example is combined with the workflow diagram of fig. 1. FIG. 1 is a work flow diagram of the method of the invention as exemplified by CIL metabolomics database.

Example 1:

Dns-Library is a CIL metabonomics database, namely established by the prior art method, and the establishment process and the database are published in the following steps:

Tao Huan,Yiman Wu,Chenqu Tang,Guohui Lin and Liang Li,2015,“DnsID in MyCompoundID for Rapid Identification of Dansylated Amine-and Phenol-Containing Metabolites in LC-MS-Based Metabolomics”,Anal.Chem.87,9838–9845.

the database contains 273 metabolites in small quantities, and the small quantities of metabolites that can be identified from the instrumental data are used for metabolite identification. For human urine sample analysis, 105 metabolites were identified if identified using accurate mass and experimental retention time.

By using the method provided by the invention, a new metabonomics database using the predicted retention time is established on the basis of the CIL metabonomics database according to the steps. During the set-up, 273 metabolites were randomly divided into training and testing groups. According to the principle of using 1/7 metabolites as test groups, 234 metabolites in 273 metabolites are used as training groups, and 39 metabolites are used as test groups. The training set was used to build the model and the test set was used to verify the model condition.

The specific model establishing steps are as follows:

1) establishing a metabolite retention time model:

a) 273 metabolites in the CIL metabonomics database are searched on a PubChem website (https:// Pubchem. ncbi. nlm. nih. gov /) to obtain the SMILES structure and other related information of the metabolites (see Table 1 for a few examples);

table 1 lists SMILES structures of several metabolites and other data sheets related thereto

b) 273 metabolites in the CIL metabonomics database are analyzed by using Chemistry Development Kit (https:// CDK. githu. io /) according to the SMILES structural formula obtained from PubChem to obtain CDK Descriptors thereof;

CDK Descriptors data sheet for several metabolites listed in Table 2

c) Combining PubChem Descriptors with CDK Molecular Descriptors to obtain 273 complete Metabolite property representations (namely, Metabolite Descriptors, MDs);

d) combining the MD of the 273 metabolites with corresponding retention time (namely Experimental RT) recorded in a CIL metabonomics database; table 3 lists the experimental retention times for three of the metabolites;

table 3 Experimental retention time data for three metabolites

e) 273 metabolites in total in the CIL metabonomics database are randomly divided into two groups, wherein one group contains 6/7 total metabolites, namely 234 metabolites, and is called as a training group; the other group, containing the remaining 1/7 metabolites, 39 metabolites, was the test group; see tables 4 and 5 for examples.

Table 4 exemplary table of metabolites in training set

Table 5 exemplary test group metabolites

f) The MD and retention time of the training group metabolites are used for establishing a model, and the support vector machine method is used for establishing the model. The method comprises the steps of converting data at a low latitude into a data space at a high latitude, so as to identify the variable interrelation of training group data, and establish a model for describing the mathematical relationship between MD and retention time; the method comprises the following specific steps:

(R run e1071 package: (a)https://cran.r-project.org/web/packages/e1071/ index.html)；

and thirdly, transforming the MD data into a high-latitude data space by using a radial basis kernel. radial basis kernel equation is

Where u and v are variables, e is a natural constant, r is a parameter, and there is another parameter cost for model building (in a formula not shown). In this example, the calculation parameter r is 0.000244140625, cost is 32;

and fourthly, running the program after the parameters are determined, and establishing a regression model between the MD data and the retention time (the model is a high-dimensional matrix model and is stored in the form of an Rdata file).

The model can be expressed as rt-XLOGP + LipinskiFailure + nRotB + MLogP + nATOmLAC + … …, where each variable is MD and is defined using the parameters weight (w) and interrupt (b). In this example, the values of the portions w and b are as follows: XLOGP of w is 109.6916, LipinskiFailues is-45.92101, nRotB is-31.93641, MLogP is 128.7612, nATOmLAC is 96.8386; and b is-1.53251329.

After the model is established, carrying out model verification by using metabolites of the test group;

after the model is built, the model is used for predicting the retention time of 39 metabolites in the test group, and the specific steps are as follows:

loading the established regression model (. Rdata file) in an R program;

using MD of the metabolite of the test group as variable input;

and running the program to obtain the predicted retention time.

The predicted retention time is compared to the experimental retention time. The model predicted retention time success criteria were:

the predicted retention time and the experimental retention time of all the test group metabolites are linearly related, as shown in FIG. 2 and Table 6;

the difference value between the predicted retention time of all metabolites and the experimental retention time is in a certain range; (this range is used as retention time threshold for metabolite identification, in this case within 180 seconds).

Calculating Q of model when SVM method model is built²Value, required Q²>0.7。

Table 6 lists the predicted retention times and experimental retention time results for the metabolites of the five test groups

Name	Experimental RT	Predicted RT	Difference between predicted and experimental retention times
				Citrulline	224.4	156.76	66.6398
5-Aminopentanoic acid	520.8	576	55.2
				Homovanillic acid	990.6	1032	41.4
Serotonin	1479	1374.42	104.584
				L-Thyronine	1526.4	1500.68	25.7188

2) Creating a new metabolite list:

a) establishing 3281 metabolites in total in a metabolite list which is expected to be recorded in a database but is not easy to obtain chemical standards; for example 3-methyl-histadine, Trans-4-hydroxy-L-Proline, Sepiapterin, Malonyl-CoA, 2-Hydroxyestradiol.

b) Obtaining PubChem SMILES structural formula and CDK Descriptors of the metabolites by the method; partial metabolites as shown in tables 7 and 8.

Table 7 data sheet listing several metabolites

Table 8 data sheet listing several metabolites

c) Combining the PubPhem Descriptors and CDK Descriptors of the metabolites to obtain a complete property expression of each Metabolite (namely, Metabolite Descriptors, MDs);

3) obtaining a predicted retention time

a) Obtaining the retention time of the new metabolite using the MDs of the new metabolite according to the established model describing the mathematical relationship between the MDs and the retention time; table 9 lists the predicted retention times for several metabolites.

Table 9 lists the predicted retention time results for metabolites

b) The database is built with a list of new metabolites, the exact mass of the new metabolites (from the list of metabolites) and the retention times obtained.

The verification results are as follows:

the results show that when the test group metabolites are predicted by using the newly established model, the obtained predicted retention time is linearly related to the retention time of the real experiment, and the results are shown in FIG. 2, wherein R²0.9624, demonstrating a linear correlation. Simultaneous cross-validation process yields Q²And (0.792) proving that the model is successfully established. (the model is judged in such a manner that cross-validation is performed and Q is generated based on the cross-validation²(representing the predicted behavior of the model) determining the accuracy of the model, Q²The higher (closer to 1), the higher the model accuracy, generally requiring Q²At least greater than 0.7).

The established new metabonomics database contains 3554 metabolites in total.

The metabolites recorded by the method are purchased as standard products, and then the experimental retention time is collected and compared, which can be used for further verifying the accuracy of the predicted retention time. The verification result is matched with the test group result. For example, table 10:

table 10 shows the results of verifying the difference between predicted retention time and experimental retention time

Detection of human urine:

380 metabolites were identified using human urine as a sample, as shown in Table 11. There are other novel metabolites, as listed in Table 12.

And (3) carrying out sample preparation, LC-MS analysis and data processing on the human urine sample by using an HP-CIL metabonomics technology, and carrying out metabolite identification on the obtained peak pair list. The accurate mass and retention time of unknown peak pairs in the urine sample are used for identification and are matched with the accurate mass and retention time of metabolites in the database, when the newly established database is used for matching, the used retention time is the predicted retention time predicted by the model, the matching threshold is that the accurate mass (m/z) is 10ppm, and the retention time threshold is 180 seconds.

TABLE 11 380 metabolite tables identified in human urine

TABLE 12 list of other novel metabolites

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention.

Claims

1. A method of modeling a metabolite, comprising the steps of:

1) establishing a metabolite retention time model:

2) Creating a new metabolite list:

a) establishing metabolites to be included in the database;

3) obtaining a predicted retention time:

2. The method for modeling a metabolite according to claim 1, wherein the specific step of modeling using the svm method in step f) is as follows:

running an e1071 package by using R;

3. The method for modeling metabolites according to claim 1 wherein after the modeling, the metabolites of the test group are subjected to model validation by the steps of,

1) loading the established regression model in the R program: an Rdata file;

2) inputting MD of the test group metabolites as variables;

the model predicted retention time success criteria were:

4. The method for modeling metabolites according to claim 1 wherein said database of known metabolism is a database of known CIL metabolism.

5. A method for creating a metabolomics database, wherein the metabolomics database is obtained by using the metabolite data obtained by the method according to any one of claims 1-4.