CN114594171B

CN114594171B - Metabolome deep annotation method

Info

Publication number: CN114594171B
Application number: CN202011407735.8A
Authority: CN
Inventors: 许国旺; 李在芳; 王鑫欣; 亓彦鹏; 路鑫; 林晓惠; 赵春霞; 赵欣捷
Original assignee: Dalian University of Technology; Dalian Institute of Chemical Physics of CAS
Current assignee: Dalian University of Technology; Dalian Institute of Chemical Physics of CAS
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2023-12-15
Anticipated expiration: 2040-12-03
Also published as: CN114594171A

Abstract

The invention discloses a deep annotation method for a complex biological sample metabolome. The method comprises the steps of carrying out non-targeted metabonomics analysis based on ultra-high performance liquid chromatography-high resolution mass spectrum on a biological sample extract, obtaining metabonomic chromatography-mass spectrum information of the biological sample, and screening matched candidate metabolites from a metabonomics database according to experimental primary mass spectrum ion mass-charge ratio and experimental retention time in the obtained non-targeted metabonomics data; and constructing a metabolite molecular structure association network according to the molecular fingerprint similarity of the candidate metabolites. And then, performing large-scale qualitative analysis on the metabolome by using non-targeted ultra-high performance liquid chromatography-high resolution mass spectrum metabolome experimental data and using a molecular structure association network as a background network. The method does not depend on a large-scale experimental secondary spectrogram database, and has higher qualitative coverage and reliability.

Description

Metabolome deep annotation method

Technical Field

The invention relates to the fields of analytical chemistry and metabonomics, in particular to a metabolite deep annotation method based on a molecular structure association network.

Research setting

Metabolites are diverse in variety and species specific. Metabolomics has been a bottleneck problem in research in the fields of metabolomics and analytical chemistry. Non-targeted ultra-high performance liquid chromatography-high resolution mass spectrometry technology is one of the main technologies of metabonomics research, and along with the continuous progress of high resolution mass spectrometry technology, the generation of high-flux metabonomics data is no longer a main bottleneck of research. Metabonomics methods based on non-targeted ultra high performance liquid chromatography-high resolution mass spectrometry (UHPLC-HRMS) have enabled detection of tens or tens of thousands of mass spectral peaks (metabolic feature) at a time, but typically can obtain fewer than 1000 metabolites, and wherein typically only a few hundred metabolites can be identified. Because the non-targeted metabonomics experimental data has limited annotated information, a large number of discovered differential metabolites cannot be used for subsequent functional mechanism and other researches due to unknown structures.

High reliability metabolite identification based on mass spectrometry techniques typically requires search matching identification by accurate mass numbers, retention times, and secondary mass spectrometry (MS/MS). At present, a large amount of endogenous metabolites are recorded in a metabolome database, but the database lacks chromatographic retention time, the number of experimental secondary spectrograms is small, most recorded secondary spectrograms are theoretical predicted spectrograms, and the difference between the recorded secondary spectrograms and actual measured spectrograms is large. In addition, the reproducibility of the secondary spectrograms acquired by different types of mass spectra is poor, so that the database searching and qualitative capacity is limited, and the effective identification of metabolites is seriously influenced. For this reason, development of a deep annotation method for non-targeted ultra-high performance liquid chromatography-high resolution mass spectrometry metabolome data is urgently needed.

Disclosure of Invention

The invention provides a large-scale qualitative method of metabolome. In order to achieve the aim of the invention, non-targeted metabonomics analysis based on ultra-high performance liquid chromatography-high resolution mass spectrometry is carried out on the biological sample extract, and metabonomic related chromatography-mass spectrometry information of the biological sample extract is obtained; collecting candidate metabolites in a metabolome database based on the obtained non-targeted metabolome data; constructing a metabolite molecular structure association network based on candidate metabolite molecular fingerprint similarity; and carrying out large-scale qualitative on the metabolome by using non-targeted ultra-high performance liquid chromatography-high resolution mass spectrum metabolome experimental data and using a molecular structure association network as a background network. The technical scheme adopted by the invention comprises the following steps:

firstly, performing non-targeted metabonomics analysis on an extract of a biological sample to be detected by adopting ultra-high performance liquid chromatography-high resolution mass spectrometry; acquiring relevant chromatographic-mass spectral information of extract metabolome, including retention time t of metabolite peak detected by experiment _{r actual measurement} Primary mass spectrum information, i.e. primary ion mass to charge ratio m/z _{Actual measurement} And corresponding secondary mass spectrometry information, i.e., mass to charge ratio and intensity of secondary ions; the primary ions refer to ions directly collected after the compounds are ionized; the secondary ions refer to ions collected by the primary ions after collision and fragmentation by applying certain energy;

second, constructing the number of molecular structures of the candidate metaboliteA database; the primary ion mass-to-charge ratio m/z of all metabolites in the biological sample extract to be tested obtained according to the first step of experiment _{Actual measurement} And experimental retention time t _{r actual measurement} . Obtaining mass-to-charge ratio m/z of theoretical primary ions by using molecular formula of metabolites in open source metabolome database _{Theory of} The method comprises the steps of carrying out a first treatment on the surface of the Obtaining predicted retention time t of the metabolite according to the retention time prediction model _{r prediction} The retention time prediction model is constructed based on known metabolite structure retention relationships. Mass-to-charge ratio m/z of primary ions of metabolic physics in open source metabolome database _{Theory of} And a predicted retention time t _{r prediction} First order ion mass to charge ratio m/z with experimental metabolite data _{Actual measurement} And experimental retention time t _{r actual measurement} Matching is carried out while meeting

|t _{r prediction} -t _{r actual measurement} |<2min, and |m/z _{Theory of} -m/z _{Actual measurement} |/m/z _{Theory of} *1000000<5ppm of metabolite will be used as candidate metabolite, constructing a candidate metabolite database; the database contains simplified molecular linear input specifications (SMILES), names, molecular formulas, molecular structures and predicted retention times for metabolites;

thirdly, constructing a molecular structure association network of the metabolome; obtaining a molecular fingerprint according to the molecular structure of the metabolite in the candidate metabolite database, wherein the molecular fingerprint can be any one of Morgan fingerprint, MACS fingerprint, atom-pair fingerprint and Dayleight fingerprint; and calculating the similarity between the molecular fingerprints of any two candidate metabolites, wherein the calculation method of the similarity is based on an open source tool RDkit. Setting a similarity threshold, taking metabolites as nodes and molecular fingerprint similarity as edges, and connecting lines among the metabolites with the similarity threshold value larger than or equal to the molecular fingerprint similarity threshold value to construct a molecular structure association network of a metabolome level;

fourthly, carrying out scale qualitative on the metabolites based on a molecular structure association network; taking the molecular structure association network constructed in the third step as a background network, taking a candidate metabolite database as a reference, selecting 5-50 metabolites from the background network, and identifying 5-50 metabolites from non-targeted ultra-high performance liquid chromatography-high resolution mass spectrum metabolome experimental data by utilizing standard samples of the 5-50 metabolitesThe 50 metabolites are used as seed metabolites, mapped into an established molecular structure association network, seed metabolite-related metabolites are obtained from the network, wherein adjacent metabolites refer to the metabolites with direct side connection in the molecular structure association network; assigning secondary mass spectra of seed metabolites to adjacent metabolites as pseudo secondary mass spectra thereof, setting a search threshold, |t _{r prediction} -t _{r actual measurement} |<2min and |m/z _{Theory of} -m/z _{Actual measurement} |/m/z _{Theory of} *1000000<The similarity between the experimental secondary mass spectrum of the metabolite peak and the quasi secondary mass spectrum of the adjacent metabolite is more than or equal to 0.5. Searching for neighboring metabolites m/z in experimental data _{Theory of} ，t _{r prediction} Metabolite peaks matched by the secondary mass spectrum, and if the matching is successful, the metabolite peaks are identified; the method comprises the steps of carrying out a first treatment on the surface of the The identified metabolites are used as new seeds, and the qualitative process is repeated until no new metabolites are identified; when there are a plurality of matching results, the matching results are scored, and the metabolite peaks with higher scores are identified with higher accuracy, so that the identified metabolite is no longer used as a new seed. Score = 0.25× (1- |m/z _{Theory of} -m/z _{Actual measurement} |×1000000/(m/z _{Theory of} ×5))+0.25×(1-|t _{r (metabolite)} -t _{r (experimental value)} I/2) +0.5×secondary spectrum similarity.

According to the invention, on the premise that the MS/MS has similarity, a large-scale qualitative method based on a molecular structure association network guided by experimental data is established, and the structural identification of unknown metabolites is realized. By establishing a candidate metabolite database and a candidate metabolite molecular structure association network thereof, the network is adopted to identify the metabolites without standard MS/MS spectrograms, so that the structure identification of the metabolites can not depend on a large-scale standard MS/MS database. The invention relates to a metabolome deep annotation method independent of a large-scale experimental secondary spectrogram database, which can realize large-scale, reliable and qualitative metabolome annotation and remarkably enlarge the coverage of metabolome annotation.

Drawings

FIG. 1 molecular structure association network (metabolite molecular fingerprint similarity threshold of 0.7);

FIG. 2 is a partial enlarged view of a molecular structure-associated network;

FIG. 3 is a schematic diagram of a qualitative process of metabolites based on a molecular structure-related network;

FIG. 4A is a molecular structure association network from the maize filament mass spectrometry positive ion mode;

fig. 4B is a molecular structure-related network from the maize filament mass spectrometry negative ion mode.

Detailed Description

The invention is described in detail below with reference to the attached drawings: the present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are provided, but the protection scope of the present invention is not limited to the following embodiments.

Example 1

To confirm the effectiveness and feasibility of the present invention, a mixed standard consisting of 173 hydroxycinnamamide (including N-cinnamyl-putrescine, N- (p-coumaryl) -cadaverine, N- (p-coumaryl) -agmatine, N ' -Caffeoyl-feruloyl-putrescine, N ', N "-Caffeoyl-feruloyl-spidine, and N, N ', N" -Tris-feruloyl-spimine, etc. was added to a plant extract, and the principle of the present invention was illustrated by taking the qualitative example of hydroxycinnamamide in the collected non-targeted metabonomics data by performing ultra-high performance liquid chromatography-high resolution mass spectrometry data acquisition on the plant extract with a final concentration of 100 to 200 ng/mL.

Extraction of plant tissue metabolome: the metabolite in the maize filament is extracted by adopting a plant metabonomics method. First, 50 mg of the popcorn powder was weighed into a 1.5 ml centrifuge tube, 1.0 ml of methanol/water (volume ratio, 4:1) extractant was added, vortexed on a vortexing machine for 5 minutes, and centrifuged at 15000rpm at 4℃for 10 minutes. 700. Mu.L of the supernatant was lyophilized in a vacuum centrifugal concentrator. 100 microliters of methanol/water (volume ratio, 4:1) was added to the lyophilized sample powder, vortexed for 1min, and centrifuged at 15000rpm at 4℃for 10min in a high-speed centrifuge.

Non-targeted chromatography-mass spectrometry information acquisition: data were collected on an analytical instrument used in combination with an ACQUITY UHPLC ultra high performance liquid chromatography system (UPLC, waters, milford, mass., USA.) and a Q exact HF high resolution mass spectrometry (Thermo Fisher Scientific, rockford, ill., USA.).

The liquid chromatography conditions under the positive ion mode of the mass spectrum electrospray ionization source are as follows: phase A and phase B were 0.1% formic acid/water (volume ratio) and 0.1% formic acid/acetonitrile (volume ratio), respectively, at a flow rate of 0.35mL/min. The initial elution gradient was 5% b, held for 1min; the linear gradient increased to 100% b over 23min and was maintained for 4min, followed by a linear return to the initial gradient over 0.1min and was maintained for 2.9min for a total analysis time of 30min. The sample was ACQUITY BEH C ₁₈ The column (100mm x 2.1mm,1.7 μm, waters, milford, MA, u.s.a.) was used for separation. The column temperature was 50 ℃. The temperature of the sample introduction chamber was set to 4℃and the sample introduction amount was 5. Mu.L.

The liquid chromatography conditions under the mass spectrum electrospray ionization source negative ion mode are as follows: phase A and phase B were 6.5mM ammonium bicarbonate aqueous solution and 6.5mM ammonium bicarbonate 95% methanol/water solution, respectively (volume ratio). The flow rate was 0.35mL/min. The initial elution gradient was 2% b, held for 1min, the linear gradient increased to 100% b at 18min and held for 22min, then at 22.1min the linear gradient returned to the initial ratio and held for 25min. Sample adopts ACQUITY HSS T ₃ The separation was performed by a chromatographic column (100 mm. Times.2.1 mm,1.8 μm, waters, milford, mass., U.S.A.). The column temperature was 50℃and the sample introduction chamber temperature was set at 4℃with an introduction amount of 5. Mu.L.

The Q exact HF mass spectrometry conditions were: the scanning mode is a full-scan/auto-triggered data-dependent secondary mass spectrometry scanning mode (full MS/data-dependencMS) ² ). In the full-scan mass spectrometry setting, the resolution is 120,000, and the automatic gain control target (AGC target) and the maximum injection time (maximum IT) are set to 3×10, respectively ⁶ Ion capacity and 100ms. The scanning range of the full scanning mass is m/z 85-1250. In the secondary mass spectrum setting, an automatic gain control target (AGC target) and a maximum injection time (maximum IT) are set to 1×10, respectively ⁵ Ion capacity and 50ms. The isolation window is m/z 1.0. The collision energy was 15%,30% and 45% of the mixed normalized energy (NCE). The acquisition of the secondary mass spectrum is triggered by the first 10 ions that respond most strongly in each full scan cycle. An Inclusion list is added and set to on. Positive directionElectrospray voltages in negative ion mode are 3.5kV and 3.0kV respectively, the temperature of the ion transmission tube is 320 ℃, and the temperature of the auxiliary gas is 350 ℃. The sheath gas and auxiliary gas flow rates were 45 and 10, respectively (in arbitrary units). S-lens was set to 50.0 (in arbitrary units).

Acquisition of experimental chromatography-mass spectrometry information: non-targeted metabonomics data based on the labeled extracts were used to obtain peak tables, including experimental retention time t, using software CompoundDisovery3.1 _{r actual measurement} Primary mass spectrum information, i.e. primary ion mass to charge ratio m/z _{Actual measurement} An Excel table was derived. And (3) converting the original data by adopting software Proteowizard to obtain a secondary file of mgf, wherein the secondary file contains corresponding secondary mass spectrum information, namely the mass-to-charge ratio and the intensity of secondary ions. First-order ion mass-to-charge ratio m/z of metabolite peaks in experimental data _{Actual measurement} Experimental retention time t _{r actual measurement} The mass window matched to the corresponding secondary mass spectrum was 10ppm and the retention time window was 10s. From the collected non-targeted metabonomics data, the experimental retention time t of 173 hydroxycinnamamides was extracted _{r actual measurement} Primary mass spectrum information, i.e. primary ion mass to charge ratio m/z _{Actual measurement} And corresponding secondary mass spectrometry information, i.e., mass to charge ratio and intensity of the secondary ions.

And (3) constructing a retention time prediction model: 127 hydroxycinnamamide (including N- (p-Coumaroyl) -spidine, N-Sinapoyl-tyramine, N '-Cinnamoyl-Sinapoyl-putrescine, N' - (p-Coumaroyl) -bis-caffeoyl-spidine, etc.) samples were analyzed using the same ultra-high performance liquid chromatography-high resolution mass spectrometry data acquisition conditions as the plant extracts to obtain liquid chromatography assay retention times. Calculating in an open source website ChemDes (http:// www.scbdd.com/ChemDes) by using an SDF file of a standard sample to obtain a 1D &2D molecular descriptor of each standard sample, adopting a multiple linear regression method, taking liquid chromatography retention time as a dependent variable and a molecular descriptor as an independent variable, and selecting a progressive method to construct a retention time prediction model.

Candidate metabolites were collected using the open source plant hydroxycinnamamide metabolome database (https:// pubs. Acs. Org/doi/abs/10.1021/acs. Analchem.8b 03654), which has recorded 846 hydroxycinnamamides. First using a numberObtaining the mass-to-charge ratio m/z of theoretical primary ion of each hydroxycinnamamide according to molecular formula of hydroxycinnamamide in database _{Theory of} The method comprises the steps of carrying out a first treatment on the surface of the Predicting the predicted retention time t of 846 hydroxycinnamamides using the previously constructed retention time prediction model _{r prediction} . The primary ion mass-to-charge ratio m/z of 173 hydroxycinnamamide obtained by non-targeted metabonomics experiment of the labeled plant extract _{Actual measurement} And experimental retention time t _{r actual measurement} Searching an open source plant hydroxycinnamate metabolome database, and simultaneously meeting the following conditions in the database:

|t _{r prediction} -t _{r actual measurement} |<2min，

And/m/z _{Theory of} -m/z _{Actual measurement} |/m/z _{Theory of} ×1000000<5ppm of 220 hydroxycinnamamide as a candidate metabolite, and the SMILES, name, molecular formula, molecular structure and predicted retention time were obtained to construct a candidate hydroxycinnamamide database.

Building a molecular structure association network: and (3) obtaining Morgan fingerprints of the molecular structures of the candidate hydroxycinnamamide, calculating the similarity between Morgan fingerprints of any two candidate hydroxycinnamamide, setting a molecular fingerprint similarity threshold to be 0.7, taking the candidate hydroxycinnamamide as a node, taking the Morgan fingerprint similarity between any two candidate hydroxycinnamamide as an edge, and constructing a molecular structure association network, wherein the number of the nodes is 220 and the number of the edges is 3866.

Correlation network characterization based on molecular structure: and identifying the labeled metabolites collected by the non-targeted ultra-high performance liquid chromatography-high resolution mass spectrum metabolome by taking the constructed molecular structure association network as a background network. The specific qualitative process is as follows:

1) And identifying 6 hydroxycinnamamide serving as seed metabolites from non-targeted ultra-high performance liquid chromatography-high resolution mass spectrometry metabolome experimental data of the plant extract subjected to the standard sample, mapping the seed metabolites into an established molecular structure association network, and obtaining adjacent metabolites of the seed metabolites from the network, wherein the adjacent metabolites refer to metabolites with direct side connection in the molecular structure association network. FIG. 2 is a molecular structure-related networkA partial enlarged view is shown, wherein seed metabolite 1 is N-Caffeoyl-5-methoxytyrptamine, 5 adjacent metabolites are included, wherein adjacent metabolite 1 is N-Sinapoyl-serotonin, adjacent metabolite 2 is N, N '-ferroyl-cinnamyl-cadaverine, adjacent metabolite 3 is N, N' - (p-Coumaroyl) -ferroyl-agmatine, adjacent metabolite 4 is N-ferroyl-octopamine and adjacent metabolite 5 is N-Caffeoyl-serotonin. M/z of adjacent metabolites 1 to 5 _{Theory of} ，t _{r prediction} M/z 383.1607 and 6.62min respectively; m/z 409.2127,8.90min; m/z 453.2138,8.14min; m/z 330.1341,5.91min and m/z339.1345,6.19min.

2) The secondary mass spectrum of the seed metabolite is assigned to the adjacent metabolite as its "quasi-secondary mass spectrum". Setting a search threshold:

|t _{r prediction} -t _{r actual measurement} |<2min，

|m/z _{Theory of} -m/z _{Actual measurement} |/m/z _{Theory of} *1000000<5ppm，

And the similarity of the experimental secondary mass spectrum and the quasi-secondary mass spectrum of the adjacent metabolites is more than or equal to 0.5

The qualitative procedure is illustrated as follows: as shown in fig. 3, the secondary spectrum of seed metabolite 1 is the red spectrum in the figure, which is taken as the "pseudo-secondary spectrum" of 5 adjacent metabolites; finding m/z from experimental data for each adjacent metabolite _{Theory of} ，t _{r prediction} And a metabolite peak that is matched to the secondary mass spectrum. The retention time was found to be 6.97min, [ M+H ] in the experimental data] ⁺ 383.1594 metabolite peak which coincides with |t of the adjacent metabolite 1 (N-Sinapoyl-serotonin) _{r prediction} -t _{r actual measurement} |＝0.35min，Δm＝|m/z _{Theory of} -m/z _{Actual measurement} |/m/z _{Theory of} X 1000000=3.4 ppm, and the similarity of the experimental secondary spectrum (blue) of this peak to the "pseudo secondary spectrum" (red spectrum) of the adjacent metabolite 1 was 0.86. Thus, the metabolite peak was characterized as N-Sinapoyl-serotonin. Using a similar qualitative approach, 3 metabolite peaks (m/z _{Actual measurement} ，t _{r actual measurement} Secondary similarity) m/z409.2109,9.34min,0.78; m/z453.2118,7.92min,0.76 and m/z330.1330,5.71min,0.86 and phases, respectivelyNeighboring metabolites 2,3, and 4 matched, and these 3 metabolite peaks were also successfully identified.

3) When the experimental data searches out a plurality of matching results, scoring the matching results, wherein the scoring rule is as follows:

score = 0.25× (1- |m/z _{Theory of} -m/z _{Actual measurement} |×1000000/(m/z _{Theory of} ×5))+0.25×(1-|t _{r (metabolite)} -t _{r (experimental value)} Similarity of 0.5 x secondary spectrum

If 3 metabolite peaks are found to match the adjacent metabolite 5 in the experimental data, all meet the search threshold, which is m/z _{Actual measurement} ，tr _{Actual measurement} The similarity of the secondary mass spectrum is respectively m/z 339.1332,5.89min and 0.77; the 3 results are scored by m/z 339.1330,5.47min,0.61 and m/z 339.1335,6.63min and 0.63, the corresponding scores are 0.66,0.50 and 0.62, and the high-score identification results are output according to the sequence from large to small, so that the reliability is high. The metabolite peaks identified in this case are no longer involved as seeds in the next round of characterization.

4) The above identified metabolites were then used as new seeds and the qualitative procedure was repeated until no new metabolite peaks were identified. The metabolite peak (383.1594, 6.97 min) was successfully identified as N-Sinapoyl-serotonin (adjacent metabolite 1 in FIG. 2) in experimental data, and its experimental secondary profile was assigned to the next-order adjacent metabolite 1 (N, N' -Feruloyl-bis-cinnamoyl-split) in FIG. 2 as its "pseudo-secondary profile". M/z of the next-order neighboring metabolite 1 _{Theory of} ，t _{r prediction} 582.2968, 11.19min. And finding out a metabolite peak 582.2948 meeting a threshold value in experimental data for 11.65min, wherein the similarity of the experimental secondary spectrogram and the simulated secondary spectrogram is 0.75, and the matching is successful. The metabolite peak (582.2948, 11.65 min) was identified as N, N', N "-Feruloyl-bis-cinnamoyl-spidine and the qualitative procedure described above was repeated as a new seed.

By adopting the method, the 167 hydroxycinnamamide are successfully identified by using 6 hydroxycinnamamide as an initial seed metabolite, and the accuracy of the identification result is 98.8%. Of these, 141 were ranked first, 19 were ranked second, 5 were ranked third, and 2 were ranked 4. The reason for the non-rank first is that 80 of the 169 hydroxycinnamamides have isomers with retention times similar to the secondary mass spectrum.

Comparing the identification result with the conventional database searching method, wherein the database research (http:// specra. Psc. Riken. Jp /) only contains 23 hydroxycinnamamide, the Metlin (https:// Metlin. Scrips. Edu) contains 44 hydroxycinnamamide, but the databases hardly contain secondary spectrograms of hydroxycinnamamide, and only use primary ion mass-to-charge ratio search, so that the reliability of the qualitative result is not guaranteed and the coverage is limited.

The result shows that the metabolite qualitative method based on the molecular structure correlation network is independent of a large-scale experimental secondary spectrogram database, and can realize reliable qualitative; the coverage of metabolome annotations can be significantly expanded using an open-source structural database.

Example 2

The invention is used for qualitative determination of the actual biological sample extract. Extracting plant tissue (maize filament) metabolome, carrying out ultra-high performance liquid chromatography-high resolution mass spectrum data acquisition on the maize filament tissue extract, and carrying out qualitative analysis on the obtained non-targeted metabolome data.

The procedure and conditions are the same as in example 1, except that:

extraction of plant tissue metabolome: as in example 1.

Non-targeted metabonomics data acquisition: as in example 1.

Acquisition of experimental chromatography-mass spectrometry information: non-targeted metabonomics data based on maize filament extract, peak tables were obtained using software CompoundDisovery 3.1, including experimental retention time t _{r actual measurement} Primary mass spectrum information, i.e. primary ion mass to charge ratio m/z _{Actual measurement} An Excel table was derived. And (3) converting the original data by adopting software Proteowizard to obtain a secondary file of mgf, wherein the secondary file contains corresponding secondary mass spectrum information, namely the mass-to-charge ratio and the intensity of secondary ions.

And (3) constructing a retention time prediction model: 254 standard samples (including 1,3-Dihydroxyacetone, benzoic acid, methionine sulfoxide, 7-methoxorimarin, vibraactone B, nardosinone, etc.) were analyzed in the positive ion mode, 327 standard samples (including 3-Hydroxypropanoic acid, 2-hydroxyquinoline, coixol, 6-Benzylaminopurine, quercetin, daphnoretin, etc.) were analyzed in the negative ion mode, and the retention time of the liquid chromatography experiment was obtained, respectively. Calculating to obtain 1D &2D molecular descriptors of each standard sample in an open source website ChemDes (http:// www.scbdd.com/ChemDes) by using an SDF file of the standard sample, adopting a multiple linear regression method, taking liquid chromatography retention time as a dependent variable and selecting a progressive method to respectively construct a retention time prediction model of a positive ion mode and a negative ion mode by taking the molecular descriptors as independent variables.

Using the open source metabolome database Universal Natural Products Database UNPD (http:// pkuxxj. Pku. Edu. Cn/UNPD /), plant Metabolic Network (https:// playcyc. Org /) and KEGG (https:// www.genome.jp/KEGG /). First, based on molecular formula of metabolites in a database, mass-to-charge ratio m/z of theoretical primary ion of each metabolite is obtained _{Theory of} The method comprises the steps of carrying out a first treatment on the surface of the Predicting a predicted retention time t for each metabolite using the aforementioned retention time prediction model _{r prediction} . The primary ion mass-to-charge ratio m/z of the metabolite peak obtained by non-targeted metabonomics experiments of plant extracts _{Actual measurement} And experimental retention time t _{r actual measurement} Searching an open source metabolome database, and simultaneously meeting the following conditions in the database:

|t _{r prediction} -t _{r actual measurement} |<2min，

|m/z _{Theory of} -m/z _{Actual measurement} |/m/z _{Theory of} *1000000<5ppm of metabolite is taken as candidate metabolite, SMILES, name, molecular formula, molecular structure and predicted retention time are obtained, and a candidate metabolite database is constructed.

Building a molecular structure association network: obtaining Morgan fingerprints of the candidate metabolites based on molecular structures of the candidate metabolites, calculating the similarity between Morgan fingerprints of any two candidate metabolites, setting a molecular fingerprint similarity threshold to be 0.6, taking the candidate metabolites as nodes and the Morgan fingerprint similarity between any two candidate metabolites as edges, and constructing a molecular structure association network, wherein the molecular structure association network in a positive ion mode comprises 1965 metabolites (nodes) and 28199 edges, and is shown in FIG. 4A; the molecular structure association network in negative ion mode includes 1945 metabolites (nodes), 34451 sides, see fig. 4B.

Correlation network characterization based on molecular structure: and (3) taking the constructed molecular structure association network as a background network, identifying experimental data collected by a non-targeted ultra-high performance liquid chromatography-high resolution mass spectrum metabolome, and determining metabolites in a biological sample to be detected, wherein the identification process is the same as that of the embodiment 1.

The process shows that abundant candidate metabolites can be obtained from complex plant tissue extract metabolome data, the candidate metabolites calculate the similarity between Morgan fingerprints, and when a molecular fingerprint similarity threshold value is set to be 0.6, a complete communicated network can be formed, so that large-scale qualitative of the metabolome can be realized.

Claims

1. A metabolome deep annotation method, characterized by:

firstly, performing non-targeted metabonomics analysis on a plant extract to be detected by adopting ultra-high performance liquid chromatography-high resolution mass spectrometry; obtaining chromatographic-mass spectrometry information of extract metabolome, including retention time of metabolite peak detected by experimentt _{r actual measurement} Mass to charge ratio of primary mass spectrometry ionsm/z _{Actual measurement} And the mass to charge ratio and intensity of the corresponding secondary mass spectrometry ions;

secondly, constructing a candidate metabolite molecular structure database; the primary ion mass-to-charge ratio of all metabolites in the biological sample extract to be tested obtained according to the first step of experimentm/z _{Actual measurement} And experimental retention timet _{r actual measurement} Screening and primary ion mass-to-charge ratio from open source metabonomics databasem/z _{Actual measurement} And experimental retention timet _{r actual measurement} The matched metabolites are taken as candidate metabolites, and a candidate metabolite database is constructed; the database contains simplified molecular linear input specifications SMILES, name, molecular formula, molecular structure and predicted retention time for the metabolite;

thirdly, constructing a metabolic component molecular structure association network; obtaining a molecular fingerprint according to the molecular structure of the metabolites in the candidate metabolite database; calculating the similarity between the molecular fingerprints of any two candidate metabolites, wherein the similarity calculation method is based on an open source tool RDkit; setting a similarity threshold value between molecular fingerprints to be 0.5-0.8, taking the metabolites as nodes and the molecular fingerprint similarity as edges, and connecting the metabolites with the similarity threshold value between the molecular fingerprints more than or equal to each other to construct a molecular structure association network;

fourthly, performing metabolite characterization based on a molecular structure association network; the molecular structure association network constructed in the third step is used as a background network to identify experimental data collected by non-targeted ultra-high performance liquid chromatography-high resolution mass spectrometry, and the metabolites in the biological sample to be detected are determined;

the first step of the primary mass spectrum ion is an ion directly acquired after ionization and ionization of a compound by mass spectrum; the secondary mass spectrum ions are ions acquired after the primary ions are collided and disintegrated by applying certain energy;

the second step, the method for obtaining the candidate metabolites comprises the following steps: obtaining mass-to-charge ratio of theoretical primary mass spectrum under positive and negative ion ionization mode by using molecular formula of metabolite in public metabolome databasem/z _{Theory of} Obtaining predicted retention time according to metabolite structure parametert _{r prediction} The method comprises the steps of carrying out a first treatment on the surface of the Inclusion criteria for candidate metabolites are, at the same time, that

|t _{r prediction} -t _{r actual measurement} |<2min and

|m/z _{theory of} -m/z _{Actual measurement} |/m/z _{Theory of} ×1000000<5 ppm；

The fourth step, metabolite identification method based on molecular structure association network is that taking candidate metabolite database as reference, selecting 5-50 metabolites from the candidate metabolite database, identifying 5-50 metabolites as seed metabolites from non-targeted ultra-high performance liquid chromatography-high resolution mass spectrum metabolome experimental data by using standard samples of the 5-50 metabolites, mapping the seed metabolites into established molecular structure association network, and obtaining seed metabolites from the networkAn ortho metabolite; assigning a secondary mass spectrum of the seed metabolite to an adjacent metabolite as a pseudo-secondary mass spectrum thereof; setting a search threshold value, and searching for adjacent metabolites in experimental datam/z _{Theory of} ，t _{r prediction} Metabolite peaks matched by the quasi-secondary mass spectrum, and if the matching is successful, the metabolite peaks are identified; the identified metabolites are used as new seeds, and the qualitative process is repeated until no new metabolites are identified;

when a plurality of matching results exist, scoring the matching results, and sorting the matching results from high to low according to the score, wherein the metabolite peaks with higher scores are identified with higher accuracy, and the identified metabolites are not used as new seeds any more;

search threshold:

|t _{r prediction} -t _{r actual measurement} |<2 min，

|m/z _{Theory of} -m/z _{Actual measurement} |/m/z _{Theory of} *1000000<5 ppm，

And similarity of experimental secondary mass spectrum of metabolite peaks and pseudo secondary mass spectrum of adjacent metabolites≥0.5。

2. The method according to claim 1, wherein: the molecular fingerprint in the third step is any one of Morgan fingerprint, MACS fingerprint, atom-pair fingerprint and Dayleight fingerprint.

3. The method according to claim 1, wherein: the predicted retention time of the metabolite is predicted by a retention time prediction model constructed by a known metabolite structure-retention relationship.

4. The method according to claim 1, wherein: adjacent metabolites are those having direct side linkages in the molecular structure-associated network.