CN111896609B - Method for analyzing mass spectrum data based on artificial intelligence - Google Patents

Method for analyzing mass spectrum data based on artificial intelligence Download PDF

Info

Publication number
CN111896609B
CN111896609B CN202010707525.4A CN202010707525A CN111896609B CN 111896609 B CN111896609 B CN 111896609B CN 202010707525 A CN202010707525 A CN 202010707525A CN 111896609 B CN111896609 B CN 111896609B
Authority
CN
China
Prior art keywords
layer
sample
feature
mass spectrum
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010707525.4A
Other languages
Chinese (zh)
Other versions
CN111896609A (en
Inventor
钱昆
徐伟
曹敬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202010707525.4A priority Critical patent/CN111896609B/en
Publication of CN111896609A publication Critical patent/CN111896609A/en
Application granted granted Critical
Publication of CN111896609B publication Critical patent/CN111896609B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N27/00Investigating or analysing materials by the use of electric, electrochemical, or magnetic means
    • G01N27/62Investigating or analysing materials by the use of electric, electrochemical, or magnetic means by investigating the ionisation of gases, e.g. aerosols; by investigating electric discharges, e.g. emission of cathode
    • G01N27/64Investigating or analysing materials by the use of electric, electrochemical, or magnetic means by investigating the ionisation of gases, e.g. aerosols; by investigating electric discharges, e.g. emission of cathode using wave or particle radiation to ionise a gas, e.g. in an ionisation chamber
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

A method for analyzing mass spectrometry data based on artificial intelligence, the method comprising: collecting small molecular fingerprint spectrograms of the metabolites of each sample by adopting a laser-assisted desorption/ionization mass spectrometer; extracting absolute intensity from the fingerprint; and inputting the processed data into the multi-layer neural network, and performing sample grouping processing. A sample distinguishing contribution importance calculating method converts fingerprint spectrum data into a two-dimensional image; and calculating the data in the metabolite screening picture library by using a significance characteristic analysis method, sequencing all the characteristics, and screening out the substances with the greatest contribution to sample discrimination. The beneficial effects of the invention are as follows: the mass spectrum data are rapidly grouped, and the interpretability of the classification model is greatly improved.

Description

Method for analyzing mass spectrum data based on artificial intelligence
Technical Field
The invention belongs to the field of artificial intelligence assisted mass spectrum data mining, and particularly relates to a metabolite fingerprint spectrum obtained based on mass spectrum and an artificial intelligence analysis technology for constructing a sample grouping model and calculating grouping importance.
Background
Mass spectrometry detection methods show the advantages of high throughput analysis and multi-metabolite detection, the primary method for detecting untargeted metabolites. However, the application of mass spectrometry detection methods is also faced with lengthy pre-treatments, including high complexity and low metabolite abundance in biological samples. With the development of nanotechnology, recently developed nano-assisted laser desorption/ionization mass spectrometry (ldims) has become the most practical tool for metabolic analysis due to its high analysis throughput (300 samples/hour) and accurate metabolic identification (mass error <50 ppm).
The deep learning is mainly applied to auxiliary analysis of large and high-latitude data sets, can accept input of various data types, and becomes a leading edge technology of various medical data analysis at present. Deep learning, which is a leading field of machine learning, has become a major analysis tool nowadays, and is widely used in various fields due to its features of optimizing a loss function as much as possible to learn relevant data rules and mining potential features of data as much as possible. It has been widely used in the biomedical field. However, the analysis of the traditional machine learning method is not well suitable for analysis and mining of mass spectrum data, because the mass spectrum data has huge sample characteristics, and problems of accuracy reduction such as over fitting, under fitting and the like can occur. Furthermore, deep learning is a black box, and it is difficult to select important features from the score to explain the mechanism of the diagnosis principle.
The main method for selecting common features in deep learning is a saliency area map, which is mainly applied to the image field, and the most obvious difference area is rapidly and intuitively screened out by comparing the saliency difference areas in detected images, so that the technology has been expanded to solve the problem of complex scene understanding in various fields such as neuroscience, psychology, medical diagnosis and the like. But this method of significance analysis has not yet been applied to mass spectrometry data analysis.
Disclosure of Invention
Aiming at the problems of long time consumption, high data dimension, complex combination and the like in mass spectrum data analysis, the invention provides a method for realizing rapid sample grouping based on the classification model constructed by the improved multi-layer neural network, and calculating the classification contribution importance, which is rapid, accurate and efficient, and greatly improves the interpretability of the classification model.
A method for analyzing mass spectrum data based on artificial intelligence uses a multi-layer neural network to analyze and process the mass spectrum data so as to realize grouping of samples;
the method comprises the following steps:
step 1: sucking the sample onto a mass spectrum target plate, drying and then carrying out subsequent mass spectrum analysis as a thin layer;
step 2: collecting metabolite small molecule fingerprint spectra between 100 and 1000 positive ion modes of each analysis sample by adopting a laser-assisted desorption/ionization mass spectrometer, and no smoothing program is needed;
step 3: extracting the absolute intensity of the original metabolic fingerprint, and carrying out centering pretreatment on the data extracted from all samples;
step 4: and (3) inputting the data in the step (3) into a neural network, and performing sample grouping processing.
Further, in step 2, at least 2 independent experiments were performed on each sample to eliminate individual internal bias and improve reproducibility and stability of analysis.
Further, a multi-layer neural network comprising: network input, network main body, network output; the network main body comprises a feature extraction part, a nonlinear feature interaction layer and a classification layer; the network input is processed by the feature extraction part, the output of the feature extraction part is processed by the nonlinear feature interaction layer, the output of the nonlinear feature interaction layer is processed by the classification layer, and the output of the classification layer is the network output;
the principle formulas from the network input up to the classification layer are:
x_input=concatenate(x_spectral,x_ext) (1)
x_fs=feature_extract(x_input) (2)
x_nl=feature_interaction(x_fs) (3)
y_pred=softmax(x_nl) (4)
further, the network input is a 1-1024 dimensional multi-modal feature (x_input), including the raw mass spectral data input (x_spectral), the other parts are filled with 0. Based on the sample's finite nature, a simple scaling centering is performed on all multi-modal features.
Further, the feature extraction part (feature_extract) is formed by stacking four layers of local connected1D layers, each local connected1D layer divides all features into 32 sections to respectively perform full-connection feature extraction (32 full-connection layers with respective parameters), so that the feature position correlation of mass spectrum data is reflected, the final 32 external multi-mode features are compatible, the feature extraction process of fine modeling mass spectrum data can be reduced while the network width and the parameter scale are reduced compared with a four-layer full-connection architecture, and overfitting is also indirectly reduced.
Further, the principle formula of four layers of localconnected 1D layer stacks:
further, a nonlinear feature interaction layer (feature_interaction) learns the nonlinear relationship of 96 hidden features obtained by the feature extraction section. Each layer of the feature interaction part can extract discretized Relu activation features at the same time, can also extract approximate quadratic relation of feature linear combination, can extract nonlinear features better through residual error or combination and extraction as fusion features, and can further relieve overfitting and enhance generalization performance by dropout regularization. The nonlinear feature interaction layer can be regarded as a novel self-attention mechanism suitable for a full-connection layer, has the fusion capability of discrete and secondary features, and enhances the nonlinearity while reducing the network width and the parameter scale compared with a multi-layer full-connection architecture, thereby being beneficial to reducing the overfitting under limited samples and improving the final classification performance.
Further, the principle formula of the nonlinear feature interaction layer:
further, non-target detection is carried out on the metabolic fingerprint after sample pretreatment, a related metabolite database is obtained, a mapping relation between grouping information and the metabolic spectrogram is constructed, and a training set sample and a blind test set sample are divided.
Further comprises:
step 11: converting mass spectrum data into two-dimensional images, and constructing a metabolite screening picture library;
step 12: the data in the metabolite screening picture library were calculated using the Saliency Maps method (Saliency Maps) and all features were ranked to screen out the substances that contributed most to sample discrimination.
Further, training the neural network by using the sample data, and randomly taking 3/4 of the training set data as a training group and 1/4 as a test group. And carrying out 10-fold cross validation (10-fold) training on the training group sample based on the multi-layer neural network, and realizing classification by counting accurate average values of a final model.
The invention has the following technical effects: the mass spectrum data are rapidly grouped, and the interpretability of the classification model is greatly improved.
Drawings
Fig. 1 is a schematic diagram of a neural network structure in one embodiment of the invention.
Detailed Description
The following description of the preferred embodiments of the present application will make the technical contents thereof more clear and easier to understand. This application may be embodied in many different forms of embodiments and the scope of protection is not limited to the embodiments set forth herein.
The conception, specific structure and technical effects of the present invention will be further described below to fully understand the objects, features and effects of the present invention, but the protection of the present invention is not limited thereto.
In one embodiment of the present invention, data of a sample to be inspected is prepared first, and the steps are as follows:
step 1: sucking the sample onto a mass spectrum target plate, drying and then carrying out subsequent mass spectrum analysis as a thin layer;
step 2: collecting metabolite micromolecule fingerprint spectra between 100 and 1000 positive ion modes of each analysis sample by adopting a laser-assisted desorption/ionization mass spectrometer without any smoothing program, and carrying out five independent experiments on each sample so as to eliminate individual internal deviation and improve the repeatability and stability of a diagnosis result;
step 3: extracting absolute intensity from an original metabolism fingerprint spectrum (between 100 and 1000m/z mass-to-charge ratio), and carrying out centering pretreatment on data extracted from all samples for further machine learning;
non-target detection is carried out on the metabolic fingerprint after sample pretreatment, a relevant metabolite database is obtained, a mapping relation between grouping information and the metabolic spectrogram is constructed, and a training set sample and a blind test set sample are divided.
In this embodiment, the neural network structure for processing data extracted from a sample is as follows:
the input to the network is a 1-1024 dimensional multi-modal feature (x_input), including the raw mass spectral data input (x_spectral), the other parts are filled with 0 s. Based on the sample's finite nature, a simple scaling centering (-1, 1) was performed on all features. The main body of the network is divided into two parts, namely a feature extraction part (feature_extraction) which is input immediately, a nonlinear feature interaction layer (feature_interaction) is arranged behind the feature extraction layer, and finally the recombined 96 features are input into a Softmax classification layer to carry out classification probability output.
Principle formula input from 1024 dimensions to Softmax layer:
x_input=concatenate(x_spectral,x_ext) (1)
x_fs=feature_extract(x_input) (2)
x_nl=feature_interaction(x_fs) (3)
y_pred=softmax(x_nl) (4)
and a feature extraction part (feature_extract) formed by stacking four layers of LocalyConnected 1D layers, wherein each LocalyConnected 1D layer divides all features into 32 intervals to respectively perform full-connection feature extraction (32 full-connection layers with respective parameters), so that the feature position correlation of mass spectrum data is reflected, the final 32 external multi-modal features are compatible, the feature extraction process of the mass spectrum data can be finely modeled while the network width and the parameter scale can be reduced compared with a four-layer full-connection architecture, and the overfitting is also indirectly reduced.
Principle formula of four-layer localconnected 1D layer stack:
and a nonlinear feature interaction layer (feature_interaction) for learning nonlinear relations of 96 hidden features obtained by the feature extraction part. Each layer of the feature interaction part can extract discretized Relu activation features at the same time, can also extract approximate quadratic relation of feature linear combination, can extract nonlinear features better through residual error or combination and extraction as fusion features, and can further relieve overfitting and enhance generalization performance by dropout regularization. The nonlinear feature interaction layer can be regarded as a novel self-attention mechanism suitable for a full-connection layer, has the fusion capability of discrete and secondary features, and enhances the nonlinearity while reducing the network width and the parameter scale compared with a multi-layer full-connection architecture, thereby being beneficial to reducing the overfitting under limited samples and improving the final classification performance.
Principle formula of nonlinear characteristic interaction layer:
training the neural network by using sample data, and randomly taking 3/4 of training set data as a training group and 1/4 as a test group. Performing 10-fold cross validation (10-fold) training on the training set sample based on the multi-layer neural network, and realizing classification by counting accurate average values of a final model;
the trained network is used for analyzing the blind test set sample, and the accuracy of analysis prediction verifies that the grouping model based on the multi-layer neural network can realize accurate classification;
further analyzing the classification model contribution degree, comprising the following steps:
step 11: converting mass spectrogram data into two-dimensional images, and constructing a metabolite screening picture library;
step 12: the data in the metabolite screening picture library were calculated using the Saliency Maps method (Saliency Maps) and all features were ranked to screen out the substances that contributed most to sample discrimination.
Preferred embodiments of the present application are described in detail above. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the present application by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the conception of the present application shall be within the scope of protection defined by the claims.

Claims (5)

1. A method for analyzing mass spectrum data based on artificial intelligence is characterized in that a multi-layer neural network is used for analyzing and processing the mass spectrum data to realize grouping of samples; the method comprises the following steps:
step 1: sucking the sample onto a mass spectrum target plate, drying and then carrying out subsequent mass spectrum analysis as a thin layer;
step 2: collecting metabolite small molecule fingerprint spectra of between 100 and 1000 positive ion modes of each sample by adopting a laser-assisted desorption/ionization mass spectrometer, and no smoothing program is needed;
step 3: extracting absolute intensity from the fingerprint, and performing centralized pretreatment on the extracted data;
step 4: inputting the data processed in the step 3 into the multi-layer neural network for sample grouping processing;
the multi-layer neural network comprises: network input, network main body, network output; the network main body comprises a feature extraction part, a nonlinear feature interaction layer and a classification layer; the network input is processed by the characteristic extraction part, the output of the characteristic extraction part is processed by the nonlinear characteristic interaction layer, the output of the nonlinear characteristic interaction layer is processed by the classification layer, and the output of the classification layer is the network output;
the principle formulas from the network input up to the classification layer are:
x_input=concatenate(x_spectral,x_ext)(1)
x_fs=feature_extract(x_input)(2)
x_nl=feature_interaction(x_fs)(3)
y_pred=softmax(x_nl)(4);
the network input is a 1-1024-dimensional multi-modal feature, wherein the multi-modal feature comprises an original mass spectrum data input, the rest part is filled with 0, and all the features are simply scaled and centered based on the finite property of a sample;
the feature extraction part is formed by stacking four layers of local connection layers, and each local connection layer divides all features into 32 sections for full-connection feature extraction;
the principle formula of the four-layer local connection layer stack is as follows:
the nonlinear feature interaction layer learns nonlinear relations of 96 hidden features obtained by the feature extraction part, and obtains 96 recombined features after feature recombination, and finally inputs the 96 recombined features into the Softmax classification layer for classification probability output;
the principle formula of the nonlinear characteristic interaction layer is as follows:
2. the method of analyzing mass spectrometry data based on artificial intelligence of claim 1, wherein in step 2, at least 2 independent experiments are performed for each of the samples.
3. The method for analyzing mass spectrum data based on artificial intelligence according to claim 1, wherein the fingerprint is subjected to non-target detection to obtain a related metabolite database, a mapping relation between grouping information and the metabolic spectrogram is constructed, and a training set sample and a blind test set sample are divided.
4. The method for analyzing mass spectrum data based on artificial intelligence according to claim 3, wherein 3/4 of the data of the training set sample is used as a training set and 1/4 is used as a test set, 10-fold cross-validation training is performed on the training set sample based on the multi-layer neural network, and classification is achieved by counting accurate average values of a final model.
5. A method of analyzing mass spectrometry data based on artificial intelligence as claimed in claim 3, comprising the steps of:
step 11: converting the fingerprint spectrum data into a two-dimensional image, and constructing a metabolite screening picture library;
step 12: and calculating the data in the metabolite screening picture library by using a significance characteristic analysis method, sequencing all the characteristics, and screening out the substances with the greatest contribution to sample discrimination.
CN202010707525.4A 2020-07-21 2020-07-21 Method for analyzing mass spectrum data based on artificial intelligence Active CN111896609B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010707525.4A CN111896609B (en) 2020-07-21 2020-07-21 Method for analyzing mass spectrum data based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010707525.4A CN111896609B (en) 2020-07-21 2020-07-21 Method for analyzing mass spectrum data based on artificial intelligence

Publications (2)

Publication Number Publication Date
CN111896609A CN111896609A (en) 2020-11-06
CN111896609B true CN111896609B (en) 2023-08-08

Family

ID=73190809

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010707525.4A Active CN111896609B (en) 2020-07-21 2020-07-21 Method for analyzing mass spectrum data based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN111896609B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022266928A1 (en) * 2021-06-24 2022-12-29 中山大学 Metabolic characteristic spectrum inference method and system, and computer device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004038602A1 (en) * 2002-10-24 2004-05-06 Warner-Lambert Company, Llc Integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications
CN102282559A (en) * 2008-10-20 2011-12-14 诺丁汉特伦特大学 Data analysis method and system
CN111292801A (en) * 2020-01-21 2020-06-16 西湖大学 Method for evaluating thyroid nodule by combining protein mass spectrum with deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004038602A1 (en) * 2002-10-24 2004-05-06 Warner-Lambert Company, Llc Integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications
CN102282559A (en) * 2008-10-20 2011-12-14 诺丁汉特伦特大学 Data analysis method and system
CN111292801A (en) * 2020-01-21 2020-06-16 西湖大学 Method for evaluating thyroid nodule by combining protein mass spectrum with deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Size-selected Core-shell Nanoalloys for Laser Desorption/ionization Detection of Small Metabolites;Jing Cao 等;《IEEE》;第350-353页 *

Also Published As

Publication number Publication date
CN111896609A (en) 2020-11-06

Similar Documents

Publication Publication Date Title
CN107766933B (en) Visualization method for explaining convolutional neural network
Chatzidakis et al. Towards calibration-invariant spectroscopy using deep learning
CN110110743B (en) Automatic recognition system and method for seven-class mass spectrum
CN113011357B (en) Depth fake face video positioning method based on space-time fusion
Hu et al. Emerging computational methods in mass spectrometry imaging
CN110309867B (en) Mixed gas identification method based on convolutional neural network
Zhao et al. Interpretable deep learning-assisted laser-induced breakdown spectroscopy for brand classification of iron ores
Ly et al. A new approach for quantifying morphological features of U3O8 for nuclear forensics using a deep learning model
CN116363440B (en) Deep learning-based identification and detection method and system for colored microplastic in soil
CN112149758A (en) Hyperspectral open set classification method based on Euclidean distance and deep learning
CN111896609B (en) Method for analyzing mass spectrum data based on artificial intelligence
CN110579554A (en) 3D mass spectrometric predictive classification
Muzakir et al. Model for Identification and Prediction of Leaf Patterns: Preliminary Study for Improvement
Li et al. MSSort-DIAXMBD: A deep learning classification tool of the peptide precursors quantified by OpenSWATH
CN112560925A (en) Complex scene target detection data set construction method and system
CN109447009B (en) Hyperspectral image classification method based on subspace nuclear norm regularization regression model
CN116665039A (en) Small sample target identification method based on two-stage causal intervention
CN105844297A (en) Local spatial information-based encapsulation type hyperspectral band selection method
CN113554176B (en) Metabolic profile inference method, system, computer device, and storage medium
CN114141316A (en) Method and system for predicting biological toxicity of organic matters based on spectrogram analysis
CN113705731A (en) End-to-end image template matching method based on twin network
Tung et al. SIGMA: Spectral interpretation using gaussian mixtures and autoencoder
Martyna et al. Hybrid Likelihood Ratio Models for Forensic Applications: a Novel Solution to Determine the Evidential Value of Physicochemical Data
CN109190713A (en) The minimally invasive fast inspection technology of oophoroma based on serum mass spectrum adaptive sparse feature selecting
CN114047214B (en) Improved DBN-MORF soil heavy metal content prediction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant