WO2022121055A1 - 基于代谢组学的生理预测方法、装置、计算机设备和介质 - Google Patents

基于代谢组学的生理预测方法、装置、计算机设备和介质 Download PDF

Info

Publication number
WO2022121055A1
WO2022121055A1 PCT/CN2020/142331 CN2020142331W WO2022121055A1 WO 2022121055 A1 WO2022121055 A1 WO 2022121055A1 CN 2020142331 W CN2020142331 W CN 2020142331W WO 2022121055 A1 WO2022121055 A1 WO 2022121055A1
Authority
WO
WIPO (PCT)
Prior art keywords
mass spectrometry
detection data
prediction
spectrometry detection
physiological
Prior art date
Application number
PCT/CN2020/142331
Other languages
English (en)
French (fr)
Inventor
吴思
谭国斌
麦泽彬
张健锋
黄福桂
李文锋
吴日伟
牛红志
Original Assignee
广州禾信仪器股份有限公司
昆山禾信质谱技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州禾信仪器股份有限公司, 昆山禾信质谱技术有限公司 filed Critical 广州禾信仪器股份有限公司
Publication of WO2022121055A1 publication Critical patent/WO2022121055A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N27/00Investigating or analysing materials by the use of electric, electrochemical, or magnetic means
    • G01N27/62Investigating or analysing materials by the use of electric, electrochemical, or magnetic means by investigating the ionisation of gases, e.g. aerosols; by investigating electric discharges, e.g. emission of cathode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present application relates to the technical field of data analysis, in particular to a metabolomics-based physiological prediction method, device, computer equipment and storage medium.
  • VOCs Volatile Organic Compounds
  • the data mining technology used in the breath detection analysis method applied to the detection of physiological state is not perfect, and the analysis methods covered are relatively simple, which leads to the lack of reliability of the detection results of physiological state.
  • a metabolomics-based physiological prediction method comprising:
  • Physiological prediction is performed on the mass spectrometry detection data through a pre-built multivariate statistical analysis model to obtain the probability value of each prediction category;
  • the physiological prediction result corresponding to the breath sample to be detected is determined according to the probability value of each of the prediction categories.
  • the pre-built multivariate statistical analysis model is obtained by coupling a principal component analysis model, an orthogonal partial least squares discriminant analysis model and an artificial neural network model, and the pre-built multivariate statistical analysis model is obtained by coupling The model performs physiological prediction on the mass spectrometry detection data, and obtains the probability values of each prediction category, including:
  • the trained artificial neural network model is used for prediction and identification, and the probability value of each prediction category is obtained.
  • the method further includes:
  • the second mass spectrometry detection data is subjected to normalization processing by a Z normalization method to obtain preprocessed mass spectrometry detection data.
  • the method further includes:
  • the mass spectrometry detection data of the exhalation samples in the training set corresponds to a physiological real category
  • the multivariate statistical analysis model is trained by using the mass spectrometry detection data of the breath samples in the training set to obtain the pre-built multivariate statistical analysis model.
  • the method further includes:
  • the univariate analysis method and the multivariate analysis method are used to screen the mass spectrometry detection data in sequence to obtain variables that meet preset conditions, including:
  • the first variable group is analyzed by the variable importance projected value method, and a second variable group whose variable importance projected value is higher than the second reference threshold is obtained.
  • the first number of differential markers includes: cyclohexane, (S)-3,4-dihydroxybutyric acid, 5-methyl-2-acetylfuran, 2-n-propyl furan, angelica lactone, 3-aminopropionitrile, ethyl salicylate, p-cresol, hexanal, 2-methylfuran, choline;
  • the first number of differential markers include: acetone (m/z 58), isoprene (m/z 68), phenol (m/z 94), ethylbenzene (m/z 106).
  • a metabolomics-based physiological prediction device comprising:
  • a data acquisition module for acquiring mass spectrometry detection data of the breath sample to be detected
  • a probability value prediction module configured to perform physiological prediction on the mass spectrometry detection data through a pre-built multivariate statistical analysis model to obtain the probability value of each prediction category;
  • the result determination module is configured to determine the physiological prediction result corresponding to the exhalation sample to be detected according to the probability value of each of the prediction categories.
  • a computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the above-mentioned metabolomics-based physiological prediction method when the computer program is executed.
  • a computer-readable storage medium having a computer program stored thereon, the computer program implementing the above-mentioned metabolomics-based physiological prediction method when executed by a processor.
  • the above-mentioned metabolomics-based physiological prediction method, device, computer equipment and storage medium by acquiring mass spectrometry detection data of the breath sample to be detected, and performing physiological prediction on the above mass spectrometry detection data through a pre-built multivariate statistical analysis model, each prediction is obtained.
  • the probability value of the category; the physiological prediction result corresponding to the breath sample to be detected is determined according to the probability value of each predicted category.
  • the method performs data mining and analysis on the mass spectrometry detection data of the breath sample to be detected through a multivariate statistical analysis model, can predict the physiological state according to the specific components in the breath sample, and improves the reliability of the physiological state discrimination result.
  • 1 is an application environment diagram of a metabolomics-based physiological prediction method in one embodiment
  • FIG. 2 is a schematic flowchart of a metabolomics-based physiological prediction method in one embodiment
  • FIG. 3 is a schematic flowchart of a multivariate statistical analysis step in one embodiment
  • FIG. 4 is a schematic diagram of a data analysis flow chart of a metabolomics-based physiological prediction method in another embodiment
  • Fig. 5 is the PCA analysis diagram in one embodiment
  • Fig. 6 is the OPLS-DA analysis diagram in one embodiment
  • Figure 7 is a permutation test diagram in one embodiment
  • Figure 8 is a ROC graph in one embodiment
  • Figure 9 is an overview of the pathway analysis in one embodiment
  • Figure 10 is a mass spectrogram of a differential marker in another embodiment
  • FIG. 11 is a structural block diagram of a metabolomics-based physiological prediction device in one embodiment
  • Figure 12 is a diagram of the internal structure of a computer device in one embodiment.
  • the metabolomics-based physiological prediction method provided in this application can be applied in the application environment shown in FIG. 1 .
  • the terminal 101 communicates with the server 102 through the network.
  • the terminal 101 may be a mass spectrometer or other component detection device capable of performing component detection on the breath sample, and the server 102 may be implemented by an independent server or a server cluster composed of multiple servers.
  • a metabolomics-based physiological prediction method is provided, and the method is applied to the server 102 in FIG. 1 as an example for illustration, including the following steps:
  • Step S201 acquiring mass spectrometry detection data of the breath sample to be detected.
  • the mass spectrometry detection data includes qualitative and quantitative analysis of the volatile organic compounds (Volatile Organic Compounds, VOCs) of the gas by an online mass spectrometer to obtain mass spectrometry data of each component, including the characteristic peak intensity corresponding to each component.
  • the above-mentioned mass spectrometry detection data also includes mass spectrometry detection data obtained by off-line detection; the sampling method can be collected by an air bag or directly detected online.
  • the above-mentioned breath sample is detected by a mass spectrometer to obtain mass spectral detection data including characteristic peaks of each component.
  • Step S202 performing physiological prediction on the above-mentioned mass spectrometry detection data by using a pre-built multivariate statistical analysis model to obtain the probability value of each prediction category.
  • the multivariate statistical analysis model is a mathematical operation model that can analyze the statistical laws of multiple objects and multiple indicators when they are related to each other.
  • the multivariate statistical analysis model can be used to analyze the specific markers in the exhalation of people in different physiological states in different physiological states. It can establish a multivariate statistical analysis model based on the composition of substances and the changes in the relationship between them, which can perform predictive analysis on the breath samples of people in specific physiological states, and obtain probability values of different predicted categories, such as the probability values of various physiological states, Including early, mid or late physiological state,.
  • Step S203 determining the physiological prediction result for the pair of breath samples to be detected according to the probability value of each prediction category.
  • multiple prediction categories can be set, the final prediction result is determined according to the size of the probability value of each prediction category, and the prediction category with the largest probability value is taken as the final physiological prediction result.
  • the probability value of each predicted category is obtained; according to the probability value of each predicted category Determine the physiological prediction result corresponding to the above-mentioned breath sample to be detected.
  • the method performs data mining and analysis on the mass spectrometry detection data of the breath sample to be detected through a multivariate statistical analysis model, can predict different physiological states according to specific components in the breath sample, and improves the reliability of the judgment result of the physiological state.
  • FIG. 3 shows a schematic flowchart of the above-mentioned step S202, wherein the above-mentioned pre-built multivariate statistical analysis model is based on a principal-component analysis model (Principal-Component Analysis, PCA), orthogonal The partial least squares discriminant analysis model (Orthogonal Partial Least Squares Discriminant Analysis, OPLS-DA) and the artificial neural network model (Artificial Neural Network, ANN) are coupled to obtain, and the above step S202 specifically includes:
  • PCA Principal-component Analysis
  • OPLS-DA Orthogonal Partial Least Squares Discriminant Analysis
  • ANN Artificial Neural Network
  • Step S301 performing dimension reduction processing on the above-mentioned mass spectrometry detection data through a principal component analysis model, to obtain dimension-reduced mass spectrometry detection data.
  • PCA is a statistical method for dimensionality reduction. From a mathematical point of view, PCA uses an orthogonal transformation to replace the p-dimensional X space (m ⁇ p) with the m-dimensional Y space (m ⁇ p) to explain the principal components, While converting multiple indicators into a few comprehensive indicators, most of the characteristic information of the original variables is maintained, which can be used to intuitively describe whether there is a classification trend between groups.
  • Step S302 performing regression analysis on the dimensionality-reduced mass spectrometry detection data by using an orthogonal partial least squares discriminant analysis model to obtain a variable importance projection value of each metabolite.
  • variable importance projection value of each metabolite namely the VIP value (Variable Importance in the Projection, the importance of the variable to the model)
  • VIP value Very Importance in the Projection, the importance of the variable to the model
  • Substances are more critical differential metabolites, ie marker metabolites.
  • OPLS-DA model analysis is used to further determine the differences between groups.
  • OPLS-DA is a regression modeling method of multiple dependent variables to multiple independent variables, which can remove factors in the independent variable X that are not related to the categorical variable Y. , to focus the classification information into a principal component, to observe the characteristics of multidimensional data on a two-dimensional plane, and to give a detailed practical explanation of the regression model, which has great advantages in the classification method.
  • the abscissa is the score value of the main component in the OSC process, reflecting the difference between groups
  • the ordinate is the score value of the orthogonal component, reflecting the difference within the group.
  • R2 and Q2 are generally used.
  • R2X and R2Y represent the interpretation rate of the X and Y matrices of the built model, respectively.
  • Q2 represents the predictive ability of the model. In theory, the closer the values of R2 and Q2 are to 1, the better the model. Well, if the values of R2 and Q2 are higher than 0.5, the model fitting accuracy is better, and higher than 0.4 can be accepted normally.
  • Step S303 perform prediction and identification through the trained artificial neural network model according to the variable importance projection value of each metabolite, and obtain the probability value of each of the predicted categories.
  • the ANN artificial neural network model was used to model and analyze the VIP value of each metabolite in each of the above prediction categories. It is a system with learning ability. Its main job is to build a model and determine the weight value. By adjusting the weight value until the output of the neuron is consistent with the output of the real training sample, the trained artificial neural network model is used for prediction and recognition. The probability value of each of the predicted categories can be obtained.
  • model verification can also be performed on the data analysis results of the above-mentioned multivariate statistical analysis model.
  • permutation test permutation and receiver operating characteristic ROC curve are used to verify whether the OPLS-DA and the ANN model are overfitted respectively. Phenomenon.
  • Permutation is to re-model and predict by randomly shuffling the original model samples, and to check whether the model has over-fitted with multiple accuracy.
  • ROC curve is a method to study the relationship between model sensitivity and specificity, with sensitivity as the abscissa and 1-specitivity as the ordinate. The evaluation is based on the comparison of the area under the curve AUC, AUC The closer it is to 1, the better the performance of the model, and if the AUC is less than 0.5, the accuracy of the model is poor.
  • the original data is subjected to dimensionality reduction through a principal component analysis (PCA) model, and an orthogonal partial least squares discriminant analysis (OPLS-DA) model is used to identify marker metabolites on the dimensionality-reduced data.
  • PCA principal component analysis
  • OPLS-DA orthogonal partial least squares discriminant analysis
  • Classification, and the artificial neural network (ANN) model analyzes and predicts the differential metabolites in each predicted category, which can obtain an accurate and comprehensive classification effect, and improve the accuracy and comprehensiveness of the prediction.
  • the method further includes: filtering out the missing values of the mass spectrometry detection data according to a preset rule to obtain first mass spectrometry detection data;
  • the missing value of the first mass spectrometry detection data is filled to obtain the second mass spectrometry detection data;
  • the second mass spectrometry detection data is standardized by the Z normalization method to obtain the preprocessed mass spectrometry detection data.
  • the data after acquiring the mass spectrometry detection data of the breath sample to be detected, the data also needs to be preprocessed, and the preprocessing process includes screening and filling of missing values, data standardization, and reliability analysis.
  • the "rule of 80" is used to remove the missing values of the original data, that is, when more than 80% of a mass spectrum peak in a certain group is not 0, the mass spectrum data of this group is retained, otherwise it is rejected.
  • the missing values of the original data are filled, and the multiple imputation method is used to obtain the average level and variation level of each missing value, and finally integrate all the imputed data sets to obtain the final result.
  • data standardization is carried out, and the method used is Z-score standardization:
  • is the mean of all sample data
  • is the standard deviation of all sample data.
  • the processed data will conform to a normal distribution, that is, the data will be scaled so that it falls within a small specific interval. .
  • n is the total number of samples
  • Si is the internal variance of the ith sample
  • Sx is the total variance of all samples. owned The higher the value, the better the internal consistency of the data.
  • the original data is filtered, filled and standardized sequentially through data preprocessing to obtain valid data, which will pave the way for subsequent data analysis and processing, so as to obtain more accurate classification and prediction results.
  • the above method further includes: acquiring mass spectrometry detection data of the exhalation samples in the training set; the mass spectrometry detection data of the exhalation samples in the training set corresponds to physiological real categories and/or health categories;
  • the above-mentioned multivariate statistical analysis model is trained on the mass spectrometry detection data to obtain the above-mentioned pre-built multivariate statistical analysis model.
  • the exhalation samples of the training set are obtained, including the physiological one group and the two physiological exhalation samples, and the exhalation samples in the physiological group include physiological exhalation samples of different stages, such as early, middle, late or ulcer.
  • Type, superficial physiological breath samples, etc. all these samples are samples of known physiological state type (that is, the physiological state type has been determined), and the mass spectrometry detection data of each breath sample corresponds to the physiological real category and/or Refer to group category.
  • the above-mentioned multivariate statistical analysis model is trained using the mass spectrometry detection data of the exhaled samples in the training set, and the multivariate statistical analysis model that can finally accurately identify the sample type is obtained through multiple parameter adjustment training, then the model is a pre-built multivariate statistical analysis model. .
  • the above-mentioned method further includes: screening the mass spectrometry detection data sequentially through a univariate analysis method and a multivariate analysis method to obtain a variable that satisfies a preset condition; Inputting the preset first metabolic pathway database for retrieval and characterization to obtain the first number of differential markers; inputting the first number of differential markers into the preset second metabolic pathway analysis database for pathway analysis to obtain physiological metabolic pathways.
  • marker screening was performed using a combination of t-test and variable importance in the projection (VIP) methods. Univariate analysis was carried out first, and t-test was used to screen out the variables with statistical difference, which were characterized by p-value. If p ⁇ 0.05, it means that the characteristic variable has significant difference in different groups. Based on this result, the VIP value analysis in the multivariate model was carried out, and variables with VIP values greater than 1 to 2 were usually selected as potential difference markers according to the number of variables. Finally, potential differential markers were screened out and their biological metabolism mechanisms were explored.
  • the screened metabolites with significant differences were entered into the MetPA (www.metaboanalyst.com) website for enrichment analysis of metabolic pathways, further study of the metabolic pathways involved in the physiological markers and their correlations with each other, to find Physiologically most relevant metabolic pathways, thereby predicting physiologically possible mechanisms of action.
  • MetPA www.metaboanalyst.com
  • the mass spectrometry detection data is screened by the univariate analysis method and the multivariate analysis method, and the differential mass-to-charge ratios corresponding to the screened metabolites are input into the preset metabolic pathway database for retrieval and qualitative analysis, and the first number of Differential markers; input the first number of differential markers into the preset metabolic pathway analysis website for pathway analysis to obtain physiological metabolic pathways, which provides an effective screening method for differential markers, and can obtain corresponding metabolic pathways, further It provides an accurate basis for the prediction and discrimination of specific physiological states.
  • the above-mentioned screening of mass spectrometry detection data through a univariate analysis method and a multivariate analysis method in sequence to obtain variables that meet preset conditions includes: screening the mass spectrometry detection data by a t-test method to obtain an interval The first variable group whose probability is lower than the first reference threshold; the first variable group is analyzed by the variable importance projection value method, and the second variable group whose VIP value is higher than the second reference threshold is obtained.
  • marker screening was performed using a combination of t-test and variable importance in the projection (VIP) methods. Univariate analysis was performed first, and t-test was used to screen out the variables with statistical difference, which were characterized by p-value (interval probability). If p ⁇ 0.05, it means that the characteristic variable has significant difference in different groups. Based on this result, the VIP value analysis in the multivariate model was carried out, and variables with VIP values greater than 1 to 2 were usually selected as potential difference markers according to the number of variables. Finally, potential differential markers were screened out and their biological metabolism mechanisms were explored.
  • VIP projection
  • the difference markers are obtained by setting different thresholds and deletion, which provides data for subsequent prediction and identification.
  • the above-mentioned first quantity of differential markers includes: cyclohexane, (S)-3,4-dihydroxybutyric acid, 5-methyl-2-acetylfuran, 2-n-propylfuran , Angelica lactone, 3-aminopropionitrile, ethyl salicylate, p-cresol, hexanal, 2-methylfuran, choline.
  • the differential mass-to-charge ratio was entered into the metabolic pathway databases HMDB and KEGG for qualitative search, and 11 differential markers were compared by combining the definitions of literature, material ionization energy and VOCs, namely: cyclohexane, (S)- 3,4-Dihydroxybutyric acid, 5-methyl-2-acetylfuran, 2-n-propylfuran, angelica lactone, 3-aminopropionitrile, ethyl salicylate, p-cresol, hexanal, 2-methylfuran, choline.
  • GC-MS will be used to qualitatively analyze the substances, to clearly characterize the components of differentially marked metabolites, and to simultaneously verify the reliability of the results of mass spectrometry screening of markers.
  • differential marker categories are obtained by qualitatively identifying the differential markers, which provides a solid basis for judging the physiological state, and also provides an effective basis for analyzing the physiological state.
  • FIG. 4 shows a data analysis flowchart of a metabolomics-based physiological prediction method in a specific application scenario, which mainly includes:
  • breath sample data a total of 153 breath samples were collected in this experiment, including 88 people in a specific physiological state and 65 volunteers in other physiological states as the reference group.
  • the present invention selects 56 of the original 310 mass spectrum peaks as new variables.
  • the reliability analysis is performed on the data before and after preprocessing, and the reliability coefficient before preprocessing It is 0.575, and it is 0.995 after preprocessing, which is significantly improved than before, so it is judged that the preprocessing effect is better. usually , it means that the data reliability is good, and the result can also show that the follow-up evaluation results are credible.
  • the present invention combines three different classification methods to distinguish the difference between a specific physiological state group and a reference group of exhaled mass spectral data.
  • unsupervised PCA is selected as a pre-analysis step to visually describe whether there is a classification trend between groups.
  • the output result R 2 X represents the percentage of the principal components fitted by the model covering all the observed value information.
  • R 2 Y represents the percentage that the principal components fitted by the model can explain all the variables of the sample
  • Q 2 Y is calculated through cross-validation to evaluate the predictive ability of the model.
  • R 2 Y and Q 2 Y are to 1
  • ANN is a system with learning ability. 153 samples are randomly divided into training set and validation set. Among them, there are 111 samples in the training set, accounting for 72.5% of the total number of samples, including 63 samples in the specific physiological state group and 48 samples in the reference group; 42 samples in the validation set Case samples, accounting for 27.5% of the total samples, including 25 cases in the specific physiological state group and 17 cases in the reference group. After the difference variables are input into the model for statistical analysis, the analysis results of the ANN model are obtained.
  • the receiver operating characteristic ROC curve is used to further verify the classification ability of the ANN model.
  • the ROC curve is a method to study the relationship between the sensitivity and specificity of the model. The evaluation is based on the comparison of the area under the curve AUC. It means that the performance of the model is better. As shown in Figure 8, the AUCs of the specific physiological state group and the reference group are both 0.999, close to 1, indicating that the model has a good classification effect and is suitable for discriminant analysis of different groups.
  • the present invention combines two marker screening methods.
  • univariate analysis is performed, and t-test is used to screen out 39 variables with p ⁇ 0.01;
  • the VIP value analysis in the multivariate model was carried out, and the variables with VIP>1.2 were selected, 13 in total.
  • the mass-to-charge ratios (m/z) selected by the combination of the two methods are: 84, 120, 124, 110, 98, 70, 166, 355, 357, 108, 100, 82, and 104, respectively.
  • the differential mass-to-charge ratios were entered into the metabolic pathway databases HMDB and KEGG for qualitative search, and 11 differential markers were compared by combining the definitions of literature, material ionization energy and VOCs, namely: cyclohexane, (S)-3,4 -Dihydroxybutyric acid, 5-methyl-2-acetylfuran, 2-n-propylfuran, angelica lactone, 3-aminopropionitrile, ethyl salicylate, p-cresol, hexanal, 2-methyl base furan, choline.
  • GC-MS will be used to qualitatively analyze the substances, to clearly characterize the components of the differentially marked metabolites, and to simultaneously verify the reliability of the results of mass spectrometry screening of markers.
  • the present invention inputs 11 differential metabolites into the MetPA website for pathway analysis to find the metabolic pathway most relevant to a specific physiological state.
  • Figure 9 is an overview of the pathway analysis, the abscissa represents the importance of the metabolic pathway, and the ordinate represents the significance level of the metabolic pathway enrichment analysis.
  • the specific physiological state in this example is mainly involved in three metabolic pathways Pathways, including steroid hormone biosynthesis and metabolism, glycine serine and threonine metabolism, glycerophospholipid metabolism and other pathways.
  • FIG. 10 shows the mass spectrum of another group of differential markers.
  • the specific physiological state group and the reference group all appeared at m/z 58, 68, 87, 94, 106, 136, 281 and 355.
  • Peaks with different relative intensities were obtained, some of which were preliminarily characterized as acetone (m/z 58), isoprene (m/z 68), phenol (m/z 94), and ethylbenzene (m/z 106).
  • a set of models and analysis methods suitable for predicting and discriminating physiological states are established by analyzing the original data of exhalation mass spectrometry.
  • This method starts from the process of metabolomics and uses a variety of analysis methods to verify each other. Compared with single data processing, it has more comprehensive analysis results, which can greatly improve the accuracy and classification efficiency of discrimination and prediction of specific physiological states; in addition, this embodiment also provides a differential marker screening method.
  • the exhalation metabolomics analysis of the physiological state group and the reference group screened out 13 different metabolites between the two, providing an experimental basis for the discrimination of a specific physiological state, and a theory for the prediction and discrimination of a specific physiological state. support.
  • steps in the flowcharts of FIGS. 1-4 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIGS. 1-4 may include multiple steps or multiple stages. These steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution of these steps or stages The order is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the steps or phases within the other steps.
  • a metabolomics-based physiological screening device 1000 including: a data acquisition module 1001, a probability value prediction module 1002, and a result determination module 1003, wherein:
  • a data acquisition module 1001 configured to acquire mass spectrometry detection data of the breath sample to be detected
  • a probability value prediction module 1002 configured to perform physiological prediction on the mass spectrometry detection data through a pre-built multivariate statistical analysis model to obtain the probability value of each prediction category;
  • the result determination module 1003 is configured to determine the physiological prediction result corresponding to the breath sample to be detected according to the probability value of each of the prediction categories.
  • the above-mentioned probability value prediction module 1002 is further configured to:
  • the above-mentioned data acquisition module 1001 is further used for:
  • the missing values of the mass spectrometry detection data are screened out according to the preset rules to obtain the first mass spectrometry detection data; the missing values of the first mass spectrometry detection data are filled with the preset filling method to obtain the second mass spectrometry data Detection data; standardize the second mass spectrometry detection data by Z normalization method to obtain preprocessed mass spectrometry detection data.
  • the above-mentioned data acquisition module 1001 is further used for:
  • the mass spectrometry detection data of the expiratory samples in the training set corresponds to the real category and/or the reference group category with a specific physiological state; use the mass spectrometry detection data of the exhalation samples in the training set to
  • the multivariate statistical analysis model is trained to obtain the pre-built multivariate statistical analysis model.
  • the above-mentioned metabolomics-based physiological screening device 1000 further includes a marker acquisition unit 1004 and a metabolic pathway search unit 1005:
  • the marker acquisition unit 1004 is configured to screen the mass spectrometry detection data sequentially through a univariate analysis method and a multivariate analysis method to obtain a variable that meets a preset condition; input the differential mass-to-charge ratio corresponding to the variable that meets the preset condition.
  • the preset first metabolic pathway database is searched and qualitative, and the first number of differential markers is obtained;
  • the metabolic pathway search unit 1005 is configured to input the first quantity of differential markers into a preset second metabolic pathway analysis database for pathway analysis to obtain a physiological metabolic pathway.
  • the above-mentioned marker acquisition unit 1004 is further configured to screen the mass spectrometry detection data by the t-test method to obtain a first variable group whose interval probability is lower than the first reference threshold; The method analyzes the first variable group to obtain a second variable group whose VIP value is higher than the second reference threshold.
  • the above-mentioned first quantity of differential markers includes: cyclohexane, (S)-3,4-dihydroxybutyric acid, 5-methyl-2-acetylfuran, 2-n-propylfuran , Angelica lactone, 3-aminopropionitrile, ethyl salicylate, p-cresol, hexanal, 2-methylfuran, choline.
  • Each module in the above-mentioned metabolomics-based physiological prediction device can be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 11 .
  • the computer device includes a processor, memory, and a network interface connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the nonvolatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium.
  • the computer facility's database is used to store mass spectrometry detection data as well as prediction result data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program when executed by the processor, implements a metabolomics-based physiological prediction method.
  • FIG. 11 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device including a memory and a processor, where a computer program is stored in the memory, and when the processor executes the computer program, the steps in the foregoing method embodiments are implemented.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the foregoing method embodiments are implemented.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Pure & Applied Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Mathematics (AREA)
  • Software Systems (AREA)
  • Electrochemistry (AREA)
  • Immunology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Biophysics (AREA)
  • Biochemistry (AREA)
  • Analytical Chemistry (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

本申请涉及一种基于代谢组学的生理预测方法、装置、计算机设备和存储介质。采用本申请能够根据呼气样本中的特定成分对特定生理状态进行预测,提高了生理状态判别结果的可靠性,为生理阶段及类型的判别提供实验基础,也为特定生理状态的预测和判别研究提供理论支持。该方法包括:通过获取待检测呼气样本的质谱检测数据,通过预先构建的多元统计分析模型针对上述质谱检测数据进行生理预测,得到各个预测类别的概率值;根据各个所述预测类别的概率值确定上述待检测呼气样本对应的生理预测结果。

Description

基于代谢组学的生理预测方法、装置、计算机设备和介质 技术领域
本申请涉及数据分析技术领域,特别是涉及一种基于代谢组学的生理预测方法、装置、计算机设备和存储介质。
背景技术
人体呼气中含有约200余种挥发性有机化合物(Volatile Organic Compounds,VOCs),随着生理状态的变化,呼气中VOCs的成分会发生变化,呼气成分分析即通过检查呼气中VOCs成分的变化反映机体的生理状态。
现阶段应用于生理状态检测的呼气检测分析方法中使用的数据挖掘技术尚不完善,涵盖的分析方法较为单一,进而导致生理状态检测结果缺乏可靠性。
发明内容
基于此,有必要针对上述技术问题,提供一种基于代谢组学的生理预测方法、装置、计算机设备和存储介质。
一种基于代谢组学的生理预测方法,所述方法包括:
获取待检测呼气样本的质谱检测数据;
通过预先构建的多元统计分析模型针对所述质谱检测数据进行生理预测,得到各个预测类别的概率值;
根据各个所述预测类别的概率值确定所述待检测呼气样本对应的生理预测结果。
在其中一个实施例中,所述预先构建的多元统计分析模型是基于主成分分析模型、正交偏最小二乘法判别分析模型和人工神经网络模型耦合得到的,所述通过预先构建的多元统计分析模型针对所述质谱检测数据进行生理预测,得到各个预测类别的概率值,包括:
通过所述主成分分析模型针对所述质谱检测数据进行降维处理,得到降维后的质谱检测数据;
通过所述正交偏最小二乘法判别分析模型针对所述降维后的质谱检测数据进行回归分析,得到每种代谢物的变量重要性投影值;
根据所述每种代谢物的变量重要性投影值通过训练好的人工神经网络模型进行预测识别,得到各个所述预测类别的概率值。
在一个实施例中,所述获取待检测呼气样本的质谱检测数据之后,还包括:
通过预设的规则针对所述质谱检测数据的缺失值进行数据筛除,得到第一质谱检测数据;
通过预设的填补法针对所述第一质谱检测数据的缺失值进行填补,得到第二质谱检测数据;
通过Z标准化方法针对所述第二质谱检测数据进行标准化处理,得到预处理后的质谱检测数据。
在一个实施例中,所述方法还包括:
获取训练集呼气样本的质谱检测数据;所述训练集呼气样本的质谱检测数据对应有生理真实类别;
利用所述训练集呼气样本的质谱检测数据对所述多元统计分析模型进行训练,得到所述预先构建的多元统计分析模型。
在一个实施例中,所述方法还包括:
依次通过单变量分析方法和多变量分析方法针对质谱检测数据进行筛选,得到满足预设条件的变量;
将所述满足预设条件的变量对应的差异质荷比输入预设的第一代谢通路数据库进行检索定性,得到第一数量的差异标志物;
将所述第一数量的差异标志物输入预设的第二代谢通路分析数据库进行通路分析,得到生理代谢途径。
在一个实施例中,所述依次通过单变量分析方法和多变量分析方法针对质谱检测数据进行筛选,得到满足预设条件的变量,包括:
通过t检验方法针对所述质谱检测数据进行筛选,得到p低于第一参考阈值的第一变量组;
通过变量重要性投影值方法针对所述第一变量组进行分析,得到变量重要 性投影值高于第二参考阈值的第二变量组。
在一个实施例中,所述第一数量的差异标志物包括:环己烷、(S)-3,4-二羟基丁酸、5-甲基-2-乙酰基呋喃、2-正丙基呋喃、当归内脂、3-氨基丙腈、水杨酸乙酯、对甲酚、己醛、2-甲基呋喃、胆碱;
或者,
所述第一数量的差异标志物包括:丙酮(m/z 58)、异戊二烯(m/z 68)、苯酚(m/z 94)、乙基苯(m/z 106)。
一种基于代谢组学的生理预测装置,所述装置包括:
数据获取模块,用于获取待检测呼气样本的质谱检测数据;
概率值预测模块,用于通过预先构建的多元统计分析模型针对所述质谱检测数据进行生理预测,得到各个预测类别的概率值;
结果确定模块,用于根据各个所述预测类别的概率值确定所述待检测呼气样本对应的生理预测结果。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现如上述的基于代谢组学的生理预测方法。
一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如上述的基于代谢组学的生理预测方法。
上述基于代谢组学的生理预测方法、装置、计算机设备和存储介质,通过获取待检测呼气样本的质谱检测数据,通过预先构建的多元统计分析模型针对上述质谱检测数据进行生理预测,得到各个预测类别的概率值;根据各个所述预测类别的概率值确定上述待检测呼气样本对应的生理预测结果。该方法通过多元统计分析模型对待检测呼气样本的质谱检测数据进行数据挖掘和分析,能够根据呼气样本中的特定成分对生理状态进行预测,提高了生理状态判别结果的可靠性。
附图说明
图1为一个实施例中基于代谢组学的生理预测方法的应用环境图;
图2为一个实施例中基于代谢组学的生理预测方法的流程示意图;
图3为一个实施例中多元统计分析步骤的流程示意图;
图4为另一个实施例中基于代谢组学的生理预测方法的数据分析流程示意图;
图5为一个实施例中的PCA分析图;
图6为一个实施例中的OPLS-DA分析图;
图7为一个实施例中的置换检验图;
图8为一个实施例中的ROC曲线图;
图9为一个实施例中的通路分析概图;
图10为另一个实施例中的差异标志物的质谱图;
图11为一个实施例中基于代谢组学的生理预测装置的结构框图;
图12为一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的基于代谢组学的生理预测方法,可以应用于如图1所示的应用环境中。其中,终端101通过网络与服务器102进行通信。其中,终端101可以是能够对呼气样本进行成分检测的质谱仪或其他成分检测装置,服务器102可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一个实施例中,如图2所示,提供了一种基于代谢组学的生理预测方法,以该方法应用于图1中的服务器102为例进行说明,包括以下步骤:
步骤S201,获取待检测呼气样本的质谱检测数据。
其中,质谱检测数据包括通过在线质谱分析仪对气体的挥发性有机化合物(Volatile Organic Compounds,VOCs)进行定性定量分析,得到各成分的质谱数据,其中包含各成分对应的特征峰强度。上述质谱检测数据还包括离线检测得到的质谱检测数据;采样方法可通过气袋采集或直接在线检测。
具体地,采集呼气样本后,通过质谱分析仪完成对上述呼气样本的检测,得到包含各成分特征峰的质谱检测数据。
步骤S202,通过预先构建的多元统计分析模型的针对上述质谱检测数据进行生理预测,得到各个预测类别的概率值。
其中,多元统计分析模型是一种能够在多个对象和多个指标互相关联的情况下分析它们的统计规律的数学运算模型。
具体地,由于处于不同生理状态的人与普通人的呼气中所包含的成分具有统计学差异,因此可通过多元统计分析模型分析处于不同生理状态的人在不同生理状态下呼气中特定标志物的成分以及它们之间的关系变化建立多元统计分析模型,可针对处于特定生理状态的人的呼气样本进行预测分析,得到不同预测类别的概率值,例如各种不同生理状态的概率值、包括早期、中期或晚期生理状态、。
步骤S203,根据各个预测类别的概率值确定待检测呼气样本对用的生理预测结果。
具体地,可设置多个预测类别,根据各个预测类别的概率值的大小判断最终预测结果,取其中概率值最大的预测类别作为最终的生理预测结果。
上述实施例,通过获取待检测呼气样本的质谱检测数据,通过预先构建的多元统计分析模型针对上述质谱检测数据进行生理预测,得到各个预测类别的概率值;根据各个所述预测类别的概率值确定上述待检测呼气样本对应的生理预测结果。该方法通过多元统计分析模型对待检测呼气样本的质谱检测数据进行数据挖掘和分析,能够根据呼气样本中的特定成分对不同生理状态进行预测,提高了生理状态的判别结果的可靠性。
在一实施例中,如图3所示,图3展示了上述步骤S202的流程示意图,其中,上述预先构建的多元统计分析模型是基于主成分分析模型(Principal-Component Analysis,PCA)、正交偏最小二乘法判别分析模型(Orthogonal Partial Least Squares Discriminant Analysis,OPLS-DA)和人工神经网络模型(Artificial Neural Network,ANN)耦合得到的,上述步骤S202具体包括:
步骤S301,通过主成分分析模型针对上述质谱检测数据进行降维处理,得到降维后的质谱检测数据。
其中,PCA是一种降维处理的统计方法,从数学角度分析,PCA借助一个正交变换,用m维的Y空间代替p维的X空间(m<p),对主成分进行解释说明,把多指标转化为少数几个综合指标的同时,保持了原始变量的大部分特征信息,可用于直观地描述组别之间是否具有分类趋势。
步骤S302,通过正交偏最小二乘法判别分析模型针对所述降维后的质谱检测数据进行回归分析,得到每种代谢物的变量重要性投影值。
其中,每种代谢物的变量重要性投影值,即VIP值(Variable Importance in the Projection,变量对模型的重要性),描述了每一个变量对模型的总体贡献,物质的VIP越大,说明该物质越是关键的差异代谢物,也即标志代谢物。
具体地,采用OPLS-DA模型分析,是为了进一步确定组间差异,OPLS-DA是一种多因变量对多自变量的回归建模方法,可以去除自变量X中与分类变量Y无关的因素,使分类信息聚焦到一个主成分中,在二维平面图上对多维数据的特性进行观察,给予回归模型详细的实际解释,在分类方法中具有很大优势。OPLS-DA图中,横坐标是OSC过程中主要成分的得分值,反映的是组间差异,纵坐标是正交成分的得分值,反映的是组内差异。
对于OPLS-DA模型的验证一般采用R2和Q2表示,R2X和R2Y分别表示所建模型对X和Y矩阵的解释率,Q2表示模型的预测能力,理论上R2和Q2数值越接近1说明模型越好,R2和Q2数值高于0.5模型拟合准确性较好,高于0.4即可正常接受。
步骤S303,根据所述每种代谢物的变量重要性投影值通过训练好的人工神经网络模型进行预测识别,得到各个所述预测类别的概率值。
具体地,采用ANN人工神经网络模型对上述各个预测类别中的每种代谢物的VIP值进行建模分析。是一个具有学习能力的系统,它的主要工作是建立模型和确定权重值,通过调整权重值,直到神经元的输出与真实的训练样本输出一致,通过训练好的人工神经网络模型进行预测识别,即可得到各个所述预测类别的概率值。
可选地,本实施中还可以对上述多元统计分析模型的数据分析结果进行模型验证,例如,采用置换检验permutation和受试者工作特征ROC曲线分别验证OPLS-DA与ANN模型是否出现过拟合现象。
permutation是通过将原始模型样本随机打乱后重新建模和预测,以多次的准确率来考察模型是否出现过拟合现象。ROC曲线是研究模型灵敏度和特异性之间相互关系的方法,以灵敏度(sensitive)为横坐标,1-特异性(1-specitivity)为纵坐标,评估依据是比较曲线下方的面积AUC大小,AUC越接近于1,则代表模型性能越好,若AUC小于0.5,则表示模型的准确性不佳。
上述实施例,通过主成分分析(PCA)模型对原始数据进行降维、正交偏最小二乘法判别分析(OPLS-DA)模型对降维后的数据进行识别标志代谢物并根据标志代谢物进行分类,以及人工神经网络(ANN)模型针对每种预测类别中的差异代谢物进行分析、预测,能够得到准确、全面的分类效果,提高了预测的准确性和全面性。
在一实施例中,上述步骤S201之后,还包括:通过预设的规则针对所述质谱检测数据的缺失值进行数据筛除,得到第一质谱检测数据;通过预设的填补法针对所述第一质谱检测数据的缺失值进行填补,得到第二质谱检测数据;通过Z标准化方法针对所述第二质谱检测数据进行标准化处理,得到预处理后的质谱检测数据。
具体地,获取待检测呼气样本的质谱检测数据之后,还需要对数据进行预处理,预处理过程包括缺失值的筛除与填补、数据标准化、可靠性分析。首先采用“80规则”去除原始数据的缺失值,即当某一质谱峰在某一组别中超过80%的值不为0时,该组质谱数据予以保留,反之则剔除。随后对原始数据缺失值进行填补,采用的是多重填补法,得到每个缺失值的平均水平和变异水平,最终整合所有填补后的数据集得出最终结果。接下来进行数据标准化,采用的方法为Z-score标准化:
x*=(x-μ)/σ
其中μ为所有样本数据的均值(mean),σ为所有样本数据的标准差(standard deviation),经过处理后的数据将符合正态分布,即按比例缩放数据使其落入一 个小的特定区间。
最后对上述预处理前后的数据分别进行可靠性分析,用于评估数据预处理的效果。结果采用信度系数
Figure PCTCN2020142331-appb-000001
(也称克朗巴哈系数)予以表征:
Figure PCTCN2020142331-appb-000002
其中,n为样本总数,Si为第i个样本的内方差,Sx为全部样本的总方差。得到的
Figure PCTCN2020142331-appb-000003
值越高,代表数据的内在一致性越好。
上述实施例,通过数据预处理依次对原始数据进行缺失值筛除、填补和标准化处理得到有效数据,为后续数据分析处理做好铺垫,以便得到更准确的分类和预测结果。
在一实施例中,上述方法还包括:获取训练集呼气样本的质谱检测数据;该训练集呼气样本的质谱检测数据对应有生理真实类别和/或健康类别;利用训练集呼气样本的质谱检测数据对上述多元统计分析模型进行训练,得到上述预先构建的多元统计分析模型。
具体地,获取训练集呼气样本,其中包括生理一组和生理二组呼气样本,而生理一组中的呼气样本包括不同阶段的生理呼气样本,例如包括早期、中期、晚期或溃疡型、浅表型生理呼气样本等,所有这些样本都是生理状态类型已知(即已确定生理状态类型)的样本,每份呼气样本的质谱检测数据都对应有生理真实类别和/或参照组类别。使用训练集呼气样本的质谱检测数据对上述多元统计分析模型进行训练,通过多次调参训练得到最终能够准确识别样本类别的多元统计分析模型,则该模型即为预先构建的多元统计分析模型。
上述实施例,通过使用不同的样本数据对多元统计分析模型进行调参训练,最终得到能够准确识别患病类型的多元统计分析模型,为生理类型的预测识别提供了数据铺垫。
在一实施例中,上述方法还包括:依次通过单变量分析方法和多变量分析方法针对质谱检测数据进行筛选,得到满足预设条件的变量;将满足预设条件的变量对应的差异质荷比输入预设的第一代谢通路数据库进行检索定性,得到第一数量的差异标志物;将第一数量的差异标志物输入预设的第二代谢通路分 析数据库进行通路分析,得到生理代谢途径。
具体地,标志物筛选采用t检验与变量重要性投影值(Variable importance in the projection,VIP)两种方法结合。先进行单变量分析,采用t检验筛选出具有统计学差异的变量,用p值表征。若p<0.05,则代表该特征变量在不同组别中具有显著性差异。基于此结果,再进行多变量模型中的VIP值分析,通常依据变量数目来挑选VIP值大于1到2区间的变量作为潜在差异标志物。最终筛选出潜在的差异标志物,探索其生物代谢机制。
差异质荷比输入代谢通路数据库HMDB与KEGG数据库检索定性,同时结果与文献、物质电离能、VOCs定义共同比对。后续可以结合GC-MS定性分析,再次验证筛选结果的准确性。
将筛选出的具有显著性差异的代谢物输入MetPA(www.metaboanalyst.com)网站进行代谢通路的富集分析,进一步研究生理标志物所涉及到的代谢途径以及彼此之间的关联性,寻找与生理最相关的代谢途径,从而预测生理可能的作用机理。
上述实施例,通过单变量分析方法和多变量分析方法针对质谱检测数据进行筛选,并将筛选得到的代谢物对应的差异质荷比输入预设的代谢通路数据库进行检索定性,得到第一数量的差异标志物;将第一数量的差异标志物输入预设的代谢通路分析网站进行通路分析,得到生理代谢途径,提供了一种有效的差异标志物筛选方法,并可得到相应的代谢途径,进一步为特定生理状态预测和判别提供了准确的依据。
在一实施例中,上述依次通过单变量分析方法和多变量分析方法针对质谱检测数据进行筛选,得到满足预设条件的变量,包括:通过t检验方法针对所述质谱检测数据进行筛选,得到区间概率低于第一参考阈值的第一变量组;通过变量重要性投影值方法针对所述第一变量组进行分析,得到VIP值高于第二参考阈值的第二变量组。
具体地,标志物筛选采用t检验与变量重要性投影值(Variable importance in the projection,VIP)两种方法结合。先进行单变量分析,采用t检验筛选出具有统计学差异的变量,用p值(区间概率)表征。若p<0.05,则代表该特征变量 在不同组别中具有显著性差异。基于此结果,再进行多变量模型中的VIP值分析,通常依据变量数目来挑选VIP值大于1到2区间的变量作为潜在差异标志物。最终筛选出潜在的差异标志物,探索其生物代谢机制。
上述实施例,通过设置不同的阈值删选得到差异标志物,为后续进行预测识别提供数据铺垫。
在一实施例中,上述第一数量的差异标志物包括:环己烷、(S)-3,4-二羟基丁酸、5-甲基-2-乙酰基呋喃、2-正丙基呋喃、当归内脂、3-氨基丙腈、水杨酸乙酯、对甲酚、己醛、2-甲基呋喃、胆碱。
具体地,将差异质荷比输入代谢通路数据库HMDB与KEGG进行检索定性,结合了文献、物质电离能、VOCs定义共同比对出11种差异标志物,分别为:环己烷、(S)-3,4-二羟基丁酸、5-甲基-2-乙酰基呋喃、2-正丙基呋喃、当归内脂、3-氨基丙腈、水杨酸乙酯、对甲酚、己醛、2-甲基呋喃、胆碱。
可选地,后续为了再次验证差异标志物的筛选是否准确,还将采用GC-MS对物质进行定性分析,清楚地表征差异标志代谢物的组分,同步验证质谱筛选标志物结果的可靠性。
上述实施例,通过对差异标志物定性,得到具体的差异标志物类别,为生理状态的判别提供了确凿的依据,也为生理状态的分析提供了有效依据。
在一实施例中,如图4所示,图4示出了一具体应用场景中的基于代谢组学的生理预测方法的数据分析流程图,主要包括:
1样本采集
采集呼气样本数据,本实验共采集到呼气样本153例,包括处于特定生理状态的人88例,处于其他生理状态的志愿者65例作为参照组。
2方法
2.1数据预处理
本发明对153例原始数据进行缺失值去除与填补、标准化处理后,从原来的310个质谱峰中筛选出了56个作为新变量。接下来对预处理前后的数据分别进行可靠性分析,预处理前信度系数
Figure PCTCN2020142331-appb-000004
为0.575,预处理后为0.995,较之前有了显著提高,因此判断预处理效果较好。通常
Figure PCTCN2020142331-appb-000005
时,说明数据信度佳, 该结果也能表明后续评价结果是可信的。
2.2多元统计分析
本发明联合三种不同分类方法来区分特定生理状态组与参照组呼气质谱数据的差异。首先选取无监督的PCA作为预分析步骤,用于直观地描述组别之间是否具有分类趋势。输出结果R 2X表征了模型拟合出的主成分涵盖所有观测值信息的百分比,结果显示R 2X=0.656,已经明显高于0.4,说明拟合效果较好,由图5可知,特定生理状态组与参照组分离趋势明显,但特定生理状态组中出现有少数几个异常点偏离出95%置信区间。
为了进一步确定组间差异,建立OPLS-DA模型进行分析,该模型的拟合效果通常用R 2Y和Q 2Y两个指标表征。R 2Y表示模型拟合出的主成分可以解释样本所有变量的百分比,Q 2Y则通过交叉验证计算得出,用以评价模型的预测能力。通常情况下,R 2Y和Q 2Y越接近1分别表示模型的拟合效果和预测效果越好。由图6可知,R 2Y=0.955,表明模型拟合出的主成分具有较高的解释率;Q 2Y=0.935,表示模型的预测准确率达93.5%,模型对未知样本的预测能力较为准确。此外,特定生理状态组与参照组同样在95%置信区间内呈现出显著的组间差异,与PCA的分类趋势相同,说明两组样本的呼气VOCs差异明显;同样明显的组内聚集趋势,也表明同组样本自身的差异性较小,可以推断出样本具有较好的平行性。
最后,本研究还采用ANN算法对样本数据建模分析。ANN是一个具有学习能力的系统,153例样本随机分成训练集、验证集,其中训练集111例样本,占总样本数72.5%,包括特定生理状态组63例,参照组48例;验证集42例样本,占总样本数27.5%,包括特定生理状态组25例,参照组17例。将差异变量输入模型进行统计学分析后,得出ANN模型的分析结果。验证结果显示,25例特定生理状态组全部判断正确,17例参照组有1例判错,误认为是特定生理状态组。ANN预测的准确率达到97.6%,说明神经网络分类效果良好,充分说明了生理状态呼气鉴别的有效性和可靠性。
2.3模型验证
采用permutation验证OPLS-DA模型是否出现过拟合现象,其中,R 2为累 计方差值,Q 2为累积交叉有效性,一般情况下,R 2<0.5,Q 2<0,则认为模型没有出现过拟合。本发明进行了200次置换验证的结果如图7所示,R 2=0.158和Q 2=-0.366,远低于原始模型的R 2与Q 2值,表明OPLS-DA模型不存在过度拟合现象,并具有可靠的判别与预测能力。
采用受试者工作特征ROC曲线对ANN模型的分类能力进一步验证,ROC曲线是研究模型灵敏度和特异性之间相互关系的方法,评估依据是比较曲线下方的面积AUC大小,AUC越接近于1,则代表模型性能越好,由图8可得,特定生理状态组与参照组的AUC均为0.999,接近于1,说明模型具有良好的分类效果,适合用于不同组的判别分析。
2.4筛选差异标志物
为了客观、全面地评价每个变量的重要性,本发明结合了两种标志物筛选方法,先进行单变量分析,采用t检验筛选出p<0.01的变量,总计39个;基于此结果,再进行多变量模型中的VIP值分析,挑选出VIP>1.2的变量,总计13个。两种方法结合挑选出的质荷比(m/z)分别为:84、120、124、110、98、70、166、355、357、108、100、82、104。
2.5标志物定性
将差异质荷比输入代谢通路数据库HMDB与KEGG进行检索定性,结合了文献、物质电离能、VOCs定义共同比对出11种差异标志物,分别为:环己烷、(S)-3,4-二羟基丁酸、5-甲基-2-乙酰基呋喃、2-正丙基呋喃、当归内脂、3-氨基丙腈、水杨酸乙酯、对甲酚、己醛、2-甲基呋喃、胆碱。后续为了再次验证差异标志物的筛选是否准确,还将采用GC-MS对物质进行定性分析,清楚地表征差异标志代谢物的组分,同步验证质谱筛选标志物结果的可靠性。
3代谢通路分析
本发明将11个差异代谢物输入MetPA网站进行通路分析,寻找与特定生理状态最相关的代谢途径。如图9所示,图9为通路分析概图,横坐标表示代谢通路的重要性,纵坐标表示代谢通路富集分析的显著性水平,本实施例中的特定生理状态主要参与了3条代谢通路,包括类固醇激素生物合成代谢、甘氨丝氨酸与苏氨酸代谢、甘油磷脂代谢等通路。
另外,如图10所示,图10展示了另一组差异标志物的质谱图,在本实施例中,由特定生理状态组和参照组的呼气质谱数据得知,在m/z 50~359范围内,两组呼出气代表样品中检测到的质谱峰种类差别不大,特定生理状态组和参照组均在m/z 58、68、87、94、106、136、281和355处出现了不同的相对强度的峰,部分初步定性为丙酮(m/z 58)、异戊二烯(m/z 68)、苯酚(m/z 94)、乙基苯(m/z 106)。
本实施例通过对呼气质谱原始数据展开分析,建立一套适合用于生理状态预测和判别的模型与分析方法,该方法从代谢组学的流程出发,运用了多种分析方法交互验证,相比于单一的数据处理具有更全面的分析结果,可以极大提高特定生理状态的判别与预测的准确度与分类效率;另外,本实施例还提供了一种差异标志物筛选方法,通过对特定生理状态群体与参照群体的呼气代谢组学分析,筛选出两者之间存在13种差异代谢物质,为特定生理状态的判别提供实验基础,同时为特定生理状态的预测和判别研究提供了理论支持。
应该理解的是,虽然图1-4的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图1-4中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图10所示,提供了一种基于代谢组学的生理筛查装置1000,包括:数据获取模块1001、概率值预测模块1002和结果确定模块1003,其中:
数据获取模块1001,用于获取待检测呼气样本的质谱检测数据;
概率值预测模块1002,用于通过预先构建的多元统计分析模型针对所述质谱检测数据进行生理预测,得到各个预测类别的概率值;
结果确定模块1003,用于根据各个所述预测类别的概率值确定所述待检测 呼气样本对应的生理预测结果。
在一实施例中,上述概率值预测模块1002进一步用于:
通过所述主成分分析模型针对所述质谱检测数据进行降维处理,得到降维后的质谱检测数据;通过所述正交偏最小二乘法判别分析模型针对所述降维后的质谱检测数据进行回归分析,得到每种代谢物的VIP值;根据所述每种代谢物的VIP值通过训练好的人工神经网络模型进行预测识别,得到各个所述预测类别的概率值。
在一实施例中,上述数据获取模块1001,还用于:
通过预设的规则针对所述质谱检测数据的缺失值进行数据筛除,得到第一质谱检测数据;通过预设的填补法针对所述第一质谱检测数据的缺失值进行填补,得到第二质谱检测数据;通过Z标准化方法针对所述第二质谱检测数据进行标准化处理,得到预处理后的质谱检测数据。
在一实施例中,上述数据获取模块1001,还用于:
获取训练集呼气样本的质谱检测数据;所述训练集呼气样本的质谱检测数据对应有特定生理状态的真实类别和/或参照组类别;利用所述训练集呼气样本的质谱检测数据对所述多元统计分析模型进行训练,得到所述预先构建的多元统计分析模型。
在一实施例中,上述基于代谢组学的生理筛查装置1000,还包括标志物获取单元1004和代谢途径查找单元1005:
标志物获取单元1004,用于依次通过单变量分析方法和多变量分析方法针对质谱检测数据进行筛选,得到满足预设条件的变量;将所述满足预设条件的变量对应的差异质荷比输入预设的第一代谢通路数据库进行检索定性,得到第一数量的差异标志物;
代谢途径查找单元1005,用于将所述第一数量的差异标志物输入预设的第二代谢通路分析数据库进行通路分析,得到生理代谢途径。
在一实施例中,上述标志物获取单元1004,进一步用于通过t检验方法针对所述质谱检测数据进行筛选,得到区间概率低于第一参考阈值的第一变量组;通过变量重要性投影值方法针对所述第一变量组进行分析,得到VIP值高于第 二参考阈值的第二变量组。
在一实施例中,上述第一数量的差异标志物包括:环己烷、(S)-3,4-二羟基丁酸、5-甲基-2-乙酰基呋喃、2-正丙基呋喃、当归内脂、3-氨基丙腈、水杨酸乙酯、对甲酚、己醛、2-甲基呋喃、胆碱。
关于基于代谢组学的生理预测装置的具体限定可以参见上文中对于基于代谢组学的生理预测方法的限定,在此不再赘述。上述基于代谢组学的生理预测装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图11所示。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储质谱检测数据以及预测结果数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基于代谢组学的生理预测方法。
本领域技术人员可以理解,图11中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述各方法实施例中的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (10)

  1. 一种基于代谢组学的生理预测方法,其特征在于,所述方法包括:
    获取待检测呼气样本的质谱检测数据;
    通过预先构建的多元统计分析模型针对所述质谱检测数据进行生理预测,得到各个预测类别的概率值;
    根据各个所述预测类别的概率值确定所述待检测呼气样本对应的生理预测结果。
  2. 根据权利要求1所述的方法,其特征在于,所述预先构建的多元统计分析模型是基于主成分分析模型、正交偏最小二乘法判别分析模型和人工神经网络模型耦合得到的,所述通过预先构建的多元统计分析模型针对所述质谱检测数据进行生理预测,得到各个预测类别的概率值,包括:
    通过所述主成分分析模型针对所述质谱检测数据进行降维处理,得到降维后的质谱检测数据;
    通过所述正交偏最小二乘法判别分析模型针对所述降维后的质谱检测数据进行回归分析,得到每种代谢物的变量重要性投影值;
    根据所述每种代谢物的变量重要性投影值通过训练好的人工神经网络模型进行预测识别,得到各个所述预测类别的概率值。
  3. 根据权利要求1所述的方法,其特征在于,所述获取待检测呼气样本的质谱检测数据之后,还包括:
    通过预设的规则针对所述质谱检测数据的缺失值进行数据筛除,得到第一质谱检测数据;
    通过预设的填补法针对所述第一质谱检测数据的缺失值进行填补,得到第二质谱检测数据;
    通过Z标准化方法针对所述第二质谱检测数据进行标准化处理,得到预处理后的质谱检测数据。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    获取训练集呼气样本的质谱检测数据;所述训练集呼气样本的质谱检测数据对应有特定生理状态真实类别和/或参照组类别;
    利用所述训练集呼气样本的质谱检测数据对所述多元统计分析模型进行训练,得到所述预先构建的多元统计分析模型。
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    依次通过单变量分析方法和多变量分析方法针对质谱检测数据进行筛选,得到满足预设条件的变量;
    将所述满足预设条件的变量对应的差异质荷比输入预设的第一代谢通路数据库进行检索定性,得到第一数量的差异标志物;
    将所述第一数量的差异标志物输入预设的第二代谢通路分析数据库进行通路分析,得到生理代谢途径。
  6. 根据权利要求5所述的方法,其特征在于,所述依次通过单变量分析方法和多变量分析方法针对质谱检测数据进行筛选,得到满足预设条件的变量,包括:
    通过t检验方法针对所述质谱检测数据进行筛选,得到区间概率低于第一参考阈值的第一变量组;
    通过变量重要性投影值方法针对所述第一变量组进行分析,得到变量重要性投影值高于第二参考阈值的第二变量组。
  7. 根据权利要求6所述的方法,其特征在于,所述第一数量的差异标志物包括:环己烷、(S)-3,4-二羟基丁酸、5-甲基-2-乙酰基呋喃、2-正丙基呋喃、当归内脂、3-氨基丙腈、水杨酸乙酯、对甲酚、己醛、2-甲基呋喃、胆碱;
    或者,
    所述第一数量的差异标志物包括:丙酮(m/z 58)、异戊二烯(m/z 68)、苯酚(m/z 94)、乙基苯(m/z 106)。
  8. 一种基于代谢组学的生理预测装置,其特征在于,所述装置包括:
    数据获取模块,用于获取待检测呼气样本的质谱检测数据;
    概率值预测模块,用于通过预先构建的多元统计分析模型针对所述质谱检测数据进行生理预测,得到各个预测类别的概率值;
    结果确定模块,用于根据各个所述预测类别的概率值确定所述待检测呼气样本对应的生理预测结果。
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至7中任一项所述的方法的步骤。
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的方法的步骤。
PCT/CN2020/142331 2020-12-12 2020-12-31 基于代谢组学的生理预测方法、装置、计算机设备和介质 WO2022121055A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011453774.1 2020-12-12
CN202011453774.1A CN114624316A (zh) 2020-12-12 2020-12-12 基于代谢组学的生理预测方法、装置、计算机设备和介质

Publications (1)

Publication Number Publication Date
WO2022121055A1 true WO2022121055A1 (zh) 2022-06-16

Family

ID=81896152

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/142331 WO2022121055A1 (zh) 2020-12-12 2020-12-31 基于代谢组学的生理预测方法、装置、计算机设备和介质

Country Status (2)

Country Link
CN (1) CN114624316A (zh)
WO (1) WO2022121055A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115389690A (zh) * 2022-09-27 2022-11-25 中国科学院生态环境研究中心 环境中苯并三唑紫外线吸收剂类污染物的全面识别方法
CN116129991A (zh) * 2023-04-17 2023-05-16 南京派森诺基因科技有限公司 一种基于代谢物定性定量数据的非靶向代谢组分析方法
CN116362599A (zh) * 2022-12-12 2023-06-30 武汉同捷信息技术有限公司 一种基于mes系统的质量数据采集方法与装置
CN117238491A (zh) * 2023-07-27 2023-12-15 深圳爱湾医学检验实验室 一种基于尿液代谢组数据的尿结石风险预测方法及其系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN206146940U (zh) * 2016-11-09 2017-05-03 苏州一呼医疗科技有限公司 智能呼气分子诊断系统
US20180180590A1 (en) * 2016-07-13 2018-06-28 The United States Of America As Represented By The Secretary Of The Navy Volatile organic compounds as diagnostic breath markers for pulmonary oxygen toxicity
CN111710372A (zh) * 2020-05-21 2020-09-25 中国医学科学院生物医学工程研究所 一种呼出气检测装置及其呼出气标志物的建立方法
CN111833330A (zh) * 2020-07-14 2020-10-27 中国医学科学院生物医学工程研究所 基于影像与机器嗅觉融合的肺癌智能检测方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180180590A1 (en) * 2016-07-13 2018-06-28 The United States Of America As Represented By The Secretary Of The Navy Volatile organic compounds as diagnostic breath markers for pulmonary oxygen toxicity
CN206146940U (zh) * 2016-11-09 2017-05-03 苏州一呼医疗科技有限公司 智能呼气分子诊断系统
CN111710372A (zh) * 2020-05-21 2020-09-25 中国医学科学院生物医学工程研究所 一种呼出气检测装置及其呼出气标志物的建立方法
CN111833330A (zh) * 2020-07-14 2020-10-27 中国医学科学院生物医学工程研究所 基于影像与机器嗅觉融合的肺癌智能检测方法及系统

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115389690A (zh) * 2022-09-27 2022-11-25 中国科学院生态环境研究中心 环境中苯并三唑紫外线吸收剂类污染物的全面识别方法
CN115389690B (zh) * 2022-09-27 2023-09-05 中国科学院生态环境研究中心 环境中苯并三唑紫外线吸收剂类污染物的全面识别方法
CN116362599A (zh) * 2022-12-12 2023-06-30 武汉同捷信息技术有限公司 一种基于mes系统的质量数据采集方法与装置
CN116362599B (zh) * 2022-12-12 2023-11-10 武汉同捷信息技术有限公司 一种基于mes系统的质量数据采集方法与装置
CN116129991A (zh) * 2023-04-17 2023-05-16 南京派森诺基因科技有限公司 一种基于代谢物定性定量数据的非靶向代谢组分析方法
CN117238491A (zh) * 2023-07-27 2023-12-15 深圳爱湾医学检验实验室 一种基于尿液代谢组数据的尿结石风险预测方法及其系统

Also Published As

Publication number Publication date
CN114624316A (zh) 2022-06-14

Similar Documents

Publication Publication Date Title
WO2022121055A1 (zh) 基于代谢组学的生理预测方法、装置、计算机设备和介质
Xi et al. Statistical analysis and modeling of mass spectrometry-based metabolomics data
US11315774B2 (en) Big-data analyzing Method and mass spectrometric system using the same method
Enot et al. Preprocessing, classification modeling and feature selection using flow injection electrospray mass spectrometry metabolite fingerprint data
US11341404B2 (en) Analysis-data analyzing device and analysis-data analyzing method that calculates or updates a degree of usefulness of each dimension of an input in a machine-learning model
US8731839B2 (en) Method and system for robust classification strategy for cancer detection from mass spectrometry data
Karpievitch et al. Normalization and missing value imputation for label-free LC-MS analysis
US8478534B2 (en) Method for detecting discriminatory data patterns in multiple sets of data and diagnosing disease
US7676329B2 (en) Method and system for processing multi-dimensional measurement data
JP6715451B2 (ja) マススペクトル解析システム,方法およびプログラム
Dumas et al. Analyzing the physiological signature of anabolic steroids in cattle urine using pyrolysis/metastable atom bombardment mass spectrometry and pattern recognition
Koo et al. Analysis of Metabolomic Profiling Data Acquired on GC–MS
Sun et al. A systematic model of the LC-MS proteomics pipeline
CN112912723A (zh) 使用共有文库进行样品分析的技术
Webb-Robertson et al. A Bayesian integration model of high-throughput proteomics and metabolomics data for improved early detection of microbial infections
CN114184599A (zh) 单细胞拉曼光谱采集数目估计方法、数据处理方法及装置
US11990327B2 (en) Method, system and program for processing mass spectrometry data
US20180137236A1 (en) System, method and device for identifying discriminant biological factors and for classifying proteomic profiles
Kakourou et al. Accounting for isotopic clustering in Fourier transform mass spectrometry data analysis for clinical diagnostic studies
Del Prete et al. Comparative analysis of MALDI-TOF mass spectrometric data in proteomics: a case study
US20230351263A1 (en) Active machine learning model for targeted mass spectrometry data analysis
Grissa et al. A hybrid data mining approach for the identification of biomarkers in metabolomic data
Truntzer et al. Statistical approach for biomarker discovery using label-free LC-MS data: an overview
KARABAGIAS MULTIVARIATE ANALYSIS IN COMBINATION WITH SUPERVISED AND NON-SUPERVISED STATISTICAL TECHNIQUES: CHEMOMETRICS
Vračko Chemometrical Analysis of Proteomics Data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20964984

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 27/09/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20964984

Country of ref document: EP

Kind code of ref document: A1