CN112509702A

CN112509702A - Disease prediction method and system based on medical big data

Info

Publication number: CN112509702A
Application number: CN202011377118.8A
Authority: CN
Inventors: 田润涛
Original assignee: Zimei Beijing Biotechnology Co ltd
Current assignee: Zimei Beijing Biotechnology Co ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-03-16

Abstract

The invention discloses a disease prediction method and a system based on medical big data, wherein the method comprises the following steps: introducing mass spectrum data of each mainstream mass spectrometer manufacturer, and completing conversion from original mass spectrum data of the manufacturer to a universal data format; according to the personal endogenous metabolite spectrum information and a compound search matching algorithm, automatic labeling and qualitative analysis of endogenous compounds in a mass spectrogram are realized through a metabolite compound database; the method comprises the following steps of (1) collecting and carrying out effective management on large and complex human metabolome samples, carrying out data mining and modeling, wherein the collected and carried data types mainly comprise targeted and non-targeted metabonomics mass spectrometry analysis original data and labeling results; performing normalized cleaning, compound alignment, qualitative and quantitative analysis on complex multi-sample data, and correcting by adopting a multi-class algorithm; and (3) completing the whole process of training, releasing and maintaining the user disease prediction model by adopting a machine learning model training set algorithm with different types of mass spectrum data processing characteristics.

Description

Disease prediction method and system based on medical big data

Technical Field

The invention relates to the technical field of medical data artificial intelligence, in particular to a disease prediction method and system based on medical big data.

Background

The mass spectrometry technology is one of the leading technologies in the field of precise instrument analysis, is rapidly developed in the field of clinical detection in recent years, has a wide application range of clinical mass spectrometry, can replace the traditional methodology in multiple fields of biochemical immunity, drug metabolism, microorganism, pathological diagnosis, molecules and the like, for example, compared with gene sequencing, the mass spectrometry is suitable for the detection of almost all molecules, including biomacromolecules such as nucleic acid, polypeptide and the like, biological micromolecules such as metabolites, hormone, vitamin and the like, and trace elements, and can simultaneously realize the simultaneous qualitative and quantitative determination of thousands of markers, compared with the prior art, the mass spectrometry has the advantages of high detection efficiency, wide marker coverage, low detection cost, higher automation degree of analysis process and the like, has the best prospect of becoming the clinical detection 'gold standard', and the application of liquid chromatography tandem mass spectrometry at home and abroad at the present is still in the initial stage of on-site detection, clinical mass spectrometry application mainly focuses on items such as drug concentration monitoring, neonatal defect screening, vitamin and hormone detection and the like, is relatively simple and mature in terms of depth and breadth of detection indexes, and belongs to the primary application range of precise medicine.

The concept of metabonomics (metabonomics) was originally proposed in 1999 by Jeremy Nicholson of the university of kingdom of empire and has become one of the most interesting frontier technologies in the field of precise medicine and related health after the development of short 20 years, the research objects of metabonomics are generally the time-space variation trend of the composition of small molecule metabolites with molecular weight less than 1500 Da in organisms, the advanced mass spectrometry detection technology combined with expert system is the core analysis means of metabonomics, compared with genomes, the metabonomics are positioned at the downstream of the gene regulation and protein action network and are closely related to the nutrition state of cells, the action of drugs and environmental pollutants and the influence of other external factors, and the events which occur under the post-day comprehensive action of the organisms are intuitively reflected rather than the probability events at the gene level, through the analysis of metabolites in blood, urine, skin and the like, complete biological terminal information can be provided more accurately, metabonomics naturally most closely meet the requirements of clinical detection and precise medicine, however, the composition of metabonomic data is very complex, one-time sample analysis usually generates tens of thousands to hundreds of thousands of compound fragment information, currently, the complete interpretability is less than ten percent, and in addition, the individual difference of a living body causes a plurality of uncertain factors in the current metabonomic data analysis, for example, target data missing values generated by a mass spectrum-based targeting technology are few, but more critical non-targeting technology usually generates missing values of 10 to 35 percent, so that how to realize a high-coverage high-precision metabonomic data analysis method and application around data analysis are the main challenges currently faced.

As a latest generation of molecular diagnosis and accurate medical detection technology, metabonomics analysis based on mass spectrometry technology can obtain at least 2000 biomarker information of human body related to disease occurrence and development processes at one time, metabonomics data volume is large and analysis is complex, and in addition, individual differences of people are large, the serial development of disease early screening and early diagnosis intelligent models is realized by means of human metabonomic big data mining and machine learning modeling, Google company considers that for overcoming serious diseases such as cancer and aging society related problems, big data related technology and Google experience prove are used, under the existing technical conditions, the sample number is enlarged by 100 times or even higher, the quality of a prediction result is changed, therefore, a human metabonomic big data system is established, the sample analysis scale is enlarged through standardized data acquisition, management and mining modes, the development and application of a high-precision machine learning model surrounding metabonomic data are realized, and are one of the key technical bottlenecks which need to be broken through in the current field.

In summary, the perfection of metabonomics related data analysis technology and the establishment of a big data mining platform become key problems restricting the further rapid development of metabonomics related data analysis technology and accurate medical treatment, the application provides deep mining and feature extraction of metabolite information of a living body acquired by clinical mass spectrometry and metabonomics technology, and machine learning and mode recognition of mass spectrometry data are developed in a big sample mode, so that a one-station and automatic model training and prediction mechanism based on a sample database is formed, the prediction mechanism is tightly combined with clinical mass spectrometry and metabonomics analysis and clinical requirements, and big data technical support and system solution are provided for important application directions in health fields such as early screening and early diagnosis of major diseases and chronic diseases, accurate medication solution and the like.

Disclosure of Invention

Aiming at the technical problems in the related art, the invention provides a disease prediction method and system based on medical big data, which can solve the problems of the existing metabonomics mass spectrum data analysis and the automation and standardization of the process, and fill the blank of a metabonomics automatic prediction model and a big database system for screening and predicting human diseases.

In order to achieve the technical purpose, the technical scheme of the invention is realized as follows: a disease prediction method based on medical big data comprises the following steps:

s1 mass spectrometer data exchange and pretreatment, introducing mass spectrometer data of each mainstream mass spectrometer manufacturer, completing conversion from original mass spectrometer data of the manufacturer to a universal data format, converting a continuous mass spectrogram to a rod-shaped mass spectrogram, and performing pretreatment operation on extraction of different scanning channels;

s2 the metabolite compound database, according to the personal endogenous metabolite spectrum information and the compound search matching algorithm, the automatic labeling and qualitative analysis of the endogenous compound in the mass spectrogram are realized through the metabolite compound database;

s3, collecting and loading a human metabolome sample database, carrying out effective management on large and complex human metabolome samples, carrying out data mining and modeling, wherein the collected and loaded data types mainly comprise targeted and non-targeted metabolomics mass spectrometry analysis original data and labeling results, and space metabolomics information obtained through mass spectrometry imaging;

the method comprises the following steps that S4 metabonomics data analysis workstation is used for carrying out normalized cleaning, compound alignment and qualitative and quantitative analysis on complex multi-sample data, and correction is carried out by adopting a multi-class algorithm according to the problem of sample difference caused by instruments and experimental processes;

and S5, a mass spectrum data machine learning algorithm and a disease prediction model development module, wherein different types of machine learning model training set algorithms with mass spectrum data processing characteristics are adopted to complete the whole process of training, releasing and maintaining the user disease prediction model.

Further, the matching of the database mass spectra data of the metabolite compounds in S2 further comprises:

s2.1, accurately matching analysis conditions: accurately matching a sample to be detected with a compound with the closest detection condition in the database according to the recorded experimental conditions, instrument models, ion source types and retention time of liquid chromatogram and mass spectrum data during detection of each metabolite;

s2.2, dynamic addition ion matching: under the condition that the type and the number of the dynamic addition ions are uncertain, the missing detection and the false detection of the compound are effectively improved by adopting a dynamic matching mode of positive and negative addition ions;

s2.3 Mass Spectrometry isotope matching: based on matching of any mass spectrum resolution data and simulation of an isotope peak of any molecular weight compound, compatibility of sample data of different instrument types is improved during matching;

s2.4 retention time matching: carrying out window matching on the retention time of the compound by combining chromatographic conditions during sample testing to obtain the influence of an interference compound with larger difference of filtering chromatographic behavior on a matching result;

s2.5, primary mass spectrum matching: completely recording the cracking information of the excimer ion peak and the fragment peak of the metabolite molecules, and recording the complete primary mass spectrum information including the chemical formula, the structural formula and the fragment loss information of each fragment, so as to realize accurate metabolite matching;

s2.6, matching of secondary mass spectra: according to different ion sources of compound parent ions and fragment cracking information under different collision voltages, recording complete secondary mass spectrum information including chemical formulas, structural formulas and fragment loss information for realizing accurate metabolite matching;

s2.7 ion mobility matching: recording ion mobility information of compound fragments, and distinguishing various isomers and chiral isomers to realize accurate metabolite matching;

s2.8 metabolic pathway matching: according to all the collected known human body metabolic pathway information, relevant metabolites of the matched compounds on the upstream and downstream of the metabolic pathway are matched, and the compounds meeting the conditions give higher matching results to score, so that the accuracy degree of metabolite matching is improved;

s2.9 non-targeted matching mechanism: for unknown metabolites which lack a real standard substance but have significant significance and stable mass spectrum characteristics, the unknown metabolites are taken as non-target characteristic compounds to be recorded, so that the coverage degree and reliability of matching results are improved, and the accuracy degree of metabolite matching is further improved;

s2.10 matching weight adjustment, according to the principles and mechanisms of various matching algorithms, the search results are influenced to different degrees, and through the weight adjustment function, a user strengthens or weakens certain matching mechanisms according to the specific conditions of a sample, so that the accuracy of metabolite matching is further improved.

Further, the S3 human metabolome sample database further comprises:

s3.1, the human metabolome sample database consists of a human metabolome biological sample database and an experimental animal biological sample database, wherein blood and tissue slice samples are contained;

s3.2, each record of the sample library is related to health data of the detected person, wherein the health data comprise basic information of sex, age, ethnicity, height and weight, and auxiliary diagnosis information of life dietary habits, past medical history and medication conditions;

s3.3, seamlessly associating the metabolic group sample data with a metabolite database to generate a full spectrum and automatic compound label;

s3.4 the human metabolome sample database and the metabolomics data analysis workstation exchange data seamlessly and are used for one-stop analysis and data modeling development of users;

the S3.5 database is the frontier requirement for the management of large samples of a metabolome and the data mining from the aspects of omics analysis design concept and database architecture, and belongs to the initiative.

Further, the correcting by the multi-class algorithm in S4 further includes:

s4.1, in the screening algorithm of the internal standard substances in the sample, each alternative internal standard compound is used as an internal standard to establish a fitting standard curve, the standard curve fitting the analyte is selected from the fitting standard curve to be used as the corresponding internal standard of the analyte, and the internal standard corresponding to the regression equation determining coefficients is used as the final internal standard of the analyte.

Further, the training, issuing and maintaining of the machine learning algorithm and the disease prediction model in S5 further comprises:

s5.1, performing model training set algorithm and optimization;

s5.1.1, based on various multivariate statistical analysis, chemometrics and artificial intelligence algorithms, realizing full-automatic machine learning aiming at mass spectrum data;

s5.1.2 identifying and classifying the supervised or unsupervised patterns of metabolome samples, fully automatically identifying and screening disease-related marker sets, training and evaluating the quality of models, and performing automated parameter optimization for a specific algorithm;

s5.1.3 performing feature extraction and dimension reduction algorithm including cluster analysis, variance analysis, principal component analysis, partial least square analysis and orthogonal partial least square analysis on the high-dimensional data, and expanding the algorithm by a model interface mode;

s5.2, training, releasing and maintaining a disease prediction model;

s5.2.1, training based on a general development mode of a metabonomics mass spectrum data disease prediction model, selecting and introducing samples in a metabonomics database, training the model by combining a specific machine learning algorithm, and giving the quality of the established model, namely the generalization degree of the model, through a cross validation algorithm and the comprehensive recognition rate of the trained model;

s5.2.2 maintaining the disease prediction model, and further adjusting and perfecting the model by adjusting the composition and quantity of the sample set and an algorithm parameter optimization mode;

s5.2.3, issuing the whole disease prediction model as an independent file, and installing the file in a target computer, or calling a cloud service program in an API (application programming interface) mode, sending detection data and returning a disease prediction model prediction result, and performing auxiliary prediction and screening analysis on related disease types of a sample to be tested in a mode of loading a model file.

In another aspect, a disease prediction system based on medical big data is provided, where the system includes:

the mass spectrometer data preprocessing is used for importing mass spectrometer data of each mainstream mass spectrometer manufacturer, completing the conversion from original mass spectrometer data of the manufacturer to a universal data format, converting a continuous mass spectrogram to a rod-shaped mass spectrogram, and preprocessing the extraction of different scanning channels;

the metabolite compound database is used for realizing automatic labeling and qualitative analysis of endogenous compounds in a mass spectrogram through the metabolite compound database according to personal endogenous metabolite spectrum information and a compound search matching algorithm;

the human metabolome sample database is used for collecting and carrying out effective management on large and complex human metabolome samples and carrying out data mining and modeling, the collected and carried data types mainly comprise targeted and non-targeted metabolomics mass spectrometry analysis original data and labeling results, and space metabolomics information obtained through mass spectrometry imaging;

the metabonomics data analysis workstation is used for carrying out normalized cleaning, compound alignment, qualitative and quantitative analysis on complex multi-sample data, and correcting by adopting a plurality of algorithms according to the problem of sample difference caused by instruments and an experimental process;

the mass spectrum data machine learning algorithm and the disease prediction model development module adopt different types of machine learning model training set algorithms with the characteristic of mass spectrum data processing to complete the whole process of training, releasing and maintaining the user disease prediction model.

Further, the metabolite compound database mass spectral data match is used for:

accurate matching of analysis conditions: accurately matching a sample to be detected with a compound with the closest detection condition in the database according to the recorded experimental conditions, instrument models, ion source types and retention time of liquid chromatogram and mass spectrum data during detection of each metabolite;

dynamic adduct ion matching: under the condition that the type and the number of the dynamic addition ions are uncertain, the missing detection and the false detection of the compound are effectively improved by adopting a dynamic matching mode of positive and negative addition ions;

mass spectrum isotope matching: based on matching of any mass spectrum resolution data and simulation of an isotope peak of any molecular weight compound, compatibility of sample data of different instrument types is improved during matching;

and (3) retention time matching: carrying out window matching on the retention time of the compound by combining chromatographic conditions during sample testing to obtain the influence of an interference compound with larger difference of filtering chromatographic behavior on a matching result;

primary mass spectrum matching: completely recording the cracking information of the excimer ion peak and the fragment peak of the metabolite molecules, and recording the complete primary mass spectrum information including the chemical formula, the structural formula and the fragment loss information of each fragment, so as to realize accurate metabolite matching;

matching the secondary mass spectrum: according to different ion sources of compound parent ions and fragment cracking information under different collision voltages, recording complete secondary mass spectrum information including chemical formulas, structural formulas and fragment loss information for realizing accurate metabolite matching;

ion mobility matching: recording ion mobility information of compound fragments, and distinguishing various isomers and chiral isomers to realize accurate metabolite matching;

metabolic pathway matching: according to all the collected known human body metabolic pathway information, relevant metabolites of the matched compounds on the upstream and downstream of the metabolic pathway are matched, and the compounds meeting the conditions give higher matching results to score, so that the accuracy degree of metabolite matching is improved;

non-targeted matching mechanism: for unknown metabolites which lack a real standard substance but have significant significance and stable mass spectrum characteristics, the unknown metabolites are taken as non-target characteristic compounds to be recorded, so that the coverage degree and reliability of matching results are improved, and the accuracy degree of metabolite matching is further improved;

and matching weight adjustment, namely, according to the principles and mechanisms of various matching algorithms, the search results are influenced to different degrees, and through the weight adjustment function, a user strengthens or weakens certain matching mechanisms according to the specific conditions of the sample, so that the accuracy of metabolite matching is further improved.

Further, the human metabolome sample database for:

the human metabolome sample database consists of a human metabolome biological sample database and an experimental animal biological sample database, wherein blood and tissue slice samples are contained;

each record of the sample library is associated with health data of a detected person, and the health data comprises basic information of sex, age, ethnicity, height and weight, and auxiliary diagnosis information of life eating habits, past medical history and medication conditions;

carrying out seamless association on the metabolic group sample data and a metabolite database to generate a full spectrum and automatic compound labeling;

the human metabonomic group sample database and the metabonomic data analysis workstation are subjected to seamless data exchange and are used for one-stop analysis and data modeling development of users;

the database is a leading-edge requirement for large sample management and data mining of the metabolome from the aspects of omics analysis design concept and database architecture, and belongs to the initiative.

Further, the multi-class algorithm corrects for:

the screening algorithm of the internal standard substances in the sample establishes a fitting standard curve by taking each alternative internal standard compound as an internal standard, picks out a standard curve fitting the analyte from the fitting standard curve to be used as a corresponding internal standard of the analyte, and takes the internal standard corresponding to the regression equation determining coefficients as a final internal standard of the analyte.

Further, the machine learning algorithm and the disease prediction model train, publish and maintain, which are used for:

performing model training set algorithm and optimization;

based on various multivariate statistical analysis, chemometrics and artificial intelligence algorithms, full-automatic machine learning aiming at mass spectrum data is realized;

identifying and classifying the supervised or unsupervised patterns of the metabolome sample, fully automatically identifying and screening the disease-related marker group, training the model, evaluating the quality of the model, and automatically optimizing parameters aiming at a specific algorithm;

performing feature extraction and dimension reduction algorithms including clustering analysis, variance analysis, principal component analysis, partial least square analysis and orthogonal partial least square analysis on the high-dimensional data, and expanding the algorithms in a model interface mode;

training, releasing and maintaining a disease prediction model:

training is carried out based on a general development mode of a metabonomics mass spectrum data disease prediction model, samples in a metabonomics database are selected and led in, model training is carried out by combining a specific machine learning algorithm, and the quality of the established model, namely the generalization degree of the model, is given through a cross validation algorithm and the comprehensive recognition rate of the trained model;

maintaining a disease prediction model, and further adjusting and perfecting the model in an algorithm parameter optimization mode by adjusting the composition and the quantity of a sample set;

and (3) issuing and applying the disease prediction model, wherein the whole disease prediction model is independently exported to be an independent file to be issued and is installed on a target computer, or calling a cloud service program in an API (application programming interface) mode, sending detection data and returning a disease prediction model prediction result, and carrying out auxiliary prediction and screening analysis on related disease types of a sample to be tested in a mode of loading a model file.

The invention has the beneficial effects that: in view of the defects in the prior art, the method has the following beneficial effects:

1) the automatic labeling and qualitative analysis of endogenous compounds in a sample mass spectrogram are realized through a metabolite compound database consisting of 2000 personal endogenous metabolite mass spectrum comprehensive information and 10 compound search matching algorithms, and the data labeling efficiency and the compound identification quality in the current metabonomics analysis field are greatly improved through the combined application of big data and an intelligent matching algorithm;

2) by constructing a human metabolome biological sample digital specimen library taking blood and tissue slice samples as cores and adopting a big data mode to effectively manage the human metabolome samples and carry out data mining and modeling, the requirements on huge and complex human metabolome sample management are greatly facilitated, and core basic support is provided for omics data mining and artificial intelligence disease prediction model development;

3) the automation, standardization and traceability of the metabonomics data analysis process are realized, and the metabonomics data analysis workstation provides an efficient and flexible data processing process through the designed sample correction algorithm, the inter-sample compound alignment algorithm and the content determination method, so that the sample correction quality is greatly improved on the basis of the prior art;

4) the support of one-stop data modeling of complex metabonomics mass spectrum data and the application scene of artificial intelligent prediction of various diseases is realized by comprehensively utilizing various machine learning algorithms optimized aiming at the characteristics of the mass spectrum data, and complete disease prediction model development and application including sample classification, disease marker group discovery, disease prediction model training, model maintenance, unknown sample prediction and the like can be realized by the automatic disease prediction and screening model development system based on metabonomics detection data, so that the combination level of the current clinical mass spectrum detection and artificial intelligence is greatly improved, the deep understanding of the disease occurrence and development process rules is promoted, and the positive effects on early screening and accurate medicine of major diseases and chronic diseases are played;

5) the system provides compatibility to various third-party mass spectrum instruments through a data exchange interface, provides intuitive, friendly, highly intelligent and automatic metabonomics data analysis tools and disease prediction model development and application support for users in different fields such as scientific research, medical treatment and laboratories, and realizes a one-stop artificial intelligence platform for disease screening and prediction based on medical big data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic block diagram of mass spectrum data matching of a medical big data-based disease prediction method and a system metabolite compound database according to an embodiment of the invention;

fig. 2 is a flow chart of a disease prediction method based on medical big data and a system metabonomics data analysis workstation according to an embodiment of the present invention;

fig. 3 is a block flow diagram of a disease prediction method and system based on medical big data, a mass spectrometry data machine learning algorithm and a disease prediction model development system according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

As shown in fig. 1 to 3, the disease prediction method and system based on medical big data according to the embodiment of the present invention includes:

firstly, mass spectrometer instrument data exchange and pretreatment are carried out, mass spectrometer data of each mainstream mass spectrometer instrument manufacturer are imported, the mass spectrometer instrument data are directly exported by a workstation of a mass spectrometer manufacturer, or the data conversion function (such as DataBridge of an XCalibur workstation) of instrument workstation software is used for completing conversion from original mass spectrometer data of the manufacturer to a universal data format, a continuous mass spectrogram is converted to a rod-shaped mass spectrogram, preprocessing operation is carried out on extraction of different scanning channels, and the import of mass spectrometer imaging data types is supported.

Then, the metabolite compound database realizes automatic labeling and qualitative analysis of endogenous compounds in a mass spectrogram through the metabolite compound database according to personal endogenous metabolite spectrum information and a compound search matching algorithm, measures mass spectrum information of over 2000 individual metabolites through a detection result of a standard substance object, fully eliminates the interference of exogenous compounds, realizes the most comprehensive human endogenous metabolite matching database of the current variety, and simultaneously supports dynamic editing and expansion of the database, thereby continuously improving the accuracy of metabolite matching in a big data matching manner, realizing accurate and automatic compound analysis due to the influence of various factors such as a sample matrix, a mass spectrum ion source variety and an instrument working state on the mass spectrum information of LC-MS, and designing the following 10 mass spectrum data matching in the metabolite compound database, further comprising:

1) accurate matching of analysis conditions: accurately matching a sample to be detected with a compound with the closest detection condition in the database according to the recorded experimental conditions, instrument models, ion source types and retention time of liquid chromatogram and mass spectrum data during detection of each metabolite;

2) dynamic adduct ion matching: under the condition that the type and the number of the dynamic addition ions are uncertain, the missing detection and the false detection of the compound are effectively improved by adopting a dynamic matching mode of positive and negative addition ions;

3) mass spectrum isotope matching: based on matching of any mass spectrum resolution data and simulation of an isotope peak of any molecular weight compound, compatibility of sample data of different instrument types is improved during matching;

4) and (3) retention time matching: carrying out window matching on the retention time of the compound by combining chromatographic conditions during sample testing to obtain the influence of an interference compound with larger difference of filtering chromatographic behavior on a matching result;

5) primary mass spectrum matching: completely recording the cracking information of the excimer ion peak and the fragment peak of the metabolite molecules, and recording the complete primary mass spectrum information including the chemical formula, the structural formula and the fragment loss information of each fragment, so as to realize accurate metabolite matching;

6) matching the secondary mass spectrum: according to different ion sources of compound parent ions and fragment cracking information under different collision voltages, recording complete secondary mass spectrum information including chemical formulas, structural formulas and fragment loss information for realizing accurate metabolite matching;

7) ion mobility matching: recording ion mobility information of compound fragments, and distinguishing various isomers and chiral isomers to realize accurate metabolite matching;

8) metabolic pathway matching: according to all the collected known human body metabolic pathway information, relevant metabolites of the matched compounds on the upstream and downstream of the metabolic pathway are matched, and the compounds meeting the conditions give higher matching results to score, so that the accuracy degree of metabolite matching is improved;

9) non-targeted matching mechanism: for unknown metabolites which lack a real standard substance but have significant significance and stable mass spectrum characteristics, the unknown metabolites are taken as non-target characteristic compounds to be recorded, so that the coverage degree and reliability of matching results are improved, and the accuracy degree of metabolite matching is further improved;

10) matching weight adjustment, namely, according to the principles and mechanisms of various matching algorithms, the search results are influenced to different degrees, and through the weight adjustment function, a user strengthens or weakens certain matching mechanisms according to the specific conditions of a sample, so that the accuracy of metabolite matching is further improved;

1), 8), 9) in the above-mentioned 10 analysis methods are proposed for the first time for this application, and carry out key improvement to 2), 6), 10), realized intelligent automatic retrieval from the perspective of mass spectrometry big data, and combine the cooperation of the 10 mass spectrometry data matching algorithm that propose and call, greatly improved metabolic compound mark and the accuracy of discerning, as shown in fig. 1.

Secondly, the human metabolome sample database collects and carries out effective management and data mining and modeling on large and complex human metabolome samples, the collected and carried data types mainly comprise targeted and non-targeted metabolomics mass spectrometry analysis original data and labeling results, and space metabolomics information obtained through mass spectrometry imaging is formed by a human metabolome biological sample database and an experimental animal biological sample database, wherein the human metabolome sample database covers blood and tissue slice samples;

each record of the sample library is associated with health data of a detected person, the health data comprises basic information of sex, age, nationality, height and weight, and auxiliary diagnosis information of life eating habits, past medical history and medication conditions, the metabolome sample data and the metabolite database are seamlessly associated to generate a full spectrum and automatic compound label, the human metabolome sample database and a metabonomics data analysis workstation are seamlessly exchanged for one-stop analysis and data modeling of a user, and the database is initiated for the frontier requirements of large sample management and data mining of the metabolome from the aspects of omic analysis design concept and database architecture.

And then, a metabonomics data analysis workstation carries out normalized cleaning, compound alignment and qualitative and quantitative analysis on complex multi-sample data, adopts a multi-class algorithm for correction according to the problem of sample difference caused by instruments and experimental processes, adopts a screening algorithm of internal standard substances in the sample, establishes a fitting standard curve by taking each alternative internal standard compound as an internal standard, picks out the standard curve fitting the analyte from the fitting standard curve as a corresponding internal standard of the analyte, and takes the internal standard corresponding to a regression equation determining coefficients as a final internal standard of the analyte.

Finally, the mass spectrometry data machine learning algorithm and disease prediction model development module, based on the core technology of the automatic disease prediction and screening model of metabonomics detection data, adopts different types of machine learning model training set algorithms with the characteristic of mass spectrometry data processing to complete the whole process of training, publishing and maintaining of the user disease prediction model, and further comprises:

model training set algorithm and optimization: based on various multivariate statistical analysis, chemometrics and artificial intelligence algorithms, the method realizes full-automatic machine learning aiming at mass spectrum data, identifies and classifies supervised or unsupervised patterns of metabolic group samples, fully-automatically identifies and screens disease-related marker groups, trains models and evaluates model quality, and performs automatic parameter optimization aiming at a specific algorithm, performs feature extraction and dimension reduction algorithms including cluster analysis, variance analysis, principal component analysis, partial least square analysis and orthogonal partial least square analysis on high-dimensional data, and expands the algorithms through a model interface mode;

training, releasing and maintaining a disease prediction model: training is carried out based on a general development mode of a metabonomics mass spectrum data disease prediction model, samples in a metabonomic group database are selected and introduced, and combines a specific machine learning algorithm to train the model, the trained model is subjected to a cross validation algorithm, the quality of the established model is given through the comprehensive recognition rate, namely the generalization degree of the model, the maintenance of the disease prediction model, the further adjustment and the perfection of the model by adjusting the composition and the quantity of the sample set and the algorithm parameter optimization mode, the release and the application of the disease prediction model, the independent file which is independently derived from the whole disease prediction model is released and installed on a target computer, or calling a cloud service program in an API (application programming interface) interface mode, sending detection data and returning a disease prediction model prediction result, and carrying out auxiliary prediction and screening analysis on related disease types of the sample to be tested in a mode of loading the model file.

In order to facilitate understanding of the above-described technical aspects of the present invention, the above-described technical aspects of the present invention will be described in detail below in terms of specific usage.

When in specific use, according to the disease prediction method and system based on medical big data, the mass spectrum analysis data is utilized to establish an artificial intelligent screening and prediction model of human diseases through a machine learning algorithm, and through the provided metabonomics database and data analysis and modeling, has important value for deeply knowing the change rule in the process of disease occurrence and development, is used for the analysis of disease metabolic pathway, the discovery of disease-related biomarker groups and the like, and corresponding artificial intelligence analysis and prediction models are developed for early warning and accurate medical treatment of serious diseases such as chronic diseases, cancers and the like through the application, the following takes the development of a metabonomics-based machine learning model of acute coronary heart disease represented by myocardial infarction as an example, the method steps from sample collection until the establishment of a corresponding early clinical prediction model of myocardial infarction (acute coronary syndrome) diseases are explained.

The specific implementation example comprises the following steps:

step one, collecting a sample: 92 stable coronary heart disease patients plasma (control group), 102 acute coronary syndrome patients plasma (model training group) and 93 coronary angiography negative plasma (blank group) were provided by the hospital Fuweisan, Chinese academy of medicine. The collected blood samples were all subjected to plasma separation within 1 hour, centrifuged at 1350rcf at 4 ℃ for 12 minutes, and the pale yellow supernatant was removed and transferred to a 15ml DNase free sterile centrifuge tube, which was centrifuged once again and transferred to a 2ml DNase free sterile centrifuge tube. 13500rcf, centrifuging at 4 deg.C for 5min, transferring the supernatant into 2ml DNase free sterile centrifuge tube, labeling, and freezing at-80 deg.C.

Step two, sample pretreatment: each sample was pipetted 100. mu.l, 3ml MTBE and 1.2ml H2O were added, vortexed at 2500rpm for 15min, and centrifuged at 4200rpm at 4 ℃ for 10min, the upper organic phase was transferred to another glass tube, and the lower aqueous phase was transferred to a 2ml EP tube. After nitrogen blowing of the two tubes of samples, 100. mu.L of ACN: H2O (2: 98, V/V) was added to the lower polar extract for reconstitution, and 400. mu.L of MeOH: CHCl3(1:1) was added to the upper lipid extract for reconstitution. The reconstituted sample was vortexed at 2500rpm for 5min, centrifuged at 4 ℃ and 12500 (polar extract)/4500 rpm (lipid extract) for 5min, and the supernatant was fractionated, filtered through a 96-well plate, and analyzed by sample injection.

Step three, chromatographic and mass spectrum experimental conditions: but not shown.

Step four, uploading the database: the original file of the sample of the workstation of the mass spectrometer, such as data in formats of cdf, mzXML, XML and the like, is uploaded to a sample database of the human metabolome in the system, so that effective management and traceable analysis of the data are realized, and basic information of the sample, including basic information of a patient, sampling information, experimental conditions, instrument information and the like, is also input.

And fifthly, labeling compounds and identifying metabolites, labeling compounds for the obtained metabolome sample data through a metabolite database in the system, precisely labeling all endogenous and exogenous metabolites through a database big data and non-targeted matching method, wherein among 10 matching retrieval options provided by the database, analysis condition matching, retention time matching, dynamic adduct ion matching, primary mass spectrum matching and the like are necessary options, mass spectrum isotope matching, secondary mass spectrum matching, ion mobility matching, metabolic pathway matching and non-targeted matching are optional items, and a user finely adjusts a matching result according to the requirements of matching precision and matching range required to be achieved.

Step six, sample quantification and correction: the method comprises the steps of carrying out batch instrument fluctuation correction and content measurement on obtained metabonomic group sample data through a metabonomics data analysis workstation in a system and combining a follow-up QC sample and an internal standard product, selecting algorithms such as local weighted regression and support vector regression in QC sample correction, selectively carrying out functions such as instrument fluctuation correction and internal standard compound automatic screening in a working curve type such as linear regression, quadratic regression, power regression, Wagner regression, Hill regression and the like during content measurement calculation, and completing error calibration and content measurement processes of the whole sample through the setting of the operations.

Step seven, identifying the disease marker group by an algorithm: grouping sample data which is subjected to compound labeling and content measurement according to different sample types, respectively introducing the sample data into a disease prediction model development system in the system, firstly, aligning metabolites among samples, discarding low-concentration compound signals or compounds with low occurrence frequency by setting a relative intensity filtering value and detecting a frequency filtering value, and screening to obtain characteristic compounds with a VIP value of more than 1.5 and a statistical value p-value of less than 0.05 as candidate inter-group difference metabolites through a supervised OPLS-DA algorithm;

taking the data of patients with acute coronary syndrome as an example, the screening results of potential biomarkers in polar compounds show that the markers of phenylalanine metabolism, tryptophan metabolism, tyrosine metabolism TCA cycle, phosphate cycle, pentose phosphate pathway, beta-alanine metabolism and lysine-alanine metabolism have significant differences, and the differences of the metabolites in the pathways are good for disease prediction, so that the related pathways are indirectly shown to be disordered.

Step eight, training a prediction model:

(1) selecting a pattern recognition algorithm, combining the selected training set samples, performing model training on the aligned sample matrix, and establishing a disease pattern discrimination model;

(2) and selecting a cross validation method, such as a leave-one-out method, and evaluating the prediction quality of the model. Selecting a group of model algorithm parameter configuration with the highest recognition rate as an optimal model by combining the model false recognition rate and the rejection rate in the output result;

(3) validation set testing of the model: selecting a batch of samples which do not participate in model training, and predicting by using the trained models to evaluate the overfitting degree and the generalization degree of the models;

(4) release and maintenance of the model: testing a model meeting the requirements through a model training and verification set, namely exporting the model into a prediction model file for independent calling or sending the prediction model file to a third-party user for remote model updating;

(5) and (3) analyzing the unknown sample to be tested, operating according to the same method of the steps (1) to (7), connecting the uploaded detection number of the unknown sample to be tested with the corresponding disease prediction model file, loading the file in a disease prediction model prediction system, and executing analysis to obtain an analysis result, namely completing the whole process of disease prediction model prediction.

In conclusion, by means of the technical scheme of the invention, the metabonomics comprehensive data processing and the disease prediction model modeling capability of the application reach the international leading level, and the provided method comprises a metabolite and human body metabonomic database, mass spectrum data acquisition and disease prediction model training, and the corresponding software analysis system under the complete disease metabonomics big data architecture in which the model prediction is performed is developed and completed, the deployment stage is entered, after the deployment is completed, the whole system can be subjected to the comprehensive popularization and use stage, through deep mining and machine learning of metabolite information of a living body acquired by clinical mass spectrometry and metabonomics technology, the system is closely combined with clinical mass spectrometry analysis, metabonomics analysis and clinical requirements, is used for early screening and early warning of major diseases and chronic diseases, and important application directions in the health field such as accurate medication solution and the like provide a big data technology support and a system solution.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A disease prediction method based on medical big data is characterized by comprising the following steps:

s1 mass spectrometer instrument data preprocessing, mass spectrometer data of each mainstream mass spectrometer instrument manufacturer are imported, conversion from original mass spectrometer data of the manufacturer to a universal data format is completed, a continuous mass spectrogram is converted to a rod-shaped mass spectrogram, and preprocessing operation is carried out on extraction of different scanning channels;

and S5, a mass spectrum data machine learning algorithm and a disease prediction model development module, wherein different types of machine learning model training set algorithms designed aiming at the mass spectrum data processing characteristics are adopted to complete the whole process of training, releasing and maintaining the user disease prediction model.

2. The medical big data-based disease prediction method according to claim 1, wherein the S2 metabolite compound database mass spectrum data matching further comprises:

3. The medical big data-based disease prediction method of claim 1, wherein the S3 human metabolome sample database further comprises:

4. The medical big data-based disease prediction method according to claim 1, wherein the correcting by the multi-class algorithm in S4 further comprises:

5. The medical big data-based disease prediction method according to claim 1, wherein the training, issuing and maintaining of the machine learning algorithm and the disease prediction model in S5 further comprises:

s5.1, performing model training set algorithm and optimization;

s5.2, training, releasing and maintaining a disease prediction model;

6. A disease prediction system based on medical big data, the system comprising:

the mass spectrum data machine learning algorithm and the disease prediction model development module adopt different types of machine learning model training set algorithms designed aiming at the mass spectrum data processing characteristics to complete the whole process of training, releasing and maintaining the user disease prediction model.

7. The medical big data-based disease prediction system according to claim 6, wherein the metabolite compound database mass spectral data matching is used for:

8. The medical big data-based disease prediction system according to claim 6, wherein the human metabolome group sample database is used for:

9. The medical big data-based disease prediction system according to claim 6, wherein the multi-class algorithm corrects for:

10. The medical big data-based disease prediction system of claim 6, wherein the machine learning algorithm and disease prediction model are trained, released and maintained for:

performing model training set algorithm and optimization;

training, releasing and maintaining a disease prediction model:

maintenance of disease prediction model, algorithm parameter optimization by adjusting sample set composition and number

The mode further adjusts and perfects the model;