CN115064219A

CN115064219A - Method for identifying VOCs biomarkers in human expiration based on machine learning

Info

Publication number: CN115064219A
Application number: CN202210558092.XA
Authority: CN
Inventors: 李想; 岑郑楠; 陈健; 陆冰清
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-09-16

Abstract

The invention belongs to the technical field of biomarker detection, and particularly relates to a method for identifying VOCs biomarkers in human expiration based on machine learning. The invention includes: acquiring and processing high-dimensional expiratory VOCs data; identifying internal and external source attributes of the VOCs by adopting an alveolar gradient method; screening a first-level biomarker by adopting a single-dimensional statistical method and a multi-dimensional statistical method; constructing a combined marker through correlation analysis, and screening a secondary biomarker by using a Lasso logistic regression model; and evaluating the classification and prediction performance of the secondary biomarkers by adopting a random forest algorithm. According to the invention, accurate classification of different groups can be realized only by selecting a small number of key biomarkers, the reliability is good, the sensitivity is high, the specificity is strong, and the classification cost can be greatly reduced; meanwhile, a faster temperature rise program can be developed for key compounds, so that the analysis processing time is reduced; the simplified key variables can simplify the original complicated chemical explanation and are beneficial to intensively discussing important metabolic processes and mechanisms.

Description

Method for identifying VOCs biomarkers in human expiration based on machine learning

Technical Field

The invention belongs to the technical field of biomarker detection, and particularly relates to a method for identifying VOCs biomarkers in human expiration.

Background

The components of the exhaled air of people can directly reflect the health information of human bodies. Human cells can generate Volatile Organic Compounds (VOCs) in the metabolic process, and under the action of a circulatory system, the VOCs can finally reach the lung through various systems and organs and diffuse to respiratory tracts through blood-gas exchange. Each exhalation contains abundant VOCs extracted from the blood, and therefore there is a certain correlation between the organic matter enriched in the blood and the constituents of the exhaled breath. External stimuli or internal physiological reactions can cause the increase or decrease of the content of some VOCs and can also generate new compounds, so that expiration can directly reflect the current physiological state of cells, tissues and microorganisms in vivo, further information about personal health is provided, and analysis and research on the components of respiratory gas are also emerging means for environmental health assessment.

Currently, breath detection has been studied initially from different perspectives. At present, more than 3000 volatile compounds are detected in human breath, and the compounds comprise aldehyde ketone compounds, alkanes, alkenes, nitrogen-containing compounds, sulfur-containing compounds and the like. The compounds mainly have 3 sources, namely VOCs generated by the self metabolism of human bodies, which are called endogenous VOCs; VOCs entering human body through inhalation, diet, medicine, skin contact and other ways, and are metabolized and consumed in vivo, so that the VOCs are called exogenous VOCs; third, VOCs released by host parasitic microorganisms, and these compounds may also reflect the health of the host. In the early disease screening and diagnosis test for exhaled VOCs, attempts are made to find a biomarker of VOCs indicating a corresponding disease by comparing exhaled VOCs of cancer patients, inflammation patients and chronic disease patients, but the progress is very slow due to problems such as a sampling method, an analysis method and a marker screening process. Since a standardized sampling and analyzing method for the exhaled VOCs is not established internationally and domestically, the difference between the results is large and is not comparable, and the development of respiratory detection is also severely limited. The breath detection is used as a non-invasive method, has wide prospect in disease diagnosis, and will play an important role in the future by combining with the research of new technologies such as artificial intelligence and the like.

In addition, there are many problems in the screening of the markers of the exhaled VOCs, which results in poor consistency of the markers found in various studies. First, sampling techniques severely limit the screening of markers for exhaled VOCs. Most studies today use Tedlar bags as the primary sampling tool, but the sampling volume cannot be determined and the bags themselves release contaminants that affect the target within the bag. Secondly, the cost of breath sampling is high, and most of the samples for research are insufficient, so that the screened biomarkers are not universal. Meanwhile, the content of VOCs in human expiration fluctuates greatly, the sampling conditions are not unified and standard, and the stability of the biomarker is seriously influenced. In addition, the content of exhaled VOCs is extremely low, posing a major challenge to the analytical technique. Therefore, appropriate enrichment techniques are required to accurately determine the content of VOCs in exhaled breath. And moreover, the collected VOCs have two sources of an endogenous source and an exogenous source, most researches are not distinguished, biomarkers are directly screened, and the influence of environmental factors is ignored, so that the accuracy of the markers is reduced. Finally, how to screen biomarkers with high specificity, sensitivity and accuracy from massive exhaled VOCs data is a big problem. At present, most of the researches only carry out simple statistical analysis or one-time screening, and the obtained markers have poor performance.

The method mainly focuses on overcoming the difficulty in screening the exhaled VOCs markers, and establishes an integrated algorithm based on machine learning to efficiently screen the biomarkers from mass exhaled VOCs.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method for identifying the VOCs biomarkers in human expiration, which is suitable for sample size, simple in process and high in accuracy and is based on machine learning.

The invention provides a method for identifying (screening) VOCs biomarkers in human expiration based on machine learning, which comprises the steps of searching a group of simplified respiratory VOCs biomarker parameters by means of statistics and machine learning aiming at acquired high-dimensional expiratory VOCs data, representing main information in the whole data set under the condition of not losing important information, and generating the classification identification capability basically identical to the full-expiratory VOCs parameters.

The invention provides a method for identifying VOCs biomarkers in human expiration based on machine learning, which comprises the following specific steps:

s1: acquiring and processing high-dimensional expiratory VOCs data; comprises acquiring high-dimensional expiratory VOCs data of a certain amount (for example, 3 liters) of specific diseases (or pathological responses) and healthy people, and constructing an experimental group and a control group;

s2: identifying the internal and external source attributes of the VOCs by adopting an alveolar gradient method;

s3: screening a first-level biomarker by adopting a single-dimensional statistical method and a multi-dimensional statistical method;

s4: constructing a combined marker through correlation analysis, and screening a secondary biomarker by using a Lasso logistic regression model;

s5: and evaluating the classification and prediction performance of the secondary biomarkers by adopting a random forest algorithm.

The details of each step are further described below.

(1) In step S1, the acquiring and processing of the high-dimensional expiratory VOCs data includes the specific steps of:

volunteers were divided into experimental and control groups, and after sitting calmly, a certain amount (e.g. 3L) of breath samples were accurately collected in adsorption tubes (Tenax TA + Carbograph-5TD) using reciva (olwlstone); analyzing a sample by thermal desorption-two-dimensional gas chromatography-time-of-flight mass spectrometry (TD-GC XGC-TOF MS) to obtain high-dimensional VOCs data; comparing the mass spectrum peak data with NIST spectral library (v.2.3) to confirm the chemical components of each VOCs; generating a ratio of the peak area of the VOCs to the peak area of the internal standard through internal standard correction data, and taking the ratio as the relative content of each VOCs; and (4) carrying out abnormal value detection and missing value interpolation on the mass spectrum data of all samples.

(2) In step S2, identifying the internal and external source attributes of the VOCs by using the alveolar gradient includes:

a hand-held gas sampler (idex) is used to collect the same amount (e.g., 3L) of ambient air sample in the adsorption tube at the same time as the sample data collection stage. The analysis of the ambient air sample was consistent with the breath sample. The alveolar gradient AG of a certain VOC was calculated by the following formula (1):

in the formula, A _Sample,VOCi Refers to the peak area of a certain VOC in a sample, A _Sample,IS Denotes the peak area of the internal standard in the sample, A _Air,VOCi Refers to the peak area of a certain VOC in the ambient air, A _Air,IS The peak area of the internal standard in ambient air. If AG>0, then the VOCs are considered endogenous, if AG<0 then the VOCs are considered exogenous. The main basis of the theory is that the content of endogenous VOCs in the exhaled breath is higher than the content of VOCs in the ambient air, but for some VOCs with very low content, if the enrichment effect is not obvious, AG may be very close to 0, and at this time, large errors and uncertainties may exist in the judgment of the internal and external source attributes of the VOCs. For this purpose, further calculating the alveolar gradient of each VOCs in each sample, and counting the probability P that each VOCs in all people belongs to the endogenous source _Endo,i (equation 2).

In the formula, N _i,G>0 Represents the number of times compound i has an alveolar gradient greater than 0 in all subjects; n represents the number of subjects.

If P _Endo,i Greater than 50%, the compound is classified as an endogenous compound, otherwise it is considered an exogenous compound.

(3) In step S3, the first-order biomarker is screened by using a single-dimensional statistical method and a multi-dimensional statistical method, and the specific process is as follows:

the primary biomarkers were identified using a one-dimensional statistical test. Whether a parametric test (data are normally distributed) or a nonparametric test (data are non-normally distributed) is used is determined by performing a normality test on kurtosis and skewness of each variable data distribution. And (4) if the kurtosis and skewness of the distribution of each variable data are close to 0, the distribution is normal distribution, otherwise, the distribution is non-normal distribution. Significance p-values were calculated using SPSS Statistics analysis software. And calculating the fold change FC (equation 3) from the mean of the variables of each group:

in the formula (I), the compound is shown in the specification,

is the average of the i < th > VOC in the affected group,

is the average value of the i < th > VOC in the control group.

And drawing a volcanic chart. Screening primary biomarkers with p <0.05, FC <0.5 or FC >2 as conditions. Although the single-dimensional statistical method can only give the change condition of a significance p value and an average value among groups, the significance difference cannot be quantitatively evaluated, and the relationship among various expiratory VOCs cannot be constructed, the method can simply and intuitively confirm whether the significance difference exists between a disease group and a control group.

Potential primary biomarkers were further mined using multivariate statistical methods. Supervised orthogonal partial least squares discriminant analysis (OPLS-DA) was performed on RStudio. Compared with single-dimensional statistical analysis, the OPLS-DA can eliminate the influence of uncontrollable variables on data through top-classification, and further quantizes the difference caused by characteristic variables among different groups. To verify the reliability of the OPLS-DA model, the sample sequences were randomly arranged and after 200 random permutation tests, differences between the two groups were evaluated using an OPLS-DA scattergram. After model fitting, the parameter R is output ² (y) and Q ² The values of (a) can be used to evaluate the applicability and predictive power of the model. R ² (y) represents the percentage of the OPLS-DA model interpreting the y-axis direction matrix information; q ² Representing the prediction rate of the model. Both values are greater than 0.5, indicating better classification performance of the model, and closer to 1 indicating better fitting data of the model. Variable projectionThe importance (VIP) is a variable weight value of an OPLS-DA model variable, and can be used for measuring the influence strength and the interpretation capability of accumulated difference of each variable on classification judgment of each group of samples. In the present invention, the importance of exhaled VOCs is ranked based on VIP values, variable (p) with VIP value greater than 1<0.05) can be selected into the primary biomarker list. This method will complement the fold change not satisfying FC<0.5 or FC>2 condition. By comprehensively considering the results of univariate and multivariate statistical methods, the FC screening criteria and VIP screening criteria can be appropriately adjusted to determine appropriate primary biomarkers.

(4) In step S4, the combined markers are constructed by correlation analysis, and the secondary biomarkers are screened by using a Lasso logistic regression model, which comprises the following specific steps:

(ii) a marker for reconstitution; for different expiratory VOCs, possibly produced by the same enzyme or similar metabolic pathway, the relationship between the variables can be strengthened according to their ratio. And acquiring a correlation coefficient list among variables by using a Pearson correlation coefficient for data in normal distribution and a SpSS static statistical analysis software for data in abnormal distribution. And (3) performing ratio reconstruction on a group of compounds meeting the condition by taking the correlation coefficient greater than or equal to 0.6 as a standard, wherein all reconstructed ratios become new independent markers.

Simplifying the marker and extracting key features. Given the potentially large number of primary and ratio biomarkers screened above, Lasso (minimum absolute contraction and selection operator) -logistic regression (LLR) models can be used to refine the input variables that will subsequently generate a diagnostic model. The present invention uses LLR models to perform L1 regularization on all single and combined markers, reduces the regression coefficients of the variables in the model by maximum likelihood estimation, reduces the coefficients of the relatively unimportant variables to 0, and excludes these variables based on this criterion. This punitive estimation method prevents any overfitting that may occur due to co-linearity or high dimensions of the independent variables. The number of covariates in the model decreases with increasing tuning parameter λ, but the fitting error does not decrease unidirectionally with the change in λ. Knot given according to modelFruit, λ is at λ _min And λ _1se In between, the model error is in an acceptable range. Lambda _min Represents the lambda value when the mean square error is minimal; lambda [ alpha ] _1se Represents the lambda value of one standard error when the mean square error of the distance is minimal. The selection of lambda can be determined according to the research needs _min Generating relatively more variable models or λ _1se The simplest model is generated. And substituting the selected lambda into the model to reconstruct the LLR model, and determining variables with nonzero regression coefficients as secondary biomarkers, wherein the markers have the largest contribution to distinguishing the difference between the two groups, and the number of the secondary biomarkers is far smaller than the sum of the number of the primary biomarkers and the number of the combined markers. Based on these markers, a model Predictive Score (PS) formula can be obtained:

in the formula, C _i Regression coefficients for each non-zero variable; var _i Is a non-zero variable i; n is the number of non-zero variables.

(5) The classification and prediction performance of the secondary biomarkers is evaluated by adopting a random forest algorithm, and the specific process is as follows:

secondary biomarkers were tested for sensitivity, specificity and accuracy using a Random Forest (RF) model. RF is a hierarchical non-parametric modeling method and is robust to correlations between outliers and compounds. RF is superior to some other multivariate classification algorithms for pattern recognition because it can detect non-linear relationships between compounds and results. RF provides a stronger prediction and better error measurement since it is less susceptible to overfitting. In addition, RF has the advantage of incorporating randomness into its predictions by repeatedly directing sampling and random variable selection when generating a single decision tree. Performing marker validation in RStudio, optimizing the hyperparameter m _try And constructing an RF classification model after ntrees. The sensitivity, specificity and accuracy of secondary biomarkers were determined by receiver characteristic operating curves (ROC).

Compared with the prior art, the invention has the beneficial effects that:

(1) according to the invention, accurate classification of different groups can be realized only by selecting a small amount of key biomarkers, and the method has the advantages of good reliability, high sensitivity and strong specificity;

(2) the method can greatly reduce the classification cost;

(3) the method is beneficial to developing a faster heating program aiming at key compounds, and the analysis time and the pretreatment time are reduced;

(4) the simplified key variables can simplify original complicated chemical explanation and even possibly generate contradiction, and are beneficial to intensively discussing important metabolic processes and mechanism mechanisms.

Drawings

Fig. 1 is a process of breath VOCs biomarker screening.

FIG. 2 illustrates the alveolar gradient to identify the internal and external source properties of VOCs. Wherein, (a) a control group; (b) experimental group.

Figure 3 is the primary biomarker screen. Wherein, (a) a one-dimensional statistical method; (b) multidimensional statistical method.

Fig. 4 is a LLR model screening for secondary biomarkers. Wherein (a) a binomial distribution deviation; (b) and (4) regression coefficients.

Figure 5 is a ROC curve for secondary biomarker classification and predictive power.

Detailed Description

In order to make the technical solutions in the present application better understood by those skilled in the art, the following more fully describes the respective steps, and further detailed descriptions are provided in conjunction with examples. It should be noted that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments, and the present invention is not limited by the following embodiments.

Example 1

The method is suitable for screening the markers of the high-dimensional exhaled VOCs data, and the screening process is shown in figure 1.

(1) Acquiring high-dimensional expiratory VOCs data

Vaccination is one of the effective methods to prevent the prevalence of new coronary epidemics. The inoculation of inactivated vaccines can cause immune reactions that may affect the metabolic processes of the human body, which in turn leads to changes in the exhaled VOCs. The true bookThe experimental group was selected as the vaccinated group and the control group was selected as the unvaccinated group. 50 subjects were recruited without vaccination with the new corona vaccine, 23 males, 27 females, in the age range 22-58 years (mean age 34 years) and a BMI index in the range 16.87-32.49kg/m ² (average BMI index 23.28kg/m ² ). 54 persons were vaccinated with a second dose of the novel corona vaccine, wherein 24 persons were male, 30 persons were female, the age range was 22-30 years (mean age 24 years), the BMI index range was 16.42-29.58kg/m ² (average BMI index 21.09kg/m ² ). 3L breath samples were taken from each subject and analyzed by instrument using TD-GC x GC-TOF MS. Hundreds of exhaled VOCs are detected, and 100 compounds with the occurrence rate higher than 60% are selected as target compounds for further analysis, including 23 alkanes, 7 alkenes, 4 alcohols, 4 aldehydes, 14 ketones, 5 acids, 7 esters, 8 aromatic compounds and other nitrogen, oxygen and sulfur containing compounds (Table 1). Peak areas of various VOCs were calculated using chromapace software (SepSolve Analytical), and normalized using an internal standard, and missing values were supplemented using mean values.

TABLE 1 qualitative analysis List of exhaled VOCs

(2) Alveolar gradient identification of internal and external source attributes of VOCs (volatile organic Compounds)

The alveolar gradient allows a rough determination of whether the VOCs are alveolar source or inhaled from the environment. Alveoli of each VOCThe gradient is not fixed and will vary with the difference in the rate of production and consumption thereof in vivo. The present embodiment calculates the alveolar gradients of various VOCs in the experimental group and the control group respectively, and counts the endogenous rate P _Endo,i (FIG. 2). P of 72 VOCs out of 100 target compounds _Endo,i Greater than 50%, then these compounds are endogenous VOCs (top 72 compounds of table 1). The remaining 28 compounds were exogenous VOCs (the latter 28 compounds of table 1). Subsequent screening for vaccine induction biomarkers will select from these 72 VOCs. Few studies currently distinguish endogenous and exogenous compounds in the breath, which may lead to misinterpretation of some exogenous compounds as potential biomarkers, with erroneous results. In the case, the influence of environmental background interference on the accuracy of the biomarkers of the exhaled VOCs is simply and quickly eliminated by using an internal external source of the alveolar gradient region.

(3) Screening primary biomarkers by single-dimensional statistical method and multi-dimensional statistical method

The VOCs exhaled by the vaccinated and unvaccinated control subjects were essentially non-normally distributed, and therefore were analyzed by a non-parametric test in a one-dimensional statistical analysis. To eliminate the effect of age, gender, etc., differential analyses were performed using paired homogeneous experimental and control groups. If no matched sample exists, the factors of age, sex and the like of the two groups need to be ensured to be close to each other as much as possible, or a statistical method is adopted to correct the influence caused by the factors. Wilcoxon signed rank test results show that the p values of 43 endogenous VOCs are less than 0.05, and the differences are significant. Down-regulation occurred in 26 VOCs, with significant down-regulation occurring in 11 VOCs (FC <0.5) based on fold change (fig. 3 a); upregulation occurred in 17 VOCs, with significant upregulation occurring in 1 of the VOCs (FC > 2).

The results of the OPLS-DA analysis showed that (FIG. 3b), the vaccine group and the control group apparently have different clusters, R ² Intercept is 0.154, Q ² The intercept is-0.876. The model exhibits good applicability (R) ² Y＝0.832,p<0.01) and predictive power (Q) ² ＝0.756,p<0.01). The VIP values of 24 compounds are more than 1, and the difference compounds obtained by screening by a univariate statistical method are basically included, so that the two methods have good consistency. Synthesis ofConsider single and multi-dimensional statistics, and the incidence of compounds: (>90%), and finally 21 VOCs were selected as primary biomarkers (table 2).

TABLE 2 Primary biomarker List of neocoronal vaccine-related exhaled VOCs

(4) Screening for Secondary biomarkers

And analyzing the correlation of 21 single markers and strengthening the relation among the markers. The spearman scale correlation coefficient indicates that the correlation coefficient of 26 pairs of compounds is greater than 0.6, and therefore the 26 pairs of compounds are ratioed to reconstruct the ratiometric markers. The 26 ratio markers were independent of the 21 single markers.

Marker compositions were simplified by LLR models. 70% of the experimental and control group data were set as training set, leaving 30% of the data as test set. The LLR model L1 regularized the data using 5-fold cross-validation, with λ ranging from 0.000167-0.313. As the number of variables in the model is continuously reduced along with the increase of the lambda, the deviation of the binomial distribution shows that the lambda given by the model _min And λ _1se In between, the model error is small (fig. 4 a). Within this range, the classification and prediction capabilities of the model are better. In this case, λ is selected _1se A more simplified biomarker combination was generated, comprising a total of 12 biomarkers, of which 9 single markers and 3 ratio markers. These 12 markers were secondary biomarkers (fig. 4 b). Further, a model prediction score calculation formula may also be obtained:

(5) validation and evaluation of Secondary biomarkers

Marker validation was performed in RStudio, model optimization was performed with 70% of the two sets of data set as the training set, leaving 30% of the data as the test set. Firstly, the hyper-parameter m is optimized _try . At default 500 decisionsIn the case of trees, m _try In the range of 1 to 10, m _try The error in the model is minimal at 3. The decision tree number ntrees is optimized on this basis. The number range of decision trees of the optimization model is set to be 1-1000 (the upper limit can be adjusted upwards if necessary), the number of decision trees which reach the stable lower out-of-bag error in the range is selected, excessive decision trees can cause overfitting, and the model error is increased. This case finally determined ntrees 200, under which the out-of-bag error of the model was less than 5%. And reconstructing the RF classification model based on the two optimized hyper-parameters. 70% of the original data was used as training set and the remaining 30% was used as test set. The results of 1000 random training runs were tested by a round-robin operation. The ROC curve results indicated that the mean sensitivity of the secondary biomarker was 98.33%, the mean specificity was 95.98%, the area under the mean curve (AUC) was 0.89, and the overall accuracy was 97.21% (fig. 5). Compared with other research results of the same type, the screened marker has better sensitivity, specificity and accuracy. Therefore, by the integrated exhaled VOCs marker identification method based on machine learning, 12 biomarkers induced by the inactivated neocorona vaccine are successfully screened out.

Claims

1. A method for identifying VOCs biomarkers in human expiration based on machine learning is characterized by comprising the steps of obtaining high-dimensional expiration VOCs data, searching a group of simplified respiration VOCs biomarker parameters through statistics and machine learning methods, representing main information in a whole data set under the condition that important information is not lost, and generating classification and identification capabilities which are basically the same as the full expiration VOCs parameters; the method comprises the following specific steps:

s1: acquiring and processing high-dimensional expiratory VOCs data; acquiring a certain amount of specific disease or pathological reaction and high-dimensional exhaled VOCs data of healthy people, and constructing an experimental group and a control group;

2. The method for identifying the VOCs biomarkers in the human breath based on machine learning of claim 1, wherein the step S1 of acquiring and processing the VOCs data in the high-dimensional breath comprises the following specific steps:

the volunteers are divided into an experimental group and a control group, and a certain amount of breath samples are collected in an adsorption tube by using RecIVA; carrying out thermal desorption-two-dimensional gas chromatography-flight time mass spectrometry on the sample to obtain high-dimensional VOCs data; comparing the mass spectrum peak data with a NIST spectral library (v.2.3) to confirm the chemical components of each VOCs; generating a ratio of the peak area of the VOCs to the peak area of the internal standard through internal standard correction data, and taking the ratio as the relative content of each VOCs; and (4) carrying out abnormal value detection and missing value interpolation on the mass spectrum data of all samples.

3. The method for identifying the biomarkers of the VOCs in the human expiration based on the machine learning of claim 2, wherein the step S2 of identifying the internal and external source attributes of the VOCs by using the alveolar gradient comprises the following specific steps:

in the sample data acquisition stage, collecting the environmental air sample with the same amount as the sample in the adsorption tube by using a handheld gas sampler; the analysis method of the ambient air sample is consistent with that of the breath sample; the alveolar gradient AG of a certain VOC was calculated by the following formula (1):

in the formula, A _Sample,VOCi Refers to the peak area of a certain VOC in the sample, A _Sample,IS Denotes the peak area of the internal standard in the sample, A _Air,VOCi Refers to the peak area of a certain VOC in the ambient air, A _Air,IS The peak area of the internal standard in the ambient air; if AG>0, then the VOCs are considered endogenous, if AG<0, then the VOCs are considered to be exogenous; the main basis is that the content of endogenous VOCs in the breath is higher than that of the VOCs in the ambient air; however, for some VOCs with very low content, if the enrichment effect is not obvious, AG may be very close to 0, and at this time, large errors and uncertainties may exist in the judgment of the internal and external source attributes of the VOCs; for this purpose, further calculating the alveolar gradient of each VOCs in each sample, and counting the probability P that each VOCs belongs to the internal source in all people _Endo,i ：

In the formula, N _i,G>0 Represents the number of times compound i has an alveolar gradient greater than 0 in all subjects; n represents the number of subjects;

4. The method for identifying VOCs biomarkers in human expiration based on machine learning according to claim 3, wherein the step S3 of screening the primary biomarkers by using a one-dimensional statistical method and a multi-dimensional statistical method comprises the following specific steps:

identifying a primary biomarker using a one-dimensional statistical test; performing normality test on the kurtosis and skewness of each variable data distribution, and determining whether parameter test or nonparametric test is used; the kurtosis and skewness of each variable data distribution are close to 0, and the variable data distribution is normal distribution, otherwise the variable data distribution is non-normal distribution; significance p-values were calculated using SPSS Statistics statistical analysis software and fold change FC was calculated from the mean of each group of variables:

in the formula (I), the compound is shown in the specification,

is the average value of the i < th > VOC in the affected group,

is the average value of the ith VOC in the control group;

drawing a volcano chart; screening primary biomarkers under the conditions of p <0.05, FC <0.5 or FC > 2;

further mining potential first-level biomarkers by using a multi-dimensional statistical test method; performing OPLS-DA on RStudio; after 200 random permutation tests, differences between the two groups were assessed using an OPLS-DA scattergram; r ² (y) and Q ² The values of (a) are used to evaluate the applicability and predictive power of the model; ranking the importance of the variables according to the VIP value, the VIP value being greater than 1 for variables, and p<0.05, select into the primary biomarker list; thereby compensating for the factor of change not satisfying FC<0.5 or FC>2 a primary biomarker of a condition; comprehensively considering the results of univariate and multivariate statistical methods, adjusting FC screening standard and VIP screening standard, and determining appropriate primary biomarkers; here, R ² (y) represents the percentage of the OPLS-DA model interpreting the y-axis direction matrix information; q ² Representing the prediction rate of the model; if the two values are both greater than 0.5, the classification performance of the model is better, and the closer to 1, the better the fitting data of the model is; the VIP value is a variable weight value of an OPLS-DA model variable and is used for measuring the influence strength and the interpretation capability of the accumulated difference of each variable on the classification judgment of each group of samples.

5. The method for identifying VOCs in human breath based on machine learning of claim 4, wherein the step S4 of constructing the combined markers through correlation analysis and screening the secondary biomarkers by using a Lasso logistic regression model comprises the following steps:

(ii) a marker for reconstitution; for different exhalations, VOCs may be produced by the same enzyme or similar metabolic pathway, enhancing the association between variables according to their ratio; using a Pearson correlation coefficient for data in normal distribution, using a SpSS statistical analysis software to obtain a correlation coefficient list among variables for data in abnormal distribution; taking the correlation coefficient more than or equal to 0.6 as a standard, carrying out ratio reconstruction on a group of compounds meeting the condition, wherein all reconstructed ratios become new independent markers;

simplifying the marker and extracting key features; performing L1 regularization on all single markers and combined markers by using an LLR regression model, and reducing the regression coefficient of each variable in the model through maximum likelihood estimation; the variables in the model are reduced along with the increase of the tuning parameter lambda, but the fitting error cannot be reduced along with the change of the lambda in a one-way mode; from the results given by the model, λ is at λ _min And λ _1se Within the acceptable range, the selection of lambda can be determined according to the research needs _min Or λ _1se (ii) a Substituting the selected lambda into the model to reconstruct the LLR model, and determining variables with nonzero regression coefficients as secondary biomarkers, wherein the markers have the largest contribution to distinguishing the difference between the two groups; from these markers, a model prediction score PS can be obtained:

6. The method for identifying VOCs in human expiration based on machine learning according to claim 5, wherein the step S5 of evaluating classification and prediction performance of secondary biomarkers by using a random forest algorithm comprises the following specific steps:

performing sensitivity, specificity and accuracy tests on the secondary biomarkers using an RF model; performing marker validation in RStudio, optimizing the hyperparameter m _try And constructing an RF classification model after ntrees; the sensitivity, specificity and accuracy of the secondary biomarkers were determined by subject characteristic working curves.