CN107480441B

CN107480441B - Modeling method and system for children septic shock prognosis prediction

Info

Publication number: CN107480441B
Application number: CN201710661510.7A
Authority: CN
Inventors: 方芳
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-08-04
Filing date: 2017-08-04
Publication date: 2021-02-09
Anticipated expiration: 2037-08-04
Also published as: CN107480441A

Abstract

The invention discloses a modeling method and a system for children septic shock prognosis prediction based on a support vector machine. The method is characterized by screening features according to high-throughput data of gene expression after childhood septic shock prognosis, modeling a plurality of screened features by adopting a Support Vector Machine (SVM) algorithm, realizing accurate prediction of the childhood septic shock prognosis, and providing supplement and support of molecular level for clinical prognosis prediction of childhood septic shock.

Description

Modeling method and system for children septic shock prognosis prediction

Technical Field

The invention belongs to the field of bioinformatics, and relates to a modeling method and a system for children septic shock prognosis prediction based on a support vector machine.

Background

Sepsis is an inflammatory disorder with high mortality, and childhood sepsis is also an important cause of death in children worldwide. Septic shock is the most severe type of sepsis, and therefore, it is important to develop a prognostic prediction technique for septic shock in children. At present, scientific researchers mainly adopt a biomarker decision tree model to carry out modeling prediction on septic shock of children. However, the decision tree algorithm may have an over-fitting problem, and correlation among attributes in the data set is ignored, so that the problem of machine learning cannot be solved, and the generalization error rate is greatly increased.

Biomarker data mining and computer simulation are critical to the development of efficient prediction technologies, are good at processing large-scale noisy data with potential value, and are now powerful technical means in a plurality of research fields. Data mining and computer simulation studies of complex diseases were initially conducted based on interrelationships between variables using logistic regression techniques and network visualization techniques. The advent of various high-throughput technologies in recent years has led to the generation of large volumes of data, and the use of various complex systems methods has increased accordingly. The Support Vector Machine (SVM) machine learning algorithm based on the biomarkers can realize the integration of high-dimensional and large-scale data, has the advantages of strong generalization capability and the like, can solve the machine learning problems of small sample size, high dimension, nonlinearity and the like, can reduce the generalization error rate, and does not establish a child septic shock prognosis SVM model based on expression profile data at present.

Disclosure of Invention

Aiming at the problems, the invention provides a modeling method and a system for predicting children septic shock prognosis based on a support vector machine, which are used for carrying out feature screening according to high-throughput data of gene expression of children septic shock prognosis, and modeling a plurality of screened features by adopting a Support Vector Machine (SVM) algorithm, thereby realizing accurate prediction of children septic shock prognosis and providing supplement and support of molecular level for clinical prediction of children septic shock.

In a first aspect, the invention provides a modeling method for children septic shock prognosis prediction based on a support vector machine, which comprises the following steps:

(1) collecting high-throughput data of child septic shock gene Expression in a GEO (Gene Expression Omnibus) data source;

(2) sequentially preprocessing and summarizing the high-flux data to obtain preprocessed data;

(3) screening genes which are abnormally expressed in a death group relative to a survival group from the preprocessed data to obtain an abnormally expressed gene data set with poor prognosis of the child septic shock;

(4) carrying out format conversion on an abnormal expression gene data set with poor prognosis of the children septic shock to form a training biomarker data set;

(5) carrying out feature screening on the training biomarker data set, and selecting a set with the least features which enable the prediction accuracy to reach the highest, namely a feature set for model construction;

(6) and (5) constructing a children septic shock prognosis prediction model by using the feature set and the training biomarker data set in the step (5) and a kernlab program package in the R program by adopting a Support Vector Machine (SVM) algorithm.

The GEO (Gene Expression Omnibus) data source is a public repository for archiving and freely distributing high-throughput gene Expression data submitted by researchers, storing data for about 10 billion individual gene expressions from over 100 organisms, with a website of www.ncbi.nih.gov/geo.

The basic principle of the Support Vector Machine (SVM) algorithm is as follows:

given a training sample set: (x)_i,y_i),i＝1,2,…,N，

Wherein x is_i∈R^dD is the dimension of the input space, y_iE { -1,1} represents the class label, and N is the number of training samples. Then the linear discriminant function general shape of the d-dimensional spaceThe formula is as follows:

f(x)＝wx+b，

the equation for the classification plane is:

wx+b＝0，

wherein the coefficient w represents the weight vector and b is the threshold.

Finding the optimal classification plane requires that the classification plane can correctly classify all samples, and two types of samples can meet the constraint condition:

y_i(wx_i+b)≥1,i＝1,2,…,N，

at the same time, in order to maximize the generalization ability, it is desirable to maximize the classification interval 2/| w |, i.e., equivalent

In the present invention, it is preferable to use a linear indivisible support vector machine algorithm that requires the use of a kernel function K (x)_i,x_j) And (5) raising the dimension of the low-dimensional vector, so as to find the optimal classification plane in the high-dimensional space. Partial samples may still be inseparable after dimension rising, and relaxation variable xi can be introduced_i(ξ_i≧ 0), i ═ 1,2, …, N, and the relaxed classification plane constraint is:

y_i(wx_i+b)-1+ξ_i≥0,i＝1,2,…,N。

while balancing generalization ability and error classification in

Introducing a penalty term:

the objective function is converted into:

wherein, C is an error penalty factor representing the penalty degree for the error sample point. Then introducing Lagrange function to obtain corresponding optimal classification function:

preferably, the step (1) further comprises the steps of screening the high-throughput data, and downloading and extracting screening results;

preferably, the screening is to exclude animal sample data, adult sample data, under-sized sample data and incomplete information data to obtain sample data of septic shock in the child.

Preferably, the animal sample data is animal population sample data other than human;

preferably, the adult sample data is sample data with an age range above 18 years;

preferably, the sample size too small data is sample data of which the total number of samples is less than 30 persons;

preferably, the incomplete information data is sample data that does not include both the alive group and the dead group.

Preferably, the step (1) of collecting high throughput data on septic shock gene expression is:

in the GEO database, keywords "sepsis (namely sepsis)" and/or "septa shock (namely septic shock)" are used for searching to obtain high-throughput data of septic shock gene expression.

Preferably, the background correction in step (2) is performed by using RMA (Robust Multi-chip Average) function in the R program;

preferably, the normalization process is performed using a quantile method;

preferably, Median polish (Median smoothing) is used for data summarization.

Preferably, the screening in the step (3) is performed by using a limma program package in the R program;

preferably, the determination criteria for the abnormally expressed gene in step (3) are:

the absolute value of the logarithm of the fold difference between the expression levels of the death group and the survival group is more than or equal to 0.8, for example, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4 or 1.5, and the values between the above values and above are specific, for brevity and conciseness, the invention is not exhaustive, and the invention is not limited to the specific values included in the ranges, and the genes with corrected P-values of less than 0.05, for example, 0.04, 0.03, 0.02, 0.01 or 0.005, and the values between the above values and below are specific, for brevity and conciseness, the invention is not limited to the specific values included in the ranges.

Preferably, the format in step (4) is converted into a data format for converting the data set of abnormally expressed genes with poor prognosis of children septic shock by Perl program into a data format which is in accordance with the R program for feature selection, for example, the format shown in Table 2, or any other format which can be identified by the R program.

Preferably, the characteristic screening in the step (5) is as follows:

adopting an R program to construct a characteristic sorting coefficient, and removing a characteristic with the minimum sorting coefficient in each iteration to finally obtain the descending sorting of all the characteristics; the set of the least features that maximize the prediction accuracy, i.e., the set of features used for model construction, is selected.

Preferably, constructing the children septic shock prognosis prediction model in the step (6) by using a support vector machine algorithm and a Gaussian kernel function;

wherein, the formula of the Gaussian kernel function is as follows:

preferably, the step (6) is to run a support vector machine algorithm according to a data subset belonging to the feature set part in the step (5) in the training biomarker data set, train to obtain a parameter sigma of a gaussian kernel function and a parameter error penalty factor C of the support vector machine, and then construct a child septic shock prognosis prediction model;

preferably, the parameter σ of the gaussian kernel function is 0.05-0.5, and may be, for example, 0.06, 0.07, 0.08, 0.09, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4 or 0.45, and specific point values therebetween, and for reasons of brevity and brevity, the present invention is not exhaustive of the specific point values included in the range, preferably 0.08-0.3, and more preferably 0.11-0.13;

preferably, the error penalty factor C of the support vector machine is 8-15, for example, 8, 9, 10, 11, 12, 13, 14 or 15, and specific point values between the above values, which are limited by space and for brevity, the present invention is not exhaustive, and the range includes specific point values, preferably 9-13, and more preferably 10-11, and the AUC values of the model external test can reach 0.722 within the above optimal parameter range.

In particular, σ and the error penalty factor C are adjusted by the training data set to optimize the prediction of the trained children septic shock prognosis model, so that the values of the two important parameters are varied within a certain range.

In a second aspect, the present invention provides a modeling system for children septic shock prognosis prediction based on a support vector machine, comprising:

(1) a data collection module: high throughput data for the collection of childhood septic shock gene expression within the GEO data source;

(2) a data preprocessing module: the data collection module is connected with the data acquisition module and is used for preprocessing and summarizing the high-flux data to obtain preprocessed data;

(3) a screening module: the data preprocessing module is connected with the data processing module, and genes which are abnormally expressed in a death group relative to a survival group are screened from the preprocessed data to obtain an abnormally expressed gene data set with poor children septic shock prognosis;

(4) the data conversion module: the abnormal expression gene data set is connected with the screening module and used for carrying out format conversion on the abnormal expression gene data set with poor prognosis of the children septic shock to form a training biomarker data set;

(5) a characteristic screening module: the characteristic screening module is connected with the data conversion module and is used for carrying out characteristic screening on the training biomarker data set to select a set of minimum characteristics which enable the prediction accuracy to reach the highest degree, namely a characteristic set used for model construction;

(6) a model building module: and the prediction module is connected with the feature screening module, and a prediction model for the children septic shock prognosis is constructed by using the feature set and the training biomarker data set and using a kernlab program package in the R program by adopting a support vector machine algorithm.

Compared with the prior art, the invention has at least the following beneficial effects:

the modeling method for the children septic shock prognosis prediction based on the support vector machine, provided by the invention, is used for carrying out feature screening according to high-throughput data of the children septic shock prognosis gene expression, and modeling a plurality of screened features by adopting a Support Vector Machine (SVM) algorithm, so that accurate prediction of the children septic shock prognosis is realized, and the supplementation and support of the molecular level are provided for the clinical prognosis prediction of the children septic shock.

Drawings

FIG. 1 is a process diagram of the modeling method for the children septic shock prognosis prediction based on the support vector machine of the invention.

FIG. 2 is a graph showing the results of feature screening in example 1.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It is further noted that, for the sake of convenience in this description, the drawings show only some of the results relevant to the present invention and not all of them.

Example 1

The embodiment provides a technical solution of a modeling method for children septic shock prognosis prediction based on a support vector machine, the modeling method provided by the embodiment can be executed by a modeling device, the device is integrated in a computer device, and the method specifically includes the following steps, and the flow is shown in fig. 1:

(1) selecting a data source: a GEO (Gene Expression Omnibus) database is selected as a data source.

(2) Searching a data source: and searching a GEO data source by using keywords 'sepsis' and 'septa shock' and collecting septic shock gene expression high-throughput data.

(3) And (3) screening search results: the original search results were further screened to exclude adult samples older than 18 years, under-sized samples less than 30, and incomplete samples that did not provide information for the complete sample population (i.e., not including both the surviving and dead groups).

(4) And (4) carrying out data downloading and extraction on the screening result (the data is at a probe level), and inputting the data into the R program.

(5) Preprocessing the extracted data: the preprocessing is performed using RMA (multi-array logarithmic robust algorithm) function in the R program, specifically, the background correction is performed using RMA method, the normalization is performed using quantile method, and the summarization method uses mediapolish (median smoothing).

The algorithm for carrying out standardization (Quantum Normalization) processing by using a Quantile method is mainly divided into three steps:

a) sorting the data points of each chip;

b) calculating the average value of all chip data at the same position, and replacing the expression quantity of the gene at the position with the average value;

c) each gene is reduced to its own position.

(6) Converting the processed probe level data into gene level data, and specifically comprising the following steps:

d) according to the corresponding file of the probe of the corresponding chip technology platform of the original data of the expression profile and each gene, corresponding the data of the probe level with the gene;

e) deleting the data rows of the genes corresponding to one-to-many probes and not corresponding to the probes;

f) in the case where a plurality of probes correspond to the same gene, the average value is taken as the expression level of the gene.

(7) Screening of differentially expressed genes: by utilizing the limma program package of the R program, the genes of which the logarithmic absolute value of the expression quantity difference multiple of the death group and the survival group is more than or equal to 0.8 and the P value after the False Discovery Rate (FDR) correction is less than 0.05 are judged as the abnormally expressed genes, and an abnormally expressed gene data set with poor children septic shock prognosis is obtained after summarizing, wherein the data is shown in Table 1. It is specifically noted that since there are a total of twenty thousand genes, not all of them are listed, and only 5 genes that are abnormally expressed are exemplified here.

TABLE 1 examples of abnormally expressed Gene data

Gene	Logarithm of fold difference in expression amount	False Discovery Rate (FDR) corrected P value
			Gene 1	1.717146732	0.029147425
Gene 2	1.358191894	0.035863019
			Gene 3	1.283283163	0.002649534
Gene 4	-0.84291548	0.015277801
			Gene 5	-0.837903188	0.022307329

(8) Training the format conversion of the biomarker data set: and (3) performing format conversion on the abnormal expression gene data set with poor prognosis of the children septic shock by using a Perl program to meet the data format (shown in table 2) required by the R program for feature selection, wherein the data set subjected to format conversion is the training biomarker data set.

Table 2 data format example

(9) And (3) feature screening: performing feature screening by adopting an R program according to a training biomarker data set; constructing feature sorting coefficients (namely feature importance sorting coefficients), and removing a feature with the minimum sorting coefficient in each iteration to finally obtain the descending sorting of all the features; the set of the least features where the prediction accuracy is highest, i.e. the set of features used for model construction, is selected. The Feature selection process of this embodiment uses Recursive Feature Elimination (RFE), and its main idea is to repeatedly construct a model, select and exclude the worst features, and then repeat this process on the remaining features, where the order in which the features are eliminated is the Feature ordering. This is therefore an algorithm to find the optimal feature subset. When the RFE is adopted for feature selection, all N features are included in the model, the performance and feature importance ranking of the model is calculated, the most important N-1 features are reserved, the performance is modeled and calculated again, and iteration is repeated in this way to find out a proper feature subset. In the above process of this embodiment, a random forest algorithm is used to perform model construction, performance evaluation, and feature importance ranking on each iteration.

The screening results are shown in FIG. 2, and it can be seen that the accuracy can be maximized by selecting at least 11 genes, so that the 11 genes (the specific gene Entrez ID numbers and gene names are shown in Table 3) are selected as a feature set for the subsequent model construction.

TABLE 3

Entrez ID number of Gene	Name of Gene
		54541	DDIT4
5553	PRG2
		10875	FGL2
55701	ARHGEF40
		5168	ENPP2
100133941	CD24
		84419	C15orf48
5657	PRTN3
		2867	FFAR2
401233	HTATSF1P2
		7045	TGFBI

(10) Constructing an early warning model: according to a data subset which belongs to 11 gene feature set parts in a training biomarker data set, a kernel lab module of an R program is used for operating a Support Vector Machine (SVM) algorithm in machine learning, a parameter sigma of a Gaussian kernel function is obtained by training and is 0.11, a wrong penalty factor C is 10, and then a children septic shock prognosis prediction model is constructed.

In the invention, an external verification method is adopted to respectively select an error penalty factor C and a parameter sigma of a Gaussian kernel function, and the specific steps are as follows:

5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50 and 100 are respectively selected as test values of the error penalty factor C, 0.01, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.2, 0.3, 0.4, 0.5 and 1 are respectively selected as test values of the parameter sigma, and model construction is respectively performed according to the combination of the test values to obtain a series of test models.

And then, selecting an independent sample data set with other known results to perform performance verification on the constructed test model, drawing a Receiver Operating Characteristic (ROC) Curve, checking the prediction effect of the test model according to an evaluation parameter of Area under the Curve (AUC), and selecting an error penalty factor C and a parameter sigma applied to the test model with the maximum AUC value.

Example 210X Cross test

The 10 × cross test is to divide the training biomarker data set into 10 parts after random rearrangement, use 1 part of them as test data, use the other 9 parts as training data, after circulating 10 times in this way (i.e. each part is used as test data once), arrange the test results of each time and take the mean value to obtain the final test result for the model performance.

Finally, the cross inspection error of the model is 0.128571, and the error of repeated times is stabilized between 0.10 and 0.15, so that the error of the method is close to that of other models, and the method has certain reliability.

In addition, besides cross inspection, which is an internal inspection, external inspection is also performed, namely, other independent sample data sets are selected to verify the performance of the model in the invention, and the specific steps are as follows:

firstly, an independent sample data set for verification is abbreviated as an external test data set, and the external test data set is predicted by using a constructed model to obtain a model prediction prognosis result of the test data set.

Secondly, after the model prediction result is obtained, the prediction result is compared with the actual prognosis condition (namely the clinical sample result) to obtain a confusion matrix between the prediction result and the actual prognosis condition, and the format example of the confusion matrix is shown in a table 4. Where a is the number of actual positive samples predicted as positive samples, b is the number of actual positive samples predicted as negative samples, c is the number of actual negative samples predicted as positive samples, and d is the number of actual negative samples predicted as negative samples.

TABLE 4

Thirdly, drawing a Receiver Operating Characteristic (ROC) Curve according to the confusion matrix, and testing the prediction effect of the model according to an evaluation parameter, namely Area under the Curve (AUC).

Auc (area dark) is the area under the Receiver Operating Characteristic (ROC) curve, typically between 0.5 and 1. The AUC value is a probability value, and when a positive sample and a negative sample are randomly selected from all samples, the constructed classification model is used for calculating the probability that the positive sample is arranged in front of the negative sample, namely the AUC value. The larger the AUC value, the more likely the model will rank positive samples ahead of negative samples, enabling better classification. Therefore, the AUC can be used as a numerical value to intuitively evaluate the prediction performance of the model, and the larger the value, the better the value.

Thus, if the samples are classified completely randomly, the AUC should be close to 0.5. AUC <0.5 does not correspond to the real case and occurs rarely in practice. In the case of AUC >0.5, the closer the AUC is to 1, indicating the better the predictive performance of the model. The AUC value of the external test is 0.722 and is between 0.7 and 0.8, and compared with performance indexes of prediction models of other diseases, the performance of the method is close to or better than that of the method, and the model constructed by the method is acceptable in prediction capability of the prognosis of the septic shock of children.

In conclusion, the support vector machine-based modeling method for the children septic shock prognosis prediction carries out feature screening according to high-throughput data of the children septic shock prognosis gene expression, models a plurality of screened features by adopting a Support Vector Machine (SVM) algorithm, realizes accurate prediction of the children septic shock prognosis, and provides supplement and support of molecular level for clinical prognosis prediction of the children septic shock.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A modeling system for children septic shock prognosis prediction based on a support vector machine is characterized by comprising:

(1) a data collection module: high throughput data for the collection of septic shock gene expression within the GEO data source;

(4) the data conversion module: the screening module is connected with the abnormal expression gene data set for format conversion of the children septic shock poor prognosis abnormal expression gene data set to form a training biomarker data set, and the format conversion module converts the abnormal expression gene data set for children septic shock poor prognosis into a data format which is in accordance with an R program for feature selection through a Perl program;

(5) a characteristic screening module: the characteristic sorting coefficient is constructed by adopting an R program, one characteristic with the minimum sorting coefficient is removed from each iteration of the training biomarker data set, the descending sorting of all the characteristics is finally obtained, and a set with the minimum characteristics, namely a characteristic set for model construction, which enables the prediction accuracy of the children septic shock prognosis to be highest is selected;

(6) a model building module: the characteristic screening module is connected with the characteristic screening module, the characteristic set and the training biomarker data set are used, a support vector machine algorithm is adopted, a kernlab program package in an R program is used for constructing a children septic shock prognosis prediction model, the support vector machine algorithm is operated according to a data subset belonging to the characteristic set part in the training biomarker data set, a parameter sigma of a Gaussian kernel function and a wrong penalty factor C of the support vector machine are obtained through training, and then a children septic shock prognosis prediction model is constructed; the parameter sigma of the Gaussian kernel function is 0.05-0.5, and the error penalty factor C of the support vector machine is 8-15.

2. The system of claim 1, wherein the parameter σ of the gaussian kernel function is 0.08-0.3, and the error penalty factor C of the support vector machine is 9-13.

3. The system of claim 2, wherein the parameter σ of the gaussian kernel function is 0.11-0.13, and the error penalty factor C of the support vector machine is 10-11.