CN113241122A

CN113241122A - Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network

Info

Publication number: CN113241122A
Application number: CN202110650665.7A
Authority: CN
Inventors: 秦喜文; 王芮; 李绍松; 谭佳伟; 徐定鑫; 崔薛腾; 张斯琪
Original assignee: Changchun University of Technology
Current assignee: Changchun University of Technology
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-08-10

Abstract

The method for selecting and classifying the gene data variable fused with the self-adaptive elastic network and the deep neural network is characterized by comprising the steps of selecting the variable of complex gene data and classifying the complex data; the variable selection module performs weighted estimation on the punishment item of the adaptive elastic network by considering the interdependency between genes and combining coefficient compression and mutual information theory from the internal association structure of the complex gene data, and establishes a data-driven model-free assumed adaptive variable selection method; the classification module of the complex data optimizes the structural parameters of the deep neural network by using a wolf optimization method, and the generalization capability of the model is improved.

Description

Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network

Technical Field

The invention relates to the technical field of biological big data analysis and variable selection, in particular to a gene data variable selection and classification method integrating an adaptive elastic network and a deep neural network.

Background

In the field of bioinformatics, it is an important technique to predict clinical outcomes using gene datasets with a large number of variables. In such data sets, the sample size is often very small compared to the number of predictors (genes), thus leading to an n < p problem. In addition, the complex and unknown correlation structure between predictors presents great difficulties for classification results and variable selection. Therefore, a new set of statistical or data mining methods suitable for the characteristics of high-dimensional small samples is required for classifying gene data, and the dimensionality of the data is reduced while the high precision is maintained.

In the prior art, a regularization method is an important dimension reduction method for solving high-dimensional small sample data, and can reduce dimensions of gene data while training a model. Exemplary methods include L1 norm based Lasso, adaptive Lasso, L2 norm based ridge regression. The penalty functions of L1 norm and L2 norm in the typical method can not meet the requirements of unbiasedness, sparsity, continuity and the like at the same time, and the traditional SCAD method does not take the interaction of genes and genes into consideration, only considers the relation with diseases from the perspective of the genes, thereby reducing the effectiveness of using the SCAD method for gene selection and cancer classification. Second, typical regularization methods also include elastic nets and adaptive elastic nets based on the L1 and L2 norms. However, when applying adaptive elastic nets to high dimensional gene expression data, some important genes may be erroneously weighted less in the initial estimation due to lower accuracy requirements. Therefore, these important genes are easily deleted from the model by mistake, and the informative gene selection prediction accuracy of microarray DNA data is low. In addition, if the pairwise correlation between variables is not high, the performance of the adaptive elastic network may not be good.

Meanwhile, the deep learning model has been proven to be a powerful classification tool, but its application in bioinformatics is limited due to the n < p problem. This is because cell populations and clinical subject populations exhibit great heterogeneity, differing in data variables from laboratory to laboratory, resulting in a limited number of samples of gene expression datasets compared to a large number of variables. On the other hand, in the fields of image classification and the like, deep learning usually requires a large number of training samples, and the contradiction hinders the application of the deep learning technology in the field of bioinformatics. Based on these facts, there is a need to use gene expression data of n < p to improve deep learning models suitable for classification of disease outcome. Compared with a common deep neural network classifier, constructing a variable selector on the basis of the deep neural network classifier is a natural choice for the following reasons: (1) the detector detects the effective variable in a supervision mode, namely, the accurate variable representation is generated by utilizing the information of the training result; (2) the input of the deep neural network has a smaller dimension than the original set of variables; because the deep neural network is a multilayer neural network and comprises more than two hidden layers, the professionality of the training model is improved by increasing more layers and the number of neurons in each layer. However, if the network structure is too complex, the generalization capability of the model may be reduced, and a method for determining the structural parameters of the deep neural network model is needed to improve the generalization capability. Therefore, a gene data variable selection and classification method with fusion of the adaptive elastic network and the deep neural network is developed for n < p data.

Disclosure of Invention

In order to solve the problems, the invention provides a gene data variable selection and classification method integrating an adaptive elastic network and a deep neural network, which comprises the steps of firstly, carrying out variable selection based on the adaptive elastic network method, and classifying based on the deep neural network on the basis; the method specifically comprises the following steps:

step 1, selecting variables based on a maximum correlation minimum common redundancy self-adaptive elastic network method, comprising the following substeps:

step 1.1 utilizes mutual information to measure common redundancy, and embodies the internal association and driving characteristics between gene expression data:

cancer may occur anywhere in the human body, and it is noted that initial treatment of cancer is much easier than late stage, and analysis based on genetic data has become an effective method for early stage cancer identification, where the number of samples is much smaller than the number of genes due to limited number of clinical subjects and heterogeneity between test populations, and the first step is to identify a small fraction of genes that are the main causes of disease occurrence, destroy inappropriate and ineffective genes, and improve understanding of classification models in the collection of gene expression data;

in the mathematical definition, assume X_iIs a candidate variable, Y is a response variable, X_jE S is a selected variable, S is a selected variable subset, and candidate variables X are defined_iMutual information with response variable Y is related item, candidate variable X_iWith the selected variable X_jThe mutual information between them is called redundancy item; the goal of any variable selection problem is to select related items, exclude unrelated items, and regarding redundant items, it can be regarded as useful variables with dependency, for example, errors are made when measuring related variables, so that the working performance of the predictor is poor, but if the predictor selects redundant items of related variables, the errors can be corrected, so that the predictor can select some redundant variables to improve the robustness of prediction;

therefore, in the variable selection process of gene data, the genes are regarded as independent variables, the state labels (diseased/non-diseased) of the subjects are regarded as response variables, and the purpose is to select related genes which act on the label variables in the variable set, exclude the unrelated genes and select redundant genes;

when selecting the variable of gene data, candidate gene X is considered_iThe amount of information with the response variable Y, taking into account the genes X in the selected subset S of genes_jThe degree of information inclusion between, retaining related genes, selecting redundant genes, and excluding unrelated genes, therefore, for one gene X_je.S (S is the selected subset of genes), X_jAnd a candidate gene X_iThe redundant information between can be mutual information I (X) between the two_j,X_i) The calculation method is shown in formula (5), gene X_iThe corresponding redundant information rates are as follows:

RI(X_i,X_j) Is gene X_i,X_jRedundant information rate of, I (X)_j,X_i) Is X_i,X_jRedundant information between, I (X)_iY) is X_iInter-correlation of Y, I (X)_jY) is X_jThe correlation between Y;

multiplied by min { I (X)_i；Y),I(X_j(ii) a Y) }, common mutual information CI (X) is introduced_i,X_jDefinition of Y):

X_i，X_jand Y can measure the amount of common information among these genes, T ═ X for one gene data set₁,X₂,...,X_pThe variable selection process identifies a subset of T, denoted S, that extends the common mutual information CI (X)_i,X_jY) to CI (X)_iS, Y) and are defined as common redundancy as follows:

I(X_i(ii) a S) is a gene X_iAnd mutual information of the selected subset S;

step 1.2 construct the maximum correlation minimum common redundancy gene ordering method by using the common redundancy information:

for gene expression data, each gene is used as a carrier, elements in the carrier represent expression values of the genes in different conditions or samples, the maximum correlation minimum common redundancy method avoids underestimation of redundancy items among the genes, and achieves the purposes of selecting related genes, excluding unrelated genes and controlling redundant genes, and the global normalization of target (response) variables is considered, wherein the expression is as follows:

f(X_i)＝I(X_i,Y)-CI(X_i,S,Y) (4)

wherein:

p (x, y) is the joint distribution, p (x) and p (y) are the marginal distributions;

equation (6) as an extension of equation (3), using maximum common mutual information

To measure candidate gene X_iRedundancy with respect to Y for the selected set of basis factors S; wherein X_iRepresenting a gene variable, Y a response variable, S a selected subset of genes, I (X)_iY) mutual information between the genes and the response variables, CI (X)_iS, Y) represents a candidate gene X_iRedundancy with respect to Y, x, of the selected set of base factors S_j∈S；

Step 1.3 the maximum correlation minimum common redundancy method constructs gene importance:

let the gene expression data be an n × p matrix, where n is the number of observations and p is the number of genes, and the importance of the kth (k ∈ p) genes is given by:

S_k＝f(X_k)＝I(X_k,Y)-CI(X_k,S,Y) (7)

weight coefficient of kth gene:

wherein, eta is more than 0 and less than or equal to 1 is a given threshold value, when S is_kWhen the pressure is greater than eta, the pressure is higher than eta,the k gene has obvious meaning when S_kWhen ≦ η, the prediction of the interpretive variable by the kth gene is not significant, and the weight matrix is represented as:

W＝diag(w₁,...,w_p) (9)

step 1.4 construction of variable selection model:

the classification problem of the gene expression data can be abstractly expressed as learning a judgment rule from a training set, assigning a class label to a new sample, and for the gene expression data, n and p respectively represent the size of the sample and the number of genes; let Y be ═ Y₁,y₂,...y_n) ' is a response variable, X ═ X₁,X₂,...,X_p)，X_i＝(x₁,x₂,...x_n) ' is a model matrix, let x_j＝(x_1j,x_2j,...x_nj) According to a general linear regression model, we obtain:

wherein θ ═ θ₁,θ₂,...θ_p) ' is the estimated coefficient vector;

using a weight matrix containing the maximum-associated minimum common redundancy of a single gene, the following penalties for the adaptive elastic network are proposed:

the adaptive elastic network (AEN-MRMCR) model of the maximum correlation minimum common redundancy method therefore:

α∈[0,1]λ > 0 is a regularization parameter, w_jThe adaptive data driving weight is obtained, y is a response variable value, and theta is a coefficient vector of an estimated value; AEN-MRMCR estimator

Is the minimum of the above formula:

the adaptive elastic network penalizes the square error loss by adopting a method combining L2 penalty and an adaptive L1 penalty, compared with the adaptive elastic network, the model provided by the invention adopts an adaptive weight based on maximum correlation minimum common redundancy to replace ridge regression, and the adaptive elastic network method with the maximum correlation minimum redundancy can achieve the effects of selecting related genes, controlling redundant genes and excluding unrelated genes in the automatic gene variable selection process, thereby having obvious biological significance;

step 2, selecting the structural parameters of the deep neural network based on the wolf optimization algorithm, and comprising the following substeps:

step 2.1: deep neural network parameter optimization based on the gray wolf algorithm:

the deep neural network is a multilayer neural network and comprises more than two hidden layers, the speciality of a training model is improved by increasing more layers and the number of neurons in each layer, but if the network structure is too complex, the generalization capability of the model can be reduced, so that a method is needed for determining the structural parameters of the deep neural network model to improve the generalization capability of the deep neural network model, and therefore, the structural parameters of the deep neural network are optimized by using a wolf optimization algorithm;

a Grey Wolf optimization algorithm (GWOlf Optimizer, GWOO) algorithm simulates a Grey Wolf population level mechanism and hunting behaviors in nature, and simulates social levels through 4 types of wolfs (alpha, beta, delta and omega); the hunting behavior of the wolf is simulated through processes of wolf colony tracking, surrounding, chasing, attacking the prey and the like, the aim of optimizing search is achieved, the prey needs to be surrounded when the wolf is hunted, and the mathematical description of the surrounding behavior is as follows:

D＝|C·X_p(t)-X(t)| (15)

X(t+1)＝X_p(t)-A·D (16)

wherein t is the current iteration number; a and C are coefficient variables; x_pIs the position vector of the prey; x is the location vector of the gray wolf; d is the distance between the gray wolf and the prey in t iterations; x (t +1) is the position vector of the gray wolf in t +1 iterations; the vectors a, C are calculated as follows:

A＝2a·r₁-a (17)

C＝2·r₂ (18)

where a is a convergence factor whose components decrease linearly from 2 to 0 r in an iterative process₁,r₂Is [0,1 ]]A random vector of (a);

in an abstract search space, the precise location of the optimal solution (game) is unknown to the wolf, and in order to simulate the hunting behavior of the wolf, it is assumed that α (optimal candidate solution), β and δ have information about the potential location of the game, so that in each iteration, 3 optimal solutions obtained so far are saved, forcing other wolfs to adopt the following formula to update the location according to the optimal search location:

D_α＝|C₁·X_α-X| (20)

D_β＝|C₂·X_β-X| (21)

D_δ＝|C₃·X_δ-X| (22)

X₁＝X_α-A₁·D_α (23)

X₂＝X_β-A₂·D_β (24)

X₃＝X_δ-A₃·D_δ (25)

where A is₁,A₂,A₃,C₁,C₂,C₃As a co-operative coefficient variable, X is the location vector of the gray wolf, D_α,D_β,D_δThe distances of the gray wolf relative to three wolfs of alpha, beta and delta, X₁,X₂,X₃The position vectors of the gray wolf relative to three wolfs of alpha, beta and delta are respectively; x (t +1) is the position vector of the wolf when the iteration is carried out for t +1 times, so the deep neural network parameter optimization step based on the wolf algorithm comprises the following steps:

the first step is as follows: initializing a wolf population, wherein each position consists of a hidden layer number l and a hidden node number n;

the second step is that: learning a training sample, and taking the mean square error of the prediction result of the deep neural network as an individual fitness function of the wolf algorithm;

the third step: calculating a of the gray wolf algorithm according to formula (19), updating A and C according to formulas (17-18);

the fourth step: updating the position of the single wolf according to formula (26);

the fifth step: if the maximum iteration times is reached, returning the best single wolf position, otherwise, repeating the steps from three to five;

the key to find the global optimal solution in the gray wolf optimization algorithm is to determine a fitness function, the fitness function of the algorithm is calculated GWO by using the training mean square error of the deep neural network, and GWO optimization is linked with the deep neural network;

step 2.2: the deep neural network training error calculation steps are as follows:

the first step is as follows: initializing a DNN parameter set theta consisting of weights and deviations;

the second step is that: if it is firstthe fitness of t-generation gray Tailang is f (l)_t,n_t) The number of hidden layers and hidden nodes can be expressed as l_t,n_t；

The third step: v. of₀Inputting a sample vector, q is the iteration number of DNN, and e is the training mean square error of DBN;

the fourth step: randomly iterating the training set in batches according to q times;

the fifth step: fine tuning theta by using a BP algorithm;

and a sixth step: calculating a predicted value by using theta to obtain a training error e;

thus, the GWO algorithm is associated with the DNN through a fitness function, which may reflect the quality of the DNN structural parameters, thereby generating appropriate predictors.

The invention has the beneficial effects that: the method is based on the internal correlation structure of the complex gene data, considers the interdependency among genes, combines coefficient compression and mutual information theory, provides a new gene data variable selection and classification method fusing an adaptive elastic network and a deep neural network, establishes a data-driven model-free assumed adaptive variable selection method, fully considers the redundant information of the genes, carries out weighted estimation on punishment items of the adaptive elastic network, excludes irrelevant genes, controls redundant genes, reduces the complexity of model training, and provides a new thought for the variable selection of the complex nonlinear gene data. Meanwhile, structural parameters of the deep neural network are optimized by using a wolf optimization method, and the generalization capability of the model is improved. The method is used for gene data variable selection and classification, greatly saves medical examination and decision time, and provides great support for saving the life of a patient.

Description of the drawings:

FIG. 1 is a diagram of a maximum correlation minimum common redundancy framework.

Fig. 2 is a flow chart of a maximum correlation minimum common redundancy method.

Fig. 3 is a flow chart of a method for adapting an elastic network based on maximum correlation and minimum common redundancy.

FIG. 4 is a flow chart for optimizing deep neural network structure parameters using a gray wolf optimization algorithm.

The specific implementation scheme is as follows:

the present invention will be further described with reference to the following drawings and examples, including but not limited to the following examples.

The gene data can be regarded as a matrix with the ordinate of the tested individual and the abscissa of the gene expression, and the number in the matrix represents the expression quantity of the gene of a certain tested individual and is generally expressed by real number. The gene data variable selection and classification method based on the fusion of the adaptive elastic network and the deep neural network basically comprises the following implementation processes:

1. selecting variables based on a maximum correlation minimum common redundancy self-adaptive elastic network method:

when the variable selection is carried out on the gene expression data, firstly, the maximum and minimum standardization processing is carried out on the data to solve the influence of the dimension on the result, and the expression is as follows:

then, selecting variables for the normalized data, wherein in step 1, the adaptive elastic network method based on the maximum correlation and minimum common redundancy comprises the following sub-steps: 1.1, measuring common redundant information by utilizing mutual information to embody the internal association and driving characteristics between gene expression data; 1.2, constructing a maximum correlation minimum common redundancy gene ordering method by utilizing common redundancy information, and reflecting the importance of variable data. 1.3 the maximum correlation minimum common redundancy method constructs gene importance and gives a weight matrix of data variables. 1.4 constructing a variable selection model by using a maximum correlation minimum common redundancy method and an adaptive elastic network.

In step 1.1, the genes are considered as independent variables and the subject status signature (diseased/non-diseased) is considered as a response variable, with the aim of selecting the relevant genes among the independent variables that contribute to the signature variable, excluding the irrelevant genes, and selecting the redundant genes, as shown in FIG. 1. First of all utilizeMutual information method for measuring common redundant information and defining gene X_iThe corresponding redundant information rates are as follows:

since some variable selection methods at the present stage only consider the relationship between the variable and the response variable gene, it is rare to consider the next variable to be selected and the variable X in the selected subset S of variables when selecting the variables_jThe mutual inclusion degree between the two variables can not keep useful variables to the maximum extent, and simultaneously, the function of redundant variables is considered, and irrelevant variables are eliminated. Thus, for one gene X_je.S (S is a selected subset of genes), X_jAnd a candidate gene X_iThe redundant information between can be mutual information I (X) between the two_j,X_i) And (6) measuring.

Multiplying the redundant information rate by min { I (X)_i；Y),I(X_j(ii) a Y) }, common mutual information CI (X) is introduced_i,X_jThe concept of Y):

X_i，X_jand Y can measure the amount of mutual information among these genes. For a gene set T ═ X₁,X₂,...,X_pThe gene selection process identifies a subset of T, denoted as S. Extending common mutual information CI (X)_i,X_jY) to CI (X)_iS, Y) and are defined as common redundancy as follows:

the common redundancy information is used in step 1.2 to construct a maximum correlation minimum common redundancy gene ordering method, for gene expression data, each gene acts as a vector, with the elements representing their expression values in different conditions or samples. The maximum correlation minimum common redundancy method avoids underestimation of redundant items among genes, achieves the purposes of selecting related genes, excluding unrelated genes and controlling redundant genes, and considers the global normalization of target (response) genes. As shown in fig. 2, the importance of each gene in the gene expression data was calculated, and its expression was as follows:

f(X_i)＝I(X_i,Y)-CI(X_i,S,Y) (4)

wherein:

p (x, y) is the joint distribution, and p (x) and p (y) are the marginal distributions.

When the selected data set S is an empty set, the gene whose mutual information value is the largest at this time is selected as the selected gene, and the mutual information of the selected gene is its importance value and is put into the selected data set S. When the selected data set S is not an empty set, the gene importance value is calculated according to equation (4).

To measure candidate gene X_iRedundancy with respect to Y for the selected set of basis factors S; wherein X_iRepresenting candidate genes, Y representing a response variable, S representing a selected subset of genes, I (X)_iY) mutual information between the genes and the response variables, CI (X)_iS, Y) represents a candidate gene X_iRedundancy with respect to Y, x, of the selected set of base factors S_j∈S。

In step 1.3, the gene importance is constructed using the maximum correlation minimum common redundancy method, and the importance of the kth gene is given by:

S_k＝f(X_k) (7)

define the weight coefficient for the kth gene:

wherein, eta which is more than 0 and less than or equal to 1 is a given threshold value. When S is_kWhen eta, the kth gene has obvious meaning when S_kWhen the k-th gene is less than or equal to eta, the prediction of the explanation variable gene is not obvious. We represent the weight matrix as:

W＝diag(w₁,...,w_p) (9)

in the polynomial sparse group lasso model, the calculation and significance of the weight value are not given, the weight value of the adaptive lasso adopts initial consistency estimation, and the weight value of the adaptive elastic network adopts initial elastic network estimation. The weights given by the above methods, although having clear statistical significance, can be used comprehensively to evaluate the importance of genes, but cannot account for obvious biological significance. The adaptive gene selection strategy presented herein is of biological interest.

In step 1.4, as shown in fig. 3, the prediction accuracy of gene selection can be improved to some extent by performing weighted estimation on the L1 and L2 penalty terms of the elastic net in consideration of the information correlation between different genes in the data set.

The problem of classification of gene expression data can be abstractly expressed as learning a discriminant rule from a training set and assigning a class label to a new sample. For gene expression data, n and p represent the sample size and number of genes, respectively. Let Y be ═ Y₁,y₂,...y_n) ' is a response variable, X ═ X₁,X₂,...,X_p)，X_i＝(x₁,x₂,...x_n) ' is a model matrix. Let x_j＝(x_1j,x_2j,...x_nj)'. According to a general linear regression model, we obtain:

wherein θ ═ θ₁,θ₂,...θ_p) ' is the estimated coefficient vector.

By using a weight matrix containing the mutual information of the conditions of the single gene, the following punishment items of the self-adaptive elastic network are proposed:

an adaptive elastic net (AEN-MRMCR) model with maximum correlation minimum common redundancy method is proposed:

α∈[0,1]λ > 0 is a regularization parameter, w_jIs the adaptive data driven weight, y is the value of the response variable, and theta is the coefficient vector of the estimate. AEN-MRMCR estimator

Is the minimum of the above formula.

2. Selecting structural parameters of the deep neural network based on a wolf optimization algorithm:

because gene expression data has the characteristics of high dimension and small samples, after the data is subjected to variable selection by using the step 1, the data needs to be classified by using a predictor so as to assist clinical diagnosis. As shown in fig. 4, the present invention provides a classification method for sirius optimized deep neural network structure parameters, which is implemented as follows:

in step 2, the selecting of the structural parameters of the deep neural network based on the wolf optimization algorithm comprises the following substeps: 2.1, carrying out deep neural network parameter optimization based on the gray wolf algorithm, initializing parameters, and constructing a fitness function of the gray wolf optimization. 2.1, deep neural network training error, namely connecting the gray wolf optimization method with the deep neural network through an error function.

In step 2.1: the deep neural network parameter optimization method based on the wolf algorithm comprises the following steps:

the first step is as follows: a gray wolf population is initialized. Each position consists of a hidden layer number l and a hidden node number n;

the third step: calculating a of the gray wolf algorithm, and updating A and C;

the fourth step: updating the position of the single wolf according to the A and the C;

the fifth step: if the termination condition is reached, returning to the optimal personal position, otherwise, repeating the steps from three to five;

the key to find the global optimal solution in the grayish optimization algorithm is to determine a fitness function, and the fitness function of the algorithm is calculated GWO by using the training mean square error of the deep neural network;

in step 2.2: the deep neural network training error calculation steps are as follows:

the second step is that: if the fitness of the t-th generation of gray Tailang is f (l)_t,n_t) The number of hidden layers and hidden nodes can be expressed as l_t,n_t；

the fifth step: fine tuning theta by using a BP algorithm;

thus, the GWO algorithm is associated with the DNN by a fitness function. The fitness value may reflect the quality of the DNN structural parameters, thereby generating a suitable predictor.

Claims

1. In order to solve the problem that the sample size of gene data is far smaller than that of characteristics and the deep learning method is limited in bioinformatics, the invention provides a gene data variable selection and classification method fusing an adaptive elastic network and a deep neural network, wherein variable selection is carried out on the basis of the adaptive elastic network method, and on the basis, classification is carried out on the basis of the deep neural network; the method specifically comprises the following steps:

in the mathematical definition, assume X_iIs a candidate variable, Y is a response variable, X_jE S is a selected variable, S is a selected variable subset, and candidate variables X are defined_iMutual information with response variable Y is related item, candidate variable X_iWith the selected variable X_jThe mutual information between them is called redundancy item; the goal of any variable selection problem is to select dependent terms, exclude non-dependent terms, and for redundant terms, consider that there is a dependencyIf some errors are made in measuring the relevant variables, the performance of the predictor is poor, but if some redundancy items of the relevant variables are selected by the predictor, the errors can be corrected, so that the predictor can select some redundancy variables to improve the robustness of the prediction;

X_i,X_jand Y can measure the amount of common information among these genes, T ═ X for one gene data set₁,X₂,...,X_pThe variable selection process identifies a subset of T, denoted S, that extends the common mutual information CI (X)_i,X_jY) to CI (X)_iS, Y) and are defined as common redundancy as follows:

I(X_i(ii) a S) is a gene X_iAnd mutual information of the selected subset S;

f(X_i)＝I(X_i,Y)-CI(X_i,S,Y) (4)

wherein:

S_k＝f(X_k)＝I(X_k,Y)-CI(X_k,S,Y) (7)

weight coefficient of kth gene:

wherein, eta is more than 0 and less than or equal to 1 is a given threshold value, when S is_kWhen eta, the kth gene has obvious meaning when S_kWhen ≦ η, the prediction of the interpretive variable by the kth gene is not significant, and the weight matrix is represented as:

W＝diag(w₁,...,w_p) (9)

step 1.4 construction of variable selection model:

the classification problem of the gene expression data can be abstractly expressed as learning a judgment rule from a training set, assigning a class label to a new sample, and for the gene expression data, n and p respectively represent the size of the sample and the number of genes; let Y be ═ Y₁,y₂,...y_n) ' is a response variable, X ═ X₁,X₂,...,X_p)，X_i＝(x₁,x₂,...x_n) ' is a model matrix, let x_j＝(x_1j,x_2j,...x_nj) ', obtaining the target value according to a general linear regression modelTo: