CN113241122A - Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network - Google Patents

Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network Download PDF

Info

Publication number
CN113241122A
CN113241122A CN202110650665.7A CN202110650665A CN113241122A CN 113241122 A CN113241122 A CN 113241122A CN 202110650665 A CN202110650665 A CN 202110650665A CN 113241122 A CN113241122 A CN 113241122A
Authority
CN
China
Prior art keywords
gene
genes
variable
wolf
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110650665.7A
Other languages
Chinese (zh)
Inventor
秦喜文
王芮
李绍松
谭佳伟
徐定鑫
崔薛腾
张斯琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun University of Technology
Original Assignee
Changchun University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun University of Technology filed Critical Changchun University of Technology
Priority to CN202110650665.7A priority Critical patent/CN113241122A/en
Publication of CN113241122A publication Critical patent/CN113241122A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The method for selecting and classifying the gene data variable fused with the self-adaptive elastic network and the deep neural network is characterized by comprising the steps of selecting the variable of complex gene data and classifying the complex data; the variable selection module performs weighted estimation on the punishment item of the adaptive elastic network by considering the interdependency between genes and combining coefficient compression and mutual information theory from the internal association structure of the complex gene data, and establishes a data-driven model-free assumed adaptive variable selection method; the classification module of the complex data optimizes the structural parameters of the deep neural network by using a wolf optimization method, and the generalization capability of the model is improved.

Description

Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network
Technical Field
The invention relates to the technical field of biological big data analysis and variable selection, in particular to a gene data variable selection and classification method integrating an adaptive elastic network and a deep neural network.
Background
In the field of bioinformatics, it is an important technique to predict clinical outcomes using gene datasets with a large number of variables. In such data sets, the sample size is often very small compared to the number of predictors (genes), thus leading to an n < p problem. In addition, the complex and unknown correlation structure between predictors presents great difficulties for classification results and variable selection. Therefore, a new set of statistical or data mining methods suitable for the characteristics of high-dimensional small samples is required for classifying gene data, and the dimensionality of the data is reduced while the high precision is maintained.
In the prior art, a regularization method is an important dimension reduction method for solving high-dimensional small sample data, and can reduce dimensions of gene data while training a model. Exemplary methods include L1 norm based Lasso, adaptive Lasso, L2 norm based ridge regression. The penalty functions of L1 norm and L2 norm in the typical method can not meet the requirements of unbiasedness, sparsity, continuity and the like at the same time, and the traditional SCAD method does not take the interaction of genes and genes into consideration, only considers the relation with diseases from the perspective of the genes, thereby reducing the effectiveness of using the SCAD method for gene selection and cancer classification. Second, typical regularization methods also include elastic nets and adaptive elastic nets based on the L1 and L2 norms. However, when applying adaptive elastic nets to high dimensional gene expression data, some important genes may be erroneously weighted less in the initial estimation due to lower accuracy requirements. Therefore, these important genes are easily deleted from the model by mistake, and the informative gene selection prediction accuracy of microarray DNA data is low. In addition, if the pairwise correlation between variables is not high, the performance of the adaptive elastic network may not be good.
Meanwhile, the deep learning model has been proven to be a powerful classification tool, but its application in bioinformatics is limited due to the n < p problem. This is because cell populations and clinical subject populations exhibit great heterogeneity, differing in data variables from laboratory to laboratory, resulting in a limited number of samples of gene expression datasets compared to a large number of variables. On the other hand, in the fields of image classification and the like, deep learning usually requires a large number of training samples, and the contradiction hinders the application of the deep learning technology in the field of bioinformatics. Based on these facts, there is a need to use gene expression data of n < p to improve deep learning models suitable for classification of disease outcome. Compared with a common deep neural network classifier, constructing a variable selector on the basis of the deep neural network classifier is a natural choice for the following reasons: (1) the detector detects the effective variable in a supervision mode, namely, the accurate variable representation is generated by utilizing the information of the training result; (2) the input of the deep neural network has a smaller dimension than the original set of variables; because the deep neural network is a multilayer neural network and comprises more than two hidden layers, the professionality of the training model is improved by increasing more layers and the number of neurons in each layer. However, if the network structure is too complex, the generalization capability of the model may be reduced, and a method for determining the structural parameters of the deep neural network model is needed to improve the generalization capability. Therefore, a gene data variable selection and classification method with fusion of the adaptive elastic network and the deep neural network is developed for n < p data.
Disclosure of Invention
In order to solve the problems, the invention provides a gene data variable selection and classification method integrating an adaptive elastic network and a deep neural network, which comprises the steps of firstly, carrying out variable selection based on the adaptive elastic network method, and classifying based on the deep neural network on the basis; the method specifically comprises the following steps:
step 1, selecting variables based on a maximum correlation minimum common redundancy self-adaptive elastic network method, comprising the following substeps:
step 1.1 utilizes mutual information to measure common redundancy, and embodies the internal association and driving characteristics between gene expression data:
cancer may occur anywhere in the human body, and it is noted that initial treatment of cancer is much easier than late stage, and analysis based on genetic data has become an effective method for early stage cancer identification, where the number of samples is much smaller than the number of genes due to limited number of clinical subjects and heterogeneity between test populations, and the first step is to identify a small fraction of genes that are the main causes of disease occurrence, destroy inappropriate and ineffective genes, and improve understanding of classification models in the collection of gene expression data;
in the mathematical definition, assume XiIs a candidate variable, Y is a response variable, XjE S is a selected variable, S is a selected variable subset, and candidate variables X are definediMutual information with response variable Y is related item, candidate variable XiWith the selected variable XjThe mutual information between them is called redundancy item; the goal of any variable selection problem is to select related items, exclude unrelated items, and regarding redundant items, it can be regarded as useful variables with dependency, for example, errors are made when measuring related variables, so that the working performance of the predictor is poor, but if the predictor selects redundant items of related variables, the errors can be corrected, so that the predictor can select some redundant variables to improve the robustness of prediction;
therefore, in the variable selection process of gene data, the genes are regarded as independent variables, the state labels (diseased/non-diseased) of the subjects are regarded as response variables, and the purpose is to select related genes which act on the label variables in the variable set, exclude the unrelated genes and select redundant genes;
when selecting the variable of gene data, candidate gene X is considerediThe amount of information with the response variable Y, taking into account the genes X in the selected subset S of genesjThe degree of information inclusion between, retaining related genes, selecting redundant genes, and excluding unrelated genes, therefore, for one gene Xje.S (S is the selected subset of genes), XjAnd a candidate gene XiThe redundant information between can be mutual information I (X) between the twoj,Xi) The calculation method is shown in formula (5), gene XiThe corresponding redundant information rates are as follows:
Figure BDA0003111498980000021
RI(Xi,Xj) Is gene Xi,XjRedundant information rate of, I (X)j,Xi) Is Xi,XjRedundant information between, I (X)iY) is XiInter-correlation of Y, I (X)jY) is XjThe correlation between Y;
multiplied by min { I (X)i;Y),I(Xj(ii) a Y) }, common mutual information CI (X) is introducedi,XjDefinition of Y):
Figure BDA0003111498980000031
Xi,Xjand Y can measure the amount of common information among these genes, T ═ X for one gene data set1,X2,...,XpThe variable selection process identifies a subset of T, denoted S, that extends the common mutual information CI (X)i,XjY) to CI (X)iS, Y) and are defined as common redundancy as follows:
Figure BDA0003111498980000032
I(Xi(ii) a S) is a gene XiAnd mutual information of the selected subset S;
step 1.2 construct the maximum correlation minimum common redundancy gene ordering method by using the common redundancy information:
for gene expression data, each gene is used as a carrier, elements in the carrier represent expression values of the genes in different conditions or samples, the maximum correlation minimum common redundancy method avoids underestimation of redundancy items among the genes, and achieves the purposes of selecting related genes, excluding unrelated genes and controlling redundant genes, and the global normalization of target (response) variables is considered, wherein the expression is as follows:
f(Xi)=I(Xi,Y)-CI(Xi,S,Y) (4)
wherein:
Figure BDA0003111498980000033
p (x, y) is the joint distribution, p (x) and p (y) are the marginal distributions;
Figure BDA0003111498980000034
equation (6) as an extension of equation (3), using maximum common mutual information
Figure BDA0003111498980000035
To measure candidate gene XiRedundancy with respect to Y for the selected set of basis factors S; wherein XiRepresenting a gene variable, Y a response variable, S a selected subset of genes, I (X)iY) mutual information between the genes and the response variables, CI (X)iS, Y) represents a candidate gene XiRedundancy with respect to Y, x, of the selected set of base factors Sj∈S;
Step 1.3 the maximum correlation minimum common redundancy method constructs gene importance:
let the gene expression data be an n × p matrix, where n is the number of observations and p is the number of genes, and the importance of the kth (k ∈ p) genes is given by:
Sk=f(Xk)=I(Xk,Y)-CI(Xk,S,Y) (7)
weight coefficient of kth gene:
Figure BDA0003111498980000041
wherein, eta is more than 0 and less than or equal to 1 is a given threshold value, when S iskWhen the pressure is greater than eta, the pressure is higher than eta,the k gene has obvious meaning when SkWhen ≦ η, the prediction of the interpretive variable by the kth gene is not significant, and the weight matrix is represented as:
W=diag(w1,...,wp) (9)
step 1.4 construction of variable selection model:
the classification problem of the gene expression data can be abstractly expressed as learning a judgment rule from a training set, assigning a class label to a new sample, and for the gene expression data, n and p respectively represent the size of the sample and the number of genes; let Y be ═ Y1,y2,...yn) ' is a response variable, X ═ X1,X2,...,Xp),Xi=(x1,x2,...xn) ' is a model matrix, let xj=(x1j,x2j,...xnj) According to a general linear regression model, we obtain:
Figure BDA0003111498980000042
wherein θ ═ θ12,...θp) ' is the estimated coefficient vector;
using a weight matrix containing the maximum-associated minimum common redundancy of a single gene, the following penalties for the adaptive elastic network are proposed:
Figure BDA0003111498980000043
Figure BDA0003111498980000044
the adaptive elastic network (AEN-MRMCR) model of the maximum correlation minimum common redundancy method therefore:
Figure BDA0003111498980000045
α∈[0,1]λ > 0 is a regularization parameter, wjThe adaptive data driving weight is obtained, y is a response variable value, and theta is a coefficient vector of an estimated value; AEN-MRMCR estimator
Figure BDA0003111498980000046
Is the minimum of the above formula:
Figure BDA0003111498980000047
the adaptive elastic network penalizes the square error loss by adopting a method combining L2 penalty and an adaptive L1 penalty, compared with the adaptive elastic network, the model provided by the invention adopts an adaptive weight based on maximum correlation minimum common redundancy to replace ridge regression, and the adaptive elastic network method with the maximum correlation minimum redundancy can achieve the effects of selecting related genes, controlling redundant genes and excluding unrelated genes in the automatic gene variable selection process, thereby having obvious biological significance;
step 2, selecting the structural parameters of the deep neural network based on the wolf optimization algorithm, and comprising the following substeps:
step 2.1: deep neural network parameter optimization based on the gray wolf algorithm:
the deep neural network is a multilayer neural network and comprises more than two hidden layers, the speciality of a training model is improved by increasing more layers and the number of neurons in each layer, but if the network structure is too complex, the generalization capability of the model can be reduced, so that a method is needed for determining the structural parameters of the deep neural network model to improve the generalization capability of the deep neural network model, and therefore, the structural parameters of the deep neural network are optimized by using a wolf optimization algorithm;
a Grey Wolf optimization algorithm (GWOlf Optimizer, GWOO) algorithm simulates a Grey Wolf population level mechanism and hunting behaviors in nature, and simulates social levels through 4 types of wolfs (alpha, beta, delta and omega); the hunting behavior of the wolf is simulated through processes of wolf colony tracking, surrounding, chasing, attacking the prey and the like, the aim of optimizing search is achieved, the prey needs to be surrounded when the wolf is hunted, and the mathematical description of the surrounding behavior is as follows:
D=|C·Xp(t)-X(t)| (15)
X(t+1)=Xp(t)-A·D (16)
wherein t is the current iteration number; a and C are coefficient variables; xpIs the position vector of the prey; x is the location vector of the gray wolf; d is the distance between the gray wolf and the prey in t iterations; x (t +1) is the position vector of the gray wolf in t +1 iterations; the vectors a, C are calculated as follows:
A=2a·r1-a (17)
C=2·r2 (18)
Figure BDA0003111498980000051
where a is a convergence factor whose components decrease linearly from 2 to 0 r in an iterative process1,r2Is [0,1 ]]A random vector of (a);
in an abstract search space, the precise location of the optimal solution (game) is unknown to the wolf, and in order to simulate the hunting behavior of the wolf, it is assumed that α (optimal candidate solution), β and δ have information about the potential location of the game, so that in each iteration, 3 optimal solutions obtained so far are saved, forcing other wolfs to adopt the following formula to update the location according to the optimal search location:
Dα=|C1·Xα-X| (20)
Dβ=|C2·Xβ-X| (21)
Dδ=|C3·Xδ-X| (22)
X1=Xα-A1·Dα (23)
X2=Xβ-A2·Dβ (24)
X3=Xδ-A3·Dδ (25)
Figure BDA0003111498980000061
where A is1,A2,A3,C1,C2,C3As a co-operative coefficient variable, X is the location vector of the gray wolf, Dα,Dβ,DδThe distances of the gray wolf relative to three wolfs of alpha, beta and delta, X1,X2,X3The position vectors of the gray wolf relative to three wolfs of alpha, beta and delta are respectively; x (t +1) is the position vector of the wolf when the iteration is carried out for t +1 times, so the deep neural network parameter optimization step based on the wolf algorithm comprises the following steps:
the first step is as follows: initializing a wolf population, wherein each position consists of a hidden layer number l and a hidden node number n;
the second step is that: learning a training sample, and taking the mean square error of the prediction result of the deep neural network as an individual fitness function of the wolf algorithm;
the third step: calculating a of the gray wolf algorithm according to formula (19), updating A and C according to formulas (17-18);
the fourth step: updating the position of the single wolf according to formula (26);
the fifth step: if the maximum iteration times is reached, returning the best single wolf position, otherwise, repeating the steps from three to five;
the key to find the global optimal solution in the gray wolf optimization algorithm is to determine a fitness function, the fitness function of the algorithm is calculated GWO by using the training mean square error of the deep neural network, and GWO optimization is linked with the deep neural network;
step 2.2: the deep neural network training error calculation steps are as follows:
the first step is as follows: initializing a DNN parameter set theta consisting of weights and deviations;
the second step is that: if it is firstthe fitness of t-generation gray Tailang is f (l)t,nt) The number of hidden layers and hidden nodes can be expressed as lt,nt
The third step: v. of0Inputting a sample vector, q is the iteration number of DNN, and e is the training mean square error of DBN;
the fourth step: randomly iterating the training set in batches according to q times;
the fifth step: fine tuning theta by using a BP algorithm;
and a sixth step: calculating a predicted value by using theta to obtain a training error e;
thus, the GWO algorithm is associated with the DNN through a fitness function, which may reflect the quality of the DNN structural parameters, thereby generating appropriate predictors.
The invention has the beneficial effects that: the method is based on the internal correlation structure of the complex gene data, considers the interdependency among genes, combines coefficient compression and mutual information theory, provides a new gene data variable selection and classification method fusing an adaptive elastic network and a deep neural network, establishes a data-driven model-free assumed adaptive variable selection method, fully considers the redundant information of the genes, carries out weighted estimation on punishment items of the adaptive elastic network, excludes irrelevant genes, controls redundant genes, reduces the complexity of model training, and provides a new thought for the variable selection of the complex nonlinear gene data. Meanwhile, structural parameters of the deep neural network are optimized by using a wolf optimization method, and the generalization capability of the model is improved. The method is used for gene data variable selection and classification, greatly saves medical examination and decision time, and provides great support for saving the life of a patient.
Description of the drawings:
FIG. 1 is a diagram of a maximum correlation minimum common redundancy framework.
Fig. 2 is a flow chart of a maximum correlation minimum common redundancy method.
Fig. 3 is a flow chart of a method for adapting an elastic network based on maximum correlation and minimum common redundancy.
FIG. 4 is a flow chart for optimizing deep neural network structure parameters using a gray wolf optimization algorithm.
The specific implementation scheme is as follows:
the present invention will be further described with reference to the following drawings and examples, including but not limited to the following examples.
The gene data can be regarded as a matrix with the ordinate of the tested individual and the abscissa of the gene expression, and the number in the matrix represents the expression quantity of the gene of a certain tested individual and is generally expressed by real number. The gene data variable selection and classification method based on the fusion of the adaptive elastic network and the deep neural network basically comprises the following implementation processes:
1. selecting variables based on a maximum correlation minimum common redundancy self-adaptive elastic network method:
when the variable selection is carried out on the gene expression data, firstly, the maximum and minimum standardization processing is carried out on the data to solve the influence of the dimension on the result, and the expression is as follows:
Figure BDA0003111498980000071
then, selecting variables for the normalized data, wherein in step 1, the adaptive elastic network method based on the maximum correlation and minimum common redundancy comprises the following sub-steps: 1.1, measuring common redundant information by utilizing mutual information to embody the internal association and driving characteristics between gene expression data; 1.2, constructing a maximum correlation minimum common redundancy gene ordering method by utilizing common redundancy information, and reflecting the importance of variable data. 1.3 the maximum correlation minimum common redundancy method constructs gene importance and gives a weight matrix of data variables. 1.4 constructing a variable selection model by using a maximum correlation minimum common redundancy method and an adaptive elastic network.
In step 1.1, the genes are considered as independent variables and the subject status signature (diseased/non-diseased) is considered as a response variable, with the aim of selecting the relevant genes among the independent variables that contribute to the signature variable, excluding the irrelevant genes, and selecting the redundant genes, as shown in FIG. 1. First of all utilizeMutual information method for measuring common redundant information and defining gene XiThe corresponding redundant information rates are as follows:
Figure BDA0003111498980000081
since some variable selection methods at the present stage only consider the relationship between the variable and the response variable gene, it is rare to consider the next variable to be selected and the variable X in the selected subset S of variables when selecting the variablesjThe mutual inclusion degree between the two variables can not keep useful variables to the maximum extent, and simultaneously, the function of redundant variables is considered, and irrelevant variables are eliminated. Thus, for one gene Xje.S (S is a selected subset of genes), XjAnd a candidate gene XiThe redundant information between can be mutual information I (X) between the twoj,Xi) And (6) measuring.
Multiplying the redundant information rate by min { I (X)i;Y),I(Xj(ii) a Y) }, common mutual information CI (X) is introducedi,XjThe concept of Y):
Figure BDA0003111498980000082
Xi,Xjand Y can measure the amount of mutual information among these genes. For a gene set T ═ X1,X2,...,XpThe gene selection process identifies a subset of T, denoted as S. Extending common mutual information CI (X)i,XjY) to CI (X)iS, Y) and are defined as common redundancy as follows:
Figure BDA0003111498980000083
the common redundancy information is used in step 1.2 to construct a maximum correlation minimum common redundancy gene ordering method, for gene expression data, each gene acts as a vector, with the elements representing their expression values in different conditions or samples. The maximum correlation minimum common redundancy method avoids underestimation of redundant items among genes, achieves the purposes of selecting related genes, excluding unrelated genes and controlling redundant genes, and considers the global normalization of target (response) genes. As shown in fig. 2, the importance of each gene in the gene expression data was calculated, and its expression was as follows:
f(Xi)=I(Xi,Y)-CI(Xi,S,Y) (4)
wherein:
Figure BDA0003111498980000084
p (x, y) is the joint distribution, and p (x) and p (y) are the marginal distributions.
Figure BDA0003111498980000085
When the selected data set S is an empty set, the gene whose mutual information value is the largest at this time is selected as the selected gene, and the mutual information of the selected gene is its importance value and is put into the selected data set S. When the selected data set S is not an empty set, the gene importance value is calculated according to equation (4).
Equation (6) as an extension of equation (3), using maximum common mutual information
Figure BDA0003111498980000091
To measure candidate gene XiRedundancy with respect to Y for the selected set of basis factors S; wherein XiRepresenting candidate genes, Y representing a response variable, S representing a selected subset of genes, I (X)iY) mutual information between the genes and the response variables, CI (X)iS, Y) represents a candidate gene XiRedundancy with respect to Y, x, of the selected set of base factors Sj∈S。
In step 1.3, the gene importance is constructed using the maximum correlation minimum common redundancy method, and the importance of the kth gene is given by:
Sk=f(Xk) (7)
define the weight coefficient for the kth gene:
Figure BDA0003111498980000092
wherein, eta which is more than 0 and less than or equal to 1 is a given threshold value. When S iskWhen eta, the kth gene has obvious meaning when SkWhen the k-th gene is less than or equal to eta, the prediction of the explanation variable gene is not obvious. We represent the weight matrix as:
W=diag(w1,...,wp) (9)
in the polynomial sparse group lasso model, the calculation and significance of the weight value are not given, the weight value of the adaptive lasso adopts initial consistency estimation, and the weight value of the adaptive elastic network adopts initial elastic network estimation. The weights given by the above methods, although having clear statistical significance, can be used comprehensively to evaluate the importance of genes, but cannot account for obvious biological significance. The adaptive gene selection strategy presented herein is of biological interest.
In step 1.4, as shown in fig. 3, the prediction accuracy of gene selection can be improved to some extent by performing weighted estimation on the L1 and L2 penalty terms of the elastic net in consideration of the information correlation between different genes in the data set.
The problem of classification of gene expression data can be abstractly expressed as learning a discriminant rule from a training set and assigning a class label to a new sample. For gene expression data, n and p represent the sample size and number of genes, respectively. Let Y be ═ Y1,y2,...yn) ' is a response variable, X ═ X1,X2,...,Xp),Xi=(x1,x2,...xn) ' is a model matrix. Let xj=(x1j,x2j,...xnj)'. According to a general linear regression model, we obtain:
Figure BDA0003111498980000093
wherein θ ═ θ12,...θp) ' is the estimated coefficient vector.
By using a weight matrix containing the mutual information of the conditions of the single gene, the following punishment items of the self-adaptive elastic network are proposed:
Figure BDA0003111498980000101
Figure BDA0003111498980000102
an adaptive elastic net (AEN-MRMCR) model with maximum correlation minimum common redundancy method is proposed:
Figure BDA0003111498980000103
α∈[0,1]λ > 0 is a regularization parameter, wjIs the adaptive data driven weight, y is the value of the response variable, and theta is the coefficient vector of the estimate. AEN-MRMCR estimator
Figure BDA0003111498980000104
Is the minimum of the above formula.
Figure BDA0003111498980000105
2. Selecting structural parameters of the deep neural network based on a wolf optimization algorithm:
because gene expression data has the characteristics of high dimension and small samples, after the data is subjected to variable selection by using the step 1, the data needs to be classified by using a predictor so as to assist clinical diagnosis. As shown in fig. 4, the present invention provides a classification method for sirius optimized deep neural network structure parameters, which is implemented as follows:
in step 2, the selecting of the structural parameters of the deep neural network based on the wolf optimization algorithm comprises the following substeps: 2.1, carrying out deep neural network parameter optimization based on the gray wolf algorithm, initializing parameters, and constructing a fitness function of the gray wolf optimization. 2.1, deep neural network training error, namely connecting the gray wolf optimization method with the deep neural network through an error function.
In step 2.1: the deep neural network parameter optimization method based on the wolf algorithm comprises the following steps:
the first step is as follows: a gray wolf population is initialized. Each position consists of a hidden layer number l and a hidden node number n;
the second step is that: learning a training sample, and taking the mean square error of the prediction result of the deep neural network as an individual fitness function of the wolf algorithm;
the third step: calculating a of the gray wolf algorithm, and updating A and C;
the fourth step: updating the position of the single wolf according to the A and the C;
the fifth step: if the termination condition is reached, returning to the optimal personal position, otherwise, repeating the steps from three to five;
the key to find the global optimal solution in the grayish optimization algorithm is to determine a fitness function, and the fitness function of the algorithm is calculated GWO by using the training mean square error of the deep neural network;
in step 2.2: the deep neural network training error calculation steps are as follows:
the first step is as follows: initializing a DNN parameter set theta consisting of weights and deviations;
the second step is that: if the fitness of the t-th generation of gray Tailang is f (l)t,nt) The number of hidden layers and hidden nodes can be expressed as lt,nt
The third step: v. of0Inputting a sample vector, q is the iteration number of DNN, and e is the training mean square error of DBN;
the fourth step: randomly iterating the training set in batches according to q times;
the fifth step: fine tuning theta by using a BP algorithm;
and a sixth step: calculating a predicted value by using theta to obtain a training error e;
thus, the GWO algorithm is associated with the DNN by a fitness function. The fitness value may reflect the quality of the DNN structural parameters, thereby generating a suitable predictor.

Claims (1)

1. In order to solve the problem that the sample size of gene data is far smaller than that of characteristics and the deep learning method is limited in bioinformatics, the invention provides a gene data variable selection and classification method fusing an adaptive elastic network and a deep neural network, wherein variable selection is carried out on the basis of the adaptive elastic network method, and on the basis, classification is carried out on the basis of the deep neural network; the method specifically comprises the following steps:
step 1, selecting variables based on a maximum correlation minimum common redundancy self-adaptive elastic network method, comprising the following substeps:
step 1.1 utilizes mutual information to measure common redundancy, and embodies the internal association and driving characteristics between gene expression data:
cancer may occur anywhere in the human body, and it is noted that initial treatment of cancer is much easier than late stage, and analysis based on genetic data has become an effective method for early stage cancer identification, where the number of samples is much smaller than the number of genes due to limited number of clinical subjects and heterogeneity between test populations, and the first step is to identify a small fraction of genes that are the main causes of disease occurrence, destroy inappropriate and ineffective genes, and improve understanding of classification models in the collection of gene expression data;
in the mathematical definition, assume XiIs a candidate variable, Y is a response variable, XjE S is a selected variable, S is a selected variable subset, and candidate variables X are definediMutual information with response variable Y is related item, candidate variable XiWith the selected variable XjThe mutual information between them is called redundancy item; the goal of any variable selection problem is to select dependent terms, exclude non-dependent terms, and for redundant terms, consider that there is a dependencyIf some errors are made in measuring the relevant variables, the performance of the predictor is poor, but if some redundancy items of the relevant variables are selected by the predictor, the errors can be corrected, so that the predictor can select some redundancy variables to improve the robustness of the prediction;
therefore, in the variable selection process of gene data, the genes are regarded as independent variables, the state labels (diseased/non-diseased) of the subjects are regarded as response variables, and the purpose is to select related genes which act on the label variables in the variable set, exclude the unrelated genes and select redundant genes;
when selecting the variable of gene data, candidate gene X is considerediThe amount of information with the response variable Y, taking into account the genes X in the selected subset S of genesjThe degree of information inclusion between, retaining related genes, selecting redundant genes, and excluding unrelated genes, therefore, for one gene Xje.S (S is the selected subset of genes), XjAnd a candidate gene XiThe redundant information between can be mutual information I (X) between the twoj,Xi) The calculation method is shown in formula (5), gene XiThe corresponding redundant information rates are as follows:
Figure FDA0003111498970000021
RI(Xi,Xj) Is gene Xi,XjRedundant information rate of, I (X)j,Xi) Is Xi,XjRedundant information between, I (X)iY) is XiInter-correlation of Y, I (X)jY) is XjThe correlation between Y;
multiplied by min { I (X)i;Y),I(Xj(ii) a Y) }, common mutual information CI (X) is introducedi,XjDefinition of Y):
Figure FDA0003111498970000022
Xi,Xjand Y can measure the amount of common information among these genes, T ═ X for one gene data set1,X2,...,XpThe variable selection process identifies a subset of T, denoted S, that extends the common mutual information CI (X)i,XjY) to CI (X)iS, Y) and are defined as common redundancy as follows:
Figure FDA0003111498970000023
I(Xi(ii) a S) is a gene XiAnd mutual information of the selected subset S;
step 1.2 construct the maximum correlation minimum common redundancy gene ordering method by using the common redundancy information:
for gene expression data, each gene is used as a carrier, elements in the carrier represent expression values of the genes in different conditions or samples, the maximum correlation minimum common redundancy method avoids underestimation of redundancy items among the genes, and achieves the purposes of selecting related genes, excluding unrelated genes and controlling redundant genes, and the global normalization of target (response) variables is considered, wherein the expression is as follows:
f(Xi)=I(Xi,Y)-CI(Xi,S,Y) (4)
wherein:
Figure FDA0003111498970000024
p (x, y) is the joint distribution, p (x) and p (y) are the marginal distributions;
Figure FDA0003111498970000025
equation (6) as an extension of equation (3), using maximum common mutual information
Figure FDA0003111498970000026
To measure candidate gene XiRedundancy with respect to Y for the selected set of basis factors S; wherein XiRepresenting a gene variable, Y a response variable, S a selected subset of genes, I (X)iY) mutual information between the genes and the response variables, CI (X)iS, Y) represents a candidate gene XiRedundancy with respect to Y, x, of the selected set of base factors Sj∈S;
Step 1.3 the maximum correlation minimum common redundancy method constructs gene importance:
let the gene expression data be an n × p matrix, where n is the number of observations and p is the number of genes, and the importance of the kth (k ∈ p) genes is given by:
Sk=f(Xk)=I(Xk,Y)-CI(Xk,S,Y) (7)
weight coefficient of kth gene:
Figure FDA0003111498970000031
wherein, eta is more than 0 and less than or equal to 1 is a given threshold value, when S iskWhen eta, the kth gene has obvious meaning when SkWhen ≦ η, the prediction of the interpretive variable by the kth gene is not significant, and the weight matrix is represented as:
W=diag(w1,...,wp) (9)
step 1.4 construction of variable selection model:
the classification problem of the gene expression data can be abstractly expressed as learning a judgment rule from a training set, assigning a class label to a new sample, and for the gene expression data, n and p respectively represent the size of the sample and the number of genes; let Y be ═ Y1,y2,...yn) ' is a response variable, X ═ X1,X2,...,Xp),Xi=(x1,x2,...xn) ' is a model matrix, let xj=(x1j,x2j,...xnj) ', obtaining the target value according to a general linear regression modelTo:
Figure FDA0003111498970000032
wherein θ ═ θ12,...θp) ' is the estimated coefficient vector;
using a weight matrix containing the maximum-associated minimum common redundancy of a single gene, the following penalties for the adaptive elastic network are proposed:
Figure FDA0003111498970000033
Figure FDA0003111498970000034
the adaptive elastic network (AEN-MRMCR) model of the maximum correlation minimum common redundancy method therefore:
Figure FDA0003111498970000035
α∈[0,1]λ > 0 is a regularization parameter, wjThe adaptive data driving weight is obtained, y is a response variable value, and theta is a coefficient vector of an estimated value; AEN-MRMCR estimator
Figure FDA0003111498970000041
Is the minimum of the above formula:
Figure FDA0003111498970000042
the adaptive elastic network penalizes the square error loss by adopting a method combining L2 penalty and an adaptive L1 penalty, compared with the adaptive elastic network, the model provided by the invention adopts an adaptive weight based on maximum correlation minimum common redundancy to replace ridge regression, and the adaptive elastic network method with the maximum correlation minimum redundancy can achieve the effects of selecting related genes, controlling redundant genes and excluding unrelated genes in the automatic gene variable selection process, thereby having obvious biological significance;
step 2, selecting the structural parameters of the deep neural network based on the wolf optimization algorithm, and comprising the following substeps:
step 2.1: deep neural network parameter optimization based on the gray wolf algorithm:
the deep neural network is a multilayer neural network and comprises more than two hidden layers, the speciality of a training model is improved by increasing more layers and the number of neurons in each layer, but if the network structure is too complex, the generalization capability of the model can be reduced, so that a method is needed for determining the structural parameters of the deep neural network model to improve the generalization capability of the deep neural network model, and therefore, the structural parameters of the deep neural network are optimized by using a wolf optimization algorithm;
a Grey Wolf optimization algorithm (GWOlf Optimizer, GWOO) algorithm simulates a Grey Wolf population level mechanism and hunting behaviors in nature, and simulates social levels through 4 types of wolfs (alpha, beta, delta and omega); the hunting behavior of the wolf is simulated through processes of wolf colony tracking, surrounding, chasing, attacking the prey and the like, the aim of optimizing search is achieved, the prey needs to be surrounded when the wolf is hunted, and the mathematical description of the surrounding behavior is as follows:
D=|C·Xp(t)-X(t)| (15)
X(t+1)=Xp(t)-A·D (16)
wherein t is the current iteration number; a and C are coefficient variables; xpIs the position vector of the prey; x is the location vector of the gray wolf; d is the distance between the gray wolf and the prey in t iterations; x (t +1) is the position vector of the gray wolf in t +1 iterations; the vectors a, C are calculated as follows:
A=2a·r1-a (17)
C=2·r2 (18)
Figure FDA0003111498970000051
where a is a convergence factor whose components decrease linearly from 2 to 0 r in an iterative process1,r2Is [0,1 ]]A random vector of (a);
in an abstract search space, the precise location of the optimal solution (game) is unknown to the wolf, and in order to simulate the hunting behavior of the wolf, it is assumed that α (optimal candidate solution), β and δ have information about the potential location of the game, so that in each iteration, 3 optimal solutions obtained so far are saved, forcing other wolfs to adopt the following formula to update the location according to the optimal search location:
Dα=|C1·Xα-X| (20)
Dβ=|C2·Xβ-X| (21)
Dδ=|C3·Xδ-X| (22)
X1=Xα-A1·Dα (23)
X2=Xβ-A2·Dβ (24)
X3=Xδ-A3·Dδ (25)
Figure FDA0003111498970000052
where A is1,A2,A3,C1,C2,C3As a co-operative coefficient variable, X is the location vector of the gray wolf, Dα,Dβ,DδThe distances of the gray wolf relative to three wolfs of alpha, beta and delta, X1,X2,X3The position vectors of the gray wolf relative to three wolfs of alpha, beta and delta are respectively; x (t +1) is the position vector of the wolf when the iteration is carried out for t +1 times, so the deep neural network parameter optimization step based on the wolf algorithm comprises the following steps:
the first step is as follows: initializing a wolf population, wherein each position consists of a hidden layer number l and a hidden node number n;
the second step is that: learning a training sample, and taking the mean square error of the prediction result of the deep neural network as an individual fitness function of the wolf algorithm;
the third step: calculating a of the gray wolf algorithm according to formula (19), updating A and C according to formulas (17-18);
the fourth step: updating the position of the single wolf according to formula (26);
the fifth step: if the maximum iteration times is reached, returning the best single wolf position, otherwise, repeating the steps from three to five;
the key to find the global optimal solution in the gray wolf optimization algorithm is to determine a fitness function, the fitness function of the algorithm is calculated GWO by using the training mean square error of the deep neural network, and GWO optimization is linked with the deep neural network;
step 2.2: the deep neural network training error calculation steps are as follows:
the first step is as follows: initializing a DNN parameter set theta consisting of weights and deviations;
the second step is that: if the fitness of the t-th generation of gray Tailang is f (l)t,nt) The number of hidden layers and hidden nodes can be expressed as lt,nt
The third step: v. of0Inputting a sample vector, q is the iteration number of DNN, and e is the training mean square error of DBN;
the fourth step: randomly iterating the training set in batches according to q times;
the fifth step: fine tuning theta by using a BP algorithm;
and a sixth step: calculating a predicted value by using theta to obtain a training error e;
thus, the GWO algorithm is associated with the DNN through a fitness function, which may reflect the quality of the DNN structural parameters, thereby generating appropriate predictors.
CN202110650665.7A 2021-06-11 2021-06-11 Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network Withdrawn CN113241122A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110650665.7A CN113241122A (en) 2021-06-11 2021-06-11 Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110650665.7A CN113241122A (en) 2021-06-11 2021-06-11 Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network

Publications (1)

Publication Number Publication Date
CN113241122A true CN113241122A (en) 2021-08-10

Family

ID=77139684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110650665.7A Withdrawn CN113241122A (en) 2021-06-11 2021-06-11 Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network

Country Status (1)

Country Link
CN (1) CN113241122A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838519A (en) * 2021-08-20 2021-12-24 河南大学 Gene selection method and system based on adaptive gene interaction regularization elastic network model
CN114841472A (en) * 2022-06-28 2022-08-02 浙江机电职业技术学院 GWO optimized Elman power load prediction method based on DNA hairpin variation
CN115099885A (en) * 2022-03-31 2022-09-23 日日顺供应链科技股份有限公司 Commodity matching recommendation method and system
CN116680594A (en) * 2023-05-05 2023-09-01 齐鲁工业大学(山东省科学院) Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm
CN117649876A (en) * 2024-01-29 2024-03-05 长春大学 Method for detecting SNP combination related to complex diseases on GWAS data based on GWO algorithm

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838519A (en) * 2021-08-20 2021-12-24 河南大学 Gene selection method and system based on adaptive gene interaction regularization elastic network model
CN113838519B (en) * 2021-08-20 2022-07-05 河南大学 Gene selection method and system based on adaptive gene interaction regularization elastic network model
CN115099885A (en) * 2022-03-31 2022-09-23 日日顺供应链科技股份有限公司 Commodity matching recommendation method and system
CN114841472A (en) * 2022-06-28 2022-08-02 浙江机电职业技术学院 GWO optimized Elman power load prediction method based on DNA hairpin variation
CN116680594A (en) * 2023-05-05 2023-09-01 齐鲁工业大学(山东省科学院) Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm
CN117649876A (en) * 2024-01-29 2024-03-05 长春大学 Method for detecting SNP combination related to complex diseases on GWAS data based on GWO algorithm
CN117649876B (en) * 2024-01-29 2024-04-12 长春大学 Method for detecting SNP combination related to complex diseases on GWAS data based on GWO algorithm

Similar Documents

Publication Publication Date Title
CN113241122A (en) Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network
Perrone et al. Poisson random fields for dynamic feature models
CN114927162A (en) Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution
Maulik Analysis of gene microarray data in a soft computing framework
Perrone et al. Poisson random fields for dynamic feature models
WO2022167821A1 (en) Drug optimisation by active learning
CN112256971A (en) Sequence recommendation method and computer-readable storage medium
US20120185424A1 (en) FlexSCAPE: Data Driven Hypothesis Testing and Generation System
Kumar et al. Future of machine learning (ML) and deep learning (DL) in healthcare monitoring system
Ma An Efficient Optimization Method for Extreme Learning Machine Using Artificial Bee Colony.
Elzeki et al. A new hybrid genetic and information gain algorithm for imputing missing values in cancer genes datasets
Aushev et al. Likelihood-free inference with deep Gaussian processes
Shukla et al. Application of deep learning in biological big data analysis
Hoffmann et al. Minimizing the expected posterior entropy yields optimal summary statistics
CN114722273A (en) Network alignment method, device and equipment based on local structural feature enhancement
Roy et al. A hidden-state Markov model for cell population deconvolution
JP2023535285A (en) Mutant Pathogenicity Scoring and Classification and Their Use
Amutha et al. A Survey on Machine Learning Algorithms for Cardiovascular Diseases Predic-tion
CN117976047B (en) Key protein prediction method based on deep learning
Punjabi et al. Enhancing Performance of Lazy Learner by Means of Binary Particle Swarm Optimization
Chen et al. SoftStep relaxation for mining optimal convolution kernel
Lim et al. Feature Acquisition Using Monte Carlo Tree Search
Darmawahyuni et al. Health-related Data Analysis using Metaheuristic Optimization and Machine Learning
Baruque et al. All Action Updates for Reinforcement Learning with Costly Features
Homenda et al. Objective functions in fuzzy cognitive maps: the case of time series modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210810

WW01 Invention patent application withdrawn after publication