CN113838519A

CN113838519A - Gene selection method and system based on adaptive gene interaction regularization elastic network model

Info

Publication number: CN113838519A
Application number: CN202110959928.2A
Authority: CN
Inventors: 王雅娣; 朱海红; 刘荣; 王芳
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2021-12-24
Anticipated expiration: 2041-08-20
Also published as: CN113838519B

Abstract

The invention discloses a gene selection method and a system based on an adaptive gene interaction regularization elastic network model, wherein the method comprises the following steps: assessing the degree of importance of each measured gene based on Wilcoxon rank sum test; quantifying the importance degree of each measured gene, adding self-adaptive penalty weight, and further deleting noise genes to obtain characteristic genes; introducing the punishment weight into a least square loss function, and constructing a self-adaptive elastic network model; constructing an adjacency matrix of the gene interaction network; constructing a gene interaction network penalty based on the adjacency matrix; combining the adaptive elastic network model with the gene interaction network penalty to construct an adaptive gene interaction regularization elastic network model; and solving the optimal solution of the regularized elastic network model based on a gradient descent algorithm, and selecting genes based on the optimal solution. The present invention can adaptively select important genes highly related to the generation of tumors and remove redundant, unrelated genes and noise genes.

Description

Gene selection method and system based on adaptive gene interaction regularization elastic network model

Technical Field

The invention belongs to the technical field of biological information, and particularly relates to a gene selection method and system based on an adaptive gene interaction regularization elastic network model.

Background

Tumors become one of the main diseases threatening human life health, according to the 2018 global cancer statistical data report, the number of new cancer cases will reach 1810 ten thousand in 2018, the number of deaths will reach 960 ten thousand in 2018, and the number of cancer diagnosed is rapidly increasing every year, however, the research on the treatment means and the prevention means of cancer is not comprehensive. With the application and development of a large number of gene chip technologies, people can continuously obtain normal gene expression information of various tissues, and find out a small number of genes with differential expression in different disease categories from a large number of genes measured on a gene chip, which is the key point for carrying out accurate disease judgment and providing reliable diagnosis bases. Meanwhile, the method also provides convenience for the development of further disease-resistant medicines.

The learning method based on machine learning and statistical analysis is helpful to provide important references for tumor diagnosis, cancer classification, clinical outcome prediction and the like, and thus has attracted extensive attention of a large number of scholars. In biomedicine, microarray data is widely used in the study of cancer classification and prognosis. DNA microarray data is typically represented in the form of a matrix, with the values of the elements in the matrix representing gene expression level information. The specific format is shown in table 1, where the rows represent samples, the columns represent a gene, and the rightmost column represents the class label for the sample.

TABLE 1 matrix form of DNA microarray data

The gene selection is an important step for researching and analyzing a gene expression profile, a small amount of subsets containing key gene information are selected from the high-dimensional gene expression profile, and the gene value in an original sample is not changed in the gene selection process, but redundant genes are removed and genes related to classification are reserved.

The process of gene selection is as follows: (1) acquiring microarray data, (2) preprocessing the acquired data, (3) extracting characteristic genes, (4) performing classification modeling, and (5) analyzing the classification result. The specific classification process is shown in fig. 1.

Currently, there are many studies that use microarray data to classify diseases based on the expression levels of genes. Numerous studies have shown that the expression level of a gene is an important tool for finding characteristic genes and classes. Logistic regression is a commonly used method of feature classification, but for microarray data (i.e., the number of predictor variables p is much greater than the number of samples n), it may produce unstable estimates. Furthermore, the maximum likelihood method also produces unstable results when multiple collinearities exist between the predictor variables. Therefore, existing logistic regression methods are not suitable for disease classification based on gene expression levels. Based on L₂Penalized logistic regression of norms and various L₁Norm penalties and regularization methods have been successfully applied to disease classification. However, the existing L₁The type penalty method may cause problems of low estimation efficiency and inconsistent variable selection results when modeling linear regression because a penalty is applied to all features without considering the importance degree of each feature. Some highly related genes present in disease classification should be selected or eliminated simultaneously as a gene population. From a learning point of view, this can be considered as a population effect, i.e., the estimated coefficients of some highly related genes are relatively close. As a new regularization method, elastic network models and their various generalizations are able to produce population effects in the process of creating classifiers, but such models are rarely bioanalytically interpretable and fail to adequately consider genetic interactions.

With the rise of large-scale cancer genome research and personalized medicine, comprehensive prediction of clinical results by utilizing multiomic data becomes an emerging research topic. Since the relative advantages of DNA methylation and gene expression in predicting cancer stage and patient survival are less pronounced in many cancers, prediction performance can be improved by combining gene expression profiling data, methylation profiling data and other gene measurement methods to predict clinical outcome. However, this requires the collection and integration of genomic data from a large number of patients, with a relatively large number of tasks.

Disclosure of Invention

The invention provides a gene selection method and a gene selection system based on an adaptive gene interaction regularization elastic network model aiming at the problems of low estimation efficiency, insufficient consideration of gene interaction and large task load of the existing gene selection method, overcomes the defects, can adaptively select important genes highly related to the generation of tumors, and removes redundant and irrelevant genes and noise genes.

In order to achieve the purpose, the invention provides a gene selection method based on an adaptive gene interactive regularization elastic network model, which is characterized in that the importance of each gene is evaluated based on Wilcoxon Rank Sum Test (WRST), and then adaptive punishment is applied to each gene according to the importance degree of the measured gene, so that a noise gene is deleted from the model, and a characteristic gene is identified; integrating gene measurement and interactive information between genes into a self-adaptive elastic network model, enhancing the sparsity of a structure, and selecting characteristic genes by utilizing a grouping effect to reduce redundancy; and solving the regularized elastic network model by using an iterative gradient descent algorithm. The method specifically comprises the following steps:

step 1: assessing the degree of importance of each measured gene based on Wilcoxon rank sum test;

step 2: quantifying the degree of importance of each measured gene;

and step 3: adding a self-adaptive penalty weight to each measured gene according to the importance degree of each quantified gene, and deleting a noise gene based on the self-adaptive penalty weight to obtain a characteristic gene;

and 4, step 4: introducing the self-adaptive penalty weight into a least square loss function so as to construct a self-adaptive elastic network model;

and 5: constructing an adjacency matrix of the gene interaction network;

step 6: constructing a gene interaction network penalty based on the adjacency matrix;

and 7: combining the self-adaptive elastic network model with the gene interaction network penalty to construct a self-adaptive gene interaction regularization elastic network model;

and 8: and solving the optimal solution of the self-adaptive gene interaction regularization elastic network model based on a gradient descent algorithm, and selecting genes based on the optimal solution.

Further, the step 1 comprises:

based on the Wilcoxon rank sum test, the importance of each measured gene was evaluated according to the following formula:

wherein I (.) is an indicator function;

represents the ith expression value of the jth gene; p represents the total number of genes measured; n is a radical of₀And N₁Index sets representing different sample classes, n₀、n₁Respectively represent samples N₀、N₁The number of (2); s (g)_j) Denotes that the jth gene has different expression levels in two classes, 0. ltoreq. s (g)_j)≤n₀n₁If s (g)_j) Is close to 0 or n₀n₁It means that the j-th gene is an important characteristic gene in the classification.

Further, the step 2 comprises:

the genes were ranked according to the following formula:

R(g_j)＝max{s(g_j),n₀n₁-(g_j)}

when s (g)_j) The closer to 0 orn₀n₁When is, R (g)_j) The larger the value, the greater the importance of the jth gene in the classification problem.

Further, in step 3, the expression of the adaptive penalty weight is:

where n is the number of samples.

Further, the expression of the adaptive elastic network model is as follows:

wherein O is₂Representing an adaptive elastic network model, y being a sample class, beta being an estimated coefficient of all genes, beta_jIs the estimated coefficient, x, of the jth gene_iFor the input vector, λ and α are regularization parameters, and λ>0，α∈[0,1]。

Further, in the step 5, an adjacency matrix of the gene interaction network is constructed according to the following formula:

A＝[a_ij]∈R^p×p

wherein R represents a real number set; a represents a adjacency matrix of the gene interaction network; a is_ijThe value is 0 or 1.

Further, in the step 6, a gene interaction network penalty is constructed according to the following formula:

wherein O is₃Represents a genetic interaction network penalty, beta_iTr (. lamda.) represents the trace of the matrix for the estimated coefficient of the ith gene.

Further, the expression of the adaptive gene interaction regularization elastic network model is as follows:

wherein F (X, A, beta) represents an adaptive gene interaction regularization elastic network model, X is an input matrix,

for the penalty term, γ is the regularization parameter.

The invention also provides a gene selection system based on the adaptive gene interaction regularization elastic network model, which comprises the following steps:

a gene importance assessment module for assessing the importance of each measured gene based on Wilcoxon rank sum test;

a gene importance quantification module for quantifying the importance of each measured gene;

the weighting module is used for adding self-adaptive penalty weight to each measured gene according to the importance degree of each quantized gene, and deleting noise genes based on the self-adaptive penalty weight to obtain characteristic genes;

the first construction module is used for introducing the genetic weight into a least square loss function so as to construct an adaptive elastic network model;

the second construction module is used for constructing an adjacency matrix of the gene interaction network;

the third construction module is used for constructing gene interaction network punishment based on the adjacency matrix;

the fourth construction module is used for combining the self-adaptive elastic network model and the gene interaction network penalty to construct a self-adaptive gene interaction regularization elastic network model;

and the gene obtaining module is used for solving the optimal solution of the self-adaptive gene interaction regularization elastic network model based on a gradient descent algorithm and selecting genes based on the optimal solution.

Compared with the prior art, the invention has the following beneficial effects:

the self-adaptive gene interaction regularization elastic network model expands and integrates gene interaction network information and the self-adaptive elastic network model so as to achieve the aim of better classification. The common elastic network model does not consider information of interaction between genes, and the proposed adaptive elastic network model contains information of gene interaction. The method integrates the self-adaptive elastic network model and the gene interaction network, and adopts a gradient descent algorithm to solve the optimal solution of the model, so that the gene importance and the gene interaction information are conveniently integrated to identify the characteristic genes, and the redundancy is reduced; it is also possible to adaptively select important genes highly correlated with the generation of tumors and remove redundant, irrelevant genes and noise genes.

Drawings

FIG. 1 is a schematic diagram of a gene selection process;

FIG. 2 is a basic flowchart of a method for selecting a gene based on an adaptive genetic interaction regularization elastic network model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a gene selection system based on an adaptive genetic interaction regularization elastic network model according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

as shown in fig. 2, a method for selecting a gene based on an adaptive gene interaction regularization elastic network model includes:

step 2: quantifying the degree of importance of each measured gene;

and 5: constructing an adjacency matrix of the gene interaction network;

and 7: combining the self-adaptive elastic network model and the gene interaction network penalty to construct a self-adaptive gene interaction regularization elastic network model (AGIREN);

Specifically, in order to be able to efficiently sort out important genes for classification, adaptive L is applied₁And (4) type punishment, namely adding important genes in the classification into a punishment regression model.

Further, the step 1 comprises:

wherein I (.) is an indicator function;

Further, the step 2 comprises:

although the importance of each gene can be measured by Wilcoxon rank sum test, since this statistic cannot be used directly for adaptive penalty weighting, to quantify the importance of each gene, the genes are ranked according to the following formula:

R(g_j)＝max{s(g_j),n₀n₁-(g_j)}

wherein s (g)_j) Close to 0 or n₀n₁Indicates that the jth gene is an important characteristic gene, and R (g) corresponding to the important gene_j) Larger value, and R (g) corresponding to the noise gene_j) The value is small; s (g)_j) The closer to 0 or n₀n₁，R(g_j) The larger the value, the greater the importance of the jth gene in the classification.

Further, in order to perform a distinguishing penalty according to the importance degree of each gene in the classification, in the step 3, the expression of the adaptive penalty weight is as follows:

where n is the number of samples. Noise genes get relatively large penalty weights, while key feature genes get smaller penalty weights.

Further, the importance of the gene pair for classification, i.e., the gene weight w_jIntroducing the data into a common least square loss function so as to construct an adaptive elastic network model, wherein the expression of the adaptive elastic network model is as follows:

wherein O is₂Representing an adaptive elastic network model, y being a sample class, beta being an estimated coefficient of all genes, beta_jIs the estimated coefficient, x, of the jth gene_iIs an input vector, n is the number of samples, λ and α are regularization parameters, and λ>0，α∈[0,1]。

A＝[a_ij]∈R^p×p

wherein R represents a real numberSet, A represents the adjacency matrix of the gene interaction network; a is_ijThe value is 0 or 1. a is_ij0 means that the interaction between the ith gene and the jth gene is weak, and vice versa. It is worth mentioning that the gene interaction matrix A constructed in the present invention can be further optimized according to the degree of interaction between genes, and can also contain various types of interaction information, such as interaction between the target of transcription factor and protein. For example, proteins that interact more strongly may be assigned more weight than proteins that interact less strongly.

Further, to ensure that known interacting genes have similar coefficients and thus are more likely to be grouped together, it is desirable to maximize the overall grouping effect in the gene interaction network, constructing a gene interaction network penalty according to the following formula:

Further, in step 7, the expression of the adaptive genetic interaction regularization elastic network model is as follows:

for the penalty term, γ is the regularization parameter.

Further, in step 8, the expression of the adaptive gene interaction regularization elastic network model solved based on the gradient descent algorithm is as follows:

wherein

In order to be a penalty term,

representing an adaptive gene interaction regularization elastic network model solved based on a gradient descent algorithm, and the optimal solution is

Based on the optimal solution

Selecting a gene; specifically, a non-zero regression coefficient is an important gene closely related to cancer, and the larger the absolute value of the regression coefficient, the stronger the correlation between the gene and cancer.

Further, after the step 8, the method further comprises:

classification is performed based on the selected genes, and the classification results are analyzed.

Specifically, some highly related genes present in the classification of diseases should be selected or eliminated simultaneously as one gene population. As a new regularization method, the elastic network model and its various generalizations can produce a population effect in the process of creating a classifier. To be able to efficiently sort out important genes for classification, adaptive L is applied₁And (4) type punishment, namely adding important genes in the classification into a punishment regression model. Expressing the importance degree of each gene based on a Wilcoxon rank sum test gene ordering method, proposing adaptive weight, quantifying the importance degree of each gene and carrying out differential punishment according to the importance degree of each gene in classification. However, noise genes get relatively large weights, while key signature genes get smaller weights. Thus, the importance of gene pairs for classification can be incorporated directly into the logistic regression model, i.e., the adaptive elastic network model

Gene-gene interactions are fundamental elements in the understanding of complex diseases, and phenotypes are thought to be the result of interactions between multiple key genes. When cancer classification is performed, it is necessary to consider the interaction of genes, and when a plurality of genes interact, all genes are not considered as characteristic genes because the information they carry inevitably has correlation due to the interaction of genes. To avoid redundancy, a network constraint based on gene interactions may be defined such that any variable in the network is likely to be placed in the same set. To ensure that genes with known interactions have similar coefficients and thus are more likely to be grouped together, it is desirable to maximize the overall grouping effect in the gene interaction network, i.e., the gene interaction regularization model

Constructing an adaptive gene interaction regularization elastic network model (AGEREN) according to the adaptive elastic network model and the gene interaction regularization model:

on the basis of the above embodiments, as shown in fig. 3, the present invention further provides a gene selection system based on an adaptive gene interaction regularization elastic network model, which includes:

In summary,

(1) the invention introduces the importance of genes in a classification method based on gene ordering through Wilcoxon rank sum test so as to better select characteristic genes which have important contribution to classification;

(2) according to the method, the self-adaptive penalty weight is applied to each gene, so that the noise gene has a larger penalty and is removed by the model, and the penalty of the characteristic gene is smaller and is reserved;

(3) because a large amount of redundant information exists between genes, in order to effectively remove the redundant genes, the invention constructs the penalty of gene-gene interaction network;

(4) and combining the three points. The self-adaptive gene interaction regularization elastic network model provided by the invention has the following two obvious characteristics. Firstly, the adaptive gene interactive regularization elastic network model is established on the basis of the adaptive elastic network model, so that the model has sparsity, relatively few characteristic genes are selected according to regression coefficients, and the selected characteristic genes play a key role in the processes of cancer classification, clinical result prediction and the like. Second, constructing a gene interaction network model can reduce redundant information between genes and is applicable to a wide variety of data types, such as DNA methylation data, gene expression profiling data, protein interactions, and the like.

The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A gene selection method based on an adaptive gene interaction regularization elastic network model is characterized by comprising the following steps:

step 2: quantifying the degree of importance of each measured gene;

and 5: constructing an adjacency matrix of the gene interaction network;

2. The method for selecting genes based on an adaptive gene interaction regularization elastic network model according to claim 1, wherein the step 1 comprises:

wherein I (.) is an indicator function;

3. The method for selecting genes based on an adaptive gene interaction regularization elastic network model according to claim 2, wherein said step 2 comprises:

the genes were ranked according to the following formula:

R(g_j)＝max{s(g_j),n₀n₁-(g_j)}

when s (g)_j) The closer to 0 or n₀n₁When is, R (g)_j) The larger the value, the greater the importance of the jth gene in the classification problem.

4. The method for selecting genes based on an adaptive genetic interaction regularization elastic network model according to claim 3, wherein in the step 3, the expression of the adaptive penalty weight is as follows:

where n is the number of samples.

5. The method of claim 4, wherein the adaptive genetic interaction regularization elastic network model is expressed as:

6. The method for selecting genes based on adaptive genetic interaction regularization elastic network model according to claim 2, wherein in said step 5, an adjacency matrix of a genetic interaction network is constructed according to the following formula:

A＝[a_ij]∈R^p×p

7. The method for selecting genes based on an adaptive genetic regularization elastic network model as claimed in claim 6, wherein in said step 6, a genetic interaction network penalty is constructed according to the following formula:

8. The method for selecting genes based on an adaptive genetic interaction regularization elastic network model according to claim 7, wherein in the step 7, the expression of the adaptive genetic interaction regularization elastic network model is:

for the penalty term, γ is the regularization parameter.

9. A gene selection system based on an adaptive gene interaction regularization elastic network model, comprising: