CN111192638A

CN111192638A - High-dimensional low-sample gene data screening and protein network analysis method and system

Info

Publication number: CN111192638A
Application number: CN201911424402.3A
Authority: CN
Inventors: 章乐; 游宇杰; 陈渝杰
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-22
Anticipated expiration: 2039-12-31
Also published as: CN111192638B

Abstract

The invention provides a high-dimensional low-sample gene data screening and protein network analysis method and a system, wherein the method comprises the following steps: preprocessing original GBM data to obtain GBM data; based on the GBM data, feature extraction is carried out, and interaction items among the covariates are added into the screened covariates; carrying out regression analysis aiming at the screened covariates added with the interactive items to obtain a gene result; and screening to obtain the final screening gene according to the gene result. And then, acquiring protein data under the conditions of high GBM expression, normal expression and no expression by using the final screening gene, analyzing and screening the key protein, and finally acquiring a mechanism of influencing a protein pathway by the key gene. The method optimizes the method of combining Cos, SIS and Lasso together, greatly improves the accuracy of model screening of GBM key genes, and determines the signal path relation between GBM and key proteins.

Description

High-dimensional low-sample gene data screening and protein network analysis method and system

Technical Field

The invention belongs to the field of gene data screening and gene data processing, and particularly relates to a method and a system for searching key genes closely related to GBM (GBM) in high-dimensional low-sample gene data, and acquiring protein expression data and protein network analysis after key genes are knocked out.

Background

Glioblastoma multiforme (abbreviated as GBM) is a tumor generated by astrocytoma through malignant transformation, because the GBM generation mechanism is a particularly complex biological phenomenon caused by various external factors, genes and growth stages, and the complex factors lead the common biological research methods to have great difficulty in researching the GBM growth mechanism, and the main schemes adopted in the prior art include the following:

the Cox proportional hazard model is an important survival analysis method, however, the classical Cox proportional hazard model needs survival data information for solving a full rank equation, and the traditional Cox model cannot process the survival data because the dimension (P) in the model is larger than the number (N) of samples (we refer to as data with P > N type).

In order to process P > > N type data, models such as CoxLasso, CoxSis and CoxSisLasso have been studied before, and survival data analysis is carried out on the models to achieve certain effect.

Firstly, a Cox regression model is adopted as a proportional risk model as formula 1:

h(t,x)＝h₀(t)exp(βX) (1)，

where h (t, x) is a risk function at time t, h₀(t) is the baseline hazard function at time 0;

β is the regression coefficient of X, which is the amount we need to process;

the related gene screened by the Lasso method (formula 2) is marked as C₀。

Where β is an unknown vector of regression coefficients in the p-dimension, x_iIs a vector of the i-th individual's potential predictors. On the basis of the samples, the method,

is an estimator of the coefficients of the unknown parameters. Thus, D is a set of indices of events, R_kRepresents the time t_kA set of indices of individuals at risk. The tuning parameter λ is the sparsity of the control estimator.

For each X on the gene results screened_mIs e.g. G and

the key genes were further selected in conjunction with the cox model by performing equation 3.

Index reference C of selected covariates₀Indicate, the remaining covariates are exception x_mAnd (4) showing.

The gene selected in formula 3 is further processed by conditional SIS formula 4 to obtain the coxSisLasso gene.

Where Θ is the set C of enhanced selected predictors₀∪C₁. Wherein C is₀Index, C, representing the selected covariate and lasso₁Representing covariates retained by the condition sis.

The main defects of the above scheme are as follows:

(1) the Cox proportional risk model can process P < < N type data and can not process P > N type data;

(2) the accuracy of the methods realized by CoxLasso, CoxSis and CoxSisLasso is not enough, and the performance of the models is poor.

Disclosure of Invention

In view of this, the present invention provides a method and a system for high-dimensional low-sample gene data screening and protein network analysis, the method comprising: preprocessing original GBM data to obtain GBM data; based on the GBM data, feature extraction is carried out, and interaction items among the covariates are added into the screened covariates; carrying out regression analysis aiming at the screened covariates added with the interactive items to obtain a gene result; and screening to obtain the final screening gene according to the gene result. And then, acquiring protein data under the conditions of high GBM expression, normal expression and no expression by using the final screening gene, analyzing and screening the key protein, and finally acquiring a mechanism of influencing a protein pathway by the key gene. The method optimizes the method of combining Cos, SIS and Lasso together, greatly improves the accuracy of model screening of GBM key genes, and determines the signal path relation between GBM and key proteins. Specifically, the method comprises the following steps:

step 1, preprocessing original GBM data to obtain GBM data;

step 2, based on the GBM data, extracting features, wherein the extracting includes: firstly, screening covariates by using a CoxLasso method, then screening the covariates by using a Conditional SIS method, screening the covariates by using the CoxLasso method again on the basis, and adding interaction items among the covariates in the screened covariates;

step 3, aiming at the screened covariates added with the interactive items, executing regression analysis to obtain a gene result;

and 4, aiming at the gene result, screening by utilizing T test to obtain the final screened gene.

Preferably, in the step 2, in adding the interaction term to the screened covariates, each covariate is obtained according to the following formula:

adding interactive items among covariates to obtain new covariate data and interactive item data;

where Θ is the set C of enhanced selected predictors₀∪C₁In which C is₀Index, C, representing the selected covariate and lasso₁Covariates representing conditional sis Retention

Preferably, in step 3, the regression analysis is performed in the following manner:

where X is a p-dimensional covariate, β denotes the index through the CoxSisLasso strategy and the finally selected covariate, X_iIs a vector of the i-th individual's potential predictors. Thus, D is a set of indices of events, R_kRepresents the time t_kA set of indices of individuals at risk. The tuning parameter λ is the sparsity of the control estimator. Θ is the set C of enhanced selected predictors₀∪C₁(ii) a Wherein C is₀Index, C, representing the selected covariate and lasso₁Representing the covariates that the condition sis holds,

representing interactive items between selected genes.

Preferably, the method further comprises:

step 5, based on the final screening gene, obtaining the content change of the related protein by enhancing the expression of the related gene and knocking out the expression of the related gene, and calculating the folding change value of the related protein;

step 6, determining the protein promoted or inhibited by the related gene by the following method:

therein, FC_iIs an experimental group protein and ExP_jFold change values between control proteins.

Preferably, the fold change value is obtained by:

therein, FC_iIs an experimental group protein and ExP_jControl of proteinsThe fold change value between, i represents the index of the two groups of proteins, ExP_jAnd CoP_jIs the protein content determined using RPPA, j represents the index of the experiment, and m represents the number of replications.

Preferably, the step 6 further comprises obtaining a pathway relation of the proteins based on the determined proteins promoted or inhibited by the related genes, and drawing a pathway map.

In addition, the invention also provides a high-dimensional low-sample gene data screening and protein network analysis system, which comprises:

the data preprocessing module is used for preprocessing original GBM data to obtain GBM data, preferably, extracting protein to prepare an RPPA chip after preprocessing original GBM sample data by an RPPA lysis buffer, and scanning, analyzing and quantifying the chip to obtain the GBM data;

the feature selection module performs feature extraction based on the GBM data obtained by the data preprocessing module, and preferably, the feature extraction includes: firstly, screening covariates by using a CoxLasso method, then screening the covariates by using a Conditional SIS method, screening the covariates by using the CoxLasso method again on the basis, and adding interactive items in the screened covariates;

the regression analysis module is used for executing regression analysis aiming at the screened covariates added with the interactive items to obtain a gene result;

and the gene screening module is used for screening the gene result obtained by the regression analysis module to obtain the final screening gene.

Preferably, the system further comprises:

and the protein pathway analysis module is used for obtaining the content change of the related protein by enhancing the expression of the related gene and knocking out the expression of the related gene based on the final screening gene obtained by the gene screening module, calculating the folding change value of the related protein, determining the protein promoted or inhibited by the related gene based on the folding change value, and obtaining the pathway relation of the protein based on the determined protein promoted or inhibited by the related gene.

Preferably, the regression analysis module is executed in the following manner:

representing interactive items between selected genes.

Preferably, in the protein pathway analysis module, the means for determining the protein promoted or inhibited by the gene of interest is:

therein, FC_iIs an experimental group protein and ExP_jThe fold change between the control proteins, i representing the index of the two groups of proteins, ExP_jAnd CoP_jIs the protein content determined using RPPA, j represents the index of the experiment, and m represents the number of replications.

Compared with the prior art, the technical scheme of the invention can be well adapted to P & gt N type data, is obviously higher than the traditional CoxLasso, CoxSis and CoxSisLasso model methods in the accuracy of key gene screening, is convenient for obtaining a unilateral information channel related to the key gene, and has a great promoting effect on GBM key gene research.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a flow chart of the ImpCoxSisLasso model according to an embodiment of the present invention;

FIG. 3 is a Venn diagram of screening results according to an embodiment of the present invention;

FIGS. 4(a), 4(b) are graphs comparing the performance of embodiments of the present invention with other models;

FIG. 5 is a flowchart of protein pathway studies according to an embodiment of the present invention.

Detailed Description

An application program recommendation method and apparatus according to an embodiment of the present invention are described in detail below with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be appreciated by those of skill in the art that the following specific examples or embodiments are a series of presently preferred arrangements of the invention to further explain the principles of the invention, and that such arrangements may be used in conjunction or association with one another, unless it is expressly stated that some or all of the specific examples or embodiments are not in association or association with other examples or embodiments. Meanwhile, the following specific examples or embodiments are only provided as an optimized arrangement mode and are not to be understood as limiting the protection scope of the present invention.

Example 1

In one embodiment, the present invention provides a method for screening gene data and analyzing protein network with high dimension and low sample, which comprises: the method for influencing the signal pathway mechanism of the GBM is finally discovered by searching the key gene from the high-dimensional low-sample gene data and carrying out protein-level research on the searched key gene. Specifically, the technical scheme of the invention can be realized by the following modes:

optimization method combining Cox, SIS and Lasso (ImpCoxSIS Lasso)

Considering the complexity of the human system, it is thought that the genes highly related to GBM do not act completely independently, and there is an interaction effect between them.

In this regard, we can use the conventional SIS method to perform a coarse screening, and then use the lasso method to perform a screening to obtain the gene screened in the first step. When the relationship between the screened genes and the GBM survival time is judged according to the coefficient, the interaction of the genes is considered, pairwise interaction term products between the genes are added, and the relationship between the gene terms and the GBM time is checked under the condition of the interaction between the genes, so that the method for obtaining the ImpCoxSisLasso is obtained by optimizing the CoxSisLasso strategy in a mode of adding the interaction term products, and the formula 5 of the strategy is obtained by combining the formula 4:

where X is a p-dimensional covariate, β denotes the index through the CoxSisLasso strategy and the finally selected covariate, X_iIs a vector of the i-th individual's potential predictors. Thus, D is a set of indices of events, R_kRepresents the time t_kA set of indices of individuals at risk. The tuning parameter λ is the sparsity of the control estimator. Θ is the selection of the enhancementSet of predicted values C₀∪C₁(ii) a Wherein C is₀Index, C, representing the selected covariate and lasso₁Representing the covariates that the condition sis holds,

representing interactive items between selected genes.

In a specific embodiment, the policy execution flowchart is shown in fig. 2, and may be specifically implemented in the following manner:

step 1021, inputting data x between patient gene data and GBM survival time, and primarily screening covariate x of coxSisLasso gene according to known formula 4_mThis partial data set is denoted as C.

Step 1022, to consider the interaction between genes when performing regression, we add product interaction term data between two of these genes to set C

Then, a regression analysis formula 5 is executed on the set added with the cross product data to obtain a regression coefficient result β of each gene added with the interactive item between the genes_C,

In the formula

Wherein i₁≠i₂，i₁,i₂Epsilon C represents the interaction between the selected genes.

Step 1023, gene coefficient β obtained in the previous step_C,

Performing T test, and counting p in the screening result<0.05(p represents p-value is the basis for judging whether a certain gene is significant) to obtain the finally screened key gene. We named this method ImpCoxSisLasso algorithm.

As can be seen from FIG. 4, the ROC and AUC curves obtained by predicting the genes screened by the ImpCoxSisLasso method provided by the invention are greatly improved in the accuracy of predicting whether a patient has GBM compared with the curves obtained by screening by the existing CoxLasso, CoxSis and CoxSisLasso methods, which indicates that the relationship between the genes screened by the method provided by the invention and GBM is more compact, reliable and reasonable.

Second, protein pathway study

After obtaining the key genes related to the survival time of GBM by the methods of the above embodiments, we studied the changes of the expressed protein controlled by the genes in GBM and the related pathway problems, and the following describes the protein pathway study process in detail with reference to fig. 5.

First, we increased AEBP1 expression in the experimental group, which consisted of ln229 cell line. In addition, we retained AEBP1 expression from the control group consisting of ln229 cell line. We then performed two RPPA experiments on both groups. Finally, it shows the expression of 287 related proteins.

Second, we knocked out AEBP1 of the panel consisting of LN229 cell line with CRISPR-CAS 9. In addition, we retained AEBP1 expression from the control group consisting of ln229 cell line. We then performed two RPPA experiments on both groups. Finally, it shows the expression of 302 related proteins.

Third, we calculated the fold change between the experimental and control groups of the AEBP1 upper and lower data sets using equation (6).

Here FC_iIs experimental group and ExP_jFold variation between control groups. i represents the index of the two groups of proteins. ExP_jAnd CoP_jIs the protein content determined using RPPA. j represents the index of the experiment and m represents the number of replications.

Next, we used equation (7) to determine which proteins are promoted or inhibited by AEBP 1.

Judge values greater than 1 indicate an increase, and values less than or equal to 1 indicate a decrease.

Finally, we compared the experimental and control groups of AEBP1 lifting and suppression datasets by T-test. The null hypothesis was that the average expression level of the protein under the experimental conditions was equal to the level under the control conditions, and the P-value threshold was 0.05. Next, we calculated the p-values of the AEBP1 lifting dataset and the AEBP1 suppression dataset, respectively. Finally we found that 7 proteins were statistically significant in the AEBP1 promotion dataset and the AEBP1 inhibition dataset, respectively. Finally, the pathway relation between the proteins and the pathogenic factors can be obtained by inquiring the existing data, and a pathway graph is drawn.

Example 2

In another embodiment of the present invention, the analysis method proposed by the present invention can be further performed by establishing a system, and in a specific implementation, the system can be implemented by:

the system comprises:

the data preprocessing module is used for preprocessing the original GBM data to obtain the GBM data; preferably, the original GBM sample data is pretreated by RPPA lysis buffer solution to extract protein, an Aushon biosystems 2470 Array mechanism is adopted to make an RPPA chip (applicable antibody can query M.D. Anderson Cancer Center: https:// www.mdanderson.org/research/research-resources/core-resources/functional protocols-RPPA-), Array-Pro Analyzer software (MediaCybernetics) is used to scan, analyze and quantify the chip to obtain GBM data;

the characteristic selection module is used for extracting characteristics based on the GBM data obtained by the data preprocessing module, and the characteristic extraction comprises the following steps: firstly, screening covariates by using a CoxLasso method, then screening the covariates by using a Conditional SIS method, screening the covariates by using the CoxLasso method again on the basis, and adding interactive items in the screened covariates;

Preferably, the system further comprises:

Preferably, the regression analysis module is executed in the following manner:

representing interactive items between selected genes.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A high-dimensional low-sample gene data screening and protein network analysis method comprises the following steps:

step 1, preprocessing original GBM (glioblastoma multiforme) sample data to obtain GBM data;

step 2, performing feature extraction based on the preprocessed GBM data, preferably, the feature extraction includes: firstly, screening covariates by using a CoxLasso method, then screening the covariates by using a Conditional SIS method, screening the covariates by using the CoxLasso method again on the basis, and adding interaction items among the covariates in the screened covariates;

2. The method according to claim 1, wherein in the step 2, in adding the interactive item to the screened covariates, each covariate is obtained according to the following formula:

where Θ is the set C of enhanced selected predictors₀∪C₁，C₀Index, C, representing the selected covariate and lasso₁Representing covariates retained by the condition sis.

3. The method of claim 1, wherein in step 3, the regression analysis is performed in the following manner:

where X is a p-dimensional covariate, β denotes the index through the CoxSisLasso strategy and the finally selected covariate, X_iIs a vector of the i-th individual's potential predictor; d is a set of indices of the event, R_kRepresents the time t_kA set of indices of individuals at risk; the tuning parameter λ is the sparsity of the control estimator; Θ is the set C of enhanced selected predictors₀∪C₁(ii) a Wherein C is₀Index, C, representing the selected covariate and lasso₁Representing the covariates that the condition sis holds,

representing interactive items between selected genes.

4. The method of claim 1, further comprising:

5. The method of claim 4, wherein the fold change value is obtained by:

6. The method of claim 4, wherein step 6 further comprises obtaining a pathway relationship of the proteins based on the determined proteins promoted or inhibited by the genes involved, and mapping the pathway relationship.

7. A high-dimensional low-sample gene data screening and protein network analysis system, comprising:

the data preprocessing module is used for preprocessing the original GBM data to obtain the GBM data; preferably, the original GBM sample data is pretreated by RPPA lysis buffer solution, then protein is extracted to prepare an RPPA chip, and chip scanning, analysis and quantification are carried out to obtain GBM data;

8. The system of claim 7, further comprising:

9. The system of claim 7, wherein the regression analysis module performs in the following manner:

where β denotes the index of the covariate ultimately selected by the coxsislasso strategy,

10. The system of claim 7, wherein the protein pathway analysis module determines the proteins promoted or suppressed by the genes of interest by:

the fold change value is obtained by: