CN111192638A - High-dimensional low-sample gene data screening and protein network analysis method and system - Google Patents

High-dimensional low-sample gene data screening and protein network analysis method and system Download PDF

Info

Publication number
CN111192638A
CN111192638A CN201911424402.3A CN201911424402A CN111192638A CN 111192638 A CN111192638 A CN 111192638A CN 201911424402 A CN201911424402 A CN 201911424402A CN 111192638 A CN111192638 A CN 111192638A
Authority
CN
China
Prior art keywords
gene
protein
screening
covariates
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911424402.3A
Other languages
Chinese (zh)
Other versions
CN111192638B (en
Inventor
章乐
游宇杰
陈渝杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201911424402.3A priority Critical patent/CN111192638B/en
Publication of CN111192638A publication Critical patent/CN111192638A/en
Application granted granted Critical
Publication of CN111192638B publication Critical patent/CN111192638B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a high-dimensional low-sample gene data screening and protein network analysis method and a system, wherein the method comprises the following steps: preprocessing original GBM data to obtain GBM data; based on the GBM data, feature extraction is carried out, and interaction items among the covariates are added into the screened covariates; carrying out regression analysis aiming at the screened covariates added with the interactive items to obtain a gene result; and screening to obtain the final screening gene according to the gene result. And then, acquiring protein data under the conditions of high GBM expression, normal expression and no expression by using the final screening gene, analyzing and screening the key protein, and finally acquiring a mechanism of influencing a protein pathway by the key gene. The method optimizes the method of combining Cos, SIS and Lasso together, greatly improves the accuracy of model screening of GBM key genes, and determines the signal path relation between GBM and key proteins.

Description

High-dimensional low-sample gene data screening and protein network analysis method and system
Technical Field
The invention belongs to the field of gene data screening and gene data processing, and particularly relates to a method and a system for searching key genes closely related to GBM (GBM) in high-dimensional low-sample gene data, and acquiring protein expression data and protein network analysis after key genes are knocked out.
Background
Glioblastoma multiforme (abbreviated as GBM) is a tumor generated by astrocytoma through malignant transformation, because the GBM generation mechanism is a particularly complex biological phenomenon caused by various external factors, genes and growth stages, and the complex factors lead the common biological research methods to have great difficulty in researching the GBM growth mechanism, and the main schemes adopted in the prior art include the following:
the Cox proportional hazard model is an important survival analysis method, however, the classical Cox proportional hazard model needs survival data information for solving a full rank equation, and the traditional Cox model cannot process the survival data because the dimension (P) in the model is larger than the number (N) of samples (we refer to as data with P > N type).
In order to process P > > N type data, models such as CoxLasso, CoxSis and CoxSisLasso have been studied before, and survival data analysis is carried out on the models to achieve certain effect.
Firstly, a Cox regression model is adopted as a proportional risk model as formula 1:
h(t,x)=h0(t)exp(βX) (1),
where h (t, x) is a risk function at time t, h0(t) is the baseline hazard function at time 0;
β is the regression coefficient of X, which is the amount we need to process;
the related gene screened by the Lasso method (formula 2) is marked as C0
Figure BDA0002352380810000021
Where β is an unknown vector of regression coefficients in the p-dimension, xiIs a vector of the i-th individual's potential predictors. On the basis of the samples, the method,
Figure BDA0002352380810000022
is an estimator of the coefficients of the unknown parameters. Thus, D is a set of indices of events, RkRepresents the time tkA set of indices of individuals at risk. The tuning parameter λ is the sparsity of the control estimator.
For each X on the gene results screenedmIs e.g. G and
Figure BDA0002352380810000023
the key genes were further selected in conjunction with the cox model by performing equation 3.
Figure BDA0002352380810000024
Index reference C of selected covariates0Indicate, the remaining covariates are exception xmAnd (4) showing.
The gene selected in formula 3 is further processed by conditional SIS formula 4 to obtain the coxSisLasso gene.
Figure BDA0002352380810000025
Where Θ is the set C of enhanced selected predictors0∪C1. Wherein C is0Index, C, representing the selected covariate and lasso1Representing covariates retained by the condition sis.
The main defects of the above scheme are as follows:
(1) the Cox proportional risk model can process P < < N type data and can not process P > N type data;
(2) the accuracy of the methods realized by CoxLasso, CoxSis and CoxSisLasso is not enough, and the performance of the models is poor.
Disclosure of Invention
In view of this, the present invention provides a method and a system for high-dimensional low-sample gene data screening and protein network analysis, the method comprising: preprocessing original GBM data to obtain GBM data; based on the GBM data, feature extraction is carried out, and interaction items among the covariates are added into the screened covariates; carrying out regression analysis aiming at the screened covariates added with the interactive items to obtain a gene result; and screening to obtain the final screening gene according to the gene result. And then, acquiring protein data under the conditions of high GBM expression, normal expression and no expression by using the final screening gene, analyzing and screening the key protein, and finally acquiring a mechanism of influencing a protein pathway by the key gene. The method optimizes the method of combining Cos, SIS and Lasso together, greatly improves the accuracy of model screening of GBM key genes, and determines the signal path relation between GBM and key proteins. Specifically, the method comprises the following steps:
step 1, preprocessing original GBM data to obtain GBM data;
step 2, based on the GBM data, extracting features, wherein the extracting includes: firstly, screening covariates by using a CoxLasso method, then screening the covariates by using a Conditional SIS method, screening the covariates by using the CoxLasso method again on the basis, and adding interaction items among the covariates in the screened covariates;
step 3, aiming at the screened covariates added with the interactive items, executing regression analysis to obtain a gene result;
and 4, aiming at the gene result, screening by utilizing T test to obtain the final screened gene.
Preferably, in the step 2, in adding the interaction term to the screened covariates, each covariate is obtained according to the following formula:
Figure BDA0002352380810000031
adding interactive items among covariates to obtain new covariate data and interactive item data;
where Θ is the set C of enhanced selected predictors0∪C1In which C is0Index, C, representing the selected covariate and lasso1Covariates representing conditional sis Retention
Preferably, in step 3, the regression analysis is performed in the following manner:
Figure BDA0002352380810000032
where X is a p-dimensional covariate, β denotes the index through the CoxSisLasso strategy and the finally selected covariate, XiIs a vector of the i-th individual's potential predictors. Thus, D is a set of indices of events, RkRepresents the time tkA set of indices of individuals at risk. The tuning parameter λ is the sparsity of the control estimator. Θ is the set C of enhanced selected predictors0∪C1(ii) a Wherein C is0Index, C, representing the selected covariate and lasso1Representing the covariates that the condition sis holds,
Figure BDA0002352380810000043
representing interactive items between selected genes.
Preferably, the method further comprises:
step 5, based on the final screening gene, obtaining the content change of the related protein by enhancing the expression of the related gene and knocking out the expression of the related gene, and calculating the folding change value of the related protein;
step 6, determining the protein promoted or inhibited by the related gene by the following method:
Figure BDA0002352380810000041
therein, FCiIs an experimental group protein and ExPjFold change values between control proteins.
Preferably, the fold change value is obtained by:
Figure BDA0002352380810000042
therein, FCiIs an experimental group protein and ExPjControl of proteinsThe fold change value between, i represents the index of the two groups of proteins, ExPjAnd CoPjIs the protein content determined using RPPA, j represents the index of the experiment, and m represents the number of replications.
Preferably, the step 6 further comprises obtaining a pathway relation of the proteins based on the determined proteins promoted or inhibited by the related genes, and drawing a pathway map.
In addition, the invention also provides a high-dimensional low-sample gene data screening and protein network analysis system, which comprises:
the data preprocessing module is used for preprocessing original GBM data to obtain GBM data, preferably, extracting protein to prepare an RPPA chip after preprocessing original GBM sample data by an RPPA lysis buffer, and scanning, analyzing and quantifying the chip to obtain the GBM data;
the feature selection module performs feature extraction based on the GBM data obtained by the data preprocessing module, and preferably, the feature extraction includes: firstly, screening covariates by using a CoxLasso method, then screening the covariates by using a Conditional SIS method, screening the covariates by using the CoxLasso method again on the basis, and adding interactive items in the screened covariates;
the regression analysis module is used for executing regression analysis aiming at the screened covariates added with the interactive items to obtain a gene result;
and the gene screening module is used for screening the gene result obtained by the regression analysis module to obtain the final screening gene.
Preferably, the system further comprises:
and the protein pathway analysis module is used for obtaining the content change of the related protein by enhancing the expression of the related gene and knocking out the expression of the related gene based on the final screening gene obtained by the gene screening module, calculating the folding change value of the related protein, determining the protein promoted or inhibited by the related gene based on the folding change value, and obtaining the pathway relation of the protein based on the determined protein promoted or inhibited by the related gene.
Preferably, the regression analysis module is executed in the following manner:
Figure BDA0002352380810000051
where X is a p-dimensional covariate, β denotes the index through the CoxSisLasso strategy and the finally selected covariate, XiIs a vector of the i-th individual's potential predictors. Thus, D is a set of indices of events, RkRepresents the time tkA set of indices of individuals at risk. The tuning parameter λ is the sparsity of the control estimator. Θ is the set C of enhanced selected predictors0∪C1(ii) a Wherein C is0Index, C, representing the selected covariate and lasso1Representing the covariates that the condition sis holds,
Figure BDA0002352380810000061
representing interactive items between selected genes.
Preferably, in the protein pathway analysis module, the means for determining the protein promoted or inhibited by the gene of interest is:
Figure BDA0002352380810000062
Figure BDA0002352380810000063
therein, FCiIs an experimental group protein and ExPjThe fold change between the control proteins, i representing the index of the two groups of proteins, ExPjAnd CoPjIs the protein content determined using RPPA, j represents the index of the experiment, and m represents the number of replications.
Compared with the prior art, the technical scheme of the invention can be well adapted to P & gt N type data, is obviously higher than the traditional CoxLasso, CoxSis and CoxSisLasso model methods in the accuracy of key gene screening, is convenient for obtaining a unilateral information channel related to the key gene, and has a great promoting effect on GBM key gene research.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a flow chart of the ImpCoxSisLasso model according to an embodiment of the present invention;
FIG. 3 is a Venn diagram of screening results according to an embodiment of the present invention;
FIGS. 4(a), 4(b) are graphs comparing the performance of embodiments of the present invention with other models;
FIG. 5 is a flowchart of protein pathway studies according to an embodiment of the present invention.
Detailed Description
An application program recommendation method and apparatus according to an embodiment of the present invention are described in detail below with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be appreciated by those of skill in the art that the following specific examples or embodiments are a series of presently preferred arrangements of the invention to further explain the principles of the invention, and that such arrangements may be used in conjunction or association with one another, unless it is expressly stated that some or all of the specific examples or embodiments are not in association or association with other examples or embodiments. Meanwhile, the following specific examples or embodiments are only provided as an optimized arrangement mode and are not to be understood as limiting the protection scope of the present invention.
Example 1
In one embodiment, the present invention provides a method for screening gene data and analyzing protein network with high dimension and low sample, which comprises: the method for influencing the signal pathway mechanism of the GBM is finally discovered by searching the key gene from the high-dimensional low-sample gene data and carrying out protein-level research on the searched key gene. Specifically, the technical scheme of the invention can be realized by the following modes:
optimization method combining Cox, SIS and Lasso (ImpCoxSIS Lasso)
Considering the complexity of the human system, it is thought that the genes highly related to GBM do not act completely independently, and there is an interaction effect between them.
In this regard, we can use the conventional SIS method to perform a coarse screening, and then use the lasso method to perform a screening to obtain the gene screened in the first step. When the relationship between the screened genes and the GBM survival time is judged according to the coefficient, the interaction of the genes is considered, pairwise interaction term products between the genes are added, and the relationship between the gene terms and the GBM time is checked under the condition of the interaction between the genes, so that the method for obtaining the ImpCoxSisLasso is obtained by optimizing the CoxSisLasso strategy in a mode of adding the interaction term products, and the formula 5 of the strategy is obtained by combining the formula 4:
Figure BDA0002352380810000081
Figure BDA0002352380810000082
where X is a p-dimensional covariate, β denotes the index through the CoxSisLasso strategy and the finally selected covariate, XiIs a vector of the i-th individual's potential predictors. Thus, D is a set of indices of events, RkRepresents the time tkA set of indices of individuals at risk. The tuning parameter λ is the sparsity of the control estimator. Θ is the selection of the enhancementSet of predicted values C0∪C1(ii) a Wherein C is0Index, C, representing the selected covariate and lasso1Representing the covariates that the condition sis holds,
Figure BDA0002352380810000083
representing interactive items between selected genes.
In a specific embodiment, the policy execution flowchart is shown in fig. 2, and may be specifically implemented in the following manner:
step 1021, inputting data x between patient gene data and GBM survival time, and primarily screening covariate x of coxSisLasso gene according to known formula 4mThis partial data set is denoted as C.
Figure BDA0002352380810000084
Step 1022, to consider the interaction between genes when performing regression, we add product interaction term data between two of these genes to set C
Figure BDA0002352380810000085
Then, a regression analysis formula 5 is executed on the set added with the cross product data to obtain a regression coefficient result β of each gene added with the interactive item between the genesC,
Figure BDA0002352380810000091
Figure BDA0002352380810000092
In the formula
Figure BDA0002352380810000093
Wherein i1≠i2,i1,i2Epsilon C represents the interaction between the selected genes.
Step 1023, gene coefficient β obtained in the previous stepC,
Figure BDA0002352380810000094
Performing T test, and counting p in the screening result<0.05(p represents p-value is the basis for judging whether a certain gene is significant) to obtain the finally screened key gene. We named this method ImpCoxSisLasso algorithm.
As can be seen from FIG. 4, the ROC and AUC curves obtained by predicting the genes screened by the ImpCoxSisLasso method provided by the invention are greatly improved in the accuracy of predicting whether a patient has GBM compared with the curves obtained by screening by the existing CoxLasso, CoxSis and CoxSisLasso methods, which indicates that the relationship between the genes screened by the method provided by the invention and GBM is more compact, reliable and reasonable.
Second, protein pathway study
After obtaining the key genes related to the survival time of GBM by the methods of the above embodiments, we studied the changes of the expressed protein controlled by the genes in GBM and the related pathway problems, and the following describes the protein pathway study process in detail with reference to fig. 5.
First, we increased AEBP1 expression in the experimental group, which consisted of ln229 cell line. In addition, we retained AEBP1 expression from the control group consisting of ln229 cell line. We then performed two RPPA experiments on both groups. Finally, it shows the expression of 287 related proteins.
Second, we knocked out AEBP1 of the panel consisting of LN229 cell line with CRISPR-CAS 9. In addition, we retained AEBP1 expression from the control group consisting of ln229 cell line. We then performed two RPPA experiments on both groups. Finally, it shows the expression of 302 related proteins.
Third, we calculated the fold change between the experimental and control groups of the AEBP1 upper and lower data sets using equation (6).
Figure BDA0002352380810000101
Here FCiIs experimental group and ExPjFold variation between control groups. i represents the index of the two groups of proteins. ExPjAnd CoPjIs the protein content determined using RPPA. j represents the index of the experiment and m represents the number of replications.
Next, we used equation (7) to determine which proteins are promoted or inhibited by AEBP 1.
Figure BDA0002352380810000102
Judge values greater than 1 indicate an increase, and values less than or equal to 1 indicate a decrease.
Finally, we compared the experimental and control groups of AEBP1 lifting and suppression datasets by T-test. The null hypothesis was that the average expression level of the protein under the experimental conditions was equal to the level under the control conditions, and the P-value threshold was 0.05. Next, we calculated the p-values of the AEBP1 lifting dataset and the AEBP1 suppression dataset, respectively. Finally we found that 7 proteins were statistically significant in the AEBP1 promotion dataset and the AEBP1 inhibition dataset, respectively. Finally, the pathway relation between the proteins and the pathogenic factors can be obtained by inquiring the existing data, and a pathway graph is drawn.
Example 2
In another embodiment of the present invention, the analysis method proposed by the present invention can be further performed by establishing a system, and in a specific implementation, the system can be implemented by:
the system comprises:
the data preprocessing module is used for preprocessing the original GBM data to obtain the GBM data; preferably, the original GBM sample data is pretreated by RPPA lysis buffer solution to extract protein, an Aushon biosystems 2470 Array mechanism is adopted to make an RPPA chip (applicable antibody can query M.D. Anderson Cancer Center: https:// www.mdanderson.org/research/research-resources/core-resources/functional protocols-RPPA-), Array-Pro Analyzer software (MediaCybernetics) is used to scan, analyze and quantify the chip to obtain GBM data;
the characteristic selection module is used for extracting characteristics based on the GBM data obtained by the data preprocessing module, and the characteristic extraction comprises the following steps: firstly, screening covariates by using a CoxLasso method, then screening the covariates by using a Conditional SIS method, screening the covariates by using the CoxLasso method again on the basis, and adding interactive items in the screened covariates;
the regression analysis module is used for executing regression analysis aiming at the screened covariates added with the interactive items to obtain a gene result;
and the gene screening module is used for screening the gene result obtained by the regression analysis module to obtain the final screening gene.
Preferably, the system further comprises:
and the protein pathway analysis module is used for obtaining the content change of the related protein by enhancing the expression of the related gene and knocking out the expression of the related gene based on the final screening gene obtained by the gene screening module, calculating the folding change value of the related protein, determining the protein promoted or inhibited by the related gene based on the folding change value, and obtaining the pathway relation of the protein based on the determined protein promoted or inhibited by the related gene.
Preferably, the regression analysis module is executed in the following manner:
Figure BDA0002352380810000121
where X is a p-dimensional covariate, β denotes the index through the CoxSisLasso strategy and the finally selected covariate, XiIs a vector of the i-th individual's potential predictors. Thus, D is a set of indices of events, RkRepresents the time tkA set of indices of individuals at risk. The tuning parameter λ is the sparsity of the control estimator. Θ is the set C of enhanced selected predictors0∪C1(ii) a Wherein C is0Index, C, representing the selected covariate and lasso1Representing the covariates that the condition sis holds,
Figure BDA0002352380810000122
representing interactive items between selected genes.
Preferably, in the protein pathway analysis module, the means for determining the protein promoted or inhibited by the gene of interest is:
Figure BDA0002352380810000123
Figure BDA0002352380810000124
therein, FCiIs an experimental group protein and ExPjThe fold change between the control proteins, i representing the index of the two groups of proteins, ExPjAnd CoPjIs the protein content determined using RPPA, j represents the index of the experiment, and m represents the number of replications.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A high-dimensional low-sample gene data screening and protein network analysis method comprises the following steps:
step 1, preprocessing original GBM (glioblastoma multiforme) sample data to obtain GBM data;
step 2, performing feature extraction based on the preprocessed GBM data, preferably, the feature extraction includes: firstly, screening covariates by using a CoxLasso method, then screening the covariates by using a Conditional SIS method, screening the covariates by using the CoxLasso method again on the basis, and adding interaction items among the covariates in the screened covariates;
step 3, aiming at the screened covariates added with the interactive items, executing regression analysis to obtain a gene result;
and 4, aiming at the gene result, screening by utilizing T test to obtain the final screened gene.
2. The method according to claim 1, wherein in the step 2, in adding the interactive item to the screened covariates, each covariate is obtained according to the following formula:
Figure FDA0002352380800000011
adding interactive items among covariates to obtain new covariate data and interactive item data;
where Θ is the set C of enhanced selected predictors0∪C1,C0Index, C, representing the selected covariate and lasso1Representing covariates retained by the condition sis.
3. The method of claim 1, wherein in step 3, the regression analysis is performed in the following manner:
Figure FDA0002352380800000012
where X is a p-dimensional covariate, β denotes the index through the CoxSisLasso strategy and the finally selected covariate, XiIs a vector of the i-th individual's potential predictor; d is a set of indices of the event, RkRepresents the time tkA set of indices of individuals at risk; the tuning parameter λ is the sparsity of the control estimator; Θ is the set C of enhanced selected predictors0∪C1(ii) a Wherein C is0Index, C, representing the selected covariate and lasso1Representing the covariates that the condition sis holds,
Figure FDA0002352380800000021
representing interactive items between selected genes.
4. The method of claim 1, further comprising:
step 5, based on the final screening gene, obtaining the content change of the related protein by enhancing the expression of the related gene and knocking out the expression of the related gene, and calculating the folding change value of the related protein;
step 6, determining the protein promoted or inhibited by the related gene by the following method:
Figure FDA0002352380800000022
therein, FCiIs an experimental group protein and ExPjFold change values between control proteins.
5. The method of claim 4, wherein the fold change value is obtained by:
Figure FDA0002352380800000023
therein, FCiIs an experimental group protein and ExPjThe fold change between the control proteins, i representing the index of the two groups of proteins, ExPjAnd CoPjIs the protein content determined using RPPA, j represents the index of the experiment, and m represents the number of replications.
6. The method of claim 4, wherein step 6 further comprises obtaining a pathway relationship of the proteins based on the determined proteins promoted or inhibited by the genes involved, and mapping the pathway relationship.
7. A high-dimensional low-sample gene data screening and protein network analysis system, comprising:
the data preprocessing module is used for preprocessing the original GBM data to obtain the GBM data; preferably, the original GBM sample data is pretreated by RPPA lysis buffer solution, then protein is extracted to prepare an RPPA chip, and chip scanning, analysis and quantification are carried out to obtain GBM data;
the feature selection module performs feature extraction based on the GBM data obtained by the data preprocessing module, and preferably, the feature extraction includes: firstly, screening covariates by using a CoxLasso method, then screening the covariates by using a Conditional SIS method, screening the covariates by using the CoxLasso method again on the basis, and adding interactive items in the screened covariates;
the regression analysis module is used for executing regression analysis aiming at the screened covariates added with the interactive items to obtain a gene result;
and the gene screening module is used for screening the gene result obtained by the regression analysis module to obtain the final screening gene.
8. The system of claim 7, further comprising:
and the protein pathway analysis module is used for obtaining the content change of the related protein by enhancing the expression of the related gene and knocking out the expression of the related gene based on the final screening gene obtained by the gene screening module, calculating the folding change value of the related protein, determining the protein promoted or inhibited by the related gene based on the folding change value, and obtaining the pathway relation of the protein based on the determined protein promoted or inhibited by the related gene.
9. The system of claim 7, wherein the regression analysis module performs in the following manner:
Figure FDA0002352380800000031
where β denotes the index of the covariate ultimately selected by the coxsislasso strategy,
Figure FDA0002352380800000032
wherein i1≠i2,i1,i2Epsilon C represents the interaction between the selected genes.
10. The system of claim 7, wherein the protein pathway analysis module determines the proteins promoted or suppressed by the genes of interest by:
Figure FDA0002352380800000033
the fold change value is obtained by:
Figure FDA0002352380800000041
therein, FCiIs an experimental group protein and ExPjThe fold change between the control proteins, i representing the index of the two groups of proteins, ExPjAnd CoPjIs the protein content determined using RPPA, j represents the index of the experiment, and m represents the number of replications.
CN201911424402.3A 2019-12-31 2019-12-31 High-dimensional low-sample gene data screening and protein network analysis method and system Active CN111192638B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911424402.3A CN111192638B (en) 2019-12-31 2019-12-31 High-dimensional low-sample gene data screening and protein network analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911424402.3A CN111192638B (en) 2019-12-31 2019-12-31 High-dimensional low-sample gene data screening and protein network analysis method and system

Publications (2)

Publication Number Publication Date
CN111192638A true CN111192638A (en) 2020-05-22
CN111192638B CN111192638B (en) 2022-04-22

Family

ID=70710599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911424402.3A Active CN111192638B (en) 2019-12-31 2019-12-31 High-dimensional low-sample gene data screening and protein network analysis method and system

Country Status (1)

Country Link
CN (1) CN111192638B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111863120A (en) * 2020-06-28 2020-10-30 深圳晶泰科技有限公司 Drug virtual screening system and method for crystal compound
WO2023097927A1 (en) * 2021-11-30 2023-06-08 周建伟 Prediction system for identifying key heterogeneous molecules that drive tumor metastasis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103500263A (en) * 2013-07-30 2014-01-08 胡膺期 High-dimensional data function selection algorithm based on reverse elimination method and application thereof in medical treatment
US20160224723A1 (en) * 2015-01-29 2016-08-04 The Trustees Of Columbia University In The City Of New York Method for predicting drug response based on genomic and transcriptomic data
CN106326688A (en) * 2016-08-26 2017-01-11 章乐 Selecting method for high-dimension fewer-sample gene, signal channel and related proteins

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103500263A (en) * 2013-07-30 2014-01-08 胡膺期 High-dimensional data function selection algorithm based on reverse elimination method and application thereof in medical treatment
US20160224723A1 (en) * 2015-01-29 2016-08-04 The Trustees Of Columbia University In The City Of New York Method for predicting drug response based on genomic and transcriptomic data
CN106326688A (en) * 2016-08-26 2017-01-11 章乐 Selecting method for high-dimension fewer-sample gene, signal channel and related proteins

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YUN LI 等: "Modeling gene-covariate interactions in sparse regression with group structure for genome-wide association studies", 《STAT. APPL. GENET. MOL. BIOL》 *
宋允全: "若干多元统计模型的适应性统计推断", 《中国优秀博士学位论文全文数据库 基础科学辑》 *
李娜: "多形性胶质母细胞瘤致病因子筛选算法研究", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111863120A (en) * 2020-06-28 2020-10-30 深圳晶泰科技有限公司 Drug virtual screening system and method for crystal compound
WO2023097927A1 (en) * 2021-11-30 2023-06-08 周建伟 Prediction system for identifying key heterogeneous molecules that drive tumor metastasis

Also Published As

Publication number Publication date
CN111192638B (en) 2022-04-22

Similar Documents

Publication Publication Date Title
Tan et al. Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders
Yu et al. Predicting protein-protein interactions in unbalanced data using the primary structure of proteins
CN111192638B (en) High-dimensional low-sample gene data screening and protein network analysis method and system
Galan et al. CHESS enables quantitative comparison of chromatin contact data and automatic feature extraction
Liu et al. Predicting breast cancer recurrence and metastasis risk by integrating color and texture features of histopathological images and machine learning technologies
Komura et al. Universal encoding of pan-cancer histology by deep texture representations
Ding et al. Feature-enhanced graph networks for genetic mutational prediction using histopathological images in colon cancer
Chen et al. An effective feature selection scheme for healthcare data classification using binary particle swarm optimization
Wang et al. Optimization of parallel random forest algorithm based on distance weight
Elsayed et al. Matlab vs. opencv: A comparative study of different machine learning algorithms
CN115510981A (en) Decision tree model feature importance calculation method and device and storage medium
Cadow et al. On the feasibility of deep learning applications using raw mass spectrometry data
CN110010204B (en) Fusion network and multi-scoring strategy based prognostic biomarker identification method
Mallick et al. An integrated Bayesian framework for multi‐omics prediction and classification
Bhanushali et al. WOMEN'S BREAST CANCER PREDICTED USING THE RANDOM FOREST APPROACH AND COMPARISON WITH OTHER METHODS
Long et al. A model population analysis method for variable selection based on mutual information
Sagala et al. Enhanced churn prediction model with boosted trees algorithms in the banking sector
Li et al. Quality control of imbalanced mass spectra from isotopic labeling experiments
CN116779034A (en) miRNA and disease association prediction method, equipment and storage medium
Alfyani Comparison of Naïve Bayes and KNN algorithms to understand hepatitis
Suleman et al. PseU-Pred: An ensemble model for accurate identification of pseudouridine sites
CN114973245A (en) Machine learning-based extracellular vesicle classification method, device, equipment and medium
Figueroa-Silva et al. Machine learning techniques in predicting braf mutation status in cutaneous melanoma from clinical and histopathologic features
CN112309571B (en) Screening method of prognosis quantitative characteristics of digital pathological image
CN112992347A (en) lncRNA-disease associated prediction method and system based on Laplace regularization least square and network projection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant