CN114141306B - Distant metastasis identification method based on gene interaction mode optimization graph representation - Google Patents

Distant metastasis identification method based on gene interaction mode optimization graph representation Download PDF

Info

Publication number
CN114141306B
CN114141306B CN202111400313.2A CN202111400313A CN114141306B CN 114141306 B CN114141306 B CN 114141306B CN 202111400313 A CN202111400313 A CN 202111400313A CN 114141306 B CN114141306 B CN 114141306B
Authority
CN
China
Prior art keywords
gene
graph
network
gene expression
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111400313.2A
Other languages
Chinese (zh)
Other versions
CN114141306A (en
Inventor
苏苒
朱莹莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202111400313.2A priority Critical patent/CN114141306B/en
Publication of CN114141306A publication Critical patent/CN114141306A/en
Application granted granted Critical
Publication of CN114141306B publication Critical patent/CN114141306B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Human Resources & Organizations (AREA)
  • Neurology (AREA)
  • Economics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Strategic Management (AREA)
  • Quality & Reliability (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Game Theory and Decision Science (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)

Abstract

The invention discloses a distant metastasis identification method based on a gene interaction mode optimization chart representation, which comprises data preprocessing; constructing and processing a protein interaction network PPI; dividing a training test set; constructing a glmGCN model based on a gene interaction pattern optimization map representation; cross-validating the network model using ten folds; the model is applied to the test set test. Compared with the prior art, the method provided by the invention has the advantages that the tumor metastasis is predicted under the GCN framework, and the gene-gene relation of the initial image is given in the field of more attention at the image learning layer, so that more accurate prediction performance is obtained.

Description

Distant metastasis identification method based on gene interaction mode optimization graph representation
Technical Field
The invention belongs to the technical field of biological information, and particularly relates to a distant metastasis identification method based on a gene interaction pattern optimization diagram representation.
Background
Tumor metastasis refers to the process by which tumor cells spread from the primary site and continue to grow and form tumors at sites other than the primary site by invading lymphatic and blood vessels. Metastasis can be divided into regional metastasis and distant metastasis. In order to improve the cure rate of cancer and reduce the suffering of patients, it is necessary to predict whether cancer patients have metastasis and then select an appropriate treatment strategy.
In recent years, transcriptome data generated by microarray and RNA sequencing techniques have been widely used to explore the molecular properties of metastasis. For example, the Robinson et al paper "Integrated clinical genetics of metastatic cancer" performed whole exome and transcriptome sequencing of adult patients with metastatic solid tumors from different lineages and biopsy sites, providing a clinically relevant molecular landscape of metastatic tumors. Ma et al, the "Prophylogenetic characterization and comparative genomic analysis of human clinical cancer patient methods" provides a new paradigm for understanding human colon and rectal cancer liver metastasis by proteome analysis, whole exome and transcriptome sequencing, and single nucleotide polymorphism array analysis. Ramasuma et al, "A molecular signature of metastasis in primary soluble tumors," also determined gene features that distinguish metastatic sites from primary sites based on tumor metastasis expression profiles.
However, these studies do not discuss the direct identification of the transfer samples. In addition, cancer has a complex mechanism. The interaction between genes plays a very important role in this mechanism and should be considered. Thus, biological networks, such as protein-protein interaction (PPI) networks, provide more reliable insights that have been used in many cancer-related studies.
Direct prediction of metastasis has been achieved in a few computational-based studies, and most existing studies predict cancer metastasis based on traditional machine learning algorithms. For example, a plurality of transcriptome datasets are integrated in the "Support vector machine classifier for prediction of the methods of the scientific cancer" paper by Zhi et al, feature genes are screened, and an optimal Support Vector Machine (SVM) model is established to distinguish transfer and non-transfer samples. In the "Machine learning prediction of lymph node metastasis of poorly differentiated-type intramucosal gastric cancer" by Zhou et al, seven Machine learning algorithms are used to predict lymph node metastasis of poorly differentiated-type intramucosal gastric cancer.
Since biological networks are graphical data with irregular spatial features, the advent of graph-based deep learning techniques has provided opportunities for network biological analysis. The Rhee et al paper "Hybrid application of relationship Network and Localized Graph constraint for Breast Cancer Classification" indicates that the Graph Convolutional neural Network (GCN) is a CNN that acts directly on Graph data, taking into account the feature information and structure information of nodes, and can discover and mine information that is more hidden and complex than the general rule, where the representation of each node propagates through edges until stable equilibrium is reached. Various GCN architectures have recently been proposed, such as the simplified Graph Convolution Network (GCN) proposed by Kipf et al for Semi-Supervised learning to operate directly on the Graph. However, the Graph of GCN has either a fixed Graph structure, such as some existing biological Networks with specific meaning in a specific domain, or some human-constructed graphs, such as k-nearest neighbor graphs with gaussian kernels, supervised graphs with fully connected Networks as proposed in the paper "Deep connected Networks on Graph-Structured Data" by Henaff et al, and learned by distance metric as proposed in the paper "Graph Attention Networks" by veiikovi et al. These maps may not fit into the GCN architecture. To overcome this problem, the paper "Graph Learning-relational Networks" by Jiang et al proposes a Graph Learning Convolutional Network (GLCN) that combines Graph Learning and Graph convolution in a unified network structure. The given label and the prediction label are combined, training is carried out through a single optimization method, the graph structure can be refined, and the effect is more remarkable.
Disclosure of Invention
Based on the prior art, the invention provides a distant metastasis identification method based on a gene interaction pattern optimization graph representation, which constructs a gene interaction graph representation and predicts distant metastasis of tumors under a graph convolution network GCN framework.
The invention is realized by the following technical scheme:
a method for identifying distant metastasis based on an optimized representation of gene interaction patterns, the method comprising the steps of:
step 1, preprocessing data, including preparing a data set, wherein the data set comprises gene expression data and clinical information, screening gene expression data to obtain a differential gene set DEG, and oversampling a minority of samples synthesized by the screened gene expression data by adopting an SMOTE processing model;
step 2, constructing and processing a protein interaction network PPI, and obtaining an adjacency matrix from the graph relation:
step 3, dividing a training set and a test set by ten folds by using a Stratield KFold method, wherein one fold is used as the test set, and the nine folds are used as the training set;
step 4, constructing a glmGCN model represented by an optimized graph based on a gene interaction pattern, which specifically comprises the following processes:
step 4-1, constructing an embedded graph learning network layer, and obtaining a new graph representation by using a single-layer network:
Figure BDA0003365162870000031
wherein, a = (a) 1 ,a 2 ,...,a n ) T Is a weight vector, σ (-) is an activation function, A represents a adjacency matrix, g i 、g i Respectively represents the gene expression of the ith and jth genes,
Figure BDA0003365162870000032
represents the power, p represents the total number of genes;
step 4-2, constructing a graph convolution network layer, executing a layering propagation rule based on the self-adaptive neighborhood graph S returned by the graph learning layer and the gene expression data of each sample, and obtaining the output expression of the graph convolution network layer as follows:
O i (k+1) =σ(S·O i (k) ·W (k) ),for i=1,2,...,n
wherein K =0, 1., K-1, o i (k+1) Represents the output of the (k + 1) th layer, O i (0) Represents the gene expression of the ith gene, sigma (. Beta.) represents the activation function, W (k) A trainable weight matrix representing each graph convolution layer;
4-3, constructing a fully-connected network layer, and comprehensively extracting characteristic information, wherein the method comprises the steps of setting a plurality of fully-connected network layers and processing the fully-connected network layers to obtain a final prediction result;
step 5, training a glmGCN model represented by a gene interaction pattern optimization diagram, and evaluating the trained model by taking ACC (adaptive cruise control), SPE (SPE), SEN (SEN), AUC (AUC) and ROC (ROC) curves as evaluation indexes; and testing the test set by using the trained network weight parameters and the trained network model, averaging the obtained ten-fold prediction results, and repeating the experiment for three times to obtain the final test set prediction result.
Compared with the prior art, the invention provides that a graph learning layer is embedded in a network model and used for learning the optimal graph representation of the gene interaction relation; tumor metastasis is predicted under the GCN framework, which extracts informative high-level features from the constructed irregular graph structure. The gene-gene relation of the initial image is given in the field of more attention on the image learning layer, so that more accurate prediction performance is obtained.
Drawings
FIG. 1 is a general flow chart of the method for identifying distant metastasis based on an optimized representation of the gene interaction pattern according to the present invention;
FIG. 2 is a diagram of an example of a PPI network for PAAD;
FIG. 3 is a framework diagram of the overall implementation of the network;
FIG. 4 is a schematic diagram of an embedded graph learning layer structure;
FIG. 5 is a ROC curve diagram of a glmGCN model of a CESC data set in an embodiment;
FIG. 6 is ROC graph of the glmGCN model of STAD data set in the example;
FIG. 7 is a ROC graph of the glmGCN model of the PAAD data set in the example;
FIG. 8 is a ROC graph of the glmGCN model of the BLCA data set in the example.
Detailed Description
The technical solution of the present invention is described in detail below with reference to the accompanying drawings and specific embodiments.
A method for optimizing graph representation of distant metastasis identification based on gene interaction patterns, comprising the steps of:
step 1, data preprocessing
1-1, preparing a data set comprising cervical squamous cell carcinoma and cervical adenocarcinoma CESC data samples, gastric adenocarcinoma STAD data samples, pancreatic adenocarcinoma PAAD data samples, bladder urothelial carcinoma BLCA data samples (the invention is not limited to these data samples);
step 1-1, obtaining a CESC sample data set, a STAD sample data set, a PAAD sample data set and a BLCA sample data set, wherein the CESC sample data set, the STAD sample data set, the PAAD sample data set and the BLCA sample data set are mRNA and lncRNA data of a TCGA database downloaded from a SangerBox, the data sets comprise gene expression data and clinical information, and the gene expression data comprise at least gene transcriptome Counts and gene expression quantity FPKM; wherein, the Counts data is used for difference analysis, and the expression quantity FPKM of standardized counting can be used for more accurately classifying whether the transfer occurs;
in an embodiment of the invention, there are 309 samples in CESC sample dataset, 407 samples in STAD sample dataset, 182 samples in PAAD sample dataset, 427 samples in BLCA sample dataset, 19814 mrnas and 14851 lncrnas;
step 1-2, processing the clinical information obtained in the step 1-1 to obtain classification labels of all data sets;
in an embodiment of the present invention, 34 patients with distant metastasis and 275 patients on the left with no distant metastasis (275 +34= 309) are included in the CESC sample data set. In the STAD sample dataset 347 patients did not undergo distant metastasis, 60 patients underwent distant metastasis (347 +60= 407). The PAAD sample dataset included 122 patients without distant metastasis, 60 patients with distant metastasis (122 +60= 182). Included in the BLCA sample dataset were 96 patients with distant metastasis, and 331 patients without distant metastasis (331 +96= 427);
step 1-3, performing differential gene screening on the gene expression data obtained in the step 1-1 to obtain a differential gene set DEG, and further screening the gene expression data according to the differential gene set DEG, wherein the method specifically comprises the following steps:
the grouping file is made from a sample name suffix, labeled as tomor with a sample name suffix of-01 or-06 and labeled as normal with a suffix of-11. Differential gene analysis was performed using edgeR package, with the standards of log2FC >1 (i.e., foldchange, which represents the ratio of expression amounts between two samples (groups)) and FDR <0.05 (i.e., false Discovery Rate), and the Counts and the original FPKM screened for the differential genes were corresponded to each other according to sample names, to obtain FPKM data corresponding to the differential genes.
In the embodiment of the present invention, 1515 DEG (1197 mrna +318 lncRNA) of CESC sample dataset, 4113 DEG (2903 mrna +1219 lncRNA) of STAD sample dataset, 116 DEG (77mrna +39 lncRNA) of PAAD sample dataset, 2767 DEG (2118mrna +649 lncRNA) of BLCA sample dataset are obtained;
step 1-4, synthesizing a Minority sample according to the gene expression data obtained by screening in step 1-3 for Oversampling, analyzing the Minority sample by adopting an SMOTE (Synthetic minimum ownership Oversampling Technique) processing model, artificially synthesizing a new sample according to the Minority sample, adding the new sample into a data set to solve the problem of overfitting and balance the data set to obtain expression data X with harmonious data sample proportion, and Oversampling the CESC sample data set, the STAD sample data set, the PAAD sample data set and the BLCA sample set according to the Minority sample; the method specifically comprises the following steps:
calculating the Euclidean distance from each minority sample x to all samples in the minority sample set to obtain a neighbor k of the minority sample x;
setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from adjacent k for each few class sample x, wherein the randomly selected adjacent is assumed to be
Figure BDA0003365162870000061
Construction of a New sample x new The expression is as follows:
Figure BDA0003365162870000062
wherein the content of the first and second substances,
Figure BDA0003365162870000063
representing each randomly selected neighbor, and rand (0, 1) representing the generation of random numbers between 0 and 1.
In the embodiment of the present invention, the transfer and non-transfer ratio is set to 1. In the four CESC, STAD, PAAD and BLCA sample datasets, the number of patients transferred and non-transferred was 275, 347, 122 and 331, respectively. Thus, the total number of patients was 550, 694, 244 and 662, respectively.
Step 2, building and processing PPI
2-1, further screening potential effects existing among the encoded proteins from a STRING database according to the difference gene screening result obtained in the step 1-3, and using the potential effects to construct a protein interaction network PPI; the PPI network represents a graph relationship between genes, from which an adjacency matrix A can be obtained, wherein genetic interaction scores are used to measure the strength of node (gene) relationships;
in an embodiment of the invention, the PPI network of CESC comprises a total of 15664 gene interactions, involving 1106 genes. STAD co-found 36389 gene interactions, involving 2800 genes. The PPI network of PAAD involves 35 interactions of 33 genes, and BLCA involves 28183 interactions of 2014 genes.
Step 3, dividing the training set and the test set, and specifically comprising the following steps:
step 3-1, dividing a training set and a testing set of a CESC sample:
and (3) performing ten-fold division on the gene expression data X of the CESC sample data set which is obtained in the step (1) to be subjected to the over-sampling of a few types of samples by using a StratifiedKFold method, sequentially taking one fold as test data CESC _ Xtest, marking the test label as CESC _ Ytest, taking the rest nine folds as training data CESC _ Xtrain, and marking the training label as CESC _ Ytrain.
In an embodiment of the invention, the gene expression data X of the CESC sample data set relates to 1515 gene features of 550 samples, X _ test of the CESC sample data set has approximately 55 samples, and X _ train has approximately 495 samples;
step 3-2, dividing a training set and a test set of the STAD sample:
performing ten-fold division on the gene expression data X of the STAD sample data set which is obtained in the step 1-4 and is subjected to the over-sampling of a few types of samples by using a StratifiedKFold method, sequentially taking one fold as test data STAD _ Xtest, marking the test label as STAD _ Ytest, taking the rest nine folds as training data STAD _ Xtrain, and marking the training label as STAD _ Ytrain;
in an embodiment of the invention, the gene expression data X of the STAD dataset relates to 4113 gene features of 694 samples, X _ test of the STAD dataset has approximately 69 samples, X _ train has approximately 625 samples;
step 3-3, division of training set and test set of PAAD sample
Performing ten-fold division on the gene expression data X of the PAAD sample data set obtained in the step 1-4 and subjected to the over-sampling of the minority samples by using a StratifiedKFold method, sequentially taking one fold as test data PAAD _ Xtest, marking the test label as PAAD _ Ytest, taking the rest nine folds as training data PAAD _ Xtrain, and marking the training label as PAAD _ Ytrain
In an embodiment of the invention, the gene expression data X of the PAAD dataset relates to 116 gene signatures of 244 samples, X _ test of the PAAD dataset has approximately 24 samples, and X _ train has approximately 220 samples;
step 3-4, dividing training set and test set of BLCA sample
Performing ten-fold division on the gene expression data X of the BLCA sample data set which is obtained in the step 1-4 and is subjected to over-sampling of a few types of samples by using a StratifiedKFold method, sequentially taking one fold as test data BLCA _ Xtest, marking the test label as BLCA _ Yest, taking the rest nine folds as training data BLCA _ Xtrain, and marking the training label as BLCA _ Ytrain;
in an embodiment of the invention, the gene expression data X of the BLCA data set relates to 2767 gene signatures of 662 samples, X _ test of the BLCA data set has approximately 66 samples, X _ train has approximately 596 samples;
step 4, constructing a glmGCN model represented by the gene interaction pattern optimization diagram, and specifically comprising the following processes:
step 4-1, constructing an embedded graph learning network:
constructing an adjacency matrix A epsilon R based on PPI p×p . In PPI networks, the inter-gene emphasis is not placed on(ii) interaction; therefore, the diagonal elements of the adjacency matrix a are two genes that do not interact with each other, and one identity matrix I is added to a such that the diagonal elements of the identity matrix I are two genes that directly interact with each other.
A defined nonlinear function S is established based on a gene expression matrix G and an adjacency matrix A ij =h(g i ,g j ),g i 、g i The expression of the ith and jth genes is shown. At the same time, the projection matrix P is used in a low-dimensional space pj ∈R n×d To reduce computational complexity, where d < n. A new graph representation is obtained using a single layer network as shown in the following equation:
Figure BDA0003365162870000081
Figure BDA0003365162870000091
wherein, a = (a) 1 ,a 2 ,...,a n ) T Is a weight vector, σ (-) is an activation function, A represents a adjacency matrix, g i 、g i Respectively represents the gene expression of the ith and jth genes,
Figure BDA0003365162870000092
represents the power, and p represents the total number of genes.
Setting weight vector a = (a) 1 ,a 2 ,...,a n ) T ∈R n×1 A sigmoid or tanh function is used as the activation function σ (·). The adjacency matrix A is improved compared with GLCN, using A
Figure BDA0003365162870000093
The power emphasizes the importance of the domain initial map, so that the difference between strong interaction and weak interaction of the original PPI network map gene is larger. By the following loss function L 1 Optimizing the weight vectors a and P pj (Nie F,“Clustering and projected clustering with adaptive neighbors,”2014.):
Figure BDA0003365162870000094
Where γ and β represent two constants that can be adjusted manually, and F represents the Frobenius specification.
If it is not
Figure BDA0003365162870000095
And &>
Figure BDA0003365162870000096
Is greater than or equal to>
Figure BDA0003365162870000097
Larger, S should be smaller. Or vice versa>
Figure BDA0003365162870000098
And if the coefficient is smaller, S is larger, the second term is regularization processing and is used for controlling the sparsity of the learning graph S network, wherein gamma is a regularization coefficient, and the third term is used for integrating PPI network information into a loss function for controlling, so that the loss function is more scientific and reasonable.
If there is no initial map, i.e. only the gene expression matrix, the learning map S can be defined as:
Figure BDA0003365162870000099
in this case, the weight matrix will be optimized by the following loss function:
Figure BDA00033651628700000910
step 4-2, constructing a graph convolution network:
the final softmax operation of equation (3) or equation (5) in step 4-1 ensures that the learned graph S satisfies the following equation when a new graph representation S is obtained using a single-layer network via the graph learning layer:
Figure BDA00033651628700000911
thus, graph G (X, S) is defined, and the representation of the graph is learned from data information X and the adjacency matrix S. In the graph convolution layer, a hierarchical propagation rule is executed based on the adaptive neighborhood graph S returned by the graph learning layer and the gene expression data of each sample, i.e.
O i (k+1) =σ(S·O i (k) ·W (k) ),for i=1,2,...,n (8)
Wherein K =0, 1., K-1, o i (k+1) Represents the output of the (k + 1) th layer, O i (0) Represents the gene expression of the ith gene, sigma (. Beta.) represents the activation function, W (k) A trainable weight matrix, W, representing each graph convolution layer (0) ∈R 1×h(0) Represents the input to the hidden weight matrix of the hidden layer with h (0) feature mapping, and W (K) ∈R h(K-1)×C A weight matrix hidden to output (C is a class number) representing one hidden layer with h (K-1) eigenmap is used in the experiment (C = 2).
If the graph volume layer is directly output after extracting the features, the final sensor is defined as follows:
Z=soft max(SO i (K) W (K) ) (9)
wherein the weight matrix
Figure BDA0003365162870000101
C represents the number of classes, and the output Z belongs to R n×C Indicates the tag prediction rate, Z, of the glmGCN model i Indicating the label prediction rate of the ith node.
In an embodiment of the invention, the CESC sample data set O i (0) Dimension of 1515 × 1, the graph convolution layer is two layers, W (0) Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W (1) Is a uniformly distributed argument with dimension 25 x 2 that satisfies the Xavier initialization implementationNumber, S dimension 1515 x 1515;
in an embodiment of the invention, STAD sample data set O i (0) Dimension of (d) 4113 x 1, the volume of the graph is two layers, W (0) Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W (1) Is a parameter with dimension 25 x 2 satisfying the uniform distribution achieved by Xavier initialization, and dimension S is 4113 x 4113;
in an embodiment of the invention, the PAAD sample data set O i (0) Of dimension 116 x 1, the graph convolution layer is two layers, W (0) Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W (1) Is a uniformly distributed parameter with dimension 25 x 2 that satisfies Xavier initialization, with dimension S being 116 x 116;
in an embodiment of the invention, the BLCA sample data set O i (0) Dimension of 2767 x 1, the graph convolution layers are two layers, W (0) Is a uniformly distributed parameter with dimension 1 x 25 that satisfies the Xavier initialization implementation, W (1) Is a parameter with dimension 25 x 2 satisfying the uniform distribution achieved by Xavier initialization, the dimension of S is 2767 x 2767;
step 4-3, constructing a full connection network layer:
constructing full connected layers (FC) to comprehensively extract characteristic information, setting a plurality of network layers, initializing weight and deviation in the first network layer, performing forward propagation, calculating the error of the network after one iteration, feeding the error and the gradient back to the network model for updating the network weight, reducing the error of subsequent iterations, processing the information through a plurality of layers to obtain the calculation error and the prediction accuracy of the network model, and selecting the model with the minimum error as the final model.
In the embodiment of the invention, the CESC data set full connection layer is 5 layers, and the number of the neurons is 4096, 2048, 1024, 512 and 2 respectively;
in the embodiment of the invention, the full connection layer of the STAD data set is 5 layers, and the number of the neurons is 2048, 1024, 512, 256 and 2 respectively;
in the embodiment of the invention, the PAAD data set full-connection layer is 5 layers, and the number of the neurons is 4096, 2048, 1024, 512 and 2 respectively;
in the embodiment of the invention, the BLCA data set full connection layer is 5 layers, and the number of the neurons is 4096, 2048, 1024, 512 and 2 respectively;
step 4-4, constructing a loss function:
assuming the predicted label is Z ∈ R n×2 Each row represents the predicted tag vector for the ith gene, and the true tag Y ∈ R n×2 Each row in the list represents the true label vector for the ith sample. Here parameter optimization is performed using a cross entropy loss function (T denotes training set):
Figure BDA0003365162870000111
in the formula, the loss function is mainly optimized for the network layer parameters from the graph convolution layer to the full connection layer, and all the parameters of the whole architecture are optimized in the following way:
L glmGCN =L 2 +λL 1 (11)
in the embodiment of the invention, the lambda value of the CESC sample data set is 0.1, the lambda value of the STAD data set is 0.01, the lambda value of the PAAD sample data set is 0.01, and the lambda value of the BLCA sample data set is 0.01;
step 5, training and evaluating a glmGCN model represented by a gene interaction pattern optimization diagram, and specifically comprising the following steps:
and 5-1, inputting the training set CESC _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ CESC.
Inputting each folded training set STAD _ xtrin in the step 3-1 into the glmGCN model represented by the gene interaction pattern optimization diagram in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain a network weight W _ STAD.
Inputting the training set PAAD _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ PAAD.
Inputting the training set BLCA _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ BLCA.
And 5-2, evaluating the trained model by taking Accuracy (ACC), specificity (SPE), sensitivity (SEN), F1-Score (F1) and AUC as evaluation indexes, and displaying the relation between sensitivity and specificity by using an ROC curve. Where, referring to some basic definitions, FN (False Negative) represents the number of samples judged to be Negative samples but actually Positive samples, FP (False Positive) represents the number of samples judged to be Positive samples but actually Negative samples, TN (True Negative) represents the number of samples judged to be Negative samples and actually Negative samples, and TP (True Positive) represents the number of samples judged to be Positive samples and actually Positive samples. Precision (Precision), calculated by the formula Precision = TP/(TP + FP), represents the proportion of all data predicted as positive samples that is actually positive samples. Recall (Recall), calculated by the formula Recall = TP/(TP + FN), represents the proportion predicted as positive samples in all data that are actually positive samples.
The evaluation index principle is as follows:
Figure BDA0003365162870000121
Figure BDA0003365162870000131
Figure BDA0003365162870000132
Figure BDA0003365162870000133
sensitivity (SEN) is also called True Positive Rate (TPR) and recall rate (TPR), and is a ratio predicted to be positive in an actually positive sample, specificity (SPE) is also called True Negative Rate (TNR), and is a ratio predicted to be negative in an actually negative sample, and Accuracy (ACC) is a common evaluation index, and indicates the number of correctly predicted samples in all samples, and generally, the higher the accuracy, the better the classifier. The F1 SCORE (F1-SCORE) is a Precision and Recall weighted harmonic mean, combining the Precision and Recall results, and the higher F1-SCORE indicates that the test method is more effective. ROC and AUC are two other indexes of the evaluation classifier, ROC is short for a receiver operating characteristic Curve (receiver operating characteristic Curve), and AUC is short for the Area Under the ROC Curve (Area Under rock Curve). The ROC curve is mainly patterned through sensitivity and specificity, a continuous variable relation is shown, the AUC value refers to the area under the ROC curve, generally speaking, the AUC value fluctuates between 0.5 and 1, the classification prediction accuracy is high when the AUC value is large, and the classification prediction effect is good.
And 5-3, testing the trained network weight parameter W _ CESC in each fold in the step 5-1 and the network model in the step 4 on the test set CESC _ Xtest in each fold in the step 3-1.
And testing the per-fold test set STAD _ Xtest in the step 3-1 by using the trained network weight parameter W _ STAD in the step 5-1 and the network model in the step 4.
And (4) testing the trained network weight parameter W _ PAAD of each fold in the step 5-1 and the network model in the step 4 on the test set PAAD _ Xtest of each fold in the step 3-1.
And (4) testing the test set BLCA _ Xtest of each fold in the step (3-1) by using the trained network weight parameter W _ BLCA in the step (5-1) and the network model in the step (4).
And 5-4, averaging the ten-fold prediction results obtained in the step 5-3, and repeating the experiment for three times to obtain the final test set prediction result. All data sets (CESC, STAD, BLCA, PAAD) were trained and tested.
The embodiment is as follows:
and (3) inputting each folded training set CESC _ Xtrian in the step (3-1) into the glmGCN model represented by the optimized graph based on the gene interaction mode in the step (4) according to the method in the step (5) for the CESC sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ CESC. And using the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 as the final result.
And (3) inputting each folded training set STAD _ Xtrian in the step (3-2) into the glmGCN model represented by the gene interaction pattern optimization diagram in the step (4) according to the method in the step (5) for the STAD sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ STAD. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.
And (3) inputting each folded training set PAAD _ Xtrian in the step (3-3) into the glmGCN model represented by the gene interaction pattern optimization diagram in the step (4) according to the method in the step (5) for the PAAD sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ PAAD. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.
And (4) inputting the training set BLCA _ Xtrian of each fold in the step 3-4 into the glmGCN model represented by the gene interaction pattern optimization map in the step 4 according to the method in the step 5 for the BLCA sample data set, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ BLCA. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.
As shown in table 1, the results of the glmGCN model on CESC sample dataset are shown. As shown in Table 2, the results of the glmGCN model on STAD sample dataset. As shown in table 3, the results of the glmGCN model on the PAAD sample dataset are shown. As shown in Table 4, the results of the glmGCN model on the sample data set of BLCA are shown. A series of experiments show the effectiveness of the method.
TABLE 1
Methods ACC(%) SPE(%) SEN(%) F1-SCORE(%) AUC
glmGCN 98.92 99.64 98.19 98.89 0.9945
TABLE 2
Methods ACC(%) SPE(%) SEN(%) F1-SCORE(%) AUC
glmGCN 97.39 98.26 96.52 97.35 0.9927
TABLE 3
Methods ACC(%) SPE(%) SEN(%) F1-SCORE(%) AUC
glmGCN 79.56 84.36 74.76 78.44 0.8523
TABLE 4
Methods ACC(%) SPE(%) SEN(%) F1-SCORE(%) AUC
glmGCN 92.04 94.66 89.41 91.73 0.9652
The invention being thus described by way of example, it should be understood that any simple alterations, modifications or other equivalent alterations as would be within the skill of the art without the exercise of inventive faculty, are within the scope of the invention.

Claims (2)

1. A method for identifying distant metastasis based on optimized representation of gene interaction patterns, the method comprising the steps of:
step 1, preprocessing data, including preparing a data set, wherein the data set comprises gene expression data and clinical information, screening gene expression data to obtain a differential gene set DEG, and oversampling a minority of samples synthesized by the screened gene expression data by adopting an SMOTE processing model;
step 2, constructing and processing a protein interaction network PPI, and obtaining an adjacency matrix from the graph relation:
step 3, dividing a training set and a test set by ten folds by using a Stratield KFold method, wherein one fold is used as the test set, and the nine folds are used as the training set;
step 4, constructing a glmGCN model represented by an optimized graph based on a gene interaction pattern, which specifically comprises the following processes:
step 4-1, constructing an embedded graph learning network layer, and acquiring a new graph representation on the basis of the initial graph by using a single-layer network:
Figure FDA0003365162860000011
wherein, a = (a) 1 ,a 2 ,...,a n ) T Is a weight vector, σ (-) is an activation function, A represents a adjacency matrix, g i 、g i Respectively represent the gene expression of the ith and jth genes,
Figure FDA0003365162860000012
represents the power, p represents the total number of genes;
step 4-2, constructing a graph convolution network layer, executing a layering propagation rule based on the self-adaptive neighborhood graph S returned by the graph learning layer and the gene expression data of each sample, and obtaining the output expression of the graph convolution network layer as follows:
O i (k+1) =σ(S·O i (k) ·W (k) ),for i=1,2,...,n
wherein K =0, 1., K-1, o i (k+1) Represents the output of the (k + 1) th layer, O i (0) Represents the gene expression of the ith gene, sigma (. Beta.) represents the activation function, W (k) A trainable weight matrix representing each graph convolution layer;
4-3, constructing a fully-connected network layer, and comprehensively extracting characteristic information, wherein the method comprises the steps of setting a plurality of fully-connected network layers and processing the fully-connected network layers to obtain a final prediction result; through the treatment of a plurality of full connection layers to obtain the final product
Step 5, training a glmGCN model represented by a gene interaction pattern optimization diagram, and evaluating the trained model by taking ACC (adaptive cruise control), SPE (SPE), SEN (SEN), AUC (AUC) and ROC (ROC) curves as evaluation indexes; and testing the test set by using the trained network weight parameters and the trained network model, averaging the obtained ten-fold prediction results, and repeating the experiment for three times to obtain the final test set prediction result.
2. The method for identifying distant metastasis based on optimized representation of gene interaction patterns according to claim 1, wherein in step 4-1, if there is no initial map, i.e. only gene expression matrix, the learning map S is defined as:
Figure FDA0003365162860000021
wherein, a = (a) 1 ,a 2 ,...,a n ) T Is a weight vector, σ (-) is an activation function, g i 、g i Respectively represents the gene expression of the ith and jth genes,
Figure FDA0003365162860000022
represents the power, and p represents the total number of genes. />
CN202111400313.2A 2021-11-19 2021-11-19 Distant metastasis identification method based on gene interaction mode optimization graph representation Active CN114141306B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111400313.2A CN114141306B (en) 2021-11-19 2021-11-19 Distant metastasis identification method based on gene interaction mode optimization graph representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111400313.2A CN114141306B (en) 2021-11-19 2021-11-19 Distant metastasis identification method based on gene interaction mode optimization graph representation

Publications (2)

Publication Number Publication Date
CN114141306A CN114141306A (en) 2022-03-04
CN114141306B true CN114141306B (en) 2023-04-07

Family

ID=80391119

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111400313.2A Active CN114141306B (en) 2021-11-19 2021-11-19 Distant metastasis identification method based on gene interaction mode optimization graph representation

Country Status (1)

Country Link
CN (1) CN114141306B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019891B (en) * 2022-06-08 2023-07-07 郑州大学 Individual driving gene prediction method based on semi-supervised graph neural network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948667A (en) * 2019-03-01 2019-06-28 桂林电子科技大学 Image classification method and device for the prediction of correct neck cancer far-end transfer
CN109994151A (en) * 2019-01-23 2019-07-09 杭州师范大学 Predictive genes system is driven based on the tumour of complex network and machine learning method
CN110097974A (en) * 2019-05-15 2019-08-06 天津医科大学肿瘤医院 A kind of nasopharyngeal carcinoma far-end transfer forecasting system based on deep learning algorithm
CN110111895A (en) * 2019-05-15 2019-08-09 天津医科大学肿瘤医院 A kind of method for building up of nasopharyngeal carcinoma far-end transfer prediction model
CN110796672A (en) * 2019-11-04 2020-02-14 哈尔滨理工大学 Breast cancer MRI segmentation method based on hierarchical convolutional neural network
CN111081317A (en) * 2019-12-10 2020-04-28 山东大学 Gene spectrum-based breast cancer lymph node metastasis prediction method and prediction system
CN113128599A (en) * 2021-04-23 2021-07-16 南方医科大学南方医院 Machine learning-based head and neck tumor distal metastasis prediction method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109994151A (en) * 2019-01-23 2019-07-09 杭州师范大学 Predictive genes system is driven based on the tumour of complex network and machine learning method
CN109948667A (en) * 2019-03-01 2019-06-28 桂林电子科技大学 Image classification method and device for the prediction of correct neck cancer far-end transfer
CN110097974A (en) * 2019-05-15 2019-08-06 天津医科大学肿瘤医院 A kind of nasopharyngeal carcinoma far-end transfer forecasting system based on deep learning algorithm
CN110111895A (en) * 2019-05-15 2019-08-09 天津医科大学肿瘤医院 A kind of method for building up of nasopharyngeal carcinoma far-end transfer prediction model
CN110796672A (en) * 2019-11-04 2020-02-14 哈尔滨理工大学 Breast cancer MRI segmentation method based on hierarchical convolutional neural network
CN111081317A (en) * 2019-12-10 2020-04-28 山东大学 Gene spectrum-based breast cancer lymph node metastasis prediction method and prediction system
CN113128599A (en) * 2021-04-23 2021-07-16 南方医科大学南方医院 Machine learning-based head and neck tumor distal metastasis prediction method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A molecular signature of metastasis in primary solidtumors;Sridhar Ramaswamy;《Nature Genetics》;全文 *
Integrative clinical genomics of metastatic cancer;R Dan;《Cell》;第161卷;全文 *
Proteogenomiccharacterization and comprehensive integrative genomic analysis of humancolorectal cancer liver metastasis;Yu Shui Ma;《Molecular Cancer》;全文 *

Also Published As

Publication number Publication date
CN114141306A (en) 2022-03-04

Similar Documents

Publication Publication Date Title
Albaradei et al. Machine learning and deep learning methods that use omics data for metastasis prediction
Lopez-Garcia et al. Transfer learning with convolutional neural networks for cancer survival prediction using gene-expression data
Chalise et al. Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm
Wang et al. LDGRNMF: LncRNA-disease associations prediction based on graph regularized non-negative matrix factorization
Savareh et al. A machine learning approach identified a diagnostic model for pancreatic cancer through using circulating microRNA signatures
Ruschhaupt et al. A compendium to ensure computational reproducibility in high-dimensional classification tasks
Xu et al. Merging microarray data from separate breast cancer studies provides a robust prognostic test
Liu et al. SMALF: miRNA-disease associations prediction based on stacked autoencoder and XGBoost
Momeni et al. A survey on single and multi omics data mining methods in cancer data classification
Wu et al. Deep learning methods for predicting disease status using genomic data
Shen et al. Simultaneous genes and training samples selection by modified particle swarm optimization for gene expression data classification
Li et al. GCAEMDA: Predicting miRNA-disease associations via graph convolutional autoencoder
Zeng et al. Mixture classification model based on clinical markers for breast cancer prognosis
Zhu et al. Deep-gknock: nonlinear group-feature selection with deep neural networks
CN114141306B (en) Distant metastasis identification method based on gene interaction mode optimization graph representation
Su et al. Distant metastasis identification based on optimized graph representation of gene interaction patterns
Hu et al. Cancer gene selection with adaptive optimization spiking neural p systems and hybrid classifiers
Wu On biological validity indices for soft clustering algorithms for gene expression data
CN116543832A (en) disease-miRNA relationship prediction method, model and application based on multi-scale hypergraph convolution
Korfiati et al. Predicting human miRNA target genes using a novel computational intelligent framework
Pyman et al. Exploring microRNA regulation of cancer with context-aware deep cancer classifier
Zhong et al. DNRLCNN: a CNN framework for identifying MiRNA–disease associations using latent feature matrix extraction with positive samples
Elkhani et al. Membrane computing to model feature selection of microarray cancer data
Sun et al. Prediction of potential associations between miRNAs and diseases based on matrix decomposition
Metipatil et al. An efficient framework for predicting cancer type based on microarray gene expressions using CNN-BiLSTM technique

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant