CN114141306A - Distant metastasis identification method based on gene interaction mode optimization graph representation - Google Patents

Distant metastasis identification method based on gene interaction mode optimization graph representation Download PDF

Info

Publication number
CN114141306A
CN114141306A CN202111400313.2A CN202111400313A CN114141306A CN 114141306 A CN114141306 A CN 114141306A CN 202111400313 A CN202111400313 A CN 202111400313A CN 114141306 A CN114141306 A CN 114141306A
Authority
CN
China
Prior art keywords
gene
graph
network
data
gene expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111400313.2A
Other languages
Chinese (zh)
Other versions
CN114141306B (en
Inventor
苏苒
朱莹莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202111400313.2A priority Critical patent/CN114141306B/en
Publication of CN114141306A publication Critical patent/CN114141306A/en
Application granted granted Critical
Publication of CN114141306B publication Critical patent/CN114141306B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Economics (AREA)
  • Neurology (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Development Economics (AREA)
  • Biotechnology (AREA)
  • Game Theory and Decision Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)

Abstract

The invention discloses a distant metastasis identification method based on a gene interaction mode optimization chart representation, which comprises data preprocessing; constructing and processing a protein interaction network PPI; dividing a training test set; constructing a glmGCN model based on a gene interaction pattern optimization map representation; cross-validating the network model using ten folds; the model is applied to the test set test. Compared with the prior art, the method predicts the tumor metastasis under the GCN framework, gives the gene-gene relation of an initial map in the field of more attention at the map learning layer, and therefore obtains more accurate prediction performance.

Description

Distant metastasis identification method based on gene interaction mode optimization graph representation
Technical Field
The invention belongs to the technical field of biological information, and particularly relates to a distant metastasis identification method based on a gene interaction pattern optimization diagram representation.
Background
Tumor metastasis refers to the process by which tumor cells spread from the primary site and continue to grow and form tumors at sites other than the primary site by invading lymphatic and blood vessels. Metastasis can be divided into regional metastasis and distant metastasis. In order to increase the cure rate of cancer and reduce the suffering of the patient, it is necessary to predict the presence of metastasis in cancer patients and then select an appropriate treatment strategy.
In recent years, transcriptome data generated by microarray and RNA sequencing techniques have been widely used to explore the molecular properties of metastasis. For example, the Robinson et al paper "Integrated clinical genetics of metastatic cancer" performed whole exome and transcriptome sequencing of adult patients with metastatic solid tumors from different lineages and biopsy sites, providing a clinically relevant molecular landscape of metastatic tumors. The Ma et al paper "Protoplastic sequencing and comprehensive genetic analysis of human colomatic cancer liver metastasis" provides a new paradigm for understanding human colon cancer and rectal cancer liver metastasis by analyzing proteomes, performing whole exome and transcriptome sequencing, and single nucleotide polymorphism array analysis. Ramasuma et al, "A molecular signature of metastasis in primary soluble tumors," also determined gene features that distinguish metastatic sites from primary sites based on tumor metastasis expression profiles.
However, these studies do not discuss the direct identification of the transferred samples. In addition, cancer has a complex mechanism. The interaction between genes plays a very important role in this mechanism and should be considered. Thus, biological networks, such as protein-protein interaction (PPI) networks, provide more reliable insights that have been used in many cancer-related studies.
Direct prediction of metastasis has been achieved in a few computational-based studies, and most existing studies predict cancer metastasis based on traditional machine learning algorithms. For example, in the paper "Support vector machine classifier for prediction of the methodology of the scientific cancer" by Zhi et al, multiple transcriptome datasets are integrated, feature genes are screened out, and an optimal Support Vector Machine (SVM) model is established to distinguish transfer and non-transfer samples. In the "Machine learning prediction of lymph node metastasis of poorly differentiated-type intramucosal gastric cancer" by Zhou et al, seven Machine learning algorithms are used to predict lymph node metastasis of poorly differentiated-type intramucosal gastric cancer.
Since biological networks are graphical data with irregular spatial features, the advent of graph-based deep learning techniques has provided opportunities for network biological analysis. The Rhee et al paper "Hybrid application of relationship Network and Localized Graph constraint for Breast Cancer Classification" indicates that the Graph Convolutional neural Network (GCN) is a CNN that acts directly on Graph data, taking into account the feature information and structure information of nodes, and can discover and mine information that is more hidden and complex than the general rule, where the representation of each node propagates through edges until stable equilibrium is reached. Various GCN architectures have recently been proposed, such as the simplified Graph Convolution Network (GCN) for Semi-Supervised learning proposed by Kipf et al, paper "Semi-Supervised Classification with Graph conditional Networks", operating directly on the Graph. However, the Graph of GCN has either a fixed Graph structure, such as some existing biological Networks with specific meaning in a specific domain, or some human-constructed graphs, such as k-nearest neighbor graphs with gaussian kernels, supervised graphs with fully connected Networks as proposed in the paper "Deep connected Networks on Graph-Structured Data" by Henaff et al, and learned by distance metric as proposed in the paper "Graph Attention Networks" by veiikovi et al. These maps may not fit into the GCN architecture. To overcome this problem, the paper "Graph Learning-relational Networks" by Jiang et al proposes a Graph Learning Convolutional Network (GLCN) that combines Graph Learning and Graph convolution in a unified network structure. The given label and the prediction label are combined, training is carried out through a single optimization method, the graph structure can be refined, and the effect is more remarkable.
Disclosure of Invention
Based on the prior art, the invention provides a distant metastasis identification method based on a gene interaction pattern optimization graph representation, which constructs a gene interaction graph representation and predicts distant metastasis of tumors under a graph convolution network GCN framework.
The invention is realized by the following technical scheme:
a method for identifying distant metastasis based on optimized representation of gene interaction patterns, the method comprising the steps of:
step 1, preprocessing data, including preparing a data set, wherein the data set comprises gene expression data and clinical information, screening gene expression data to obtain a differential gene set DEG, and oversampling a minority of samples synthesized by the screened gene expression data by adopting an SMOTE processing model;
step 2, constructing and processing a protein interaction network PPI, and obtaining an adjacency matrix from the graph relation:
step 3, dividing a training set and a test set by ten folds by using a Stratield KFold method, wherein one fold is used as the test set, and the nine folds are used as the training set;
step 4, constructing a glmGCN model represented by the gene interaction pattern optimization diagram, and specifically comprising the following processes:
step 4-1, constructing an embedded graph learning network layer, and obtaining a new graph representation by using a single-layer network:
Figure BDA0003365162870000031
wherein a ═ a1,a2,...,an)TIs a weight vector, σ (-) is an activation function, A represents a adjacency matrix, gi、giRespectively represents the gene expression of the ith and jth genes,
Figure BDA0003365162870000032
represents the power, p represents the total number of genes;
step 4-2, constructing a graph convolution network layer, executing a layering propagation rule based on the self-adaptive neighborhood graph S returned by the graph learning layer and the gene expression data of each sample, and obtaining the output expression of the graph convolution network layer as follows:
Oi (k+1)=σ(S·Oi (k)·W(k)),for i=1,2,...,n
wherein K is 0,1, K-1, Oi (k+1)Represents the output of the (k +1) th layer, Oi (0)Represents the gene expression of the ith gene, sigma (. beta.) represents the activation function, W(k)A trainable weight matrix representing each graph convolution layer;
4-3, constructing a fully-connected network layer, and comprehensively extracting characteristic information, wherein the method comprises the steps of setting a plurality of fully-connected network layers and obtaining a final prediction result through the processing of the plurality of fully-connected network layers;
step 5, training a glmGCN model represented by a gene interaction pattern optimization diagram, and evaluating the trained model by taking ACC (adaptive cruise control), SPE (SPE), SEN (SEN), AUC (AUC) and ROC (ROC) curves as evaluation indexes; and testing the test set by using the trained network weight parameters and the trained network model, averaging the obtained ten-fold prediction results, and repeating the experiment for three times to obtain the final test set prediction result.
Compared with the prior art, the invention provides that a graph learning layer is embedded in a network model and used for learning the optimal graph representation of the gene interaction relationship; tumor metastasis is predicted under the GCN framework, which extracts informative high-level features from the constructed irregular graph structure. The gene-gene relation of the initial image is given in the field of more attention on the image learning layer, so that more accurate prediction performance is obtained.
Drawings
FIG. 1 is a general flow chart of the method for identifying distant metastasis based on an optimized representation of the gene interaction pattern according to the present invention;
FIG. 2 is a diagram of an example of a PPI network for PAAD;
FIG. 3 is a framework diagram of the overall implementation of a network;
FIG. 4 is a schematic diagram of an embedded graph learning layer structure;
FIG. 5 is a ROC graph of a glmGCN model of a CESC data set in an embodiment;
FIG. 6 is ROC graph of the glmGCN model of STAD data set in the example;
FIG. 7 is a ROC graph of the glmGCN model of the PAAD data set in the example;
FIG. 8 is a ROC graph of the glmGCN model of the BLCA data set in the example.
Detailed Description
The technical solution of the present invention is described in detail below with reference to the accompanying drawings and specific embodiments.
A method for distant metastasis identification based on optimized representation of gene interaction patterns, comprising the steps of:
step 1, data preprocessing
1-1, preparing a data set comprising CESC data samples of cervical squamous cell carcinoma and cervical adenocarcinoma, STAD data samples of gastric adenocarcinoma, PAAD data samples of pancreatic cancer, BLCA data samples of bladder urothelial cancer (the invention is not limited to these data samples);
step 1-1, obtaining a CESC sample data set, a STAD sample data set, a PAAD sample data set and a BLCA sample data set, wherein the CESC sample data set, the STAD sample data set, the PAAD sample data set and the BLCA sample data set are mRNA and lncRNA data of a TCGA database downloaded from Sanger Box, the data sets comprise gene expression data and clinical information, and the gene expression data comprise at least gene transcriptome Counts and gene expression quantity FPKM; wherein, the Counts data is used for difference analysis, and the expression quantity FPKM of standardized counting can be used for more accurately classifying whether the transfer occurs;
in an embodiment of the invention, there are 309 samples in CESC sample dataset, 407 samples in STAD sample dataset, 182 samples in PAAD sample dataset, 427 samples in BLCA sample dataset, 19814 mrnas and 14851 lncrnas;
step 1-2, processing the clinical information obtained in the step 1-1 to obtain classification labels of all data sets;
in an embodiment of the invention, 34 patients with and 275 patients on the left have not had a distant metastasis in the CESC sample data set (275+34 ═ 309). In the STAD sample dataset 347 patients did not develop distant metastasis, 60 patients developed distant metastasis (347+60 ═ 407). There were 122 patients without distant metastasis and 60 with distant metastasis in the PAAD sample dataset (122+60 ═ 182). Included in the BLCA sample dataset were 96 patients who had distant metastases, and 331 patients who did not (331+96 ═ 427);
step 1-3, performing differential gene screening on the gene expression data obtained in the step 1-1 to obtain a differential gene set DEG, and further screening the gene expression data according to the differential gene set DEG, wherein the method specifically comprises the following steps:
the grouping file is made from a sample name suffix, labeled as tomor with a sample name suffix of-01 or-06 and labeled as normal with a suffix of-11. Differential gene analysis was performed using the edgeR package with the standards of log2FC >1 (i.e., foldchange, which represents the ratio of expression amounts between two samples (groups)) and FDR <0.05 (i.e., False Discovery Rate), and the Counts and the original FPKM screened for the differential genes were corresponded to each other according to sample names to obtain FPKM data corresponding to the differential genes.
In an embodiment of the present invention, 1515 DEG (1197 mRNA +318 lncRNA) of CESC sample dataset, 4113 DEG (2903 mRNA +1219 lncRNA) of STAD sample dataset, 116 DEG (77mRNA +39 lncRNA) of PAAD sample dataset, 2767 DEG (2118mRNA +649 lncRNA) of BLCA sample dataset were obtained;
step 1-4, synthesizing a Minority sample according to the gene expression data obtained by screening in step 1-3 for Oversampling, analyzing the Minority sample by adopting an SMOTE (synthetic Minority Oversampling technique) processing model, artificially synthesizing a new sample according to the Minority sample, adding the new sample into a data set to solve the overfitting problem and balance the data set to obtain expression data X with harmonious data sample proportion, and Oversampling the CESC sample data set, the STAD sample data set, the PAAD sample data set and the BLCA sample data set according to the Minority sample; the method specifically comprises the following steps:
calculating the Euclidean distance from each minority sample x to all samples in the minority sample set to obtain a neighbor k of the minority sample x;
setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from adjacent k for each few class sample x, wherein the randomly selected adjacent is assumed to be
Figure BDA0003365162870000061
Construction of a New sample xnewThe expression is as follows:
Figure BDA0003365162870000062
wherein the content of the first and second substances,
Figure BDA0003365162870000063
representing each randomly selected neighbor, and rand (0,1) representing the generation of random numbers between 0 and 1.
In an embodiment of the present invention, the transfer and non-transfer ratio is set to 1: 1. In the four CESC, STAD, PAAD and BLCA sample datasets, the number of patients transferred and non-transferred was 275, 347, 122 and 331, respectively. Thus, the total number of patients was 550, 694, 244 and 662, respectively.
Step 2, building and processing PPI
2-1, further screening potential effects existing among the encoded proteins from a STRING database according to the difference gene screening result obtained in the step 1-3, and using the potential effects to construct a protein interaction network PPI; the PPI network represents a graph relationship between genes, from which a adjacency matrix A can be obtained, wherein genetic interaction scores are used to measure the strength of node (gene) relationships;
in an embodiment of the invention, the PPI network of CESC comprises a total of 15664 gene interactions, involving 1106 genes. STAD co-found 36389 gene interactions, involving 2800 genes. The PPI network of PAAD involves 35 interactions of 33 genes, and BLCA involves 28183 interactions of 2014 genes.
Step 3, dividing the training set and the test set, and specifically comprising the following steps:
step 3-1, dividing a training set and a testing set of a CESC sample:
and (3) performing ten-fold division on the gene expression data X of the CESC sample data set which is obtained in the step (1) to be subjected to the over-sampling of a few types of samples by using a StratifiedKFold method, sequentially taking one fold as test data CESC _ Xtest, marking the test label as CESC _ Ytest, taking the rest nine folds as training data CESC _ Xtrain, and marking the training label as CESC _ Ytrain.
In an embodiment of the invention, the gene expression data X of the CESC sample data set relates to 1515 gene features of 550 samples, X _ test of the CESC sample data set has approximately 55 samples, and X _ train has approximately 495 samples;
step 3-2, dividing a training set and a test set of the STAD sample:
performing ten-fold division on the gene expression data X of the STAD sample data set which is obtained in the step 1-4 and is subjected to the over-sampling of a few types of samples by using a StratifiedKFold method, sequentially taking one fold as test data STAD _ Xtest, marking the test label as STAD _ Ytest, taking the rest nine folds as training data STAD _ Xtrain, and marking the training label as STAD _ Ytrain;
in an embodiment of the invention, the gene expression data X of the STAD dataset relates to 4113 gene features of 694 samples, X _ test of the STAD dataset has approximately 69 samples, X _ train has approximately 625 samples;
step 3-3, division of training set and test set of PAAD sample
Performing ten-fold division on the gene expression data X of the PAAD sample data set obtained in the step 1-4 and subjected to the over-sampling of the minority samples by using a StratifiedKFold method, sequentially taking one fold as test data PAAD _ Xtest, marking the test label as PAAD _ Ytest, taking the rest nine folds as training data PAAD _ Xtrain, and marking the training label as PAAD _ Ytrain
In an embodiment of the invention, the gene expression data X of the PAAD dataset relates to 116 gene signatures of 244 samples, X _ test of the PAAD dataset has approximately 24 samples, and X _ train has approximately 220 samples;
step 3-4, training set test set division of BLCA sample
Performing ten-fold division on the gene expression data X of the BLCA sample data set which is obtained in the step 1-4 and is subjected to over-sampling of a few types of samples by using a StratifiedKFold method, sequentially taking one fold of the gene expression data X as test data BLCA _ Xtest, marking the test label as BLCA _ Ytest, taking the rest nine folds as training data BLCA _ Xtrain, and marking the training label as BLCA _ Ytrain;
in an embodiment of the invention, the gene expression data X of the BLCA data set relates to 2767 gene signatures of 662 samples, X _ test of the BLCA data set has approximately 66 samples, X _ train has approximately 596 samples;
step 4, constructing a glmGCN model represented by the gene interaction pattern optimization diagram, and specifically comprising the following processes:
step 4-1, constructing an embedded graph learning network:
constructing an adjacency matrix A epsilon R based on PPIp×p. In PPI networks, the interactions between genes are not emphasized; therefore, the diagonal elements of the adjacency matrix a are two genes that do not interact with each other, and one identity matrix I is added to a so that the diagonal elements of the identity matrix I are two genes that directly interact with each other.
A defined nonlinear function S is established based on a gene expression matrix G and an adjacency matrix Aij=h(gi,gj),gi、giRespectively show the gene expression of the ith and jth genes. At the same time, the projection matrix P is used in a low-dimensional spacepj∈Rn×dTo reduce computational complexity, where d < n. A new graph representation is obtained using a single layer network as shown in the following equation:
Figure BDA0003365162870000081
Figure BDA0003365162870000091
wherein a ═ a1,a2,...,an)TIs a weight vector, σ (-) is an activation function, A represents a adjacency matrix, gi、giRespectively represents the gene expression of the ith and jth genes,
Figure BDA0003365162870000092
represents the power, and p represents the total number of genes.
Set weight vector a ═ a1,a2,...,an)T∈Rn×1A sigmoid or tanh function is used as the activation function σ (·). The adjacency matrix A is improved compared with GLCN, using A
Figure BDA0003365162870000093
The power emphasizes the importance of the domain initial map, so that the difference between strong interaction and weak interaction of the original PPI network map gene is larger. By the following loss function L1Optimizing the weight vectors a and Ppj(Nie F,“Clustering and projected clustering with adaptive neighbors,”2014.):
Figure BDA0003365162870000094
Where γ and β represent two constants that can be adjusted manually, and F represents the Frobenius specification.
If it is not
Figure BDA0003365162870000095
And
Figure BDA0003365162870000096
is a distance of
Figure BDA0003365162870000097
Larger, S should be smaller. Otherwise, the reverse is carried out
Figure BDA0003365162870000098
And if the coefficient is smaller, S is larger, the second term is regularization processing and is used for controlling the sparsity of the learning graph S network, wherein gamma is a regularization coefficient, and the third term is used for integrating PPI network information into a loss function for controlling, so that the loss function is more scientific and reasonable.
If there is no initial map, i.e. only the gene expression matrix, the learning map S can be defined as:
Figure BDA0003365162870000099
in this case, the weight matrix will be optimized by the following loss function:
Figure BDA00033651628700000910
step 4-2, constructing a graph convolution network:
the final softmax operation of equation (3) or equation (5) in step 4-1 ensures that the learned graph S satisfies the following equation when a new graph representation S is obtained using a single-layer network via the graph learning layer:
Figure BDA00033651628700000911
thus, graph G (X, S) is defined, and the representation of the graph is learned from data information X and the adjacency matrix S. In the graph convolution layer, a hierarchical propagation rule is executed based on the adaptive neighborhood graph S returned by the graph learning layer and the gene expression data of each sample, i.e.
Oi (k+1)=σ(S·Oi (k)·W(k)),for i=1,2,...,n (8)
Wherein K is 0,1, K-1, Oi (k+1)Represents the output of the (k +1) th layer, Oi (0)Represents the gene expression of the ith gene, sigma (. beta.) represents the activation function, W(k)A trainable weight matrix, W, representing each graph convolution layer(0)∈R1×h(0)Represents the input to the hidden layer with h (0) feature mapping, and W(K)∈Rh(K-1)×CA weight matrix hidden to output (C is a class number) representing a hidden layer with h (K-1) feature mapping, in experiments (C ═ 2).
If the graph volume layer is directly output after extracting the features, the final sensor is defined as follows:
Z=soft max(SOi (K)W(K)) (9)
wherein the weight matrix
Figure BDA0003365162870000101
C represents the number of classes, and the output Z belongs to Rn×CRepresenting the tag prediction rate, Z, of the glmGCN modeliIndicating the label prediction rate of the ith node.
In an embodiment of the invention, the CESC sample data set Oi (0)Dimension of 1515 × 1, the graph convolution layer is two layers, W(0)Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W(1)Is a parameter with dimension 25 x 2 satisfying the uniform distribution achieved by Xavier initialization, the dimension of S is 1515 x 1515;
in an embodiment of the invention, STAD sample data set Oi (0)Dimension of (d) 4113 x 1, the graph volume layer is two layers, W(0)Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W(1)Is a parameter with dimension 25 x 2 satisfying the uniform distribution achieved by Xavier initialization, and dimension S is 4113 x 4113;
in an embodiment of the invention, the PAAD sample data set Oi (0)Dimension of 116 x 1, the graph convolution layer is two layers, W(0)Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W(1)Is a parameter with dimension 25 x 2 satisfying the uniform distribution achieved by Xavier initialization, dimension S is 116 x 116;
in an embodiment of the invention, the BLCA sample data set Oi (0)Dimension of 2767 x 1, the graph convolution layers are two layers, W(0)Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W(1)Is a parameter with dimension 25 x 2 satisfying the uniform distribution achieved by Xavier initialization, the dimension of S is 2767 x 2767;
step 4-3, constructing a full connection network layer:
constructing a full connected layers (FC) for comprehensively extracting characteristic information, setting a plurality of network layers, initializing weight and deviation in the first network layer, carrying out forward propagation, calculating the error of the network after one iteration, then feeding the error and the gradient back to the network model for updating the network weight, reducing the error of subsequent iteration, processing the information by a plurality of layers to obtain the calculation error and the prediction accuracy of the network model, and selecting the model with the minimum error as the final model.
In the embodiment of the invention, the CESC data set full connection layer is 5 layers, and the number of the neurons is 4096, 2048, 1024, 512 and 2 respectively;
in the embodiment of the invention, the STAD data set full connection layer is 5 layers, and the number of the neurons is 2048, 1024, 512, 256 and 2 respectively;
in the embodiment of the invention, the PAAD data set full-connection layer is 5 layers, and the number of the neurons is 4096, 2048, 1024, 512 and 2 respectively;
in the embodiment of the invention, the BLCA data set full connection layer is 5 layers, and the number of the neurons is 4096, 2048, 1024, 512 and 2 respectively;
step 4-4, constructing a loss function:
assuming the predicted label is Z ∈ Rn×2Each row represents the predicted tag vector for the ith gene, and the true tag Y ∈ Rn×2Each row in the list represents the true label vector for the ith sample. Here parameter optimization is performed using a cross entropy loss function (T denotes training set):
Figure BDA0003365162870000111
in the formula, the loss function is mainly optimized for the network layer parameters from the graph convolution layer to the full connection layer, and all the parameters of the whole architecture are optimized in the following way:
LglmGCN=L2+λL1 (11)
in the embodiment of the invention, the lambda value of the CESC sample data set is 0.1, the lambda value of the STAD data set is 0.01, the lambda value of the PAAD sample data set is 0.01, and the lambda value of the BLCA sample data set is 0.01;
step 5, training and evaluating a glmGCN model represented by a gene interaction pattern optimization diagram, and specifically comprising the following steps:
and 5-1, inputting the training set CESC _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ CESC.
Inputting the training set STAD _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ STAD.
Inputting the training set PAAD _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ PAAD.
Inputting the training set BLCA _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ BLCA.
And 5-2, evaluating the trained model by taking Accuracy (ACC), Specificity (SPE), Sensitivity (SEN), F1-Score (F1) and AUC as evaluation indexes, and displaying the relation between sensitivity and specificity by using an ROC curve. Where, referring to some basic definitions, fn (false negative) denotes the number of samples judged as negative samples but actually positive samples, fp (false positive) denotes the number of samples judged as positive samples but actually negative samples, tn (true negative) denotes the number of samples judged as negative samples and actually negative samples, and tp (true positive) denotes the number of samples judged as positive samples and actually positive samples. Precision (Precision), calculated by the formula TP/(TP + FP), represents the proportion of all data that is predicted to be positive samples that is actually positive samples. Recall (Recall ratio), calculated by the formula Recall ═ TP/(TP + FN), represents the proportion of all data that are actually positive samples that are predicted to be positive samples.
The evaluation index principle is as follows:
Figure BDA0003365162870000121
Figure BDA0003365162870000131
Figure BDA0003365162870000132
Figure BDA0003365162870000133
sensitivity (SEN) is also called True Positive Rate (TPR) and recall rate, and is a proportion predicted to be positive in an actually positive sample, Specificity (SPE) is also called true negative rate or tnr (true negative rate), and is a proportion predicted to be negative in an actually negative sample, and Accuracy (ACC) is a common evaluation index, and generally indicates the number of correctly predicted samples in all samples, and the classifier is better when the accuracy is higher. The F1 SCORE (F1-SCORE) is Precision and Recall weighted harmonic mean, the results of Precision and Recall are combined, and the F1-SCORE is higher, which indicates that the test method is more effective. ROC and AUC are two other indexes of the evaluation classifier, ROC is short for a receiver operating characteristic Curve (receiver operating characteristic Curve), and AUC is short for the Area Under the ROC Curve (Area Under rock Curve). The ROC curve is mainly patterned through sensitivity and specificity, a continuous variable relation is shown, the AUC value refers to the area under the ROC curve, generally speaking, the AUC value fluctuates between 0.5 and 1, the classification prediction accuracy is high when the AUC value is large, and the classification prediction effect is good.
And 5-3, testing the trained network weight parameter W _ CESC in each fold in the step 5-1 and the network model in the step 4 on the test set CESC _ Xtest in each fold in the step 3-1.
And testing the per-fold test set STAD _ Xtest in the step 3-1 by using the trained network weight parameter W _ STAD in the step 5-1 and the network model in the step 4.
And (4) testing the trained network weight parameter W _ PAAD of each fold in the step 5-1 and the network model in the step 4 on the test set PAAD _ Xtest of each fold in the step 3-1.
And (4) testing the test set BLCA _ Xtest of each fold in the step (3-1) by using the trained network weight parameter W _ BLCA in the step (5-1) and the network model in the step (4).
And 5-4, averaging the ten-fold prediction results obtained in the step 5-3, and repeating the experiment for three times to obtain the final test set prediction result. All data sets (CESC, STAD, BLCA, PAAD) were trained and tested.
Example (b):
and (3) inputting each folded training set CESC _ Xtrian in the step (3-1) into the glmGCN model represented by the optimized graph based on the gene interaction mode in the step (4) according to the method in the step (5) for the CESC sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ CESC. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.
And (3) inputting each folded training set STAD _ Xtrian in the step (3-2) into the glmGCN model represented by the gene interaction pattern optimization diagram in the step (4) according to the method in the step (5) for the STAD sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ STAD. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.
And (3) inputting each folded training set PAAD _ Xtrian in the step (3-3) into the glmGCN model represented by the gene interaction pattern optimization diagram in the step (4) according to the method in the step (5) for the PAAD sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ PAAD. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.
And (3) inputting the training set BLCA _ Xtrian of each fold in the step (3-4) into the glmGCN model represented by the optimization diagram based on the gene interaction pattern in the step (4) according to the method in the step (5) for the BLCA sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ BLCA. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.
As shown in table 1, the results of the glmGCN model on CESC sample dataset are shown. As shown in Table 2, the results of the glmGCN model on STAD sample dataset. As shown in table 3, the result of the glmGCN model on the PAAD sample dataset is shown. As shown in Table 4, the results of the glmGCN model on the BLCA sample dataset are shown. A series of experiments show the effectiveness of the method.
TABLE 1
Methods ACC(%) SPE(%) SEN(%) F1-SCORE(%) AUC
glmGCN 98.92 99.64 98.19 98.89 0.9945
TABLE 2
Methods ACC(%) SPE(%) SEN(%) F1-SCORE(%) AUC
glmGCN 97.39 98.26 96.52 97.35 0.9927
TABLE 3
Methods ACC(%) SPE(%) SEN(%) F1-SCORE(%) AUC
glmGCN 79.56 84.36 74.76 78.44 0.8523
TABLE 4
Methods ACC(%) SPE(%) SEN(%) F1-SCORE(%) AUC
glmGCN 92.04 94.66 89.41 91.73 0.9652
The invention has been described in an illustrative manner, and it is to be understood that any simple variations, modifications or other equivalent changes which can be made by one skilled in the art without departing from the spirit of the invention fall within the scope of the invention.

Claims (2)

1. A method for identifying distant metastasis based on optimized representation of gene interaction patterns, the method comprising the steps of:
step 1, preprocessing data, including preparing a data set, wherein the data set comprises gene expression data and clinical information, screening gene expression data to obtain a differential gene set DEG, and oversampling a minority of samples synthesized by the screened gene expression data by adopting an SMOTE processing model;
step 2, constructing and processing a protein interaction network PPI, and obtaining an adjacency matrix from the graph relation:
step 3, dividing a training set and a test set by ten folds by using a Stratield KFold method, wherein one fold is used as the test set, and the nine folds are used as the training set;
step 4, constructing a glmGCN model represented by the gene interaction pattern optimization diagram, and specifically comprising the following processes:
step 4-1, constructing an embedded graph learning network layer, and acquiring a new graph representation on the basis of the initial graph by using a single-layer network:
Figure FDA0003365162860000011
wherein a ═ a1,a2,...,an)TIs a weight vector, σ (-) is an activation function, A represents a adjacency matrix, gi、giRespectively represent the ith and jth genesThe expression of the gene(s) of (a),
Figure FDA0003365162860000012
represents the power, p represents the total number of genes;
step 4-2, constructing a graph convolution network layer, executing a layering propagation rule based on the self-adaptive neighborhood graph S returned by the graph learning layer and the gene expression data of each sample, and obtaining the output expression of the graph convolution network layer as follows:
Oi (k+1)=σ(S·Oi (k)·W(k)),for i=1,2,...,n
wherein K is 0,1, K-1, Oi (k+1)Represents the output of the (k +1) th layer, Oi (0)Represents the gene expression of the ith gene, sigma (. beta.) represents the activation function, W(k)A trainable weight matrix representing each graph convolution layer;
4-3, constructing a fully-connected network layer, and comprehensively extracting characteristic information, wherein the method comprises the steps of setting a plurality of fully-connected network layers and obtaining a final prediction result through the processing of the plurality of fully-connected network layers; through the treatment of a plurality of full connection layers to obtain the final product
Step 5, training a glmGCN model represented by a gene interaction pattern optimization diagram, and evaluating the trained model by taking ACC (adaptive cruise control), SPE (SPE), SEN (SEN), AUC (AUC) and ROC (ROC) curves as evaluation indexes; and testing the test set by using the trained network weight parameters and the trained network model, averaging the obtained ten-fold prediction results, and repeating the experiment for three times to obtain the final test set prediction result.
2. The method for identifying distant metastasis based on optimized representation of gene interaction patterns according to claim 1, wherein in step 4-1, if there is no initial map, i.e. only gene expression matrix, the learning map S is defined as:
Figure FDA0003365162860000021
wherein, a=(a1,a2,...,an)TIs a weight vector, σ (-) is an activation function, gi、giRespectively represents the gene expression of the ith and jth genes,
Figure FDA0003365162860000022
represents the power, and p represents the total number of genes.
CN202111400313.2A 2021-11-19 2021-11-19 Distant metastasis identification method based on gene interaction mode optimization graph representation Active CN114141306B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111400313.2A CN114141306B (en) 2021-11-19 2021-11-19 Distant metastasis identification method based on gene interaction mode optimization graph representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111400313.2A CN114141306B (en) 2021-11-19 2021-11-19 Distant metastasis identification method based on gene interaction mode optimization graph representation

Publications (2)

Publication Number Publication Date
CN114141306A true CN114141306A (en) 2022-03-04
CN114141306B CN114141306B (en) 2023-04-07

Family

ID=80391119

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111400313.2A Active CN114141306B (en) 2021-11-19 2021-11-19 Distant metastasis identification method based on gene interaction mode optimization graph representation

Country Status (1)

Country Link
CN (1) CN114141306B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019891A (en) * 2022-06-08 2022-09-06 郑州大学 Individual driver gene prediction method based on semi-supervised graph neural network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948667A (en) * 2019-03-01 2019-06-28 桂林电子科技大学 Image classification method and device for the prediction of correct neck cancer far-end transfer
CN109994151A (en) * 2019-01-23 2019-07-09 杭州师范大学 Predictive genes system is driven based on the tumour of complex network and machine learning method
CN110097974A (en) * 2019-05-15 2019-08-06 天津医科大学肿瘤医院 A kind of nasopharyngeal carcinoma far-end transfer forecasting system based on deep learning algorithm
CN110111895A (en) * 2019-05-15 2019-08-09 天津医科大学肿瘤医院 A kind of method for building up of nasopharyngeal carcinoma far-end transfer prediction model
CN110796672A (en) * 2019-11-04 2020-02-14 哈尔滨理工大学 Breast cancer MRI segmentation method based on hierarchical convolutional neural network
CN111081317A (en) * 2019-12-10 2020-04-28 山东大学 Gene spectrum-based breast cancer lymph node metastasis prediction method and prediction system
CN113128599A (en) * 2021-04-23 2021-07-16 南方医科大学南方医院 Machine learning-based head and neck tumor distal metastasis prediction method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109994151A (en) * 2019-01-23 2019-07-09 杭州师范大学 Predictive genes system is driven based on the tumour of complex network and machine learning method
CN109948667A (en) * 2019-03-01 2019-06-28 桂林电子科技大学 Image classification method and device for the prediction of correct neck cancer far-end transfer
CN110097974A (en) * 2019-05-15 2019-08-06 天津医科大学肿瘤医院 A kind of nasopharyngeal carcinoma far-end transfer forecasting system based on deep learning algorithm
CN110111895A (en) * 2019-05-15 2019-08-09 天津医科大学肿瘤医院 A kind of method for building up of nasopharyngeal carcinoma far-end transfer prediction model
CN110796672A (en) * 2019-11-04 2020-02-14 哈尔滨理工大学 Breast cancer MRI segmentation method based on hierarchical convolutional neural network
CN111081317A (en) * 2019-12-10 2020-04-28 山东大学 Gene spectrum-based breast cancer lymph node metastasis prediction method and prediction system
CN113128599A (en) * 2021-04-23 2021-07-16 南方医科大学南方医院 Machine learning-based head and neck tumor distal metastasis prediction method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
R DAN: "Integrative clinical genomics of metastatic cancer", 《CELL》 *
SRIDHAR RAMASWAMY: "A molecular signature of metastasis in primary solidtumors", 《NATURE GENETICS》 *
YU SHUI MA: "Proteogenomiccharacterization and comprehensive integrative genomic analysis of humancolorectal cancer liver metastasis", 《MOLECULAR CANCER》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019891A (en) * 2022-06-08 2022-09-06 郑州大学 Individual driver gene prediction method based on semi-supervised graph neural network

Also Published As

Publication number Publication date
CN114141306B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
Albaradei et al. Machine learning and deep learning methods that use omics data for metastasis prediction
Wang et al. LDGRNMF: LncRNA-disease associations prediction based on graph regularized non-negative matrix factorization
García-Díaz et al. Unsupervised feature selection algorithm for multiclass cancer classification of gene expression RNA-Seq data
Savareh et al. A machine learning approach identified a diagnostic model for pancreatic cancer through using circulating microRNA signatures
Liu et al. SMALF: miRNA-disease associations prediction based on stacked autoencoder and XGBoost
Ruschhaupt et al. A compendium to ensure computational reproducibility in high-dimensional classification tasks
Momeni et al. A survey on single and multi omics data mining methods in cancer data classification
Wen et al. A classification model for lncRNA and mRNA based on k-mers and a convolutional neural network
Li et al. C-CSN: single-cell RNA sequencing data analysis by conditional cell-specific network
Hu et al. Determination of endometrial carcinoma with gene expression based on optimized Elman neural network
Zheng et al. CGMDA: an approach to predict and validate MicroRNA-disease associations by utilizing chaos game representation and LightGBM
Wu et al. Deep learning methods for predicting disease status using genomic data
Zhang et al. A novel graph attention adversarial network for predicting disease-related associations
Li et al. GCAEMDA: Predicting miRNA-disease associations via graph convolutional autoencoder
Shen et al. Simultaneous genes and training samples selection by modified particle swarm optimization for gene expression data classification
Zeng et al. Mixture classification model based on clinical markers for breast cancer prognosis
Valdebenito et al. Machine learning approaches to study glioblastoma: A review of the last decade of applications
Zhu et al. Deep-gknock: nonlinear group-feature selection with deep neural networks
Li et al. Predicting miRNA-disease associations based on graph attention network with multi-source information
CN114141306B (en) Distant metastasis identification method based on gene interaction mode optimization graph representation
Wang et al. MiRNA-disease association prediction via hypergraph learning based on high-dimensionality features
Li et al. Prognostic prediction of carcinoma by a differential-regulatory-network-embedded deep neural network
Su et al. Distant metastasis identification based on optimized graph representation of gene interaction patterns
Chai et al. Integrating multi-omics data with deep learning for predicting cancer prognosis
Wu On biological validity indices for soft clustering algorithms for gene expression data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant