CN114141306A

CN114141306A - Distant metastasis identification method based on gene interaction mode optimization graph representation

Info

Publication number: CN114141306A
Application number: CN202111400313.2A
Authority: CN
Inventors: 苏苒; 朱莹莹
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-03-04
Anticipated expiration: 2041-11-19
Also published as: CN114141306B

Abstract

The invention discloses a distant metastasis identification method based on a gene interaction mode optimization chart representation, which comprises data preprocessing; constructing and processing a protein interaction network PPI; dividing a training test set; constructing a glmGCN model based on a gene interaction pattern optimization map representation; cross-validating the network model using ten folds; the model is applied to the test set test. Compared with the prior art, the method predicts the tumor metastasis under the GCN framework, gives the gene-gene relation of an initial map in the field of more attention at the map learning layer, and therefore obtains more accurate prediction performance.

Description

Distant metastasis identification method based on gene interaction mode optimization graph representation

Technical Field

The invention belongs to the technical field of biological information, and particularly relates to a distant metastasis identification method based on a gene interaction pattern optimization diagram representation.

Background

Tumor metastasis refers to the process by which tumor cells spread from the primary site and continue to grow and form tumors at sites other than the primary site by invading lymphatic and blood vessels. Metastasis can be divided into regional metastasis and distant metastasis. In order to increase the cure rate of cancer and reduce the suffering of the patient, it is necessary to predict the presence of metastasis in cancer patients and then select an appropriate treatment strategy.

In recent years, transcriptome data generated by microarray and RNA sequencing techniques have been widely used to explore the molecular properties of metastasis. For example, the Robinson et al paper "Integrated clinical genetics of metastatic cancer" performed whole exome and transcriptome sequencing of adult patients with metastatic solid tumors from different lineages and biopsy sites, providing a clinically relevant molecular landscape of metastatic tumors. The Ma et al paper "Protoplastic sequencing and comprehensive genetic analysis of human colomatic cancer liver metastasis" provides a new paradigm for understanding human colon cancer and rectal cancer liver metastasis by analyzing proteomes, performing whole exome and transcriptome sequencing, and single nucleotide polymorphism array analysis. Ramasuma et al, "A molecular signature of metastasis in primary soluble tumors," also determined gene features that distinguish metastatic sites from primary sites based on tumor metastasis expression profiles.

However, these studies do not discuss the direct identification of the transferred samples. In addition, cancer has a complex mechanism. The interaction between genes plays a very important role in this mechanism and should be considered. Thus, biological networks, such as protein-protein interaction (PPI) networks, provide more reliable insights that have been used in many cancer-related studies.

Direct prediction of metastasis has been achieved in a few computational-based studies, and most existing studies predict cancer metastasis based on traditional machine learning algorithms. For example, in the paper "Support vector machine classifier for prediction of the methodology of the scientific cancer" by Zhi et al, multiple transcriptome datasets are integrated, feature genes are screened out, and an optimal Support Vector Machine (SVM) model is established to distinguish transfer and non-transfer samples. In the "Machine learning prediction of lymph node metastasis of poorly differentiated-type intramucosal gastric cancer" by Zhou et al, seven Machine learning algorithms are used to predict lymph node metastasis of poorly differentiated-type intramucosal gastric cancer.

Since biological networks are graphical data with irregular spatial features, the advent of graph-based deep learning techniques has provided opportunities for network biological analysis. The Rhee et al paper "Hybrid application of relationship Network and Localized Graph constraint for Breast Cancer Classification" indicates that the Graph Convolutional neural Network (GCN) is a CNN that acts directly on Graph data, taking into account the feature information and structure information of nodes, and can discover and mine information that is more hidden and complex than the general rule, where the representation of each node propagates through edges until stable equilibrium is reached. Various GCN architectures have recently been proposed, such as the simplified Graph Convolution Network (GCN) for Semi-Supervised learning proposed by Kipf et al, paper "Semi-Supervised Classification with Graph conditional Networks", operating directly on the Graph. However, the Graph of GCN has either a fixed Graph structure, such as some existing biological Networks with specific meaning in a specific domain, or some human-constructed graphs, such as k-nearest neighbor graphs with gaussian kernels, supervised graphs with fully connected Networks as proposed in the paper "Deep connected Networks on Graph-Structured Data" by Henaff et al, and learned by distance metric as proposed in the paper "Graph Attention Networks" by veiikovi et al. These maps may not fit into the GCN architecture. To overcome this problem, the paper "Graph Learning-relational Networks" by Jiang et al proposes a Graph Learning Convolutional Network (GLCN) that combines Graph Learning and Graph convolution in a unified network structure. The given label and the prediction label are combined, training is carried out through a single optimization method, the graph structure can be refined, and the effect is more remarkable.

Disclosure of Invention

Based on the prior art, the invention provides a distant metastasis identification method based on a gene interaction pattern optimization graph representation, which constructs a gene interaction graph representation and predicts distant metastasis of tumors under a graph convolution network GCN framework.

The invention is realized by the following technical scheme:

a method for identifying distant metastasis based on optimized representation of gene interaction patterns, the method comprising the steps of:

step 1, preprocessing data, including preparing a data set, wherein the data set comprises gene expression data and clinical information, screening gene expression data to obtain a differential gene set DEG, and oversampling a minority of samples synthesized by the screened gene expression data by adopting an SMOTE processing model;

step 2, constructing and processing a protein interaction network PPI, and obtaining an adjacency matrix from the graph relation:

step 3, dividing a training set and a test set by ten folds by using a Stratield KFold method, wherein one fold is used as the test set, and the nine folds are used as the training set;

step 4, constructing a glmGCN model represented by the gene interaction pattern optimization diagram, and specifically comprising the following processes:

step 4-1, constructing an embedded graph learning network layer, and obtaining a new graph representation by using a single-layer network:

wherein a ═ a₁,a₂,...,a_n)^TIs a weight vector, σ (-) is an activation function, A represents a adjacency matrix, g_i、g_iRespectively represents the gene expression of the ith and jth genes,

represents the power, p represents the total number of genes;

step 4-2, constructing a graph convolution network layer, executing a layering propagation rule based on the self-adaptive neighborhood graph S returned by the graph learning layer and the gene expression data of each sample, and obtaining the output expression of the graph convolution network layer as follows:

O_i ^(k+1)＝σ(S·O_i ^(k)·W^(k)),for i＝1,2,...,n

wherein K is 0,1, K-1, O_i ^(k+1)Represents the output of the (k +1) th layer, O_i ⁽⁰⁾Represents the gene expression of the ith gene, sigma (. beta.) represents the activation function, W^(k)A trainable weight matrix representing each graph convolution layer;

4-3, constructing a fully-connected network layer, and comprehensively extracting characteristic information, wherein the method comprises the steps of setting a plurality of fully-connected network layers and obtaining a final prediction result through the processing of the plurality of fully-connected network layers;

step 5, training a glmGCN model represented by a gene interaction pattern optimization diagram, and evaluating the trained model by taking ACC (adaptive cruise control), SPE (SPE), SEN (SEN), AUC (AUC) and ROC (ROC) curves as evaluation indexes; and testing the test set by using the trained network weight parameters and the trained network model, averaging the obtained ten-fold prediction results, and repeating the experiment for three times to obtain the final test set prediction result.

Compared with the prior art, the invention provides that a graph learning layer is embedded in a network model and used for learning the optimal graph representation of the gene interaction relationship; tumor metastasis is predicted under the GCN framework, which extracts informative high-level features from the constructed irregular graph structure. The gene-gene relation of the initial image is given in the field of more attention on the image learning layer, so that more accurate prediction performance is obtained.

Drawings

FIG. 1 is a general flow chart of the method for identifying distant metastasis based on an optimized representation of the gene interaction pattern according to the present invention;

FIG. 2 is a diagram of an example of a PPI network for PAAD;

FIG. 3 is a framework diagram of the overall implementation of a network;

FIG. 4 is a schematic diagram of an embedded graph learning layer structure;

FIG. 5 is a ROC graph of a glmGCN model of a CESC data set in an embodiment;

FIG. 6 is ROC graph of the glmGCN model of STAD data set in the example;

FIG. 7 is a ROC graph of the glmGCN model of the PAAD data set in the example;

FIG. 8 is a ROC graph of the glmGCN model of the BLCA data set in the example.

Detailed Description

The technical solution of the present invention is described in detail below with reference to the accompanying drawings and specific embodiments.

A method for distant metastasis identification based on optimized representation of gene interaction patterns, comprising the steps of:

step 1, data preprocessing

1-1, preparing a data set comprising CESC data samples of cervical squamous cell carcinoma and cervical adenocarcinoma, STAD data samples of gastric adenocarcinoma, PAAD data samples of pancreatic cancer, BLCA data samples of bladder urothelial cancer (the invention is not limited to these data samples);

step 1-1, obtaining a CESC sample data set, a STAD sample data set, a PAAD sample data set and a BLCA sample data set, wherein the CESC sample data set, the STAD sample data set, the PAAD sample data set and the BLCA sample data set are mRNA and lncRNA data of a TCGA database downloaded from Sanger Box, the data sets comprise gene expression data and clinical information, and the gene expression data comprise at least gene transcriptome Counts and gene expression quantity FPKM; wherein, the Counts data is used for difference analysis, and the expression quantity FPKM of standardized counting can be used for more accurately classifying whether the transfer occurs;

in an embodiment of the invention, there are 309 samples in CESC sample dataset, 407 samples in STAD sample dataset, 182 samples in PAAD sample dataset, 427 samples in BLCA sample dataset, 19814 mrnas and 14851 lncrnas;

step 1-2, processing the clinical information obtained in the step 1-1 to obtain classification labels of all data sets;

in an embodiment of the invention, 34 patients with and 275 patients on the left have not had a distant metastasis in the CESC sample data set (275+34 ═ 309). In the STAD sample dataset 347 patients did not develop distant metastasis, 60 patients developed distant metastasis (347+60 ═ 407). There were 122 patients without distant metastasis and 60 with distant metastasis in the PAAD sample dataset (122+60 ═ 182). Included in the BLCA sample dataset were 96 patients who had distant metastases, and 331 patients who did not (331+96 ═ 427);

step 1-3, performing differential gene screening on the gene expression data obtained in the step 1-1 to obtain a differential gene set DEG, and further screening the gene expression data according to the differential gene set DEG, wherein the method specifically comprises the following steps:

the grouping file is made from a sample name suffix, labeled as tomor with a sample name suffix of-01 or-06 and labeled as normal with a suffix of-11. Differential gene analysis was performed using the edgeR package with the standards of log2FC >1 (i.e., foldchange, which represents the ratio of expression amounts between two samples (groups)) and FDR <0.05 (i.e., False Discovery Rate), and the Counts and the original FPKM screened for the differential genes were corresponded to each other according to sample names to obtain FPKM data corresponding to the differential genes.

In an embodiment of the present invention, 1515 DEG (1197 mRNA +318 lncRNA) of CESC sample dataset, 4113 DEG (2903 mRNA +1219 lncRNA) of STAD sample dataset, 116 DEG (77mRNA +39 lncRNA) of PAAD sample dataset, 2767 DEG (2118mRNA +649 lncRNA) of BLCA sample dataset were obtained;

step 1-4, synthesizing a Minority sample according to the gene expression data obtained by screening in step 1-3 for Oversampling, analyzing the Minority sample by adopting an SMOTE (synthetic Minority Oversampling technique) processing model, artificially synthesizing a new sample according to the Minority sample, adding the new sample into a data set to solve the overfitting problem and balance the data set to obtain expression data X with harmonious data sample proportion, and Oversampling the CESC sample data set, the STAD sample data set, the PAAD sample data set and the BLCA sample data set according to the Minority sample; the method specifically comprises the following steps:

calculating the Euclidean distance from each minority sample x to all samples in the minority sample set to obtain a neighbor k of the minority sample x;

setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from adjacent k for each few class sample x, wherein the randomly selected adjacent is assumed to be

Construction of a New sample x_newThe expression is as follows:

wherein the content of the first and second substances,

representing each randomly selected neighbor, and rand (0,1) representing the generation of random numbers between 0 and 1.

In an embodiment of the present invention, the transfer and non-transfer ratio is set to 1: 1. In the four CESC, STAD, PAAD and BLCA sample datasets, the number of patients transferred and non-transferred was 275, 347, 122 and 331, respectively. Thus, the total number of patients was 550, 694, 244 and 662, respectively.

Step 2, building and processing PPI

2-1, further screening potential effects existing among the encoded proteins from a STRING database according to the difference gene screening result obtained in the step 1-3, and using the potential effects to construct a protein interaction network PPI; the PPI network represents a graph relationship between genes, from which a adjacency matrix A can be obtained, wherein genetic interaction scores are used to measure the strength of node (gene) relationships;

in an embodiment of the invention, the PPI network of CESC comprises a total of 15664 gene interactions, involving 1106 genes. STAD co-found 36389 gene interactions, involving 2800 genes. The PPI network of PAAD involves 35 interactions of 33 genes, and BLCA involves 28183 interactions of 2014 genes.

Step 3, dividing the training set and the test set, and specifically comprising the following steps:

step 3-1, dividing a training set and a testing set of a CESC sample:

and (3) performing ten-fold division on the gene expression data X of the CESC sample data set which is obtained in the step (1) to be subjected to the over-sampling of a few types of samples by using a StratifiedKFold method, sequentially taking one fold as test data CESC _ Xtest, marking the test label as CESC _ Ytest, taking the rest nine folds as training data CESC _ Xtrain, and marking the training label as CESC _ Ytrain.

In an embodiment of the invention, the gene expression data X of the CESC sample data set relates to 1515 gene features of 550 samples, X _ test of the CESC sample data set has approximately 55 samples, and X _ train has approximately 495 samples;

step 3-2, dividing a training set and a test set of the STAD sample:

performing ten-fold division on the gene expression data X of the STAD sample data set which is obtained in the step 1-4 and is subjected to the over-sampling of a few types of samples by using a StratifiedKFold method, sequentially taking one fold as test data STAD _ Xtest, marking the test label as STAD _ Ytest, taking the rest nine folds as training data STAD _ Xtrain, and marking the training label as STAD _ Ytrain;

in an embodiment of the invention, the gene expression data X of the STAD dataset relates to 4113 gene features of 694 samples, X _ test of the STAD dataset has approximately 69 samples, X _ train has approximately 625 samples;

step 3-3, division of training set and test set of PAAD sample

Performing ten-fold division on the gene expression data X of the PAAD sample data set obtained in the step 1-4 and subjected to the over-sampling of the minority samples by using a StratifiedKFold method, sequentially taking one fold as test data PAAD _ Xtest, marking the test label as PAAD _ Ytest, taking the rest nine folds as training data PAAD _ Xtrain, and marking the training label as PAAD _ Ytrain

In an embodiment of the invention, the gene expression data X of the PAAD dataset relates to 116 gene signatures of 244 samples, X _ test of the PAAD dataset has approximately 24 samples, and X _ train has approximately 220 samples;

step 3-4, training set test set division of BLCA sample

Performing ten-fold division on the gene expression data X of the BLCA sample data set which is obtained in the step 1-4 and is subjected to over-sampling of a few types of samples by using a StratifiedKFold method, sequentially taking one fold of the gene expression data X as test data BLCA _ Xtest, marking the test label as BLCA _ Ytest, taking the rest nine folds as training data BLCA _ Xtrain, and marking the training label as BLCA _ Ytrain;

in an embodiment of the invention, the gene expression data X of the BLCA data set relates to 2767 gene signatures of 662 samples, X _ test of the BLCA data set has approximately 66 samples, X _ train has approximately 596 samples;

step 4-1, constructing an embedded graph learning network:

constructing an adjacency matrix A epsilon R based on PPI^p×p. In PPI networks, the interactions between genes are not emphasized; therefore, the diagonal elements of the adjacency matrix a are two genes that do not interact with each other, and one identity matrix I is added to a so that the diagonal elements of the identity matrix I are two genes that directly interact with each other.

A defined nonlinear function S is established based on a gene expression matrix G and an adjacency matrix A_ij＝h(g_i,g_j)，g_i、g_iRespectively show the gene expression of the ith and jth genes. At the same time, the projection matrix P is used in a low-dimensional space_pj∈R^n×dTo reduce computational complexity, where d < n. A new graph representation is obtained using a single layer network as shown in the following equation:

represents the power, and p represents the total number of genes.

Set weight vector a ═ a₁,a₂,...,a_n)^T∈R^n×1A sigmoid or tanh function is used as the activation function σ (·). The adjacency matrix A is improved compared with GLCN, using A

The power emphasizes the importance of the domain initial map, so that the difference between strong interaction and weak interaction of the original PPI network map gene is larger. By the following loss function L₁Optimizing the weight vectors a and P_pj(Nie F，“Clustering and projected clustering with adaptive neighbors，”2014.)：

Where γ and β represent two constants that can be adjusted manually, and F represents the Frobenius specification.

If it is not

And

is a distance of

Larger, S should be smaller. Otherwise, the reverse is carried out

And if the coefficient is smaller, S is larger, the second term is regularization processing and is used for controlling the sparsity of the learning graph S network, wherein gamma is a regularization coefficient, and the third term is used for integrating PPI network information into a loss function for controlling, so that the loss function is more scientific and reasonable.

If there is no initial map, i.e. only the gene expression matrix, the learning map S can be defined as:

in this case, the weight matrix will be optimized by the following loss function:

step 4-2, constructing a graph convolution network:

the final softmax operation of equation (3) or equation (5) in step 4-1 ensures that the learned graph S satisfies the following equation when a new graph representation S is obtained using a single-layer network via the graph learning layer:

thus, graph G (X, S) is defined, and the representation of the graph is learned from data information X and the adjacency matrix S. In the graph convolution layer, a hierarchical propagation rule is executed based on the adaptive neighborhood graph S returned by the graph learning layer and the gene expression data of each sample, i.e.

O_i ^(k+1)＝σ(S·O_i ^(k)·W^(k)),for i＝1,2,...,n (8)

Wherein K is 0,1, K-1, O_i ^(k+1)Represents the output of the (k +1) th layer, O_i ⁽⁰⁾Represents the gene expression of the ith gene, sigma (. beta.) represents the activation function, W^(k)A trainable weight matrix, W, representing each graph convolution layer⁽⁰⁾∈R^1×h(0)Represents the input to the hidden layer with h (0) feature mapping, and W^(K)∈R^h(K-1)×CA weight matrix hidden to output (C is a class number) representing a hidden layer with h (K-1) feature mapping, in experiments (C ═ 2).

If the graph volume layer is directly output after extracting the features, the final sensor is defined as follows:

Z＝soft max(SO_i ^(K)W^(K)) (9)

wherein the weight matrix

C represents the number of classes, and the output Z belongs to R^n×CRepresenting the tag prediction rate, Z, of the glmGCN model_iIndicating the label prediction rate of the ith node.

In an embodiment of the invention, the CESC sample data set O_i ⁽⁰⁾Dimension of 1515 × 1, the graph convolution layer is two layers, W⁽⁰⁾Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W⁽¹⁾Is a parameter with dimension 25 x 2 satisfying the uniform distribution achieved by Xavier initialization, the dimension of S is 1515 x 1515;

in an embodiment of the invention, STAD sample data set O_i ⁽⁰⁾Dimension of (d) 4113 x 1, the graph volume layer is two layers, W⁽⁰⁾Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W⁽¹⁾Is a parameter with dimension 25 x 2 satisfying the uniform distribution achieved by Xavier initialization, and dimension S is 4113 x 4113;

in an embodiment of the invention, the PAAD sample data set O_i ⁽⁰⁾Dimension of 116 x 1, the graph convolution layer is two layers, W⁽⁰⁾Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W⁽¹⁾Is a parameter with dimension 25 x 2 satisfying the uniform distribution achieved by Xavier initialization, dimension S is 116 x 116;

in an embodiment of the invention, the BLCA sample data set O_i ⁽⁰⁾Dimension of 2767 x 1, the graph convolution layers are two layers, W⁽⁰⁾Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W⁽¹⁾Is a parameter with dimension 25 x 2 satisfying the uniform distribution achieved by Xavier initialization, the dimension of S is 2767 x 2767;

step 4-3, constructing a full connection network layer:

constructing a full connected layers (FC) for comprehensively extracting characteristic information, setting a plurality of network layers, initializing weight and deviation in the first network layer, carrying out forward propagation, calculating the error of the network after one iteration, then feeding the error and the gradient back to the network model for updating the network weight, reducing the error of subsequent iteration, processing the information by a plurality of layers to obtain the calculation error and the prediction accuracy of the network model, and selecting the model with the minimum error as the final model.

In the embodiment of the invention, the CESC data set full connection layer is 5 layers, and the number of the neurons is 4096, 2048, 1024, 512 and 2 respectively;

in the embodiment of the invention, the STAD data set full connection layer is 5 layers, and the number of the neurons is 2048, 1024, 512, 256 and 2 respectively;

in the embodiment of the invention, the PAAD data set full-connection layer is 5 layers, and the number of the neurons is 4096, 2048, 1024, 512 and 2 respectively;

in the embodiment of the invention, the BLCA data set full connection layer is 5 layers, and the number of the neurons is 4096, 2048, 1024, 512 and 2 respectively;

step 4-4, constructing a loss function:

assuming the predicted label is Z ∈ R^n×2Each row represents the predicted tag vector for the ith gene, and the true tag Y ∈ R^n×2Each row in the list represents the true label vector for the ith sample. Here parameter optimization is performed using a cross entropy loss function (T denotes training set):

in the formula, the loss function is mainly optimized for the network layer parameters from the graph convolution layer to the full connection layer, and all the parameters of the whole architecture are optimized in the following way:

L_glmGCN＝L₂+λL₁ (11)

in the embodiment of the invention, the lambda value of the CESC sample data set is 0.1, the lambda value of the STAD data set is 0.01, the lambda value of the PAAD sample data set is 0.01, and the lambda value of the BLCA sample data set is 0.01;

step 5, training and evaluating a glmGCN model represented by a gene interaction pattern optimization diagram, and specifically comprising the following steps:

and 5-1, inputting the training set CESC _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ CESC.

Inputting the training set STAD _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ STAD.

Inputting the training set PAAD _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ PAAD.

Inputting the training set BLCA _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ BLCA.

And 5-2, evaluating the trained model by taking Accuracy (ACC), Specificity (SPE), Sensitivity (SEN), F1-Score (F1) and AUC as evaluation indexes, and displaying the relation between sensitivity and specificity by using an ROC curve. Where, referring to some basic definitions, fn (false negative) denotes the number of samples judged as negative samples but actually positive samples, fp (false positive) denotes the number of samples judged as positive samples but actually negative samples, tn (true negative) denotes the number of samples judged as negative samples and actually negative samples, and tp (true positive) denotes the number of samples judged as positive samples and actually positive samples. Precision (Precision), calculated by the formula TP/(TP + FP), represents the proportion of all data that is predicted to be positive samples that is actually positive samples. Recall (Recall ratio), calculated by the formula Recall ═ TP/(TP + FN), represents the proportion of all data that are actually positive samples that are predicted to be positive samples.

The evaluation index principle is as follows:

sensitivity (SEN) is also called True Positive Rate (TPR) and recall rate, and is a proportion predicted to be positive in an actually positive sample, Specificity (SPE) is also called true negative rate or tnr (true negative rate), and is a proportion predicted to be negative in an actually negative sample, and Accuracy (ACC) is a common evaluation index, and generally indicates the number of correctly predicted samples in all samples, and the classifier is better when the accuracy is higher. The F1 SCORE (F1-SCORE) is Precision and Recall weighted harmonic mean, the results of Precision and Recall are combined, and the F1-SCORE is higher, which indicates that the test method is more effective. ROC and AUC are two other indexes of the evaluation classifier, ROC is short for a receiver operating characteristic Curve (receiver operating characteristic Curve), and AUC is short for the Area Under the ROC Curve (Area Under rock Curve). The ROC curve is mainly patterned through sensitivity and specificity, a continuous variable relation is shown, the AUC value refers to the area under the ROC curve, generally speaking, the AUC value fluctuates between 0.5 and 1, the classification prediction accuracy is high when the AUC value is large, and the classification prediction effect is good.

And 5-3, testing the trained network weight parameter W _ CESC in each fold in the step 5-1 and the network model in the step 4 on the test set CESC _ Xtest in each fold in the step 3-1.

And testing the per-fold test set STAD _ Xtest in the step 3-1 by using the trained network weight parameter W _ STAD in the step 5-1 and the network model in the step 4.

And (4) testing the trained network weight parameter W _ PAAD of each fold in the step 5-1 and the network model in the step 4 on the test set PAAD _ Xtest of each fold in the step 3-1.

And (4) testing the test set BLCA _ Xtest of each fold in the step (3-1) by using the trained network weight parameter W _ BLCA in the step (5-1) and the network model in the step (4).

And 5-4, averaging the ten-fold prediction results obtained in the step 5-3, and repeating the experiment for three times to obtain the final test set prediction result. All data sets (CESC, STAD, BLCA, PAAD) were trained and tested.

Example (b):

and (3) inputting each folded training set CESC _ Xtrian in the step (3-1) into the glmGCN model represented by the optimized graph based on the gene interaction mode in the step (4) according to the method in the step (5) for the CESC sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ CESC. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.

And (3) inputting each folded training set STAD _ Xtrian in the step (3-2) into the glmGCN model represented by the gene interaction pattern optimization diagram in the step (4) according to the method in the step (5) for the STAD sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ STAD. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.

And (3) inputting each folded training set PAAD _ Xtrian in the step (3-3) into the glmGCN model represented by the gene interaction pattern optimization diagram in the step (4) according to the method in the step (5) for the PAAD sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ PAAD. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.

And (3) inputting the training set BLCA _ Xtrian of each fold in the step (3-4) into the glmGCN model represented by the optimization diagram based on the gene interaction pattern in the step (4) according to the method in the step (5) for the BLCA sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ BLCA. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.

As shown in table 1, the results of the glmGCN model on CESC sample dataset are shown. As shown in Table 2, the results of the glmGCN model on STAD sample dataset. As shown in table 3, the result of the glmGCN model on the PAAD sample dataset is shown. As shown in Table 4, the results of the glmGCN model on the BLCA sample dataset are shown. A series of experiments show the effectiveness of the method.

TABLE 1

Methods	ACC(％)	SPE(％)	SEN(％)	F1-SCORE(％)	AUC
						glmGCN	98.92	99.64	98.19	98.89	0.9945

TABLE 2

Methods	ACC(％)	SPE(％)	SEN(％)	F1-SCORE(％)	AUC
						glmGCN	97.39	98.26	96.52	97.35	0.9927

TABLE 3

Methods	ACC(％)	SPE(％)	SEN(％)	F1-SCORE(％)	AUC
						glmGCN	79.56	84.36	74.76	78.44	0.8523

TABLE 4

Methods	ACC(％)	SPE(％)	SEN(％)	F1-SCORE(％)	AUC
						glmGCN	92.04	94.66	89.41	91.73	0.9652

The invention has been described in an illustrative manner, and it is to be understood that any simple variations, modifications or other equivalent changes which can be made by one skilled in the art without departing from the spirit of the invention fall within the scope of the invention.

Claims

1. A method for identifying distant metastasis based on optimized representation of gene interaction patterns, the method comprising the steps of:

step 4-1, constructing an embedded graph learning network layer, and acquiring a new graph representation on the basis of the initial graph by using a single-layer network:

wherein a ═ a₁,a₂,...,a_n)^TIs a weight vector, σ (-) is an activation function, A represents a adjacency matrix, g_i、g_iRespectively represent the ith and jth genesThe expression of the gene(s) of (a),

represents the power, p represents the total number of genes;

O_i ^(k+1)＝σ(S·O_i ^(k)·W^(k)),for i＝1,2,...,n

4-3, constructing a fully-connected network layer, and comprehensively extracting characteristic information, wherein the method comprises the steps of setting a plurality of fully-connected network layers and obtaining a final prediction result through the processing of the plurality of fully-connected network layers; through the treatment of a plurality of full connection layers to obtain the final product

2. The method for identifying distant metastasis based on optimized representation of gene interaction patterns according to claim 1, wherein in step 4-1, if there is no initial map, i.e. only gene expression matrix, the learning map S is defined as:

wherein, a＝(a₁,a₂,...,a_n)^TIs a weight vector, σ (-) is an activation function, g_i、g_iRespectively represents the gene expression of the ith and jth genes,

represents the power, and p represents the total number of genes.