CN114141306B

CN114141306B - Distant metastasis identification method based on gene interaction mode optimization graph representation

Info

Publication number: CN114141306B
Application number: CN202111400313.2A
Authority: CN
Inventors: 苏苒; 朱莹莹
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2023-04-07
Anticipated expiration: 2041-11-19
Also published as: CN114141306A

Abstract

The invention discloses a distant metastasis identification method based on a gene interaction mode optimization chart representation, which comprises data preprocessing; constructing and processing a protein interaction network PPI; dividing a training test set; constructing a glmGCN model based on a gene interaction pattern optimization map representation; cross-validating the network model using ten folds; the model is applied to the test set test. Compared with the prior art, the method provided by the invention has the advantages that the tumor metastasis is predicted under the GCN framework, and the gene-gene relation of the initial image is given in the field of more attention at the image learning layer, so that more accurate prediction performance is obtained.

Description

Distant metastasis identification method based on gene interaction mode optimization graph representation

Technical Field

The invention belongs to the technical field of biological information, and particularly relates to a distant metastasis identification method based on a gene interaction pattern optimization diagram representation.

Background

Tumor metastasis refers to the process by which tumor cells spread from the primary site and continue to grow and form tumors at sites other than the primary site by invading lymphatic and blood vessels. Metastasis can be divided into regional metastasis and distant metastasis. In order to improve the cure rate of cancer and reduce the suffering of patients, it is necessary to predict whether cancer patients have metastasis and then select an appropriate treatment strategy.

In recent years, transcriptome data generated by microarray and RNA sequencing techniques have been widely used to explore the molecular properties of metastasis. For example, the Robinson et al paper "Integrated clinical genetics of metastatic cancer" performed whole exome and transcriptome sequencing of adult patients with metastatic solid tumors from different lineages and biopsy sites, providing a clinically relevant molecular landscape of metastatic tumors. Ma et al, the "Prophylogenetic characterization and comparative genomic analysis of human clinical cancer patient methods" provides a new paradigm for understanding human colon and rectal cancer liver metastasis by proteome analysis, whole exome and transcriptome sequencing, and single nucleotide polymorphism array analysis. Ramasuma et al, "A molecular signature of metastasis in primary soluble tumors," also determined gene features that distinguish metastatic sites from primary sites based on tumor metastasis expression profiles.

However, these studies do not discuss the direct identification of the transfer samples. In addition, cancer has a complex mechanism. The interaction between genes plays a very important role in this mechanism and should be considered. Thus, biological networks, such as protein-protein interaction (PPI) networks, provide more reliable insights that have been used in many cancer-related studies.

Direct prediction of metastasis has been achieved in a few computational-based studies, and most existing studies predict cancer metastasis based on traditional machine learning algorithms. For example, a plurality of transcriptome datasets are integrated in the "Support vector machine classifier for prediction of the methods of the scientific cancer" paper by Zhi et al, feature genes are screened, and an optimal Support Vector Machine (SVM) model is established to distinguish transfer and non-transfer samples. In the "Machine learning prediction of lymph node metastasis of poorly differentiated-type intramucosal gastric cancer" by Zhou et al, seven Machine learning algorithms are used to predict lymph node metastasis of poorly differentiated-type intramucosal gastric cancer.

Since biological networks are graphical data with irregular spatial features, the advent of graph-based deep learning techniques has provided opportunities for network biological analysis. The Rhee et al paper "Hybrid application of relationship Network and Localized Graph constraint for Breast Cancer Classification" indicates that the Graph Convolutional neural Network (GCN) is a CNN that acts directly on Graph data, taking into account the feature information and structure information of nodes, and can discover and mine information that is more hidden and complex than the general rule, where the representation of each node propagates through edges until stable equilibrium is reached. Various GCN architectures have recently been proposed, such as the simplified Graph Convolution Network (GCN) proposed by Kipf et al for Semi-Supervised learning to operate directly on the Graph. However, the Graph of GCN has either a fixed Graph structure, such as some existing biological Networks with specific meaning in a specific domain, or some human-constructed graphs, such as k-nearest neighbor graphs with gaussian kernels, supervised graphs with fully connected Networks as proposed in the paper "Deep connected Networks on Graph-Structured Data" by Henaff et al, and learned by distance metric as proposed in the paper "Graph Attention Networks" by veiikovi et al. These maps may not fit into the GCN architecture. To overcome this problem, the paper "Graph Learning-relational Networks" by Jiang et al proposes a Graph Learning Convolutional Network (GLCN) that combines Graph Learning and Graph convolution in a unified network structure. The given label and the prediction label are combined, training is carried out through a single optimization method, the graph structure can be refined, and the effect is more remarkable.

Disclosure of Invention

Based on the prior art, the invention provides a distant metastasis identification method based on a gene interaction pattern optimization graph representation, which constructs a gene interaction graph representation and predicts distant metastasis of tumors under a graph convolution network GCN framework.

The invention is realized by the following technical scheme:

a method for identifying distant metastasis based on an optimized representation of gene interaction patterns, the method comprising the steps of:

step 1, preprocessing data, including preparing a data set, wherein the data set comprises gene expression data and clinical information, screening gene expression data to obtain a differential gene set DEG, and oversampling a minority of samples synthesized by the screened gene expression data by adopting an SMOTE processing model;

step 2, constructing and processing a protein interaction network PPI, and obtaining an adjacency matrix from the graph relation:

step 3, dividing a training set and a test set by ten folds by using a Stratield KFold method, wherein one fold is used as the test set, and the nine folds are used as the training set;

step 4, constructing a glmGCN model represented by an optimized graph based on a gene interaction pattern, which specifically comprises the following processes:

step 4-1, constructing an embedded graph learning network layer, and obtaining a new graph representation by using a single-layer network:

wherein, a = (a) ₁ ,a ₂ ,...,a _n ) ^T Is a weight vector, σ (-) is an activation function, A represents a adjacency matrix, g _i 、g _i Respectively represents the gene expression of the ith and jth genes,

represents the power, p represents the total number of genes;

step 4-2, constructing a graph convolution network layer, executing a layering propagation rule based on the self-adaptive neighborhood graph S returned by the graph learning layer and the gene expression data of each sample, and obtaining the output expression of the graph convolution network layer as follows:

O _i ^(k+1) ＝σ(S·O _i ^(k) ·W ^(k) ),for i＝1,2,...,n

wherein K =0, 1., K-1, o _i ^(k+1) Represents the output of the (k + 1) th layer, O _i ⁽⁰⁾ Represents the gene expression of the ith gene, sigma (. Beta.) represents the activation function, W ^(k) A trainable weight matrix representing each graph convolution layer;

4-3, constructing a fully-connected network layer, and comprehensively extracting characteristic information, wherein the method comprises the steps of setting a plurality of fully-connected network layers and processing the fully-connected network layers to obtain a final prediction result;

step 5, training a glmGCN model represented by a gene interaction pattern optimization diagram, and evaluating the trained model by taking ACC (adaptive cruise control), SPE (SPE), SEN (SEN), AUC (AUC) and ROC (ROC) curves as evaluation indexes; and testing the test set by using the trained network weight parameters and the trained network model, averaging the obtained ten-fold prediction results, and repeating the experiment for three times to obtain the final test set prediction result.

Compared with the prior art, the invention provides that a graph learning layer is embedded in a network model and used for learning the optimal graph representation of the gene interaction relation; tumor metastasis is predicted under the GCN framework, which extracts informative high-level features from the constructed irregular graph structure. The gene-gene relation of the initial image is given in the field of more attention on the image learning layer, so that more accurate prediction performance is obtained.

Drawings

FIG. 1 is a general flow chart of the method for identifying distant metastasis based on an optimized representation of the gene interaction pattern according to the present invention;

FIG. 2 is a diagram of an example of a PPI network for PAAD;

FIG. 3 is a framework diagram of the overall implementation of the network;

FIG. 4 is a schematic diagram of an embedded graph learning layer structure;

FIG. 5 is a ROC curve diagram of a glmGCN model of a CESC data set in an embodiment;

FIG. 6 is ROC graph of the glmGCN model of STAD data set in the example;

FIG. 7 is a ROC graph of the glmGCN model of the PAAD data set in the example;

FIG. 8 is a ROC graph of the glmGCN model of the BLCA data set in the example.

Detailed Description

The technical solution of the present invention is described in detail below with reference to the accompanying drawings and specific embodiments.

A method for optimizing graph representation of distant metastasis identification based on gene interaction patterns, comprising the steps of:

step 1, data preprocessing

1-1, preparing a data set comprising cervical squamous cell carcinoma and cervical adenocarcinoma CESC data samples, gastric adenocarcinoma STAD data samples, pancreatic adenocarcinoma PAAD data samples, bladder urothelial carcinoma BLCA data samples (the invention is not limited to these data samples);

step 1-1, obtaining a CESC sample data set, a STAD sample data set, a PAAD sample data set and a BLCA sample data set, wherein the CESC sample data set, the STAD sample data set, the PAAD sample data set and the BLCA sample data set are mRNA and lncRNA data of a TCGA database downloaded from a SangerBox, the data sets comprise gene expression data and clinical information, and the gene expression data comprise at least gene transcriptome Counts and gene expression quantity FPKM; wherein, the Counts data is used for difference analysis, and the expression quantity FPKM of standardized counting can be used for more accurately classifying whether the transfer occurs;

in an embodiment of the invention, there are 309 samples in CESC sample dataset, 407 samples in STAD sample dataset, 182 samples in PAAD sample dataset, 427 samples in BLCA sample dataset, 19814 mrnas and 14851 lncrnas;

step 1-2, processing the clinical information obtained in the step 1-1 to obtain classification labels of all data sets;

in an embodiment of the present invention, 34 patients with distant metastasis and 275 patients on the left with no distant metastasis (275 +34= 309) are included in the CESC sample data set. In the STAD sample dataset 347 patients did not undergo distant metastasis, 60 patients underwent distant metastasis (347 +60= 407). The PAAD sample dataset included 122 patients without distant metastasis, 60 patients with distant metastasis (122 +60= 182). Included in the BLCA sample dataset were 96 patients with distant metastasis, and 331 patients without distant metastasis (331 +96= 427);

step 1-3, performing differential gene screening on the gene expression data obtained in the step 1-1 to obtain a differential gene set DEG, and further screening the gene expression data according to the differential gene set DEG, wherein the method specifically comprises the following steps:

the grouping file is made from a sample name suffix, labeled as tomor with a sample name suffix of-01 or-06 and labeled as normal with a suffix of-11. Differential gene analysis was performed using edgeR package, with the standards of log2FC >1 (i.e., foldchange, which represents the ratio of expression amounts between two samples (groups)) and FDR <0.05 (i.e., false Discovery Rate), and the Counts and the original FPKM screened for the differential genes were corresponded to each other according to sample names, to obtain FPKM data corresponding to the differential genes.

In the embodiment of the present invention, 1515 DEG (1197 mrna +318 lncRNA) of CESC sample dataset, 4113 DEG (2903 mrna +1219 lncRNA) of STAD sample dataset, 116 DEG (77mrna +39 lncRNA) of PAAD sample dataset, 2767 DEG (2118mrna +649 lncRNA) of BLCA sample dataset are obtained;

step 1-4, synthesizing a Minority sample according to the gene expression data obtained by screening in step 1-3 for Oversampling, analyzing the Minority sample by adopting an SMOTE (Synthetic minimum ownership Oversampling Technique) processing model, artificially synthesizing a new sample according to the Minority sample, adding the new sample into a data set to solve the problem of overfitting and balance the data set to obtain expression data X with harmonious data sample proportion, and Oversampling the CESC sample data set, the STAD sample data set, the PAAD sample data set and the BLCA sample set according to the Minority sample; the method specifically comprises the following steps:

calculating the Euclidean distance from each minority sample x to all samples in the minority sample set to obtain a neighbor k of the minority sample x;

setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from adjacent k for each few class sample x, wherein the randomly selected adjacent is assumed to be

Construction of a New sample x _new The expression is as follows:

wherein the content of the first and second substances,

representing each randomly selected neighbor, and rand (0, 1) representing the generation of random numbers between 0 and 1.

In the embodiment of the present invention, the transfer and non-transfer ratio is set to 1. In the four CESC, STAD, PAAD and BLCA sample datasets, the number of patients transferred and non-transferred was 275, 347, 122 and 331, respectively. Thus, the total number of patients was 550, 694, 244 and 662, respectively.

Step 2, building and processing PPI

2-1, further screening potential effects existing among the encoded proteins from a STRING database according to the difference gene screening result obtained in the step 1-3, and using the potential effects to construct a protein interaction network PPI; the PPI network represents a graph relationship between genes, from which an adjacency matrix A can be obtained, wherein genetic interaction scores are used to measure the strength of node (gene) relationships;

in an embodiment of the invention, the PPI network of CESC comprises a total of 15664 gene interactions, involving 1106 genes. STAD co-found 36389 gene interactions, involving 2800 genes. The PPI network of PAAD involves 35 interactions of 33 genes, and BLCA involves 28183 interactions of 2014 genes.

Step 3, dividing the training set and the test set, and specifically comprising the following steps:

step 3-1, dividing a training set and a testing set of a CESC sample:

and (3) performing ten-fold division on the gene expression data X of the CESC sample data set which is obtained in the step (1) to be subjected to the over-sampling of a few types of samples by using a StratifiedKFold method, sequentially taking one fold as test data CESC _ Xtest, marking the test label as CESC _ Ytest, taking the rest nine folds as training data CESC _ Xtrain, and marking the training label as CESC _ Ytrain.

In an embodiment of the invention, the gene expression data X of the CESC sample data set relates to 1515 gene features of 550 samples, X _ test of the CESC sample data set has approximately 55 samples, and X _ train has approximately 495 samples;

step 3-2, dividing a training set and a test set of the STAD sample:

performing ten-fold division on the gene expression data X of the STAD sample data set which is obtained in the step 1-4 and is subjected to the over-sampling of a few types of samples by using a StratifiedKFold method, sequentially taking one fold as test data STAD _ Xtest, marking the test label as STAD _ Ytest, taking the rest nine folds as training data STAD _ Xtrain, and marking the training label as STAD _ Ytrain;

in an embodiment of the invention, the gene expression data X of the STAD dataset relates to 4113 gene features of 694 samples, X _ test of the STAD dataset has approximately 69 samples, X _ train has approximately 625 samples;

step 3-3, division of training set and test set of PAAD sample

Performing ten-fold division on the gene expression data X of the PAAD sample data set obtained in the step 1-4 and subjected to the over-sampling of the minority samples by using a StratifiedKFold method, sequentially taking one fold as test data PAAD _ Xtest, marking the test label as PAAD _ Ytest, taking the rest nine folds as training data PAAD _ Xtrain, and marking the training label as PAAD _ Ytrain

In an embodiment of the invention, the gene expression data X of the PAAD dataset relates to 116 gene signatures of 244 samples, X _ test of the PAAD dataset has approximately 24 samples, and X _ train has approximately 220 samples;

step 3-4, dividing training set and test set of BLCA sample

Performing ten-fold division on the gene expression data X of the BLCA sample data set which is obtained in the step 1-4 and is subjected to over-sampling of a few types of samples by using a StratifiedKFold method, sequentially taking one fold as test data BLCA _ Xtest, marking the test label as BLCA _ Yest, taking the rest nine folds as training data BLCA _ Xtrain, and marking the training label as BLCA _ Ytrain;

in an embodiment of the invention, the gene expression data X of the BLCA data set relates to 2767 gene signatures of 662 samples, X _ test of the BLCA data set has approximately 66 samples, X _ train has approximately 596 samples;

step 4, constructing a glmGCN model represented by the gene interaction pattern optimization diagram, and specifically comprising the following processes:

step 4-1, constructing an embedded graph learning network:

constructing an adjacency matrix A epsilon R based on PPI ^p×p . In PPI networks, the inter-gene emphasis is not placed on(ii) interaction; therefore, the diagonal elements of the adjacency matrix a are two genes that do not interact with each other, and one identity matrix I is added to a such that the diagonal elements of the identity matrix I are two genes that directly interact with each other.

A defined nonlinear function S is established based on a gene expression matrix G and an adjacency matrix A _ij ＝h(g _i ,g _j )，g _i 、g _i The expression of the ith and jth genes is shown. At the same time, the projection matrix P is used in a low-dimensional space _pj ∈R ^n×d To reduce computational complexity, where d < n. A new graph representation is obtained using a single layer network as shown in the following equation:

represents the power, and p represents the total number of genes.

Setting weight vector a = (a) ₁ ,a ₂ ,...,a _n ) ^T ∈R ^n×1 A sigmoid or tanh function is used as the activation function σ (·). The adjacency matrix A is improved compared with GLCN, using A

The power emphasizes the importance of the domain initial map, so that the difference between strong interaction and weak interaction of the original PPI network map gene is larger. By the following loss function L ₁ Optimizing the weight vectors a and P _pj (Nie F，“Clustering and projected clustering with adaptive neighbors，”2014.)：

Where γ and β represent two constants that can be adjusted manually, and F represents the Frobenius specification.

If it is not

And &>

Is greater than or equal to>

Larger, S should be smaller. Or vice versa>

And if the coefficient is smaller, S is larger, the second term is regularization processing and is used for controlling the sparsity of the learning graph S network, wherein gamma is a regularization coefficient, and the third term is used for integrating PPI network information into a loss function for controlling, so that the loss function is more scientific and reasonable.

If there is no initial map, i.e. only the gene expression matrix, the learning map S can be defined as:

in this case, the weight matrix will be optimized by the following loss function:

step 4-2, constructing a graph convolution network:

the final softmax operation of equation (3) or equation (5) in step 4-1 ensures that the learned graph S satisfies the following equation when a new graph representation S is obtained using a single-layer network via the graph learning layer:

thus, graph G (X, S) is defined, and the representation of the graph is learned from data information X and the adjacency matrix S. In the graph convolution layer, a hierarchical propagation rule is executed based on the adaptive neighborhood graph S returned by the graph learning layer and the gene expression data of each sample, i.e.

O _i ^(k+1) ＝σ(S·O _i ^(k) ·W ^(k) ),for i＝1,2,...,n (8)

Wherein K =0, 1., K-1, o _i ^(k+1) Represents the output of the (k + 1) th layer, O _i ⁽⁰⁾ Represents the gene expression of the ith gene, sigma (. Beta.) represents the activation function, W ^(k) A trainable weight matrix, W, representing each graph convolution layer ⁽⁰⁾ ∈R ^1×h(0) Represents the input to the hidden weight matrix of the hidden layer with h (0) feature mapping, and W ^(K) ∈R ^h(K-1)×C A weight matrix hidden to output (C is a class number) representing one hidden layer with h (K-1) eigenmap is used in the experiment (C = 2).

If the graph volume layer is directly output after extracting the features, the final sensor is defined as follows:

Z＝soft max(SO _i ^(K) W ^(K) ) (9)

wherein the weight matrix

C represents the number of classes, and the output Z belongs to R ^n×C Indicates the tag prediction rate, Z, of the glmGCN model _i Indicating the label prediction rate of the ith node.

In an embodiment of the invention, the CESC sample data set O _i ⁽⁰⁾ Dimension of 1515 × 1, the graph convolution layer is two layers, W ⁽⁰⁾ Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W ⁽¹⁾ Is a uniformly distributed argument with dimension 25 x 2 that satisfies the Xavier initialization implementationNumber, S dimension 1515 x 1515;

in an embodiment of the invention, STAD sample data set O _i ⁽⁰⁾ Dimension of (d) 4113 x 1, the volume of the graph is two layers, W ⁽⁰⁾ Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W ⁽¹⁾ Is a parameter with dimension 25 x 2 satisfying the uniform distribution achieved by Xavier initialization, and dimension S is 4113 x 4113;

in an embodiment of the invention, the PAAD sample data set O _i ⁽⁰⁾ Of dimension 116 x 1, the graph convolution layer is two layers, W ⁽⁰⁾ Is a uniformly distributed parameter with dimension 1 × 25 that satisfies the Xavier initialization implementation, W ⁽¹⁾ Is a uniformly distributed parameter with dimension 25 x 2 that satisfies Xavier initialization, with dimension S being 116 x 116;

in an embodiment of the invention, the BLCA sample data set O _i ⁽⁰⁾ Dimension of 2767 x 1, the graph convolution layers are two layers, W ⁽⁰⁾ Is a uniformly distributed parameter with dimension 1 x 25 that satisfies the Xavier initialization implementation, W ⁽¹⁾ Is a parameter with dimension 25 x 2 satisfying the uniform distribution achieved by Xavier initialization, the dimension of S is 2767 x 2767;

step 4-3, constructing a full connection network layer:

constructing full connected layers (FC) to comprehensively extract characteristic information, setting a plurality of network layers, initializing weight and deviation in the first network layer, performing forward propagation, calculating the error of the network after one iteration, feeding the error and the gradient back to the network model for updating the network weight, reducing the error of subsequent iterations, processing the information through a plurality of layers to obtain the calculation error and the prediction accuracy of the network model, and selecting the model with the minimum error as the final model.

In the embodiment of the invention, the CESC data set full connection layer is 5 layers, and the number of the neurons is 4096, 2048, 1024, 512 and 2 respectively;

in the embodiment of the invention, the full connection layer of the STAD data set is 5 layers, and the number of the neurons is 2048, 1024, 512, 256 and 2 respectively;

in the embodiment of the invention, the PAAD data set full-connection layer is 5 layers, and the number of the neurons is 4096, 2048, 1024, 512 and 2 respectively;

in the embodiment of the invention, the BLCA data set full connection layer is 5 layers, and the number of the neurons is 4096, 2048, 1024, 512 and 2 respectively;

step 4-4, constructing a loss function:

assuming the predicted label is Z ∈ R ^n×2 Each row represents the predicted tag vector for the ith gene, and the true tag Y ∈ R ^n×2 Each row in the list represents the true label vector for the ith sample. Here parameter optimization is performed using a cross entropy loss function (T denotes training set):

in the formula, the loss function is mainly optimized for the network layer parameters from the graph convolution layer to the full connection layer, and all the parameters of the whole architecture are optimized in the following way:

L _glmGCN ＝L ₂ +λL ₁ (11)

in the embodiment of the invention, the lambda value of the CESC sample data set is 0.1, the lambda value of the STAD data set is 0.01, the lambda value of the PAAD sample data set is 0.01, and the lambda value of the BLCA sample data set is 0.01;

step 5, training and evaluating a glmGCN model represented by a gene interaction pattern optimization diagram, and specifically comprising the following steps:

and 5-1, inputting the training set CESC _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ CESC.

Inputting each folded training set STAD _ xtrin in the step 3-1 into the glmGCN model represented by the gene interaction pattern optimization diagram in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain a network weight W _ STAD.

Inputting the training set PAAD _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ PAAD.

Inputting the training set BLCA _ Xtrain of each fold in the step 3-1 into the glmGCN model represented by the optimized graph based on the gene interaction pattern in the step 4, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ BLCA.

And 5-2, evaluating the trained model by taking Accuracy (ACC), specificity (SPE), sensitivity (SEN), F1-Score (F1) and AUC as evaluation indexes, and displaying the relation between sensitivity and specificity by using an ROC curve. Where, referring to some basic definitions, FN (False Negative) represents the number of samples judged to be Negative samples but actually Positive samples, FP (False Positive) represents the number of samples judged to be Positive samples but actually Negative samples, TN (True Negative) represents the number of samples judged to be Negative samples and actually Negative samples, and TP (True Positive) represents the number of samples judged to be Positive samples and actually Positive samples. Precision (Precision), calculated by the formula Precision = TP/(TP + FP), represents the proportion of all data predicted as positive samples that is actually positive samples. Recall (Recall), calculated by the formula Recall = TP/(TP + FN), represents the proportion predicted as positive samples in all data that are actually positive samples.

The evaluation index principle is as follows:

sensitivity (SEN) is also called True Positive Rate (TPR) and recall rate (TPR), and is a ratio predicted to be positive in an actually positive sample, specificity (SPE) is also called True Negative Rate (TNR), and is a ratio predicted to be negative in an actually negative sample, and Accuracy (ACC) is a common evaluation index, and indicates the number of correctly predicted samples in all samples, and generally, the higher the accuracy, the better the classifier. The F1 SCORE (F1-SCORE) is a Precision and Recall weighted harmonic mean, combining the Precision and Recall results, and the higher F1-SCORE indicates that the test method is more effective. ROC and AUC are two other indexes of the evaluation classifier, ROC is short for a receiver operating characteristic Curve (receiver operating characteristic Curve), and AUC is short for the Area Under the ROC Curve (Area Under rock Curve). The ROC curve is mainly patterned through sensitivity and specificity, a continuous variable relation is shown, the AUC value refers to the area under the ROC curve, generally speaking, the AUC value fluctuates between 0.5 and 1, the classification prediction accuracy is high when the AUC value is large, and the classification prediction effect is good.

And 5-3, testing the trained network weight parameter W _ CESC in each fold in the step 5-1 and the network model in the step 4 on the test set CESC _ Xtest in each fold in the step 3-1.

And testing the per-fold test set STAD _ Xtest in the step 3-1 by using the trained network weight parameter W _ STAD in the step 5-1 and the network model in the step 4.

And (4) testing the trained network weight parameter W _ PAAD of each fold in the step 5-1 and the network model in the step 4 on the test set PAAD _ Xtest of each fold in the step 3-1.

And (4) testing the test set BLCA _ Xtest of each fold in the step (3-1) by using the trained network weight parameter W _ BLCA in the step (5-1) and the network model in the step (4).

And 5-4, averaging the ten-fold prediction results obtained in the step 5-3, and repeating the experiment for three times to obtain the final test set prediction result. All data sets (CESC, STAD, BLCA, PAAD) were trained and tested.

The embodiment is as follows:

and (3) inputting each folded training set CESC _ Xtrian in the step (3-1) into the glmGCN model represented by the optimized graph based on the gene interaction mode in the step (4) according to the method in the step (5) for the CESC sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ CESC. And using the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 as the final result.

And (3) inputting each folded training set STAD _ Xtrian in the step (3-2) into the glmGCN model represented by the gene interaction pattern optimization diagram in the step (4) according to the method in the step (5) for the STAD sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ STAD. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.

And (3) inputting each folded training set PAAD _ Xtrian in the step (3-3) into the glmGCN model represented by the gene interaction pattern optimization diagram in the step (4) according to the method in the step (5) for the PAAD sample data set, and optimizing by minimizing the loss function in the step (4-4) to obtain the network weight W _ PAAD. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.

And (4) inputting the training set BLCA _ Xtrian of each fold in the step 3-4 into the glmGCN model represented by the gene interaction pattern optimization map in the step 4 according to the method in the step 5 for the BLCA sample data set, and optimizing by minimizing the loss function in the step 4-4 to obtain the network weight W _ BLCA. And the average value of the ten-fold cross-validation results of the three repeated experiments obtained in the step 5-2, the step 5-3 and the step 5-4 is used as the final result.

As shown in table 1, the results of the glmGCN model on CESC sample dataset are shown. As shown in Table 2, the results of the glmGCN model on STAD sample dataset. As shown in table 3, the results of the glmGCN model on the PAAD sample dataset are shown. As shown in Table 4, the results of the glmGCN model on the sample data set of BLCA are shown. A series of experiments show the effectiveness of the method.

TABLE 1

Methods	ACC(％)	SPE(％)	SEN(％)	F1-SCORE(％)	AUC
						glmGCN	98.92	99.64	98.19	98.89	0.9945

TABLE 2

Methods	ACC(％)	SPE(％)	SEN(％)	F1-SCORE(％)	AUC
						glmGCN	97.39	98.26	96.52	97.35	0.9927

TABLE 3

Methods	ACC(％)	SPE(％)	SEN(％)	F1-SCORE(％)	AUC
						glmGCN	79.56	84.36	74.76	78.44	0.8523

TABLE 4

Methods	ACC(％)	SPE(％)	SEN(％)	F1-SCORE(％)	AUC
						glmGCN	92.04	94.66	89.41	91.73	0.9652

The invention being thus described by way of example, it should be understood that any simple alterations, modifications or other equivalent alterations as would be within the skill of the art without the exercise of inventive faculty, are within the scope of the invention.

Claims

1. A method for identifying distant metastasis based on optimized representation of gene interaction patterns, the method comprising the steps of:

step 4-1, constructing an embedded graph learning network layer, and acquiring a new graph representation on the basis of the initial graph by using a single-layer network:

wherein, a = (a) ₁ ,a ₂ ,...,a _n ) ^T Is a weight vector, σ (-) is an activation function, A represents a adjacency matrix, g _i 、g _i Respectively represent the gene expression of the ith and jth genes,

represents the power, p represents the total number of genes;

O _i ^(k+1) ＝σ(S·O _i ^(k) ·W ^(k) ),for i＝1,2,...,n

4-3, constructing a fully-connected network layer, and comprehensively extracting characteristic information, wherein the method comprises the steps of setting a plurality of fully-connected network layers and processing the fully-connected network layers to obtain a final prediction result; through the treatment of a plurality of full connection layers to obtain the final product

2. The method for identifying distant metastasis based on optimized representation of gene interaction patterns according to claim 1, wherein in step 4-1, if there is no initial map, i.e. only gene expression matrix, the learning map S is defined as:

wherein, a = (a) ₁ ,a ₂ ,...,a _n ) ^T Is a weight vector, σ (-) is an activation function, g _i 、g _i Respectively represents the gene expression of the ith and jth genes,

represents the power, and p represents the total number of genes. />