CN117438102A

CN117438102A - Anti-tumor drug efficacy prediction method based on knowledge graph embedding representation relearning

Info

Publication number: CN117438102A
Application number: CN202311560265.2A
Authority: CN
Inventors: 谢新平; 汪凤婷; 王红强; 姜晓东
Original assignee: Anhui Jianzhu University
Current assignee: Anhui Jianzhu University
Priority date: 2023-11-22
Filing date: 2023-11-22
Publication date: 2024-01-23

Abstract

The invention relates to an antitumor drug efficacy prediction method based on knowledge graph embedding representation relearning, which comprises the following steps: preparing original data; obtaining cell line embedded characterization Embed ₁ The method comprises the steps of carrying out a first treatment on the surface of the Construction of a cell line embedded characterization relearning deep network model, characterization of Embed using cell line embedded ₁ Learning the cell line embedded characterization relearning deep network model to obtain tumor cell relearning characterization Embed; obtaining a DNN classification model after trainingThe method comprises the steps of carrying out a first treatment on the surface of the And relearning the tumor cells, inputting the representation Embled into a DNN classification model after training, and predicting the relation between the tumor cells to be detected and the drug sensitivity. According to the invention, the convolutional neural network model is constructed by embedding and characterizing the original gene expression spectrum and the cell line, so that the new sample is directly represented by using the trained model, and the defect that the model needs to be retrained when the new sample is added in the existing method is overcome; integrates the original expression profile, drug effect label pair and gene regulation network information of the tumor cell line, and improves the sensitivity prediction performance of the tumor cell drug.

Description

Anti-tumor drug efficacy prediction method based on knowledge graph embedding representation relearning

Technical Field

The invention relates to the technical field of tumor cell drug sensitivity detection and evaluation, in particular to an anti-tumor drug efficacy prediction method based on knowledge graph embedding representation relearning.

Background

Because of tumor heterogeneity and genetic diversity, individual patients with the same cancer will receive different therapeutic responses to even the same drug. Blind administration causes serious toxic side effects and even excessive treatment. The method based on the network representation learning has been proved to effectively extract the gene regulation characteristics of the sample, and has better tumor cell drug effect sensitivity prediction capability.

However, in the existing network-based expression learning method, samples are required to be fused to a priori gene regulation network in the process of extracting gene regulation characteristics, so that the fused network embedded expression is learned. The method for constructing the fusion network for all samples omits the defect that the new samples are added and the fusion network representation learning model needs to be reconstructed, thereby bringing inconvenience to field application and being unfavorable for improving the prediction capability.

Disclosure of Invention

In order to solve the defect that a fusion network needs to be reconstructed and a learning model represented by the fusion network is retrained when a new sample is added, the invention aims to provide the anti-tumor drug efficacy prediction method for relearning the expression based on knowledge graph embedding by relearning the regulation and control characteristics of the fusion genes, which not only solves the high-dimension of high-throughput data of genes, but also improves the drug sensitivity prediction performance of tumor cells.

In order to achieve the above purpose, the present invention adopts the following technical scheme: the method for predicting the drug effect of the antitumor drug based on knowledge graph embedding representation relearning comprises the following steps in sequence:

(1) Preparing original data: the original data comprise N cell line original gene expression profiles, drug effect tag pairs and a gene regulation network;

(2) Obtaining cell line embedded characterization Embed ₁ : the cell and the gene regulation network are fused to obtain a cell-gene fusion regulation network map, the cell-gene fusion regulation network map is input into a knowledge map embedding model for learning, and the embedded representation of the cell line is obtained ₁ ；

(3) Construction of a cell line embedded characterization relearning deep network model, characterization of Embed using cell line embedded ₁ Learning the cell line embedded characterization relearning deep network model to obtain tumor cell relearning characterization Embed;

(4) Constructing a DNN (deoxyribonucleic acid) classification model, and training the DNN classification model through relearning and characterizing Embed of tumor cells to obtain a trained DNN classification model;

(5) And relearning the tumor cells, inputting the representation Embled into a DNN classification model after training, and predicting the relation between the tumor cells to be detected and the drug sensitivity.

The step (2) specifically comprises the following steps:

(2a) Constructing a cell-gene fusion regulation network map: fusing all tumor cell nodes with a gene regulation network, fitting probability density distribution of tumor cell sample gene expression, and setting the probability density distribution at a quantile Z _1-α The other genes are used as hot spot genes of the cells, and the hot spot genes are linked with tumor cell nodes to obtain a cell-gene fusion regulation network map;

(2b) Inputting a cell-gene fusion regulation network map into a knowledge map embedding model, and calculating gene fusion expression characteristic expression of all tumor cell samples, wherein the method specifically comprises the following steps of:

(2b1) Extracting positive triplets in a cell-gene fusion regulation network map;

(2b2) And (3) carrying out negative triplet sampling to obtain a negative triplet set, and calculating the importance of the negative triplet by using the following formula:

where α is a constant representing the sampling rate, (h' _j ，r,o′ _j ) Represents the j-th negative triplet sample, h 'represents the negative triplet sample head vector representation, o' represents the negative triplet sample tail vector representation, r represents the negative triplet sample relationship vector representation, P _j = |h 'o r-o' |is a scoring function of the sample, o represents the hadamard product;

(2b3) The resulting positive and negative triples are scored to calculate the total Loss:

wherein g (h' _i ,r,o′ _i ) Is the weight of a negative triplet sample i, M is the number of negative triplet samples, sigma represents the Sigmoid activation function, gamma represents a constant, and p (h, r, o) is the scoring function of the positive triplet;

(2b4) Updating regulatory fusion characteristic representations of all nodes and edges of the cell-gene fusion regulatory network map by using an Adam optimization algorithm;

(2b5) Repeating the steps (2 b 2) to (2 b 4) until the loss function shown in the step (2 b 3) converges, and taking the regulatory fusion characteristic representation of the cell line node as the embedded representation of the cell line ₁ 。

The step (3) specifically comprises the following steps:

(3a) Constructing a cell line embedded characterization relearning training set, wherein the cell line embedded characterization relearning training set is characterized by original gene expression profile of the cell line and embedded characterization Ebed of the cell line ₁ Composition;

(3b) Constructing a cell line embedded characterization relearning depth network model, namely a one-dimensional convolutional neural network, wherein the one-dimensional convolutional neural network is provided with a plurality of convolutional layers, the cell line embedded characterization is processed through convolution, activation, batch standardization and pooling operations according to different convolution kernel sizes of facilities, a full connection layer is used as the output of the whole convolutional network after convolution, and the rejection rate Dropout is set to be 0.5;

(3c) The cell line embedding characterization relearning training set is input into a constructed cell line embedding characterization relearning depth network model, and the tumor cell relearning characterization embedded is obtained through the set one-dimensional convolutional neural network ₂ Relearning tumor cells to characterize Ebed ₂ Characterization of embedded with cell lines ₁ Comparing the mean square error, and taking the mean square error as a loss function of the one-dimensional convolutional neural network;

wherein N is the number of cell lines, Y _i Characterization of Ebed for tumor cell relearning ₂ Is used as a reference to the value of (a),characterization of embedded for cell lines ₁ Is a value of (b);

(3d) Updating tumor cell relearning characterization of embedded using Adam optimization algorithm ₂ ；

(3e) Repeating the steps (3 c) to (3 d) until the loss function in the step (3 c) converges, and obtaining the tumor cell relearning characterization Embed.

The step (4) specifically comprises the following steps:

(4a) Constructing a drug effect prediction training set, wherein the drug effect prediction training set consists of tumor cell relearning characterization Embed and drug effect label pairs;

(4b) Constructing a DNN classification model:

(4b1) Inputting the drug effect prediction training set into a constructed DNN classification model to obtain the probability of sensitivity of a cell line to drugs, and judging whether the output is sensitive or drug resistant to obtain a sensitive relation; setting a plurality of hidden layers according to the DNN classification model, setting the number of inter-layer units according to the dimension of the cell line embedding characterization, wherein the hidden layers L of the DNN classification model are more than or equal to 3, a ReLU activation function is used between layers, the number of neurons of the output layer units is 1, and the activation function is set as Sigmoid to be used as a classification task; the Sigmoid function outputs event probability, the output is set between 0 and 1, when the result is larger than a certain threshold value, the threshold value is 0.5, and the positive class is divided, namely the sensitivity is divided;

(4b2) Calculating binary cross entropy loss according to the sensitivity relationship obtained in the step (4 b 1) and the real sensitivity relationship of the drug effect label to serve as a loss function of the DNN binary classification model;

wherein N is the number of cell lines, y _i For binary tag values 0 or 1, p (y _i ) Is of y _i Probability of tag value;

(4b3) Optimizing the sensitivity relationship of DNN two-class model output by using an Adam algorithm;

(4b4) Repeating the steps (4 b 1) to (4 b 3) until the loss function of the step (4 b 2) converges to obtain a trained DNN classification model.

The step (5) specifically refers to: using a DNN classification model after training, adopting tumor cell relearning to represent the relationship between Embed prediction tumor cells to be detected and drug sensitivity, wherein the sensitivity is 1, and the drug resistance is 0:

wherein f represents a trained DNN classification model, z _i Representing the probability that the ith tumor cell to be predicted in the Ebed is sensitive to drug response through a Sigmoid function output; outputting 1 if the probability of outputting the drug response as sensitive is greater than 0.5, indicating sensitivity to the drug; if the probability of outputting a drug response as sensitive is less than 0.5, then 0 is output, indicating resistance to the drug.

According to the technical scheme, the beneficial effects of the invention are as follows: firstly, constructing a convolutional neural network model by embedding and characterizing an original gene expression spectrum and a cell line, realizing that a new sample is directly represented by using a trained model, and solving the defect that the model needs to be retrained when the new sample is added in the existing method; secondly, the original expression profile, the drug effect label pair and the gene regulation network information of the tumor cell line are integrated through embedding the characterization and relearning of the cell line, so that the drug sensitivity prediction performance of the tumor cell is improved; thirdly, a deep learning coding technology is introduced to solve the problem of high-throughput data high-dimensionality of genes.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a predictive flow chart of a trained DNN classification model.

Detailed Description

As shown in fig. 1, a method for predicting the efficacy of an antitumor drug based on knowledge-graph embedding and relearning comprises the following steps in sequence:

(5) Tumor cells were relearned and characterized by input of an embedded model of DNN after training, and the relationship between the tumor cells to be tested and drug sensitivity was predicted, as shown in FIG. 2.

The step (2) specifically comprises the following steps:

(2a) Constructing a cell-gene fusion regulation network map: all tumors were treatedThe cell nodes are fused with a gene regulation network, and the probability density distribution of the gene expression of a tumor cell sample is fitted, so that the probability density distribution falls on a quantile Z _1-α The other genes are used as hot spot genes of the cells, and the hot spot genes are linked with tumor cell nodes to obtain a cell-gene fusion regulation network map;

where α is a constant representing the sampling rate, (h' _j ，r，o′ _j ) Represents the j-th negative triplet sample, h 'represents the negative triplet sample head vector representation, o' represents the negative triplet sample tail vector representation, r represents the negative triplet sample relationship vector representation, P _j = |h 'o r-o' |is a scoring function of the sample, o represents the hadamard product;

The step (3) specifically comprises the following steps:

(3b) And constructing a cell line embedded characterization relearning deep network model, namely a one-dimensional convolutional neural network. The one-dimensional convolutional neural network can be provided with a plurality of convolutional layers, and the cell line embedded characterization is processed through operations such as convolution, activation, batch standardization, pooling and the like by different convolutional kernel sizes of facilities. The one-dimensional convolutional network is set up into three convolutional layers: convolution width K ₁ At a convolution step S of 7 ₁ 1, maximum pooling width K ₂ At 3, pooling step S ₂ 3, the only difference between layers is channel number C, which is 8, 16 and 32 respectively, after convolution, a full connection layer is used as the output of the whole convolution network, and the drop rate Dropout is set to be 0.5;

wherein N is the number of cell lines, Y _i Characterization of Ebed for tumor cell relearning ₂ Is used as a reference to the value of (a),is embedded into cell lineCharacterization of Embled ₁ Is a value of (b);

The step (4) specifically comprises the following steps:

(4b) Constructing a DNN classification model:

(4b1) Inputting the drug effect prediction training set into a constructed DNN classification model to obtain the probability of sensitivity of a cell line to drugs, and judging whether the output is sensitive or drug resistant to obtain a sensitive relation; the DNN classification model can be provided with a plurality of hidden layers, the number of interlayer units is reasonably set according to the dimension of the cell line embedding characterization, the hidden layer L of the DNN classification model is set to be 3, and the number of interlayer units is a _i The method comprises the following steps of: 200. 50, using ReLU or other activation functions between layers, wherein the number of the output layer unit neurons is 1, and the activation functions are set as Sigmoid to be used as classification tasks; the Sigmoid function outputs event probability, the output is set between 0 and 1, when the result is larger than a certain threshold value, the threshold value is 0.5, and the positive class is divided, namely the sensitivity is divided;

In summary, the convolutional neural network model is constructed by embedding and characterizing the original gene expression profile and the cell line, so that the new sample is directly represented by using the trained model, and the defect that the model needs to be retrained when the new sample is added in the existing method is overcome; the original expression profile, the drug effect label pair and the gene regulation network information of the tumor cell line are integrated through embedding the characterization and relearning of the cell line, so that the drug sensitivity prediction performance of the tumor cell is improved; the deep learning coding technology is introduced to solve the difficult problem of high-throughput data and high-dimension of genes.

Claims

1. The method for predicting the drug effect of the antitumor drug based on knowledge graph embedding representation relearning is characterized by comprising the following steps of: the method comprises the following steps in sequence:

(2) Obtaining cell line embedded characterization Embed ₁ : i.e. fusing the cell with the gene regulation network to obtain a cell-gene fusion regulation network map, regulating and controlling the cell-gene fusionInputting the network map into a knowledge map embedding model for learning to obtain a cell line embedded characterization embedded ₁ ；

2. The method for predicting the efficacy of the antitumor drug based on knowledge-graph embedding representation relearning according to claim 1, which is characterized in that: the step (2) specifically comprises the following steps:

wherein α is a constant, a substitutionTable sample rate, (h) _j ′,r,o _j ' represents the j-th negative triplet sample, h ' represents the negative triplet sample head vector representation, o ' represents the negative triplet sample tail vector representation, r represents the negative triplet sample relationship vector representation, P _j = |h 'or-o' || is a scoring function of the sample, and o represents the hadamard product;

3. The method for predicting the efficacy of the antitumor drug based on knowledge-graph embedding representation relearning according to claim 1, which is characterized in that: the step (3) specifically comprises the following steps:

4. The method for predicting the efficacy of the antitumor drug based on knowledge-graph embedding representation relearning according to claim 1, which is characterized in that: the step (4) specifically comprises the following steps:

(4b) Constructing a DNN classification model:

5. The method for predicting the efficacy of the antitumor drug based on knowledge-graph embedding representation relearning according to claim 1, which is characterized in that: the step (5) specifically refers to: using a DNN classification model after training, adopting tumor cell relearning to represent the relationship between Embed prediction tumor cells to be detected and drug sensitivity, wherein the sensitivity is 1, and the drug resistance is 0:

wherein f represents a trained DNN classification model, z _i Representing the probability that the ith tumor cell to be predicted in the Ebed is sensitive to drug response through a Sigmoid function output; if the probability of outputting the drug response as sensitive is greater than 0.5, then output1, representing sensitivity to the drug; if the probability of outputting a drug response as sensitive is less than 0.5, then 0 is output, indicating resistance to the drug.