CN114420203A

CN114420203A - Method and model for predicting transcription factor-target gene interaction

Info

Publication number: CN114420203A
Application number: CN202111493609.3A
Authority: CN
Inventors: 黄裕安; 潘贵青; 王佳; 林秋镇; 李坚强
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-04-29
Also published as: WO2023103387A1

Abstract

The invention discloses a method and a model for predicting the interaction between a transcription factor and a target gene, wherein the method for predicting the interaction between the transcription factor and the target gene is used for predicting the potential interaction relationship between TF and the target gene based on a random walk heterogeneous graph embedding algorithm, generating node sentences in a heterogeneous graph in a random walk mode, then extracting training samples in a sliding window mode, generating the characteristics of the nodes through the heterogeneous graph embedding algorithm, and further predicting the undiscovered interaction relationship between TF and the target gene. In addition, as the TG node is added in the heterogeneous network, in the process of generating the sample path by adopting random walk, when the TG or disease node is selected, a probability is set, and the TG or disease is selected according to a certain probability, so that the problem of cold start is solved, and the node information of the heterogeneous network can be better captured.

Description

Method and model for predicting transcription factor-target gene interaction

Technical Field

The invention relates to the technical field of bioinformatics, in particular to a method and a model for predicting transcription factor-target gene interaction.

Background

Transcriptional regulation by regulating the rate of transcription to coordinate gene regulation occurs during transcription, which determines the cell developmental fate and the cell's response to genetic and environmental perturbations. In this process, Transcription Factor (TF) binds to cis-regulatory elements of DNA, activating RNA polymerase to regulate transcription of target genes. Given its importance for biological processes, determining the interaction pattern of TF interaction with its target genes is of crucial importance for biological and medical research.

Yang et al proposed a GripDL model consisting of a Convolutional Neural Network (CNN) that learns embedded features from In Situ Hybridization (ISH) images of genes, including TF and its target genes, to infer potential interaction relationships between them. However, gene expression image data is still very expensive for some research targets and is difficult to apply widely in practical research. Lin et al used a three-factor decomposition based collaborative filtering technique to predict potential target genes for a particular TF using relevant protein-protein interactions as training data. However, this method does not fully consider the connection between TF and its target gene.

Accordingly, the prior art is yet to be improved and developed.

Disclosure of Invention

In view of the above-mentioned deficiencies of the prior art, the present invention aims to provide a method and a model for predicting the interaction between a transcription factor and a target gene, which aims to solve the problems of the prior art that the result of the TF-target gene interaction cannot be directly obtained and the connection of TF and the target gene cannot be considered more comprehensively when predicting the interaction between the transcription factor and the target gene.

The technical scheme of the invention is as follows:

a method for predicting transcription factor-target gene interaction, comprising the steps of:

extracting the connection between the transcription factor and the target gene from the gene database and the connection between the gene and the disease from the disease database to construct a first heterogeneous network of three node types;

assuming that a node between a transcription factor and a target gene is TG, the first heterogeneous network and the node TG form a second heterogeneous network;

generating a sample path from the second heterogeneous network in a random walk mode based on a meta path;

constructing a graph embedding model based on the sample paths, giving sample pair nodes, and connecting embedding vectors of the sample pair nodes by using the graph embedding model to be used as input of a full connection layer to obtain a result of the nodes;

and obtaining the final embedding of the nodes from the second heterogeneous network according to the result, respectively forming a transcription factor embedding matrix F and an embedding matrix G of the target gene, and calculating the dot product of the two matrixes to obtain a prediction score matrix R, wherein the value of the ith row and the jth column represents the prediction score of the ith transcription factor and the jth target gene.

The method for predicting the interaction of the transcription factor and the target gene comprises the types of the transcription factor, the target gene and the disease.

In the method for predicting a transcription factor-target gene interaction, the node TG is set as a vector of constant 1.

The method for predicting transcription factor-target gene interaction, the meta pathway being represented in the form

Wherein

Representing two different types of nodes V₁And V_nAn association between them.

The method for predicting the interaction between the transcription factor and the target gene calculates the probability p of the k step transition according to the expression form of the meta-path, and the formula is as follows:

wherein

Representing nodes

V of_t+1A type of neighbor node.

The method for predicting transcription factor-target gene interaction, wherein the graph embedding model is composed of a first neural network model and a second neural network model; the first neural network model comprises a construction model and node embedding obtained through the model; the second neural network model is used for calculating the similarity between a pair of node embeddings.

The method for predicting transcription factor-target gene interaction, for the first neural network model, gives a heterogeneous network G ═ (V, E, T), where | V | >1, which learns node embedding with different types of nodes and links by maximizing the probability that a node has a heterogeneous neighbor set, the objective function is expressed as:

wherein N is_t(υ)Represents the neighbors of node v in the heterogeneous context of different node types;

for the second neural network model, extracting an embedded vector of each node from a hidden layer weight matrix of the first neural network model, giving a pair of nodes, splicing the embedded vectors of the pair of nodes as input, obtaining a result as a model loss, and changing an objective function (1) into a function (2):

where F represents a fully connected layer, g represents the two node embeddings are stitched together, and σ (x) is calculated as follows:

a model for predicting a transcription factor-target gene interaction, which is constructed using the method for predicting a transcription factor-target gene interaction as described above.

The model for predicting the interaction between the transcription factor and the target gene is used for manufacturing a BioTGI tool which is used for predicting the interaction between potential TF and the target gene.

Has the advantages that: the invention provides a method and a model for predicting the interaction between a transcription factor and a target gene, wherein the method for predicting the interaction between the transcription factor and the target gene is used for predicting the potential interaction relationship between TF and the target gene based on a random walk heterogeneous graph embedding algorithm, generating node sentences in a heterogeneous graph in a random walk mode, then extracting training samples in a sliding window mode, generating the characteristics of the nodes through the heterogeneous graph embedding algorithm, and further predicting the undiscovered interaction relationship between TF and the target gene. In addition, as the TG node is added in the heterogeneous network, in the process of generating the sample path by adopting random walk, when the TG or disease node is selected, a probability is set, and the TG or disease is selected according to a certain probability, so that the problem of cold start is solved, and the node information of the heterogeneous network can be better captured.

Drawings

FIG. 1 is a flow chart of a method for predicting transcription factor-target gene interaction of the present invention;

FIG. 2 is a diagram illustrating the construction of a heterogeneous network according to the present invention;

FIG. 3 is a schematic diagram of a set meta path according to the present invention;

FIG. 4 is a diagram illustrating the random walk method based on meta-path according to the present invention;

FIG. 5 is a diagram illustrating a heterogeneous node embedding model according to the present invention;

FIG. 6 is a schematic diagram of a scoring matrix according to the present invention.

Detailed Description

The present invention provides a method and a model for predicting transcription factor-target gene interaction, and the following further describes the present invention in detail in order to make the objects, technical schemes and effects of the present invention clearer and clearer. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the description of the present application, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience in describing the present application and for simplicity in description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus should not be construed as limiting the present application. Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

To detect reliable interactions between TF and target genes, different strategies were developed based on in vitro and in vivo assays to obtain a profile of Transcription Factor Binding Sites (TFBS). For example, Chip-X technology (chromatin co-immunoprecipitation technology) (e.g., Chip-Chip, Chip-seq, Chip-pet technology) separates binding sites of TFs, and then sequencing antibody immunoprecipitate-enriched in vivo immunoprecipitated DNA, to finally determine target genes of TFs. As an alternative technique to recognize the association between TF and the target, DamID (DNA adenine methyltransferase identification) methylates the adenine bases in the GATC motif near the TF target interaction site, and these methylated sequences are subsequently amplified and detected to identify TF-target interactions. However, these techniques are very labor and material consuming, and therefore it is necessary to find a more convenient and less time-consuming method for predicting the target gene corresponding to TF.

In addition to using tests, existing researchers have developed computational methods to predict TFBS and thus the interaction between TF and the target gene. One of them is a deep learning technique that is directly applied to DNA sequences. For example: KEGRU extracts the k-mer features of DNA sequences and constructs a bidirectional gated cyclic unit network to learn insertions to predict potential motifs as binding sites for TF. However, such predictions based on TFBS as intermediate data in the computational pipeline still do not directly lead to the results of TF-target interactions. To compensate for this gap, Yang et al proposed a GripDL model consisting of a Convolutional Neural Network (CNN) that learns embedded features from In Situ Hybridization (ISH) images of genes, including TF and its target gene, to infer potential interactions between them. However, gene expression image data is still very expensive for some research targets and is difficult to apply widely in practical research. Lin et al used a three-factor decomposition based collaborative filtering technique to predict potential target genes for a particular TF using relevant protein-protein interactions as training data. However, this method cannot take into account the connection of TF and its target gene more comprehensively.

Based on this, referring to fig. 1, the present invention provides a method for predicting transcription factor-target gene interaction, comprising the steps of:

step S10, constructing a first heterogeneous network:

step S20, forming the first heterogeneous network and a node TG into a second heterogeneous network:

step S30, generating a sample path from the second heterogeneous network:

step S40, constructing a graph embedding model based on the sample path:

step S50, prediction score:

The method for predicting the interaction between the transcription factor and the target gene is used for predicting the interaction mode of the TF-target gene interaction, so that the connection of the TF and the target gene can be considered more comprehensively, and the prediction mode is simple and convenient and consumes less time. Since the invention considers the association between TF and its target gene more comprehensively and considers the association between them and diseases, the prediction effect is significantly improved compared with the existing one.

Specifically, as shown in fig. 2, the connection between the transcription factor and the target gene is extracted from the gene database and the connection between the gene and the disease is extracted from the disease database, and finally a first heterogeneous network of three different node types is constructed; wherein the node types include Transcription Factor (TF), target gene, and disease.

A target gene corresponding to a new TF (cold start problem) without any known target gene is a challenge. To solve the cold start problem in the heterophoric map (HG), the present invention assumes that there is always one node associated with TF and the target gene. Thus, when we sample a path using a random walk strategy, this new node can be sampled. Based on the assumption, a node called TG, i.e., a node between the transcription factor and the target gene, is added to the sample path, as shown in fig. 3. In the present invention, TG nodes are regarded as general entities between TF and a target gene and share the same information with them, and therefore, the embedding of TG nodes is set as a vector of constant 1.

In some embodiments, a meta-path scheme is set, the meta-path being represented in the form of

Wherein

Representing two different types of nodes V₁And V_nAn association between them. Given a meta-path scheme, the probability p of the transition at step k can be calculated, as shown in fig. 4, with the following formula:

wherein

Representing nodes

V of_t+1A type of neighbor node. Under the guidance of meta path, the displayed connection can be captured, and the entities which do not display the connection in the heterogeneous network can also be connected through the meta path along with the random walk. Therefore, the method for predicting the transcription factor-target gene interaction of the present invention can consider the connection of TF and its target gene more comprehensively.

In the present embodiment, a graph embedding model is constructed based on a path sample of random walk production, the graph embedding model being composed of a first neural network model and a second neural network model; the first neural network model comprises a construction model and node embedding obtained through the model; the second neural network model calculates the similarity between a pair of node embeddings in a new way, as shown in fig. 5.

Specifically, the first neural network model is composed of two parts, namely model building and node embedding obtained through the model, and the data mainly needed by the model is a parameter learned by the model, namely the weight of the hidden layer matrix.

In this embodiment, for the first neural network model, given a heterogeneous network G ═ (V, E, T), where | V | >1, that learns node embedding with different types of nodes and connections by maximizing the probability that a node has a heterogeneous neighbor set, mathematically, an objective function is defined as follows:

and p (c)_tL v; θ) is defined by the soft-max function:

that is, what is really needed in this embodiment is the parameters learned by the first neural network model. Therefore, the present invention develops another neural network model (second neural network model) to optimize the embedding of the first neural network model learning.

Specifically, the embedded vector of each node can be obtained by using the weight of the hidden layer matrix of the first neural network model; when a sample pair of nodes is given, their node-embedding vectors can be concatenated together as input to the fully-concatenated layer to obtain the results for such pair of nodes; the final result is then considered as a training loss for both models.

In this embodiment, for effective optimization, a negative sampling method is used, and therefore the function (1) can be changed to the formula (2):

as shown in fig. 6, in the last step, the present embodiment obtains the final embedding of the nodes from the second heterogeneous network, respectively forms a TF embedding matrix F and an embedding matrix G of the target gene, calculates the dot product of the two matrices to obtain a prediction score matrix R, where the value in the ith row and the jth column indicates the prediction score of the ith TF and the jth target gene.

In the embodiment, the embedding of the middle layer is superposed to be used as the input of a full-connection layer, the obtained value is used as the loss of the second heterogeneous network to be trained, the prediction effect of the model is improved to a certain extent, and the AUC value of the method for predicting the interaction between the transcription factor and the target gene and the model reaches 85.28% in the task of predicting the interaction between TF and the target gene.

It should be noted that the AUC value is the area under the ROC curve, and the formula is:

wherein e is⁺Represents a positive sample set, e^-Representing a set of negative samples, rank_eThe predicted score ranking, representing an edge of e.

In addition, the present invention provides a model for predicting a transcription factor-target gene interaction, which is constructed using the method for predicting a transcription factor-target gene interaction as described above.

Based on this model, the model for predicting transcription factor-target gene interaction was used to create a BioTGI tool for predicting potential TF interactions with a target gene. Specifically, the user may query for potential target genes by providing TF, and when the user provides the name of TF, the BioTGI trains the entire model to finally obtain a score vector, each value represents the possibility of interaction with all target genes in the data set, and the larger the value, the higher the probability of interaction between TF and the corresponding target gene is, so that the user can know which genes are most likely to be the target genes of input TF. Similarly, the user can also search for the potential TF of the target gene, and the calculation process is the same as that of the target gene for calculating TF.

In summary, the invention provides a method and a model for predicting the interaction between a transcription factor and a target gene, wherein the method for predicting the interaction between the transcription factor and the target gene predicts the potential interaction relationship between the TF and the target gene based on a random walk heterogeneous graph embedding algorithm, generates node statements in the heterogeneous graph in a random walk manner, extracts training samples in a sliding window manner, generates the characteristics of the nodes through the heterogeneous graph embedding algorithm, and then predicts the undiscovered interaction relationship between the TF and the target gene. In addition, as the TG node is added in the heterogeneous network, in the process of generating the sample path by adopting random walk, when the TG or disease node is selected, a probability is set, and the TG or disease is selected according to a certain probability, so that the problem of cold start is solved, and the node information of the heterogeneous network can be better captured.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A method for predicting transcription factor-target gene interaction, comprising the steps of:

2. The method of claim 1, wherein the node types include transcription factors, target genes, diseases.

3. The method for predicting a transcription factor-target gene interaction as claimed in claim 1, wherein the node TG is set to a vector of constant 1.

4. The method for predicting a transcription factor-target gene interaction as claimed in claim 1, wherein the meta pathway is represented in the form of

Wherein

5. The method of claim 4, wherein the probability p of the k-th transition is calculated according to the meta-path representation, and is expressed as follows:

wherein

Representing nodes

V of_t+1A type of neighbor node.

6. The method for predicting a transcription factor-target gene interaction as claimed in claim 1, wherein the graph embedding model is composed of a first neural network model and a second neural network model; the first neural network model comprises a construction model and node embedding obtained through the model; the second neural network model is used for calculating the similarity between a pair of node embeddings.

7. The method of claim 6, wherein for the first neural network model, given a heterogeneous network G ═ (V, E, T), where | V | >1, the heterogeneous network learns node embeddings with different types of nodes and links by maximizing the probability that a node has a heterogeneous neighbor set, the objective function is expressed as:

8. a model for predicting transcription factor-target gene interaction, which is constructed using the method for predicting transcription factor-target gene interaction according to any one of claims 1 to 7.

9. The model of claim 8, wherein the model is used to generate a BioTGI tool for predicting the interaction between a potential TF and a target gene.