CN113299338B

CN113299338B - Knowledge-graph-based synthetic lethal gene pair prediction method, system, terminal and medium

Info

Publication number: CN113299338B
Application number: CN202110638513.5A
Authority: CN
Inventors: 郑杰; 吴敏; 刘勇; 王诗珂; 徐凡; 汪洁; 李云洋; 张可
Original assignee: ShanghaiTech University
Current assignee: ShanghaiTech University
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2023-08-29
Anticipated expiration: 2041-06-08
Also published as: CN113299338A

Abstract

According to the knowledge graph-based synthetic lethal gene pair prediction method, system, terminal and medium, the highest prediction accuracy is obtained by extracting the subgraph from the knowledge graph and completing the knowledge integration and feature extraction processes based on the extracted subgraph. Especially when expressing gene characteristics, according to the utilized knowledge graph containing factors such as biological processes, diseases, pathways and the like related to the SL gene pair, the effect of fully considering the common biological mechanism behind the SL gene pair is achieved, so that the prediction result has more referential property, and the problems in the prior art are solved.

Description

Knowledge-graph-based synthetic lethal gene pair prediction method, system, terminal and medium

Technical Field

The invention belongs to the field of data processing, and particularly relates to a synthetic lethal gene pair prediction method, system, terminal and medium based on a knowledge graph.

Background

Complex biological systems operate by means of genetic interactions. Synthetic lethality (Synthetic lethality, SL) is one of many interactions, i.e., simultaneous inactivation of two genes in a synthetic lethality causes cell death, while inactivation of either gene does not affect the cell. The synthetic lethal gene is key to the discovery of targets of anticancer drugs, and when a specific gene is found to be inactivated in a tumor, the synthetic lethal gene of the inactivated gene can be inhibited by drugs, so that cancer cells can be specifically killed without endangering healthy cells. Thus, the prediction of synthetic lethal gene pairs not only helps to improve the efficacy of targeted drug therapies, develop new effective treatment regimens and circumvent drug resistance, but also may provide opportunities for genes or biological pathways that are temporarily unavailable for targeted therapies.

The current methods for screening SL gene pairs can be largely divided into two major categories, namely high throughput wet experiments and computational methods. The methods of wet experiments include RNA blocking, chemical small molecule inhibition and CRISPR-based gene editing techniques, whose core idea is to observe the survival of cells by altering the expression of a certain gene to screen SL gene pairs. The wet experiment method has the advantages of high result authenticity, but has the challenges of high cost, off-target effect, lack of consistency among different cell lines, unclear underlying mechanism and the like, so that it is particularly necessary to design an effective calculation method to remedy the defects of the wet experiment technology.

The SL gene pair screening based on the calculation method can be classified into the following three types. The first is modeling the knockdown effect of the gene metabolic network on a single or a pair of genes; the second is to construct gene signatures based on knowledge, and predict potential SL gene pairs in combination with network topology. The two methods rely on metabolic network models, domain-specific knowledge and genomic data, and known SL gene pair information is not fully exploited. In order to better utilize these known data, machine learning algorithms have been widely used in recent years in the problem of predicting SL gene pairs. Researchers extract input features from the protein interaction network and build traditional machine learning models (e.g., support vector machines); there are also researchers who build a learning framework for encoder-decoder based on graph representation learning. Wherein the encoder is used to map the nodes in the SL gene effect network to a low dimensional space and the decoder attempts to describe the similarity between the nodes, exploiting the possible SL relationships. However, these machine learning methods do not fully consider the biological mechanisms underlying SL gene pairs, such as whether both are involved in the same gene pathway, exert similar effects in a biological process, etc., when characterizing the genes. Additional knowledge is therefore required to obtain more fully described genetic features.

Disclosure of Invention

In view of the above drawbacks of the prior art, the present invention aims to provide a knowledge-graph-based synthetic lethal gene pair prediction method, system, terminal and medium, which are used for solving the problems of high cost, batch effect, off-target and the like in the prior art that the SL gene pair is screened by a wet experiment method, and the existing calculation method for predicting the SL gene pair ignores the common biological mechanism behind the SL gene pair.

To achieve the above and other related objects, the present invention provides a synthetic lethal gene pair prediction method based on a knowledge-graph, the method comprising: extracting subgraphs which respectively take a gene pair to be predicted as a central gene node from a knowledge graph and are respectively derived from one or more stages of the central gene node; wherein the subgraph comprises: sub-graph feature representation information representing neighborhood relations of each gene node; based on the subgraph characteristic representation information, respectively updating the characteristic representation information of each gene node at each level in each subgraph to obtain characteristic representation updating information corresponding to the gene pair to be predicted; and calculating to obtain a probability value corresponding to the synthetic lethal relationship of the gene pair to be predicted according to the characteristic expression updating information of the gene pair to be predicted, and obtaining a prediction result corresponding to whether the synthetic lethal relationship exists in the gene pair to be predicted.

In an embodiment of the present invention, the sub-graph feature representation information includes: the characteristic representing information of each gene node at each level, the characteristic representing information of the neighboring node corresponding to each gene node, and the relationship characteristic representing information between each gene node and the neighboring node corresponding to each gene node.

In an embodiment of the present invention, based on the sub-graph feature representation information, updating feature representation information of each gene node at each level in each sub-graph to obtain feature representation update information corresponding to the gene pair to be predicted includes: and based on the subgraph feature representation information, sequentially aggregating all neighborhood relations of all the gene nodes at each level in each subgraph according to the derivation direction of the central gene node, and sequentially updating the feature representation information of all the gene nodes at each level in each subgraph step by step to obtain feature representation updating information corresponding to the gene pair to be predicted.

In an embodiment of the present invention, the method for sequentially aggregating all neighborhood relations of each gene node at each level in each subgraph according to the derivation direction to the central gene node based on the subgraph feature representation information, and sequentially updating the feature representation information of each gene node at each level in each subgraph step by step to obtain feature representation update information corresponding to the gene pair to be predicted includes: based on the relation characteristic representation information between each gene node and the corresponding neighbor node and the characteristic representation information of the central gene node, respectively obtaining the relation weight value between each gene node and the corresponding neighbor node; and based on the weight values of the relations and the characteristic representation information of the gene nodes, sequentially carrying out weight aggregation on all neighborhood relations of the gene nodes at each level in each subgraph according to the derivation direction of the central gene node, and sequentially updating the characteristic representation information of the gene nodes at each level in each subgraph step by step to obtain the characteristic representation updating information corresponding to the gene pair to be predicted.

In an embodiment of the present invention, the method for sequentially performing weight aggregation on all neighborhood relations of each gene node at each level in each subgraph according to the deriving direction of the central gene node based on the weight value of each relation and the feature representation information of each gene node, and sequentially updating the feature representation information of each gene node at each level in each subgraph step by step to obtain the feature representation update information corresponding to the gene pair to be predicted includes: obtaining neighborhood relation characteristic representation information of neighborhood relation of each gene node according to the relation weight value between each gene node and the neighbor node corresponding to each gene node and the characteristic representation information of each gene node in turn according to the derivation direction of the central gene node; aggregating the neighborhood relation feature representation information corresponding to each gene node, and replacing the feature representation information of each gene node at each level in each subgraph with the aggregated feature representation information; and respectively taking the feature representation information obtained by aggregating the neighborhood relation feature representation information respectively corresponding to the central gene nodes as feature representation updating information corresponding to the gene pairs to be predicted.

In an embodiment of the present invention, calculating a probability value corresponding to a synthetic lethal relationship between the pair of genes to be predicted according to the feature representation update information of the pair of genes to be predicted, and obtaining a prediction result corresponding to whether the pair of genes to be predicted has the synthetic lethal relationship includes: based on the inner product function, calculating to obtain a probability value corresponding to the synthetic lethal relationship of the gene pair to be predicted according to the characteristic representation updating information of the gene pair to be predicted; obtaining a prediction result characteristic value according to the probability value based on the ReLU function and a set judgment threshold value; and obtaining a predicted result corresponding to whether the gene pair to be predicted has a synthetic lethal relationship or not according to the characteristic value of the predicted result based on the predicted result judging condition.

In an embodiment of the present invention, the judging condition based on the prediction result includes: if the characteristic value of the predicted result is 0, obtaining a predicted result which corresponds to the gene pair to be predicted and does not have a synthetic lethal relationship; if the characteristic value of the predicted result is 1, obtaining the predicted result of the synthetic lethal relationship of the corresponding gene pair to be predicted.

To achieve the above and other related objects, the present invention provides a synthetic lethal gene pair prediction system based on a knowledge-graph, the system comprising: the knowledge graph extraction module is used for extracting subgraphs which respectively take the gene pair to be predicted as central gene nodes and are derived from one or more stages of the central gene nodes from the knowledge graph; wherein the subgraph comprises: sub-graph feature representation information representing neighborhood relations of each gene node; the aggregation updating module is connected with the knowledge graph extracting module and is used for respectively updating the characteristic representation information of each gene node at each level in each subgraph based on the subgraph characteristic representation information so as to obtain the characteristic representation updating information corresponding to the gene pair to be predicted; and the prediction module is connected with the aggregation updating module and is used for calculating a probability value corresponding to the synthetic lethal relationship of the gene pair to be predicted according to the characteristic representation updating information of the gene pair to be predicted and obtaining a prediction result corresponding to whether the synthetic lethal relationship exists in the gene pair to be predicted.

To achieve the above and other related objects, the present application provides a synthetic lethal gene pair prediction terminal based on a knowledge-graph, comprising: a memory for storing a computer program; and the processor is used for executing the synthetic lethal gene pair prediction method based on the knowledge graph.

To achieve the above and other related objects, the present application provides a computer-readable storage medium storing a computer program which, when executed by one or more processors, performs the synthetic lethal gene pair prediction method based on a knowledge-graph.

As described above, the application relates to a knowledge-based synthetic lethal gene pair prediction method, a knowledge-based synthetic lethal gene pair prediction system, a knowledge-based synthetic lethal gene pair prediction terminal and a knowledge-based synthetic lethal gene pair prediction medium, which have the following beneficial effects: according to the method, the highest prediction accuracy is obtained by extracting the subgraph from the knowledge graph and completing the processes of knowledge integration and feature extraction based on the extracted subgraph. Especially when expressing gene characteristics, according to the utilized knowledge graph containing factors such as biological processes, diseases, pathways and the like related to the SL gene pair, the effect of fully considering the common biological mechanism behind the SL gene pair is achieved, so that the prediction result has more referential property, and the problems in the prior art are solved.

Drawings

FIG. 1 is a schematic flow chart of a synthetic lethal gene pair prediction method based on a knowledge-graph according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a portion of a sub-graph in an embodiment of the invention.

FIG. 3 is a schematic diagram of a knowledge-based synthetic lethal gene pair prediction system according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a KG4SL model according to an embodiment of the present invention.

FIG. 5 is a schematic diagram showing the structure of a synthetic lethal gene pair prediction terminal based on a knowledge-graph according to an embodiment of the present invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

The invention provides a synthetic lethal gene pair prediction method based on a knowledge graph, which obtains the highest prediction accuracy by extracting a subgraph from the knowledge graph and completing the processes of knowledge integration and feature extraction based on the extracted subgraph. Especially when expressing gene characteristics, according to the utilized knowledge graph containing factors such as biological processes, diseases, pathways and the like related to the SL gene pair, the effect of fully considering the common biological mechanism behind the SL gene pair is achieved, so that the prediction result has more referential property, and the problems in the prior art are solved.

The embodiments of the present invention will be described in detail below with reference to the attached drawings so that those skilled in the art to which the present invention pertains can easily implement the present invention. This invention may be embodied in many different forms and is not limited to the embodiments described herein.

As shown in fig. 1, a schematic flow chart of a synthetic lethal gene pair prediction method based on a knowledge-graph in the embodiment of the present invention is shown.

The method comprises the following steps:

step S11: and extracting subgraphs which respectively take the gene pair to be predicted as central gene nodes from the knowledge graph and are respectively derived from one or more stages of the central gene nodes.

In detail, the subgraph includes: sub-graph feature representation information representing neighborhood relations of each gene node; preferably, the sub-graph feature representation information includes: the characteristic representing information of each gene node at each level, the characteristic representing information of the neighboring node corresponding to each gene node, and the relationship characteristic representing information between each gene node and the neighboring node corresponding to each gene node.

Optionally, the knowledge graph comprises a regulation relation between genes, an interaction relation between genes and a co-expression relation between genes; preferably, the knowledge graph is constructed based on SynLethDB; specifically, a knowledge graph having 11 entities and 24 relationships is extracted from SynLethDB. Among these 24 relationships, there is 16 directly related to the gene. Of the 11 entities, 6 entities are directly related to a gene. The entities and relationships in the knowledge-graph may be presented in the form of (entity, relationship, entity) triples. For example, a knowledge-graph was constructed that contained 54012 nodes and 2231921 side relationships, with isolated points in the knowledge-graph removed.

Optionally, extracting two subgraphs respectively taking the gene pair to be predicted as central gene nodes from the knowledge graph and respectively deriving from one or more stages of each central gene node; and extracting neighbor nodes of each gene node from the subgraph, and compensating by adopting a repeated sampling method if the number of neighbors of each gene node existing on the subgraph does not meet the preset requirement. For example, according to a given gene pair e to be predicted _i And e _j And respectively taking two points as central gene nodes to extract subgraphs connected with the two gene nodes. Because of the limitation of computational resources, the subgraph only selects 2-level gene nodes derived from the specific genes; and randomly extracting k neighbors from each gene, and if the neighbors of some nodes on the subgraph are less than k, adopting a repeated sampling method to make up.

Step S12: and respectively updating the characteristic representation information of each gene node at each level in each subgraph based on the subgraph characteristic representation information so as to obtain characteristic representation updating information corresponding to the gene pair to be predicted.

Optionally, step S12 includes: based on the subgraph feature representation information, sequentially aggregating all neighborhood relations of all the gene nodes at each level in each subgraph according to the derivation direction of the central gene node, and sequentially updating the feature representation information of all the gene nodes at each level in each subgraph step by step to obtain feature representation updating information corresponding to the gene pair to be predicted;

Specifically, for the gene node in any one subgraph, the information is represented by e based on the subgraph characteristics _i For example, a sub-graph of a central gene node, for any node e on the sub-graph, the central gene node e is pressed _i The derivative direction sequentially aggregates the neighborhood relations of all neighbor nodes of each gene node e, and the characteristic representation information of the gene node e is updated according to the aggregation result to obtain the corresponding central gene node e _i Is characterized by updating information for another e _j In the same way, to obtain the corresponding gene pair e to be predicted _i And e _j Is representative of updated information.

The central gene node e is pressed _i The derivative direction is sequentially aggregated, namely, the neighborhood relation aggregation is firstly carried out from the gene node at the farthest stage from the central gene node, and the feature expression information of the gene node at the stage is updated by the aggregation result; based on the characteristic representation information of the updated gene node of the stage, starting to carry out neighborhood relation aggregation on the gene node of the stage above the stage so as to update the characteristic representation information of the gene node of the current stage; and then carrying out neighborhood relation aggregation on the gene nodes of the previous stage according to the feature expression information of the gene nodes obtained in the current stage, updating the feature expression information of the gene nodes step by step until the neighborhood relation aggregation is carried out on the central gene node so as to update the feature expression information of the central gene node, and finally obtaining the feature expression updating information corresponding to the gene pair to be predicted by adopting the mode in both subgraphs.

Optionally, the method for sequentially aggregating all neighborhood relations of the gene nodes at each level in each subgraph according to the derivation direction to the central gene node based on the subgraph feature representation information, and sequentially updating the feature representation information of the gene nodes at each level in each subgraph step by step to obtain feature representation updating information corresponding to the gene pair to be predicted includes:

based on the relation characteristic representation information between each gene node and the corresponding neighbor node and the characteristic representation information of the central gene node, respectively obtaining the relation weight value between each gene node and the corresponding neighbor node; based on the weight value of each relation and the characteristic representation information of each gene node, sequentially carrying out weight aggregation on all neighborhood relations of each gene node at each level in each subgraph according to the derivation direction of the central gene node, and sequentially updating the characteristic representation information of each gene node at each level in each subgraph step by step to obtain the characteristic representation updating information corresponding to the gene pair to be predicted;

preferably, based on a relation weight value algorithm, according to relation characteristic representation information between each gene node and the corresponding neighbor node and characteristic representation information of the central gene node, respectively obtaining relation weight values between each gene node and the corresponding neighbor node;

Wherein the relation weight algorithm is represented by e _j Representing information r for the relationship characteristics between any one node e and any one of its neighboring nodes e' in the subgraph of the central gene node _e，e′ Is a relation weight value of (2)The calculation mode of (2) comprises:

wherein e _j Characteristic representation information representing the central gene node ej, the function g being an inner product function, and a relationship weight valueRepresenting relationship characteristic representation information r _e，e′ For gene e _j Is of importance.

Optionally, the method includes: sequentially calculating neighborhood relation characteristic representation information of neighborhood relations corresponding to each gene node according to the derivation direction of the central gene node; the specific modes comprise: obtaining neighborhood relation characteristic representation information of neighborhood relation of each gene node according to relation weight values between each gene node and neighbor nodes corresponding to each gene node and characteristic representation information of each gene node; aggregating the neighborhood relation feature representation information corresponding to each gene node, and replacing the feature representation information of each gene node at each level in each subgraph with the aggregated feature representation information; and respectively taking the feature representation information obtained by aggregating the neighborhood relation feature representation information respectively corresponding to the central gene nodes as feature representation updating information corresponding to the gene pairs to be predicted.

Optionally, according to the relation weight value between each gene node e and the corresponding neighbor node e' respectivelyPerforming normalization to obtain a normalized relationship weight value +.>According to the normalized weight value +.>And feature representation information e 'of each gene node e' to obtain neighborhood relation feature representation information +_for neighborhood relation of each gene node>Based on a weight aggregation formula, aggregating neighborhood relation characteristic representation information corresponding to each gene node e>And representing the aggregated characteristic information e _p Substituting the characteristic representation information e of each gene node e at each level in each subgraph;

wherein the weight aggregation formula comprises:

and wherein P (e) represents all neighbor nodes of gene e.

For subgraphs with h-level derivatization, as illustrated in FIG. 2, for example, gene node e [ h+1 ]]At node e [ h ]]E _p Through aggregation, feature representation information e [ h ] of each gene node in each subgraph is obtained]After ep, training to obtain a characteristic representation algorithm of the entity in the subgraph, which is used for calculating the upper-level gene node e [ h+1 ]]Characteristic representation information e [ h+1 ]]The algorithm comprises:

e[h+1]＝σ(W(e[h]+ep)+b)； (4)

where W and b are training parameters of the network and the activation function σ is a ReLU function. The gene pair e to be detected is obtained through the calculation of the h layer _i And e _j Is characterized by updating informationAnd->

Step S13: and calculating to obtain a probability value corresponding to the synthetic lethal relationship of the gene pair to be predicted according to the characteristic expression updating information of the gene pair to be predicted, and obtaining a prediction result corresponding to whether the synthetic lethal relationship exists in the gene pair to be predicted.

Optionally, step S13 includes: based on the inner product function f (x, y), updating information is represented according to the characteristics of the gene pair to be predictedAnd->Calculating to obtain the probability value of the synthetic lethal relation of the gene to be predicted>Based on the ReLU function sigma and the set judgment threshold S, the method is carried out according to the probability value +.>Obtaining the characteristic value of the prediction resultBased on the prediction result judgment condition, according to the characteristic value of the prediction result +.>And obtaining a prediction result corresponding to whether the gene pair to be predicted has a synthetic lethal relationship.

Preferably, the set judgment threshold S is 0.5, if the probability value isLess than 0.5, then->The output value is 0, if->If the ratio is greater than or equal to 0.5, then +.>The output value is 1.

Optionally, the judging condition based on the prediction result includes: if it is pre-arrangedMeasurement result characteristic valueIf the value is 0, a prediction result which corresponds to the gene pair to be predicted and does not have a synthetic lethal relationship is obtained; if the characteristic value of the predicted result is- >And 1, obtaining a prediction result corresponding to the gene pair to be predicted, wherein the prediction result has a synthetic lethal relationship.

Similar to the principles of the above embodiments, the present invention provides a synthetic lethal gene pair prediction system based on a knowledge-graph.

Specific embodiments are provided below with reference to the accompanying drawings:

FIG. 3 shows a schematic structural diagram of a synthetic lethal gene pair prediction system based on a knowledge-graph according to an embodiment of the present invention.

The system comprises:

the knowledge graph extraction module 31 is used for extracting subgraphs which respectively take a gene pair to be predicted as a central gene node and are respectively derived from one or more stages of the central gene node from the knowledge graph; wherein the subgraph comprises: sub-graph feature representation information representing neighborhood relations of each gene node;

the aggregation updating module 32 is connected with the knowledge graph extracting module 31 and is used for respectively updating the characteristic representation information of each gene node at each level in each subgraph based on the subgraph characteristic representation information so as to obtain the characteristic representation updating information corresponding to the gene pair to be predicted;

the prediction module 33 is connected to the aggregation updating module 32, and is configured to calculate a probability value corresponding to the synthetic lethal relationship of the gene pair to be predicted according to the feature representation updating information of the gene pair to be predicted, and obtain a prediction result corresponding to whether the synthetic lethal relationship exists in the gene pair to be predicted.

It should be noted that, it should be understood that the division of the modules in the embodiment of the system of fig. 3 is merely a division of logic functions, and may be fully or partially integrated into a physical entity or may be physically separated. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; the method can also be realized in a mode that a part of modules are called by processing elements and software, and the part of modules are realized in a hardware mode;

for example, each module may be one or more integrated circuits configured to implement the above methods, e.g.: one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASIC), or one or more microprocessors (digitalsignal processor, abbreviated as DSP), or one or more field programmable gate arrays (Field Programmable Gate Array, abbreviated as FPGA), or the like. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Optionally, the aggregation update module 32 is configured to represent information based on the sub-graph feature, e _i For example, a sub-graph of a central gene node, for any node e on the sub-graph, the central gene node e is pressed _i The derivative direction sequentially aggregates the neighborhood relations of all neighbor nodes of each gene node e, and the characteristic representation information of the gene node e is updated according to the aggregation result to obtain the corresponding central gene node e _i Is characterized by updating information for another e _j In the same way, to obtain the corresponding gene pair e to be predicted _i And e _j Is representative of updated information. It should be noted that, the aggregation update module 32 is configured to push the central gene node e _i The derivative direction is sequentially aggregated, namely, the neighborhood relation aggregation is firstly carried out from the gene node at the farthest stage from the central gene node, and the feature expression information of the gene node at the stage is updated by the aggregation result; starting the stage based on the characteristic representation information of the updated gene node itselfCarrying out neighborhood relation aggregation on the gene node of the previous stage to update the characteristic representation information of the gene node of the current stage; and then carrying out neighborhood relation aggregation on the gene nodes of the previous stage according to the feature expression information of the gene nodes obtained in the current stage, updating the feature expression information of the gene nodes step by step until the neighborhood relation aggregation is carried out on the central gene node so as to update the feature expression information of the central gene node, and finally obtaining the feature expression updating information corresponding to the gene pair to be predicted by adopting the mode in both subgraphs.

Optionally, the aggregation update module 32 is configured to obtain a relationship weight value between each gene node and its corresponding neighboring node based on the relationship feature representation information between each gene node and its corresponding neighboring node and the feature representation information of the central gene node; and based on the weight values of the relations and the characteristic representation information of the gene nodes, sequentially carrying out weight aggregation on all neighborhood relations of the gene nodes at each level in each subgraph according to the derivation direction of the central gene node, and sequentially updating the characteristic representation information of the gene nodes at each level in each subgraph step by step to obtain the characteristic representation updating information corresponding to the gene pair to be predicted.

Preferably, the aggregation updating module 32 is configured to obtain, based on a relational weight value algorithm, a relational weight value between each gene node and its corresponding neighboring node according to the relational feature representation information between each gene node and its corresponding neighboring node and the feature representation information of the central gene node; wherein the relation weight algorithm is represented by e _j Representing information r for the relationship characteristics between any one node e and any one of its neighboring nodes e' in the subgraph of the central gene node _e，e′ Is a relation weight value of (2)The calculation mode of (2) comprises:

wherein e _j Representing the center Gene node e _j Is characterized by the characteristic representation information, the function g is an inner product function, and the relation weight valueRepresenting relationship characteristic representation information r _e，e′ For gene e _j Is of importance.

Optionally, the aggregation updating module 32 is configured to sequentially calculate neighborhood relation feature representation information of neighborhood relations corresponding to each gene node according to a direction derived from the central gene node; the specific modes comprise: obtaining neighborhood relation characteristic representation information of neighborhood relation of each gene node according to relation weight values between each gene node and neighbor nodes corresponding to each gene node and characteristic representation information of each gene node; aggregating the neighborhood relation feature representation information corresponding to each gene node, and replacing the feature representation information of each gene node at each level in each subgraph with the aggregated feature representation information; and respectively taking the feature representation information obtained by aggregating the neighborhood relation feature representation information respectively corresponding to the central gene nodes as feature representation updating information corresponding to the gene pairs to be predicted.

Optionally, the aggregation update module 32 is configured to update the node according to the relationship weight value between each gene node e and its corresponding neighboring node e Performing normalization to obtain a normalized relationship weight value +.>According to the normalized weight valueAnd feature representation information e 'of each gene node e' to obtain neighborhood relation feature representation information +_for neighborhood relation of each gene node>Based on a weight aggregation formula, aggregating neighborhood relation characteristic representation information corresponding to each gene node e>And representing the aggregated characteristic information e _p Substituting the characteristic representation information e of each gene node e at each level in each subgraph;

wherein the weight aggregation formula comprises:

and wherein P (e) represents all neighbor nodes of gene e.

e[h+1]＝σ(W(e[h]+e _p )+b)； (4)

where W and b are training parameters of the network and the activation function σ is a ReLU function. The gene pair e to be detected is obtained through the calculation of the h layer _i And e _j Is characterized by updating information And->

Optionally, the prediction module 33 is configured to base on an inner product functionThe number f (x, y) represents updated information according to the characteristics of the gene pair to be predictedAnd->Calculating to obtain the probability value corresponding to the synthetic lethal relationship of the gene to be predictedBased on the ReLU function sigma and the set judgment threshold S, the method is carried out according to the probability value +.>Obtaining the characteristic value +.>Based on the prediction result judging condition, according to the characteristic value of the prediction resultAnd obtaining a prediction result corresponding to whether the gene pair to be predicted has a synthetic lethal relationship. Preferably, the set judgment threshold S is 0.5, if the probability value +.>Less than 0.5, then->Output value is 0, ifIf the ratio is greater than or equal to 0.5, then +.>The output value is 1.

Optionally, the judging condition based on the prediction result includes: if the characteristic value of the prediction result isIf the value is 0, a prediction result which corresponds to the gene pair to be predicted and does not have a synthetic lethal relationship is obtained; if the characteristic value of the predicted result is->And 1, obtaining a prediction result corresponding to the gene pair to be predicted, wherein the prediction result has a synthetic lethal relationship.

In order to better describe the synthetic lethal gene pair prediction system based on the knowledge graph, a specific embodiment is provided;

example 1; knowledge-graph-based synthetic lethal gene pair prediction model (KG 4SL model). Fig. 4 is a schematic structural diagram of the KG4SL model.

The mode of predicting by using the model is mainly that the gene node to be detected is input into a trained KG4SL model, and a prediction result of whether the synthetic lethal relationship exists in the gene pair to be predicted can be directly obtained.

The knowledge graph construction and data preprocessing steps before model construction comprise:

in addition, a knowledge map with 11 entities and 24 relations is included in SynLethDB, 16 relations are directly related to genes, regulatory relations among the 24 relations, interaction relations among the genes, and co-expression relations among the genes, among the 11 entities, 6 entities are directly related to genes, the entities and relations in the knowledge map are presented in the form of (entity, relation, entity) triplets.

Given the constructed gene pair S.epsilon.0, 1 ^n×n Knowledge graph g= (V) _e ，V _r ) Wherein each edge in the knowledge-graph is defined as a triplet t= (h, r, T). Our goal is to build a graph neural network model to learn a function(probability size calculation model) Using the function to calculate Gene e _i And gene e _j And judging whether the synthetic lethal relationship exists or not according to the set threshold value.

After the data are prepared, constructing the KG4SL model, and constructing a knowledge-based synthetic lethal gene pair prediction model (KG 4SL model) framework by the following steps: :

gene-specific weighted subgraph module for the preparation of a vector for the expression of two genes e _i And e _j Respectively extracting subgraphs connected with the two genes. Because of the limitation of computational resources, the subgraph only selects 2 layers of genes derived from the specific genes, and randomly extracts k neighbors for each gene;

an Aggregation module for selecting, for each SL gene pair, the gene node and edge relationships directly linked to that gene. Aggregating neighbor node information of each gene node in the subgraph based on the assumption that biological information can flow between the extracted subgraph nodes to form a characteristic representation of the gene; specifically, for the gene node in any one subgraph, let e _i Taking a subgraph of a central node as an example, for any node e on the subgraph, aggregating the feature representations of all neighbor nodes of the node, and updating the feature representation of the node by using the aggregation result; calculating the weight of each edge on the subgraph, each edge r _e，e′ Weights of (2)The calculation method comprises the following steps: />Where e is denoted by e _i Any node in the subgraph which is the center, e' represents the neighbor node of the node e, r _e，e′ A characteristic representation representing the relationship between e and e', e _j Representation Gene e _j The function g here takes the form of an inner product. Weight->Representing the relationship r _e，e′ For gene e _j Is of importance of (a); the aggregation mode adopts a weight summation mode, and the calculation formula is as follows:wherein (1)>Represented is the edge weight value +.>P (e) represents the neighbor node of the gene e, ep represents the feature representation after node e aggregates the neighbor nodes. After obtaining the characteristic representation of the genes in each subgraph through the aggregation operation, training by using a neural network model to obtain the characteristic representation e [ h+1 ] of the entities in the subgraph]＝σ(W(e[h]+ep) +b), wherein W and b are training parameters of the network, the activation function sigma adopts a ReLU function, and the gene e is obtained through H-layer training _i And e _j Final feature representation->And->

Score computation module for calculating genes using the obtained characteristic representations of two genesThe probability of the synthetic lethal relation exists between the two, and the specific calculation mode comprises the following steps:also here a ReLU is used as the activation function σ and an inner product function is used as the function f. The threshold is set to 0.5, i.e. when f is less than 0.5 +.>0, indicating that there is no synthetic lethal relationship between genes, when f is 0.5 or more,/is>1, it indicates that there is a synthetic lethal relationship between the genes.

As can be seen from comparison of specific experiments, the KG4SL model constructed by the method is superior to the existing basic model in terms of AUPR and F1 indexes; for example, a TransE and TransE+GCN model; the KG4SL model has stronger discrimination capability for SL relation and non-SL relation; the KG4SL model reveals the contribution of the knowledge graph extracted from the sylothdb database to the SL gene on the prediction task. KG4SL overcomes the assumption that each SL gene pair is considered to be an independent sample in the prior art by adding a proper knowledge graph into the GNN and taking into account the information stored on the knowledge graph and related to the biological mechanism behind the SL gene pair. After knowledge graph information is aggregated, the performance of KG4SL on AUC, AUPR and F1 is improved by 3.11%, 2.16% and 6.4% respectively compared with all basic models.

FIG. 5 shows a schematic diagram of the structure of the synthetic lethal gene pair prediction terminal 50 based on a knowledge-graph in an embodiment of the present invention.

The synthetic lethal gene pair prediction terminal 50 based on the knowledge-graph includes: memory 51 and processor 52 the memory 51 is for storing a computer program; the processor 52 runs a computer program to implement the synthetic lethal gene pair prediction method based on the knowledge-graph as described in fig. l.

Alternatively, the number of the memories 51 may be one or more, and the number of the processors 52 may be one or more, and one is taken as an example in fig. 5.

Optionally, the processor 52 in the knowledge-graph-based synthetic lethal gene pair prediction terminal 50 loads one or more instructions corresponding to the process of the application program into the memory 51 according to the steps as shown in fig. 1, and the processor 52 runs the application program stored in the first memory 51, thereby implementing various functions in the knowledge-graph-based synthetic lethal gene pair prediction method as shown in fig. 1.

Optionally, the memory 51 may include, but is not limited to, high speed random access memory, nonvolatile memory. Such as one or more disk storage devices, flash memory devices, or other non-volatile solid-state storage devices; the processor 52 may include, but is not limited to, a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

Alternatively, the processor 52 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

The invention also provides a computer readable storage medium storing a computer program which, when run, implements the synthetic lethal gene pair prediction method based on the knowledge-graph as shown in figure 1. The computer-readable storage medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disk-read only memories), magneto-optical disks, ROMs (read-only memories), RAMs (random access memories), EPROMs (erasable programmable read only memories), EEPROMs (electrically erasable programmable read only memories), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing machine-executable instructions. The computer readable storage medium may be an article of manufacture that is not accessed by a computer device or may be a component used by an accessed computer device.

In summary, the method, the system, the terminal and the medium for predicting the synthetic lethal gene pair based on the knowledge graph acquire the highest prediction accuracy by extracting the subgraph from the knowledge graph and completing the processes of knowledge integration and feature extraction based on the extracted subgraph. Especially when expressing gene characteristics, according to the utilized knowledge graph containing factors such as biological processes, diseases, pathways and the like related to the SL gene pair, the effect of fully considering the common biological mechanism behind the SL gene pair is achieved, so that the prediction result has more referential property, and the problems in the prior art are solved. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. It is therefore intended that all equivalent modifications and changes made by those skilled in the art without departing from the spirit and technical spirit of the present invention shall be covered by the appended claims.

Claims

1. A knowledge-graph-based synthetic lethal gene pair prediction method, characterized in that the method comprises:

extracting subgraphs which respectively take a gene pair to be predicted as a central gene node from a knowledge graph and are respectively derived from one or more stages of the central gene node; wherein the subgraph comprises: sub-graph feature representation information representing neighborhood relations of each gene node;

based on the subgraph characteristic representation information, respectively updating the characteristic representation information of each gene node at each level in each subgraph to obtain characteristic representation updating information corresponding to the gene pair to be predicted;

and calculating to obtain a probability value corresponding to the synthetic lethal relationship of the gene pair to be predicted according to the characteristic expression updating information of the gene pair to be predicted, and obtaining a prediction result corresponding to whether the synthetic lethal relationship exists in the gene pair to be predicted.

2. The knowledge-based synthetic lethal gene pair prediction method according to claim 1, wherein said sub-graph feature representation information includes: the characteristic representing information of each gene node at each level, the characteristic representing information of the neighboring node corresponding to each gene node, and the relationship characteristic representing information between each gene node and the neighboring node corresponding to each gene node.

3. The knowledge-graph-based synthetic lethal gene pair prediction method according to claim 2, wherein updating the feature representation information of each gene node at each level in each subgraph based on the subgraph feature representation information to obtain feature representation update information corresponding to the gene pair to be predicted comprises:

and based on the subgraph feature representation information, sequentially aggregating all neighborhood relations of all the gene nodes at each level in each subgraph according to the derivation direction of the central gene node, and sequentially updating the feature representation information of all the gene nodes at each level in each subgraph step by step to obtain feature representation updating information corresponding to the gene pair to be predicted.

4. The knowledge-graph-based synthetic lethal gene pair prediction method according to claim 3, wherein the means for sequentially aggregating all neighborhood relations of each gene node at each level in each subgraph in a direction derived from the central gene node based on the subgraph feature representation information, and sequentially updating feature representation information of each gene node at each level in each subgraph step by step to obtain feature representation update information corresponding to the gene pair to be predicted comprises:

Based on the relation characteristic representation information between each gene node and the corresponding neighbor node and the characteristic representation information of the central gene node, respectively obtaining the relation weight value between each gene node and the corresponding neighbor node;

and based on the weight values of the relations and the characteristic representation information of the gene nodes, sequentially carrying out weight aggregation on all neighborhood relations of the gene nodes at each level in each subgraph according to the derivation direction of the central gene node, and sequentially updating the characteristic representation information of the gene nodes at each level in each subgraph step by step to obtain the characteristic representation updating information corresponding to the gene pair to be predicted.

5. The knowledge-graph-based synthetic lethal gene pair prediction method according to claim 4, wherein the method for sequentially performing weight aggregation on all neighborhood relations of each gene node at each level in each subgraph according to the derivation direction of the central gene node based on each relation weight value and the feature representation information of each gene node, and sequentially updating the feature representation update information of each gene node at each level in each subgraph step by step to obtain the feature representation update information corresponding to the gene pair to be predicted comprises:

Obtaining neighborhood relation characteristic representation information of neighborhood relation of each gene node according to the relation weight value between each gene node and the neighbor node corresponding to each gene node and the characteristic representation information of each gene node in turn according to the derivation direction of the central gene node;

aggregating the neighborhood relation feature representation information corresponding to each gene node, and replacing the feature representation information of each gene node at each level in each subgraph with the aggregated feature representation information;

and respectively taking the feature representation information obtained by aggregating the neighborhood relation feature representation information respectively corresponding to the central gene nodes as feature representation updating information corresponding to the gene pairs to be predicted.

6. The knowledge-graph-based synthetic lethal gene pair prediction method according to claim 1, wherein the calculating a probability value corresponding to the synthetic lethal relationship of the gene pair to be predicted according to the characteristic representation update information of the gene pair to be predicted, and the obtaining a prediction result corresponding to whether the synthetic lethal relationship exists in the gene pair to be predicted comprises:

based on the inner product function, calculating to obtain a probability value corresponding to the synthetic lethal relationship of the gene pair to be predicted according to the characteristic representation updating information of the gene pair to be predicted;

Obtaining a prediction result characteristic value according to the probability value based on the ReLU function and a set judgment threshold value;

and obtaining a predicted result corresponding to whether the gene pair to be predicted has a synthetic lethal relationship or not according to the characteristic value of the predicted result based on the predicted result judging condition.

7. The knowledge-based synthetic lethal gene pair prediction method according to claim 6, wherein said prediction result-based judgment conditions include:

if the characteristic value of the predicted result is 0, obtaining a predicted result which corresponds to the gene pair to be predicted and does not have a synthetic lethal relationship;

if the characteristic value of the predicted result is 1, obtaining the predicted result of the synthetic lethal relationship of the corresponding gene pair to be predicted.

8. A knowledge-graph-based synthetic lethal gene pair prediction system, said system comprising:

the knowledge graph extraction module is used for extracting subgraphs which respectively take the gene pair to be predicted as central gene nodes and are derived from one or more stages of the central gene nodes from the knowledge graph; wherein the subgraph comprises: sub-graph feature representation information representing neighborhood relations of each gene node;

the aggregation updating module is connected with the knowledge graph extracting module and is used for respectively updating the characteristic representation information of each gene node at each level in each subgraph based on the subgraph characteristic representation information so as to obtain the characteristic representation updating information corresponding to the gene pair to be predicted;

And the prediction module is connected with the aggregation updating module and is used for calculating a probability value corresponding to the synthetic lethal relationship of the gene pair to be predicted according to the characteristic representation updating information of the gene pair to be predicted and obtaining a prediction result corresponding to whether the synthetic lethal relationship exists in the gene pair to be predicted.

9. A synthetic lethal gene pair prediction terminal based on a knowledge-graph, comprising:

a memory for storing a computer program;

a processor for performing the synthetic lethal gene pair prediction method based on a knowledge-graph according to any one of claims 1 to 7.

10. A computer storage medium, characterized in that a computer program is stored, which computer program, when run, implements the synthetic lethal gene pair prediction method based on a knowledge-graph according to any one of claims 1 to 7.