CN116564408A - Synthetic lethal gene pair prediction method, device, equipment and medium based on knowledge-graph reasoning - Google Patents
Synthetic lethal gene pair prediction method, device, equipment and medium based on knowledge-graph reasoning Download PDFInfo
- Publication number
- CN116564408A CN116564408A CN202310486650.0A CN202310486650A CN116564408A CN 116564408 A CN116564408 A CN 116564408A CN 202310486650 A CN202310486650 A CN 202310486650A CN 116564408 A CN116564408 A CN 116564408A
- Authority
- CN
- China
- Prior art keywords
- graph
- gene
- knowledge
- synthetic lethal
- genes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 108700005090 Lethal Genes Proteins 0.000 title claims abstract description 75
- 238000000034 method Methods 0.000 title claims abstract description 70
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 223
- 231100000518 lethal Toxicity 0.000 claims abstract description 52
- 230000001665 lethal effect Effects 0.000 claims abstract description 52
- 230000000644 propagated effect Effects 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 18
- 238000010586 diagram Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 11
- 230000000977 initiatory effect Effects 0.000 claims description 11
- 238000003860 storage Methods 0.000 claims description 11
- 230000007246 mechanism Effects 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 9
- 238000010276 construction Methods 0.000 claims description 6
- 230000004931 aggregating effect Effects 0.000 claims description 5
- 230000008844 regulatory mechanism Effects 0.000 claims description 5
- 230000000306 recurrent effect Effects 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 230000008569 process Effects 0.000 abstract description 15
- 238000002474 experimental method Methods 0.000 abstract description 12
- 238000005070 sampling Methods 0.000 abstract description 4
- 238000000638 solvent extraction Methods 0.000 abstract description 3
- 230000033228 biological regulation Effects 0.000 description 9
- 230000002776 aggregation Effects 0.000 description 5
- 238000004220 aggregation Methods 0.000 description 5
- 231100000225 lethality Toxicity 0.000 description 5
- 102000000872 ATM Human genes 0.000 description 4
- 108010004586 Ataxia Telangiectasia Mutated Proteins Proteins 0.000 description 4
- 230000033616 DNA repair Effects 0.000 description 4
- 206010028980 Neoplasm Diseases 0.000 description 4
- 201000011510 cancer Diseases 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 102000015098 Tumor Suppressor Protein p53 Human genes 0.000 description 3
- 108010078814 Tumor Suppressor Protein p53 Proteins 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 102000013698 Cyclin-Dependent Kinase 6 Human genes 0.000 description 2
- 108010025468 Cyclin-Dependent Kinase 6 Proteins 0.000 description 2
- 102100032857 Cyclin-dependent kinase 1 Human genes 0.000 description 2
- 230000006907 apoptotic process Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000022131 cell cycle Effects 0.000 description 2
- 230000030833 cell death Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 235000019800 disodium phosphate Nutrition 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000002779 inactivation Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000001105 regulatory effect Effects 0.000 description 2
- 230000028617 response to DNA damage stimulus Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000035899 viability Effects 0.000 description 2
- 101150084750 1 gene Proteins 0.000 description 1
- 101150033839 4 gene Proteins 0.000 description 1
- 101150076489 B gene Proteins 0.000 description 1
- 108700020463 BRCA1 Proteins 0.000 description 1
- 102000036365 BRCA1 Human genes 0.000 description 1
- 101150072950 BRCA1 gene Proteins 0.000 description 1
- 108010034798 CDC2 Protein Kinase Proteins 0.000 description 1
- 108091033409 CRISPR Proteins 0.000 description 1
- 238000010354 CRISPR gene editing Methods 0.000 description 1
- 101100268645 Caenorhabditis elegans abl-1 gene Proteins 0.000 description 1
- 101710106279 Cyclin-dependent kinase 1 Proteins 0.000 description 1
- 108020004414 DNA Proteins 0.000 description 1
- 208000033640 Hereditary breast cancer Diseases 0.000 description 1
- 101000785063 Homo sapiens Serine-protein kinase ATM Proteins 0.000 description 1
- 101000823316 Homo sapiens Tyrosine-protein kinase ABL1 Proteins 0.000 description 1
- 108700020978 Proto-Oncogene Proteins 0.000 description 1
- 102000052575 Proto-Oncogene Human genes 0.000 description 1
- 238000012228 RNA interference-mediated gene silencing Methods 0.000 description 1
- 108091023040 Transcription factor Proteins 0.000 description 1
- 102000040945 Transcription factor Human genes 0.000 description 1
- 102000044209 Tumor Suppressor Genes Human genes 0.000 description 1
- 108700025716 Tumor Suppressor Genes Proteins 0.000 description 1
- 102100022596 Tyrosine-protein kinase ABL1 Human genes 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 239000002246 antineoplastic agent Substances 0.000 description 1
- 229940041181 antineoplastic drug Drugs 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000031018 biological processes and functions Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000034994 death Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000003596 drug target Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000003209 gene knockout Methods 0.000 description 1
- 230000009368 gene silencing by RNA Effects 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 208000025581 hereditary breast carcinoma Diseases 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 238000011005 laboratory method Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000009125 negative feedback regulation Effects 0.000 description 1
- 231100001160 nonlethal Toxicity 0.000 description 1
- 230000009437 off-target effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 238000000059 patterning Methods 0.000 description 1
- 230000008288 physiological mechanism Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000004549 pulsed laser deposition Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 238000013106 supervised machine learning method Methods 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
- G06N5/025—Extracting rules from data
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Mathematical Physics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biotechnology (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Genetics & Genomics (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Bioethics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application provides a synthetic lethal gene pair prediction method, a device, equipment and a medium based on knowledge-graph reasoning, wherein the method comprises the following steps: acquiring a synthetic lethal knowledge graph and a known synthetic lethal gene pair; combining the synthetic lethal knowledge graph with a synthetic lethal graph formed by known synthetic lethal gene pairs to generate corresponding iso-graph, so as to construct a prediction model for predicting a plurality of partner genes of the preset initial gene based on the iso-graph and the preset initial gene; an optimized predictive model is trained based on a multi-class loss function. The invention fully utilizes the structure of KG to predict SL relationship and interprets the prediction process on the premise of not sampling neighbors, and defines the SL prediction problem as the recommendation problem of partner genes. Experiments show that the sum of the performances of KR4SL on NDCG, precision and Recall is superior to all baseline models in three data partitioning scenes.
Description
Technical Field
The application relates to the technical field of biological information, in particular to a synthetic lethal gene pair prediction method, a device, equipment and a medium based on knowledge-graph reasoning.
Background
Many important gene interactions are involved in cancer. Thus, identification of gene interactions is critical for the discovery of targets for anticancer drugs. Synthetic lethality (Synthetic lethality, SL) is a gene interaction relationship in that the inactivation of a single gene does not affect the viability of the cell, whereas the simultaneous inactivation of two genes results in cell death. Synthetic lethal relationships between genes provide a promising strategy for cancer treatment. By targeting genes that are not essential in normal cells but that are synthetically lethal to genes with cancer specific alterations, cancer cells can be selectively killed without damaging normal cells. Some wet laboratory techniques for large scale SL screening have been developed, such as RNA interference and CRISPR. However, these techniques have problems of high cost, off-target effect, unsuccessful gene knockout, and the like. To address these problems and expedite SL-based drug target discovery, many bioinformatic approaches for SL prediction and analysis have been developed over the last decade.
Existing calculation methods for prediction SL can be divided into three categories: statistical inference, web-based methods, and supervised machine learning methods. Statistical methods mine SL gene pairs based on predefined assumptions or rules. Network-based methods predict SL relationships by constructing a biological network and analyzing the topological features of genes in the network. Both types of methods have good interpretability, but manual selection of assumptions or topological features is relatively subjective and cannot utilize known SL pairs. Most supervised machine learning approaches lack an interpretability aspect, and the mechanism of SL tends to be unclear. The inclusion of a priori Knowledge in a Knowledge Graph (KG) into a supervised model may improve its interpretability. The existing KG-based method generally randomly samples neighbors and predicts based on node embedded similarity, so that the characteristics really important for prediction in KG cannot be found, namely, some important priori knowledge is ignored, so that the structural information of KG cannot be fully utilized and the prediction of a model cannot be well explained.
Therefore, development of an interpretable predictive model based on KG is needed to make full use of the semantic structure of KG to perform SL prediction and give an interpretation of the prediction result. The knowledge graph reasoning-based method utilizes the connectivity of paths between two nodes to infer the relationship of the two nodes, wherein important paths can be used as the explanation of the prediction of the two nodes. The relationship path is a special sequence composed of edge relationships in KG, and the directed graph composed of all possible relationship paths between two nodes is called a relationship directed graph.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, an object of the present application is to provide a knowledge-graph inference-based synthetic lethal gene pair prediction method, apparatus, device and medium for solving how to base on an interpretable prediction model of KG, making full use of the semantic structure of KG to perform SL prediction and giving an interpretation of the prediction result.
To achieve the above and other related objects, a first aspect of the present application provides a synthetic lethal gene pair prediction method based on knowledge-graph reasoning, including: acquiring a synthetic lethal knowledge graph and a known synthetic lethal gene pair; combining the synthetic lethal knowledge graph with a synthetic lethal graph formed by the known synthetic lethal gene pairs to generate corresponding iso-graphs, so as to construct a prediction model for predicting a plurality of partner genes of the preset starting genes based on the iso-graphs and the preset starting genes; the predictive model is trained and optimized based on a multi-class loss function.
In some embodiments of the first aspect of the present application, the method further comprises: after a synthetic lethal knowledge map and a known synthetic lethal gene pair are extracted from a SynLethDB synthetic lethal database, selecting a plurality of entities and a plurality of seed edge relations associated with a gene regulation mechanism from the synthetic lethal knowledge map; and expanding the side relationship based on a preset data set to obtain an expanded synthetic lethal knowledge graph.
In some embodiments of the first aspect of the present application, constructing a prediction model for predicting a number of partner genes of the preset starting gene based on the iso-pattern and the preset starting gene, comprising: constructing a directed graph based on all gene pairs of which the synthetic lethal knowledge graph is the initial gene and the preset initial gene; calculating semantic information representation transmitted from the initial genes to each node of each layer in the knowledge graph based on the relation directed graph; based on semantic information representation transmitted from a starting gene to each node of each layer in a knowledge graph, candidate partner genes in each layer of neighbor nodes are calculated, pairing possibility between each candidate partner gene and the starting gene is calculated, and a plurality of candidate partner genes with high pairing possibility are selected as partner genes of the starting gene.
In some embodiments of the first aspect of the present application, constructing a directed graph of all pairs of genes whose starting genes are the same as the preset starting genes based on the synthetic lethal knowledge graph, comprising:
definition of the initiation Gene g q And an isomerism diagram G; initial gene g q And a node in the heterographK-hop relation directed graph of +.>From the initial gene g q Starting from this, the starting gene G is found in the isomerism map G q Is marked as +.>Based on subgraph->Searching all neighbors for each neighbor node, so recursively searching K rounds to get sub-graph +.>Subgraph->Is the initiation gene g q And all nodes of the K-th layer->A union of K-hop relationship directed graphs; among all nodes of the K-th round, all gene nodes are taken as the initial gene g q SL candidate partner genes of (c).
In some embodiments of the first aspect of the present application, calculating a semantic information representation of each node of each layer in the knowledge-graph propagating from the starting gene based on the relational directed graph, comprising: constructing a relationship directed graph of the current layer based on the relationship directed graph of the previous layer in the knowledge graph so as to propagate semantic information from the previous layer to the current layer; aggregating all messages propagated to the same target node through an attention mechanism; sequence information from the upper layer to all sides of the current layer in the knowledge-graph is optimized based on the gating loop unit.
In some embodiments of the first aspect of the present application, the constructing a relationship directed graph of the current layer based on the relationship directed graph of the previous layer in the knowledge graph to propagate semantic information from the previous layer to the current layer includes: definition of the definitionFrom the initial gene g q Propagated to target node e i Semantic information of (2); for a triplet (e) from step (K-1) to step (K) i ,r io ,e o ) Slave node e i Propagated to node e o The semantic information of (2) is: /> wherein ,is r io Embedded representation at k-th layer, T i and To E is respectively i and eo Text representation of->Is a parameter that can be learned and is,represents the gene g from the start q To the (K-1) layer node e i Semantic information of (a).
In some embodiments of the first aspect of the present application, the aggregation of all messages propagated to the same target node by the attention mechanism is represented as: wherein ,is the initiation gene g q And node e in the heterograph o A K-hop relationship directed graph of (2); />Is for triplet (e i ,r io ,e o ) Is a concentration factor of (2); /> Andare all learnable parameters.
In some embodiments of the first aspect of the present application, the optimizing the sequence information from the top layer to the current layer in the knowledge-graph based on the gating loop unit includes using one GRU (Gated Recurrent Unit) gating loop unit to further strengthen the sequence information from the (K-1) th to the K-th steps, including:
wherein ,are all learnable parameters; />Represents the ratio g q Propagation through k steps to e o Semantic information representation of (2); />Representing the flow from g before passing through the GRU q Propagation to e o Semantic representation of (2); />Are all learnable parameters, r k 、f k Respectively representing a reset gate and an update gate, n k Representing the updated value after the GRU.
In some embodiments of the first aspect of the present application, the multi-class loss function is:
wherein ,/>Is all gene pairs involved in training, +.>Are all in g q A gene pair that is a starting gene; />Represents g after exponential transformation q and gp This scores the genes.
To achieve the above and other related objects, a second aspect of the present application provides a synthetic lethal gene pair prediction apparatus based on knowledge-graph inference, comprising: the data acquisition module is used for acquiring a synthetic lethal knowledge graph and a known synthetic lethal gene pair; the model construction module is used for combining the synthetic lethal knowledge graph and the synthetic lethal graph formed by the known synthetic lethal gene pairs to generate corresponding different patterns so as to construct a prediction model for predicting a plurality of partner genes of the preset initial genes based on the different patterns and the preset initial genes; the model training module is used for training and optimizing the prediction model based on a multi-classification loss function.
To achieve the above and other related objects, a third aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the synthetic lethal gene pair prediction method based on knowledge-graph inference.
To achieve the above and other related objects, a fourth aspect of the present application provides a computer apparatus, comprising: a processor and a memory; the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so that the computer equipment executes the synthetic lethal gene pair prediction method based on knowledge graph reasoning.
As described above, the synthetic lethal gene pair prediction method, device, equipment and medium based on knowledge-graph reasoning have the following beneficial effects: according to the invention, on the premise that neighbors do not need to be sampled, the SL relation is fully predicted by utilizing the KG structure and the prediction process is explained, the SL prediction problem is defined as the recommended problem of the partner genes, namely, one initial gene in a given SL gene pair is given, all possible genes of the model are scored, and a plurality of the partner genes with the forefront scores are selected as the predicted partner genes. Experiments show that the sum of the performances of KR4SL on NDCG, precision and Recall is superior to all baseline models in three data partitioning scenes.
Drawings
FIG. 1 is a schematic flow chart of a synthetic lethal gene pair prediction method based on knowledge-graph reasoning according to an embodiment of the present application.
Fig. 2 is a schematic flow chart of a prediction model for obtaining a plurality of partner genes for predicting the preset starting genes based on the iso-graph and the preset starting genes in an embodiment of the present application.
FIG. 3A is a schematic diagram of an iso-patterning during an experiment in an embodiment of the present application.
Fig. 3B is a schematic diagram showing the structure of the semantic information encoder (Semantic information encoder) during experiments in an embodiment of the present application.
Fig. 3C is a schematic diagram of a decoder (scanning decoder) during an experiment in an embodiment of the present application.
FIG. 3D is a schematic representation of the construction of a synthetic lethal gene pair from node ATM and node TP53 in one embodiment of the present application.
Fig. 4A shows the performance of the three types of indicators in the push-through scene according to an embodiment of the present application.
Fig. 4B shows the performance of three types of indicators in a generalized scene according to an embodiment of the present application.
FIG. 5 is a schematic diagram showing the structure of a synthetic lethal gene pair prediction device according to one embodiment of the present application.
Fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It is noted that in the following description, reference is made to the accompanying drawings, which describe several embodiments of the present application. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present application. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present application is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. Spatially relative terms, such as "upper," "lower," "left," "right," "lower," "upper," and the like, may be used herein to facilitate a description of one element or feature as illustrated in the figures as being related to another element or feature.
In this application, unless specifically stated and limited otherwise, the terms "mounted," "connected," "secured," "held," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art as the case may be.
Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" specify the presence of stated features, operations, elements, components, items, categories, and/or groups, but do not preclude the presence, presence or addition of one or more other features, operations, elements, components, items, categories, and/or groups. The terms "or" and/or "as used herein are to be construed as inclusive, or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a, A is as follows; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; A. b and C). An exception to this definition will occur only when a combination of elements, functions or operations are in some way inherently mutually exclusive.
In order to solve the problems in the background technology, the invention provides a knowledge-graph-inference-based synthetic lethal gene pair prediction method, a knowledge-graph-inference-based synthetic lethal gene pair prediction system, a knowledge-graph-based synthetic lethal gene pair prediction terminal and a knowledge-graph-based synthetic lethal gene pair prediction medium, and aims to fully utilize the structure of KG to predict SL relations and explain a prediction process on the premise of not sampling neighbors. KR4SL defines the SL prediction problem as a recommended problem for partner genes, i.e. given the starting genes in the SL gene pair, the model is scored for all possible genes, and the top number of scores is selected as the predicted partner gene. Experiments show that the performance of KR4SL on indexes such as NDCG, precision and Recall is superior to all baseline models under three data division scenes.
In short, the invention can effectively construct a relationship directed graph for a plurality of gene pairs and make inferences on the graph, predict potential SL partner genes and make explanations. Specifically: first, for multiple pairs of genes with the same starting gene, the model will construct a relationship directed graph for those pairs of genes simultaneously without randomly sampling neighbors, and reasoning in those graphs starting from the starting gene. Secondly, in the reasoning process of each layer, the structural information of the relation directed graph and the text semantic information of the entities in the graph are combined to serve as semantic information to be propagated, and the semantic information is further enhanced by learning the sequence information of the relation paths in the relation directed graph. And finally, information aggregation is carried out by adopting an attention mechanism, and a path with high weight is selected as interpretation after model training is finished.
In order to make the objects, technical solutions and advantages of the present invention more apparent, further detailed description of the technical solutions in the embodiments of the present invention will be given by the following examples with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Before explaining the present invention in further detail, terms and terminology involved in the embodiments of the present invention will be explained, and the terms and terminology involved in the embodiments of the present invention are applicable to the following explanation:
<1> synthetic lethality (Synthetic Lethality, SL): refers to the phenomenon that two non-lethal genes are inactivated simultaneously, resulting in cell death, and can be defined as a phenomenon that when either one of the A gene and the B gene is mutated, the viability is also obtained, but when both genes are mutated simultaneously, death is caused.
<2> Wet laboratory (Wet-Lab): refers to traditional experiments based on experimental agents, and the relative concept Dry laboratory (Dry-Lab) refers to computer-based simulation experiments.
The embodiment of the invention provides a synthetic lethal gene pair prediction method based on knowledge graph reasoning, a system of the synthetic lethal gene pair prediction method based on knowledge graph reasoning and a storage medium for storing an executable program for realizing the synthetic lethal gene pair prediction method based on knowledge graph reasoning. In terms of the implementation of the knowledge-based reasoning synthetic lethal gene pair prediction method, the embodiment of the present invention will describe an exemplary implementation scenario of the knowledge-based reasoning synthetic lethal gene pair prediction.
As shown in FIG. 1, a schematic flow chart of a synthetic lethal gene pair prediction method based on knowledge-graph reasoning is shown in the embodiment of the invention. The synthetic lethal gene pair prediction method based on knowledge-graph reasoning in the embodiment mainly comprises the following steps:
step S1: and obtaining a synthetic lethal knowledge graph and a known synthetic lethal gene pair.
In the embodiment of the invention, the synthetic lethal knowledge map and the known synthetic lethal gene pair are extracted from a SynLethDB synthetic lethal database. It should be noted that SynLethDB is a database concerning synthetic lethality, and has been widely used as data of gold standard; synLethDB2.0 contains a total of 50868 SL gene pairs involving 5 species and a knowledge-graph (SynLethKG) for the SL gene pairs. The present example utilized 35374 pair gene pairs involving 9746 genes as tag data, with the SynLethKG containing 11 relationships and 27 entities.
More preferably, after the synthetic lethal knowledge map and the known synthetic lethal gene pair are extracted from the synLethDB synthetic lethal database, selecting a plurality of entities and a plurality of side relations associated with a gene regulation mechanism from the synthetic lethal knowledge map; and expanding the side relationship based on a preset data set to obtain an expanded synthetic lethal knowledge graph.
For example, 3 entities and 4 side relationships associated with the gene regulatory mechanism can be selected from the synthetic lethal knowledge graph (synLethKG), and the 4 side relationships are expanded to 32 side relationships by using the Onstoprotein data set, so as to finally obtain a knowledge graph consisting of 3 types of 42547 nodes and 32 types of 381761 sides.
It is understood that the 3 entities are genes, gene ontologies and pathways, respectively. Gene regulation can be divided into 4 levels: level 1 gene regulation is represented by negative feedback regulation and is regulated by substrate or product concentration; the 2-level gene regulation is chain regulation and is regulated by signal molecules; the 3-level gene regulation is one-to-many regulation, represented by transcription factors, and one node regulates dozens of hundreds of targets; level 4 gene regulation is program regulation, is genome level time-dependent regulation, and controls transcriptome by changing expression group.
In some examples, the method further includes extracting a textual representation of each node in the knowledge-graph based on a pre-trained language model, the specific extraction process including: and extracting text representation for each node by using the text description of each node in the knowledge graph as an input parameter and using a pre-trained BERT-based language model CORR in the biomedical corpus so as to enrich semantic information of the text representation.
Step S2: combining the synthetic lethal knowledge graph with a synthetic lethal graph formed by the known synthetic lethal gene pairs to generate a corresponding iso-graph, so as to construct a prediction model for predicting a plurality of partner genes of the preset starting gene based on the iso-graph and the preset starting gene.
It should be noted that there may be different types of nodes and edges in the iso-graph, which have independent ID spaces and features. A heterogeneous graph is typically composed of a series of sub-graphs, one sub-graph corresponding to each relationship defined by a string triplet (source node type, edge type, target node type).
In some examples, a prediction model for predicting a plurality of partner genes of the preset starting genes is constructed based on the iso-graph and the preset starting genes, and the process is as shown in fig. 2:
step S21: and constructing a directed graph based on the synthetic lethal knowledge graph as a starting gene and all gene pairs of the preset starting gene.
Specifically, the initiation gene g is defined q And an isomerism diagram G; initial gene g q And a node in the heterographK-hop relation directed graph of +. >From the initial gene g q Starting from this, the starting gene G is found in the isomerism map G q Is marked as +.>Based on subgraph->Searching all neighbors for each neighbor node so that K rounds of searching recursively obtain childrenFigure->Subgraph->Is the initiation gene g q And all nodes of the K-th layer->A union of K-hop relationship directed graphs; among all nodes of the K-th round, all gene nodes are taken as the initial gene g q SL candidate partner genes of (c).
Step S22: and calculating semantic information representation transmitted from the initial genes to each node of each layer in the knowledge graph based on the relation directed graph. The specific process is as follows:
step S22a: and constructing a relation directed graph of the current layer based on the relation directed graph of the upper layer in the knowledge graph so as to propagate semantic information from the upper layer to the current layer.
In a directed graph based on the relation of the (K-1) th stepTo construct a relation directed graph of the K th step +.>The following description is given for the sake of example: definitions->From the initial gene g q Propagated to target node e i Semantic information of (2); for a triplet (e) from step (K-1) to step (K) i ,r io ,e o ) Slave node e i Propagated to node e o The semantic information of (2) is:
wherein ,is r io Embedded representation at k-th layer, T i and To E is respectively i and eo Text representation of->Is a parameter which can be learned, +.>Represents the gene g from the start q To the (K-1) layer node e i Semantic information of (a).
Step S22b: all messages propagated to the same target node are aggregated by the attention mechanism.
Specifically, aggregating all messages propagated to the same target node through the attention mechanism is represented as:
wherein ,is the initiation gene g q And node e in the heterograph o A K-hop relationship directed graph of (2); />Is for triplet (e i ,r io ,e o ) Attention coefficient of (a), i.e.)> and />Are all learnable parameters.
Step S22c: sequence information from the upper layer to all sides of the current layer in the knowledge-graph is optimized based on the gating loop unit.
Specifically, one GRU (Gated Recurrent Unit) gating loop was used to further enhance the sequence information from (K-1) to all sides of the K-th step, including:
wherein ,are all learnable parameters; />Represents the ratio g q Propagation through k steps to e o Semantic information representation of (2); />Representing the flow from g before passing through the GRU q Propagation to e o Semantic representation of (2); />Are all learnable parameters, r k 、f k Respectively representing a reset gate and an update gate, n k Representing the updated value after the GRU.
Step S23: based on semantic information representation transmitted from a starting gene to each node of each layer in a knowledge graph, candidate partner genes in each layer of neighbor nodes are calculated, pairing possibility between each candidate partner gene and the starting gene is calculated, and a plurality of candidate partner genes with high pairing possibility are selected as partner genes of the starting gene.
After the message passing through the K layers, all gene nodes in the K-layer neighbor nodes can be selected as candidate partner genes. For example, node g p G is g q A gene node in the K-th layer neighbor node of (2), then g p That is g q Can be g through a full junction layer p Calculating a final score:
wherein ,Wff and bff Are all learnable parameters; this score reflects g p Become g q The higher the probability of a partner gene, the higher the probability of a partner gene. After the scores of all candidate genes are arranged in descending order, the first N are selected as g q Partner gene.
Step S3: the predictive model is trained and optimized based on a multi-class loss function.
The multi-class loss function in the embodiment of the invention is expressed as follows:
wherein ,is all gene pairs involved in training, +.>Are all in g q A gene pair that is a starting gene;represents g after exponential transformation q and gp This scores the genes.
In order to facilitate the technical features and technical effects of the present invention to be further understood by those skilled in the art, the present invention will be explained in more detail below with reference to experimental procedures and experimental results.
FIG. 3A shows the combination of a synthetic lethal pattern (Known SL graph) and a synthetic lethal knowledge-graph (KG) of Known synthetic lethal gene pairs during an experiment to generate the resulting iso-pattern. Wherein node DNA damage response refers to a DNA damage response, which is one of the basic physiological mechanisms of an organism, which is aimed at protecting the genome of the organism. The node DNA repair is DNA repair, and is a reaction of cells after the cells are damaged to DNA; node BRCA1 is a gene directly related to hereditary breast cancer; node cell cycle is the cell cycle, which refers to the whole process that a cell undergoes from the completion of one division to the end of the next division; apoptotic process refers to the process of apoptosis; node ABL1 is a proto-oncogene; node CDK6 is cell division protein kinase 6; node ATM is an ataxia telangiectasia mutated gene; node CDK1 is cyclin dependent kinase 1; node TP53 is a tumor suppressor gene.
Fig. 3B shows a schematic diagram of the structure of the semantic information encoder (Semantic information encoder) during an experiment. Starting from a starting gene, recursively searching for a plurality of layers of neighbor nodes in the heterogeneous graph, and taking the gene node in the node of the last layer as a candidate partner gene. In the process from the k-1 step to the k step, firstly, calculating semantic information transmitted on each edge by utilizing structural information on the heterogram and text information of an entity in KG, then, carrying out attention message aggregation (Attentive Aggregation) on triples with the same target node, and finally, strengthening sequence information through a GRU to obtain semantic information representation of the k layer.
Fig. 3C shows a schematic diagram of a decoder (scanning decoder) during an experiment. For each candidate gene node of the K-th layer, a full-join (FF) layer is used to obtain the final score. After ranking these scores in descending order, the first N were selected as partner genes.
Fig. 3D shows the final explanation, taking DNA repair as an example: node ATM is a new partner gene for node TP53 because both node ATM and the known SL partner genes (ABL 1 and BRCA 1) are involved in biological processes (i.e., DNA repair).
The experimental scenario was set as follows: to evaluate the performance of the model, two experimental scenarios were set up.
Direct push type scene: given the known SL map and the synthetic lethal knowledge-map KG, unknown pairs of SL genes (or SL relationships) are deduced. In this case, the dataset is divided by gene pairs, and genes in the test set may be present in the training set.
Inductive scenario: all genes tested were not seen during the training. In this case, the data sets are divided by genes, the gene sets related to the training set and the gene sets related to the test set are not intersected with each other, and the gene sets related to the training and the test are also not intersected with each other in the different patterns for training and the different patterns for test. This setup may further check the generalization ability of the model.
Experimental comparison results are shown in fig. 4A and 4B: the synthetic lethal gene pair prediction method based on knowledge-graph reasoning (KR 4SL for short) provided by the embodiment of the invention is superior to the existing basic model in three indexes (NDCG@N, precision@N and recall@N, N=10, 20, 50) of two scenes, and is particularly in a generalized scene. Each value in the table is the result of training five times, the best result for each column is indicated in bold, "-" indicates that the value is 0.
The synthetic lethal gene based on knowledge-graph reasoning in the invention is used for explaining the implementation process and principle of the prediction method in detail. Hereinafter, the prediction apparatus, device and medium will be further described with respect to synthetic lethal gene based on knowledge-graph inference.
Fig. 5 shows a schematic structural diagram of a synthetic lethal gene pair prediction device based on knowledge-graph reasoning in the embodiment of the invention. The synthetic lethal gene pair prediction apparatus 500 according to an embodiment of the present invention includes: a data acquisition module 501, a model construction module 502 and a model training module 503.
The data acquisition module 501 is configured to acquire a synthetic lethal knowledge profile and a known synthetic lethal gene pair. The model construction module 502 is configured to combine the synthetic lethal knowledge graph and the synthetic lethal graph formed by the known synthetic lethal gene pair to generate a corresponding iso-graph, so as to construct a prediction model for predicting a plurality of partner genes of the preset starting gene based on the iso-graph and the preset starting gene. Model training module 503 is used to train and optimize the predictive model based on a multi-class loss function.
In some examples, the data acquisition module 501, after extracting synthetic lethal knowledge patterns and known synthetic lethal gene pairs from the SynLethDB synthetic lethal database, selects several entities and several seed edge relationships associated with a gene regulation mechanism from the synthetic lethal knowledge patterns; and expanding the side relationship based on a preset data set to obtain an expanded synthetic lethal knowledge graph.
In some examples, the model construction module 502 constructs a prediction model for predicting a plurality of partner genes of the preset starting genes based on the iso-graph and the preset starting genes, and the process specifically includes: constructing a directed graph based on all gene pairs of which the synthetic lethal knowledge graph is the initial gene and the preset initial gene; calculating semantic information representation transmitted from the initial genes to each node of each layer in the knowledge graph based on the relation directed graph; based on semantic information representation transmitted from a starting gene to each node of each layer in a knowledge graph, candidate partner genes in each layer of neighbor nodes are calculated, pairing possibility between each candidate partner gene and the starting gene is calculated, and a plurality of candidate partner genes with high pairing possibility are selected as partner genes of the starting gene.
In some examples, constructing a directed graph of all pairs of genes for which a starting gene is the same as the preset starting gene based on the synthetic lethal knowledge profile, comprising: definition of the definitionInitial Gene g q And an isomerism diagram G; initial gene g q And a node in the heterographK-hop relation directed graph of +. >From the initial gene g q Starting from this, the starting gene G is found in the isomerism map G q Is marked as +.>Based on subgraph->Searching all neighbors for each neighbor node, so recursively searching K rounds to get sub-graph +.>Subgraph->Is the initiation gene g q And all nodes of the K-th layer->A union of K-hop relationship directed graphs; among all nodes of the K-th round, all gene nodes are taken as the initial gene g q SL candidate partner genes of (c).
In some examples, calculating a semantic information representation of each node of each layer in the knowledge-graph that propagates from the starting gene based on the relational graph includes: constructing a relationship directed graph of the current layer based on the relationship directed graph of the previous layer in the knowledge graph so as to propagate semantic information from the previous layer to the current layer; aggregating all messages propagated to the same target node through an attention mechanism; sequence information from the upper layer to all sides of the current layer in the knowledge-graph is optimized based on the gating loop unit.
In some examples, what isConstructing a relationship directed graph of a current layer based on a relationship directed graph of a previous layer in the knowledge graph to propagate semantic information from the previous layer to the current layer, comprising: definition of the definition From the initial gene g q Propagated to target node e i Semantic information of (2); for a triplet (e) from step (K-1) to step (K) i ,r io ,e o ) Slave node e i Propagated to node e o The semantic information of (2) is: /> wherein ,/>Is r io Embedded representation at k-th layer, T i and To E is respectively i and eo Text representation of->Is a parameter that can be learned.
In some examples, the aggregation of all messages propagated to the same target node through the attention mechanism is represented as:
wherein ,/>Is the initiation gene g q And node e in the heterograph o A K-hop relationship directed graph of (2); />Is for triplet (e i ,r io ,e o ) Is a concentration factor of (2); and />Are all learnable parameters.
In some examples, the gating loop-based unit optimizes sequence information from a top layer to all sides of a current layer in the knowledge-graph, including using one GRU (Gated Recurrent Unit) gating loop unit to further strengthen sequence information from (K-1) th to (K-th) th sides, including:
wherein ,are all learnable parameters; />Represents the ratio g q Propagation through k steps to e o Semantic information representation of (a).
In some examples, the multi-class loss function used by the model training module 503 is:
wherein ,/>Is all gene pairs involved in training, +.>Are all in g q A gene pair that is a starting gene; />Represents g after exponential transformation q and gp This scores the genes.
It should be noted that: the synthetic lethal gene pair prediction device based on knowledge-graph inference provided in the above embodiment only illustrates the division of each program module when performing the synthetic lethal gene pair prediction based on knowledge-graph inference, and in practical application, the process allocation may be completed by different program modules according to needs, that is, the internal structure of the device is divided into different program modules, so as to complete all or part of the processes described above. In addition, the synthetic lethal gene pair prediction device based on knowledge-graph reasoning provided in the above embodiment and the synthetic lethal gene pair prediction method based on knowledge-graph reasoning belong to the same concept, and the specific implementation process is detailed in the method embodiment, which is not described herein.
The method for predicting the synthetic lethal gene pair based on the knowledge-graph inference provided by the embodiment of the invention can be implemented by adopting a terminal side or a server side, and referring to fig. 5, for a hardware structure of a predicted terminal of the synthetic lethal gene pair based on the knowledge-graph inference, an optional hardware structure schematic diagram of a predicted terminal 500 of the synthetic lethal gene pair based on the knowledge-graph inference provided by the embodiment of the invention is shown, where the terminal 500 can be a mobile phone, a computer device, a tablet device, a personal digital processing device, a factory background processing device, and the like. The synthetic lethal gene pair prediction terminal 500 based on knowledge-graph reasoning includes: at least one processor 501, memory 502, at least one network interface 504, and a user interface 506. The various components in the device are coupled together by a bus system 505. It is understood that bus system 505 is used to enable connected communications between these components. The bus system 505 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus systems in fig. 5.
The user interface 506 may include, among other things, a display, keyboard, mouse, trackball, click gun, keys, buttons, touch pad, or touch screen, etc.
It is to be appreciated that memory 502 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), a programmable Read Only Memory (PROM, programmable Read-Only Memory), which serves as an external cache, among others. By way of example, and not limitation, many forms of RAM are available, such as static random Access Memory (SRAM, staticRandom Access Memory), synchronous static random Access Memory (SSRAM, synchronous Static RandomAccess Memory). The memory described by embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
The memory 502 in the embodiment of the present invention is used to store various kinds of data to support the operation of the synthetic lethal gene on the prediction terminal 500 based on knowledge-graph inference. Examples of such data include: any executable program for operating on the knowledge-graph inference based synthetic lethal gene pair prediction terminal 500, such as an operating system 5021 and an application 5022; the operating system 5021 contains various system programs, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks. The application 5022 may include various application programs such as a media player (MediaPlayer), a Browser (Browser), etc. for implementing various application services. The synthetic lethal gene pair prediction method based on knowledge-graph reasoning provided by the embodiment of the invention can be contained in an application 5022.
The method disclosed in the above embodiment of the present invention may be applied to the processor 501 or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in the processor 501. The processor 501 may be a general purpose processor, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 501 may implement or perform the methods, steps and logic blocks disclosed in embodiments of the present invention. The general purpose processor 501 may be a microprocessor or any conventional processor or the like. The steps of the accessory optimization method provided by the embodiment of the invention can be directly embodied as the execution completion of the hardware decoding processor or the execution completion of the hardware and software module combination execution in the decoding processor. The software modules may be located in a storage medium having memory and a processor reading information from the memory and performing the steps of the method in combination with hardware.
In an exemplary embodiment, the synthetic lethal gene pair prediction terminal 400 based on knowledge-graph inference may be used by one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex programmable logic devices (CPLDs, complex Programmable LogicDevice) to perform the aforementioned methods.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by computer program related hardware. The aforementioned computer program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.
In the embodiments provided herein, the computer-readable storage medium may include read-only memory, random-access memory, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, U-disk, removable hard disk, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. In addition, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable and data storage media do not include connections, carrier waves, signals, or other transitory media, but are intended to be directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.
In summary, the present application provides a method, a device, a terminal and a medium for predicting synthetic lethal gene pairs based on knowledge graph reasoning, which fully utilizes the structure of KG to predict SL relationship and explain the prediction process under the premise of not sampling neighbors, defines the SL prediction problem as the recommended problem of partner genes, namely, given one initial gene in SL gene pairs, scoring all possible genes of a model, and selecting a plurality of partner genes with the forefront scores as the predicted partner genes. Experiments show that the sum of the performances of KR4SL on NDCG, precision and Recall is superior to all baseline models in three data partitioning scenes. Therefore, the method effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles of the present application and their effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications and variations which may be accomplished by persons skilled in the art without departing from the spirit and technical spirit of the disclosure be covered by the claims of this application.
Claims (12)
1. A synthetic lethal gene pair prediction method based on knowledge-graph reasoning is characterized by comprising the following steps:
acquiring a synthetic lethal knowledge graph and a known synthetic lethal gene pair;
combining the synthetic lethal knowledge graph with a synthetic lethal graph formed by the known synthetic lethal gene pairs to generate corresponding iso-graphs, so as to construct a prediction model for predicting a plurality of partner genes of the preset starting genes based on the iso-graphs and the preset starting genes;
the predictive model is trained and optimized based on a multi-class loss function.
2. The knowledge-graph inference-based synthetic lethal gene pair prediction method according to claim 1, wherein said method further comprises: after a synthetic lethal knowledge map and a known synthetic lethal gene pair are extracted from a SynLethDB synthetic lethal database, selecting a plurality of entities and a plurality of seed edge relations associated with a gene regulation mechanism from the synthetic lethal knowledge map; and expanding the side relationship based on a preset data set to obtain an expanded synthetic lethal knowledge graph.
3. The knowledge-graph-inference-based synthetic lethal gene pair prediction method according to claim 1, wherein a prediction model for predicting a plurality of partner genes of the preset starting gene is constructed based on the heterograms and the preset starting gene, and the method comprises the steps of:
Constructing a directed graph based on all gene pairs of which the synthetic lethal knowledge graph is the initial gene and the preset initial gene;
calculating semantic information representation transmitted from the initial genes to each node of each layer in the knowledge graph based on the relation directed graph;
based on semantic information representation transmitted from a starting gene to each node of each layer in a knowledge graph, candidate partner genes in each layer of neighbor nodes are calculated, pairing possibility between each candidate partner gene and the starting gene is calculated, and a plurality of candidate partner genes with high pairing possibility are selected as partner genes of the starting gene.
4. The synthetic lethal gene pair prediction method based on knowledge-graph inference of claim 3, wherein the constructing a directed graph based on the synthetic lethal knowledge graph for all pairs of the initial genes and the preset initial genes comprises:
definition of the initiation Gene g q And an isomerism diagram G; initial gene g q And a node in the heterographK-hop relationship directed graph of (2) isFrom the initial gene g q Starting from this, the starting gene G is found in the isomerism map G q Is marked asBased on subgraph- >Searching all neighbors for each neighbor node, so recursively searching K rounds to get sub-graph +.>Subgraph->Is the initiation gene g q And all nodes of the K-th layer->A union of K-hop relationship directed graphs; among all nodes of the K-th round, all gene nodes are taken as the initial gene g q SL candidate partner genes of (c).
5. A synthetic lethal gene pair prediction method based on knowledge-graph inference as claimed in claim 3, wherein said calculating semantic information representation of each node of each layer in the knowledge-graph, which is propagated from the initial gene, based on said relational directed graph, comprises:
constructing a relationship directed graph of the current layer based on the relationship directed graph of the previous layer in the knowledge graph so as to propagate semantic information from the previous layer to the current layer;
aggregating all messages propagated to the same target node through an attention mechanism;
sequence information from the upper layer to all sides of the current layer in the knowledge-graph is optimized based on the gating loop unit.
6. The knowledge-graph inference based synthetic lethal gene pair prediction method according to claim 5, wherein said knowledge-graph based upper layer relationship directed graph constructs a current layer relationship directed graph to propagate semantic information from an upper layer to a current layer, comprising: definition of the definition From the initial gene g q Propagated to target node e i Semantic information of (2); for a triplet (e) from step (K-1) to step (K) i ,r io ,e o ) Slave node e i Propagated to node e o The semantic information of (2) is: wherein ,/>Is r io Embedded representation at k-th layer, T i and To E is respectively i and eo Text representation of->Is a parameter which can be learned, +.>Represents the gene g from the start q To the (K-1) layer node e i Semantic information of (a).
7. The synthetic lethal gene pair prediction method based on knowledge-graph inference of claim 5, wherein said aggregating all messages propagated to the same target node by an attention mechanism is expressed as:
wherein ,is the initiation gene g q And node e in the heterograph o A K-hop relationship directed graph of (2); />Is for triplet (e i ,r io ,e o ) Is a concentration factor of (2); /> and />Are all learnable parameters.
8. The knowledge-graph inference based synthetic lethal gene pair prediction method according to claim 5, wherein said gating loop unit optimizes the sequence information from the upper layer to the current layer in the knowledge graph, including using one GRU (Gated Recurrent Unit) gating loop unit to further strengthen the sequence information from the (K-1) th to the (K) th sides, comprising:
wherein ,are all learnable parameters; />Represents the ratio g q Propagation through k steps to e o Semantic information representation of (2); />Representing the flow from g before passing through the GRU q Propagation to e o Semantic representation of (2); />Are all learnable parameters, r k 、f k Respectively representing a reset gate and an update gate, n k Representing the updated value after the GRU.
9. The knowledge-graph-inference-based synthetic lethal gene pair prediction method according to claim 1, wherein said multiclass loss function is:
wherein ,is all gene pairs involved in training, +.>Are all in g q A gene pair that is a starting gene; />Represents g after exponential transformation q and gp This scores the genes.
10. The utility model provides a synthetic lethal gene pair prediction device based on knowledge-graph reasoning which characterized in that includes:
the data acquisition module is used for acquiring a synthetic lethal knowledge graph and a known synthetic lethal gene pair;
the model construction module is used for combining the synthetic lethal knowledge graph and the synthetic lethal graph formed by the known synthetic lethal gene pairs to generate corresponding different patterns so as to construct a prediction model for predicting a plurality of partner genes of the preset initial genes based on the different patterns and the preset initial genes;
The model training module is used for training and optimizing the prediction model based on a multi-classification loss function.
11. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the synthetic lethal gene pair prediction method based on knowledge-graph inference of any one of claims 1 to 9.
12. A computer device, comprising: a processor and a memory;
the memory is used for storing a computer program;
the processor is configured to execute the computer program stored in the memory, so that the computer device performs the synthetic lethal gene pair prediction method based on knowledge-graph inference according to any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310486650.0A CN116564408B (en) | 2023-04-28 | 2023-04-28 | Synthetic lethal gene pair prediction method, device, equipment and medium based on knowledge-graph reasoning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310486650.0A CN116564408B (en) | 2023-04-28 | 2023-04-28 | Synthetic lethal gene pair prediction method, device, equipment and medium based on knowledge-graph reasoning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116564408A true CN116564408A (en) | 2023-08-08 |
CN116564408B CN116564408B (en) | 2024-03-01 |
Family
ID=87487277
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310486650.0A Active CN116564408B (en) | 2023-04-28 | 2023-04-28 | Synthetic lethal gene pair prediction method, device, equipment and medium based on knowledge-graph reasoning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116564408B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117079712A (en) * | 2023-08-30 | 2023-11-17 | 中国农业科学院农业信息研究所 | Method, device, equipment and medium for excavating biosynthesis gene cluster |
CN117116355A (en) * | 2023-08-30 | 2023-11-24 | 中国农业科学院农业信息研究所 | Method, device, equipment and medium for excavating excellent multiple-effect genes |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112288091A (en) * | 2020-10-30 | 2021-01-29 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Knowledge inference method based on multi-mode knowledge graph |
US20210174906A1 (en) * | 2019-12-06 | 2021-06-10 | Accenture Global Solutions Limited | Systems And Methods For Prioritizing The Selection Of Targeted Genes Associated With Diseases For Drug Discovery Based On Human Data |
CN113010691A (en) * | 2021-03-30 | 2021-06-22 | 电子科技大学 | Knowledge graph inference relation prediction method based on graph neural network |
CN113626612A (en) * | 2021-08-13 | 2021-11-09 | 第四范式(北京)技术有限公司 | Prediction method and system based on knowledge graph reasoning |
EP3913543A2 (en) * | 2020-12-21 | 2021-11-24 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for training multivariate relationship generation model, electronic device and medium |
CN113987203A (en) * | 2021-10-27 | 2022-01-28 | 湖南大学 | Knowledge graph reasoning method and system based on affine transformation and bias modeling |
CN114595344A (en) * | 2022-05-09 | 2022-06-07 | 北京市农林科学院信息技术研究中心 | Crop variety management-oriented knowledge graph construction method and device |
US20220207343A1 (en) * | 2020-12-22 | 2022-06-30 | International Business Machines Corporation | Entity disambiguation using graph neural networks |
CN114969369A (en) * | 2022-05-30 | 2022-08-30 | 大连民族大学 | Knowledge graph human cancer lethal prediction method based on mixed network and knowledge graph construction method |
CN115240777A (en) * | 2022-08-10 | 2022-10-25 | 上海科技大学 | Synthetic lethal gene prediction method, device, terminal and medium based on graph neural network |
CN115240778A (en) * | 2022-08-10 | 2022-10-25 | 上海科技大学 | Synthetic lethal gene partner recommendation method, device, terminal and medium based on comparative learning |
WO2022222037A1 (en) * | 2021-04-20 | 2022-10-27 | 中国科学院深圳先进技术研究院 | Interpretable recommendation method based on graph neural network inference |
WO2023065545A1 (en) * | 2021-10-19 | 2023-04-27 | 平安科技(深圳)有限公司 | Risk prediction method and apparatus, and device and storage medium |
-
2023
- 2023-04-28 CN CN202310486650.0A patent/CN116564408B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210174906A1 (en) * | 2019-12-06 | 2021-06-10 | Accenture Global Solutions Limited | Systems And Methods For Prioritizing The Selection Of Targeted Genes Associated With Diseases For Drug Discovery Based On Human Data |
CN112288091A (en) * | 2020-10-30 | 2021-01-29 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Knowledge inference method based on multi-mode knowledge graph |
EP3913543A2 (en) * | 2020-12-21 | 2021-11-24 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for training multivariate relationship generation model, electronic device and medium |
US20220207343A1 (en) * | 2020-12-22 | 2022-06-30 | International Business Machines Corporation | Entity disambiguation using graph neural networks |
CN113010691A (en) * | 2021-03-30 | 2021-06-22 | 电子科技大学 | Knowledge graph inference relation prediction method based on graph neural network |
WO2022222037A1 (en) * | 2021-04-20 | 2022-10-27 | 中国科学院深圳先进技术研究院 | Interpretable recommendation method based on graph neural network inference |
CN113626612A (en) * | 2021-08-13 | 2021-11-09 | 第四范式(北京)技术有限公司 | Prediction method and system based on knowledge graph reasoning |
WO2023065545A1 (en) * | 2021-10-19 | 2023-04-27 | 平安科技(深圳)有限公司 | Risk prediction method and apparatus, and device and storage medium |
CN113987203A (en) * | 2021-10-27 | 2022-01-28 | 湖南大学 | Knowledge graph reasoning method and system based on affine transformation and bias modeling |
CN114595344A (en) * | 2022-05-09 | 2022-06-07 | 北京市农林科学院信息技术研究中心 | Crop variety management-oriented knowledge graph construction method and device |
CN114969369A (en) * | 2022-05-30 | 2022-08-30 | 大连民族大学 | Knowledge graph human cancer lethal prediction method based on mixed network and knowledge graph construction method |
CN115240778A (en) * | 2022-08-10 | 2022-10-25 | 上海科技大学 | Synthetic lethal gene partner recommendation method, device, terminal and medium based on comparative learning |
CN115240777A (en) * | 2022-08-10 | 2022-10-25 | 上海科技大学 | Synthetic lethal gene prediction method, device, terminal and medium based on graph neural network |
Non-Patent Citations (3)
Title |
---|
MINCAI LAI 等: "Predicting Synthetic Lethality in Human Cancers via Multi-Graph Ensemble Neural Network", 《2021 43RD ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY (EMBC)》 * |
杨瑞达;林欣;杨燕;贺樑;窦亮: "基于混合增强智能的知识图谱推理技术研究", 计算机应用与软件, no. 06 * |
陈德华;殷苏娜;乐嘉锦;王梅;潘乔;朱立峰;: "一种面向临床领域时序知识图谱的链接预测模型", 计算机研究与发展, no. 12 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117079712A (en) * | 2023-08-30 | 2023-11-17 | 中国农业科学院农业信息研究所 | Method, device, equipment and medium for excavating biosynthesis gene cluster |
CN117116355A (en) * | 2023-08-30 | 2023-11-24 | 中国农业科学院农业信息研究所 | Method, device, equipment and medium for excavating excellent multiple-effect genes |
CN117116355B (en) * | 2023-08-30 | 2024-02-20 | 中国农业科学院农业信息研究所 | Method, device, equipment and medium for excavating excellent multiple-effect genes |
CN117079712B (en) * | 2023-08-30 | 2024-02-20 | 中国农业科学院农业信息研究所 | Method, device, equipment and medium for excavating pathway gene cluster |
Also Published As
Publication number | Publication date |
---|---|
CN116564408B (en) | 2024-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116564408B (en) | Synthetic lethal gene pair prediction method, device, equipment and medium based on knowledge-graph reasoning | |
Zhao et al. | IRWNRLPI: integrating random walk and neighborhood regularized logistic matrix factorization for lncRNA-protein interaction prediction | |
Zhu et al. | Recursively imputed survival trees | |
CN107391512B (en) | Method and device for predicting knowledge graph | |
US20160321357A1 (en) | Discovery informatics system, method and computer program | |
CN106021990B (en) | A method of biological gene is subjected to classification and Urine scent with specific character | |
US11514498B2 (en) | System and method for intelligent guided shopping | |
Gong et al. | Novel heuristic density-based method for community detection in networks | |
Lagani et al. | Structure-based variable selection for survival data | |
Choi et al. | Identifying disease-gene associations using a convolutional neural network-based model by embedding a biological knowledge graph with entity descriptions | |
US20230197205A1 (en) | Bioretrosynthetic method and system based on and-or tree and single-step reaction template prediction | |
Cannataro et al. | Data management of protein interaction networks | |
Yu et al. | DDOT: a Swiss army knife for investigating data-driven biological ontologies | |
Zhou et al. | Summarisation of weighted networks | |
Price et al. | Survey: Enhancing protein complex prediction in PPI networks with GO similarity weighting | |
CN110837567A (en) | Method and system for embedding knowledge graph | |
CN110610763A (en) | KaTZ model-based metabolite and disease association relation prediction method | |
CN116324810A (en) | Potential policy distribution for assumptions in a network | |
Sun et al. | A graph neural network-based interpretable framework reveals a novel DNA fragility–associated chromatin structural unit | |
Di Mauro et al. | Bandit-based Monte-Carlo structure learning of probabilistic logic programs | |
Razi et al. | Identifying gene subnetworks associated with clinical outcome in ovarian cancer using network based coalition game | |
Ji et al. | HAM-FMD: mining functional modules in protein–protein interaction networks using ant colony optimization and multi-agent evolution | |
CN115080587A (en) | Electronic component replacing method, device and medium based on knowledge graph | |
Chen et al. | A community finding method for weighted dynamic online social network based on user behavior | |
Yoo et al. | The Five‐Gene‐Network Data Analysis with Local Causal Discovery Algorithm Using Causal Bayesian Networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |