CN117079712B

CN117079712B - Method, device, equipment and medium for excavating pathway gene cluster

Info

Publication number: CN117079712B
Application number: CN202311109387.XA
Authority: CN
Inventors: 张丹丹; 赵瑞雪; 鲜国建; 寇远涛
Original assignee: Agricultural Information Institute of CAAS
Current assignee: Agricultural Information Institute of CAAS
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2024-02-20
Anticipated expiration: 2043-08-30
Also published as: CN117079712A

Abstract

The invention relates to the technical field of excavation of pathway gene clusters, and discloses an excavation method, device, equipment and medium of pathway gene clusters, comprising the following steps: selecting entity class and corresponding data attribute and object attribute to construct character regulation gene ontology model; constructing triplets among various entities in the character regulation gene ontology model to generate a character regulation gene knowledge graph; constructing a protein interaction prediction model; acquiring an interaction protein connected subgraph of the protein to be excavated based on the character regulation gene knowledge graph, and supplementing the interaction protein connected subgraph according to the interaction relation prediction of the protein interaction prediction model, so as to excavate an interaction protein complete subgraph; and excavating the common connection entity nodes between the proteins and the corresponding genes based on the complete subgraph of the interaction proteins to obtain a gene structure diagram, and obtaining a pathway gene cluster through physical position judgment. The invention can enhance the accuracy of the interactive protein prediction and successfully excavate the channel gene cluster.

Description

Method, device, equipment and medium for excavating pathway gene cluster

Technical Field

The invention relates to the technical field of excavation of pathway gene clusters, in particular to an excavation method, device, equipment and medium of pathway gene clusters.

Background

In life science research, the pathway gene cluster exists in various biological genomes as a very important gene set type and plays important roles in metabolism and regulation, so that the analysis of molecular regulation mechanisms by the pathway gene cluster is important. Genes in a pathway gene cluster are typically located adjacent to each other in the genome at the physical location level where the genes are located; in terms of gene function, genes in a pathway gene cluster commonly regulate the same pathway to generate specific compound small molecules. That is, a pathway gene cluster is a cluster of genes encoding a pathway together, which occur continuously within a certain range (several tens to several hundreds of kb range) of a genome of a plant, a bacterium or the like. However, from the sequence pattern level, the number of genes in one pathway gene cluster is large and the sequence difference is large, and it is difficult to mine a new type of pathway gene cluster by sequence homology. The existing method only utilizes a multi-type protein interaction model to realize gene excavation in a single same channel or to realize gene excavation in the same family based on structural similarity and functional similarity prediction, the excavation result has low accuracy, and complex channel gene cluster structures cannot be excavated.

Disclosure of Invention

In view of the above, the present invention provides a method, apparatus, device and medium for excavating a pathway gene cluster, so as to solve the problem that pathway gene cluster excavation cannot be performed.

In a first aspect, the present invention provides a method for mining a pathway gene cluster, the method comprising:

selecting entity class according to the purpose of channel gene cluster mining, determining data attributes of different entities in the entity class and object attributes among the entities, and constructing a character regulation gene ontology model based on the entity class, the data attributes and the object attributes;

based on the character regulation gene ontology model, extracting various entities and the relation among the entities from a multi-source database to construct a triplet representing the relation among the different entities, and carrying out multi-source knowledge association fusion according to the triplet to generate a character regulation gene knowledge graph;

selecting interaction protein pairs and non-interaction protein pairs with preset proportions in the trait regulatory gene knowledge graph as data sets, and performing model training based on the data sets to construct a protein interaction prediction model;

searching based on the character regulation gene knowledge graph to obtain an interaction protein connected subgraph of the protein to be mined, predicting the interaction relation of any protein in the interaction protein connected subgraph to an input protein interaction prediction model, supplementing the interaction protein connected subgraph according to a prediction result, and mining an interaction protein complete subgraph based on the supplemented interaction protein connected subgraph;

and excavating the common connection entity nodes among all proteins and among corresponding genes in the interaction protein complete subgraph from the character regulation gene knowledge graph based on the interaction protein complete subgraph to obtain a gene structure diagram, and further obtaining a path gene cluster by judging whether the genes in the gene structure diagram are within a preset physical position interval or not.

According to the method for excavating the channel gene cluster, provided by the embodiment of the invention, the character regulation gene body model is constructed, and is filled to generate the character regulation gene knowledge graph, on the basis, the known interaction protein pairs are searched to construct the interaction protein connected subgraph, the prediction of the interaction protein pairs is carried out through the protein interaction prediction model, the interaction protein connected subgraph is filled according to the prediction result to obtain the interaction protein complete subgraph, and the channel gene cluster is excavated through the filling of the interaction protein complete subgraph, so that the accuracy of the interaction protein prediction can be enhanced, and the channel gene cluster is successfully excavated.

In an alternative embodiment, the entity class includes: proteins, genes, traits, signaling pathways, gene symbols, protein families, domains, subcellular localization, cellular components, molecular functions, biological processes, metabolic pathways, and enzymes, and with proteins, genes, and traits as central entities; the data attribute is the characteristic of the corresponding entity, and the object attribute is the relation between different entities.

According to the invention, the entity class is selected, the character regulation gene ontology model is constructed according to the data attribute and the object attribute between the entity class, the relation between the entities on the abstract level can be described through the logic model, and the value ranges of the entity and the entity core attribute in the ontology model can be combed by means of the model framework.

In an alternative embodiment, a multi-source database, comprising: literature database and field science database.

According to the invention, the latest discipline knowledge in the field is integrated into the systematic chemical discipline knowledge by a data layer construction mode for organizing multidimensional scientific data in the associated literature database and the field scientific knowledge base, so that the problem of difficulty in excavating the path gene cluster can be solved.

In an alternative embodiment, based on the body model of the trait regulatory gene, extracting various entities and relationships among the entities from the multi-source database to construct a triplet representing the relationships among the different entities, wherein the process comprises the following steps: taking the character entity as a search term, acquiring a protein entity related to the character entity based on a literature database, and constructing a protein-related-character triplet after checking the relation between the character entity and the protein entity; protein sequences of different species are obtained based on the scientific databases of the fields of all types, and homologous proteins and corresponding genes of protein entities are extracted based on the protein sequences, so that protein-homologous-protein triples and protein-corresponding-gene triples are constructed; and acquiring structural data related to the protein entity and the gene entity based on the field science databases of all types, cleaning the structural data, and constructing triplets among other entities except the protein, the gene and the property according to the unique identifier attribute of the common protein in the field science databases of different types.

According to the invention, the relations among various entities are extracted through the existing data, the corresponding triples are constructed, and meanwhile, the triples are subjected to multi-source knowledge association fusion, so that the problem that the description information of the same entity or concept from multiple sources is combined with low redundancy and high accuracy can be solved, and therefore, the ontology model is filled according to the triples, the character regulation gene knowledge map covering all the entities is generated, and data support is provided for the excavation of the path gene clusters.

In an alternative embodiment, the process of selecting the interacting protein pairs and the non-interacting protein pairs with the preset proportion in the trait regulatory gene knowledge graph as the data set comprises the following steps: selecting all known interacting protein pairs in the trait regulatory gene knowledge graph as positive samples, and selecting all known non-interacting protein pairs as negative samples; the negative samples are downsampled based on a preset ratio of the positive samples to the negative samples, and the positive samples and the downsampled negative samples are combined as a dataset.

According to the invention, a relatively accurate protein interaction prediction model can be constructed by taking known interaction protein pairs and non-interaction proteins in the trait regulatory gene knowledge graph as data sets. And the proportion of non-interactive protein pairs in the trait regulatory gene knowledge graph is relatively large, so that the negative sample is subjected to downsampling in the process of acquiring the data set, and the sample can be kept balanced.

In an alternative embodiment, a protein interaction prediction model is used to predict whether there is an interaction relationship between different proteins, comprising: the process for predicting by using the protein interaction prediction model comprises the following steps of: calculating a first protein vector and a second protein vector corresponding to all protein pairs in the character regulation gene knowledge graph through a preset algorithm; respectively inputting the first protein vector and the second protein vector into a corresponding single-layer fully-connected neural network in a protein interaction prediction model to obtain a corresponding first output vector and a corresponding second output vector; splicing the first output vector and the second output vector, and acquiring a third output vector through a fully connected neural network; and mapping the third output vector through a preset activation function to obtain a prediction result, namely predicting whether different proteins have interaction relations or not.

According to the invention, the problem of low accuracy of the interaction protein prediction result can be solved and the reliability of the channel gene cluster mining result can be ensured by constructing the protein interaction prediction model fused with the fully-connected neural network by embedding the knowledge graph node vector.

In an alternative embodiment, the process of supplementing the interacting protein connected subgraph according to the prediction result and mining the interacting protein complete subgraph based on the supplemented interacting protein connected subgraph includes: supplementing the interaction relation between different proteins predicted by the protein interaction prediction model to the interaction protein communication subgraph in the form of a connecting edge based on the connection edge of the reachable path between proteins with known interaction relation in the character regulation gene knowledge graph of the interaction protein communication subgraph; searching proteins with interaction relations among the proteins from the interaction protein connected subgraphs based on a preset algorithm to generate an interaction protein complete subgraph.

According to the invention, through carrying out interaction relation prediction on any protein pair in the interaction protein connected subgraph, all proteins with interaction relation are connected through the connecting edges, and thus, the complete subgraph of the interaction protein can be obtained. All pairs of proteins in the complete subgraph of interacting proteins must be guaranteed to have an interaction relationship so that the corresponding genes can be guaranteed to be structural genes encoding enzymes catalyzing different steps of the same metabolic pathway, which provides a basis for the excavation of pathway gene clusters.

In a second aspect, the present invention provides an excavating device for a pathway gene cluster, the device comprising:

the ontology model construction module is used for selecting entity classes according to the purpose of excavation of the path gene clusters, determining data attributes of different entities in the entity classes and object attributes among the entities, and constructing a character regulation gene ontology model based on the entity classes, the data attributes and the object attributes;

the knowledge graph generation module is used for extracting various entities and the relation among the entities from the multi-source database based on the character regulation gene ontology model to construct a triplet representing the relation among the different entities, and carrying out multi-source knowledge association fusion according to the triplet to generate a character regulation gene knowledge graph;

the prediction model construction module is used for selecting interaction protein pairs and non-interaction protein pairs with preset proportions in the trait regulation gene knowledge graph as a data set, and carrying out model training based on the data set to construct a protein interaction prediction model;

the subgraph construction module is used for searching based on the character regulation gene knowledge graph to obtain an interaction protein connected subgraph of the protein to be mined, predicting the interaction relation of any protein in the interaction protein connected subgraph to an input protein interaction prediction model, supplementing the interaction protein connected subgraph according to a prediction result, and mining an interaction protein complete subgraph based on the supplemented interaction protein connected subgraph;

the pathway gene cluster mining module is used for mining the common connection entity nodes among all proteins and among corresponding genes in the interaction protein complete subgraph from the character regulation gene knowledge graph based on the interaction protein complete subgraph to obtain a gene structure chart, and further obtaining the pathway gene cluster by judging whether the genes in the gene structure chart are within a preset physical position interval or not.

In a third aspect, the present invention provides a computer device comprising: the processor executes the computer instructions, thereby executing the method for excavating the path gene cluster according to the first aspect or any one of the corresponding embodiments.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the method for mining a pathway gene cluster according to the first aspect or any one of the embodiments corresponding thereto.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a method of mining a pathway gene cluster according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a trait regulatory gene ontology model of a method of mining a pathway gene cluster according to an embodiment of the present invention;

FIG. 3 is a schematic representation of a protein interaction prediction model calculation of a method of mining a pathway gene cluster according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an interaction protein communication subgraph of a method of mining a pathway gene cluster according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a complete subgraph of an interacting protein of a method of mining a pathway gene cluster according to an embodiment of the present invention;

FIG. 6 is a schematic representation of pathway gene cluster mining of a method of mining pathway gene clusters according to an embodiment of the present invention;

FIG. 7 is a block diagram of a construction of an excavating device of pathway gene clusters according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention is suitable for the scene of excavating the path gene cluster. The embodiment of the invention provides a method for excavating a pathway gene cluster, which is used for achieving the effect of successfully excavating the pathway gene cluster by constructing a character regulation gene knowledge graph and carrying out interactive protein pair retrieval and prediction according to the knowledge graph. It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

In this embodiment, a method for excavating a pathway gene cluster is provided, which can be used in the above computer, and fig. 1 is a flowchart of a method for excavating a pathway gene cluster according to an embodiment of the present invention, as shown in fig. 1, and the flowchart includes the following steps:

step S101, selecting entity class according to the purpose of the excavation of the path gene cluster, determining the data attribute of different entities in the entity class and the object attribute among the entities, and constructing a character regulation gene ontology model based on the entity class, the data attribute and the object attribute.

Specifically, in the embodiment of the invention, 13 entity classes, 16 data attributes and 14 object attributes are used for constructing a character regulation gene ontology model, but the character regulation gene ontology model is not limited to the embodiment. Wherein the 13 entity classes include: protein, gene, trait (track), signal Pathway, gene Symbol, protein Family, domain, subcellular localization (Subcellular Location), cellular component (Cellular Component), molecular function (Molecular Function), biological process (Biological Process), metabolic Pathway (Metabolic Pathway) and Enzyme (Enzyme), and the Protein, gene and Trait are taken as central entities, data attribute is the characteristic of the corresponding entity, object attribute is the relation between different entities, and the established Trait regulatory Gene body model is shown in fig. 2. Taking a protein type entity as an example, the establishment of the association relationship between the known property protein and the property is realized by connecting the property type with the protein type with the known property through related (associates with) object attributes. Meanwhile, the data attribute of the description protein is added: protein identifier (protein ID), species, first time of discovery (date of creation), functional description (function description), impact phenotype description (phenotype disruption), and PubMed document number (PMID). In addition, the association between two proteins is established through the homologous (homologo to) object attribute, and the association is used as a key object attribute in the ontology model and is also an important basis for realizing multi-dimensional scientific data fusion among species. On the basis, the association relationship between the protein and the gene is constructed through correspondence (corrushing to), and the data attribute describing the gene is added: gene identifier (gene ID), species (location), panher database number (PANTHER identity), transcript name (transcript name). The association relationship between the protein and the gene symbol is established through the identity (identity with) object attribute, which is taken as a key for the discovery of the inter-species gene function knowledge, and is only used as an example and not limited to the example.

Step S102, based on the character regulation gene ontology model, extracting various entities and the relations among the entities from a multi-source database to construct triples representing the relations among the different entities, and carrying out multi-source knowledge association fusion according to the triples to generate a character regulation gene knowledge graph.

Specifically, in the embodiment of the invention, the character regulation gene ontology model only describes that different types of object attributes exist among different entity types, but the character regulation gene ontology model is also required to be filled to obtain a character regulation gene knowledge graph. The embodiment of the invention extracts different entities and the relation between the different entities from the existing and determined data in the literature database and the field science database, but the invention is not limited to the above. Wherein the documents of the document database contain the latest achievements in the field, and the relevant knowledge in the documents is extracted to represent the latest knowledge and field research progress. Since the latest achievements found by scientists are generally previously published in the literature. Thus, the knowledge extracted from the literature is the most novel knowledge in the field. The domain scientific database contains systematic and normalized domain knowledge, and reorganization and fusion of the domain knowledge can improve the efficiency of knowledge extraction, expand the knowledge association of domain knowledge graphs and have important significance for domain knowledge discovery. The embodiment of the invention selects the Pubmed document database and the Uniprot field scientific database respectively, but in actual operation, the Pubmed document database is traced through the Uniprot field scientific database, which is only used as an example and not limited to the above. The fusion of the two types of databases is to fuse the most novel knowledge with normalized and systematic knowledge, so that the knowledge extraction efficiency is improved, the knowledge association of the domain knowledge graph is expanded, and the discovery of new discipline knowledge is realized.

In an optional implementation manner, the embodiment of the invention uses a character (traits) description keyword as a search term, obtains a protein ID by connecting a Uniprot field science database to a Pubmed document database, and further manually verifies the relation between a document and a character to establish a protein-related-character triplet.

In an alternative implementation manner, the embodiment of the invention downloads protein sequences of different species from a report database, and then calculates the similarity between the protein sequences of different species by using a BLAST calculation tool, thereby obtaining a protein-protein homology relationship and a protein-gene correspondence relationship, and constructing a protein-homology-protein triplet and a protein-correspondence-gene triplet.

In an alternative embodiment, the present examples download structured data related to genes, proteins from various types of domain science databases and perform data cleansing by pandas. And constructing triplets among other entities except proteins, genes and traits by utilizing the attribute association of the unique identifiers of the proteins according to the common unique identifiers of the proteins in the databases of different departments.

In an optional implementation manner, the embodiment of the invention carries out multi-source knowledge association fusion on the extracted entity triples, and mainly solves the problem that description information about the same entity or concept from multiple sources realizes low redundancy and high accuracy combination. Finally, a trait regulatory gene knowledge graph covering 13 entity classes, 16 data attributes and 14 object attributes is formed.

Step S103, selecting interaction protein pairs and non-interaction protein pairs with preset proportions in the trait regulatory gene knowledge graph as data sets, and performing model training based on the data sets to construct a protein interaction prediction model.

Specifically, in the embodiment of the invention, the trait regulatory gene knowledge graph contains a plurality of proteins, and the prior art has determined whether some proteins have an interaction relationship, so that all known interaction protein pairs in the trait regulatory gene knowledge graph are selected as positive samples, all known non-interaction protein pairs are selected as negative samples, the negative samples are downsampled based on a preset ratio of the positive samples to the negative samples, and the positive samples and the downsampled negative samples are combined as a data set. Because the proportion of non-interacting proteins to occupied is larger in the trait regulatory gene knowledge graph, in order to ensure the balance of samples, the embodiment of the invention makes the non-interacting proteins to be kept 1 with positive samples by randomly sampling in negative samples: 1, i.e. as many positive samples as negative samples, but not limited thereto.

In an alternative implementation manner, the embodiment of the invention trains the protein interaction prediction model through the constructed data set, uses the cross entropy loss function to train the model, and stores the trained protein interaction prediction model. Wherein the cross entropy loss function is as follows:

L＝-(ylog(p)+(1-y)log(1-p))

where y is the true label and p is the probability that the model predicts as yes.

In an alternative embodiment, the trained protein interaction prediction model is used to predict whether there is an interaction relationship between different proteins, comprising: the process of predicting by using the protein interaction prediction model includes:

1. and calculating a first protein vector and a second protein vector corresponding to all protein pairs (protein 1 and protein 2) in the character regulation gene knowledge graph through a preset algorithm. The invention adopts a transition algorithm to perform vector calculation, the transition is a deep learning calculation model, nodes and relations in a knowledge graph can be respectively calculated into a vector through the model, the principle of the calculation model is that a tail entity vector=a head entity vector+a relation vector, and the calculated node vector can represent the node, for example, a triplet is: protein-associates_with-track, then track vector = Protein vector + associates_with vector. Because the TransE is calculated through the whole knowledge graph, each node is integrated with neighborhood information around the node.

2. Respectively inputting the first protein vector and the second protein vector into two single-layer fully-connected neural networks in a protein interaction prediction model to obtain a corresponding first output vector v1 and a second output vector v2;

3. the first output vector v1 and the second output vector v2 are spliced and then input into a fully-connected neural network to obtain a third output vector r;

4. and mapping the third output vector through a preset activation function to obtain a prediction result, namely predicting whether different proteins have interaction relations or not. In the embodiment of the invention, the result mapping is performed by the sigmoid activation function, and whether the final result is output or not is represented by that two input proteins are interaction proteins, which is only used as an example and not limited to the example.

And step S104, searching based on the trait regulation gene knowledge graph to obtain an interaction protein connected subgraph of the protein to be mined, predicting the interaction relation of any protein in the interaction protein connected subgraph to an input protein interaction prediction model, supplementing the interaction protein connected subgraph according to a prediction result, and mining an interaction protein complete subgraph based on the supplemented interaction protein connected subgraph.

Specifically, in the embodiment of the invention, the generated trait regulatory gene knowledge graph is searched, the protein to be mined is input as the query protein, and the interaction protein connected subgraph containing the query protein is searched, namely, the connected subgraph only contains protein nodes and interaction relation edges. Taking the protein Q9SUQ2 as an example, the interaction protein connected subgraph of the protein Q9SUQ2 is searched in a character regulation gene knowledge graph in a correlation way, and is shown in figure 4.

In an optional implementation manner, on the basis of the queried interaction protein connected subgraph, all proteins in the connected subgraph are extracted, the proteins are combined into protein pairs, the protein pairs are input into a protein interaction prediction model for prediction, and edges are added between the protein pairs with the prediction result to represent interaction relations; as shown in FIG. 5, 5 nodes at the periphery are proteins, which are connected as edges in the original map by interactions_with (interaction relationship connection), and which are connected as edges predicted by the protein interaction prediction model by the dotted line.

In an alternative implementation manner, the embodiment of the invention predicts the interaction relationship based on obtaining the interaction protein connected subgraph, and the obtained prediction result does not have the interaction relationship among all proteins. In the embodiment of the invention, the complete subgraph of the interaction protein is excavated in the interconnected subgraph of the interaction protein by a Bron-Kerbosch algorithm, but the method is not limited by the method, and the proteins with the interaction relation among all proteins are found, so that the complete subgraph of the interaction protein is constructed.

Step S105, excavating the common entity nodes between all proteins and between corresponding genes in the interaction protein complete subgraph from the character regulation gene knowledge graph based on the interaction protein complete subgraph to obtain a gene structure diagram, and further obtaining a path gene cluster by judging whether the genes in the gene structure diagram are within a preset physical position interval or not.

Specifically, in the embodiment of the present invention, based on the structure of the complete subgraph of the specific interacting protein, the entity node information of the linkage between the proteins and the entity node information of the linkage between the corresponding genes are mined from the trait regulatory gene knowledge graph, and a genetic structure diagram corresponding to the complete subgraph of the interacting protein is constructed according to the protein-correspondence-gene triplet, as shown in fig. 6.

In an alternative implementation manner, the embodiment of the invention obtains the location attribute (location) in the data attribute corresponding to different genes on the basis of obtaining the gene structure diagram, then judges whether the physical location interval between the different genes is within a preset physical location interval according to the location attribute, and if the physical location interval is within the preset physical location interval, all the genes in the gene structure diagram are the mined channel gene clusters. The preset physical location interval selected by the present invention is tens to hundreds kb, but not limited thereto.

The present embodiment provides an excavating device for a pathway gene cluster, as shown in fig. 7, comprising:

the ontology model construction module 701 is configured to select an entity class according to the purpose of path gene cluster mining, determine data attributes of different entities in the entity class and object attributes between the entities, and construct a trait regulatory gene ontology model based on the entity class, the data attributes and the object attributes;

the knowledge graph generation module 702 is configured to extract various entities and relationships between the entities from the multi-source database based on the trait regulatory gene ontology model, construct triples representing the relationships between the different entities, and perform multi-source knowledge association fusion according to the triples to generate a trait regulatory gene knowledge graph;

the prediction model construction module 703 is used for selecting interaction protein pairs and non-interaction protein pairs with preset proportions in the trait control gene knowledge graph as a data set, and performing model training based on the data set to construct a protein interaction prediction model;

the subgraph construction module 704 is used for searching based on the trait regulation gene knowledge graph to obtain an interaction protein connected subgraph of the protein to be mined, predicting the interaction relation of any protein in the interaction protein connected subgraph to the input protein interaction prediction model, supplementing the interaction protein connected subgraph according to the prediction result, and mining an interaction protein complete subgraph based on the supplemented interaction protein connected subgraph;

the pathway gene cluster mining module 705 is configured to mine, from the trait regulatory gene knowledge graph, common entity nodes between all proteins and between corresponding genes in the interacting protein complete graph based on the interacting protein complete graph, to obtain a gene structure diagram, and further obtain a pathway gene cluster by judging whether the genes in the gene structure diagram are within a preset physical position interval.

Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.

The means for mining the path gene cluster in this embodiment is in the form of functional units, where the units are ASIC (Application Specific Integrated Circuit ) circuits, processors and memories executing one or more software or fixed programs, and/or other devices that can provide the above functions.

The embodiment of the invention also provides computer equipment, which is provided with the excavating device of the path gene cluster shown in the figure 7.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 8, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 8.

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform a method for implementing the embodiments described above.

The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.

The computer device also includes a communication interface 30 for the computer device to communicate with other devices or communication networks.

The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A method for excavating a pathway gene cluster, comprising:

selecting entity classes according to the purpose of path gene cluster mining, determining data attributes of different entities in the entity classes and object attributes among the entities, and constructing a character regulation gene ontology model based on the entity classes, the data attributes and the object attributes;

selecting interaction protein pairs and non-interaction protein pairs with preset proportions in the trait regulatory gene knowledge graph as a data set, and performing model training based on the data set to construct a protein interaction prediction model;

searching based on the character regulation gene knowledge graph to obtain an interaction protein connected subgraph of the protein to be mined, predicting the interaction relation of any protein in the interaction protein connected subgraph, inputting the interaction protein predicted model, supplementing the interaction protein connected subgraph according to a predicted result, and mining an interaction protein complete subgraph based on the supplemented interaction protein connected subgraph;

and digging common entity nodes among all proteins and among corresponding genes in the interaction protein complete subgraph from the character regulation gene knowledge graph based on the interaction protein complete subgraph to obtain a gene structure diagram, and further obtaining a pathway gene cluster by judging whether the genes in the gene structure diagram are within a preset physical position interval or not.

2. The method of claim 1, wherein the entity class comprises: proteins, genes, traits, signaling pathways, genetic symbols, protein families, domains, subcellular localization, cellular components, molecular functions, biological processes, metabolic pathways, and enzymes, and taking said proteins, genes, and traits as central entities;

the data attribute is a characteristic of a corresponding entity, and the object attribute is a relation between different entities.

3. The method of claim 1, wherein the multi-source database comprises: literature database and field science database.

4. A method according to claim 2 or 3, wherein the process of extracting each type of entity and the relationship between each type of entity from the multi-source database based on the trait regulatory gene ontology model to construct a triplet representing the relationship between different entities comprises:

taking the character entity as a search term, acquiring a protein entity related to the character entity based on a literature database, and constructing a protein-related-character triplet after checking the relation between the character entity and the protein entity;

protein sequences of different species are obtained based on the scientific databases of the fields of all types, and homologous proteins and corresponding genes of protein entities are extracted based on the protein sequences, so that protein-homologous-protein triples and protein-corresponding-gene triples are constructed;

and acquiring structural data related to protein entities and gene entities based on the field science databases of all types, cleaning the structural data, and constructing triplets among other entities except proteins, genes and characters according to the unique identifier attribute of the common protein in the field science databases of different types.

5. The method of claim 1, wherein selecting the predetermined proportion of interacting protein pairs and non-interacting protein pairs in the trait regulatory gene knowledge-graph as the data set comprises:

selecting all known interacting protein pairs in the trait regulatory gene knowledge graph as positive samples, and selecting all known non-interacting protein pairs as negative samples;

and downsampling the negative sample based on a preset proportion of the positive sample and the negative sample, and combining the positive sample and the downsampled negative sample as a data set.

6. The method of claim 1, wherein the protein interaction prediction model is used to predict whether there is an interaction relationship between different proteins, comprising: the process for predicting by using the protein interaction prediction model comprises the following steps of:

calculating a first protein vector and a second protein vector corresponding to all protein pairs in the trait regulatory gene knowledge graph through a preset algorithm;

the first protein vector and the second protein vector are respectively input into a corresponding single-layer full-connected neural network in the protein interaction prediction model to obtain a corresponding first output vector and a corresponding second output vector;

splicing the first output vector and the second output vector, and obtaining a third output vector through a fully connected neural network;

and mapping the third output vector through a preset activation function to obtain a prediction result, namely predicting whether different proteins have interaction relations or not.

7. The method according to claim 6, wherein the process of supplementing the interacting protein connected subgraph according to the prediction result and mining the interacting protein complete subgraph based on the supplemented interacting protein connected subgraph comprises:

supplementing the interaction relationship between different proteins predicted by the protein interaction prediction model to the interaction protein connected subgraph in the form of a connected edge based on the reachable path connected edge between proteins with known interaction relationship in the character regulation gene knowledge graph of the interaction protein connected subgraph;

searching proteins with interaction relations between each other from the interaction protein connected subgraph based on a preset algorithm to generate an interaction protein complete subgraph.

8. A device for excavating a pathway gene cluster, said device comprising:

the subgraph construction module is used for searching based on the trait regulation gene knowledge graph to obtain an interaction protein connected subgraph of the protein to be mined, predicting the interaction relation of any protein in the interaction protein connected subgraph input into the protein interaction prediction model, supplementing the interaction protein connected subgraph according to a prediction result, and mining an interaction protein complete subgraph based on the supplemented interaction protein connected subgraph;

and the pathway gene cluster mining module is used for mining the common connection entity nodes among all proteins and among corresponding genes in the interaction protein complete subgraph from the character regulation gene knowledge graph based on the interaction protein complete subgraph to obtain a gene structure diagram, and further obtaining the pathway gene cluster by judging whether the genes in the gene structure diagram are within a preset physical position interval or not.

9. A computer device, comprising:

a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method of mining a pathway gene cluster of any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of mining a pathway gene cluster of any one of claims 1 to 7.