CN114969369A

CN114969369A - Knowledge graph human cancer lethal prediction method based on mixed network and knowledge graph construction method

Info

Publication number: CN114969369A
Application number: CN202210600385.XA
Authority: CN
Inventors: 刘爽; 朱晓敏; 孟佳娜; 孙世昶; 王巍
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-08-30

Abstract

The invention relates to the field of artificial intelligence-based link prediction research methods, and discloses a mixed network-based knowledge graph human cancer lethality prediction method and a knowledge graph construction method. The technical scheme is as follows: carrying out map design according to the category of the entity; acquiring corresponding medical data according to atlas design; processing the original data; carrying out named entity recognition and relation extraction on the processed original corpus; and (5) constructing a knowledge graph. Has the advantages that: the method introduces a graph neural network based on the knowledge graph, and introduces the knowledge graph message transmission into the graph neural network prediction; then, an attention mechanism is introduced to effectively extract important local and global neighbors, so that local and global representations of the nodes are better learned; further aggregating the original features with the local and global representations to obtain a specific feature representation; finally, feature-specific representations are integrated by taking into account the importance of the different feature maps; the method is beneficial to utilizing independence problem and avoiding artificial characteristic engineering.

Description

Knowledge graph human cancer lethality prediction method based on mixed network and knowledge graph construction method

The technical field is as follows:

the invention relates to the field of artificial intelligence-based link prediction research methods, in particular to a knowledge graph-based link prediction method and a research on the medical field of synthetic lethality prediction of human cancers based on the method.

Background art:

the knowledge graph is developed from a semantic network, and the entities are represented as nodes, and the semantic relationship between the entities is represented as edges connecting the nodes, so that a directed graph is constructed, and the knowledge can be represented as a more intuitive networked structure. The mode of representing data through the graph structure enables a computer to organize and manage a large amount of information data more efficiently, and further realizes vector embedding, retrieval, prediction and reasoning of knowledge. However, with the intensive research on the knowledge graph, there still exist some problems to be solved in finding the application process after the knowledge graph is applied to various fields.

Among the most significant difficulties that limit the widespread use of knowledge-graphs are: knowledge-graph (KG) used in applications is often incomplete, i.e. the absence of relationships or the absence of entity attributes in the KG. The problem of incomplete information greatly limits the development and wide application of knowledge-graphs in the current stage, and even the most advanced KG containing hundreds of millions of triples with huge information is still incomplete. The reason for this may be that the data for constructing KG is not complete, or some special relationships or entities cannot be identified when information extraction is performed in the process of constructing KG. Therefore, knowledge-graph-oriented knowledge completion becomes an important task for solving the problem, and link prediction is a main method of the knowledge completion task and is a promising and widely researched task aiming at solving the problem of KG incompleteness.

Link prediction is to map the content of entities and relations in the knowledge-graph into a continuous vector space, and predict the entities or relations in the knowledge-graph, including (h, r,. The method is an important research direction for mining the implicit relationship between nodes in the knowledge graph spectrum. The traditional research on link prediction is basically developed for different application scenarios and different practical problems are solved. Nowadays, link prediction problems are further developed from the initial accurate expression problems into the current reasoning and related problems, and research tends to implicit information mining in a map, so that research on the link prediction and other related problems has practical application value. The current widely used areas of link prediction are as follows:

(1) acquaintances and similar users are recommended to users in social networks, and most social networks use link prediction techniques to recommend acquaintances.

(2) For predicting the type of untagged node in a network with known partial node types, such as for judging the type of a academic paper or predicting some criminal behavior from a criminal network.

(3) In the biological field, link prediction is used to find proteins that can interact. Since many proteins are not currently known to people, it takes a lot of experimental time and money to research. If a link prediction mode is adopted, the prediction can be accurately carried out in advance, and the cost of time and money is reduced to a certain extent.

The prediction of synthetic lethality in human cancers is a very important application of linked prediction in the biomedical field. With increasing pressure on people's lives and more frequent fast-food lifestyles, cancer has become one of the major killers of human health, mainly due to the fact that cell growth is uncontrolled leading to hyperproliferation. Traditional chemotherapy aims at killing cancer cells by targeting rapidly dividing cells with drugs. When patients use these drugs, dividing normal cells are also rapidly damaged and toxic to normal cells that cannot divide rapidly, thus limiting the effectiveness of these anti-cancer drugs. Epigenetic, genetic changes within cancer cells and changes in their microenvironment, as compared to normal cells, increase their demand for specific molecular targets, providing an opportunity for selective killing of cancer cells.

Synthetic Lethality (SL) refers to the fact that for two genes in a cell, either mutated alone or not acting, does not result in cell death, and both are mutated or not expressed at the same time. It is a promising method for discovering anti-cancer drug targets, and as a new targeting strategy for selectively killing cancer cells, it brings new opportunities for cancer treatment. In synthetic lethal gene pairs, mutation of one gene does not affect cell viability, whereas mutation of both genes simultaneously results in cell death. By inhibiting the synthetic lethal partner gene of the oncogene, cancer cells in which the oncogene is mutated can be killed without damaging normal cells. In addition, synthetic lethality offers the possibility of discovering new drug targets and potential cancer drug combination strategies. Wet laboratory screening for SL pair suffers from high cost, batch effects, and off-target. The current calculation methods for SL prediction include the following three methods:

the first is to carry out gene knockout simulation based on a metabolic network model, and the second is to carry out data mining based on knowledge, namely a knowledge-oriented method, and mainly utilizes knowledge in a specific field to carry out feature engineering. The two categories rely heavily on metabolic network models, domain knowledge and genomic data, and cannot fully utilize the valuable information of the known SL pair. The third category of methods applies machine learning algorithms, features are designed based on domain knowledge and heuristic functions. Among them, GNN-based methods tend to treat each SL pair as an independent sample, and do not consider potential biological mechanisms; and existing methods support vector machines and the like to inject genomic and proteomic data to facilitate SL prediction. The GNN-based method can be used for coding information such as input features, and the method is used for manually extracting the features based on domain knowledge and omitting some features.

In summary, the current methods for predicting synthetic lethal genes in humans face three important challenges: while most existing approaches tend to assume that SL pair are independent of each other, potential shared biological mechanisms are not considered; other methods have combined genomic and proteomic data to aid SL prediction, but these methods involve manual feature engineering and rely heavily on domain knowledge; moreover, the existing method for predicting the synthetic lethal gene of human has high cost and needs to consume a great deal of labor and time. Based on the above analysis, the idea of knowledge map and graph neural network is used for reference, and knowledge map link prediction is combined with the methods of graph neural network, Bi-LSTM, Attention mechanism Attention and the like to solve the problems. The study of the human cancer lethality prediction method based on the knowledge map and the mixed network is significant to the medical field, especially cancer treatment.

Disclosure of Invention

The technical problems needed and solved by the invention are as follows:

the invention aims to provide a human cancer lethality prediction research method based on a knowledge graph and a hybrid network (deep learning network model), which can store and predict the relevant information of human cancer synthetic lethality in the form of the knowledge graph: 11 entities including genes, compounds, diseases, biological processes and 24 possible relations to SL. By carrying out information transmission on the established knowledge graph, the artificial feature extraction project is avoided, and the problem is solved by utilizing independence. The patient information is input in the form of natural language sentences, the relevant information can be searched from the knowledge base, whether the death of human cancer cells can be caused or not is predicted, and the information is returned to a doctor in a natural language mode. The doctor can acquire the detailed information of the patient condition development in advance through the mode, can make more sufficient knowledge storage to make a corresponding treatment scheme for the cancer patient, provides convenience for the treatment of the patient, strives for time and hope, and can acquire the required information more accurately and quickly.

The knowledge map and the deep learning model are combined, introduced into the SL gene relation prediction problem and researched, and a good effect is achieved. This shows that the deep learning model based on the graph neural network can better solve the complex problem in the biomedical field by combining knowledge and data. The newly predicted SL gene pair will help biologists to screen new anti-cancer drug targets more quickly, enabling the progress of new drug development to be accelerated with AI technology. In addition, the biological mechanism behind SL is revealed through the knowledge map, so that the deep learning model has better interpretability, the discovery of biological knowledge is promoted, the discovery of cancer drug targets is accelerated, and the development of AI pharmaceutical technology is promoted. Has important significance for the research in the field of biological information.

The specific technical scheme of the invention is as follows:

a method for constructing a knowledge map of human cancer synthetic lethality prediction based on the medical field comprises the following steps:

step 1: carrying out map design according to the category of the entity;

and 2, step: acquiring corresponding medical data according to atlas design;

and step 3: processing the original data;

and 4, step 4: carrying out named entity recognition and relation extraction on the processed original corpus;

and 5: and (5) constructing a knowledge graph.

Further, for step 1, the data set contains 72804 pairs of gene relationships between 10004 genes. KG, denoted synlength KG, contains 24 relationships between 11 entities. Of the 24 relationships, 16 are directly related to genes, such as (genes ), (genes, interactions, genes), and so on. Of the 11 entities, 7 are directly related to genes, i.e., pathways, cellular components, biological processes, molecular functions, diseases, compounds, etc. Firstly, screening required information in a SynLethKG database.

Further, according to the step 2, according to the map design, the required related structured data, semi-structured data and unstructured data are obtained from the SynLethDB database through a web crawler.

Further, aiming at the step 3, a jieba word segmentation tool is used for carrying out word segmentation and part-of-speech tagging on the data, and punctuation marks and stop words are removed.

Further, in step 4, the obtained semi-structured data is stored after being integrated, entity identification and relation extraction are performed on the unstructured data by using a deep learning method, and then knowledge fusion is performed on the obtained data.

Further, for step 5, the data sorted in step 4 is stored by using Neo4 j.

The invention also comprises a knowledge graph human cancer lethality prediction method based on the mixed network, which comprises the following steps:

step 1: extracting data in a database;

step 2: analyzing the gene pairs to obtain a gene-gene matrix;

and step 3: the constructed knowledge map and the obtained gene-gene pairs are used as input and transmitted into a model, and the whole framework is shown in figure 5;

and 4, step 4: entering a graph neural network model so as to obtain neighborhood representation of the gene and aggregate the obtained information;

and 5: the information after aggregation is used as the input of a Bi-LSTM model, thereby enriching the process of feature extraction;

step 6: taking the upper-layer output as the input of an attention mechanism model, thereby capturing the entity and relationship characteristics in the multi-hop neighborhood;

and 7: and calculating the total loss of the model and optimizing.

Further, in step 1, SynLethDB is a comprehensive database of synthetic lethal gene pairs, and after isolated nodes are removed, the finally obtained SynLethKG graph contains 54012 nodes and 2231921 edges. Screening out a required data set from a database; and if the corresponding triples are not inquired, crawling relevant question and answer websites and forums by a crawler technology.

Further, in step 2, the screened data is analyzed by URI and converted into a gene-gene matrix form. Given a SL-related gene and constructing a weighted subgraph from the KG, identifying the relevant nodes and determining the weights of the edges are two key steps.

Further, aiming at the step 3, the knowledge graph and the gene-gene matrix constructed in the step 1 are used as input, the model framework mainly comprises a graph neural network, a bidirectional long-short term memory neural network and an attention mechanism model, and the overall research framework is shown in fig. 3.

Further, for step 4, after obtaining the input, the neighborhood of the entity needs to be sampled. A fixed number k of neighbors are extracted for each entity to characterize its local structure and the process is repeated H hops (H > ═ 1). In particular, if the number of neighbors of a node is less than k, it is resampled, i.e., a neighbor may be sampled multiple times.

Further, aiming at step 5, in the Bi-LSTM recurrent neural network, the Bi-LSTM model is used to mine the serialization of genes, interactive synthetic lethal genes are processed into a sequence form before the Bi-LSTM is used, the long-distance dependency and position information of the original text can be captured, a state is extracted for each lethal by using the Bi-LSTM, and finally the states of each lethal gene are overlapped and predicted.

Further, for step 6, assuming the input is "PARP inhibitor treatment for recurrent ovarian cancer", attention should be paid to "inhibitor" when the decoder generates a site-related prediction, and attention should be paid to the word "ovarian cancer" when the decoder generates a related prediction. Therefore, the invention introduces an attention mechanism in order to solve the problem that the semantic vector can not pay attention to the important information representing the sequence. After the acquired word vectors are fed into the Bi-LSTM neural network model one by one, a series of coding end hidden states are generated to participate in the calculation of the attention coefficient. Then, in each round of training, the output state of the decoding end also participates in the calculation of the attention coefficient, and the final probability distribution is obtained after the state of the decoder and the hidden state are subjected to weighted summation. A feature method based on a layer attention mechanism that can capture both entity and relationship features in the neighborhood of any given entity. In addition, relational clustering and multi-hop relations are encapsulated in the model, and insight is provided for effectiveness of the attention-based model.

Further, two loss types are designed for the model in the step 7, namely, basic loss1 and loss2, wherein the basic loss is calculated by cross-entropy and optimized by using an Adam optimization algorithm.

The invention has the beneficial effects that: the invention introduces a Graph Neural Network (GNN) based on knowledge graph, and incorporates Knowledge Graph (KG) message transmission into graph neural network prediction. The model is constructed by utilizing 11 entities including genes, compounds, diseases and biological processes and 24 possible relations related to SL, information is transmitted to KG and aggregated, the aggregated information is transmitted to Bi-LSTM, the characteristic extraction process is enriched, then an attention mechanism is introduced, and the node representation is learned from a plurality of characteristic graphs. Wherein the attention mechanism can effectively extract important local and global neighbors, thereby better learning the local and global representations of the node, respectively. The original features are further aggregated with the local and global representations to arrive at a particular feature representation. Finally, feature-specific representations are integrated by taking into account the importance of the different feature maps. The method is beneficial to utilizing independence problem and avoiding artificial characteristic engineering.

Drawings

FIG. 1 is an overall structural view of the present invention;

FIG. 2 is a flow chart of the synthetic lethality prediction study of cancer according to the present invention;

FIG. 3 is a diagram of the data preprocessing format of the present invention;

FIG. 4 is a diagram of text preprocessing in the present invention;

FIG. 5 is a block diagram of the overall prediction study of synthetic lethality of cancer in accordance with the present invention;

FIG. 6 is a diagram of the operation of the neural network model of the present invention;

FIG. 7 is a diagram of a neural network structure of a deep learning graph according to the present invention;

FIG. 8 is a schematic diagram of the operation of the Bi-LSTM long and short memory neural network according to the present invention;

FIG. 9 is a schematic diagram of the operation of the attention mechanism of the present invention;

FIG. 10 is a schematic diagram of the mechanism of the present invention for the middle tier power;

FIG. 11 is a database visualization effect diagram of the present invention;

FIG. 12 is a diagram illustrating the page effect of the WeChat applet in accordance with the present invention;

FIG. 13 is a diagram illustrating the effect of a common problem of the WeChat applet in the present invention;

FIG. 14 is a visual effect diagram of the WeChat applet prediction interface in the present invention.

Detailed Description

The specific operation steps of the construction method of the synthetic lethality prediction for human cancer based on the knowledge-map medical field of the present invention will be described in more detail with reference to the accompanying drawings.

The invention mainly comprises the construction of two modules:

a first module: constructing a synthetic lethal knowledge map of human cancers;

and a second module: human cancer synthetic lethality prediction method research;

for module one, a method for constructing a knowledge map in the medical field of synthetic lethality of human cancers is provided, and the overall structure is shown in fig. 1. Designing a synthetic lethal knowledge map of human cancers according to requirements. And data are obtained through a web crawler technology, and the corresponding data are processed and extracted by a certain method and then stored into a Neo4j database. Each step will be described in detail below.

Step 1: atlas design

The step is the most critical step for constructing the map of the corresponding field. Through the understanding and analysis of the knowledge in the medical field of synthetic death of human cancer, the invention designs the entity classes of the atlas in the field, which respectively comprise: of the 11 entities, 7 are directly related to genes, i.e., pathways, cellular components, biological processes, molecular functions, diseases, compounds, and anatomy. They exist in the form of (genes, relationships, entities) where each type of entity class contains multiple entities. Wherein each entity contains its corresponding attribute information for characterizing the intrinsic characteristics of the entity, and relationships are defined to characterize the association between each entity and the entity or attribute. They exist in the form of (genes, relationships, entities):

step 2: acquiring corresponding data

According to the map design, structured data such as SynLethDB databases and biological information websites, semi-structured data, unstructured data such as medical articles and the like are crawled by using a crawler technology.

And step 3: processing original corpus

And (4) carrying out processing such as word deactivation, special symbol deletion, repeated word deletion and the like on the obtained original corpus.

And 4, step 4: named entity identification and relationship extraction

And (4) processing the data processed in the step (3) respectively. Structured data are sorted and stored, semi-structured data are extracted manually, and unstructured data are extracted by adopting a deep learning model BERT after being labeled.

And 5: building knowledge graph

Importing the data in the step 4 into a Neo4j database by using a cypher statement.

For the second module, a construction method of knowledge graph human cancer lethality prediction method research based on a mixed network is provided, and the construction method comprises the following steps:

step 1: extracting data in a database;

screening a required data set from a SynLethDB database; and crawling the relevant question and answer websites and forums by a crawler technology to arrange the relevant question and answer websites and forums into corresponding triples.

And 2, step: analyzing the gene pairs to obtain a gene-gene matrix;

and analyzing the screened data through URI, and converting the data into a gene-gene matrix form.

And step 3: the constructed knowledge map and gene pairs are used as input and transmitted into a model;

the constructed knowledge map and the gene-gene matrix are used as input, and the model frame mainly comprises a map neural network, a bidirectional long-term and short-term memory neural network and an attention mechanism model.

after the input is obtained, the neighborhood of the entity is sampled. A fixed number of k neighbors are extracted for each entity to characterize its local structure and the process is repeated for H hops. If the number of neighbors of the node is less than k, the resampling is performed.

And 5: the aggregated information is used as the input of a Bi-LSTM model, the sequence representation of genes is learned, the long-distance dependency relationship and the position information of the original text can be captured, and therefore the characteristic extraction process is enriched.

the interpretability is provided by capturing the interaction relationship between genes through an attention mechanism. The entity and relationship features can be captured simultaneously in the neighborhood of any given entity, and the relationship clustering and multi-hop relationship is encapsulated in the model to effectively extract important local and global neighbors, so that the local and global representations of the nodes are better learned respectively, effective insights are provided, and prediction is completed.

And 7: total loss and optimization;

two types of loss were designed, basic loss1 and loss2, basic loss using cross-entropy calculations, optimized using Adam optimization algorithm.

Example 2

As shown in fig. 1, a synthetic lethality prediction study for human cancer based on the field of knowledge-map medicine was mainly constructed from five aspects.

Step 1: designing a synthetic lethal prediction map of human cancers based on the medical field of knowledge maps;

step 2: acquisition of data for synthetic lethality prediction for human cancers;

and step 3: performing knowledge extraction and fusion on the field data;

and 4, step 4: constructing a knowledge graph;

and 5: realizing the human cancer synthetic lethality prediction research based on the medical field of knowledge graph;

each step is described in detail below:

step 1: according to the predictive research and analysis of synthetic lethal information of human cancer, starting from a SynLethDB database, an encyclopedia website and a related biological information website, the entity type, the entity relationship and the entity attribute in a knowledge map are determined. Of the 24 relationships, 16 are directly related to genes, and 7 of the 11 entities are directly related to genes, which exist in the form of (genes, relationships, entities). Such as (gene, genes, gene), (gene, interactions, gene), (gene, co-vary, gene). The other 8 relationships relate to drugs and compounds. The pathway, cellular components, biological processes, molecular functions, diseases, compounds and anatomy of each gene are used as attributes to describe the entity, and relationships are established to reflect the association of the entity with other entities.

Step 2: data sources obtained by crawling the large network stations are mainly classified into three types: structured data, semi-structured data, unstructured data.

And step 3: respectively extracting and fusing data in different storage forms

For structured data, it is saved to a list after it is acquired.

And for the semi-structured data, carrying out xpath analysis on the webpage structure of the related medical and biological information website, and capturing corresponding knowledge of the webpage by using a script crawler frame.

For unstructured data, crawled web articles, biological textbooks, and magazines are large pieces of text data. So that it needs to be named entity identification to extract the required entities. In the project, the combined models BERT and Bi-LSTM are adopted to extract entities in a specific field. The method mainly comprises the following steps:

the method comprises the following steps: using a jieba word segmentation tool and a custom dictionary to segment the collected data and stop words; and adding the result of incorrect word segmentation into the user-defined dictionary after word segmentation. Firstly, after an input question is subjected to word segmentation and word stop by using a jieba word segmentation tool, word vector pre-training is carried out by using a word2vec tool in genim, the dimension of the word vector is set to be 300 dimensions, and the window size is set to be 5.

The method comprises the following steps: and pre-training by using the constructed corpus, adopting a marking data format as a BIO marking mode, and marking each element into one of forms (B-XX, I-XX and O-XX). Where B denotes the beginning and XX denotes the defined element class; i represents the middle; o denotes others for marking irrelevant characters.

Step three: the model uses a pre-training model BERT to generate word vectors about context information, and the trained word vectors are used as the input of a BilSTM layer to acquire the front-back semantic relation of each word.

Step four: and linking and fusing the extracted entities and the extracted relations.

And 4, step 4: the triplets are stored in a Neo4j database.

And 5: as shown in fig. 2, the construction steps of the synthetic lethality prediction for human cancer based on the medical field of knowledge map include:

step [1 ]: the graph neural network is introduced on the basis of the knowledge graph, and the independence problem is relieved by directly introducing the potential factors into the graph as nodes.

Step [2 ]: Bi-LSTM is introduced, and multi-hop neighborhood characteristics are enriched;

step [3 ]: an attention mechanism is introduced to capture entities and contacts in a multi-hop neighborhood;

step [4 ]: total loss and optimization;

step [5 ]: a visual implementation of a cancer synthetic lethality prediction applet;

step [1 ]: given a gene related to synthetic death of cancer, a weighted subgraph is constructed from a constructed known spectrum map, related nodes are identified, and edge feeling is determined to be serious. A fixed number of neighbor-characterizing local structures are extracted for each entity, a parameter H (CNN sensing domain) is introduced, the H-hop process is repeated, and the nodes can be repeatedly sampled. Then the information is aggregated to be used as the input of the next network, and the working principle is shown in the figure.

Model training is described as follows:

1) neighbor sampling of drug entities: because the neighborhood distribution of each gene entity is different, the entity is firstly subjected to neighborhood sampling. In the invention, the neighborhood range of two hops of each node is considered, the H parameter can be understood as a sensing domain in the CNN, when H is 1, the method is equivalent to only considering the neighbor node directly connected with the current node, and when H is 2, the method represents that the condition of the node connected with the second order is considered, so that more neighborhood entity information can be learned, and H can take a larger value.

Each entity extracts a fixed number k of neighbors to characterize its local structure and repeats the process for H hops (H)>1) in particular, if the number of neighbors of a node is less than k, it is repeatedly sampled, i.e. a neighbor may be sampled multiple times. The weight on an edge represents the importance of the relationship if the number of neighbors SL of the node

Weight of edge r _a，a' In the way of computation in the subgraph:

wherein, a _n Represents a gene, r _a，a' Indicating the associated imbedding.

2) Aggregating neighborhood information: GNN is a spatial domain approach in this framework. In the constructed knowledge graph, nodes directly connected to genes are defined as Nneigh (a). Because the distribution of each medicine node neighborhood is different, for the convenience of calculation, the invention uses the GraphSAGE method for reference, and adopts a neighborhood range S (a) with a fixed size. After sampling is completed, the embedded representation of the entity and the embedded representation of the neighborhood information are aggregated by an aggregation method, and finally the embedded representation of the current entity is obtained. The sum aggregation method is an overlay operation, concat is a splicing operation, and the neighbor considers only the information of the neighborhood and ignores the self entity embedding expression.

(1) For each node in the subgraph, information is aggregated and updated, and a weighted average sum is calculated for each node, wherein the formula is as follows:

wherein a' represents an entity in the subgraph, Z (a) represents a set of entities in the subgraph,

representing importance weights between gene relationships.

Q is the gene association score after normalization using the softmax function, as shown below:

wherein the content of the first and second substances,

the normalized gene relationship score is expressed.

(2) After the expression of the neighbor of the central node is obtained, the information is aggregated and updated, and the formula is as follows:

wherein Q represents the weight of the linear transform layer, g represents the bias of the linear transform layer,

represents the activation function, A represents the representation of the entity, h +1 represents the updated representation of the entity, a [ Ag ] Ag]The weight after the linear change is represented,A _Z(a) representing the weighted average sum of the nodes after computation.

(3) After expression of two genes was obtained, the probability of reaction between them was calculated by the following formula:

wherein f () represents a gene expression formula, a _m Represents the gene m, a _n Represents a gene n.

Step [2 ]: when entity recognition is carried out on the information after upper layer aggregation, a Bi-LSTM model learning gene sequence is used for representing, the characteristic extraction process is enriched, the model is shown as a graph 8, and model training is described as follows:

the bidirectional LSTM model formed by combining the forward LSTM and the backward LSTM can effectively solve f _t The problem of long-term dependence is solved, and bidirectional semantic information is captured better. The hidden layer in the LSTM model is composed of a forgetting gate and a memory gate i _t And an output gate o _t Composition f _t Information that needs to be forgotten, i _t The ratio of the information to be memorized is determined. The Bi-LSTM is composed of two layers of LSTMs in different directions, after operation, respective prediction results are spliced, and then the splicing result is used as the input of the attention mechanism model of the next layer.

Step [3 ]: the upper layer output is used as the input of the attention mechanism model, and a node level attention mechanism is designed for each feature graph to effectively capture the importance of local neighbors and global neighbors and learn the local representations and the global representations of the nodes respectively. The original features are aggregated with local and global representations using a multi-layer perceptron to obtain a particular feature representation. To get the final representation, a feature-level attention mechanism is designed to integrate specific feature representations (by considering the importance of different feature maps). The network integrates the node representations. For a node, the node directly connected to it in the graph is defined as its local neighbor. Considering different neighbor importance differences, a node-level attention mechanism is designed to learn node representations. Working principle as shown in fig. 10, attention scores were first calculated using the following formula:

(1) for a node, the node directly connected to it in the graph is defined as its local neighbor. Considering different neighbor importance differences, a node-level attention mechanism is designed to learn node representations. First, an attention score is calculated, the formula is shown below:

wherein the content of the first and second substances,

representing a representation of a neighbor v _m For v _n Importance of (1) i.e. attention score, V _m N-dimensional feature vector, V, representing gene m _n An n-dimensional feature vector representing a gene n, f (-) represents a single-layer feedforward neural network, W represents a learnable weight matrix,

denotes v _m The weight matrix of (a) is determined,

denotes v _n The weight matrix of (2).

(2) The attention score is then normalized as shown below:

wherein the content of the first and second substances,

representing the attention score obtained after normalization, x representing the selected node representation,

a global neighbor set representing a node is shown,

representing the calculated attention score.

(3) Aggregating nodes v based on local neighbor information at the same time _i Is expressed as follows:

wherein, the first and the second end of the pipe are connected with each other,

a global representation of the node is represented,

a global neighbor set representing a node is shown,

representing weights shared by local and global neighbors _n A representation of a local neighbor is represented.

(4) The attention mechanism of a single node may introduce noise due to instability of the attention coefficient. To reduce noise, the attention mechanism is further extended to multi-headed attention, repeated y times, and then integrated y times of learned representations (| | represents a vector concatenation operation). The formula is as follows:

wherein the content of the first and second substances,

a global neighbor set representing a node is shown,

representing the weight after applying multi-head attention and repeating the node attention for multiple times.

Step [4 ]: two types of loss were designed, loss1 and loss2, with loss1 being essentially loss.

(1) Basic loss was calculated using cross-entropy, and the formula is as follows:

J＝min(s _m,n ,0)-s _m,n *s _m,n +log(b+exp(-|s _m,n equation (10)

Wherein s is _m,n Is a predicted value, s _m,n Is the true value and b is a constant.

(2) | L | · | represents the L2 regularization for entity embedding, associated embedding and aggregation weights:

(3) l2 regular loss was also added, the formula is shown below:

wherein K represents a trainable weight matrix, b ^l Represents the scoring weight of gene relationship, sigma _m,n J represents the regularization after the embedding and the aggregation weight are correlated, and alpha represents a balance hyper-parameter and is optimized through an Adam optimization algorithm.

Step [5 ]: implementation of a synthetic lethality prediction applet for cancer

The small program page is designed and designed firstly, and then code editing and development are carried out in the editing area. When the development of the program needing debugging is finished, the program needs to be switched to a debugging area, and a plurality of tools are available in the debugging area. After debugging is finished, running, uploading and previewing can be carried out at a project, and the network environment is compiled and simulated: 2G/3G/4G/WiFi. The program interfaces as shown in fig. 12, 13 and 14, for example, input "RAC mutations or HRD (homologous recombination repair deficiency) certain mutated tumors are sensitive to PAPR inhibitors", and output "so BRAC and PARP become a pair of targets for synthetic lethality. "i.e., RAC mutations or HRD (homologous recombination repair deficiency) some mutated tumors are sensitive to PAPR inhibitors, so BRAC and PARP become a synthetic lethal pair of targets.

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A method for constructing a prediction knowledge map for synthetic lethality of human cancers is characterized by comprising the following steps:

step 1: carrying out map design according to the category of the entity;

step 2: acquiring corresponding medical data according to atlas design;

and step 3: performing word segmentation and part-of-speech tagging on original data, and removing punctuation marks and stop words;

and 5: and (5) constructing a knowledge graph.

2. The method of constructing a predictive knowledge map of synthetic lethality for human cancer according to claim 1,

aiming at the step 1, the data set comprises 72804 pairs of gene relations among 10004 genes, KG is expressed as SynLeth KG and comprises 24 relations among 11 entities; in 24 relations, 16 directly related to genes, 7 of 11 entities directly related to genes, and required information is screened from a SynLethKG database;

and aiming at the step 2, according to the map design, acquiring needed related structured data, semi-structured data and unstructured data from a SynLethDB database through a web crawler.

3. The method of constructing a synthetic lethality prediction knowledge map of human cancer according to claim 1,

performing word segmentation and part-of-speech tagging on the data by using a jieba word segmentation tool aiming at the step 3, and removing punctuation marks and stop words;

integrally storing the obtained semi-structured data in the step 4, performing entity recognition and relation extraction on the unstructured data by using a deep learning method, and then performing knowledge fusion on the obtained data;

the data sorted in step 4 in step 5 is stored using Neo4 j.

4. A knowledge graph human cancer lethality prediction method based on a mixed network is characterized by comprising the following steps:

step 1: extracting data in a database;

step 2: analyzing the gene pairs to obtain a gene-gene matrix;

and step 3: the constructed knowledge map and the obtained gene-gene pairs are used as input and transmitted into a model;

and 7: and calculating the total loss of the model and optimizing.

5. The mixed network-based knowledgegraph human cancer lethality prediction method of claim 4, wherein for step 1, SynLethDB is a comprehensive database of synthetic lethal gene pairs, and after removing isolated nodes, the resulting SynLethKG graph contains 54012 nodes and 2231921 edges; screening out a required data set from a database; and if the corresponding triples are not inquired, crawling relevant question and answer websites and forums by a crawler method.

6. The mixed network-based knowledgeable map human cancer lethality prediction method of claim 4, wherein for step 2, the screened data is converted into a gene matrix form by URI resolution; given a SL-related gene and constructing a weighted subgraph from the KG, identifying the relevant nodes and determining the weights of the edges are two key steps.

7. The method of claim 4, wherein for step 3, the knowledge-graph and gene-gene matrix constructed in step 1 are used as inputs, and the model framework comprises a graph neural network, a bidirectional long-short term memory neural network, and an attention mechanism model.

8. The hybrid network-based knowledgemap human cancer lethality prediction method of claim 4, wherein for step 4, after obtaining input, a neighborhood of entities is sampled; extracting a fixed number k of neighbors for each entity to characterize its local structure, and repeating the process H hop (H > ═ 1); if the number of neighbors of a node is less than k, it is resampled.

9. The mixed network-based knowledgebase human cancer lethality prediction method of claim 4, wherein for step 5, in the Bi-LSTM recurrent neural network, the Bi-LSTM model is used to mine the serialization of genes, the interacting synthetic lethal genes are processed into a sequence before the Bi-LSTM model is used, the long-distance dependency and position information of the original text are captured, a state is extracted for each lethal by using the Bi-LSTM, and finally the states of each lethal gene are superimposed and predicted.

10. The mixed network-based knowledgegraph human cancer lethality prediction method of claim 4, wherein, for step 6, after the obtained word vectors are fed into the Bi-LSTM neural network model one by one, a series of encoding end hidden states are generated to participate in the calculation of the attention coefficient; then in each round of training, the output state of the decoding end also participates in the calculation of the attention coefficient, and the state of the decoder and the hidden state are subjected to weighted summation to obtain final probability distribution; a feature method based on a layer attention mechanism captures both entity and relationship features in the neighborhood of any given entity; packaging relational clustering and multi-hop relations, and providing insight for the effectiveness of the model based on attention;

for step 7, the attention mechanism model was augmented with basic los 1 and loss2, which was calculated using cross-entropy and optimized using Adam optimization algorithm.