CN112417166B

CN112417166B - Knowledge graph triple confidence evaluation method

Info

Publication number: CN112417166B
Application number: CN202011309998.5A
Authority: CN
Inventors: 杨帅; 王小红; 赵志刚; 窦方坤; 曹皓伟; 潘景山; 魏志强
Original assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2022-08-26
Anticipated expiration: 2040-11-20
Also published as: CN112417166A

Abstract

The invention discloses a knowledge graph triple confidence evaluation method which comprises an evaluation stage, a fusion stage and a verification stage, wherein a) entity level evaluation; a-1) a data source angle; a-2) angle of co-occurrence of documents; a-3) outer chain scale angle; a-4) text description angles; a-5) entity importance angle; a-6) angle of degree of entity; b) evaluating a relationship level; b-1) data source angle; b-2) angle of co-occurrence of documents; b-3) evaluating the known relation layers among the entities; b-4) evaluating unknown relation layers among the entities; c) knowledge-graph global-level assessment. The knowledge graph triple confidence evaluation method can efficiently, quickly and massively discover errors in knowledge graph data, and further improve the data quality of the whole knowledge graph system; and the data reliability check can be carried out on the results of machine learning tasks such as link prediction, relationship inference and the like.

Description

Knowledge graph triple confidence evaluation method

Technical Field

The invention relates to a knowledge graph triple confidence evaluation method, in particular to a knowledge graph triple confidence evaluation method comprising an evaluation stage, a fusion stage and a verification stage.

Background

Different targets and drugs are used as entities, the interaction between the targets and the drugs is used as a relationship, related Knowledge is stored in the entities and the relationship in the form of attributes and is mutually interwoven to form a huge map, and the map supports the functions of inquiry, reasoning, intelligent analysis and the like, and is called a Drug-Target Knowledge map (DT KG). DT KG is an important direction for knowledge mapping research in the field of biomedicine in effectively revealing the complex rules of physical and biochemical actions between the medicine and the target, discovering the implicit action relationship between the medicine and the target which is not discovered yet, and further discovering new medicines or developing new applications of the existing medicines.

Errors are inevitable in the construction process of the knowledge graph. In order to find errors in the knowledge graph and improve the quality of the knowledge graph, and further improve the performance of a knowledge-driven learning task, the concept of knowledge graph triple confidence is introduced in the academic world. And (2) knowledge graph triple confidence (KG triple trust) for measuring the trueness of the knowledge expressed by the triple. The confidence degree of the knowledge graph triple is in a value range of [0,1], the closer the value to 0, the higher the probability that the triple is wrong, and the closer the value to 1, the higher the probability that the triple is true.

The existing knowledge graph triple confidence evaluation method can be summarized into 3 types, and the classification principle is divided according to the applicable stages of the knowledge graph triple confidence evaluation method, as shown in 1, 2 and 3 in fig. 1. The first type of confidence evaluation method is used in the process of extracting triples from text data, and typical cases are as follows: knowlefe knowledge base of the masscharian planck information research center, germany. The second type of confidence evaluation method is used in the Embedding process, which aims to encode all entities and relationships into a continuous vector space. The confidence evaluation and the elimination of data noise in the Embedding process are hot points of research of researchers in recent years, and typical methods comprise the following steps: SCEF (a novel support-confidence-aware KG embedding frame), CKRL (a novel knowledge-aware registration retrieval frame), transt (translation embedding discovery with triple tree), and the like. The third confidence evaluation method directly evaluates the triples, can measure the reliability of the triples obtained by knowledge inference, and is also suitable for the confidence evaluation of the dynamic knowledge base. Typical methods are: KGTtm (knowledge graph triple valued measurement model), CTRANSE (knowledge graph embedding on non-knowledge graphs by using adaptation confidence-margin-based loss function for translation-based models), and the like.

The existing knowledge graph triplet confidence evaluation method is shown in table 1, and 7 methods are listed:

TABLE 1

Name of method	Application stage	Year of year
			KnowLife	Extracting entities and relationships from text	2015
SCEF	Embedding	2019
			KGTtm	Triple unit	2019
TransT	Embedding	2019
			CKRL	Embedding	2018
ConfGCN	Node attribute prediction	2019
			CTransE	Embedding	2019

(1) KnowLife realizes a universal and extensible method for automatically constructing a biomedical knowledge base, automatically extracts information from scientific publications, health portal websites and online community resources, and introduces a confidence evaluation rule in the automatic information extraction process for quantitatively measuring the reliability of extracted entity and relationship data, thereby improving the quality of the biomedical knowledge base.

(2) The SCEF is a knowledge graph embedding framework supporting confidence perception, and the framework is used for constructing an energy function by combining confidence on the basis of a traditional translation model, and realizing the improvement and correction of a knowledge graph through knowledge representation learning with triple confidence (text, a knowledge graph and triples).

(3) KGTtm is a metric model of knowledge-graph triplet confidence that quantifies the semantic correctness of triplets and the trueness of the expressed facts from the entity level, the relationship level, and the knowledge-graph global level.

(4) The TransT is a model for calculating the confidence coefficient of the triple based on information such as entity type, entity description and the like, and optimizes the model through a loss function based on cross entropy so as to improve the performance of knowledge embedding learning.

(5) CKRL is a knowledge representation learning framework based on confidence coefficient, introduces the concept of confidence coefficient based on structural information, and improves the effects of knowledge representation learning and knowledge map noise detection by constructing an energy equation by using the entities of triples, the relations and the vector information of paths among the entities.

(6) The ConfGCN model is used to predict the reliability of the node attribute task and may be used to evaluate the scores of the node labels in the graph and their confidence levels.

(7) CTransE is a translation-based model for handling errors introduced by a knowledge graph upon automatic update, which employs a confidence-based loss function to accomplish embedded representation learning of a dynamic knowledge graph.

However, the existing knowledge graph triple confidence evaluation method has the following defects:

1. the considerations are not comprehensive and the confidence score is unreliable. The existing confidence evaluation method considers the confidence influence factors of a knowledge map global level, an entity level and a relation level, but does not take the scientific research literature and the data source into account, so that the finally obtained confidence score is unreliable.

2. The calculation complexity is high, and the interpretability is poor. In the existing method, the confidence of the triples is evaluated through a machine learning model (for example, KGTtm carries out confidence evaluation on the global level of a knowledge graph based on RNN, SemaTyP carries out confidence evaluation by constructing a logistic regression model), and the model has high computational complexity and poor interpretability.

3. The confidence measure is limited to the Embedding process. Most of the existing confidence evaluation methods are suitable for the Embedding process, and the methods cannot directly evaluate the quality of the triples constructed by the knowledge reasoning and automation method.

Disclosure of Invention

The invention provides a knowledge graph triple confidence evaluation method for overcoming the defects of the technical problems.

The knowledge graph triple confidence evaluation method comprises an evaluation stage, a fusion stage and a verification stage, and is characterized in that: the evaluation phase is realized by the following steps:

a) entity level assessment;

a-1) evaluation of entities from a data source perspective, the entities to be evaluated comprising 11 total of compounds, diseases, proteins, genes, pathways, cell lines, drugs, products, targets, enzymes, protein-compounds, the data source confidence N for each entity _r Reference association open data cloudScoring LOD in Linked Open Data Cloud, and respectively giving scores of 5 stars, 5 stars and 4 stars for PubChem, RCSB PDB, DrugBank and DTO body Data sources which are not subjected to LOD scoring; data source confidence N for an entity _r The value of (a) is equal to the number of stars scored by the LOD, and if the same entity appears in 2 or more than 2 data sources, the confidence coefficient N of the data source is obtained _r Taking the highest score value;

a-2) evaluating the entity by the document co-occurrence angle, inquiring documents related to the entity in a document library, and solving the confidence coefficient LCA of the document co-occurrence angle of the entity by a formula (1):

LCA represents the document co-occurrence angle confidence of an entity, N represents the number of documents related to the entity, F represents the influence factor of the documents, L is the reference amount of the documents, T is the score value corresponding to different document categories, i represents the ith document, and alpha, beta and theta represent weight values;

a-3) evaluation of entity by external chain scale angle, confidence N of external chain scale of entity _L The number of entity external links in the biomedical knowledge graph is used for representing, the larger the entity external link scale is, the higher the reliability of entity data is, the credibility of the entity is measured through the number of the entity external links, and the confidence coefficient N of the entity external link scale _L Equal to the number of outer chains of the entity;

a-4), evaluating the entity by the text description angle, wherein the entity text description is the description of the concept, the category and the functional information of the entity, and the entity with the text description has higher data reliability; if the text description of the corresponding entity exists in the data source in the step a-1), the value of the text description confidence value D of the entity is 1, and if the text description confidence value D does not exist, the value of the text description confidence value D is 0;

a-5) evaluating the entity from the perspective of entity importance, wherein the importance of the node in the whole graph is directly determined by the quantity and quality of linked entity nodes in the biomedical knowledge graph; the importance of a certain entity in the knowledge graph is measured by adopting a PageRank algorithm to represent the confidence coefficient of the importance of the entity, wherein the PageRank algorithm is shown as a formula (2):

wherein, P ₁ 、P ₂ 、…、P _i 、…、P _n Represents a node in the knowledge-graph and,

representing a node P to be investigated _j The degree of penetration of (a) is,

representing a node P to be investigated _j N represents the number of nodes in the knowledge-graph,

representing a node P _j The PageRank values of all the nodes form a PageRank vector of the knowledge graph, and q represents the probability of continuous expansion of the nodes in the knowledge graph and is 0.5;

a-6), evaluating the entity by the angle of the degree of the entity, wherein the in-degree and out-degree of the entity node reflect the enrichment degree of entity information in the knowledge graph and the correlation strength between the entity and other entities; confidence N of angle of degree of entity _s The calculation is performed by equation (3):

N _s ＝N _in +N _out (3)

wherein N is _s Confidence of angle, N, representing degree of entity _in Representing the degree of entry, N, of a physical node _out Representing the out degree of the entity node;

b) evaluating a relationship level;

b-1), evaluating the relationship level by the angle of the data source, wherein the relationship between the entities in the biomedical knowledge graph is generally represented by a triplet (h, r, t), wherein h is a head entity, t is a tail entity, and r is the relationship between the entities; indicating two if the triple data is from a high quality data sourceThe relevance among the entities is strong, and the confidence coefficient of the triple information is high; data source confidence N 'of relationship layer' _in Referring to LOD scoring in The Linked Open Data Cloud, and giving 5-star, 5-star and 4-star scores for PubChem, RCSB PDB, drug Bank and DTO ontology Data sources which are not subjected to LOD scoring respectively; data source confidence N 'of relation layer' _in Is equal to the star number marked by LOD, if the same entity appears in 2 or more than 2 data sources, the data source confidence coefficient N 'of the relation level is' _in Taking the highest score value;

b-2) evaluating the relation level by the document co-occurrence angle, inquiring documents related to the entity pair (h, t) in a document library, and solving the document co-occurrence angle confidence coefficient LCA' of the entity pair (h, t) by a formula (4):

LCA 'represents the confidence coefficient of the co-occurrence angle of the documents of the entity pair (h, T), N' represents the number of the documents related to the entity pair (h, T), F represents the influence factor of the documents, L represents the reference quantity of the documents, T represents the score values corresponding to different document categories, i represents the ith document, and alpha, beta and theta represent the weight values;

b-3), evaluating the known relationship layer among the entities, establishing an entity relationship in the construction process of the biomedical knowledge graph, namely a known relationship, and measuring the confidence coefficient of the known relationship by adopting a resource rank algorithm to obtain the confidence coefficient of the known relationship;

b-4), evaluating an unknown relation level among the entities, wherein the entity relation which does not exist in the existing knowledge graph and needs to be obtained through reasoning is called as an unknown relation; adopting a KSP algorithm to measure the confidence coefficient of the unknown relationship, and evaluating the relationship strength through the number of the first K shortest paths between two entities in the map to obtain the confidence coefficient KSP of the unknown relationship;

c) estimating the global level of the knowledge graph;

by N _total the/M evaluates the global level of the knowledge graph,the information density of the knowledge graph overall layer is measured, and the credibility of data contained in the whole knowledge graph is further evaluated; wherein N is _total The total degree of all entity nodes of the knowledge graph is the sum of the in-degree and the out-degree of all the entity nodes, and M is the total number of the entity nodes in the knowledge graph.

According to the knowledge graph triple confidence evaluation method, the fusion stage is realized through the following steps: combining the data quality condition of the biomedical knowledge graph and the medicine-target point relation prediction task factors, solving the triple confidence value of the biomedical knowledge graph through a formula 5:

the Confidence represents a triple Confidence value which is a positive number, and the Confidence is higher when the Confidence value is larger; the Confidence value of Confidence is obtained by weighting 11 Confidence evaluators of an entity level, a relation level and a knowledge graph global level, and finally the Confidence value is normalized to [0,1]]An interval; if the confidence value is less than the threshold value of 0.6 in the designated knowledge graph, the data of the triple is unreliable; n' _r Representing data source confidence from the relational level.

According to the knowledge graph triple confidence evaluation method, the verification stage is used for evaluating whether the final confidence value of the knowledge graph triple is reasonable or not, and then the design of an evaluator and a fusion device is optimized; the checker comprises two methods of expert sampling check and automatic check; and (4) expert sampling and checking: the expert sampling and checking method is characterized in that manual checking is carried out by means of experts in the medical field, and the checking range of the experts is as follows: the confidence score is in the range of [0.9,1] and the triplets contain data of the existing drugs or hot targets; the expert checking method comprises the following steps: researching the medicines and targets related to the triad, and verifying whether the triad data with high confidence values is reliable or not according to professional knowledge and experience;

automatic verification: the automatic verification method is to verify the confidence value of the triple by means of a molecular docking technology, and the range of the automatic verification is as follows: the confidence value range is [0.6,0.9], 10% of the triples are randomly sampled; the automatic checking method comprises the following steps: performing molecular docking calculation on the drug-target data related to the triples by using a LibDock and GOLD scoring function in the Discovery Studio 2018Client, and judging whether the confidence value is reliable or not according to the final scoring value;

and the result of the verification stage is fed back to the evaluation stage and the fusion stage, the reason of the data which is seriously and negatively correlated with the verification result and the confidence value is deeply investigated, and the weight of each method in the fusion stage is adjusted, so that the whole knowledge-graph triple confidence evaluation method is perfected.

According to the method for evaluating the confidence coefficient of the knowledge graph triplet, the literature base in the step a-2) and the step b-2) comprises CAS, Patent, PubMed, Wikipedia and DOI, and the values of the value alpha, the value beta and the value theta are 0.7, 0.2 and 0.1 respectively; the scoring values T for different document categories are shown in table 1:

TABLE 1

Class of documents	Scoring value
		CAS	1.0
Patent	0.8
		PubMed	1.0
Wikipedia	0.5
		DOI	1.0

。

In the knowledge map triple confidence evaluation method, in the evaluation process of the known relationship on the relationship layer in the step b-3), the confidence of the known relationship is measured by adopting a resource rank algorithm; the resource rank algorithm is used for describing the association strength between two entities, and the idea of the algorithm is as follows: if the association between the entity pair (h, t) is strong, then there will be very many resources passing from the head entity h to the tail entity t through all the association paths; the method is realized by the following steps:

b-3-1), constructing a directed graph taking a head entity h as a center;

b-3-2), iteratively calculating the resources in the graph by using a formula (6) until the resources are converged, and calculating a resource reservation value of the tail entity t;

wherein M is _t Is the set of all the nodes leading to the tail node t, OD (e) _i ) Is node e _i The out-of-range of (c) is,

is node e _i Bandwidth to tail node t, i.e. the number of paths; for M _t In each node e _i From node e _i The amount of resources transferred to the tail node t is

Setting that the resource flow of each node has the same eta probability and can directly jump to a random node, wherein the part of resources flowing to a tail node t randomly is 1/N, and N is the total number of the nodes;

b-3-3), using R (t | h), the degree of entry ID (h) of the head node h, the degree of exit OD (h) of the head node h, the degree of entry ID (t) of the tail node t, the degree of exit OD (t) of the tail node t, and the depth Dep from the head node to the tail node in the step b-3-2), totaling 6 characteristics to construct a characteristic vector V, converting the V into a probability value RR (h, t) through an activation function, wherein RR (h, t) is the confidence resource rank, and is used for measuring the possibility that one or more relations exist between the head node h and the tail node t, and the calculation is carried out through a formula (7):

where φ is a non-linear activation function, W _i And b _i Is a parameter matrix which can be adjusted during training, and the range of RR (h, t) value is 0,1]The closer its value is to 1, the more likely there is a relationship between h and t.

The beneficial effects of the invention are: the method for evaluating the confidence coefficient of the knowledge graph triples comprises the steps that firstly, the confidence coefficient of the triples is evaluated in an evaluation stage from three aspects of entities, relations and knowledge graph overall situation, multiple angles of data sources, document co-occurrence, external link scale, text description, entity importance and entity degree to obtain 11 confidence coefficients, then, in a verification stage, 11 confidence coefficient evaluators are weighted and fused to obtain a final confidence value, in the verification stage, the rationality of the final confidence value is verified, and verification results are fed back to the evaluation stage and the fusion stage to optimize the design of the evaluation stage or adjust the weight of the fusion stage. Therefore, the knowledge graph triple confidence evaluation method can efficiently, quickly and massively discover errors in knowledge graph data, and further improve the data quality of the whole knowledge graph system; and the data reliability check can be carried out on the results of machine learning tasks such as link prediction, relationship inference and the like.

Drawings

FIG. 1 is a schematic diagram of the applicable stages of three types of confidence evaluation methods;

FIG. 2 is a schematic architecture diagram of the knowledge-graph triple confidence evaluation method of the present invention;

FIG. 3 is a schematic diagram of the ResourceRank algorithm in the present invention;

fig. 4 is a diagram of an exemplary case for calculating confidence in the evaluation phase.

Detailed Description

The invention is further described with reference to the following figures and examples.

As shown in fig. 2, a principle architecture diagram of the method for evaluating confidence of knowledge-graph triples of the present invention is given, the method for evaluating confidence of knowledge-graph triples of the present invention is used for evaluating the reliability of triples in a biomedical knowledge-graph, and the method for evaluating confidence of knowledge-graph triples of the present invention comprises: the system comprises an evaluator, a fusion device and a checker, wherein the three-element data of the knowledge graph generates a plurality of confidence value scores after passing through the evaluator, and the fusion device fuses the scores according to a certain weight to generate a final confidence value. The checker checks the rationality of the final confidence value and feeds back the check result to the evaluator and the fuser for optimizing the design of the evaluator or adjusting the weight of the fuser.

The evaluator evaluates the confidence of the triples from three levels of entities, relations, knowledge graph global and the like, a plurality of angles of data sources, document co-occurrence, external chain scale, text description, entity importance and entity degree, and the specific method is shown in table 2:

TABLE 2

a) entity level assessment;

a-1) evaluation of entities from a data source perspective, the entities to be evaluated including compounds, diseases, proteins, genes, pathways, cell lines, pharmaceuticals, products, targets, enzymes, protein-compoundsA total of 11 things, data source confidence N for each entity _r Referring to LOD scoring in The Linked Open Data Cloud, and giving 5-star, 5-star and 4-star scores for PubChem, RCSB PDB, drug Bank and DTO ontology Data sources which are not subjected to LOD scoring respectively; data source confidence N for an entity _r The value of (a) is equal to the number of stars scored by the LOD, and if the same entity appears in 2 or more than 2 data sources, the confidence coefficient N of the data source is obtained _r Taking the highest score value;

as shown in table 3, a LOD data source quality evaluation table is given:

in the step, the document library comprises CAS, Patent, PubMed, Wikipedia and DOI, wherein the values of alpha, beta and theta are respectively 0.7, 0.2 and 0.1; the scoring values T for different document categories are shown in table 1:

TABLE 1

a-3) evaluation of entity by external chain scale angle, and external chain scale confidence N of entity _L The reliability of the entity data is higher when the external chain scale of the entity is larger, the credibility of the entity is measured by the external chain number of the entity, and the external chain scale confidence coefficient N of the entity is expressed by the number of the external links of the entity in the biomedical knowledge map _L Equal to the number of outer chains of the entity;

a-4) evaluating the entity by the text description angle, wherein the entity text description is the description of the concept, category and functional information of the entity, and the data reliability of the entity with the text description is higher; if the text description of the corresponding entity exists in the data source in the step a-1), the value of the text description confidence value D of the entity is 1, and if the text description confidence value D does not exist, the value of the text description confidence value D is 0;

a-5) evaluating the entity from the perspective of entity importance, wherein the importance of the node in the whole graph is directly determined by the quantity and quality of linked entity nodes in the biomedicine knowledge graph; the importance of a certain entity in the knowledge graph is measured by adopting a PageRank algorithm to represent the confidence coefficient of the importance of the entity, wherein the PageRank algorithm is shown as a formula (2):

N _s ＝N _in +N _out (3)

wherein, N _s Confidence of angle, N, representing degree of entity _in Representing the degree of entry, N, of a physical node _out Representing the out degree of the entity node;

b) evaluating the relation level;

b-1), evaluating the relation level by the data source angle, and generally representing the relation between the entities in the biomedical knowledge graph by a triplet (h, r, t), wherein h is a head entity, t is a tail entity, and r is the relation between the entities; if the triple data come from a high-quality data source, the relevance between the two entities is very strong, and the confidence coefficient of the triple information is very high; data source confidence N 'of relationship layer' _in Referring to LOD scoring in The Linked Open Data Cloud, and giving 5-star, 5-star and 4-star scores for PubChem, RCSB PDB, drug Bank and DTO ontology Data sources which are not subjected to LOD scoring respectively; data source confidence N 'of relationship layer' _in Is equal to the star number marked by LOD, if the same entity appears in 2 or more than 2 data sources, the data source confidence coefficient N 'of the relation level is' _in Taking the highest score value;

LCA 'represents the confidence coefficient of the co-occurrence angle of the documents of the entity pair (h, T), N' represents the number of the documents related to the entity pair (h, T), F represents the influence factor of the documents, L represents the reference quantity of the documents, T represents the score values corresponding to different document categories, i represents the ith document, and alpha, beta and theta represent weights;

as shown in fig. 3, a schematic diagram of the principle of the resource rank algorithm in the present invention is given, and the edges (relationship) from the node (entity) a to the node E are very dense, which indicates that there is a high association strength between the two entities (a, E), and there is a relationship between the entities a and E. However, there is no directly associated edge between node G and node F, which means that there is no relationship between entities G and F.

In the step, the confidence coefficient of the known relation is measured by adopting a resource rank algorithm; the resource rank algorithm is used for describing the association strength between two entities, and the idea of the algorithm is as follows: if the association between the entity pair (h, t) is strong, then there will be very many resources passing from the head entity h to the tail entity t through all the association paths; the method is realized by the following steps:

b-3-1), constructing a directed graph taking a head entity h as a center;

wherein, M _t Is the set of all the nodes leading to the tail node t, OD (e) _i ) Is node e _i The out-of-range of (c) is,

b-3-3) utilizing R (t | h), the degree of approach ID (h) of the head node h, the degree of departure OD (h) of the head node h, the degree of approach ID (t) of the tail node t, the degree of departure OD (t) of the tail node t and the depth Dep from the head node to the tail node in the step b-3-2) to total 6 characteristics to construct a characteristic vector V, converting the V into a probability value RR (h, t) through an activation function, wherein RR (h, t) is a confidence resource rank and is used for measuring the possibility that one or more relations exist between the head node h and the tail node t, and the probability is obtained through a formula (7):

c) estimating the global level of the knowledge graph;

by N _total The knowledge graph global level is evaluated by the aid of the evaluation module/M, information density of the knowledge graph global level is measured, and then the whole knowledge graph is evaluatedThe credibility of data contained in each knowledge graph; wherein N is _total The total degree of all entity nodes of the knowledge graph is the sum of the in-degree and the out-degree of all the entity nodes, and M is the total number of the entity nodes in the knowledge graph.

The fusion phase is realized by the following steps: combining the data quality condition of the biomedical knowledge graph and the medicine-target point relation prediction task factors, solving the triple confidence value of the biomedical knowledge graph through a formula 5:

the Confidence represents a triple Confidence value which is a positive number, and the Confidence is higher if the Confidence value is larger; the Confidence value of Confidence is obtained by weighting 11 Confidence evaluators of an entity level, a relation level and a knowledge graph global level, and finally the Confidence value is normalized to [0,1]]An interval; if the confidence value is less than the threshold value of 0.6 in the designated knowledge graph, the data of the triple is unreliable; n' _r Representing data source confidence from the relational level.

The checking stage is used for evaluating whether the final confidence value of the knowledge map triple is reasonable or not so as to optimize the design of an evaluator and a fusion device; the checker comprises two methods of expert sampling check and automatic check; and (4) expert sampling and checking: the expert sampling and checking method is characterized in that manual checking is carried out by means of experts in the medical field, and the checking range of the experts is as follows: the confidence score is in the range of [0.9,1] and the triplets contain data of the existing drugs or hot targets; the expert checking method comprises the following steps: researching the medicines and targets related to the triad, and checking whether the triad data with high confidence values is reliable or not according to professional knowledge and experience;

and the result of the verification stage is fed back to the evaluation stage and the fusion stage, the reason of the data which is seriously and negatively correlated with the verification result and the confidence value is deeply investigated, and the weight of the fusion stage is adjusted, so that the whole knowledge map triple confidence degree evaluation method is perfected.

As shown in fig. 4, a typical case diagram of confidence calculation during the evaluation stage is given, taking (noradrenaline, binding molecule entity, β 2 adrenergic receptor) triple as an example, to briefly describe the process of calculating confidence by the evaluator: at the physical level, a translation-based energy function algorithm (TEF) was used to calculate the likelihood that a binding relationship between norepinephrine and β 2 adrenergic receptors exists. The energy function of the (noradrenaline, binding molecule entity, β 2 adrenergic receptor) triplet is first calculated to achieve a low-dimensional distributed representation of entities and relationships. And then converting the energy function into the probability that the entity pair (noradrenaline, beta 2 adrenergic receptor) forms the entity relationship of the binding molecules by using a sigmoid function, and measuring the possibility that the two entities have the binding relationship by the obtained probability value. And in the relation layer, the relation type and the correlation strength of the medicine and the target are calculated by using a resource rank algorithm. The resourcerrank algorithm creates a sub-graph with depth 2 centered around noradrenaline and β 2 adrenergic receptors, and then calculates the amount of resources flowing from the head entity (noradrenaline) to the tail entity (β 2 adrenergic receptors) based on the generated sub-graph, and if the association between the entity pair (noradrenaline, β 2 adrenergic receptors) is strong, there will be a very large number of resources passing from the head entity (noradrenaline) to the tail entity (β 2 adrenergic receptors) through all the associated paths. And on the data source level, a DataSource algorithm is used for comprehensively evaluating the quality of the data sources of the Drug Target Ontology (Drug Target Ontology), the PRotein Ontology (PRoein Ontology) and the UniProt where the triples are located. First, data for the (norepinephrine, binding molecule entity, β 2 adrenergic receptor) triplet is contained in the drug target entity, protein entity, and UniProt data sources. Secondly, The quality of The Data in The three Data sources is different, a Data source algorithm makes an LOD Data source quality evaluation table by referring to The grading of different Data source qualities in a related Open Data Cloud (LOD), and The confidence evaluation of The Data source layer is realized according to a set rule. At the literature co-occurrence level, a literature co-occurrence algorithm (LCO) quantitatively identifies the strength of association of an entity pair with the number of literature co-occurrences. First, the algorithm screened a literature containing (norepinephrine, binding molecule entity, β 2 adrenergic receptor) triplets. Then, the number of the documents is taken as the main, and the weighted calculation is carried out according to a certain weight by referring to the information of the influence factors, the quotation, the journal categories and the like of the documents, and finally the confidence value for identifying the entity to the association strength is obtained. At the level of a knowledge graph structure, a reachable path reasoning algorithm (RP) is used for evaluating semantic correlation existing between head and tail entities in the directed graph and a complex reasoning mode contained between triples. Firstly, considering semantic relevance factors of the path and the target triple, and selecting the reachable path by a path selection algorithm based on the semantic distance. The selected reachable paths are then mapped to a low-dimensional vector, and a Recurrent Neural Network (RNN) is used to obtain a final output vector, which may represent semantic information for each path. Finally, the vector is subjected to nonlinear processing to obtain a value RP ((h, r, t)) which is used for representing the confidence of the diagram structure level in the knowledge-graph.

Claims

1. A knowledge graph triple confidence evaluation method comprises an evaluation stage, a fusion stage and a verification stage, and is characterized in that: the evaluation phase is realized by the following steps:

a) entity level assessment;

a-1) evaluation of entities from a data source perspective, the entities to be evaluated comprising 11 total of compounds, diseases, proteins, genes, pathways, cell lines, drugs, products, targets, enzymes, protein-compounds, the data source confidence N for each entity _r And respectively giving 5 stars to PubChem, RCSB PDB, DrugBank and DTO ontology data sources which are not subjected to LOD scoring by referring to LOD scoring in the associated open data cloud5, 5 and 4 stars; data source confidence N for an entity _r The value of (a) is equal to the number of stars scored by the LOD, and if the same entity appears in 2 or more than 2 data sources, the confidence coefficient N of the data source is obtained _r Taking the highest score value;

representing a node P to be studied _j The degree of penetration of the (c) is,

N _s ＝N _in +N _out (3)

wherein, N _s Confidence of angle, N, representing degree of entity _in Representing the degree of entry, N, of a physical node _out Representing the degree of departure of the entity node;

b) evaluating a relationship level;

b-1), evaluating the relation level by the data source angle, and generally representing the relation between the entities in the biomedical knowledge graph by a triplet (h, r, t), wherein h is a head entity, t is a tail entity, and r is the relation between the entities; if the triple data come from a high-quality data source, the relevance between the two entities is very strong, and the confidence coefficient of the triple information is very high; of the relation layerData Source confidence N _i ′ _n Referring to LOD scoring in the associated open data cloud, and giving 5-star, 5-star and 4-star scoring for PubChem, RCSB PDB, DrugBank and DTO ontology data sources which are not subjected to LOD scoring respectively; data source confidence N 'of relation layer' _in Is equal to the number of stars scored by LOD, and if the same entity appears in 2 or more than 2 data sources, the data source confidence coefficient N 'of the relationship layer is' _in Taking the highest score value;

b-3), evaluating the known relationship level among the entities, establishing an entity relationship in the construction process of the biomedical knowledge graph, namely a known relationship, and measuring the confidence coefficient of the known relationship by adopting a Re sourceRank algorithm to obtain the confidence coefficient Re sourceRank of the known relationship;

b-4), evaluating unknown relation layers among the entities, wherein entity relations which do not exist in the existing knowledge graph and need to be obtained through reasoning are called unknown relations; adopting a KSP algorithm to measure the confidence coefficient of the unknown relationship, and evaluating the relationship strength through the number of the first K shortest paths between two entities in the map to obtain the confidence coefficient KSP of the unknown relationship;

c) estimating the global level of the knowledge graph;

by N _total The knowledge graph global level is evaluated by the aid of the evaluation module, so that information density of the knowledge graph global level is measured, and data contained in the whole knowledge graph are evaluatedThe reliability of (2); wherein N is _total The total degree of all entity nodes of the knowledge graph is the sum of the in-degree and the out-degree of all the entity nodes, and M is the total number of the entity nodes in the knowledge graph.

2. The knowledge-graph triplet confidence assessment method of claim 1, characterized in that: the fusion phase is achieved by the following steps: combining the data quality condition of the biomedical knowledge graph and the medicine-target point relation prediction task factors, solving the triple confidence value of the biomedical knowledge graph through a formula 5:

the Confidence represents a triple Confidence value which is a positive number, and the Confidence is higher when the Confidence value is larger; the Confidence value of the Confidence is obtained by weighting 11 Confidence evaluators of an entity level, a relation level and a knowledge graph global level, and the Confidence value is finally normalized to a [0,1] interval; if the confidence value is less than the threshold value of 0.6 in the given knowledge-graph, the data of the triple is not reliable.

3. The knowledge-graph triplet confidence assessment method of claim 2, characterized in that: the verification stage is used for evaluating whether the final confidence value of the knowledge map triple is reasonable or not, and further optimizing the design of an evaluator and a fusion device; the checker comprises two methods of expert sampling check and automatic check; and (4) expert sampling and checking: the expert sampling and checking method is characterized in that manual checking is carried out by means of experts in the medical field, and the checking range of the experts is as follows: the confidence score is in the range of [0.9,1] and the triplets contain data of the existing drugs or hot targets; the expert checking method comprises the following steps: researching the medicines and targets related to the triad, and checking whether the triad data with high confidence values is reliable or not according to professional knowledge and experience;

and the result of the verification stage is fed back to the evaluation stage and the fusion stage, the reasons of the data which are seriously negatively related to the verification result and the confidence value are deeply investigated, and the weight of each method in the fusion stage is adjusted, so that the whole knowledge graph triple confidence degree evaluation method is perfected.

4. The knowledge-graph triplet confidence assessment method of claim 1 or 2, characterized in that: the literature base in the step a-2) and the step b-2) comprises CAS, Patent, PubMed, Wikipedia and DOI, and the values of the alpha, the beta and the theta are respectively 0.7, 0.2 and 0.1; the scoring values T corresponding to different document categories are: scoring values corresponding to the document categories CAS, Patent, PubMed, Wikipedia, and DOI are 1.0, 0.8, 1.0, 0.5, and 1.0, respectively.

5. The knowledge-graph triplet confidence assessment method of claim 1 or 2, characterized in that: in the evaluation process of the known relationship to the relationship layer in the step b-3), measuring the confidence coefficient of the known relationship by adopting a Re sourceRank algorithm; the Re sourceRank algorithm is used for describing the correlation strength between two entities, and the idea of the algorithm is as follows: if the association between the entity pair (h, t) is strong, then there will be very many resources passing from the head entity h to the tail entity t through all the association paths; the method is realized by the following steps:

b-3-1), constructing a directed graph taking a head entity h as a center;

wherein M is _t Is the set of all the nodes leading to the tail node t, OD (e) _i ) Is node e _i Out of degree, BW _eit Is node e _i Bandwidth to tail node t, i.e. the number of paths; for M _t In each node e _i From node e _i The amount of resources transferred to the tail node t is

b-3-3), using R (t | h), the degree of entry ID (h) of the head node h, the degree of exit OD (h) of the head node h, the degree of entry ID (t) of the tail node t, the degree of exit OD (t) of the tail node t, and the depth Dep from the head node to the tail node in the step b-3-2), totaling 6 characteristics to construct a characteristic vector V, converting the V into a probability value RR (h, t) through an activation function, wherein the RR (h, t) is a confidence Re sourceRank, and is used for measuring the possibility that one or more relations exist between the head node h and the tail node t, and the calculation is carried out through a formula (7):