CN110795511A

CN110795511A - Knowledge graph representation method based on cloud model

Info

Publication number: CN110795511A
Application number: CN201911045361.7A
Authority: CN
Inventors: 刘学军; 周航; 蒋军成; 李斌; 王志荣
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-02-14

Abstract

The invention provides a knowledge graph representation method based on a cloud model, which comprises the following steps: acquiring a data set, and randomly dividing the data set into a training set and a testing set according to a proportion; dividing each relation in the training set into a plurality of semantics to obtain a Gaussian mixture model of the relation; calculating the meaning of each relation which can most express the relation; and calculating the coordinates of the language value of each main semantic and the determination degree thereof based on the cloud model. The invention provides a knowledge graph representation method based on a cloud model, aiming at obtaining a vector value capable of expressing the semantics of a relation vector most on the premise that the relation vector has multiple semantics, introducing an uncertainty thought, and combining a determination degree in a new scoring function to enable the representation of the knowledge graph to be more accurate.

Description

Knowledge graph representation method based on cloud model

Technical Field

The invention relates to the field of natural language processing, in particular to a knowledge graph representation method based on a cloud model.

Background

With the development of the internet, the content of the network data presents an explosive growth situation. The characteristics of large scale, heterogeneous and multiple internet contents, loose organization structure and the like provide great challenges for people to effectively acquire information and knowledge. The Knowledge Graph (Knowledge Graph) lays a foundation for the intellectual organization and intelligent application of the internet era by virtue of the strong semantic processing capability and open organization capability of the Knowledge Graph. Therefore, the study and application of large-scale knowledge maps has attracted sufficient attention in academia and industry. A knowledge graph is intended to describe entities that exist in the real world and relationships between entities. The knowledge graph was formally proposed by Google in 2012, and its original purpose is to improve the capability of search engine, improve the search quality and search experience of users. With the development and application of artificial intelligence technology, knowledge-graph is becoming one of the key technologies, and more researchers are dedicated to the research of knowledge-graph (KG). Knowledge maps provide a new mechanism for the effective representation of knowledge, and are now widely used in the fields of expert systems, web search, question answering, and the like.

Based on the knowledge representation of the translation model, each piece of knowledge in the knowledge graph is generally represented by a triple (head, relation, tail), wherein the head represents a head entity in the triple, the tail represents a tail entity in the triple, and the translation represents a semantic relationship between the head entity and the tail entity; although conventional translation-based models have proven effective in many cases, such models consider that one relationship corresponds to only one translation vector, and thus do not address the problem of multiple semantic relationships. For example, the has _ part relationship (sichuan, hasplat, chengdu) indicates a regional relationship, and (house, hasplat, door) indicates a composition relationship. Furthermore, different relationships have different degrees of certainty.

Disclosure of Invention

In order to solve the problems, the invention provides a knowledge graph representation method based on a cloud model, which comprises the following steps:

acquiring a data set, and randomly dividing the data set into a training set and a testing set according to a proportion;

dividing each relation in the training set into a plurality of semantics to obtain a Gaussian mixture model of the relation;

calculating the meaning of each relation which can most express the relation;

and calculating the coordinates of the language value of each main semantic and the determination degree thereof based on the cloud model.

Further, the dividing each relation in the training set into a plurality of semantics to obtain the gaussian mixture model of the relation specifically includes the following steps:

clustering and expressing triples in a training set to obtain a plurality of semantemes, expressing each semanteme into Gaussian distribution by adopting the thought of a Gaussian mixture model, and expressing the final relation into a mixed form of a plurality of Gaussian distributions, wherein the specific formula is as follows:

wherein t represents the tail entity vector in the triplet, h represents the head entity vector in the triplet, r represents the relationship vector in the triplet, σ is the variance, and N (u)_r,m,σ²) Representing a mathematical expectation of u_r,mVariance is σ²M denotes the number of semantics contained by a single relation r, u_r,mThe translation vector, λ, representing the mth semantic_r,mWeight, λ, representing the mth semantic_r,mObtained by Bayesian statistical screening.

Further, the calculating the main meaning of each relationship that can most express the relationship specifically includes:

carrying out statistics on the training data set by using Bayesian non-parameter statistics to obtain the weight of each semantic in each relation and obtain a subject semantic m capable of expressing the relation most^*The concrete formula is as follows:

wherein,representing main semantics by vectors of the main semantics

Instead of the relation vector r of the triplet, (h, r, t) represents the vector representation of the triplet, where

Representing the euclidean distance between the head entity vector h and the tail entity vector t.

Further, the computing of the coordinates of the language value of each main semantic based on the cloud model and the degree of determination thereof specifically includes the following steps:

generating cloud droplets by a two-dimensional normal cloud generator for a vector representation of a given triplet

The method specifically comprises the following steps:

generating an expected value ofMean square error of

Two-dimensional normal random entropy of

Generating an expected value of

Mean square error ofTwo-dimensional normal random number of

Then:

wherein

Meaning m as subject^*The coordinates of the language value of (a),

is composed of

Belonging to a main semantic m^*The degree of certainty of the language value of (a);

thus, the most expressible main semantic m is obtained^*Coordinate values of (2):

further, the knowledge graph representing method based on the cloud model is characterized by further comprising:

and constructing a scoring function, preprocessing the test set to obtain a scoring ranking of the test triple, and evaluating the method by using an average ranking score (MeanRank) and a proportion (Hits @10) of which the ranking is not more than 10 as evaluation indexes.

Further, the method comprises the following steps of constructing a scoring function, preprocessing a test set to obtain a scoring ranking of a test triple, and evaluating the method by using an average ranking score (MeanRank) and a ratio (Hits @10) of the ranking not greater than 10 as evaluation indexes, wherein the method specifically comprises the following steps:

randomly extracting a triple (h, r, t) from the test set, and randomly replacing a head entity (or a tail entity) of the triple with an entity in the test set to construct a test triple (h ', r, t');

performing 'Filter' setting, specifically: prior to ranking each test triplet, culling (excluding test target (h, r, t)) the correct triplets already in the training set and test set;

scoring each test triple through a scoring function, wherein the formula of the scoring function P { (h, r, t) } is specifically as follows:

where (h, r, t) is a vector representation of the triplet in the test dataset.

Compared with the prior art, the invention has the beneficial effects that:

on the premise that the relation vector has multiple semantics, the invention obtains the vector value which can most express the semantics of the relation vector, and simultaneously introduces the uncertainty idea, so that the representation of the knowledge graph is more accurate.

Drawings

FIG. 1 is a schematic flow diagram of the present invention.

Detailed Description

The present invention is described in further detail below with reference to the attached drawing figures.

In this disclosure, aspects of the present invention are described with reference to the accompanying drawings, in which a number of illustrative embodiments are shown. Embodiments of the present disclosure are not necessarily intended to include all aspects of the invention. It should be appreciated that the various concepts and embodiments described above, as well as those described in greater detail below, may be implemented in any of numerous ways, as the disclosed concepts and embodiments are not limited to any one implementation. In addition, some aspects of the present disclosure may be used alone, or in any suitable combination with other aspects of the present disclosure.

The data set Yoochoose is taken as an embodiment of the present invention and is further described with reference to fig. 1, and the following detailed description is given.

The invention discloses a knowledge graph representation method based on a cloud model, which comprises the following steps:

a knowledge map representation method based on a cloud model aims at obtaining coordinate values and determining degree which can most express certain relation semantics and constructing a high-quality fault diagnosis knowledge map; the method comprises the following steps:

s1: firstly, acquiring a data set of fault diagnosis knowledge, and randomly dividing the data set into a training set and a testing set according to a proportion, specifically;

experimental data sets were obtained from four common benchmark data sets (WN18, FB15k, WN11, FB13) of WordNet and Freebase, the data sets being arranged as 4: the proportion of 1 is randomly divided into a training set and a testing set; wherein WordNet is an English dictionary based on cognitive linguistics and designed by psychologists, linguists and computer engineers of the university of Princeton; freebase is an authoring sharing website similar to Wikipedia (all contents are added by users, and can be freely quoted by adopting an creative sharing license).

S2: dividing each relation in the training set into a plurality of semantics to obtain a Gaussian mixture model of the relation, which specifically comprises the following steps:

s11: clustering and expressing triples in a training set to obtain a plurality of semantemes, expressing each semanteme into Gaussian distribution by adopting the thought of a Gaussian mixture model, and expressing the final relation into a mixed form of a plurality of Gaussian distributions, wherein the specific formula is as follows:

S3: calculating the meanings of each relation which can best express the relation, and specifically comprising the following steps:

s31: carrying out statistics on the training data set by using Bayesian non-parameter statistics to obtain the weight of each semantic in each relation and obtain a subject semantic m capable of expressing the relation most^*The concrete formula is as follows:

wherein,representing main semantics by vectors of the main semantics

S4: calculating the coordinate of the language value of each main semantic and the determination degree thereof, specifically comprising the following steps:

s41: generating cloud droplets by a two-dimensional normal cloud generator for a vector representation of a given triplet

The method specifically comprises the following steps:

generating an expected value of

Mean square error of

Two-dimensional normal random entropy of

Generating an expected value of

Mean square error of

Two-dimensional normal random number of

Then:

wherein

Meaning m as subject^*The coordinates of the language value of (a),

is composed of

s5: constructing a scoring function, preprocessing a test set to obtain a scoring ranking of a test triple, and taking an average ranking score (MeanRank) and a proportion (Hits @10) of the ranking not more than 10 as an evaluation index of algorithm performance, wherein the method specifically comprises the following steps:

s51: a test triplet (h, r, t) is randomly extracted from the test set, and the entities in the test set randomly replace the head (or tail) entity of the triplet to construct a new test triplet (h ', r, t').

S52: if a new test triplet (h ', r, t') exists in the knowledge-graph, i.e. the triplet is actually correct, it is reasonable to rank it before the test triplet. To eliminate the effect of this problem, during the test process, before ranking each test triplet, the correct triplets already in the training set, the validation set, and the test set are rejected (excluding the test target (h, r, t)) and the result without such processing is "Filter", and the evaluation result of "Filter" is no doubt more important than "Raw".

S53: each test triple was scored by a scoring function, using the average ranking score (MeanRank) and the ratio of ranks no greater than 10 (Hits @10) as an evaluation index of the performance of the method. The smaller the MeanRank value, the larger the Hits @10 value, the better the method performance.

The formula of the scoring function P { (h, r, t) } is specifically:

where (h, r, t) is a vector representation of the triplet in the test dataset.

The results of the evaluation are shown in Table 1,

table 1 mean prediction results for different methods, TransE model: and embedding the entities and the relations of the knowledge graph into the same low-dimensional vector space, and judging whether the triples are reasonable or not by calculating Euclidean distances among the vectors.

TransH model: it is considered that different relationships should have different expressions for an entity, and the entity is projected onto the hyperplane of the corresponding relationship by means of projection.

TransR model: an entity is considered to be a complex of multiple attributes, different relationships focus on different attributes of the entity, and the entity and the relationships are projected into different spaces.

The TransG model: and refining the relation, and selecting an optimal relation semantic from the refined result.

Among them, the reason why the present invention does not perform well on the index of the MeanRank under the WN18 data set is that WN18 only contains a small number of relationships, which results in that different types of relationships are ignored and the influence of some extreme low-rank triples. In the multiple FB15K with complex relation, the indexes of the invention are all the best.

The invention provides a knowledge graph representation method based on a cloud model, aiming at obtaining a vector value capable of expressing the semantics of a relation vector most on the premise that the relation vector has multiple semantics, introducing an uncertainty thought, and combining a determination degree in a new scoring function to enable the representation of the knowledge graph to be more accurate.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A knowledge graph representation method based on a cloud model is characterized by comprising the following steps:

calculating the meaning of each relation which can most express the relation;

2. The cloud model-based knowledge graph representation method of claim 1, wherein the dividing of each relationship in the training set into a plurality of semantics and the obtaining of the gaussian mixture model of the relationship specifically comprises the steps of:

wherein t represents the tail entity vector in the triplet, h represents the head entity vector in the triplet, r represents the relationship vector in the triplet, σ is the variance, and N (u)_r，m，σ²) Representing a mathematical expectation of u_r，mVariance is σ²M denotes the number of semantics contained by a single relation r, u_r，mThe translation vector, λ, representing the mth semantic_r，mWeight, λ, representing the mth semantic_r，mObtained by Bayesian statistical screening.

3. The cloud model-based knowledge graph representation method according to claim 1, wherein the calculating of the main meaning that can most express the relationship in each relationship is specifically:

wherein,

representing main semantics by vectors of the main semanticsInstead of the relation vector r of the triplet,

(h, r, t) represents a vector representation of the triplet, where

4. The cloud model-based knowledge graph representation method according to claim 1, wherein the cloud model-based computation of the coordinates of the linguistic value of each primary semantic and the degree of certainty thereof specifically comprises the steps of:

The method specifically comprises the following steps:

generating an expected value of

Mean square error of

Two-dimensional normal random entropy of

Generating an expected value of

Mean square error of

Two-dimensional normal random number of

Then:

wherein

Meaning m as subject^*The coordinates of the language value of (a),

is composed of

5. the cloud model-based knowledge graph representation method of claim 4, further comprising:

6. The cloud model-based knowledge graph representing method according to claim 5, wherein the method comprises the following steps of constructing a scoring function, preprocessing a test set to obtain a scoring ranking of a test triple, and evaluating the method by using an average ranking score (MeanRank) and a ratio (Hits @10) with the ranking not greater than 10 as evaluation indexes:

where (h, r, t) is a vector representation of the triplet in the test dataset.