CN110472233B

CN110472233B - Relation similarity measurement method and system based on head-tail entity distribution in knowledge base

Info

Publication number: CN110472233B
Application number: CN201910639564.2A
Authority: CN
Inventors: 刘知远; 陈暐泽; 朱昊; 韩旭; 孙茂松
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2021-02-12
Anticipated expiration: 2039-07-16
Also published as: CN110472233A

Abstract

The embodiment of the invention provides a relation similarity measurement method and a relation similarity measurement system based on head and tail entity distribution in a knowledge base, wherein the method comprises the following steps: acquiring two relations to be compared; acquiring the head and tail entity distribution corresponding to the two relations respectively; calculating KL divergence between head and tail entity distributions corresponding to the two relations respectively, and determining similarity between the two relations based on the calculated KL divergence. Based on the relation similarity measurement mode of head and tail entity distribution in the knowledge base, the similarity between the two relations can be determined by using the information of the head and tail entities in the knowledge base. Meanwhile, the embodiment of the invention focuses on the distribution of head and tail entities of the two relations, so that the interpretability of the similarity of the two relations is enhanced.

Description

Relation similarity measurement method and system based on head-tail entity distribution in knowledge base

Technical Field

The invention relates to the technical field of natural language processing and knowledge representation, in particular to a method and a system for measuring relational similarity based on head and tail entity distribution in a knowledge base.

Background

In order to store and process the real-world knowledge in a structured manner and simultaneously facilitate a computer model to achieve better expression effect with the aid of the knowledge, a plurality of large-scale knowledge maps are established, such as Wikitata, Dbpedia, YAGO and the like. The knowledge graph takes proper nouns such as characters, place names, organization names and the like and things as entities, takes the relation among the entities as the relation, and finally stores knowledge in the form of a ternary relation group of (head entity, relation, tail entity). For example, the knowledge of "yaoming was born in shanghai" is represented in the knowledge-graph by the triad relationship (yaoming, born in …, shanghai).

Based on the existing knowledge base, people explore many tasks, such as automatic completion of the knowledge base, relationship extraction and the like. We have found that in these tasks, existing models tend to have difficulty distinguishing similar relationships. If the similarity between the relations can be measured, the ability of the model for distinguishing the similar relations can be strengthened in a targeted mode in the training process of the model, and therefore the ability of the model is enhanced.

Disclosure of Invention

The embodiment of the invention provides a relation similarity measurement method and a relation similarity measurement system based on head and tail entity distribution in a knowledge base, which are used for solving the problem that the effect of measuring the similarity between entity relations in the prior art is unsatisfactory, realizing better quantification of the similarity between the relations in the knowledge base and ensuring that the similarity determined by a measurement mode has high similarity with the cognition of people on the similarity.

The embodiment of the invention provides a relation similarity measurement method based on head and tail entity distribution in a knowledge base, which comprises the following steps:

acquiring two relations to be compared;

acquiring the head and tail entity distribution corresponding to the two relations respectively;

calculating KL divergence between head and tail entity distributions corresponding to the two relations respectively, and determining similarity between the two relations based on the calculated KL divergence.

Further, the step of calculating the KL divergence between the distributions of the head and tail entities corresponding to the two relationships further includes:

based on Monte Carlo simulation, calculating KL divergence between head and tail entity distributions corresponding to the two relations.

Further, the step of obtaining two relationships to be compared further includes:

defining the distribution of the ternary relationship group and defining the calculation mode of the distribution of the ternary relationship group.

Further, the step of defining the distribution of the set of ternary relationships and defining the calculation mode of the distribution of the set of ternary relationships further includes:

and calculating optimization model parameters, and optimizing the distribution of the ternary relationship group based on the optimization model parameters.

Further, the step of calculating the KL divergence between the head and tail entity distributions corresponding to the two relationships based on the monte carlo simulation further includes:

calculating KL divergence between head and tail entity distributions corresponding to the two relations based on the following formula:

wherein D_KL(. I. represents KL divergence);

represents the relationship r₁The corresponding head and tail entities are distributed,

represents the relationship r₂Corresponding head and tail entity distribution; h and t are respectively a head entity and a tail entity in the relation corresponding ternary relation group; theta^*Is a model parameter;

is from

And (4) sampling the head and tail entity pair set.

Further, the theta^*The following conditions are satisfied:

wherein

Is a set of relational triples, epsilon is a set of entities,

is a collection of relationships; and theta is a parameter model before optimization.

The embodiment of the invention provides a relation similarity measurement system based on head and tail entity distribution in a knowledge base, which comprises the following steps:

the acquisition module is used for acquiring two relations to be compared;

the acquisition module is further used for acquiring the head and tail entity distribution corresponding to the two relations respectively;

and the calculation module is used for calculating KL divergence between head and tail entity distributions corresponding to the two relations and determining the similarity between the two relations based on the calculated KL divergence.

Further, the calculation module is further configured to:

An embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the relationship similarity measurement method according to any one of the above descriptions when executing the computer program.

Embodiments of the present invention provide a non-transitory computer readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the relationship similarity measure method according to any one of the above-mentioned methods.

The method and the system for measuring the similarity of the relationship based on the head and tail entity distribution in the knowledge base provided by the embodiment of the invention can determine the similarity between two relationships by using the information of the head and tail entities in the knowledge base based on the relation similarity measuring mode of the head and tail entity distribution in the knowledge base. Meanwhile, the embodiment of the invention focuses on the distribution of head and tail entities of the two relations, so that the interpretability of the similarity of the two relations is enhanced. The method and the system provided by the embodiment of the invention can be directly expanded into the real world, for example, the method and the system can help the open domain relation extraction task to carry out the combination of the redundant relation, and can be used as a component in a heuristic algorithm to optimize a relation extraction model and a relation prediction model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a prior art method for measuring similarity based on head-to-tail entity distribution in a knowledge base;

FIG. 2 is a flowchart of an embodiment of a method for measuring similarity based on head-to-tail entity distribution in a knowledge base according to the present invention;

fig. 3 is a schematic structural diagram of an embodiment of the apparatus for measuring similarity of relationships based on head-tail entity distribution in a knowledge base according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve at least one technical problem in the prior art, an embodiment of the present invention provides a method for measuring relationship similarity. As shown in fig. 1, the method for measuring relationship similarity generally includes the following steps:

in step S1, two relationships to be compared are obtained.

And step S2, acquiring the head and tail entity distribution corresponding to the two relations respectively.

And step S3, calculating KL divergence between head and tail entity distributions corresponding to the two relations respectively, and determining similarity between the two relations based on the calculated KL divergence.

It should be noted that the triple corresponding to the relationship to be compared is predefined, the head-tail entity distribution calculation expression corresponding to the relationship to be compared is also defined in advance, the above-mentioned definition method may adopt a method in the prior art, and the embodiment of the present invention is not particularly limited.

It is assumed that the similarity between the distributions of the head and tail entities corresponding to the two relationships reflects the similarity between the two relationships, that is, if the distributions of the head and tail entities corresponding to the two relationships are very similar, the embodiment of the present invention reasonably considers that the two relationships are also very similar. Therefore, the embodiment of the invention measures the similarity based on the Kullback-Leibler divergence (KL divergence) between the head and tail entity distributions corresponding to the two relations.

Compared with the prior art, the embodiment of the invention provides a new relation similarity measurement mode based on head and tail entity distribution in the knowledge base, and the similarity between two relations can be determined by utilizing the information of the head and tail entities in the knowledge base. Meanwhile, the interpretability of the similarity of the two relations is enhanced by focusing on the distribution of head and tail entities of the two relations. The method provided by the embodiment of the invention can be directly expanded into the real world, for example, the method helps the open domain relation extraction task to merge the redundant relation, and is used as a component in a heuristic algorithm to optimize a relation extraction model and a relation prediction model.

On the basis of the foregoing embodiment of the present invention, a method for measuring relationship similarity is provided, where the step of calculating KL divergence between distributions of head and tail entities corresponding to two relationships further includes: based on Monte Carlo simulation, calculating KL divergence between the two head-tail entity distributions.

Since the calculation of the distribution of all head and tail entities of two relationships in the prior art involves the calculation of the number of all head and tail entities, which is very resource-consuming, the embodiment of the present invention considers that the KL divergence is estimated based on the monte carlo approximation.

Among them, the monte carlo method is also called a statistical simulation method and a statistical test method. The method is a numerical simulation method using a probability phenomenon as a research object. The method is a calculation method for estimating an unknown characteristic amount by obtaining a statistical value by a sampling survey method. Monte Carlo is a famous gamble in Morna, which is named to indicate its randomly sampled nature. Therefore, the method is suitable for performing calculation simulation tests on the discrete system. In computational simulation, the stochastic nature of the system can be simulated by constructing a probabilistic model that approximates the performance of the system and performing stochastic tests on a digital computer.

Wherein D is_KL(. I. represents KL divergence);

is from

And (4) sampling the head and tail entity pair set.

Epsilon is the set of entities that are,

is a collection of relationships.

Compared with the prior art, the embodiment of the invention provides a new relation similarity measurement mode based on head and tail entity distribution in a knowledge base, and the KL divergence between all the head and tail entity distributions of two relations is estimated based on Monte Carlo approximation, so that the calculation resources are saved, and the calculation rate is improved.

On the basis of any one of the above embodiments of the present invention, there is provided a method for measuring relationship similarity, where the step of obtaining two relationships to be compared further includes:

Wherein the distribution of the set of ternary relations is first defined.

A set of ternary relations can be represented as (h, r, t), h being the head entity, t being the tail entity, r being the relation between the two, and h, t ∈ ε, ε being the set of entities,

is a collection of relationships. First consider a function

All sets of ternary relations may be mapped to a scalar. In particular, can define

Further, we utilize F_θTo define an unnormalized probability.

In the embodiment of the invention, only F of a local normalized version is considered_θ：

Wherein

And

can be computed directly from a feed-forward neural network. By the above-mentioned local normalization, it is possible,

naturally, it is a reasonable probability distribution because of ∑_h，r，texp (h, r, t) is 1, and thus

Secondly, a calculation mode of distribution of the ternary relationship group is defined.

For the

Giving each relationship a different parameter and taking this parameter as the log probability, i.e.:

wherein theta is₁(r) is a parameter corresponding to the relation r.

For the second and third parts, the calculation is performed by a multilayer perceptron:

each MLP represents a multi-layer perceptron with the expression y ═ relu (Wx + b) per layer, and h, r, and t are vectors of h, r, and t.

On the basis of any of the above embodiments of the present invention, there is provided a method for measuring relationship similarity, where the step of defining the distribution of the set of ternary relationships and the calculation manner of the distribution of the set of ternary relationships further includes: and calculating optimization model parameters, and optimizing the distribution of the ternary relationship group based on the optimization model parameters.

In some embodiments, it is desirable to maximize the joint probability of the training set, i.e., to find the parameter θ of a set of models^*So that:

wherein

Is a set of relational triples, epsilon is a set of entities,

On the basis of any of the above embodiments of the present invention, there is provided a method for measuring relationship similarity,

the step of calculating the KL divergence between the head and tail entity distributions corresponding to the two relationships based on the Monte Carlo simulation further includes:

wherein D_KL(. I. represents KL divergence);

is from

And (4) sampling the head and tail entity pair set.

It is assumed that the distance between the distributions of the head and tail entities corresponding to the two relationships reflects the similarity of the two relationships, i.e. if the distributions of the head and tail entities corresponding to the two relationships are very similar, the two relationships are reasonably considered to be also very similar. Therefore, the similarity between two relationships can be defined based on the Kullback-Leibler divergence (KL divergence) between the distributions of head and tail entities corresponding to the two relationships:

wherein D_KL(. |. cndot.) represents KL divergence,

the same principle is reversed. The function g (·, ·) represents a symmetric function, and g should be a monotonically decreasing function in order to be consistent with the meaning of "similarity". In our invention, let g (x, y) be e^-max(x，y)。

Since the calculation of the distribution of all head-to-tail entities for two relations involves O (epsilon)²) Is very resource consuming, and therefore the distribution is considered to be estimated using the monte carlo approximation. Calculating the KL divergence between the two head-to-tail entity distributions based on:

wherein D_KL(. I. represents KL divergence);

is from

And (4) sampling the head and tail entity pair set.

On the basis of any one of the above embodiments of the invention, the theta^*The following conditions are satisfied:

wherein

Is a set of relational triples, epsilon is a set of entities,

On the basis of any of the above embodiments of the present invention, as shown in fig. 2, there is provided a relationship similarity measurement system, including:

an obtaining module 21, configured to obtain two relationships to be compared;

the obtaining module 21 is further configured to obtain respective head-tail entity distributions corresponding to the two relationships;

and the calculating module 22 is configured to calculate KL divergence between head and tail entity distributions corresponding to the two relationships, and determine similarity between the two relationships based on the calculated KL divergence.

Compared with the prior art, the embodiment of the invention provides a new relation similarity measurement mode based on head and tail entity distribution in the knowledge base, and the similarity between two relations can be determined by utilizing the information of the head and tail entities in the knowledge base. Meanwhile, the interpretability of the similarity of the two relations is enhanced by focusing on the distribution of head and tail entities of the two relations. The system provided by the embodiment of the invention can be directly expanded into the real world, for example, the system can help the open domain relation extraction task to carry out the combination of the redundant relation, and can be used as a component in a heuristic algorithm to optimize a relation extraction model and a relation prediction model.

On the basis of the foregoing embodiment of the present invention, a relationship similarity measurement system is provided, wherein the calculation module is further configured to: based on Monte Carlo simulation, calculating KL divergence between head and tail entity distributions corresponding to the two relations.

Wherein D is_KL(. I. represents KL divergence);

is from

And (4) sampling the head and tail entity pair set.

Epsilon is the set of entities that are,

is a collection of relationships.

An example is as follows:

fig. 3 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 3: a processor (processor)310, a communication Interface (communication Interface)320, a memory (memory)330 and a communication bus 340, wherein the processor 310, the communication Interface 320 and the memory 330 communicate with each other via the communication bus 340. The processor 310 may call logic instructions in the memory 330 to perform the following method: acquiring two relations to be compared; acquiring the head and tail entity distribution corresponding to the two relations respectively; calculating KL divergence between head and tail entity distributions corresponding to the two relations respectively, and determining similarity between the two relations based on the calculated KL divergence.

In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes: acquiring two relations to be compared; acquiring the head and tail entity distribution corresponding to the two relations respectively; calculating KL divergence between head and tail entity distributions corresponding to the two relations respectively, and determining similarity between the two relations based on the calculated KL divergence.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for measuring relationship similarity includes:

acquiring two relations to be compared;

calculating KL divergence between head and tail entity distributions corresponding to the two relations respectively, and determining similarity between the two relations based on the calculated KL divergence;

wherein, the step of calculating the KL divergence between the head and tail entity distributions corresponding to the two relationships further comprises:

calculating KL divergence between head and tail entity distributions corresponding to the two relations based on Monte Carlo simulation;

calculating the KL divergence between the two head-to-tail entity distributions based on:

wherein D_KL(. I. represents KL divergence);

is from

The head and tail entity pairs sampled from the middle are collected;

theta is described^*The following conditions are satisfied:

wherein

Is a set of relational triples, epsilon is a set of entities,

2. The relationship similarity metric method according to claim 1, wherein the step of obtaining two relationships to be compared further comprises:

3. The relationship similarity measurement method according to claim 2, wherein the step of defining the distribution of the set of ternary relationships and defining the calculation manner of the distribution of the set of ternary relationships further comprises:

4. A relational similarity measurement system, comprising:

the acquisition module is used for acquiring two relations to be compared;

the calculation module is used for calculating KL divergence between head and tail entity distributions corresponding to the two relations and determining similarity between the two relations based on the calculated KL divergence;

wherein the computing module is further configured to:

wherein D_KL(. I. represents KL divergence);

is from

The head and tail entity pairs sampled from the middle are collected;

theta is described^*The following conditions are satisfied:

wherein

Is a set of relational triples, epsilon is a set of entities,

5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the relationship similarity measure method according to any of claims 1 to 3 when executing the program.

6. A non-transitory computer readable storage medium, having stored thereon a computer program, which, when being executed by a processor, carries out the steps of the relational similarity measure method according to any one of claims 1 to 3.