CN106951526B - Entity set extension method and device - Google Patents

Entity set extension method and device Download PDF

Info

Publication number
CN106951526B
CN106951526B CN201710168839.XA CN201710168839A CN106951526B CN 106951526 B CN106951526 B CN 106951526B CN 201710168839 A CN201710168839 A CN 201710168839A CN 106951526 B CN106951526 B CN 106951526B
Authority
CN
China
Prior art keywords
entity
entities
seed
meta
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710168839.XA
Other languages
Chinese (zh)
Other versions
CN106951526A (en
Inventor
石川
郑玉艳
曹晓欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201710168839.XA priority Critical patent/CN106951526B/en
Publication of CN106951526A publication Critical patent/CN106951526A/en
Application granted granted Critical
Publication of CN106951526B publication Critical patent/CN106951526B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Abstract

According to the entity set expansion method and device provided by the embodiment of the invention, candidate entities are extracted from a target knowledge graph to form a candidate entity set according to a predetermined seed entity set; determining meta-paths between the seed entities from a heterogeneous information network corresponding to the target knowledge graph; the meta path is: a connection path composed of an entity type and a relationship type between two node types in the heterogeneous information network; the two node types are node types corresponding to different seed entities; determining a first importance degree of each meta-path according to the number of seed entity pairs connected with each meta-path; determining a second importance degree of each candidate entity in the candidate entity set according to the first importance degree of each meta-path; and determining the candidate entities with the second importance degree meeting the first preset condition in the candidate entity set as entities to be expanded, and adding the entities to be expanded into the seed entity set. The invention can be used for effectively extending the entity set.

Description

Entity set extension method and device
Technical Field
The present invention relates to the field of entity set extension technologies, and in particular, to a method and an apparatus for entity set extension.
Background
Entity set extension refers to that several entity seeds with a specific semantic type (also called specific common characteristics) are known, and more entities with the specific semantic type are obtained according to a certain rule. For example, given a set of entity seeds { Beijing, Washington, Moscow } of a particular semantic type as the country capital, it is desirable to find more country capital, such as { capital, Tokyo, Jilong slope,. · }. Currently, entity set extension has been widely used, such as dictionary extension and query suggestion extension.
The most common method for extending the entity set is to select a data source, process the data source according to a certain rule, and determine other entities having the same semantic type as the seed entity as extension elements of the entity set. The existing entity set expansion method mostly takes text or web pages as data sources. However, due to the limited amount of data contained in a single text and web page, the entity set expansion is not efficient enough to meet the increasing entity set expansion requirements.
Disclosure of Invention
The embodiment of the invention aims to provide an entity set extension method and device so as to improve the validity of entity set extension.
In order to achieve the above object, in a first aspect, an embodiment of the present invention provides an entity set extension method, where the method includes:
extracting candidate entities from the target knowledge graph according to a predetermined seed entity set, and forming the candidate entities obtained by extraction into a candidate entity set; the target knowledge-graph includes at least seed entities in the set of seed entities;
determining meta-paths between seed entities from a heterogeneous information network corresponding to the target knowledge graph; the meta path is: a connection path composed of an entity type and a relationship type between two node types in the heterogeneous information network; wherein the two node types are node types corresponding to different seed entities in the seed entity set;
determining a first importance degree of each meta-path according to the number of seed entity pairs connected with each meta-path;
determining a second degree of importance of each candidate entity in the set of candidate entities according to the first degree of importance of each meta-path;
and determining the candidate entities with the second importance degree meeting a first preset condition in the candidate entity set as entities to be expanded, and adding the entities to be expanded into the seed entity set.
Optionally, the extracting candidate entities from the target knowledge graph according to the predetermined seed entity set includes:
determining a set of entity types of each seed entity in a predetermined set of seed entities;
determining the intersection of all entity type sets as an initial entity type set;
determining a final entity type set corresponding to the seed entity set according to the hierarchical relationship of each entity type in the initial entity type set; and taking the entity which accords with the entity type in the final entity type set in the target knowledge graph as a candidate entity.
Optionally, the determining a final entity type set according to the hierarchical relationship of each entity type in the initial entity type set includes:
determining at least one hierarchical relationship corresponding to the initial entity type set, wherein any hierarchical relationship is a dependency relationship of at least two entity types;
and determining the entity type positioned at the bottommost layer in each hierarchical relationship as a final entity type, and composing the determined final entity types into a final entity type set.
Optionally, the determining meta-paths between seed entities from a heterogeneous information network corresponding to the target knowledge-graph includes:
determining a node set corresponding to the seed entity set from a heterogeneous information network corresponding to the target knowledge graph, wherein the node set comprises nodes corresponding to seed entities in the seed entity set;
taking each node in the node set as a first node;
taking each first node as a current source node, accessing a current target node connected with each current source node through a preset type edge in the heterogeneous information network, and establishing a plurality of structure data tables to be selected corresponding to the edge type; wherein, any one of the candidate structure data tables includes: a first entity pair consisting of each first node and a current target node connected through an edge of an edge type corresponding to the structure data table to be selected, a similarity value of each first entity pair, an accessed path and a similarity score; the similarity score is the sum of similarity values of all first entity pairs;
judging whether a current target node connected with each current source node in the structure data table to be selected is a second node or not aiming at each structure data table to be selected; if so, recording the similarity value of the first entity pair corresponding to the current source node in the structure data table to be selected as a first numerical value, determining the visited path corresponding to the current source node as a meta path instance, and otherwise, recording the similarity value as a second numerical value; wherein the second node is: a node in the node set, which is different from a first node corresponding to a current source node;
selecting a structure data table to be selected which meets a second preset condition from the structure data tables to be selected as a current structure data table; the second preset condition includes: the variety of the seed entities stored in the structure data table to be selected is the largest; when there are a plurality of the stored structure data tables to be selected with the most variety of seed entities, the second preset condition further includes: the number of the first entity pairs stored in the structure data table to be selected is minimum;
updating each current target node in the current structure data table to be a current source node, and returning to execute the step of accessing the current target node connected with each current source node through a preset type edge in the heterogeneous information network;
and when the length of the accessed path in each current structure data table is greater than a third preset value or the number of the seed entities in each current structure data table is less than a fourth preset value, counting all the determined meta-path examples, and obtaining meta-paths corresponding to all the meta-path examples according to the entity types and the relationship types contained in all the meta-path examples.
Optionally, the selecting, from the structure data tables to be selected, a structure data table to be selected that meets a second preset condition as a current structure data table includes:
and selecting the structure data table to be selected which meets a second preset condition as the current structure data table from a plurality of structure data tables to be selected of which the similarity scores are not larger than the first preset value.
Optionally, the determining the first importance degree of each meta-path according to the number of seed entity pairs connected to each meta-path includes:
determining the total number of seed entity pairs connected with each meta-path according to all the seed entity pairs connected with each meta-path;
determining a first importance degree of each meta-path according to the total number of the seed entity pairs connected with each meta-path and a first preset model;
wherein the first preset model is:
Figure GDA0002545458950000031
wherein, WkIs a meta path PkCorresponding first degree of importance, l is the number of meta-paths;
Figure GDA0002545458950000032
SPkis a meta path PkThe total number of connected pairs of seed entities, m being the number of seed entities,
Figure GDA0002545458950000033
is the total number of pairs of seed entities.
Optionally, the determining the second importance degree of each candidate entity in the candidate entity set according to the first importance degree of each meta-path includes:
determining a second importance degree of each candidate entity in the candidate entity set according to the first importance degree and a second preset model of each meta-path;
wherein the second preset model is:
Figure GDA0002545458950000041
wherein R (c)iS) represents a candidate entity ciN is the number of candidate entities; sjRepresenting a seed entity, S representing the set of seed entities, and m being the number of seed entities; wkIs a meta path PkCorresponding first degree of importance, l is the number of meta-paths; r { (c)i,sj)|PkDenotes the element path PkWhether or not to join seed entities sjAnd candidate entity ciIf yes, r is 1, otherwise r is 0.
Optionally, the determining, in the aggregating the candidate entities, the candidate entities whose second importance degree satisfies the first preset condition as entities to be expanded includes:
and determining the candidate entities with the second importance degree larger than a second preset value in the candidate entity set as entities to be expanded.
Optionally, the determining, in the aggregating the candidate entities, the candidate entities whose second importance degree satisfies the first preset condition as entities to be expanded includes:
according to the second importance degree, sorting the candidate entities in the candidate entity set in a descending order to obtain a first candidate entity set; and selecting a first preset number of candidate entities ranked at the top from the first candidate entity set as entities to be expanded.
In order to achieve the above object, in a second aspect, an embodiment of the present invention provides an entity set extension apparatus, including:
the candidate entity set determining module is used for extracting candidate entities from the target knowledge graph according to a predetermined seed entity set and forming the candidate entities obtained by extraction into a candidate entity set; the target knowledge-graph includes at least seed entities in the set of seed entities;
a meta-path determination module for determining meta-paths between seed entities from a heterogeneous information network corresponding to the target knowledge graph; the meta path is: a connection path composed of an entity type and a relationship type between two node types in the heterogeneous information network; wherein the two node types are node types corresponding to different seed entities in the seed entity set;
the first importance degree determining module is used for determining the first importance degree of each meta-path according to the number of the seed entity pairs connected with each meta-path;
a second importance level determining module, configured to determine a second importance level of each candidate entity in the candidate entity set according to the first importance level of each meta-path;
and the entity set expansion module is used for determining the candidate entities with the second importance degree meeting the first preset condition in the candidate entity set as entities to be expanded and adding the entities to be expanded into the seed entity set.
According to the entity set expansion method and device provided by the embodiment of the invention, on one hand, a target knowledge graph with huge data content is used as a data source for entity set expansion; on the other hand, meta-paths among the seed entity sets are determined from the heterogeneous information network corresponding to the target knowledge graph, each determined meta-path is a path connecting the seed entity pair, therefore, the meta-paths can accurately reflect specific common characteristics among the seed entities, the second importance degree of the candidate entities determined by the first importance degree of each meta-path is more effective, and the entities to be expanded determined according to the second importance degree are more effective. Therefore, the entity set extension method and the entity set extension device provided by the embodiment of the invention can improve the effectiveness of entity set extension.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of an entity set extension method according to an embodiment of the present invention;
FIG. 2 is a partially schematic illustration of a Yago knowledge-map;
FIG. 3 is a partially schematic illustration of the hierarchical relationship of entity types in a Yago knowledge-graph;
FIG. 4 is a detailed flowchart of step S102 in the embodiment shown in FIG. 1;
FIG. 5 is a schematic diagram illustrating the determination of meta-paths using a detailed flow chart shown in FIG. 4;
fig. 6A to fig. 6D are schematic diagrams illustrating validity verification results of an entity set extension method according to an embodiment of the present invention, where the entity types sequentially corresponding to fig. 6A to fig. 6D are: actors in steve-spearberg director's movies, movies director's movies having obtained the national movie prize director, software produced by companies located in mountain city, ca, scientists located at university, cambridge, massachusetts;
fig. 7 is a block diagram illustrating an entity set extension apparatus according to an embodiment of the present invention;
fig. 8 is a detailed block diagram of module 702 in the embodiment shown in fig. 7.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for extending an entity set, which are described below with reference to specific embodiments.
First, a method for extending an entity set according to an embodiment of the present invention is described.
As shown in fig. 1, an entity set extension method provided in an embodiment of the present invention includes the following steps:
s101, extracting candidate entities from a target knowledge graph according to a predetermined seed entity set, and forming the candidate entities obtained through extraction into a candidate entity set; the target knowledge-graph includes at least seed entities in the set of seed entities;
the seed entities can be preset according to a given specific semantic type, and a set formed by all the seed entities is a seed entity set. For example, if a specific semantic type is preset as movie director, then Lian, Chenkegan, Zhangiun can be preset as seed entities to form a seed entity set { Lian, Chenkegan, Zhangiun }.
A knowledge graph is a very large data set, mainly composed of triples of < subject, predicate, object >. For example, in the Yago knowledge graph shown in fig. 2, a triple is < spearberg, director, war horse movie >, and the triple represents that the spearberg director has movie war horse. In addition to the Yago knowledge-map, other knowledge-maps exist in the prior art, such as DBpedia and Freebase.
In the embodiment of the present invention, the target knowledge-graph refers to a knowledge-graph related to a predetermined seed entity. As can be appreciated by those skilled in the art, when entity set expansion is performed, accurate expansion of the entity set can only be achieved if the adopted data source has correlation with the seed entity.
Specifically, the target knowledge-graph comprises at least the seed entities in the set of seed entities.
In embodiments of the present invention, a candidate entity is an entity that has a particular common characteristic with a seed entity. Wherein the specific common features include: the entity types are the same.
S102, determining meta-paths among seed entities from a heterogeneous information network corresponding to the target knowledge graph; the meta path is: a connection path composed of an entity type and a relationship type between two node types in the heterogeneous information network; wherein the two node types are node types corresponding to different seed entities in the seed entity set;
a Heterogeneous Information Network (Heterogeneous Information Network) is a directed graph G ═ V, E, where V is a set of all entity nodes, E is a set of all relationship edges, and the entity object type | a | > 1 in the directed graph or the relationship type | R | > 1 linking different entity objects, and in the Network, one node represents one entity object (entity for short) and one edge represents the relationship between two entity objects connected by the edge. Also, there is a mapping function for a node type
Figure GDA0002545458950000061
And an edge type, E → R, for each entity object V ∈ V belonging to a particular object type
Figure GDA0002545458950000062
Each edge E ∈ E belongs to a particular relationship type ψ (E) ∈ R.
The meta-path is a connection path composed of an entity type and a relationship type between two node types in the heterogeneous information network, and represents a semantic relationship between the two node types. A meta path is defined as
Figure GDA0002545458950000071
Is a sequence of entity types (node types) and relationship types (edge types), which is described at A1Type of node and Al+1Between nodes of type, through a series of A1,…,Al+1Type node and R1,…,RlOne path of type edge connection, wherein A1The corresponding node type is called source node type, Al+1The corresponding node type is referred to as the target node type.
In heterogeneous information networks, meta-paths are widely used to capture rich semantic information, we define object a1And al+1A path therebetween
Figure GDA0002545458950000072
Is a path instance of the meta-path P, if the following condition is satisfied, for
Figure GDA0002545458950000073
And ψ (e)i)∈RiWherein, in the step (A),
Figure GDA0002545458950000074
indicating for all i.
In general, there may be multiple path instances in a meta path, for example, a path instance is:
Figure GDA0002545458950000075
another example path is:
Figure GDA0002545458950000076
since both path instances satisfy the meta path
Figure GDA0002545458950000077
We say that both paths are path instances of this meta-path.
Since the knowledge graph is mainly composed of triples of < subject, predicate, object >, in which the subject and the object may respectively correspond to one entity, the predicate may represent a certain relationship or attribute between the subject and the object, and the type of the subject and the object and the relationship or attribute between the subject and the object included in the knowledge graph are not limited to one. Therefore, a heterogeneous information network can be constructed in advance according to the knowledge graph.
For example, in FIG. 2, "director" and "show" are two different types of relationships, "actor" and "movie" are different entity types,
Figure GDA0002545458950000078
is a meta-path between the Tobi Kbeyer and Stevens Pilberg.
In addition, in fig. 2, both of topi-kaebel and martin kaien belong to the actor category, and topi-kaebel and nigil-haves belong not only to the actor category but also to the actor category of the movie of the stevens-starberg director, and in order to better distinguish the two categories, we call the former as a coarse-grained entity type and the latter as a fine-grained entity type, and candidate entities determined according to the fine-grained entity type are more likely to be determined as entities to be extended.
Specifically, it is prior art to construct heterogeneous information networks based on knowledge-graphs, and therefore, this process is not described in detail herein.
In this embodiment of the present invention, the two nodes are nodes corresponding to different seed entities in the seed entity set, and a node pair composed of the two nodes may be referred to as a "seed entity pair".
Table 1 lists the set of seed entities as s1,s2,…,smWhen the node is replaced, the seed entity is replaced by the seed entity. As shown in Table 1, when the source node is s1When the target node is { s }2,…,smAny one of them; when the source node is s2When the target node is { s }1,s3,…,smAny one of them; when the source node is other nodes, and so on, they will not be described one by one here.
TABLE 1
Figure GDA0002545458950000081
It should be further noted that, in the embodiment of the present invention, only the entities corresponding to the source node and the target node in each meta-path are seed entities, and the entities corresponding to other nodes are non-seed entities.
S103, determining a first importance degree of each meta-path according to the number of seed entity pairs connected with each meta-path;
in a specific implementation manner provided in the embodiment of the present invention, step S103 includes:
step 1, determining the total number of seed entity pairs connected with each meta-path according to the number of all seed entity pairs connected with each meta-path;
specifically, each path instance connects a pair of seed entities, so the total number of seed entity pairs connected by each meta-path is the sum of the number of seed entity pairs connected by all path instances corresponding to the meta-path.
Step 2, determining a first importance degree of each meta-path according to the total number of seed entity pairs connected with each meta-path and a first preset model;
wherein the first preset model is:
Figure GDA0002545458950000082
wherein, WkIs a meta path PkCorresponding first degree of importance, l is the number of meta-paths;
Figure GDA0002545458950000083
SPkis a meta path PkThe total number of connected pairs of seed entities, m being the number of seed entities,
Figure GDA0002545458950000084
is the total number of pairs of seed entities.
In step S102, all the important meta-paths are determined, but the importance degree of each meta-path is different, and the applicant has shown through a large number of experimental verifications that the importance degree of a certain meta-path is related to the total number of pairs of seed entities connected by the meta-path, and the greater the total number of pairs of seed entities connected by the meta-path, the more the meta-path reflects the common characteristics of the seed entities, and therefore, the more important the meta-path.
In view of this, the embodiment of the present invention provides a method for determining a first importance degree of each meta-path according to a first preset model, from which meta-path P is easily foundkThe larger the total number of connected pairs of seed entities, the larger its corresponding first importance value.
It should be noted that the method for determining the first importance degree of each meta-path is not limited to the above-mentioned one, and other methods for determining the first importance degree of each meta-path existing in the prior art are all applicable to the present invention.
S104, determining a second importance degree of each candidate entity in the candidate entity set according to the first importance degree of each meta-path;
in a specific implementation manner provided by the embodiment of the present invention, step S104 includes:
determining a second importance degree of each candidate entity in the candidate entity set according to the first importance degree and a second preset model of each meta-path;
wherein the second preset model is:
Figure GDA0002545458950000091
wherein R (c)iS) means waitSelecting entity ciN is the number of candidate entities; sjRepresenting a seed entity, S representing the set of seed entities, and m being the number of seed entities; wkIs a meta path PkCorresponding first degree of importance, l is the number of meta-paths; r { (c)i,sj)|PkDenotes the element path PkWhether or not to join seed entities sjAnd candidate entity ciIf yes, r is 1, otherwise r is 0.
It is easy to find that the second importance degree is positively correlated with the first importance degree, and the greater the first importance degree of a certain meta-path is, the more the meta-path can reflect the specific common features among the seed entities, so the second importance degree of the candidate entity determined according to the first importance degree is more effective.
It should also be noted that the method for determining the second importance level of each candidate entity is not limited to the above-mentioned one, and other methods for determining the second importance level of each candidate entity existing in the prior art are all applicable to the embodiments of the present invention.
S105, determining the candidate entities with the second importance degree meeting the first preset condition in the candidate entity set as entities to be expanded, and adding the entities to be expanded into the seed entity set.
In a specific implementation manner provided by the embodiment of the present invention, step S105 includes:
and determining the candidate entities with the second importance degree larger than a second preset value in the candidate entity set as entities to be expanded.
In another specific implementation manner provided by the embodiment of the present invention, step S105 includes:
according to the second importance degree, sorting the candidate entities in the candidate entity set in a descending order to obtain a first candidate entity set; and selecting a first preset number of candidate entities ranked at the top from the first candidate entity set as entities to be expanded.
And the applicant adopts corresponding sequencing indexes to carry out validity verification on the selected entities to be expanded in the first preset number according to the target knowledge graph, so that the validity of the method is verified.
According to the entity set expansion method provided by the embodiment of the invention, on one hand, a target knowledge graph with huge data content is used as a data source for entity set expansion; on the other hand, meta-paths between the seed entities are determined from the heterogeneous information network corresponding to the target knowledge graph, and each determined meta-path is a path connecting the seed entity pair, so that the meta-paths can accurately reflect specific common characteristics between the seed entities, the second importance degree of the candidate entities determined by the first importance degree of each meta-path is more effective, and the entities to be expanded determined according to the second importance degree are more effective. Therefore, the entity set extension method provided by the embodiment of the invention can improve the effectiveness of entity set extension.
In addition, knowledge maps such as Yago have become a tool for quickly retrieving information. With the popularity of knowledge-graphs, many researchers have begun using this tool to help improve the accuracy of entity set expansion in text or web pages. However, there is currently little work on entity set expansion using a knowledge graph as a separate data source. However, it is necessary to extend the knowledge-graph as a separate data source for entity sets, for the following reasons: (1) the traditional entity set expansion method based on text or webpage information needs complex natural language processing, which affects the accuracy of expansion to a certain extent, and the knowledge graph is taken as an independent data source without complex preprocessing; (2) the knowledge graph contains rich entities and semantic relations, which is beneficial to entity set expansion.
In a specific implementation manner provided by the embodiment of the present invention, in step S101, the step of extracting candidate entities from the target knowledge graph according to a predetermined seed entity set may include:
step 1, determining an entity type set of each seed entity in a predetermined seed entity set;
for example, for a seed entity lie in the seed entity set { lie, cheygos, zhangshu } determined above, the corresponding entity type set is { person, director }; for the seed entity chenkege and zhangxie, the corresponding seed entity type set is { people, director, actor }.
Step 2, determining the intersection of all entity type sets as an initial entity type set;
because the same entity type can reflect common characteristics among entities better, the intersection of all entity type sets is determined as an initial entity type set, and entity set expansion can be carried out more effectively.
Specifically, the intersection of the entity type set { person, director } and the seed entity type set { person, director, actor } determined in step 1 is { person, director }, that is, the initial entity type set is determined to be { person, director }.
Step 3, determining a final entity type set corresponding to the seed entity set according to the hierarchical relationship of each entity type in the initial entity type set; and taking the entity which accords with the entity type in the final entity type set in the target knowledge graph as a candidate entity.
Since the entity type of "person" in the initial entity type set of { person, director }, although it can reflect the common features of the seed entities, is coarse in granularity, the semantic ambiguity of the determined candidate entity is caused. Therefore, in the embodiment of the present invention, the final entity type set corresponding to the seed entity set is further determined according to the hierarchical relationship of each entity type in the initial entity type set.
In the embodiment of the present invention, the entity type containing more subtypes is referred to as "coarse-grained" entity type, and the subtype is referred to as "fine-grained" entity type, for example, in two entity types, i.e., "person" belongs to coarse granularity and "director" belongs to fine granularity, and those skilled in the art can understand that the coarse granularity and the fine granularity of the entity type are relative.
Specifically, the hierarchical relationship of each entity type in the initial entity type set refers to the dependency relationship of each entity type, for example, in the initial entity type set { person, director }, the entity type of "director" is dependent on the entity type of "person".
More specifically, the step 3 may include:
substep 1, determining at least one hierarchical relationship corresponding to the initial entity type set, wherein any hierarchical relationship is a subordinate relationship of at least two entity types;
and a substep 2, determining the entity type positioned at the bottommost layer in each hierarchical relationship as a final entity type, and forming the determined final entity type into a final entity type set.
Often, entity types or relationship types in a knowledge graph are organized in a hierarchical manner, where the hierarchical relationship describes an affiliation (also called a parent-child relationship) between entity types or relationship types, and fig. 3 shows a partial schematic diagram of the hierarchical relationship of entity types, all of which share a root node thing.
As shown in fig. 3, when the entity type set is { things, people, movie director, actors, artifacts, movie }, it is possible to construct: the movie director is subordinate to people, people is subordinate to things, actors are subordinate to people, the movie is subordinate to artifacts, and artifacts are subordinate to things in a hierarchical relationship. In fig. 3, the entity types at the bottom layer are: a director of the movie, actors and the movie.
For the initial entity type set { people, director } determined in step 2, the entity types at the lowest layer are: and (6) a director. Thus, the final entity type is "director" and the set of final entity types that is composed is { director }.
It will be appreciated by those skilled in the art that it is reasonable that the entity types in the final set of entity types may be one or more.
It is easy to see that, in this embodiment, on one hand, since the initial entity type set is the intersection of the entity type sets of various sub-entities, the entity types in the intersection of the entity type sets of various sub-entities can reflect the common characteristics of the seed entities better; on the other hand, because the entity types at the lowest layer in the initial entity type set can better represent the semantics of the seed entity, and the final candidate entity type set is determined according to the hierarchical relationship of each entity type in the initial entity type set, the candidate entities selected according to the final candidate entity type set are more likely to have specific common characteristics with the seed entity and are more likely to be added to the seed entity set as the entities to be expanded, which preliminarily ensures the effectiveness of the entity set expansion method provided by the embodiment of the invention.
In addition, it should be noted that the method for determining candidate entities is not limited to the above-mentioned one provided in this embodiment, and other methods for determining candidate entities existing in the prior art are all applicable to the embodiments of the present invention.
In a specific implementation manner provided by the embodiment of the present invention, in step S102 in the embodiment shown in fig. 1, the determining a meta-path between seed entities from a heterogeneous information network corresponding to the target knowledge graph includes:
step 1, determining a group of nodes corresponding to seed entities in the seed entity set from the heterogeneous information network corresponding to the target knowledge graph;
step 2, each determined node is used as a source node, the heterogeneous information network is traversed, and when a target node is a seed entity except the source node, a path connecting the source node and the target node is determined as a meta-path instance;
and step 3, counting all the determined meta-path examples, and obtaining the meta-paths corresponding to all the meta-path examples according to the entity types and the relationship types contained in all the meta-path examples.
It is obvious that, because only a group of determined nodes corresponding to the seed entities in the seed entity set are used as source nodes, and each important meta-path is determined by traversing the heterogeneous information network, the traversal range of the determined meta-paths can be reduced, the efficiency of determining the meta-paths can be improved, and the calculation resources can be saved.
Referring now to fig. 4 and 5 together, fig. 4 is a detailed flowchart of step S102 in the embodiment shown in fig. 1, that is, a flowchart of a meta-path determining method. Fig. 5 shows a schematic diagram of the determination of meta-paths using a detailed flowchart as shown in fig. 4.
In a specific implementation manner provided by the embodiment of the present invention, as shown in fig. 4, in step S102 in the embodiment shown in fig. 1, the determining a meta-path between seed entities from a heterogeneous information network corresponding to the target knowledge graph includes:
s401, determining a node set corresponding to the seed entity set from a heterogeneous information network corresponding to the target knowledge graph, wherein the node set comprises nodes corresponding to seed entities in the seed entity set;
in one embodiment, the set of nodes includes nodes equal to and in one-to-one correspondence with the number of seed entities in the set of seed entities. For example, assuming that the set of seed entities is actor {1,2, 3}, the corresponding set of nodes is also actor {1,2,3 }.
In the embodiment of the invention, the purpose of selecting the set formed by the nodes which are equal to the seed entity number of the seed entity set and correspond to one another as the node set is to reduce the search range, reduce the calculation amount for determining each meta-path and save the calculation resources.
Of course, it is reasonable to those skilled in the art to understand that, in the case of abundant computing resources, nodes corresponding to the seed entities but having a larger number than the seed entities may be selected to form the node set. For example, assuming that the set of seed entities is actor {1,2, 3}, the corresponding set of nodes may be actor {1,2,3,1,2,3 }.
S402, taking each node in the node set as a first node;
for convenience of description, in this embodiment, the seed entity set is used as the actor {1,2, 3}, and the corresponding node set is used as the actor {1,2, 3}, for example.
Specifically, each node in the set of nodes as actors {1,2, 3} is taken as the first node.
S403, taking each first node as a current source node;
alternatively, for ease of explanation, an initial structure data table may be first established.
In the embodiment of the present invention, the basic form of the structure data table is shown in table 2. In table 2, (s, t) represents an entity pair consisting of a source node s and a target node t; and sigma (s, t | II) represents the similarity value of the entity pair (s, t) under the current path II, if the entity pair (s, t) connected by the current path II is a seed entity pair, the similarity value is a first numerical value, and otherwise, the similarity value is a second numerical value. In the embodiment of the present invention, the first value is greater than the second value, and in general, the first value is equal to 1 and the second value is equal to 0. (s, …, t) is shown as finding all nodes that have been visited by the target node t connected to the source node s by path n. Of course, (s, …, t) does not necessarily have to be included in the structure data table.
TABLE 2
Figure GDA0002545458950000131
Specifically, the initial structure data table is shown in table a in fig. 5. In the initial situation, the currently accessed node is the first node itself, so that the source node and the target node are both the first node, the corresponding similarity value of the entity pair consisting of the source node and the target node is 0, the accessed node is the first node itself, and the similarity score of the initial structure data table is also 0.
S404, accessing a current target node connected with each current source node through a preset type edge in the heterogeneous information network, and establishing a plurality of structure data tables to be selected corresponding to the edge types;
wherein, any one of the candidate structure data tables includes: a first entity pair consisting of each first node and a current target node connected through an edge of an edge type corresponding to the structure data table to be selected, a similarity value of each first entity pair, an accessed path and a similarity score; the similarity score is the sum of similarity values of all first entity pairs;
specifically, as shown in fig. 5, on the basis of the initial configuration data table a, the current target nodes connected to the current source nodes 1,2, and 3 through the edge "show" and the current target nodes connected to the current source nodes 1,2, and 3 through the edge "live" are accessed in the heterogeneous information network. Here, as an example, only two types of edges, that is, "show" and "live" are selected for expansion, but it should be understood by those skilled in the art that, in practical applications, the preset type of edge connecting each current source node and current target node may be one or two types, or may be more than two types.
In fig. 5, two tables of candidate structure data corresponding to two types of edges "show" and "birth" are exemplarily created together, which are table B and table C, respectively.
S405, judging whether a current target node connected with each current source node in the structure data table to be selected is a second node or not aiming at each structure data table to be selected; if so, recording the similarity value of the first entity pair corresponding to the current source node in the structure data table to be selected as a first numerical value, determining the visited path corresponding to the current source node as a meta path instance, and otherwise, recording the similarity value as a second numerical value; wherein the second node is: a node in the seed entity set, which is different from a first node corresponding to the current source node;
specifically, in table B and table C in fig. 5, since the current target node corresponding to each first node is not the second node, the similarity value of each first entity pair is exemplarily marked as 0.
S406, selecting a to-be-selected structure data table meeting a second preset condition from the to-be-selected structure data tables as a current structure data table; the second preset condition includes: the variety of the seed entities stored in the structure data table to be selected is the largest;
optionally, when there are a plurality of the stored structural data tables to be selected with the most variety of seed entities, the second preset condition further includes: and the number of the first entity pairs stored in the structure data table to be selected is minimum.
Specifically, in fig. 5, since the type of the seed entity stored in the structure data table B to be selected is larger than that of the structure data table C to be selected, the structure data table B to be selected may be selected as the current structure data table.
S407, updating each current target node in the current structure data table to be a current source node, and returning to execute the step of accessing the current target node connected with each current source node through a preset type edge in the heterogeneous information network; i.e. return to step S404;
specifically, as shown in fig. 5, the current target nodes, movie 12, movie 17, and movie 18, in the current structure data table B are respectively updated to be the current source nodes, and step S404 is executed to return to table B.
In fig. 5, after performing step S404 on table B, exemplary co-establishment with "director-1"and" authoring-1Two structure data tables to be selected corresponding to the two types of edges are respectively a table D and a table E.
Note that, in fig. 5, the director is referred to as the side-1"and" authoring-1The superscript "-1" in "denotes the inverse relationship, i.e." director-1"means the inverse relationship of" director ". For example, when movie 12 passes the edge "director-1"when connected to person 7, indicates that movie 12 was directed by person 7; when person 7 is connected to movie 12 by an edge "director", it is said that person 7 has directed movie 12. Additionally, the last row "· in the structure data table B, D-H represents the unlisted first entity pair.
Likewise, in tables D and E of fig. 5, since the current target node corresponding to each first node is not the second node, the similarity value of each first entity pair is exemplarily marked as 0.
Further, in fig. 5, since the kind of the seed entity stored in the structure data table D to be selected is larger than that of the structure data table E to be selected, the structure data table D to be selected may be selected as the current structure data table, and the step S404 is executed back.
After step S404 is performed on table D, two candidate structure data tables F and G corresponding to two types of edges of "authoring" and "editing" are exemplarily created. After steps S405 and S406 are performed on tables F and G, it is determined that the current structure data table is H. In table H, since the current target nodes corresponding to the first nodes 1,2, and 3 are all the second nodes, the similarity values of the first entity pairs (1,2), (2,3), and (3,1) may be exemplarily marked as 1.
S408, when the length of the visited path in each current structure data table is larger than a third preset value or when the number of the seed entities in each current structure data table is smaller than a fourth preset value, counting all the determined meta-path instances to obtain meta-paths corresponding to all the meta-path instances.
The third preset value may be a preset maximum length of the accessed path, and the fourth preset value may be a preset minimum value that the number of seed entities in the structure data table should satisfy.
Finally, as shown in table H, an important meta path with a length of 4 hops can be determined, for example:
Figure GDA0002545458950000151
in this embodiment, because the determined meta-paths are important meta-paths connecting pairs of the seed entities, the meta-paths can more accurately reflect specific common features between the seed entities. When the important meta-path determined by the meta-path determining method provided by the embodiment shown in fig. 4 in the embodiment of the present invention is used for entity set extension, the validity is higher.
Optionally, in the embodiment shown in fig. 4 of the present invention, all nodes that have been accessed are also included in the structure data table to be selected, and a row in the structure data table to be selected, which is composed of the "first entity pair, the similarity value of the first entity pair, and all nodes that have been accessed and correspond to the first entity pair", is referred to as a tuple, that is, a row in table 2, which is composed of "(s, t), σ (s, t |), and (s, …, t)" is referred to as a tuple. On this basis, after step S404 and before step S405, the meta path determining method further includes:
judging whether each current target node is an accessed node stored in a tuple where the current target node is located;
if not, executing step S405; if yes, after deleting the tuple where the current target node is located from the corresponding structure data table to be selected, executing step S405.
It is easy to find that, in this embodiment, since all visited nodes are recorded in each tuple of the structure-to-be-selected data table, and when each current target node is determined, whether the target node is a visited node or not is determined, the determined meta-path can be prevented from forming a loop, thereby avoiding endless traversal of a heterogeneous information network, and improving the efficiency of determining the meta-path.
Optionally, in a specific implementation manner provided by the embodiment of the present invention, in step S406 in the embodiment shown in fig. 4, that is, selecting, from the structure data tables to be selected, a structure data table to be selected that meets a second preset condition as the current structure data table includes:
and selecting the structure data table to be selected which meets a second preset condition as the current structure data table from a plurality of structure data tables to be selected of which the similarity scores are not larger than the first preset value.
It is easy to see that, when the structure data table to be selected meeting the second preset condition is selected as the current structure data table from the plurality of structure data tables to be selected whose similarity scores are not greater than the first preset value, the meta path search range can be further narrowed, the calculation amount is reduced, and the calculation resources are further saved.
To further illustrate the effectiveness of the entity set extension method provided by the embodiment of the present invention, the applicant verifies the method through experiments, and the specific verification process is as follows:
1) determining a target knowledge graph
The applicant takes a classical Yago knowledge map as a target knowledge map, and data in the Yago knowledge map mainly come from Wikipedia, wordNet and GeoNames. The Yago knowledge-graph has about ten million entities and 120 million facts, and three parts of data including 35 relations, 1.3 million entities and over three thousand entity types in Yago knowledge-graph are mainly used as data sources. Table 3 lists a detailed description of these three data.
TABLE 3
Figure GDA0002545458950000171
2) Determining a validation set
The applicant selects four representative verification sets to verify the validity of the entity set extension method provided by the embodiment of the invention, wherein the four verification sets are as follows: actors participating in steve. spearberg directed movies, software produced by companies located in the city of Mountain views of California, directed movies obtained by the National movie prize (National Film aware), scientists located at the university of Cambridge of massachusetts, the four categories of validation focused entities being respectively: actor(s)*Software for the same*Movie and film*And scientists*The number of entities in the four types of verification sets is as follows: 112. 98, 653, 202.
3) Evaluation criteria for effectiveness
The p @ k and MAP criteria are used to make a measure of effectiveness. p @ k represents the percentage of positive examples in the top k results after ordering the candidate entities in the set of candidate entities by degree of importance.
The evaluation is mainly carried out by three criteria of p @30, p @60 and p @ 90. The MAP standard is the average of the accuracy of p @30, p @60 and p @90, and is specifically expressed as:
Figure GDA0002545458950000172
wherein, if the candidate entity of the ith bit is a positive example, reli1, otherwise reli=0。
3) Determining comparison objects
The Entity set extension method (MP _ ESE for short) provided by the embodiment of the present invention is compared with the following three methods:
(1) inspired by pattern-Based methods in text or web pages, an entity set extension method Based on entity one-hop link relationships is provided.
(2) An entity set extension method based on Nearest Neighbor (Nearest-Neighbor). An entity set extension method is provided that considers the nearest neighbors of both one-hop links and one-hop entities.
(3) An entity set extension method of Path-Constrained Random Walk PCRW (PCRW). The method is a method based on path random walk in a heterogeneous network, and provides an entity set expansion method based on a 2-hop link relation.
For each method, three seeds were randomly selected from the validation set for experiments, and the average results were averaged for comparison for 30 runs for each method. In the entity set extension method provided in the embodiment of the present invention, the first preset value is set as follows: m (m-1)/2+1, where m is the number of seed entities and the maximum path length of the meta-path is set to 4.
4) Verification result
As shown in fig. 6A to 6D, the verification results are the following, and the entity types sequentially corresponding to fig. 6A to 6D are: actor(s)*Movie and film*Software for the same*Scientists*. As can be seen from fig. 6A to 6D, when the method provided by the embodiment of the present invention is applied to entity set expansion, the accuracy is higher than that of the basic method, especially "actor" is provided*"and" movie*"two categories. At the "actor*"and" movie*The reason that the accuracy rate in the set basic method is low in the two categories is that a one-hop or two-hop link cannot well distinguish fine-grained entity categories, but the method provided by the embodiment of the invention has more hops of the meta-path, can well distinguish fine-grained entity categories, and therefore is high in accuracy rate. In software*The accuracy of the method provided by the embodiment of the invention is similar to that of the PCRW method in category because of software*"is an overlapping class that has another coarse-grained entity class in addition to a given entity classI.e. software produced by the same company.
In addition, as shown in fig. 6A to 6D, it can be seen that the accuracy of the L ink-Based algorithm in any category is significantly lower than that of the entity set expansion method provided in the embodiment of the present invention, because the L ink-Based algorithm is Based on one-hop links, and the semantic information included in the one-hop links is very little, and the specific common features between the seed entities cannot be accurately reflected.
To further illustrate the effectiveness of the entity set extension method provided by the embodiment of the present invention, table 4 lists the entity set extension method provided by the embodiment of the present invention in "actor" environment*"Categories, the first three significant meta-paths determined, as can be seen from Table 4, these meta-paths reflect" actor*"potential specific common features between the seed entities of a category, with these meta-paths, further more entities belonging to this category can be determined as entities to be expanded.
TABLE 4
Figure GDA0002545458950000191
In summary, the entity set extension method provided by the embodiment of the present invention is more effective than the three basic methods set forth above.
Corresponding to the above method embodiment, an embodiment of the present invention further provides an entity set extension apparatus, which is described in detail below.
As shown in fig. 7, an embodiment of the present invention provides an entity set extension apparatus, where the apparatus includes: a candidate entity set determining module 701, a meta path determining module 702, a first importance determining module 703, a second importance determining module 704, and an entity set expanding module 705;
a candidate entity set determining module 701, configured to extract candidate entities from the target knowledge graph according to a predetermined seed entity set, and form a candidate entity set with the extracted candidate entities; the target knowledge-graph includes at least seed entities in the set of seed entities;
a meta-path determining module 702, configured to determine meta-paths between seed entities from a heterogeneous information network corresponding to the target knowledge-graph; the meta path is: a connection path composed of an entity type and a relationship type between two node types in the heterogeneous information network; wherein the two node types are node types corresponding to different seed entities in the seed entity set;
a first importance determining module 703, configured to determine a first importance of each meta-path according to the number of seed entity pairs connected to each meta-path;
a second importance determining module 704, configured to determine a second importance of each candidate entity in the candidate entity set according to the first importance of each meta-path;
an entity set expanding module 705, configured to determine, as an entity to be expanded, the candidate entity in the candidate entity set, where the second importance degree satisfies a first preset condition, and add the entity to be expanded to the seed entity set.
On one hand, the entity set expansion device provided by the embodiment of the invention carries out entity set expansion by taking a target knowledge graph with huge data content as a data source; on the other hand, meta-paths between the seed entities are determined from the heterogeneous information network corresponding to the target knowledge graph, and each determined meta-path of each type is a path connecting the seed entity pair, so that the meta-paths can accurately reflect potential common features between the seed entities, the second importance degree of the candidate entities determined by the first importance degree of the meta-paths is more effective, and the entities to be expanded determined according to the second importance degree are more effective. Therefore, the entity set extension method provided by the embodiment of the invention can improve the effectiveness of entity set extension.
In a specific implementation manner provided in the embodiment of the present invention, the candidate entity set determining module 701 in the embodiment shown in fig. 7 may specifically include: an entity type set determining submodule, an initial entity type set determining submodule and a final entity type set determining submodule;
the entity type set determining submodule is used for determining an entity type set of each seed entity in a predetermined seed entity set;
an initial entity type set determining submodule, configured to determine an intersection of all entity type sets as an initial entity type set;
a final entity type set determining submodule, configured to determine a final entity type set corresponding to the seed entity set according to a hierarchical relationship between entity types in the initial entity type set; and taking the entity which accords with the entity type in the final entity type set in the target knowledge graph as a candidate entity.
More specifically, the final entity type set determining sub-module may include: a first determination unit and a second determination unit.
A first determining unit, configured to determine at least one hierarchical relationship corresponding to the initial entity type set, where any hierarchical relationship is a dependency relationship between at least two entity types;
and the second determining unit is used for determining the entity type positioned at the bottommost layer in each hierarchical relationship as a final entity type and forming the determined final entity type into a final entity type set.
It is easy to see that, in this embodiment, on one hand, since the initial entity type set is the intersection of the entity type sets of various sub-entities, the entity types in the intersection of the entity type sets of various sub-entities can reflect the common characteristics of the seed entities better; on the other hand, the entity types at the bottom layer in the initial entity type set can represent the semantics of the seed entity better. And the final candidate entity type set is determined according to the hierarchical relationship of each entity type in the initial entity type set, so that the candidate entities selected according to the final candidate entity type set are more likely to have specific common characteristics with the seed entities and are more likely to be added to the seed entity set as the entities to be expanded, and the expansion effectiveness of the entity set can be better ensured.
In a specific implementation manner provided in the embodiment of the present invention, the meta-path determining module 702 in the embodiment shown in fig. 7 may include: the node determination sub-module, the traversal module and the determination sub-module.
A node determination submodule for determining a set of nodes corresponding to seed entities in the set of seed entities from the heterogeneous information network corresponding to the target knowledge-graph;
a traversal module, configured to traverse the heterogeneous information network with each determined node as a source node, and determine a path connecting the source node and a target node as an element path instance when the target node is a seed entity other than the source node itself;
and the determining submodule is used for counting all the determined meta-path examples and obtaining the meta-paths corresponding to all the meta-path examples according to the entity types and the relationship types contained in all the meta-path examples.
It is obvious that, because only a group of determined nodes corresponding to the seed entities in the seed entity set are used as source nodes, and each important meta-path is determined by traversing the heterogeneous information network, the traversal range of the determined meta-path is reduced, so that the efficiency of determining the meta-path is improved, and the calculation resources are saved.
As shown in fig. 8, in a specific implementation manner provided in the embodiment of the present invention, the meta-path determining module 702 may include: a node set determining sub-module 801, a first node determining sub-module 802, a current source node determining sub-module 803, a to-be-selected structure data table establishing sub-module 804, a first judging sub-module 805, a selecting sub-module 806, an updating sub-module 807 and a meta path determining sub-module 808;
a node set determining submodule 801, configured to determine a node set corresponding to the seed entity set from a heterogeneous information network corresponding to the target knowledge graph, where the node set includes nodes corresponding to seed entities in the seed entity set;
a first node determining submodule 802, configured to take each node in the node set as a first node;
a current source node determining submodule 803, configured to use each first node as a current source node;
a candidate structure data table establishing submodule 804, configured to access, in the heterogeneous information network, a current target node connected to each current source node through a preset type of edge, and establish a plurality of candidate structure data tables corresponding to the edge types;
a first determining sub-module 805, configured to determine, for each to-be-selected structure data table, whether a current target node connected to each current source node in the to-be-selected structure data table is a second node; if so, recording the similarity value of the first entity pair corresponding to the current source node in the structure data table to be selected as a first numerical value, determining the visited path corresponding to the current source node as a meta path instance, and otherwise, recording the similarity value as a second numerical value; wherein the second node is: a node in the seed entity set, which is different from a first node corresponding to the current source node;
the selecting submodule 806 is configured to select, from the to-be-selected structure data table, a to-be-selected structure data table meeting a second preset condition as a current structure data table; the second preset condition includes: the variety of the seed entities stored in the structure data table to be selected is the largest;
an updating submodule 807, configured to update each current target node in the current structure data table to a current source node, and trigger the candidate structure data table establishing submodule 804;
the meta-path determining sub-module 808 is configured to, when the length of the visited path in each current structure data table is greater than a third preset value or when the number of seed entities in each current structure data table is less than a fourth preset value, count all determined meta-path instances, and obtain meta-paths corresponding to all the meta-path instances according to the entity types and the relationship types included in all the meta-path instances.
The third preset value may be a preset maximum length of the accessed path, and the fourth preset value may be a preset minimum value that the number of seed entities in the structure data table should satisfy.
In this embodiment, because the determined meta-paths are important meta-paths connecting pairs of the seed entities, the meta-paths can more accurately reflect specific common features between the seed entities. When the apparatus provided in the embodiment of the present invention shown in fig. 8 is applied to perform entity set extension on the important meta-path determined, the accuracy is higher.
Optionally, in the embodiment shown in fig. 8 of the present invention, the structure-to-be-selected data table further includes all the nodes that have been accessed, and a row in the structure-to-be-selected data table, which is composed of the first entity pair, the similarity value of the first entity pair, and all the nodes that have been accessed and correspond to the first entity pair, is referred to as a tuple. On this basis, after triggering the candidate structure data table building sub-module 804 and before triggering the first determining sub-module 805, the meta path determining module 702 may further include:
the second judgment submodule is used for judging whether each current target node is an accessed node stored in a tuple where the current target node is located;
the triggering submodule is used for triggering the to-be-selected structure data table establishing submodule 804 under the condition that the judgment result obtained by the second judging submodule is negative; and under the condition that the judgment result obtained by the second judgment submodule is yes, deleting the tuple where the current target node is located from the corresponding structure data table to be selected, and triggering a structure data table to be selected establishing submodule 804.
It is easy to find that, in this embodiment, since all visited nodes are recorded in each tuple of the structure-to-be-selected data table, and when each current target node is determined, whether the target node is a visited node or not is determined, it is possible to prevent the determined meta-path from forming a loop, thereby avoiding endless traversal of the heterogeneous information network, and improving the efficiency of determining the meta-path.
Optionally, in a specific implementation manner provided by the embodiment of the present invention, the selecting sub-module 806 in the embodiment shown in fig. 8 is specifically configured to select, from a plurality of candidate structure data tables whose similarity scores are not greater than the first preset value, a candidate structure data table that meets a second preset condition as the current structure data table.
It is easy to see that, when the structure data table to be selected meeting the second preset condition is selected as the current structure data table from the plurality of structure data tables to be selected whose similarity scores are not greater than the first preset value, the meta path search range can be further narrowed, the calculation amount is reduced, the meta path determination efficiency is further improved, and the calculation resources are saved.
In a specific implementation manner provided by the embodiment of the present invention, the first importance determining module 703 in the embodiment shown in fig. 7 is specifically configured to determine, according to all seed entity pairs connected to each meta-path, a total number of seed entity pairs connected to each meta-path; determining a first importance degree of each meta-path according to the total number of the seed entity pairs connected with each meta-path and a first preset model;
wherein the first preset model is:
Figure GDA0002545458950000231
the physical meanings of the parameters are the same as those in the above method embodiments, and are not described herein again.
It is obvious that the first importance degree is positively correlated with the total number of the seed entity pairs connected with the meta-path, and the more the seed entity pairs connected with the meta-path are, the more the meta-path can reflect the specific common characteristics among the seed entities, so that the first importance degree value determined according to the total number of the seed entity pairs connected with the meta-path is more accurate.
In a specific implementation manner provided by the embodiment of the present invention, the second importance level determining module 704 in the embodiment shown in fig. 7 is configured to determine the second importance level of each candidate entity in the candidate entity set according to the first importance level of each meta-path and a second preset model;
wherein the second preset model is:
Figure GDA0002545458950000232
the physical meanings of the parameters are the same as those in the above method embodiments, and are not described herein again.
It is easy to find that the second importance degree is in positive correlation with the first importance degree, and the greater the first importance degree of a certain meta-path is, the more the meta-path can reflect the specific common features among the seed entities, so that the second importance degree of the candidate entities determined according to the first importance degree is more accurate.
In a specific implementation manner provided by the embodiment of the present invention, the entity set expanding module 705 in the embodiment shown in fig. 7 is specifically configured to determine, as the entity to be expanded, the candidate entity whose second importance degree is greater than a second preset value in the candidate entity set.
In another specific implementation manner provided by the embodiment of the present invention, the entity set expanding module 705 in the embodiment shown in fig. 7 is specifically configured to rank, according to the second importance degree, the candidate entities in the candidate entity set in a descending order to obtain a first candidate entity set; and selecting a first preset number of candidate entities ranked at the top from the first candidate entity set as entities to be expanded.
And the applicant adopts corresponding sequencing indexes to carry out validity verification on the selected entities to be expanded in the first preset number according to the target knowledge graph, so that the validity of the method is verified.
In the two embodiments, the entity to be expanded is determined according to the second importance degree, and the second importance degree can correctly reflect the specific common characteristics between the candidate entity and the seed entity, so that the entity to be expanded determined according to the second importance degree is more effective, and the validity of entity expansion is ensured.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method for entity set extension, the method comprising:
extracting candidate entities from the target knowledge graph according to a predetermined seed entity set, and forming the candidate entities obtained by extraction into a candidate entity set; the target knowledge-graph includes at least seed entities in the set of seed entities; the seed entities are preset according to a given specific semantic type, and a set formed by all the seed entities is a seed entity set; the target knowledge graph refers to a knowledge graph related to a predetermined seed entity; candidate entities are entities that have a particular common trait with a seed entity, the particular common trait including: the entity types are the same; the knowledge graph is a data set and is composed of < subject, predicate and object > triples, wherein the subject and the object respectively correspond to an entity, the predicate represents the relationship or the attribute between the subject and the object, and the types of the subject and the object and the relationship or the attribute between the subject and the object are not limited in the knowledge graph;
determining meta-paths between seed entities from a heterogeneous information network corresponding to the target knowledge graph; the meta path is: a connection path composed of an entity type and a relationship type between two node types in the heterogeneous information network; wherein the two node types are node types corresponding to different seed entities in the seed entity set; in the heterogeneous information network, one node represents one entity object, and one edge represents the relationship between two entity objects connected by the edge;
determining a first importance degree of each meta-path according to the number of seed entity pairs connected with each meta-path;
determining a second degree of importance of each candidate entity in the set of candidate entities according to the first degree of importance of each meta-path;
and determining the candidate entities with the second importance degree meeting a first preset condition in the candidate entity set as entities to be expanded, and adding the entities to be expanded into the seed entity set.
2. The method of claim 1, wherein extracting candidate entities from a target knowledge-graph according to a predetermined set of seed entities comprises:
determining a set of entity types of each seed entity in a predetermined set of seed entities;
determining the intersection of all entity type sets as an initial entity type set;
determining a final entity type set corresponding to the seed entity set according to the hierarchical relationship of each entity type in the initial entity type set; and taking the entity which accords with the entity type in the final entity type set in the target knowledge graph as a candidate entity.
3. The method of claim 2, wherein determining a final set of entity types based on the hierarchical relationship of the entity types in the initial set of entity types comprises:
determining at least one hierarchical relationship corresponding to the initial entity type set, wherein any hierarchical relationship is a dependency relationship of at least two entity types;
and determining the entity type positioned at the bottommost layer in each hierarchical relationship as a final entity type, and composing the determined final entity types into a final entity type set.
4. The method of claim 1, wherein determining meta-paths between seed entities from a heterogeneous information network corresponding to the target knowledge-graph comprises:
determining a node set corresponding to the seed entity set from a heterogeneous information network corresponding to the target knowledge graph, wherein the node set comprises nodes corresponding to seed entities in the seed entity set;
taking each node in the node set as a first node;
taking each first node as a current source node, accessing a current target node connected with each current source node through a preset type edge in the heterogeneous information network, and establishing a plurality of structure data tables to be selected corresponding to the edge type; wherein, any one of the candidate structure data tables includes: a first entity pair consisting of each first node and a current target node connected through an edge of an edge type corresponding to the structure data table to be selected, a similarity value of each first entity pair, an accessed path and a similarity score; the similarity score is the sum of similarity values of all first entity pairs;
judging whether a current target node connected with each current source node in the structure data table to be selected is a second node or not aiming at each structure data table to be selected; if so, recording the similarity value of the first entity pair corresponding to the current source node in the structure data table to be selected as a first numerical value, determining the visited path corresponding to the current source node as a meta path instance, and otherwise, recording the similarity value as a second numerical value; wherein the second node is: a node in the node set, which is different from a first node corresponding to a current source node;
selecting a structure data table to be selected which meets a second preset condition from the structure data tables to be selected as a current structure data table; the second preset condition includes: the variety of the seed entities stored in the structure data table to be selected is the largest; when there are a plurality of the stored structure data tables to be selected with the most variety of seed entities, the second preset condition further includes: the number of the first entity pairs stored in the structure data table to be selected is minimum;
updating each current target node in the current structure data table to be a current source node, and returning to execute the step of accessing the current target node connected with each current source node through a preset type edge in the heterogeneous information network;
and when the length of the accessed path in each current structure data table is greater than a third preset value or the number of the seed entities in each current structure data table is less than a fourth preset value, counting all the determined meta-path examples, and obtaining meta-paths corresponding to all the meta-path examples according to the entity types and the relationship types contained in all the meta-path examples.
5. The method according to claim 4, wherein the selecting, from the candidate structure data tables, a candidate structure data table satisfying a second preset condition as the current structure data table comprises:
and selecting the structure data table to be selected which meets a second preset condition as the current structure data table from a plurality of structure data tables to be selected of which the similarity scores are not larger than the first preset value.
6. The method according to any one of claims 1 to 4, wherein determining the first degree of importance of each meta-path according to the number of pairs of seed entities connected by each meta-path comprises:
determining the total number of seed entity pairs connected with each meta-path according to all the seed entity pairs connected with each meta-path;
determining a first importance degree of each meta-path according to the total number of the seed entity pairs connected with each meta-path and a first preset model;
wherein the first preset model is:
Figure FDA0002545458940000031
wherein, WkIs a meta path PkCorresponding first degree of importance, l is the number of meta-paths;
Figure FDA0002545458940000032
SPkis a meta path PkThe total number of connected pairs of seed entities, m being the number of seed entities,
Figure FDA0002545458940000033
is the total number of pairs of seed entities.
7. The method of any one of claims 1-4, wherein determining the second degree of importance for each candidate entity in the set of candidate entities based on the first degree of importance for each meta-path comprises:
determining a second importance degree of each candidate entity in the candidate entity set according to the first importance degree and a second preset model of each meta-path;
wherein the second preset model is:
Figure FDA0002545458940000041
wherein R (c)iS) represents a candidate entity ciN is the number of candidate entities; sjRepresenting a seed entity, S representing the set of seed entities, and m being the number of seed entities; wkIs a meta path PkCorresponding first degree of importance, l is the number of meta-paths; r { (c)i,sj)|PkDenotes the element path PkWhether or not to join seed entities sjAnd candidate entity ciIf yes, r is 1, otherwise r is 0.
8. The method according to any one of claims 1 to 4, wherein the determining, as the entity to be expanded, the candidate entity in the candidate entity set whose second importance degree satisfies a first preset condition includes:
and determining the candidate entities with the second importance degree larger than a second preset value in the candidate entity set as entities to be expanded.
9. The method according to any one of claims 1 to 4, wherein the determining, as the entity to be expanded, the candidate entity in the candidate entity set whose second importance degree satisfies a first preset condition includes:
according to the second importance degree, sorting the candidate entities in the candidate entity set in a descending order to obtain a first candidate entity set; and selecting a first preset number of candidate entities ranked at the top from the first candidate entity set as entities to be expanded.
10. An entity set extension apparatus, the apparatus comprising:
the candidate entity set determining module is used for extracting candidate entities from the target knowledge graph according to a predetermined seed entity set and forming the candidate entities obtained by extraction into a candidate entity set; the target knowledge-graph includes at least seed entities in the set of seed entities; the seed entities are preset according to a given specific semantic type, and a set formed by all the seed entities is a seed entity set; the target knowledge graph refers to a knowledge graph related to a predetermined seed entity; candidate entities are entities that have a particular common trait with a seed entity, the particular common trait including: the entity types are the same; the knowledge graph is a data set and is composed of < subject, predicate and object > triples, wherein the subject and the object respectively correspond to an entity, the predicate represents the relationship or the attribute between the subject and the object, and the types of the subject and the object and the relationship or the attribute between the subject and the object are not limited in the knowledge graph;
a meta-path determination module for determining meta-paths between seed entities from a heterogeneous information network corresponding to the target knowledge graph; the meta path is: a connection path composed of an entity type and a relationship type between two node types in the heterogeneous information network; wherein the two node types are node types corresponding to different seed entities in the seed entity set; in the heterogeneous information network, one node represents one entity object, and one edge represents the relationship between two entity objects connected by the edge;
the first importance degree determining module is used for determining the first importance degree of each meta-path according to the number of the seed entity pairs connected with each meta-path;
a second importance level determining module, configured to determine a second importance level of each candidate entity in the candidate entity set according to the first importance level of each meta-path;
and the entity set expansion module is used for determining the candidate entities with the second importance degree meeting the first preset condition in the candidate entity set as entities to be expanded and adding the entities to be expanded into the seed entity set.
CN201710168839.XA 2017-03-21 2017-03-21 Entity set extension method and device Active CN106951526B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710168839.XA CN106951526B (en) 2017-03-21 2017-03-21 Entity set extension method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710168839.XA CN106951526B (en) 2017-03-21 2017-03-21 Entity set extension method and device

Publications (2)

Publication Number Publication Date
CN106951526A CN106951526A (en) 2017-07-14
CN106951526B true CN106951526B (en) 2020-08-07

Family

ID=59472639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710168839.XA Active CN106951526B (en) 2017-03-21 2017-03-21 Entity set extension method and device

Country Status (1)

Country Link
CN (1) CN106951526B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019826B (en) * 2017-07-27 2023-02-28 北大医疗信息技术有限公司 Construction method, construction device, equipment and storage medium of medical knowledge map
CN107609152B (en) * 2017-09-22 2021-03-09 百度在线网络技术(北京)有限公司 Method and apparatus for expanding query expressions
CN109145119A (en) * 2018-07-02 2019-01-04 北京妙医佳信息技术有限公司 The knowledge mapping construction device and construction method of health management arts
CN111488467B (en) * 2020-04-30 2022-04-05 北京建筑大学 Construction method and device of geographical knowledge graph, storage medium and computer equipment
CN113052968B (en) * 2021-04-30 2022-08-05 电子科技大学 Knowledge graph construction method of three-dimensional structure geological model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102844755A (en) * 2010-04-27 2012-12-26 惠普发展公司,有限责任合伙企业 Method of extracting named entity
CN103488724A (en) * 2013-09-16 2014-01-01 复旦大学 Book-oriented reading field knowledge map construction method
CN105913125A (en) * 2016-04-12 2016-08-31 北京邮电大学 Heterogeneous information network element determining method, link prediction method, heterogeneous information network element determining device and link prediction device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080270458A1 (en) * 2007-04-24 2008-10-30 Gvelesiani Aleksandr L Systems and methods for displaying information about business related entities

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102844755A (en) * 2010-04-27 2012-12-26 惠普发展公司,有限责任合伙企业 Method of extracting named entity
CN103488724A (en) * 2013-09-16 2014-01-01 复旦大学 Book-oriented reading field knowledge map construction method
CN105913125A (en) * 2016-04-12 2016-08-31 北京邮电大学 Heterogeneous information network element determining method, link prediction method, heterogeneous information network element determining device and link prediction device

Also Published As

Publication number Publication date
CN106951526A (en) 2017-07-14

Similar Documents

Publication Publication Date Title
CN106951526B (en) Entity set extension method and device
Bouros et al. Spatio-textual similarity joins
KR100963623B1 (en) Ranking processing method for semantic web resources
Fan et al. Answering graph pattern queries using views
WO2014109127A1 (en) Index generating device and method, and search device and search method
CN104239513A (en) Semantic retrieval method oriented to field data
CN108804576B (en) Domain name hierarchical structure detection method based on link analysis
WO2014107988A1 (en) Method and system for discovering and analyzing micro-blog user group structure
JP2011258235A (en) System and method for ranking result of search by using click distance
US20140136468A1 (en) Quantitative assessment of similarity of categorized data
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
Mahmood et al. FAST: frequency-aware indexing for spatio-textual data streams
CN104133868B (en) A kind of strategy integrated for the classification of vertical reptile data
WO2015051481A1 (en) Determining collection membership in a data graph
US11288266B2 (en) Candidate projection enumeration based query response generation
CN106649731A (en) Node similarity searching method based on large-scale attribute network
CN115438274A (en) False news identification method based on heterogeneous graph convolutional network
CN104008097B (en) Realize the method and device that inquiry understands
CN107016135B (en) A kind of positive and negative two-way dynamic equilibrium search strategy of resource environment
Yang et al. HNRWalker: recommending academic collaborators with dynamic transition probabilities in heterogeneous networks
US8914416B2 (en) Semantics graphs for enterprise communication networks
CN104794237B (en) web information processing method and device
JP4440246B2 (en) Spatial index method
Setayesh et al. Presentation of an Extended Version of the PageRank Algorithm to Rank Web Pages Inspired by Ant Colony Algorithm
Likhyani et al. Label constrained shortest path estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant