CN106951526A - A kind of entity set extended method and device - Google Patents

A kind of entity set extended method and device Download PDF

Info

Publication number
CN106951526A
CN106951526A CN201710168839.XA CN201710168839A CN106951526A CN 106951526 A CN106951526 A CN 106951526A CN 201710168839 A CN201710168839 A CN 201710168839A CN 106951526 A CN106951526 A CN 106951526A
Authority
CN
China
Prior art keywords
entity
path
candidate
node
fructification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710168839.XA
Other languages
Chinese (zh)
Other versions
CN106951526B (en
Inventor
石川
郑玉艳
曹晓欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201710168839.XA priority Critical patent/CN106951526B/en
Publication of CN106951526A publication Critical patent/CN106951526A/en
Application granted granted Critical
Publication of CN106951526B publication Critical patent/CN106951526B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of entity set extended method and device provided in an embodiment of the present invention, according to predetermined seed entity set, extract candidate's entity composition candidate's entity set from object knowledge collection of illustrative plates;From heterogeneous information network corresponding with object knowledge collection of illustrative plates, it is determined that planting first path between fructification;First path is:The access path being made up of entity type and relationship type between two node types in heterogeneous information network;Wherein, described two node types are the different corresponding node types of kind fructification;The quantity of the kind fructification pair connected according to every first path determines first significance level in every first path;According to first significance level in every first path, the second significance level of each candidate's entity in candidate's entity set is determined;By in candidate's entity set, candidate's entity that the second significance level meets the first preparatory condition is defined as entity to be extended, and entity to be extended is added in seed entity set.Effective entity set extension can be carried out using the present invention.

Description

A kind of entity set extended method and device
Technical field
The present invention relates to entity set expansion technique field, more particularly to a kind of entity set extended method and device.
Background technology
Entity set extension refers to, it is known that several entity seeds with certain semantic type (also referred to as particular common characteristics), More entities of the certain semantic type are obtained according to certain rule.For example, given certain semantic type is national capital Entity seed set { Beijing, Washington, Moscow }, it is desirable to find out more national capitals, such as find out that { Soul, Tokyo is lucky Long Po, }.At present, entity set extension has been obtained for being widely applied, for example, extension and the query suggestion of dictionary Extension etc..
Most common entity set extended method is, chooses a data source, to data source by it is certain it is regular handle, Other entities that therefrom determine has identical semantic type with planting fructification are used as the extensible element of entity set.Existing entity Collect extended method, data source is mostly used as using text or webpage.However, because the data volume included in single text and webpage has Limit so that the validity of entity set extension is undesirable, it is impossible to meet increasingly soaring entity set extension demand.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of entity set extended method and device, to improve entity set extension Validity.
To achieve these goals, in a first aspect, the embodiments of the invention provide a kind of entity set extended method, the side Method includes:
According to predetermined seed entity set, candidate's entity is extracted from object knowledge collection of illustrative plates, and will extract what is obtained Candidate's entity constitutes candidate's entity set;The object knowledge collection of illustrative plates at least includes the kind fructification in the seed entity set;
From heterogeneous information network corresponding with the object knowledge collection of illustrative plates, it is determined that planting first path between fructification;Institute Stating first path is:The connection being made up of between two node types in the heterogeneous information network entity type and relationship type Path;Wherein, described two node types are the corresponding node type of kind fructification different in the seed entity set;
The quantity of the kind fructification pair connected according to every first path determines first significance level in every first path;
According to first significance level in every first path, second of each candidate's entity in candidate's entity set is determined Significance level;
By in candidate's entity set, candidate's entity that second significance level meets the first preparatory condition is defined as treating Entity is extended, and the entity to be extended is added in the seed entity set.
Alternatively, it is described according to predetermined seed entity set, candidate's entity, bag are extracted from object knowledge collection of illustrative plates Include:
Determine each entity type collection for planting fructification in predetermined seed entity set;
The common factor of all entity type collection is defined as initial solid set of types;
According to the hierarchical relationship of each entity type in the initial solid set of types, determine that the seed entity set is corresponding Final entity type collection;The entity in the object knowledge collection of illustrative plates, meeting the final entity type centralized entity type is made For candidate's entity.
Alternatively, the hierarchical relationship according to each entity type in the initial solid set of types, it is determined that final entity Set of types, including:
At least one hierarchical relationship corresponding to the initial solid set of types is determined, wherein, any hierarchical relationship is extremely The subordinate relation of few two entity types;
The entity type of the bottom will be located in each hierarchical relationship, be defined as final entity type, and will be identified Final entity type composition is final entity type collection.
Alternatively, it is described from heterogeneous information network corresponding with the object knowledge collection of illustrative plates, it is determined that between planting fructification First path, including:
From heterogeneous information network corresponding with the object knowledge collection of illustrative plates, it is determined that section corresponding with the seed entity set Point set, wherein, the node set includes node corresponding with the kind fructification in the seed entity set;
It regard each node in the node set as first node;
Using each first node as current source Node, accessed and each current source Node in the heterogeneous information network The current target node connected by the side of preset kind, sets up multiple structured data tables to be selected corresponding with side type;Wherein, Any structured data table to be selected includes:By each first node with by the structured data table to be selected it is corresponding while type while connect The first instance of the current target node composition connect is to, the similarity of each first instance pair, the path accessed and phase Like property fraction;The similarity scores are the summation of the similarity of all first instances pair;
For each structured data table to be selected, judge to be connected with each current source Node in the structured data table to be selected Current target node whether be Section Point;If it is, by current source Node in the structured data table to be selected corresponding The similarity of one entity pair is designated as the first numerical value, and the corresponding path accessed of the current source Node is defined as into one First path examples, are otherwise designated as second value;Wherein, the Section Point is:With current source Node pair in the node set The different node of the first node answered;
From structured data table to be selected, the structured data table to be selected that selection meets the second preparatory condition is used as current structure number According to table;Second preparatory condition includes:The most species of the kind fructification stored in structured data table to be selected;When what is stored When the most structured data table to be selected of seed entity class has multiple, second preparatory condition also includes:Structured data to be selected The minimum number of the first instance pair stored in table;
Each current target node in the current structure tables of data is updated to current source Node, returned described in performing The current target node that access is connected with each current source Node by the side of preset kind in the heterogeneous information network Step;
When the path length accessed in each current structure tables of data is more than three preset values, or when each current When seed number of entities in structured data table is less than four preset values, all first path examples that statistics is determined, and according to Entity type and relationship type that all first path examples are included, obtain the corresponding first road of all first path examples Footpath.
Alternatively, described from structured data table to be selected, the structured data table to be selected that selection meets the second preparatory condition is made For current structure tables of data, including:
From similarity scores are not more than multiple structured data tables to be selected of the first preset value, selection meets the second default bar The structured data table to be selected of part is used as current structure tables of data.
Alternatively, the quantity of the kind fructification pair connected according to every first path determines first weight in every first path Degree is wanted, including:
The kind fructification that the first path of all kinds of fructifications pair determination every connected according to every first path is connected is to total Number;
The kind fructification connected according to every first path determines the of every first path to sum and the first preset model One significance level;
Wherein, first preset model is:
Wherein, WkFor first path PkCorresponding first significance level, l is the bar number in first path;SPkFor first path PkThe kind fructification connected is to sum, and m is the quantity of kind of fructification,For kind of a fructification pair Total quantity.
Alternatively, first significance level according to every first path, determines each time in candidate's entity set The second significance level of entity is selected, including:
According to first significance level and the second preset model in every first path, determine each in candidate's entity set Second significance level of candidate's entity;
Wherein, second preset model is:
sj∈ S, i ∈ { 1,2,3 ..., n }, wherein, R (ci, S) represent to wait Select entity ciThe second significance level, n be candidate's entity quantity;sjKind of a fructification is represented, S represents the seed entity set, m For the quantity of kind of fructification;WkFor first path PkCorresponding first significance level, l is the bar number in first path;r{(ci,sj)|PkTable Show first path PkWhether connection kind fructification sjWith candidate's entity ci, if it is, r=1, otherwise, r=0.
Alternatively, described by candidate's entity set, second significance level meets the candidate of the first preparatory condition Entity is defined as entity to be extended, including:
By in candidate's entity set, candidate's entity that second significance level is more than the second preset value is defined as waiting to expand Open up entity.
Alternatively, described by candidate's entity set, second significance level meets the candidate of the first preparatory condition Entity is defined as entity to be extended, including:
According to second significance level, candidate's entity in candidate's entity set is ranked up in descending order, obtained First candidate's entity set;Also, the candidate that preceding first predetermined number of sequence is chosen from the first candidate entity set is real Body is used as entity to be extended.
In order to realize foregoing invention purpose, second aspect, the embodiments of the invention provide a kind of entity set expanding unit, institute Stating device includes:
Candidate's entity set determining module, for according to predetermined seed entity set, being extracted from object knowledge collection of illustrative plates Candidate's entity, and obtained candidate's entity composition candidate's entity set will be extracted;The object knowledge collection of illustrative plates at least includes the kind The kind fructification that fructification is concentrated;
First path determination module, for from heterogeneous information network corresponding with the object knowledge collection of illustrative plates, determining seed First path between entity;First path is:By entity type between two node types in the heterogeneous information network The access path constituted with relationship type;Wherein, described two node types are real for seed different in the seed entity set The corresponding node type of body;
First significance level determining module, the quantity of the kind fructification pair for being connected according to every first path determines every First significance level in first path;
Second significance level determining module, for the first significance level according to every first path, determines that the candidate is real Second significance level of each candidate's entity that body is concentrated;
Entity set expansion module, for by candidate's entity set, second significance level to meet the first default bar Candidate's entity of part is defined as entity to be extended, and the entity to be extended is added in the seed entity set.
A kind of entity set extended method and device provided in an embodiment of the present invention, on the one hand, by comprising data volume it is huge Object knowledge collection of illustrative plates be used as data source carry out entity set extension;On the other hand, from heterogeneous letter corresponding with object knowledge collection of illustrative plates Cease and first path between seed entity set determined in network, since it is determined that each first path be connection kind fructification pair Path, therefore, these yuan of path can accurately reflect the particular common characteristics of seed inter-entity, and then utilize each first path The first significance level determined by candidate's entity the second significance level more effectively, and then according to the second significance level determine Entity to be extended also more effectively.Therefore, it can be carried using entity set extended method provided in an embodiment of the present invention and device The validity of high entity set extension.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of flow chart of entity set extended method provided in an embodiment of the present invention;
Fig. 2 is the partial schematic diagram of Yago knowledge mappings;
Fig. 3 be Yago knowledge mappings in entity type hierarchical relationship partial schematic diagram;
Fig. 4 be Fig. 1 shown in embodiment in step S102 a kind of detail flowchart;
Fig. 5 is the principle schematic that first path is determined using a kind of detail flowchart shown in Fig. 4;
Fig. 6 A to Fig. 6 D illustrate for a kind of validation verification result of entity set extended method provided in an embodiment of the present invention Figure, be from Fig. 6 A to Fig. 6 D entity types being corresponding in turn to:The performer of the film of Glenn Stevens Pierre Burger director, director obtain The film of countries movies prize director, the software that the company positioned at California mountain scene city produces must be crossed, positioned at Massachusetts Cambridge The scientist of university;
Fig. 7 is a kind of structured flowchart of entity set expanding unit provided in an embodiment of the present invention;
Fig. 8 be Fig. 7 shown in embodiment in module 702 a kind of detailed block diagram.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.
In order to solve the problem of prior art is present, the embodiments of the invention provide a kind of entity set extended method and dress Put, illustrated respectively with reference to specific embodiment.
First to being illustrated the embodiments of the invention provide a kind of entity set extended method.
As shown in figure 1, a kind of entity set extended method provided in an embodiment of the present invention, comprises the following steps:
S101, according to predetermined seed entity set, candidate's entity is extracted from object knowledge collection of illustrative plates, and will be extracted The candidate's entity composition candidate's entity set arrived;The seed that the object knowledge collection of illustrative plates is at least included in the seed entity set is real Body;
Planting fructification can be set previously according to given certain semantic type, the collection being made up of all kinds of fructifications It is seed entity set to close.For example, previously given specific semantic type is movie director, then Li An can be predefined, old Paean, Zhang Yimou constitute seed entity set { Li An, Chen Kaige, Zhang Yimou } as kind of a fructification.
Knowledge mapping is a data set being on a grand scale, mainly by<Main body, predicate, object>Such triple structure Into.Yago knowledge mappings for example shown in embodiment of the present invention Fig. 2, one of triple is<This Pierre's Burger, director, Battle steed film>, it is meant that represented by the triple, this Pierre's Burger has directed film battle steed.It is existing in addition to Yago knowledge mappings Have and also there is some other knowledge mapping, such as DBpedia and Freebase in technology.
In embodiments of the present invention, object knowledge collection of illustrative plates, refers to the knowledge mapping related to predetermined kind of fructification. It will be appreciated to those of skill in the art that when carrying out entity set extension, the data source only used is with planting fructification tool There is correlation, the accurate extension of entity set could be realized.
Specifically, object knowledge collection of illustrative plates at least includes the kind fructification in the seed entity set.
In embodiments of the present invention, candidate's entity is the entity for having particular common characteristics with kind of fructification.Wherein, it is specific Common trait includes:Entity type is identical.
S102, from heterogeneous information network corresponding with the object knowledge collection of illustrative plates, it is determined that plant fructification between first road Footpath;First path is:It is made up of between two node types in the heterogeneous information network entity type and relationship type Access path;Wherein, described two node types are the corresponding node class of kind fructification different in the seed entity set Type;
Heterogeneous information network (Heterogeneous Information Network) is a digraph G=(V, E), Wherein, V is the set of all entity nodes, and E is the entity object type in the set on all relation sides, digraph | A | > 1 or Link the relationship type between different entities object | R | > 1, in a network, one entity object of a node on behalf are (referred to as real Body), a line represents the relation between two entity objects connected by this edge.Also, it there is reflecting for node type Penetrate functionWith the mapping function ψ of a side type:E → R, belongs to a kind of special for each entity object v ∈ V Object typeEach edge e ∈ E belong to a kind of special relationship type ψ (e) ∈ R.
First path refers to, by entity type and relationship type group between two node types in the heterogeneous information network Into access path, first path represents the semantic relation between two node types.One member path ∏ is defined asIt is by entity type (node type) and relationship type (side type) group Into a sequence, it is described in A1The node and A of typel+1Between the node of type, pass through a series of A1,…,Al+1Class The node and R of type1,…,RlOne paths of the side connection of type, wherein, A1Corresponding node type is referred to as source node class Type, Al+1Corresponding node type is referred to as destination node type.
In heterogeneous information network, first path is widely used for catching abundant semantic information, and we define object a1With al+1Between a pathsIt is a paths example of first path P, if meeting following Condition is rightAnd ψ (ei)∈Ri, wherein,Represent to all i.
Generally, a first path there may be mulitpath example, for example, a paths example is:Another paths example is: Because this two paths example all meets first pathSo we say this two paths All it is the path examples in this first path.
Due to knowledge mapping it is main by<Main body, predicate, object>Such triple is constituted, and subject and object therein can To correspond to an entity respectively, predicate therein can represent certain relation or attribute between subject and object, also, knowledge Relation or the equal more than one of attribute between the type and subject and object of the subject and object included in collection of illustrative plates.Therefore, root A heterogeneous information network can be built in advance according to knowledge mapping.
For example, in fig. 2, " director " and " performance " is two distinct types of relation, " performer " and " film " is different Entity type,It is between Toby Kai Beier and Glenn Stevens Pierre's Burger One first path.
In addition, in fig. 2, Toby Kai Beier and Martin McCain belong to performer's class, Toby Kai Beier and Ni Ji Your Nigel Havers is not only performer's class, and falls within performer's class of the film of Glenn Stevens Pierre Burger director, in order to Better discriminate between both classifications, we the former be called coarseness entity type, the latter is fine-grained entity type, according to Candidate's entity that fine granularity entity type is determined more likely is confirmed as entity to be extended.
Specifically, building heterogeneous information network according to knowledge mapping belongs to prior art, therefore, this process is not done herein It is described in detail.
In embodiments of the present invention, described two nodes are the corresponding section of kind fructification different in the seed entity set Point, the node being made up of described two nodes is to can be referred to as " plant fructification to ".
Table 1 lists seed entity set for { s1,s2,…,smWhen, that plants the corresponding node composition of fructification " plants fructification It is right ".As shown in table 1, when source node is s1When, destination node is { s2,…,smIn any one;When source node is s2When, mesh Mark node is { s1,s3,…,smIn any one;When source node is other nodes, by that analogy, no longer chat one by one herein State.
Table 1
It should also be noted that, in embodiments of the present invention, an active node and destination node pair in each first path The entity answered is kind of a fructification, and the corresponding entity of other nodes is non-seed entity.
S103, the quantity of the kind fructification pair connected according to every first path determine the first important journey in every first path Degree;
In a kind of embodiment provided in an embodiment of the present invention, step S103 includes:
Step 1, the quantity of all kinds of fructifications pair connected according to every first path determine what every first path was connected Fructification is planted to sum;
Specifically, because each paths example connects a pair kinds of fructifications, therefore, the seed that every first path is connected Entity is to sum of the sum for the quantity of the kind fructification pair of the corresponding all path examples connections in this yuan of path.
Step 2, the kind fructification connected according to every first path are to sum and the first preset model, it is determined that per Tiao Yuan roads First significance level in footpath;
Wherein, first preset model is:
Wherein, WkFor first path PkCorresponding first significance level, l is the bar number in first path;SPkFor first path PkThe kind fructification connected is to sum, and m is the quantity of kind of fructification,For kind of a fructification pair Total quantity.
All important first paths are defined in step s 102, but the significance level in every first path is different , applicant shows by substantial amounts of experimental verification, the seed that the significance level in a certain bar member path is connected with the first path of this Entity is relevant to sum, and the kind fructification of this member path connection is bigger to sum, and this yuan of path can more reflect kind of a fructification Common trait, therefore, this member path it is more important.
In consideration of it, the embodiment of the present invention proposes the first important journey that each first path is determined according to the first preset model The method of degree, from the first preset model, it is seen that, first path PkThe kind fructification connected is bigger to sum, its correspondence The first importance value it is bigger.
It should be noted that determining that the method for first significance level in every first path is not limited to above-mentioned one kind, existing skill The method of first significance level in the first path of other determinations every present in art, suitable for the present invention.
S104, the first significance level according to every first path, determine each candidate's entity in candidate's entity set The second significance level;
In a kind of embodiment provided in an embodiment of the present invention, step S104 includes:
According to first significance level and the second preset model in every first path, determine each in candidate's entity set Second significance level of candidate's entity;
Wherein, second preset model is:
sj∈ S, i ∈ { 1,2,3 ..., n }, wherein, R (ci, S) represent to wait Select entity ciThe second significance level, n be candidate's entity quantity;sjKind of a fructification is represented, S represents the seed entity set, m For the quantity of kind of fructification;WkFor first path PkCorresponding first significance level, l is the bar number in first path;r{(ci,sj)|PkTable Show first path PkWhether connection kind fructification sjWith candidate's entity ci, if it is, r=1, otherwise, r=0.
It is seen that, the second significance level and the first significance level correlation, due to the of a certain article of member path One significance level is bigger, and the particular common characteristics of seed inter-entity can be reflected by illustrating that the first path of this is got over, therefore, according to the first weight Want the second significance level of candidate's entity of degree determination more effectively.
Explanation is needed also exist for, determines that the method for the second significance level of each candidate's entity is not limited to above-mentioned one kind, The method of second significance level of other each candidate's entities present in prior art, suitable for the embodiment of the present invention.
S105, by candidate's entity set, candidate's entity that second significance level meets the first preparatory condition is true It is set to entity to be extended, and the entity to be extended is added in the seed entity set.
In a kind of embodiment provided in an embodiment of the present invention, step S105 includes:
By in candidate's entity set, candidate's entity that second significance level is more than the second preset value is defined as waiting to expand Open up entity.
In another embodiment provided in an embodiment of the present invention, step S105 includes:
According to second significance level, candidate's entity in candidate's entity set is ranked up in descending order, obtained First candidate's entity set;Also, the candidate that preceding first predetermined number of sequence is chosen from the first candidate entity set is real Body is used as entity to be extended.
Applicant uses corresponding according to the object knowledge collection of illustrative plates to the entity to be extended of the first selected predetermined number Sequence index carry out validation verification, it was confirmed that the validity of this method.
A kind of entity set extended method provided in an embodiment of the present invention, on the one hand, by comprising the huge target of data volume Knowledge mapping carries out entity set extension as data source;On the other hand, from heterogeneous information network corresponding with object knowledge collection of illustrative plates The middle first path determined between kind of fructification, since it is determined that each first path for connection kind fructification pair path, because This, these yuan of path can accurately reflect the particular common characteristics of seed inter-entity, and then utilize the first of each first path Second significance level of candidate's entity determined by significance level more effectively, and then according to the second significance level determine wait expand Open up entity also more effectively.Therefore, entity set extended method provided in an embodiment of the present invention can improve having for entity set extension Effect property.
In addition, the knowledge mapping such as Yago has become a kind of instrument of quick-searching information.With knowledge mapping Prevalence, many researchers begin to use this instrument to aid in improving the accurate of the entity set extension in text or webpage Property.However, also few work at present carry out entity set extension using knowledge mapping as single data source.But handle is known It is necessary to know collection of illustrative plates to carry out entity set extension as single data source, and reason is as follows:(1) it is traditional based on text or The entity set extended method of person's info web needs complicated natural language processing, and this can influence the accurate of extension to a certain extent Rate, and do not need these complicated pretreatments using knowledge mapping as single data source;(2) knowledge mapping includes abundant Entity and semantic relation, this will have very much benifit to entity set extension.
In a kind of embodiment provided in an embodiment of the present invention, in above-mentioned steps S101, according to predetermined Seed entity set, the step of extracting candidate's entity from object knowledge collection of illustrative plates, can include:
Step 1, the entity type collection for determining each kind fructification in predetermined seed entity set;
For example, coming for the kind fructification Li An in the seed entity set { Li An, Chen Kaige, Zhang Yimou } that above determines Say, corresponding entity type collection is { people, director };For kind of fructification Chen Kaige and Zhang Yimou, corresponding kind of fructification Set of types is { people, director, performer }.
Step 2, the common factor of all entity type collection is defined as initial solid set of types;
Because identical entity type can more reflect the common trait of inter-entity, therefore, by the friendship of all entity type collection Collection is defined as initial solid set of types, can be with significantly more efficient progress entity set extension.
Specifically such as, the entity type collection { people, director } and seed entity type collection { people, director, performer } determined in step 1 Common factor be { people, director }, namely determine that initial solid set of types is { people, director }.
Step 3, the hierarchical relationship according to each entity type in the initial solid set of types, determine the seed entity set Corresponding final entity type collection;The final entity type centralized entity type in the object knowledge collection of illustrative plates, will be met Entity is used as candidate's entity.
Because " people " in initial solid set of types { people, director } is although this entity type can reflect kind of fructification Common trait, but its granularity is thicker, the candidate's entity for causing to determine it is semantic indefinite.Therefore, in the embodiment of the present invention In, according further to the hierarchical relationship of each entity type in initial solid set of types, determine that the seed entity set is corresponding Final entity type collection.
" coarseness " entity type will be referred to as comprising the more entity types of subtype in embodiments of the present invention, accordingly Subtype is referred to as " fine granularity " entity type, for example, in " people " and " director " the two entity types, " people " belongs to coarse grain Degree, " director " belongs to fine granularity, it will be appreciated by persons skilled in the art that the coarseness and fine granularity of entity type are relative For.
Specifically, the hierarchical relationship of each entity type refers to that the subordinate of each entity type is closed in initial solid set of types System, for example, in initial solid set of types { people, director }, " director " this entity type is subordinated to " people " this entity type.
More specifically, above-mentioned steps 3 can include:
Sub-step 1, at least one hierarchical relationship corresponding to the initial solid set of types is determined, wherein, any level Relation is the subordinate relation of at least two entity types;
Sub-step 2, the entity type that will be located at the bottom in each hierarchical relationship, are defined as final entity type, and will Identified final entity type composition is final entity type collection.
Entity type or the relationship type often tissue in the way of level in knowledge mapping, the description of this hierarchical relationship Subordinate relation between entity type or relationship type (also referred to as set membership), Fig. 3 shows that the level of entity type is closed The partial schematic diagram of system, all these types share a root node things.
As shown in figure 3, when entity type collection is { things, people, movie director, performer, artificiality, film }, can be with structure Build out:Movie director, which is subordinated to people, people and is subordinated to things, performer and is subordinated to people, film, is subordinated to artificiality and artificiality subordinate In the hierarchical relationship of things.In figure 3, the entity type positioned at the bottom is:Movie director, performer and film.
For the initial solid set of types { people, director } determined in step 2, it is positioned at undermost entity type:Lead Drill.Therefore, final entity type is " director ", and the final entity type collection of composition is { director }.
It will be appreciated by persons skilled in the art that the entity type that final entity type is concentrated can be that one kind can also It is a variety of, this is all rational.
It is not difficult to find out, in the present embodiment, on the one hand, due to the entity type that initial solid set of types is various fructifications The common factor of collection, and the entity type in the common factor of the entity type collection of various fructifications can more reflect the common spy of kind of fructification Levy;On the other hand, due to initial solid type be centrally located at the entity type of the bottom more can representative species fructification semanteme, and Final candidate entity type collection is determined according to the hierarchical relationship of each entity type in initial solid set of types, therefore, according to Candidate's entity of final candidate entity type collection selection, more likely has specific common trait, more having can with kind of a fructification It can be added to as entity to be extended in seed entity set, this tentatively ensure that entity set extension provided in an embodiment of the present invention The validity of method.
In addition, it is necessary to which explanation, determines that the method for candidate's entity is not limited to above-mentioned one kind side of the present embodiment offer Method, the method for other determinations candidate's entity present in prior art is applied to the embodiment of the present invention.
In a kind of embodiment provided in an embodiment of the present invention, in the step S102 in the embodiment shown in Fig. 1, It is described from heterogeneous information network corresponding with the object knowledge collection of illustrative plates, it is determined that plant fructification between first path, including:
Step 1, from the heterogeneous information network corresponding with the object knowledge collection of illustrative plates, determine one group and the seed The corresponding node of kind fructification in entity set;
Step 2, using each node of determination as source node, travel through the heterogeneous information network, when destination node be except During kind fructification outside the source node itself, the path for connecting the source node and the destination node is defined as a first path real Example;
All first path examples that step 3, statistics are determined, and the entity included according to all first path examples Type and relationship type, obtain the corresponding first path of all first path examples.
It is not difficult to find out, due to only making identified one group node corresponding with the kind fructification in the seed entity set For source node, travel through the heterogeneous information network and determine each important first path, therefore, it can reduce time for determining first path Scope is gone through, the efficiency for determining first path can be not only improved, additionally aid saving computing resource.
Below please also refer to Fig. 4 and Fig. 5, Fig. 4 shows one kind of the step S102 in the embodiment shown in Fig. 1 in detail Flow chart, namely a kind of flow chart of first determining method of path.Fig. 5 shows true using a kind of detail flowchart shown in Fig. 4 The principle schematic in fixed member path.
In a kind of embodiment provided in an embodiment of the present invention, as shown in figure 4, in embodiment shown in Fig. 1 It is described from heterogeneous information network corresponding with the object knowledge collection of illustrative plates in step S102, it is determined that planting the member between fructification Path, including:
S401, from heterogeneous information network corresponding with the object knowledge collection of illustrative plates, it is determined that with the seed entity set pair The node set answered, wherein, the node set includes node corresponding with the kind fructification in the seed entity set;
In a kind of embodiment, the node set includes and the seed physical quantities in the seed entity set Equal and one-to-one node.For example, it is assumed that seed entity set is performer { 1,2,3 }, the then corresponding node set For performer { 1,2,3 }.
In embodiments of the present invention, equal with the seed entity set seed physical quantities and one-to-one node is selected The set of composition is to reduce seeking scope as the purpose of node set, reduces the amount of calculation for determining every first path, saves meter Calculate resource.
Certainly, it will be appreciated by persons skilled in the art that in the case where computing resource is more abundant, can also select But quantity corresponding with kind of fructification is more than the node composition node set of seed physical quantities, and this is all rational.For example it is false If seed entity set is performer { 1,2,3 }, the corresponding node set can be performer { 1,2,3,1,2,3 }.
S402, it regard each node in the node set as first node;
Describe for convenience, in the present embodiment, using seed entity set as performer { 1,2,3 }, the corresponding node Collection is illustrated exemplified by being combined into performer { 1,2,3 }.
Specifically, regarding node set as first node for each node in performer { 1,2,3 }.
S403, it regard each first node as current source Node;
Alternatively, for convenience of explanation, an initial configuration tables of data can be initially set up.
In embodiments of the present invention, structured data table citation form is as shown in table 2.In table 2, (s, t) represents source node s The entity pair constituted with destination node t;σ (s, t | ∏) similarity of the entity under current path ∏ to (s, t) is represented, if The entity of current path ∏ connections is kind of a fructification pair to (s, t), then similarity is the first numerical value, and otherwise similarity is the Two numerical value.In embodiments of the present invention, the first numerical value is more than second value, it is generally the case that the first numerical value is equal to 1, the second number Value is equal to 0.(s ..., t) it is expressed as finding all sections accessed with the source node s passage paths ∏ destination node t being connected Point.Certainly, (s ... t) is not necessarily required to be contained in structured data table.
Table 2
Specifically, initial configuration tables of data is as shown in the Table A in Fig. 5.Due under initial situation, the node of current accessed For first node in itself, therefore, source node and destination node are the entity pair of first node, source node and destination node composition Corresponding similarity is 0, and for first node in itself, the similarity scores of initial configuration tables of data are also 0 to the node accessed.
S404, in the heterogeneous information network access with each current source Node pass through the side of preset kind be connected ought Preceding destination node, sets up multiple structured data tables to be selected corresponding with side type;
Wherein, any structured data table to be selected includes:It is corresponding with by the structured data table to be selected by each first node While type while the first instance of current target node composition that connects to, the similarity of each first instance pair, visited The path asked and similarity scores;The similarity scores are the summation of the similarity of all first instances pair;
Specifically as shown in figure 5, on the basis of initial configuration tables of data A, being accessed in the heterogeneous information network with working as The current target node that preceding source node 1,2 and 3 is connected by " performance " this edge, and pass through with current source Node 1,2 and 3 The current target node of " being born in " this edge connection.Herein as an example, only selecting " performance " and " being born in " two types Side be extended, it will be understood by those skilled in the art that in actual applications, connect each current source Node and current mesh It can be one or two kinds of or two or more to mark the side of the preset kind of node.
In Figure 5, exemplarily establish altogether two corresponding with " performance " and " being born in " two kinds of side it is to be selected Structured data table, respectively table B and table C.
S405, for each structured data table to be selected, judge to save with each current source in the structured data table to be selected Whether the current target node of point connection is Section Point;If it is, by the current source Node pair in the structured data table to be selected The similarity for the first instance pair answered is designated as the first numerical value, and the corresponding path accessed of the current source Node is determined For a first path examples, second value is otherwise designated as;Wherein, the Section Point is:With working as in the seed entity sets The different node of the corresponding first node of preceding source node;
Specifically, in table B and table C in Figure 5, because the corresponding current target node of each first node is not Two nodes, therefore, the similarity of each first instance pair are exemplarily labeled as 0.
S406, from structured data table to be selected, selection meet the second preparatory condition structured data table to be selected as current Structured data table;Second preparatory condition includes:The most species of the kind fructification stored in structured data table to be selected;
Alternatively, when the most structured data table to be selected of the seed entity class stored has multiple, described second is pre- If condition also includes:The minimum number of the first instance pair stored in structured data table to be selected.
Specifically, in Figure 5, because the species of the kind fructification stored in structured data table B to be selected is more than structure to be selected Tables of data C, it is thereby possible to select structured data table B to be selected is used as current structure tables of data.
S407, each current target node in the current structure tables of data is updated to current source Node, return is held Row is described to access the current goal being connected with each current source Node by the side of preset kind in the heterogeneous information network The step of node;Namely return to execution step S404;
Specifically, as shown in figure 5, by the current target node film 12 in current structure tables of data B, film 17 and film 18 are updated to current source Node respectively, and perform step S404 to table B returns.
In Figure 5, after step S404 is performed to table B, exemplarily establish altogether and " director-1" and " creation-1" two kinds The corresponding two structured data tables to be selected in side of type, respectively table D and table E.
It should be noted that in Figure 5, side " director-1" and " creation-1" in subscript " -1 " represent inverse relationship, namely " director-1" expression " director " inverse relationship.For example, when film 12 passes through side " director-1" when being connected with people 7, illustrate film 12 Directed by people 7;When people 7 is connected by side " director " with film 12, illustrate that people 7 has directed film 12.In addition, structured data table " " of last column represents unlisted first instance pair in B, D-H.
Similarly, in table D and table E in Figure 5, because the corresponding current target node of each first node is not Two nodes, therefore, the similarity of each first instance pair are exemplarily labeled as 0.
Further, in Figure 5, because the species of the kind fructification stored in structured data table D to be selected is more than knot to be selected Structure tables of data E, it is thereby possible to select structured data table D to be selected is as current structure tables of data, and returns to execution step S404.
After step S404 is performed to table D, exemplarily establish corresponding with " creation " and " editor " two kinds of side Two structured data table F and G to be selected.After step S405 and S406 is performed to table F and G, current structure tables of data is determined For H.In table H, because the corresponding current target node in first node 1,2 and 3 is Section Point, therefore, first instance pair (1,2), the similarity of (2,3) and (3,1) can exemplarily be labeled as 1.
S408, when the path length accessed in each current structure tables of data is more than three preset values, or when every When seed number of entities in one current structure tables of data is less than four preset values, all first path examples determined are counted, Obtain the corresponding first path of all first path examples.
Wherein, the 3rd preset value can be the maximum length of access path set in advance, and the 4th preset value can be The minimum value that seed number of entities should be met in structured data table set in advance.
Finally, as shown in table H, exemplarily, it may be determined that it is the 4 important first paths jumped to go out a length:
In the present embodiment, since it is determined that first path for connection kind of fructification pair important first path, therefore, these First path can more accurately reflect the particular common characteristics of seed inter-entity.When the implementation shown in application embodiment of the present invention Fig. 4 When important first path that first determining method of path that example is provided is determined carries out entity set extension, validity is higher.
Alternatively, it is all also including what is accessed in structured data table to be selected in the embodiment shown in Fig. 4 of the present invention Node, and by structured data table to be selected by " first instance to, the similarity of the first instance pair and with the first instance pair The row of the corresponding all nodes accessed " composition is referred to as a tuple, also i.e. by table 2 by " (s, t), σ (s, t | ∏) and (s ..., t) " composition row be referred to as a tuple.On this basis, after step S404 and before step S405, first road Footpath determines that method also includes:
Judge each current target node whether be and the current target node where store in tuple accessed Node;
If not, performing step S405;If it is, by the tuple where the current destination node from corresponding structure to be selected After being deleted in tables of data, step S405 is performed.
It is seen that, in the present embodiment, due to being also recorded for having accessed in each tuple of structured data table to be selected All nodes, and it is determined that whether being that the node accessed is sentenced to the destination node during each current target node Have no progeny, can prevent that the first path determined constitutes loop, and then avoid undying traversal heterogeneous information network, improve member The determination efficiency in path.
Alternatively, in a kind of embodiment provided in an embodiment of the present invention, step in the embodiment shown in Fig. 4 S406, namely it is described from structured data table to be selected, selection meets the structured data table to be selected of the second preparatory condition as current Structured data table, including:
From similarity scores are not more than multiple structured data tables to be selected of the first preset value, selection meets the second default bar The structured data table to be selected of part is used as current structure tables of data.
It is not difficult to find out, in multiple structured data tables to be selected of the first preset value are not more than from similarity scores, selection is full When the structured data table to be selected of the second preparatory condition of foot is as current structure tables of data, first path searching can be further reduced Scope, reduces amount of calculation, helps further to save computing resource.
In order to further illustrate a kind of validity of entity set extended method provided in an embodiment of the present invention, applicant is led to Cross experiment and verified that specific verification process is as follows to this method:
1) object knowledge collection of illustrative plates is determined
Applicant is using classical Yago knowledge mappings as object knowledge collection of illustrative plates, and the data in Yago knowledge mappings are mainly come Come from wikipedia, wordNet and GeoNames.Current this data set of Yago knowledge mappings have about 10,000,000 entity and The fact that 120000000, herein main " yagoFacts ", " yagoSimpleTypes " using in Yago knowledge mappings and " yagoTaxonomy " this three partial data is as data source, comprising 35 kinds of relations in this three partial data, 1.3 million entities, More than 3000 kinds of entity type.Table 3 lists the specific descriptions of this three partial data.
Table 3
2) checking collection is determined
Applicant have selected representational four classes checking collection to verify entity set extension provided in an embodiment of the present invention altogether The validity of method, four classes checking collection is as follows:The performer of the film of Glenn Stevens Pierre Burger of taking part in a performance director, positioned at California mountain The software of Jing Cheng (Mountain View of California) company's production, director obtained countries movies prize The film of (National Film Award) director, positioned at Massachusetts Cambridge (Cambridge of Massachusetts the scientist of university), the entity that the checking of this four class is concentrated is designated as respectively:Performer*, software*, film*With Scientist*, this four class checking concentrate entity number be respectively:112、98、653、202.
3) efficiency evaluation standard
The measurement of effective performance is carried out using p@k and MAP standards.P@k are represented to candidate's entity in candidate entity set After being sorted by significance level, the percentage of positive example is belonged in preceding k result.
Main herein to use p@30, p@60,90 3 standards of p@are evaluated.MAP standards are p@30, p@60 and p@90 standard The average value of true rate, is embodied as:Wherein, if the candidate entity of i-th bit is positive example, reli=1, otherwise, reli=0.
3) comparison other is determined
By a kind of entity set extended method (Meta Path based Entity Set provided in an embodiment of the present invention Expansion, abbreviation MP_ESE) it is compared with following three kinds of methods:
(1) the entity set extended method based on connection (Link-Based).By in text or webpage based on pattern The inspiration of method, provides the entity set extended method based on the hop link relation of entity one.
(2) it is based on the entity set extended method of arest neighbors (Nearest-Neighbor).Provide while considering a hop link With the entity set extended method of an arest neighbors for jumping entity.
(3) path is limited random walk PCRW (Path-Constrained Random Walk, PCRW) entity set expansion Exhibition method.This method is the method based on path random walk in heterogeneous network, provides the entity set based on 2 hop link relations and expands Exhibition method.
To every kind of method, three seeds of selection are concentrated to be tested from checking at random, every kind of method is run 30 times and is averaged As a result it is compared.In entity set extended method provided in an embodiment of the present invention, the first preset value of setting is:m*(m-1)/2+ 1, wherein m is plant the quantity of fructification, and the maximum path length in first path is set to 4.
4) the result
The result is as shown in Fig. 6 A to Fig. 6 D, and the entity type that Fig. 6 A to Fig. 6 D are corresponding in turn to is:Performer*, film*, it is soft Part*, scientist*.When application method provided in an embodiment of the present invention progress entity set extension is can be seen that from Fig. 6 A to Fig. 6 D, Accuracy rate is intended to high, especially " performer than the basic skills of setting*" and " film*" two classifications.In " performer*" and " film*” In two classifications, the reason for accuracy rate in the basic skills of setting is low is that the link of a jump or double bounce can not be distinguished well Fine-grained entity class, and the hop count in first path that method provided in an embodiment of the present invention is used is more, can be well Fine-grained entity class is distinguished, therefore accuracy rate is high.In " software*" in classification, method provided in an embodiment of the present invention with The accuracy rate of PCRW methods is close, and reason is " software*" it is an overlapping class, except given entity class, also with other one The software of the entity class of individual coarseness, i.e. same company production.
In addition, from Fig. 6 A to Fig. 6 D, it can be seen that accuracy rate of the Link-Based algorithms in any one classification is all Significantly lower than entity set extended method provided in an embodiment of the present invention, reason is that Link-Based algorithms are to be based on a hop link , and the semantic information that a hop link is included is considerably less, it is impossible to accurately reflect the particular common characteristics of seed inter-entity.And Entity set extended method provided in an embodiment of the present invention, employs the particular common characteristics that can accurately reflect seed inter-entity Multi-hop link (first path), therefore the Precise Semantics information of kind of fructification can be captured, and then improve entity set extension Accuracy rate.
In order to further intuitively illustrate the validity of entity set extended method provided in an embodiment of the present invention, table 4 is listed Using entity set extended method provided in an embodiment of the present invention in " performer*" in classification, first three the important first road determined Footpath, as can be seen from Table 4, these yuan of path reflect " performer*" classification kind fructification between it is potential specific common Feature, may further determine that the more entities for belonging to this classification are used as entity to be extended by the use of these yuan of path.
Table 4
Sum it up, relative to above-mentioned three kinds of basic skills of setting, entity set extension side provided in an embodiment of the present invention Method is more effective.
Corresponding to above method embodiment, the embodiment of the present invention additionally provides a kind of entity set expanding unit, carries out below Describe in detail.
As shown in fig. 7, the embodiments of the invention provide a kind of entity set expanding unit, described device includes:Candidate's entity Collect determining module 701, first path determination module 702, the first significance level determining module 703, the second significance level determining module 704 and entity set expansion module 705;
Candidate's entity set determining module 701, for according to predetermined seed entity set, being taken out from object knowledge collection of illustrative plates Candidate's entity is taken, and obtained candidate's entity composition candidate's entity set will be extracted;The object knowledge collection of illustrative plates at least includes described Kind fructification in seed entity set;
First path determination module 702, for from heterogeneous information network corresponding with the object knowledge collection of illustrative plates, it is determined that planting First path between fructification;First path is:By entity class between two node types in the heterogeneous information network The access path of type and relationship type composition;Wherein, described two node types are seed different in the seed entity set The corresponding node type of entity;
First significance level determining module 703, the quantity of the kind fructification pair for being connected according to every first path is determined First significance level in every first path;
Second significance level determining module 704, for the first significance level according to every first path, determines the candidate Second significance level of each candidate's entity in entity set;
Entity set expansion module 705, for by candidate's entity set, second significance level to meet first and preset Candidate's entity of condition is defined as entity to be extended, and the entity to be extended is added in the seed entity set.
A kind of entity set expanding unit provided in an embodiment of the present invention, on the one hand, by comprising the huge target of data volume Knowledge mapping carries out entity set extension as data source;On the other hand, from heterogeneous information network corresponding with object knowledge collection of illustrative plates The middle first path determined between kind of fructification, and since it is determined that first path of each type be a connection kind fructification pair Path, therefore, these yuan of path can accurately reflect the potential common trait of seed inter-entity, and then utilize the of first path Second significance level of candidate's entity determined by one significance level more effectively, and then according to the second significance level determine treating Extend entity also more effectively.So, entity set extended method provided in an embodiment of the present invention can improve entity set extension Validity.
In a kind of embodiment provided in an embodiment of the present invention, candidate's entity set in the embodiment shown in Fig. 7 Determining module 701 can specifically include:Entity type collection determination sub-module, initial solid set of types determination sub-module and final reality Body set of types determination sub-module;
Entity type collection determination sub-module, for determining each entity for planting fructification in predetermined seed entity set Set of types;
Initial solid set of types determination sub-module, for the common factor of all entity type collection to be defined as into initial solid type Collection;
Final entity type collection determination sub-module, for the level according to each entity type in the initial solid set of types Relation, determines the corresponding final entity type collection of the seed entity set;It in the object knowledge collection of illustrative plates, will meet described final The entity of entity type centralized entity type is used as candidate's entity.
More specifically, final entity type collection determination sub-module can include:First determining unit and second determines list Member.
First determining unit, for determining at least one hierarchical relationship corresponding to the initial solid set of types, wherein, Any hierarchical relationship is the subordinate relation of at least two entity types;
Second determining unit, for the entity type by the bottom is located in each hierarchical relationship, is defined as final entity Type, and be final entity type collection by identified final entity type composition.
It is not difficult to find out, in the present embodiment, on the one hand, due to the entity type that initial solid set of types is various fructifications The common factor of collection, and the entity type in the common factor of the entity type collection of various fructifications can more reflect the common spy of kind of fructification Levy;On the other hand, due to initial solid type be centrally located at the entity type of the bottom more can representative species fructification semanteme.And Final candidate entity type collection is determined according to the hierarchical relationship of each entity type in initial solid set of types, therefore, according to Candidate's entity of final candidate entity type collection selection, more likely has specific common trait, more having can with kind of a fructification It can be added to as entity to be extended in seed entity set, and then the validity of entity set extension can be better ensured that.
In a kind of embodiment provided in an embodiment of the present invention, first path in the embodiment shown in Fig. 7 is determined Module 702 can include:Node determination sub-module, spider module and determination sub-module.
Node determination sub-module, for from the heterogeneous information network corresponding with the object knowledge collection of illustrative plates, it is determined that One group of node corresponding with the kind fructification in the seed entity set;
Spider module, for as source node, each node of determination to be traveled through into the heterogeneous information network, when target section When point is the kind fructification in addition to the source node itself, the path for connecting the source node and the destination node is defined as a member Path examples;
Determination sub-module, for counting all first path examples determined, and according to all first path examples institutes Comprising entity type and relationship type, obtain the corresponding first path of all first path examples.
It is not difficult to find out, due to only making identified one group node corresponding with the kind fructification in the seed entity set For source node, travel through the heterogeneous information network and determine each important first path, therefore, reduce the traversal for determining first path Scope, so can not only improve the efficiency for determining first path, additionally aid saving computing resource.
As shown in figure 8, in a kind of embodiment provided in an embodiment of the present invention, first path determination module 702 can With including:Node set determination sub-module 801, first node determination sub-module 802, current source Node determination sub-module 803, treat Select structured data table setting up submodule 804, the first judging submodule 805, selection submodule 806, renewal submodule 807 and Yuan Lu Footpath determination sub-module 808;
Node set determination sub-module 801, for from heterogeneous information network corresponding with the object knowledge collection of illustrative plates, really Fixed node set corresponding with the seed entity set, wherein, the node set include with the seed entity set Plant the corresponding node of fructification;
First node determination sub-module 802, for regarding each node in the node set as first node;
Current source Node determination sub-module 803, for regarding each first node as current source Node;
Structured data table setting up submodule 804 to be selected, for being accessed and each current source in the heterogeneous information network The current target node that node is connected by the side of preset kind, sets up multiple structured data tables to be selected corresponding with side type;
First judging submodule 805, for for each structured data table to be selected, judging the structured data table to be selected In the current target node that is connected with each current source Node whether be Section Point;If it is, by the structured data table to be selected In the similarity of the corresponding first instance pair of the current source Node be designated as the first numerical value, it is and the current source Node is corresponding The path accessed is defined as a first path examples, is otherwise designated as second value;Wherein, the Section Point is:The kind The node that first node corresponding from current source Node is different in fructification set;
Submodule 806 is selected, for from structured data table to be selected, selection to meet the structure number to be selected of the second preparatory condition Current structure tables of data is used as according to table;Second preparatory condition includes:The kind fructification stored in structured data table to be selected Most species;
Submodule 807 is updated, for each current target node in the current structure tables of data to be updated to currently Source node, and trigger structured data table setting up submodule 804 to be selected;
First path determination sub-module 808, for being more than the when the path length accessed in each current structure tables of data During three preset values, or when the seed number of entities in each current structure tables of data is less than four preset values, statistics is determined All first path examples gone out, and the entity type and relationship type included according to all first path examples, obtain institute State the corresponding first path of all first path examples.
Wherein, the 3rd preset value can be the maximum length of access path set in advance, and the 4th preset value can be The minimum value that seed number of entities should be met in structured data table set in advance.
In the present embodiment, since it is determined that first path for connection kind of fructification pair important first path, therefore, these First path can more accurately reflect the particular common characteristics of seed inter-entity.When the implementation shown in application embodiment of the present invention Fig. 8 When important first path that the device that example is provided is determined carries out entity set extension, accuracy rate is higher.
Alternatively, it is all also including what is accessed in structured data table to be selected in the embodiment shown in Fig. 8 of the present invention Node, and by structured data table to be selected by " first instance to, the similarity of the first instance pair and with the first instance pair The row of the corresponding all nodes accessed " composition is referred to as a tuple.On this basis, structured data table to be selected is being triggered After setting up submodule 804, before the first judging submodule 805 is triggered, first path determination module 702 can also include:
Second judging submodule, for judge each current target node whether be and the current target node where tuple The node accessed of middle storage;
Submodule is triggered, in the case of being no in the judged result that the second judging submodule is obtained, knot to be selected is triggered Structure tables of data setting up submodule 804;In the case where the judged result that the second judging submodule is obtained is to be, by the current goal After tuple where node is deleted from corresponding structured data table to be selected, structured data table setting up submodule 804 to be selected is triggered.
It is seen that, in the present embodiment, due to being also recorded for having accessed in each tuple of structured data table to be selected All nodes, and it is determined that whether being that the node accessed is sentenced to the destination node during each current target node Have no progeny, can prevent that the first path determined constitutes loop, and then avoid undying traversal heterogeneous information network, improve first road The determination efficiency in footpath.
Alternatively, in a kind of embodiment provided in an embodiment of the present invention, son is selected in the embodiment shown in Fig. 8 Module 806, in being not more than multiple structured data tables to be selected of the first preset value from similarity scores, selection meets the The structured data table to be selected of two preparatory conditions is used as current structure tables of data.
It is not difficult to find out, in multiple structured data tables to be selected of the first preset value are not more than from similarity scores, selection is full When the structured data table to be selected of the second preparatory condition of foot is as current structure tables of data, first path searching can be further reduced Scope, reduces amount of calculation, contributes to the first path of further raising to determine efficiency, save computing resource.
In a kind of embodiment provided in an embodiment of the present invention, the important journey of first in embodiment shown in Fig. 7 Determining module 703 is spent, determines that every first path is connected specifically for all kinds of fructifications pair connected according to every first path Kind fructification to sum;The kind fructification connected according to every first path determines every to sum and the first preset model First significance level in first path;
Wherein, first preset model is:The physical significance of wherein each parameter is implemented with the above method Correspondence is identical in example, and here is omitted.
It is not difficult to find out, the kind fructification that the first significance level is connected with first path is proportionate to sum, first path institute The kind fructification of connection is to more, and the particular common characteristics of seed inter-entity can be reflected by illustrating that the first path of this is got over, therefore, according to The first importance value that the kind fructification that first path is connected is determined to sum is more accurate.
In a kind of embodiment provided in an embodiment of the present invention, the important journey of second in embodiment shown in Fig. 7 Determining module 704 is spent, for the first significance level and the second preset model according to every first path, candidate's entity is determined Second significance level of each candidate's entity concentrated;
Wherein, second preset model is:
The physical significance of wherein each parameter is identical with correspondence in above method embodiment, and here is omitted.
It is seen that, the second significance level and the first significance level correlation, due to the of a certain article of member path One significance level is bigger, and the particular common characteristics of seed inter-entity can be reflected by illustrating that this yuan of path is got over, therefore, important according to first Second significance level of candidate's entity that degree is determined is more accurate.
In a kind of embodiment provided in an embodiment of the present invention, the entity set extension in the embodiment shown in Fig. 7 Module 705, specifically for by candidate's entity set, candidate's entity that second significance level is more than the second preset value is true It is set to entity to be extended.
In another embodiment provided in an embodiment of the present invention, the entity set in the embodiment shown in Fig. 7 expands Module 705 is opened up, specifically for according to second significance level, being carried out in descending order to candidate's entity in candidate's entity set Sequence, obtains first candidate's entity set;Also, preceding first predetermined number of sequence is chosen from the first candidate entity set Candidate's entity be used as entity to be extended.
Applicant uses corresponding according to the object knowledge collection of illustrative plates to the entity to be extended of the first selected predetermined number Sequence index carry out validation verification, it was confirmed that the validity of this method.
In the above two embodiments, it is that entity to be extended is determined according to the second significance level, due to the second important journey Degree can correctly reflect the particular common characteristics of candidate's entity and seed inter-entity, therefore, be determined according to the second significance level Entity to be extended more effectively, it is ensured that entity extension validity.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or deposited between operating In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Nonexcludability is included, so that process, method, article or equipment including a series of key elements not only will including those Element, but also other key elements including being not expressly set out, or also include being this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Also there is other identical element in process, method, article or equipment including the key element.
Each embodiment in this specification is described by the way of related, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.It is real especially for device Apply for example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the scope of the present invention.It is all Any modifications, equivalent substitutions and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention It is interior.

Claims (10)

1. a kind of entity set extended method, it is characterised in that methods described includes:
According to predetermined seed entity set, candidate's entity is extracted from object knowledge collection of illustrative plates, and obtained candidate will be extracted Entity constitutes candidate's entity set;The object knowledge collection of illustrative plates at least includes the kind fructification in the seed entity set;
From heterogeneous information network corresponding with the object knowledge collection of illustrative plates, it is determined that planting first path between fructification;The member Path is:The link road being made up of between two node types in the heterogeneous information network entity type and relationship type Footpath;Wherein, described two node types are the corresponding node type of kind fructification different in the seed entity set;
The quantity of the kind fructification pair connected according to every first path determines first significance level in every first path;
According to first significance level in every first path, determine that second of each candidate's entity in candidate's entity set is important Degree;
By in candidate's entity set, candidate's entity that second significance level meets the first preparatory condition is defined as waiting to extend Entity, and the entity to be extended is added in the seed entity set.
2. according to the method described in claim 1, it is characterised in that described according to predetermined seed entity set, from target Candidate's entity is extracted in knowledge mapping, including:
Determine each entity type collection for planting fructification in predetermined seed entity set;
The common factor of all entity type collection is defined as initial solid set of types;
According to the hierarchical relationship of each entity type in the initial solid set of types, determine that the seed entity set is corresponding final Entity type collection;The entity of the final entity type centralized entity type in the object knowledge collection of illustrative plates, will be met as time Select entity.
3. method according to claim 2, it is characterised in that described according to each entity class in the initial solid set of types The hierarchical relationship of type, it is determined that final entity type collection, including:
At least one hierarchical relationship corresponding to the initial solid set of types is determined, wherein, any hierarchical relationship is at least two The subordinate relation of individual entity type;
The entity type of the bottom will be located in each hierarchical relationship, be defined as final entity type, and will be identified final Entity type composition is final entity type collection.
4. according to the method described in claim 1, it is characterised in that described from heterogeneous letter corresponding with the object knowledge collection of illustrative plates Cease in network, it is determined that first path between fructification is planted, including:
From heterogeneous information network corresponding with the object knowledge collection of illustrative plates, it is determined that set of node corresponding with the seed entity set Close, wherein, the node set includes node corresponding with the kind fructification in the seed entity set;
It regard each node in the node set as first node;
Using each first node as current source Node, access and pass through with each current source Node in the heterogeneous information network The current target node of the side connection of preset kind, sets up multiple structured data tables to be selected corresponding with side type;Wherein, it is any Structured data table to be selected includes:By each first node with by the structured data table to be selected it is corresponding while type while connect The first instance of current target node composition is to, the similarity of each first instance pair, the path accessed and similitude Fraction;The similarity scores are the summation of the similarity of all first instances pair;
For each structured data table to be selected, judge that what is be connected in the structured data table to be selected with each current source Node works as Whether preceding destination node is Section Point;If it is, by current source Node in the structured data table to be selected corresponding first in fact The similarity of body pair is designated as the first numerical value, and the corresponding path accessed of the current source Node is defined as into a Tiao Yuan roads Footpath example, is otherwise designated as second value;Wherein, the Section Point is:It is corresponding with current source Node in the node set The different node of first node;
From structured data table to be selected, the structured data table to be selected that selection meets the second preparatory condition is used as current structure data Table;Second preparatory condition includes:The most species of the kind fructification stored in structured data table to be selected;When the kind stored When the structured data table to be selected of fructification most species has multiple, second preparatory condition also includes:Structured data table to be selected The minimum number of the first instance pair of middle storage;
Each current target node in the current structure tables of data is updated to current source Node, returned described in performing in institute The step of stating the current target node that access is connected with each current source Node by the side of preset kind in heterogeneous information network;
When the path length accessed in each current structure tables of data is more than three preset values, or when each current structure When seed number of entities in tables of data is less than four preset values, all first path examples determined are counted, and according to described Entity type and relationship type that all first path examples are included, obtain the corresponding first path of all first path examples.
5. method according to claim 4, it is characterised in that described from structured data table to be selected, selection meets second The structured data table to be selected of preparatory condition as current structure tables of data, including:
From similarity scores are not more than multiple structured data tables to be selected of the first preset value, selection meets the second preparatory condition Structured data table to be selected is used as current structure tables of data.
6. the method according to claim any one of 1-4, it is characterised in that the seed connected according to every first path The quantity of entity pair determines first significance level in every first path, including:
The kind fructification that the first path of all kinds of fructifications pair determination every connected according to every first path is connected is to sum;
The kind fructification connected according to every first path determines first weight in every first path to sum and the first preset model Want degree;
Wherein, first preset model is:
Wherein, WkFor first path PkCorresponding first significance level, l is the bar number in first path;SPk For first path PkThe kind fructification connected is to sum, and m is the quantity of kind of fructification,For the total quantity of kind of fructification pair.
7. the method according to claim any one of 1-4, it is characterised in that described important according to the first of every first path Degree, determines the second significance level of each candidate's entity in candidate's entity set, including:
According to first significance level and the second preset model in every first path, each candidate in candidate's entity set is determined Second significance level of entity;
Wherein, second preset model is:
sj∈ S, i ∈ { 1,2,3 ..., n }, wherein, R (ci, S) and represent candidate's entity ciThe second significance level, n be candidate's entity quantity;sjKind of a fructification is represented, S represents the seed entity set, and m is seed The quantity of entity;WkFor first path PkCorresponding first significance level, l is the bar number in first path;r{(ci,sj)|PkRepresent first road Footpath PkWhether connection kind fructification sjWith candidate's entity ci, if it is, r=1, otherwise, r=0.
It is described 8. the method according to claim any one of 1-4, it is characterised in that described by candidate's entity set Candidate's entity that second significance level meets the first preparatory condition is defined as entity to be extended, including:
By in candidate's entity set, candidate's entity that second significance level is more than the second preset value is defined as treating that extension is real Body.
It is described 9. the method according to claim any one of 1-4, it is characterised in that described by candidate's entity set Candidate's entity that second significance level meets the first preparatory condition is defined as entity to be extended, including:
According to second significance level, candidate's entity in candidate's entity set is ranked up in descending order, first is obtained Candidate's entity set;Also, candidate's entity that preceding first predetermined number of sequence is chosen from the first candidate entity set is made For entity to be extended.
10. a kind of entity set expanding unit, it is characterised in that described device includes:
Candidate's entity set determining module, for according to predetermined seed entity set, candidate to be extracted from object knowledge collection of illustrative plates Entity, and obtained candidate's entity composition candidate's entity set will be extracted;It is real that the object knowledge collection of illustrative plates at least includes the seed The kind fructification that body is concentrated;
First path determination module, for from heterogeneous information network corresponding with the object knowledge collection of illustrative plates, it is determined that planting fructification Between first path;First path is:By entity type and pass between two node types in the heterogeneous information network The access path of set type composition;Wherein, described two node types are kind fructification pair different in the seed entity set The node type answered;
First significance level determining module, the quantity of the kind fructification pair for being connected according to every first path is determined per Tiao Yuan roads First significance level in footpath;
Second significance level determining module, for the first significance level according to every first path, determines candidate's entity set In each candidate's entity the second significance level;
Entity set expansion module, for by candidate's entity set, second significance level to meet the first preparatory condition Candidate's entity is defined as entity to be extended, and the entity to be extended is added in the seed entity set.
CN201710168839.XA 2017-03-21 2017-03-21 Entity set extension method and device Active CN106951526B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710168839.XA CN106951526B (en) 2017-03-21 2017-03-21 Entity set extension method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710168839.XA CN106951526B (en) 2017-03-21 2017-03-21 Entity set extension method and device

Publications (2)

Publication Number Publication Date
CN106951526A true CN106951526A (en) 2017-07-14
CN106951526B CN106951526B (en) 2020-08-07

Family

ID=59472639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710168839.XA Active CN106951526B (en) 2017-03-21 2017-03-21 Entity set extension method and device

Country Status (1)

Country Link
CN (1) CN106951526B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609152A (en) * 2017-09-22 2018-01-19 百度在线网络技术(北京)有限公司 Method and apparatus for expanding query formula
CN109145119A (en) * 2018-07-02 2019-01-04 北京妙医佳信息技术有限公司 The knowledge mapping construction device and construction method of health management arts
CN110019826A (en) * 2017-07-27 2019-07-16 北大医疗信息技术有限公司 Construction method, construction device, equipment and the storage medium of medical knowledge map
CN111488467A (en) * 2020-04-30 2020-08-04 北京建筑大学 Construction method and device of geographical knowledge graph, storage medium and computer equipment
CN112463974A (en) * 2019-09-09 2021-03-09 华为技术有限公司 Method and device for establishing knowledge graph
CN113052968A (en) * 2021-04-30 2021-06-29 电子科技大学 Knowledge graph construction method of three-dimensional structure geological model
CN113221572A (en) * 2021-05-31 2021-08-06 北京字节跳动网络技术有限公司 Information processing method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080270458A1 (en) * 2007-04-24 2008-10-30 Gvelesiani Aleksandr L Systems and methods for displaying information about business related entities
CN102844755A (en) * 2010-04-27 2012-12-26 惠普发展公司,有限责任合伙企业 Method of extracting named entity
CN103488724A (en) * 2013-09-16 2014-01-01 复旦大学 Book-oriented reading field knowledge map construction method
CN105913125A (en) * 2016-04-12 2016-08-31 北京邮电大学 Heterogeneous information network element determining method, link prediction method, heterogeneous information network element determining device and link prediction device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080270458A1 (en) * 2007-04-24 2008-10-30 Gvelesiani Aleksandr L Systems and methods for displaying information about business related entities
CN102844755A (en) * 2010-04-27 2012-12-26 惠普发展公司,有限责任合伙企业 Method of extracting named entity
CN103488724A (en) * 2013-09-16 2014-01-01 复旦大学 Book-oriented reading field knowledge map construction method
CN105913125A (en) * 2016-04-12 2016-08-31 北京邮电大学 Heterogeneous information network element determining method, link prediction method, heterogeneous information network element determining device and link prediction device

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019826A (en) * 2017-07-27 2019-07-16 北大医疗信息技术有限公司 Construction method, construction device, equipment and the storage medium of medical knowledge map
CN110019826B (en) * 2017-07-27 2023-02-28 北大医疗信息技术有限公司 Construction method, construction device, equipment and storage medium of medical knowledge map
CN107609152A (en) * 2017-09-22 2018-01-19 百度在线网络技术(北京)有限公司 Method and apparatus for expanding query formula
CN107609152B (en) * 2017-09-22 2021-03-09 百度在线网络技术(北京)有限公司 Method and apparatus for expanding query expressions
CN109145119A (en) * 2018-07-02 2019-01-04 北京妙医佳信息技术有限公司 The knowledge mapping construction device and construction method of health management arts
CN112463974A (en) * 2019-09-09 2021-03-09 华为技术有限公司 Method and device for establishing knowledge graph
CN111488467A (en) * 2020-04-30 2020-08-04 北京建筑大学 Construction method and device of geographical knowledge graph, storage medium and computer equipment
CN113052968A (en) * 2021-04-30 2021-06-29 电子科技大学 Knowledge graph construction method of three-dimensional structure geological model
CN113052968B (en) * 2021-04-30 2022-08-05 电子科技大学 Knowledge graph construction method of three-dimensional structure geological model
CN113221572A (en) * 2021-05-31 2021-08-06 北京字节跳动网络技术有限公司 Information processing method, device, equipment and medium
CN113221572B (en) * 2021-05-31 2024-05-07 抖音视界有限公司 Information processing method, device, equipment and medium

Also Published As

Publication number Publication date
CN106951526B (en) 2020-08-07

Similar Documents

Publication Publication Date Title
CN106951526A (en) A kind of entity set extended method and device
CN106250412B (en) Knowledge mapping construction method based on the fusion of multi-source entity
Yin et al. Building taxonomy of web search intents for name entity queries
Jin et al. Distance-constraint reachability computation in uncertain graphs
CN103927302B (en) A kind of file classification method and system
CN110309289A (en) Sentence generation method, sentence generation device and intelligent equipment
CN103902545B (en) A kind of classification path identification method and system
JP2011258235A (en) System and method for ranking result of search by using click distance
CN111027743B (en) OD optimal path searching method and device based on hierarchical road network
CN108520166A (en) A kind of drug targets prediction technique based on multiple similitude network wandering
CN106951524A (en) Overlapping community discovery method based on node influence power
CN106407302A (en) Method for supporting function of calling specific functions of middleware database through simple SQL
CN104158748B (en) A kind of topological detecting method towards system for cloud computing
CN104133868B (en) A kind of strategy integrated for the classification of vertical reptile data
CN108345609A (en) A kind of method and apparatus of processing POI information
CN103810260A (en) Complex network community discovery method based on topological characteristics
CN110119478A (en) A kind of item recommendation method based on similarity of a variety of user feedback datas of combination
Agarwal et al. A social identity approach to identify familiar strangers in a social network
Zhang et al. Exploring time factors in measuring the scientific impact of scholars
Ahmadi et al. Unsupervised matching of data and text
CN106649731A (en) Node similarity searching method based on large-scale attribute network
Hutair et al. Social community detection based on node distance and interest
CN106126681B (en) A kind of increment type stream data clustering method and system
CN107133274A (en) A kind of distributed information retrieval set option method based on figure knowledge base
Partyka et al. Semantic schema matching without shared instances

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant