CN107145523A - Large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching - Google Patents
Large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching Download PDFInfo
- Publication number
- CN107145523A CN107145523A CN201710237034.6A CN201710237034A CN107145523A CN 107145523 A CN107145523 A CN 107145523A CN 201710237034 A CN201710237034 A CN 201710237034A CN 107145523 A CN107145523 A CN 107145523A
- Authority
- CN
- China
- Prior art keywords
- entity
- block
- matching
- knowledge base
- collection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 claims abstract description 31
- 239000000203 mixture Substances 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 238000005259 measurement Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 3
- 238000012790 confirmation Methods 0.000 abstract description 3
- 238000005192 partition Methods 0.000 abstract description 2
- 230000004927 fusion Effects 0.000 abstract 1
- 239000000047 product Substances 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 241001591024 Samea Species 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000009329 sexual behaviour Effects 0.000 description 1
- 230000003997 social interaction Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching, it is embodied as follows:1) data in former knowledge base are screened, Uniform data format, on this basis obtain knowledge base in relation and initial matching entity pair;2) using the pretreated knowledge library partition of relation pair in knowledge base, and block is simplified;3) using matching entities to matching block, obtain matching block pair;4) candidate's entity pair is selected in matching block pair, and combines method for measuring similarity and threshold value confirmation candidate's entity pair;5) repeat the above steps, until new candidate's entity pair can not be found, obtain all matching entities pair.The present invention combines the thought alignment Heterogeneous Knowledge storehouse of Iterative matching, is had broad application prospects in fields such as knowledge base alignment, data fusion, automatic question answerings.
Description
Technical field
The present invention relates to knowledge base alignment field, more particularly to a kind of large-scale Heterogeneous Knowledge storehouse alignment based on Iterative matching
Method.
Background technology
With Web 3.0 arrival, the knowledge base of structuring is increasingly frequently occurred on internet.These knowledge bases
It is widely used in all kinds of semantic applications, for example:Automatic question answering, search service and social interaction server etc..However, single knowledge base
Limited information, limit these application function.In this context, knowledge base alignment has huge development space.Knowledge
Storehouse alignment (Knowledge Base Alignment) is often referred to the entity alignment of knowledge base, i.e., automatic to find to represent in reality together
Two entities of one things simultaneously connect them.
Due to the continuous growth of knowledge base scale, alignment procedure is generally divided into two steps by knowledge base alignment schemes:Hair
Existing candidate's entity pair and confirmation candidate's entity pair.It was found that candidate's entity using a small amount of attribute for each entity to generally quickly being screened
Go out several candidate's entities, confirm candidate's entity to by comparing two entities comprehensively, utilizing similarity and the entity of threshold decision two
Whether match.Compare due to avoiding entity between any two accurate, this way substantially increases the whole efficiency of method.Mesh
Before, the bottleneck of knowledge base alignment schemes is that the candidate's entity found, to usually having omitted, further results in the reality that can be matched
Body is to undiscovered.
To improve the quality of candidate's entity pair, researcher proposes the thought using Iterative matching, i.e. often wheel discovery is a small amount of
Matching entities pair, and be used as next round find candidate's entity pair foundation.However, traditional knowledge base alignment schemes are generally closed
Note have between the alignment of isomorphism knowledge base, i.e. two knowledge bases it is more can alignment relation.Its basic assumption is:If a pair of entities to
Match somebody with somebody, and they have the relation of alignment, then and their " compatible neighbours " have greater probability matching, therefore " compatible neighbours " are made
For candidate's entity pair.But, due between knowledge base can alignment relation it is few, conventional method is by holiday candidate's entity pair.In order to
The problem is solved, researcher proposes to use class-based knowledge base alignment schemes.This method is by the example with same characteristic features
It is divided into same class, and excludes the incoherent candidate's entity of content with class, candidate's entity pair is confirmed with this.However,
Because this method only obtains candidate's entity pair in the model starting stage by classical partitioning technique, therefore when between two knowledge bases
When the attribute of alignment is less, this method will also omit more candidate's entity pair.
The content of the invention
In view of above-mentioned, the present invention proposes a kind of large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching.This method
Knowledge base alignment is carried out with reference to Iterative matching thought, traveling through relation pair knowledge base using iteration framework carries out subregion, expands
The search space of candidate's entity pair;Meanwhile, using being selected using thought of dividing and ruling and confirm candidate's entity pair so that each entity is only
Need and several candidate's entities are compared comprehensively, improve the efficiency of method.
A kind of large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching, are specifically included:
Data preprocessing phase:Knowledge base KB former to any two1、KB2In data screened, Uniform data format
And meaningless character processing is rejected, and count acquisition and knowledge base KB ' after processing1Corresponding set of relations R1, with processing after know
Know storehouse KB '2Corresponding set of relations R2, compare acquisition initial matching entity to collection
Knowledge base align stage:Utilize set of relations R1With set of relations R2In relation pair knowledge base KB '1With knowledge base KB '2
Subregion is carried out, and each block is simplified, obtains simplifying block collection B '1With B '2;Then, using initial matching entity to collectionBlock collection B ' is simplified in matching1With B '2In block, obtain match block pair, finally, matching block pair in select candidate
Entity pair, and combine method for measuring similarity and threshold value δeConfirm candidate's entity pair.
Described data preprocessing phase is concretely comprised the following steps:
The former knowledge base KB of (1-1) input any two1、KB2, and remove knowledge base KB1、KB2In it is unrelated with task of aliging
Information;
(1-2) is to knowledge base KB1In literal L1With knowledge base KB2In literal L2Uniform data format, by day
Phase, numeral, name are expressed as unified form;
(1-3) removes knowledge base KB1In literal L1With knowledge base KB2In literal L2Middle stop words character, symbol
Character, linguistic labelses character, knowledge base KB ' after being handled1With KB '2;
(1-4) statistics is obtained and knowledge base KB '1Relative set of relations R1And knowledge base KB '2Corresponding set of relations R2;
(1-5) compares knowledge base KB '1With knowledge base KB '2In all entities, obtain initial matching entity to collection
Knowledge base is defined as hexa-atomic group of (E, L, R, P, FR,FP), wherein, E, L, R, P difference presentation-entity, literal, relation
And the set of attribute;The triplet sets of entity-relationship-entity are represented, it is entity to represent object
Relation is true;The triplet sets of entity-attribute-literal are represented, the category that object is literal is represented
Sexual behavior is real;FRAnd FPIn all there is insignificant information, for example:The original text for being used for extracting triple is included in some knowledge bases
Language material, these information can influence the efficiency of algorithm.In addition, some triples comprising " sameAs " relation should be also removed.
The detailed process of the step (1-4) is:
For knowledge base KB '1, travel through the triplet sets F for belonging to the knowledge baseR1In all triples (entity-pass
System-entity), statistics obtains set of relations R1;For knowledge base KB '2, travel through the triplet sets F for belonging to the knowledge baseR2In institute
There is triple (entity-relationship-entity), statistics obtains set of relations R2, set of relations R1With set of relations R2For follow-up knowledge base point
Area is operated.
In step (1-5), described initial matching entity is to collectionAcquisition process be:
First, knowledge base KB ' is extracted1In all entities composition entity set E1, extract knowledge base KB '2In all entities
Constitute entity set E2;And with entity set E1In any entity and entity set E2In the cartesian product of any entity be used as entity
Right, composition entity is to collection;
Then, screening obtains string representation identical entity pair of the entity to two entity name attributes of concentration, obtains
To pre- initial matching entity to collection;
Finally, pre- initial matching entity is screened to concentrating the entity pair with one-to-one matching relationship, is used as initial matching
Entity is to collection
Described knowledge base align stage is concretely comprised the following steps:
(2-1) Input knowledge storehouse KB '1, knowledge base KB '2, set of relations R1, set of relations R2, initial matching entity to collection
Block similarity threshold δ is setb, entity similarity threshold δe, physical quantities threshold value δ in block1And matching entities in block
Rate threshold δ2, matching entities are to collection MeInitial matching entity is initialized as to collection
(2-2) randomly selects set of relations R1Or set of relations R2In any relation, using the relation by knowledge base KB '1With know
Know storehouse KB '2In entity be divided into several blocks, obtain and knowledge base KB '1Corresponding block collection B1And knowledge base KB '2Phase
Corresponding block collection B2;
(2-3) removes Except block collection B1With block collection B2In be also easy to produce high amount of calculation or be difficult to generate matching entities pair block,
Obtain simplifying block collection B '1With simplify block collection B '2;
(2-4) is using matching entities to collection MeIn all matching entities to measurement simplify block collection B '1Middle either block with
Simplify block collection B '2Similarity between middle either block, selection Similarity value is more than block similarity threshold δbTwo blocks
Matched, obtain matching block to collection;
(2-5) to belonging to matching block to any matching block pair of concentration, with a block of the matching block pair
Any non-matching entities the cartesian product of any non-matching entities in another block of block pair is matched with this as time
Entity pair is selected, composition candidate's entity is to collection;
(2-6) judges whether not find new candidate's entity pair, if it is not, execution step (2-7) is redirected, if so, terminate iteration,
Output matching entity is to collection Me;
(2-7) calculates candidate's entity to the similarity between each entity of candidate's entity centering two of concentration, by Similarity value
More than entity similarity threshold δeCorresponding candidate's entity to added to matching entities to collection MeIn, remaining candidate's entity is to house
Abandon;
(2-8) judges whether iterations is less than iteration threshold, all no, redirects execution step (2-2);If so, terminating to change
In generation, output matching entity is to collection Me。
In step (2-2), the detailed process that the entity in knowledge base is divided into several blocks is by described utilization relation;
Firstly, for knowledge base KB '1In triplet sets FR1, count and obtain triplet sets FR1Middle n kinds object is real
Body;
Then, for every kind of object entity, by triplet sets FR1In all subject entities corresponding thereto be placed on one
Rise, obtain 1 block, n kind object entities obtain n block, composition block collection B1;
Block collection B is obtained using same method2。
In step (2-3), the described block for being also easy to produce high amount of calculation or being difficult to generation matching entities pair includes:Entity number
Amount exceedes threshold value δ1Block, matching entities ratio be less than threshold value δ2Block and entity all matched blocks.
In step (2-4), the acquisition methods of the similarity between block are:
Each block is seen be entity set, matched entity is the identical element regarding two set as, profit
With set similarity come the similarity between Metrics block, similarity simblock(bk,bl) calculation formula be:
Wherein, bkAnd blTwo blocks are represented, | bk∩bl| represent that matching entities are to quantity in two blocks, | bk∪bl| represent
Total physical quantities in two blocks.
In step (2-7), the acquisition formula of the similarity between entity is:
sim(ei,ej)=α simstring(ei,ej)+(1-α)simblock(bk,bl)
s.t.ei∈bk,ej∈bl
Wherein, bkAnd blDifference presentation-entity eiAnd ejThe block at place, simstring(ei,ej) and simblock(bk,bl) point
Similarity of character string and block similarity between other presentation-entity, α are the weights of similarity of character string, and span is [0,1].
Preferably, using based on Levenshtein distances, based on Jaro-Winker distances, based on q-gram and being based on
I-SUB similarity function, and combine by way of linear weighted function these measuring similarity functions and calculate and obtain character string phase
Like degree.
The present invention combines Iterative matching thought and carries out Heterogeneous Knowledge storehouse alignment, and relation pair knowledge base is traveled through using iteration framework
Subregion is carried out, the search space of candidate's entity pair is expanded;Meanwhile, using being selected using thought of dividing and ruling and confirm candidate's entity
It is right so that each entity only needs comprehensively to be compared with several candidate's entities, improves the efficiency of method.With existing method phase
Than the advantage is that:
(1) knowledge base alignment is regarded as an iterative process.In different iteration, travel through each relation pair knowledge base and divided
Area, and using matching block to selecting candidate's entity pair so that alignment schemes are independent of the relation that can be alignd between knowledge base
And attribute.
(2) in every wheel iteration, a small amount of matching entities pair are only found, and by these matching entities to for candidate's entity pair
Select, used the information of more matching entities pair due to selecting the process of candidate's entity pair, therefore improve candidate's entity
To quality.
Brief description of the drawings
Fig. 1 is the FB(flow block) of the large-scale Heterogeneous Knowledge storehouse alignment schemes of the invention based on Iterative matching;
Fig. 2 is the flow of data preprocessing phase in the large-scale Heterogeneous Knowledge storehouse alignment schemes of the invention based on Iterative matching
Figure;
Fig. 3 is the flow of knowledge base align stage in the large-scale Heterogeneous Knowledge storehouse alignment schemes of the invention based on Iterative matching
Figure.
Embodiment
In order to more specifically describe the present invention, below in conjunction with the accompanying drawings and embodiment is to technical scheme
It is described in detail.
As shown in figure 1, the large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching of the invention be divided into data prediction and
Two parts of knowledge base alignment.Data prediction part:Data in former knowledge base KB are screened, Uniform data format,
And obtain the relation in knowledge base and initial matching entity pair;Knowledge base aligned portions:First with the relation pair in knowledge base
Pretreated knowledge library partition, and block is simplified, then using matching entities to matching block, obtain matching block pair,
Then candidate's entity pair is selected in matching block pair, and combines method for measuring similarity and threshold value confirmation candidate's entity pair, most
After repeat the above steps, until new candidate's entity pair can not be found, you can obtain all matching entities pair.
Shown in Fig. 2 is the flow chart of data preprocessing phase;According to Fig. 2, the stage is divided into following steps:
S1-1, the former knowledge base KB of input any two1、KB2, and remove knowledge base KB1、KB2In it is unrelated with task of aliging
Information.
Knowledge base is defined as hexa-atomic group of (E, L, R, P, FR,FP), wherein, E, L, R, P difference presentation-entity, literal, relation
And the set of attribute;The triplet sets of entity-relationship-entity are represented, expression object is entity
Relation it is true;The triplet sets of entity-attribute-literal are represented, it is literal to represent object
Attribute is true;FRAnd FPIn all there is insignificant information, for example:The original text for being used for extracting triple is included in some knowledge bases
This language material, these information can influence the efficiency of algorithm.In addition, some comprising " triple of same As " relations should be also removed.
S1-2, to knowledge base KB1In literal L1With knowledge base KB2In literal L2Uniform data format, by day
Phase, numeral, name are expressed as unified form.
The expression way of the literals such as name, date, numeral in different knowledge bases may be different, for example:“2016-01-
01 " and " 01.01.2016 ".By these information unifications, beneficial to subsequently comparing, in addition, literal is unified into small letter by method.
S1-3, removes knowledge base KB1In literal L1With knowledge base KB2In literal L2Middle stop words character, symbol
The meaningless character such as character, linguistic labelses, knowledge base KB ' after being handled1With KB '2。
There may be some meaningless characters in being described in knowledge base for entity attributes, for example:" the ", " a " and
Stop words such as " an ", " # ", "!" and symbol and the " linguistic labelses such as@en " such as " * ".These characters influence the similarity of entity pair
Measurement, therefore remove these characters.
S1-4, statistics is obtained and knowledge base KB '1Relative set of relations R1And knowledge base KB '2Corresponding set of relations R2。
In this step, for knowledge base KB '1, travel through the triplet sets F for belonging to the knowledge baseR1In all triples
(entity-relationship-entity), statistics obtains set of relations R1;For knowledge base KB '2, travel through the triplet sets for belonging to the knowledge base
FR2In all triples (entity-relationship-entity), statistics obtain set of relations R2, set of relations R1With set of relations R2For follow-up
Knowledge base division operation.
S1-5, compares knowledge base KB '1With knowledge base KB '2In all entities, obtain initial matching entity to collection
In this step, initial matching entity is to collectionAcquisition process be:
First, knowledge base KB ' is extracted1In all entities composition entity set E1, extract knowledge base KB '2In all entities
Constitute entity set E2;And with entity set E1In any entity and entity set E2In the cartesian product of any entity be used as entity
Right, composition entity is to collection;
Then, screening obtains string representation identical entity pair of the entity to two entity name attributes of concentration, obtains
To pre- initial matching entity to collection;
Finally, pre- initial matching entity is screened to concentrating the entity pair with one-to-one matching relationship, is used as initial matching
Entity is to collection
Shown in Fig. 3 is the flow chart of knowledge base align stage;According to Fig. 3, the stage is divided into following steps:
S2-1, Input knowledge storehouse KB '1, knowledge base KB '2, set of relations R1, set of relations R2, initial matching entity to collection
Block similarity threshold δ is setbFor 0.2, entity similarity threshold δeFor physical quantities threshold value δ in 0.65, block1For 50 and
Matching entities rate threshold δ in block2For 0.3, matching entities are to collection MeInitial matching entity is initialized as to collection
S2-2, randomly selects set of relations R1Or set of relations R2In any relation, using the relation by knowledge base KB '1With know
Know storehouse KB '2In entity be divided into several blocks, obtain and knowledge base KB '1Corresponding block collection B1And knowledge base KB '2Phase
Corresponding block collection B2。
In this step, it is by the detailed process that the entity in knowledge base is divided into several blocks using relation;
Firstly, for knowledge base KR '1In triplet sets FR1, count and obtain triplet sets FR1Middle n kinds object is real
Body;
Then, for every kind of object entity, by triplet sets FR1In all subject entities corresponding thereto be placed on one
Rise, obtain 1 block, n kind object entities obtain n block, composition block collection B1。
Block collection B is obtained using same method2, i.e.,:
Firstly, for knowledge base KB '2In triplet sets FR2, count and obtain triplet sets FR2Middle n kinds object is real
Body;
Then, for every kind of object entity, by triplet sets FR2In all subject entities corresponding thereto be placed on one
Rise, obtain 1 block, n kind object entities obtain n block, composition block collection B2。
S2-3, removes Except block collection B1With block collection B2In be also easy to produce high amount of calculation or be difficult to generate matching entities pair block,
Obtain simplifying block collection B '1With simplify block collection B '2。
In this step, the block for being also easy to produce high amount of calculation or being difficult to generation matching entities pair includes:Physical quantities exceed threshold
Value δ1Block, matching entities ratio be less than threshold value δ2Block and entity all matched blocks.
S2-4, using matching entities to collection MeIn all matching entities to measurement simplify block collection B '1Middle either block with
Simplify block collection B '2Similarity between middle either block, selection Similarity value is more than block similarity threshold δbTwo blocks
Matched, obtain matching block to collection.
Each block is seen be entity set, matched entity is the identical element regarding two set as, profit
With set similarity come the similarity between Metrics block, similarity simblock(bk,bl) calculation formula be:
Wherein, bkAnd blTwo blocks are represented, | bk∩bl| represent that matching entities are to quantity in two blocks, | bk∪bl| represent
Total physical quantities in two blocks.
S2-5, to belonging to any matching block pair of the matching block to concentration, with a block of the matching block pair
Any non-matching entities the cartesian product of any non-matching entities in another block of block pair is matched with this as time
Entity pair is selected, composition candidate's entity is to collection.
S2-6, judges whether not find new candidate's entity pair, if it is not, execution S2-7 is redirected, if so, terminate iteration, and output
Matching entities are to collection Me。
S2-7, the similarity between calculating candidate's entity to concentrating each entity of candidate's entity centering two, by Similarity value
More than entity similarity threshold δeCorresponding candidate's entity to added to matching entities to collection MeIn, remaining candidate's entity is to house
Abandon.
In this step, the similarity between entity is measured by 2 kinds of modes:Similarity of character string is similar with block
Degree, and both similarities are combined with certain weight, its formula is as follows:
sim(ei,ej)=α simstring(ei,ej)+(1-α)simblock(bk,bl)
s.t.ei∈bk,ej∈bl
Wherein, simstring(ei,ej) and simblock(bk,bl) similarity of character string and block phase respectively between presentation-entity
Like degree, bkAnd blDifference presentation-entity eiAnd ejThe block at place, α is the weight of similarity of character string, and value is 0.6.For reality
Body eiAnd ejShared attribute is to (for example:Name), similarity of character string measures the similarity of these property values.Method is used
A variety of measuring similarity functions, such as:Based on Levenshtein distances, based on Jaro-Winker distances, based on q-gram and base
In I-SUB similarity function, and combine by way of linear weighted function these measuring similarity functions.Block similarity passes through
Similarity between block where entity carrys out the similarity of presentation-entity.After the similarity for obtaining inter-entity, with reference to threshold value δeSentence
Whether this pair of entity that break matches, and by newfound matching entities to adding all matching entities pair.
S2-8, judges whether iterations is less than iteration threshold, all no, redirects execution S2-2;If so, terminate iteration, it is defeated
Go out matching entities to collection Me。
Technical scheme and beneficial effect are described in detail above-described embodiment, Ying Li
Solution is to the foregoing is only presently most preferred embodiment of the invention, is not intended to limit the invention, all principle models in the present invention
Interior done any modification, supplement and equivalent substitution etc. are enclosed, be should be included in the scope of the protection.
Claims (10)
1. a kind of large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching, are specifically included:
Data preprocessing phase:Knowledge base KB former to any two1、KB2In data screened, Uniform data format and
Meaningless character processing is rejected, and counts acquisition and knowledge base KB ' after processing1Corresponding set of relations R1And knowledge base after processing
KB′2Corresponding set of relations R2, compare acquisition initial matching entity to collection
Knowledge base align stage:Utilize set of relations R1With set of relations R2In relation pair knowledge base KB '1With knowledge base KB '2Divided
Area, and each block is simplified, obtain simplifying block collection B '1With B '2;Then, using initial matching entity to collection
With simplifying block collection B '1With B '2In block, obtain match block pair, finally, matching block pair in select candidate's entity pair,
And combine method for measuring similarity and threshold value δeConfirm candidate's entity pair.
2. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 1 based on Iterative matching, it is characterised in that described
Data preprocessing phase is concretely comprised the following steps:
The former knowledge base KB of (1-1) input any two1、KB2, and remove knowledge base KB1、KB2In the information unrelated with task of aliging;
(1-2) is to knowledge base KB1In literal L1With knowledge base KB2In literal L2Uniform data format, by date, number
Word, name are expressed as unified form;
(1-3) removes knowledge base KB1In literal L1With knowledge base KB2In literal L2Middle stop words character, sign character,
Linguistic labelses character, knowledge base KB ' after being handled1With KB '2;
(1-4) statistics is obtained and knowledge base KB '1Corresponding set of relations R1And knowledge base KB '2Corresponding set of relations R2;
(1-5) compares knowledge base KB '1With knowledge base KB '2In all entities, obtain initial matching entity to collection
3. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 2 based on Iterative matching, it is characterised in that the step
Suddenly the detailed process of (1-4) is:
For knowledge base KB '1, travel through the triplet sets F for belonging to the knowledge baseR1In all entity-relationship-entity ternarys
Group, statistics obtains set of relations R1;For knowledge base KB '2, travel through the triplet sets F for belonging to the knowledge baseR2In all realities
Body-relation-entity triple, statistics obtains set of relations R2。
4. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 2 based on Iterative matching, it is characterised in that step
In (1-5), described initial matching entity is to collectionAcquisition process be:
First, knowledge base KB ' is extracted1In all entities composition entity set E1, extract knowledge base KB '2In all entities composition
Entity set E2;And with entity set E1In any entity and entity set E2In the cartesian product of any entity be used as entity pair, group
Into entity to collection;
Then, screening obtains string representation identical entity pair of the entity to two entity name attributes of concentration, obtains pre-
Initial matching entity is to collection;
Finally, pre- initial matching entity is screened to concentrating the entity pair with one-to-one matching relationship, is used as initial matching entity
To collection
5. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 1 based on Iterative matching, it is characterised in that described
Knowledge base align stage is concretely comprised the following steps:
(2-1) Input knowledge storehouse KB '1, knowledge base KB '2, set of relations R1, set of relations R2, initial matching entity to collectionSet
Block similarity threshold δb, entity similarity threshold δe, physical quantities threshold value δ in block1And matching entities ratio in block
Threshold value δ2, matching entities are to collection MeInitial matching entity is initialized as to collection
(2-2) randomly selects set of relations R1Or set of relations R2In any relation, using the relation by knowledge base KB '1And knowledge base
KB′2In entity be divided into several blocks, obtain and knowledge base KB '1Corresponding block collection B1And knowledge base KB '2It is corresponding
Block collection B2;
(2-3) removes Except block collection B1With block collection B2In be also easy to produce high amount of calculation or be difficult to generate matching entities pair block, obtain
Simplify block collection B '1With simplify block collection B '2;
(2-4) is using matching entities to collection MeIn all matching entities to measurement simplify block collection B '1Middle either block is with simplifying
Block collection B '2Similarity between middle either block, selection Similarity value is more than block similarity threshold δbTwo blocks carry out
Matching, obtains matching block to collection;
(2-5) to belonging to matching block to any matching block pair of concentration, with appointing in a block of the matching block pair
The cartesian product that one non-matching entities match any non-matching entities in another block of block pair with this is real as candidate
Body pair, composition candidate's entity is to collection;
(2-6) judges whether not find new candidate's entity pair, if it is not, execution step (2-7) is redirected, if so, terminate iteration, and output
Matching entities are to collection Me;
(2-7) calculates candidate's entity to the similarity between each entity of candidate's entity centering two of concentration, and Similarity value is more than
Entity similarity threshold δeCorresponding candidate's entity to added to matching entities to collection MeIn, remaining candidate's entity is to giving up;
(2-8) judges whether iterations is less than iteration threshold, all no, redirects execution step (2-2);If so, terminate iteration, it is defeated
Go out matching entities to collection Me。
6. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 5 based on Iterative matching, it is characterised in that step
In (2-2), the detailed process that the entity in knowledge base is divided into several blocks is by described utilization relation;
Firstly, for knowledge base KB '1In triplet sets FR1, count and obtain triplet sets FR1Middle n kinds object entity;
Then, for every kind of object entity, by triplet sets FR1In all subject entities corresponding thereto put together, obtain
To 1 block, n kind object entities obtain n block, composition block collection B1;
Block collection B is obtained using same method2。
7. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 5 based on Iterative matching, it is characterised in that step
In (2-3), the described block for being also easy to produce high amount of calculation or being difficult to generation matching entities pair includes:Physical quantities exceed threshold value δ1
Block, matching entities ratio be less than threshold value δ2Block and entity all matched blocks.
8. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 5 based on Iterative matching, it is characterised in that step
In (2-4), the acquisition methods of the similarity between block are:
Each block is seen be entity set, matched entity utilizes collection to the identical element regarding two set as
Close the similarity that similarity is come between Metrics block, similarity simblock(bk,bl) calculation formula be:
<mrow>
<msub>
<mi>sim</mi>
<mrow>
<mi>b</mi>
<mi>l</mi>
<mi>o</mi>
<mi>c</mi>
<mi>k</mi>
</mrow>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>b</mi>
<mi>k</mi>
</msub>
<mo>,</mo>
<msub>
<mi>b</mi>
<mi>l</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mo>|</mo>
<msub>
<mi>b</mi>
<mi>k</mi>
</msub>
<mo>&cap;</mo>
<msub>
<mi>b</mi>
<mi>l</mi>
</msub>
<mo>|</mo>
</mrow>
<mrow>
<mo>|</mo>
<msub>
<mi>b</mi>
<mi>k</mi>
</msub>
<mo>&cap;</mo>
<msub>
<mi>b</mi>
<mi>l</mi>
</msub>
<mo>|</mo>
<mo>+</mo>
<mn>0.5</mn>
<mrow>
<mo>(</mo>
<mo>|</mo>
<msub>
<mi>b</mi>
<mi>k</mi>
</msub>
<mo>&cup;</mo>
<msub>
<mi>b</mi>
<mi>l</mi>
</msub>
<mo>|</mo>
<mo>-</mo>
<mo>|</mo>
<msub>
<mi>b</mi>
<mi>k</mi>
</msub>
<mo>&cap;</mo>
<msub>
<mi>b</mi>
<mi>l</mi>
</msub>
<mo>|</mo>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
</mrow>
Wherein, bkAnd blTwo blocks are represented, | bk∩bl| represent that matching entities are to quantity in two blocks, | bk∪bl| represent two
Total physical quantities in block.
9. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 8 based on Iterative matching, it is characterised in that step
In (2-7), the acquisition formula of the similarity between entity is:
sim(ei,ej)=α simstring(ei,ej)+(1-α)simblock(bk,bl)
s.t.ei∈bk,ej∈bl
Wherein, bkAnd blDifference presentation-entity eiAnd ejThe block at place, simstring(ei,ej) and simblock(bk,bl) difference table
Show the similarity of character string and block similarity of inter-entity, α is the weight of similarity of character string, and span is [0,1].
10. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 9 based on Iterative matching, it is characterised in that use
Based on Levenshtein distances, based on Jaro-Winker distances, the similarity function based on q-gram and based on I-SUB, and
These measuring similarity functions are combined by way of linear weighted function and calculate acquisition similarity of character string.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710237034.6A CN107145523B (en) | 2017-04-12 | 2017-04-12 | Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710237034.6A CN107145523B (en) | 2017-04-12 | 2017-04-12 | Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107145523A true CN107145523A (en) | 2017-09-08 |
CN107145523B CN107145523B (en) | 2019-10-18 |
Family
ID=59774786
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710237034.6A Expired - Fee Related CN107145523B (en) | 2017-04-12 | 2017-04-12 | Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107145523B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108460021A (en) * | 2018-03-16 | 2018-08-28 | 安徽大学 | A kind of method for extracting the problems in Article Titles method pair |
CN109492114A (en) * | 2018-11-16 | 2019-03-19 | 南京茂毓通软件科技有限公司 | A kind of entity information recognition methods |
CN109739939A (en) * | 2018-12-29 | 2019-05-10 | 颖投信息科技(上海)有限公司 | The data fusion method and device of knowledge mapping |
CN110413704A (en) * | 2019-06-27 | 2019-11-05 | 浙江大学 | Entity alignment schemes based on weighting neighbor information coding |
CN111191045A (en) * | 2019-12-30 | 2020-05-22 | 创新奇智(上海)科技有限公司 | Entity alignment method and system applied to knowledge graph |
CN112699667A (en) * | 2020-12-29 | 2021-04-23 | 京东数字科技控股股份有限公司 | Entity similarity determination method, device, equipment and storage medium |
CN112784609A (en) * | 2021-03-16 | 2021-05-11 | 云知声智能科技股份有限公司 | Method, apparatus, device and medium for determining whether medical record includes consultation opinions |
CN113609304A (en) * | 2021-07-20 | 2021-11-05 | 广州大学 | Entity matching method and device |
CN114417810A (en) * | 2021-12-29 | 2022-04-29 | 东方财富信息股份有限公司 | SimBlock algorithm for realizing high-quality text similarity calculation and realization method |
CN114417810B (en) * | 2021-12-29 | 2024-07-09 | 东方财富信息股份有限公司 | SimBlock algorithm for realizing high-quality text similarity calculation and realization method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679492A (en) * | 2013-11-29 | 2015-06-03 | 国际商业机器公司 | Computer-implemented technical support providing device and method |
US20150235130A1 (en) * | 2014-02-19 | 2015-08-20 | International Business Machines Corporation | NLP Duration and Duration Range Comparison Methodology Using Similarity Weighting |
CN104899242A (en) * | 2015-03-10 | 2015-09-09 | 四川大学 | Mechanical product design two-dimensional knowledge pushing method based on design intent |
-
2017
- 2017-04-12 CN CN201710237034.6A patent/CN107145523B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679492A (en) * | 2013-11-29 | 2015-06-03 | 国际商业机器公司 | Computer-implemented technical support providing device and method |
US20150235130A1 (en) * | 2014-02-19 | 2015-08-20 | International Business Machines Corporation | NLP Duration and Duration Range Comparison Methodology Using Similarity Weighting |
CN104899242A (en) * | 2015-03-10 | 2015-09-09 | 四川大学 | Mechanical product design two-dimensional knowledge pushing method based on design intent |
Non-Patent Citations (2)
Title |
---|
MOHAMMAD REZA NAKHAI ET AL: "Interference Alignment with Cyclic Unidirectional", 《2012 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC)》 * |
黄峻福等: "中文异构百科知识库实体对齐", 《计算机应用》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108460021A (en) * | 2018-03-16 | 2018-08-28 | 安徽大学 | A kind of method for extracting the problems in Article Titles method pair |
CN108460021B (en) * | 2018-03-16 | 2021-10-12 | 安徽大学 | Method for extracting problem method pairs in thesis title |
CN109492114A (en) * | 2018-11-16 | 2019-03-19 | 南京茂毓通软件科技有限公司 | A kind of entity information recognition methods |
CN109739939A (en) * | 2018-12-29 | 2019-05-10 | 颖投信息科技(上海)有限公司 | The data fusion method and device of knowledge mapping |
CN110413704B (en) * | 2019-06-27 | 2022-05-03 | 浙江大学 | Entity alignment method based on weighted neighbor information coding |
CN110413704A (en) * | 2019-06-27 | 2019-11-05 | 浙江大学 | Entity alignment schemes based on weighting neighbor information coding |
CN111191045A (en) * | 2019-12-30 | 2020-05-22 | 创新奇智(上海)科技有限公司 | Entity alignment method and system applied to knowledge graph |
CN111191045B (en) * | 2019-12-30 | 2023-06-16 | 创新奇智(上海)科技有限公司 | Entity alignment method and system applied to knowledge graph |
CN112699667A (en) * | 2020-12-29 | 2021-04-23 | 京东数字科技控股股份有限公司 | Entity similarity determination method, device, equipment and storage medium |
CN112699667B (en) * | 2020-12-29 | 2024-05-21 | 京东科技控股股份有限公司 | Entity similarity determination method, device, equipment and storage medium |
CN112784609A (en) * | 2021-03-16 | 2021-05-11 | 云知声智能科技股份有限公司 | Method, apparatus, device and medium for determining whether medical record includes consultation opinions |
CN113609304B (en) * | 2021-07-20 | 2023-05-23 | 广州大学 | Entity matching method and device |
CN113609304A (en) * | 2021-07-20 | 2021-11-05 | 广州大学 | Entity matching method and device |
CN114417810A (en) * | 2021-12-29 | 2022-04-29 | 东方财富信息股份有限公司 | SimBlock algorithm for realizing high-quality text similarity calculation and realization method |
CN114417810B (en) * | 2021-12-29 | 2024-07-09 | 东方财富信息股份有限公司 | SimBlock algorithm for realizing high-quality text similarity calculation and realization method |
Also Published As
Publication number | Publication date |
---|---|
CN107145523B (en) | 2019-10-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107145523B (en) | Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching | |
CN108573411B (en) | Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments | |
Pezzoni et al. | How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation | |
US11036685B2 (en) | System and method for compressing data in a database | |
CN103823888B (en) | Node-closeness-based social network site friend recommendation method | |
CN104778256B (en) | A kind of the quick of field question answering system consulting can increment clustering method | |
CN106250513A (en) | A kind of event personalization sorting technique based on event modeling and system | |
CN105843796A (en) | Microblog emotional tendency analysis method and device | |
CN102955833A (en) | Correspondence address identifying and standardizing method | |
Prokić et al. | Recognising groups among dialects | |
CN107239512A (en) | The microblogging comment spam recognition methods of relational network figure is commented in a kind of combination | |
CN105843799A (en) | Academic paper label recommendation method based on multi-source heterogeneous information graph model | |
CN106991090A (en) | The analysis method and device of public sentiment event entity | |
Vieira et al. | Performance evaluation of modularity based community detection algorithms in large scale networks | |
Bi et al. | MM-GNN: Mix-moment graph neural network towards modeling neighborhood feature distribution | |
CN109885797B (en) | Relational network construction method based on multi-identity space mapping | |
CN106651461A (en) | Film personalized recommendation method based on gray theory | |
Wang et al. | An improved clustering method for detection system of public security events based on genetic algorithm and semisupervised learning | |
CN105320715A (en) | Body based semantic query method | |
CN102637202A (en) | Method for automatically acquiring iterative conception attribute name and system | |
Sharma et al. | Analysis of clustering algorithms in biological networks | |
CN107203609A (en) | The method and mobile terminal of a kind of fast search mobile terminal SMS | |
Pola et al. | Similarity sets: A new concept of sets to seamlessly handle similarity in database management systems | |
CN107391490A (en) | A kind of intelligent semantic analysis and text mining method | |
Liu et al. | Social Network Community‐Discovery Algorithm Based on a Balance Factor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20191018 |