CN107145523A - Large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching - Google Patents

Large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching Download PDF

Info

Publication number
CN107145523A
CN107145523A CN201710237034.6A CN201710237034A CN107145523A CN 107145523 A CN107145523 A CN 107145523A CN 201710237034 A CN201710237034 A CN 201710237034A CN 107145523 A CN107145523 A CN 107145523A
Authority
CN
China
Prior art keywords
entity
block
matching
knowledge base
collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710237034.6A
Other languages
Chinese (zh)
Other versions
CN107145523B (en
Inventor
陈岭
顾伟东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201710237034.6A priority Critical patent/CN107145523B/en
Publication of CN107145523A publication Critical patent/CN107145523A/en
Application granted granted Critical
Publication of CN107145523B publication Critical patent/CN107145523B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching, it is embodied as follows:1) data in former knowledge base are screened, Uniform data format, on this basis obtain knowledge base in relation and initial matching entity pair;2) using the pretreated knowledge library partition of relation pair in knowledge base, and block is simplified;3) using matching entities to matching block, obtain matching block pair;4) candidate's entity pair is selected in matching block pair, and combines method for measuring similarity and threshold value confirmation candidate's entity pair;5) repeat the above steps, until new candidate's entity pair can not be found, obtain all matching entities pair.The present invention combines the thought alignment Heterogeneous Knowledge storehouse of Iterative matching, is had broad application prospects in fields such as knowledge base alignment, data fusion, automatic question answerings.

Description

Large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching
Technical field
The present invention relates to knowledge base alignment field, more particularly to a kind of large-scale Heterogeneous Knowledge storehouse alignment based on Iterative matching Method.
Background technology
With Web 3.0 arrival, the knowledge base of structuring is increasingly frequently occurred on internet.These knowledge bases It is widely used in all kinds of semantic applications, for example:Automatic question answering, search service and social interaction server etc..However, single knowledge base Limited information, limit these application function.In this context, knowledge base alignment has huge development space.Knowledge Storehouse alignment (Knowledge Base Alignment) is often referred to the entity alignment of knowledge base, i.e., automatic to find to represent in reality together Two entities of one things simultaneously connect them.
Due to the continuous growth of knowledge base scale, alignment procedure is generally divided into two steps by knowledge base alignment schemes:Hair Existing candidate's entity pair and confirmation candidate's entity pair.It was found that candidate's entity using a small amount of attribute for each entity to generally quickly being screened Go out several candidate's entities, confirm candidate's entity to by comparing two entities comprehensively, utilizing similarity and the entity of threshold decision two Whether match.Compare due to avoiding entity between any two accurate, this way substantially increases the whole efficiency of method.Mesh Before, the bottleneck of knowledge base alignment schemes is that the candidate's entity found, to usually having omitted, further results in the reality that can be matched Body is to undiscovered.
To improve the quality of candidate's entity pair, researcher proposes the thought using Iterative matching, i.e. often wheel discovery is a small amount of Matching entities pair, and be used as next round find candidate's entity pair foundation.However, traditional knowledge base alignment schemes are generally closed Note have between the alignment of isomorphism knowledge base, i.e. two knowledge bases it is more can alignment relation.Its basic assumption is:If a pair of entities to Match somebody with somebody, and they have the relation of alignment, then and their " compatible neighbours " have greater probability matching, therefore " compatible neighbours " are made For candidate's entity pair.But, due between knowledge base can alignment relation it is few, conventional method is by holiday candidate's entity pair.In order to The problem is solved, researcher proposes to use class-based knowledge base alignment schemes.This method is by the example with same characteristic features It is divided into same class, and excludes the incoherent candidate's entity of content with class, candidate's entity pair is confirmed with this.However, Because this method only obtains candidate's entity pair in the model starting stage by classical partitioning technique, therefore when between two knowledge bases When the attribute of alignment is less, this method will also omit more candidate's entity pair.
The content of the invention
In view of above-mentioned, the present invention proposes a kind of large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching.This method Knowledge base alignment is carried out with reference to Iterative matching thought, traveling through relation pair knowledge base using iteration framework carries out subregion, expands The search space of candidate's entity pair;Meanwhile, using being selected using thought of dividing and ruling and confirm candidate's entity pair so that each entity is only Need and several candidate's entities are compared comprehensively, improve the efficiency of method.
A kind of large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching, are specifically included:
Data preprocessing phase:Knowledge base KB former to any two1、KB2In data screened, Uniform data format And meaningless character processing is rejected, and count acquisition and knowledge base KB ' after processing1Corresponding set of relations R1, with processing after know Know storehouse KB '2Corresponding set of relations R2, compare acquisition initial matching entity to collection
Knowledge base align stage:Utilize set of relations R1With set of relations R2In relation pair knowledge base KB '1With knowledge base KB '2 Subregion is carried out, and each block is simplified, obtains simplifying block collection B '1With B '2;Then, using initial matching entity to collectionBlock collection B ' is simplified in matching1With B '2In block, obtain match block pair, finally, matching block pair in select candidate Entity pair, and combine method for measuring similarity and threshold value δeConfirm candidate's entity pair.
Described data preprocessing phase is concretely comprised the following steps:
The former knowledge base KB of (1-1) input any two1、KB2, and remove knowledge base KB1、KB2In it is unrelated with task of aliging Information;
(1-2) is to knowledge base KB1In literal L1With knowledge base KB2In literal L2Uniform data format, by day Phase, numeral, name are expressed as unified form;
(1-3) removes knowledge base KB1In literal L1With knowledge base KB2In literal L2Middle stop words character, symbol Character, linguistic labelses character, knowledge base KB ' after being handled1With KB '2
(1-4) statistics is obtained and knowledge base KB '1Relative set of relations R1And knowledge base KB '2Corresponding set of relations R2
(1-5) compares knowledge base KB '1With knowledge base KB '2In all entities, obtain initial matching entity to collection
Knowledge base is defined as hexa-atomic group of (E, L, R, P, FR,FP), wherein, E, L, R, P difference presentation-entity, literal, relation And the set of attribute;The triplet sets of entity-relationship-entity are represented, it is entity to represent object Relation is true;The triplet sets of entity-attribute-literal are represented, the category that object is literal is represented Sexual behavior is real;FRAnd FPIn all there is insignificant information, for example:The original text for being used for extracting triple is included in some knowledge bases Language material, these information can influence the efficiency of algorithm.In addition, some triples comprising " sameAs " relation should be also removed.
The detailed process of the step (1-4) is:
For knowledge base KB '1, travel through the triplet sets F for belonging to the knowledge baseR1In all triples (entity-pass System-entity), statistics obtains set of relations R1;For knowledge base KB '2, travel through the triplet sets F for belonging to the knowledge baseR2In institute There is triple (entity-relationship-entity), statistics obtains set of relations R2, set of relations R1With set of relations R2For follow-up knowledge base point Area is operated.
In step (1-5), described initial matching entity is to collectionAcquisition process be:
First, knowledge base KB ' is extracted1In all entities composition entity set E1, extract knowledge base KB '2In all entities Constitute entity set E2;And with entity set E1In any entity and entity set E2In the cartesian product of any entity be used as entity Right, composition entity is to collection;
Then, screening obtains string representation identical entity pair of the entity to two entity name attributes of concentration, obtains To pre- initial matching entity to collection;
Finally, pre- initial matching entity is screened to concentrating the entity pair with one-to-one matching relationship, is used as initial matching Entity is to collection
Described knowledge base align stage is concretely comprised the following steps:
(2-1) Input knowledge storehouse KB '1, knowledge base KB '2, set of relations R1, set of relations R2, initial matching entity to collection Block similarity threshold δ is setb, entity similarity threshold δe, physical quantities threshold value δ in block1And matching entities in block Rate threshold δ2, matching entities are to collection MeInitial matching entity is initialized as to collection
(2-2) randomly selects set of relations R1Or set of relations R2In any relation, using the relation by knowledge base KB '1With know Know storehouse KB '2In entity be divided into several blocks, obtain and knowledge base KB '1Corresponding block collection B1And knowledge base KB '2Phase Corresponding block collection B2
(2-3) removes Except block collection B1With block collection B2In be also easy to produce high amount of calculation or be difficult to generate matching entities pair block, Obtain simplifying block collection B '1With simplify block collection B '2
(2-4) is using matching entities to collection MeIn all matching entities to measurement simplify block collection B '1Middle either block with Simplify block collection B '2Similarity between middle either block, selection Similarity value is more than block similarity threshold δbTwo blocks Matched, obtain matching block to collection;
(2-5) to belonging to matching block to any matching block pair of concentration, with a block of the matching block pair Any non-matching entities the cartesian product of any non-matching entities in another block of block pair is matched with this as time Entity pair is selected, composition candidate's entity is to collection;
(2-6) judges whether not find new candidate's entity pair, if it is not, execution step (2-7) is redirected, if so, terminate iteration, Output matching entity is to collection Me
(2-7) calculates candidate's entity to the similarity between each entity of candidate's entity centering two of concentration, by Similarity value More than entity similarity threshold δeCorresponding candidate's entity to added to matching entities to collection MeIn, remaining candidate's entity is to house Abandon;
(2-8) judges whether iterations is less than iteration threshold, all no, redirects execution step (2-2);If so, terminating to change In generation, output matching entity is to collection Me
In step (2-2), the detailed process that the entity in knowledge base is divided into several blocks is by described utilization relation;
Firstly, for knowledge base KB '1In triplet sets FR1, count and obtain triplet sets FR1Middle n kinds object is real Body;
Then, for every kind of object entity, by triplet sets FR1In all subject entities corresponding thereto be placed on one Rise, obtain 1 block, n kind object entities obtain n block, composition block collection B1
Block collection B is obtained using same method2
In step (2-3), the described block for being also easy to produce high amount of calculation or being difficult to generation matching entities pair includes:Entity number Amount exceedes threshold value δ1Block, matching entities ratio be less than threshold value δ2Block and entity all matched blocks.
In step (2-4), the acquisition methods of the similarity between block are:
Each block is seen be entity set, matched entity is the identical element regarding two set as, profit With set similarity come the similarity between Metrics block, similarity simblock(bk,bl) calculation formula be:
Wherein, bkAnd blTwo blocks are represented, | bk∩bl| represent that matching entities are to quantity in two blocks, | bk∪bl| represent Total physical quantities in two blocks.
In step (2-7), the acquisition formula of the similarity between entity is:
sim(ei,ej)=α simstring(ei,ej)+(1-α)simblock(bk,bl)
s.t.ei∈bk,ej∈bl
Wherein, bkAnd blDifference presentation-entity eiAnd ejThe block at place, simstring(ei,ej) and simblock(bk,bl) point Similarity of character string and block similarity between other presentation-entity, α are the weights of similarity of character string, and span is [0,1].
Preferably, using based on Levenshtein distances, based on Jaro-Winker distances, based on q-gram and being based on I-SUB similarity function, and combine by way of linear weighted function these measuring similarity functions and calculate and obtain character string phase Like degree.
The present invention combines Iterative matching thought and carries out Heterogeneous Knowledge storehouse alignment, and relation pair knowledge base is traveled through using iteration framework Subregion is carried out, the search space of candidate's entity pair is expanded;Meanwhile, using being selected using thought of dividing and ruling and confirm candidate's entity It is right so that each entity only needs comprehensively to be compared with several candidate's entities, improves the efficiency of method.With existing method phase Than the advantage is that:
(1) knowledge base alignment is regarded as an iterative process.In different iteration, travel through each relation pair knowledge base and divided Area, and using matching block to selecting candidate's entity pair so that alignment schemes are independent of the relation that can be alignd between knowledge base And attribute.
(2) in every wheel iteration, a small amount of matching entities pair are only found, and by these matching entities to for candidate's entity pair Select, used the information of more matching entities pair due to selecting the process of candidate's entity pair, therefore improve candidate's entity To quality.
Brief description of the drawings
Fig. 1 is the FB(flow block) of the large-scale Heterogeneous Knowledge storehouse alignment schemes of the invention based on Iterative matching;
Fig. 2 is the flow of data preprocessing phase in the large-scale Heterogeneous Knowledge storehouse alignment schemes of the invention based on Iterative matching Figure;
Fig. 3 is the flow of knowledge base align stage in the large-scale Heterogeneous Knowledge storehouse alignment schemes of the invention based on Iterative matching Figure.
Embodiment
In order to more specifically describe the present invention, below in conjunction with the accompanying drawings and embodiment is to technical scheme It is described in detail.
As shown in figure 1, the large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching of the invention be divided into data prediction and Two parts of knowledge base alignment.Data prediction part:Data in former knowledge base KB are screened, Uniform data format, And obtain the relation in knowledge base and initial matching entity pair;Knowledge base aligned portions:First with the relation pair in knowledge base Pretreated knowledge library partition, and block is simplified, then using matching entities to matching block, obtain matching block pair, Then candidate's entity pair is selected in matching block pair, and combines method for measuring similarity and threshold value confirmation candidate's entity pair, most After repeat the above steps, until new candidate's entity pair can not be found, you can obtain all matching entities pair.
Shown in Fig. 2 is the flow chart of data preprocessing phase;According to Fig. 2, the stage is divided into following steps:
S1-1, the former knowledge base KB of input any two1、KB2, and remove knowledge base KB1、KB2In it is unrelated with task of aliging Information.
Knowledge base is defined as hexa-atomic group of (E, L, R, P, FR,FP), wherein, E, L, R, P difference presentation-entity, literal, relation And the set of attribute;The triplet sets of entity-relationship-entity are represented, expression object is entity Relation it is true;The triplet sets of entity-attribute-literal are represented, it is literal to represent object Attribute is true;FRAnd FPIn all there is insignificant information, for example:The original text for being used for extracting triple is included in some knowledge bases This language material, these information can influence the efficiency of algorithm.In addition, some comprising " triple of same As " relations should be also removed.
S1-2, to knowledge base KB1In literal L1With knowledge base KB2In literal L2Uniform data format, by day Phase, numeral, name are expressed as unified form.
The expression way of the literals such as name, date, numeral in different knowledge bases may be different, for example:“2016-01- 01 " and " 01.01.2016 ".By these information unifications, beneficial to subsequently comparing, in addition, literal is unified into small letter by method.
S1-3, removes knowledge base KB1In literal L1With knowledge base KB2In literal L2Middle stop words character, symbol The meaningless character such as character, linguistic labelses, knowledge base KB ' after being handled1With KB '2
There may be some meaningless characters in being described in knowledge base for entity attributes, for example:" the ", " a " and Stop words such as " an ", " # ", "!" and symbol and the " linguistic labelses such as@en " such as " * ".These characters influence the similarity of entity pair Measurement, therefore remove these characters.
S1-4, statistics is obtained and knowledge base KB '1Relative set of relations R1And knowledge base KB '2Corresponding set of relations R2
In this step, for knowledge base KB '1, travel through the triplet sets F for belonging to the knowledge baseR1In all triples (entity-relationship-entity), statistics obtains set of relations R1;For knowledge base KB '2, travel through the triplet sets for belonging to the knowledge base FR2In all triples (entity-relationship-entity), statistics obtain set of relations R2, set of relations R1With set of relations R2For follow-up Knowledge base division operation.
S1-5, compares knowledge base KB '1With knowledge base KB '2In all entities, obtain initial matching entity to collection
In this step, initial matching entity is to collectionAcquisition process be:
First, knowledge base KB ' is extracted1In all entities composition entity set E1, extract knowledge base KB '2In all entities Constitute entity set E2;And with entity set E1In any entity and entity set E2In the cartesian product of any entity be used as entity Right, composition entity is to collection;
Then, screening obtains string representation identical entity pair of the entity to two entity name attributes of concentration, obtains To pre- initial matching entity to collection;
Finally, pre- initial matching entity is screened to concentrating the entity pair with one-to-one matching relationship, is used as initial matching Entity is to collection
Shown in Fig. 3 is the flow chart of knowledge base align stage;According to Fig. 3, the stage is divided into following steps:
S2-1, Input knowledge storehouse KB '1, knowledge base KB '2, set of relations R1, set of relations R2, initial matching entity to collection Block similarity threshold δ is setbFor 0.2, entity similarity threshold δeFor physical quantities threshold value δ in 0.65, block1For 50 and Matching entities rate threshold δ in block2For 0.3, matching entities are to collection MeInitial matching entity is initialized as to collection
S2-2, randomly selects set of relations R1Or set of relations R2In any relation, using the relation by knowledge base KB '1With know Know storehouse KB '2In entity be divided into several blocks, obtain and knowledge base KB '1Corresponding block collection B1And knowledge base KB '2Phase Corresponding block collection B2
In this step, it is by the detailed process that the entity in knowledge base is divided into several blocks using relation;
Firstly, for knowledge base KR '1In triplet sets FR1, count and obtain triplet sets FR1Middle n kinds object is real Body;
Then, for every kind of object entity, by triplet sets FR1In all subject entities corresponding thereto be placed on one Rise, obtain 1 block, n kind object entities obtain n block, composition block collection B1
Block collection B is obtained using same method2, i.e.,:
Firstly, for knowledge base KB '2In triplet sets FR2, count and obtain triplet sets FR2Middle n kinds object is real Body;
Then, for every kind of object entity, by triplet sets FR2In all subject entities corresponding thereto be placed on one Rise, obtain 1 block, n kind object entities obtain n block, composition block collection B2
S2-3, removes Except block collection B1With block collection B2In be also easy to produce high amount of calculation or be difficult to generate matching entities pair block, Obtain simplifying block collection B '1With simplify block collection B '2
In this step, the block for being also easy to produce high amount of calculation or being difficult to generation matching entities pair includes:Physical quantities exceed threshold Value δ1Block, matching entities ratio be less than threshold value δ2Block and entity all matched blocks.
S2-4, using matching entities to collection MeIn all matching entities to measurement simplify block collection B '1Middle either block with Simplify block collection B '2Similarity between middle either block, selection Similarity value is more than block similarity threshold δbTwo blocks Matched, obtain matching block to collection.
Each block is seen be entity set, matched entity is the identical element regarding two set as, profit With set similarity come the similarity between Metrics block, similarity simblock(bk,bl) calculation formula be:
Wherein, bkAnd blTwo blocks are represented, | bk∩bl| represent that matching entities are to quantity in two blocks, | bk∪bl| represent Total physical quantities in two blocks.
S2-5, to belonging to any matching block pair of the matching block to concentration, with a block of the matching block pair Any non-matching entities the cartesian product of any non-matching entities in another block of block pair is matched with this as time Entity pair is selected, composition candidate's entity is to collection.
S2-6, judges whether not find new candidate's entity pair, if it is not, execution S2-7 is redirected, if so, terminate iteration, and output Matching entities are to collection Me
S2-7, the similarity between calculating candidate's entity to concentrating each entity of candidate's entity centering two, by Similarity value More than entity similarity threshold δeCorresponding candidate's entity to added to matching entities to collection MeIn, remaining candidate's entity is to house Abandon.
In this step, the similarity between entity is measured by 2 kinds of modes:Similarity of character string is similar with block Degree, and both similarities are combined with certain weight, its formula is as follows:
sim(ei,ej)=α simstring(ei,ej)+(1-α)simblock(bk,bl)
s.t.ei∈bk,ej∈bl
Wherein, simstring(ei,ej) and simblock(bk,bl) similarity of character string and block phase respectively between presentation-entity Like degree, bkAnd blDifference presentation-entity eiAnd ejThe block at place, α is the weight of similarity of character string, and value is 0.6.For reality Body eiAnd ejShared attribute is to (for example:Name), similarity of character string measures the similarity of these property values.Method is used A variety of measuring similarity functions, such as:Based on Levenshtein distances, based on Jaro-Winker distances, based on q-gram and base In I-SUB similarity function, and combine by way of linear weighted function these measuring similarity functions.Block similarity passes through Similarity between block where entity carrys out the similarity of presentation-entity.After the similarity for obtaining inter-entity, with reference to threshold value δeSentence Whether this pair of entity that break matches, and by newfound matching entities to adding all matching entities pair.
S2-8, judges whether iterations is less than iteration threshold, all no, redirects execution S2-2;If so, terminate iteration, it is defeated Go out matching entities to collection Me
Technical scheme and beneficial effect are described in detail above-described embodiment, Ying Li Solution is to the foregoing is only presently most preferred embodiment of the invention, is not intended to limit the invention, all principle models in the present invention Interior done any modification, supplement and equivalent substitution etc. are enclosed, be should be included in the scope of the protection.

Claims (10)

1. a kind of large-scale Heterogeneous Knowledge storehouse alignment schemes based on Iterative matching, are specifically included:
Data preprocessing phase:Knowledge base KB former to any two1、KB2In data screened, Uniform data format and Meaningless character processing is rejected, and counts acquisition and knowledge base KB ' after processing1Corresponding set of relations R1And knowledge base after processing KB′2Corresponding set of relations R2, compare acquisition initial matching entity to collection
Knowledge base align stage:Utilize set of relations R1With set of relations R2In relation pair knowledge base KB '1With knowledge base KB '2Divided Area, and each block is simplified, obtain simplifying block collection B '1With B '2;Then, using initial matching entity to collection With simplifying block collection B '1With B '2In block, obtain match block pair, finally, matching block pair in select candidate's entity pair, And combine method for measuring similarity and threshold value δeConfirm candidate's entity pair.
2. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 1 based on Iterative matching, it is characterised in that described Data preprocessing phase is concretely comprised the following steps:
The former knowledge base KB of (1-1) input any two1、KB2, and remove knowledge base KB1、KB2In the information unrelated with task of aliging;
(1-2) is to knowledge base KB1In literal L1With knowledge base KB2In literal L2Uniform data format, by date, number Word, name are expressed as unified form;
(1-3) removes knowledge base KB1In literal L1With knowledge base KB2In literal L2Middle stop words character, sign character, Linguistic labelses character, knowledge base KB ' after being handled1With KB '2
(1-4) statistics is obtained and knowledge base KB '1Corresponding set of relations R1And knowledge base KB '2Corresponding set of relations R2
(1-5) compares knowledge base KB '1With knowledge base KB '2In all entities, obtain initial matching entity to collection
3. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 2 based on Iterative matching, it is characterised in that the step Suddenly the detailed process of (1-4) is:
For knowledge base KB '1, travel through the triplet sets F for belonging to the knowledge baseR1In all entity-relationship-entity ternarys Group, statistics obtains set of relations R1;For knowledge base KB '2, travel through the triplet sets F for belonging to the knowledge baseR2In all realities Body-relation-entity triple, statistics obtains set of relations R2
4. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 2 based on Iterative matching, it is characterised in that step In (1-5), described initial matching entity is to collectionAcquisition process be:
First, knowledge base KB ' is extracted1In all entities composition entity set E1, extract knowledge base KB '2In all entities composition Entity set E2;And with entity set E1In any entity and entity set E2In the cartesian product of any entity be used as entity pair, group Into entity to collection;
Then, screening obtains string representation identical entity pair of the entity to two entity name attributes of concentration, obtains pre- Initial matching entity is to collection;
Finally, pre- initial matching entity is screened to concentrating the entity pair with one-to-one matching relationship, is used as initial matching entity To collection
5. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 1 based on Iterative matching, it is characterised in that described Knowledge base align stage is concretely comprised the following steps:
(2-1) Input knowledge storehouse KB '1, knowledge base KB '2, set of relations R1, set of relations R2, initial matching entity to collectionSet Block similarity threshold δb, entity similarity threshold δe, physical quantities threshold value δ in block1And matching entities ratio in block Threshold value δ2, matching entities are to collection MeInitial matching entity is initialized as to collection
(2-2) randomly selects set of relations R1Or set of relations R2In any relation, using the relation by knowledge base KB '1And knowledge base KB′2In entity be divided into several blocks, obtain and knowledge base KB '1Corresponding block collection B1And knowledge base KB '2It is corresponding Block collection B2
(2-3) removes Except block collection B1With block collection B2In be also easy to produce high amount of calculation or be difficult to generate matching entities pair block, obtain Simplify block collection B '1With simplify block collection B '2
(2-4) is using matching entities to collection MeIn all matching entities to measurement simplify block collection B '1Middle either block is with simplifying Block collection B '2Similarity between middle either block, selection Similarity value is more than block similarity threshold δbTwo blocks carry out Matching, obtains matching block to collection;
(2-5) to belonging to matching block to any matching block pair of concentration, with appointing in a block of the matching block pair The cartesian product that one non-matching entities match any non-matching entities in another block of block pair with this is real as candidate Body pair, composition candidate's entity is to collection;
(2-6) judges whether not find new candidate's entity pair, if it is not, execution step (2-7) is redirected, if so, terminate iteration, and output Matching entities are to collection Me
(2-7) calculates candidate's entity to the similarity between each entity of candidate's entity centering two of concentration, and Similarity value is more than Entity similarity threshold δeCorresponding candidate's entity to added to matching entities to collection MeIn, remaining candidate's entity is to giving up;
(2-8) judges whether iterations is less than iteration threshold, all no, redirects execution step (2-2);If so, terminate iteration, it is defeated Go out matching entities to collection Me
6. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 5 based on Iterative matching, it is characterised in that step In (2-2), the detailed process that the entity in knowledge base is divided into several blocks is by described utilization relation;
Firstly, for knowledge base KB '1In triplet sets FR1, count and obtain triplet sets FR1Middle n kinds object entity;
Then, for every kind of object entity, by triplet sets FR1In all subject entities corresponding thereto put together, obtain To 1 block, n kind object entities obtain n block, composition block collection B1
Block collection B is obtained using same method2
7. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 5 based on Iterative matching, it is characterised in that step In (2-3), the described block for being also easy to produce high amount of calculation or being difficult to generation matching entities pair includes:Physical quantities exceed threshold value δ1 Block, matching entities ratio be less than threshold value δ2Block and entity all matched blocks.
8. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 5 based on Iterative matching, it is characterised in that step In (2-4), the acquisition methods of the similarity between block are:
Each block is seen be entity set, matched entity utilizes collection to the identical element regarding two set as Close the similarity that similarity is come between Metrics block, similarity simblock(bk,bl) calculation formula be:
<mrow> <msub> <mi>sim</mi> <mrow> <mi>b</mi> <mi>l</mi> <mi>o</mi> <mi>c</mi> <mi>k</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>b</mi> <mi>k</mi> </msub> <mo>,</mo> <msub> <mi>b</mi> <mi>l</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mo>|</mo> <msub> <mi>b</mi> <mi>k</mi> </msub> <mo>&amp;cap;</mo> <msub> <mi>b</mi> <mi>l</mi> </msub> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <msub> <mi>b</mi> <mi>k</mi> </msub> <mo>&amp;cap;</mo> <msub> <mi>b</mi> <mi>l</mi> </msub> <mo>|</mo> <mo>+</mo> <mn>0.5</mn> <mrow> <mo>(</mo> <mo>|</mo> <msub> <mi>b</mi> <mi>k</mi> </msub> <mo>&amp;cup;</mo> <msub> <mi>b</mi> <mi>l</mi> </msub> <mo>|</mo> <mo>-</mo> <mo>|</mo> <msub> <mi>b</mi> <mi>k</mi> </msub> <mo>&amp;cap;</mo> <msub> <mi>b</mi> <mi>l</mi> </msub> <mo>|</mo> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
Wherein, bkAnd blTwo blocks are represented, | bk∩bl| represent that matching entities are to quantity in two blocks, | bk∪bl| represent two Total physical quantities in block.
9. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 8 based on Iterative matching, it is characterised in that step In (2-7), the acquisition formula of the similarity between entity is:
sim(ei,ej)=α simstring(ei,ej)+(1-α)simblock(bk,bl)
s.t.ei∈bk,ej∈bl
Wherein, bkAnd blDifference presentation-entity eiAnd ejThe block at place, simstring(ei,ej) and simblock(bk,bl) difference table Show the similarity of character string and block similarity of inter-entity, α is the weight of similarity of character string, and span is [0,1].
10. the large-scale Heterogeneous Knowledge storehouse alignment schemes as claimed in claim 9 based on Iterative matching, it is characterised in that use Based on Levenshtein distances, based on Jaro-Winker distances, the similarity function based on q-gram and based on I-SUB, and These measuring similarity functions are combined by way of linear weighted function and calculate acquisition similarity of character string.
CN201710237034.6A 2017-04-12 2017-04-12 Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching Expired - Fee Related CN107145523B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710237034.6A CN107145523B (en) 2017-04-12 2017-04-12 Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710237034.6A CN107145523B (en) 2017-04-12 2017-04-12 Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching

Publications (2)

Publication Number Publication Date
CN107145523A true CN107145523A (en) 2017-09-08
CN107145523B CN107145523B (en) 2019-10-18

Family

ID=59774786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710237034.6A Expired - Fee Related CN107145523B (en) 2017-04-12 2017-04-12 Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching

Country Status (1)

Country Link
CN (1) CN107145523B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460021A (en) * 2018-03-16 2018-08-28 安徽大学 A kind of method for extracting the problems in Article Titles method pair
CN109492114A (en) * 2018-11-16 2019-03-19 南京茂毓通软件科技有限公司 A kind of entity information recognition methods
CN109739939A (en) * 2018-12-29 2019-05-10 颖投信息科技(上海)有限公司 The data fusion method and device of knowledge mapping
CN110413704A (en) * 2019-06-27 2019-11-05 浙江大学 Entity alignment schemes based on weighting neighbor information coding
CN111191045A (en) * 2019-12-30 2020-05-22 创新奇智(上海)科技有限公司 Entity alignment method and system applied to knowledge graph
CN112699667A (en) * 2020-12-29 2021-04-23 京东数字科技控股股份有限公司 Entity similarity determination method, device, equipment and storage medium
CN112784609A (en) * 2021-03-16 2021-05-11 云知声智能科技股份有限公司 Method, apparatus, device and medium for determining whether medical record includes consultation opinions
CN113609304A (en) * 2021-07-20 2021-11-05 广州大学 Entity matching method and device
CN114417810A (en) * 2021-12-29 2022-04-29 东方财富信息股份有限公司 SimBlock algorithm for realizing high-quality text similarity calculation and realization method
CN114417810B (en) * 2021-12-29 2024-07-09 东方财富信息股份有限公司 SimBlock algorithm for realizing high-quality text similarity calculation and realization method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679492A (en) * 2013-11-29 2015-06-03 国际商业机器公司 Computer-implemented technical support providing device and method
US20150235130A1 (en) * 2014-02-19 2015-08-20 International Business Machines Corporation NLP Duration and Duration Range Comparison Methodology Using Similarity Weighting
CN104899242A (en) * 2015-03-10 2015-09-09 四川大学 Mechanical product design two-dimensional knowledge pushing method based on design intent

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679492A (en) * 2013-11-29 2015-06-03 国际商业机器公司 Computer-implemented technical support providing device and method
US20150235130A1 (en) * 2014-02-19 2015-08-20 International Business Machines Corporation NLP Duration and Duration Range Comparison Methodology Using Similarity Weighting
CN104899242A (en) * 2015-03-10 2015-09-09 四川大学 Mechanical product design two-dimensional knowledge pushing method based on design intent

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MOHAMMAD REZA NAKHAI ET AL: "Interference Alignment with Cyclic Unidirectional", 《2012 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC)》 *
黄峻福等: "中文异构百科知识库实体对齐", 《计算机应用》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460021A (en) * 2018-03-16 2018-08-28 安徽大学 A kind of method for extracting the problems in Article Titles method pair
CN108460021B (en) * 2018-03-16 2021-10-12 安徽大学 Method for extracting problem method pairs in thesis title
CN109492114A (en) * 2018-11-16 2019-03-19 南京茂毓通软件科技有限公司 A kind of entity information recognition methods
CN109739939A (en) * 2018-12-29 2019-05-10 颖投信息科技(上海)有限公司 The data fusion method and device of knowledge mapping
CN110413704B (en) * 2019-06-27 2022-05-03 浙江大学 Entity alignment method based on weighted neighbor information coding
CN110413704A (en) * 2019-06-27 2019-11-05 浙江大学 Entity alignment schemes based on weighting neighbor information coding
CN111191045A (en) * 2019-12-30 2020-05-22 创新奇智(上海)科技有限公司 Entity alignment method and system applied to knowledge graph
CN111191045B (en) * 2019-12-30 2023-06-16 创新奇智(上海)科技有限公司 Entity alignment method and system applied to knowledge graph
CN112699667A (en) * 2020-12-29 2021-04-23 京东数字科技控股股份有限公司 Entity similarity determination method, device, equipment and storage medium
CN112699667B (en) * 2020-12-29 2024-05-21 京东科技控股股份有限公司 Entity similarity determination method, device, equipment and storage medium
CN112784609A (en) * 2021-03-16 2021-05-11 云知声智能科技股份有限公司 Method, apparatus, device and medium for determining whether medical record includes consultation opinions
CN113609304B (en) * 2021-07-20 2023-05-23 广州大学 Entity matching method and device
CN113609304A (en) * 2021-07-20 2021-11-05 广州大学 Entity matching method and device
CN114417810A (en) * 2021-12-29 2022-04-29 东方财富信息股份有限公司 SimBlock algorithm for realizing high-quality text similarity calculation and realization method
CN114417810B (en) * 2021-12-29 2024-07-09 东方财富信息股份有限公司 SimBlock algorithm for realizing high-quality text similarity calculation and realization method

Also Published As

Publication number Publication date
CN107145523B (en) 2019-10-18

Similar Documents

Publication Publication Date Title
CN107145523B (en) Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching
CN108573411B (en) Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments
Pezzoni et al. How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation
US11036685B2 (en) System and method for compressing data in a database
CN103823888B (en) Node-closeness-based social network site friend recommendation method
CN104778256B (en) A kind of the quick of field question answering system consulting can increment clustering method
CN106250513A (en) A kind of event personalization sorting technique based on event modeling and system
CN105843796A (en) Microblog emotional tendency analysis method and device
CN102955833A (en) Correspondence address identifying and standardizing method
Prokić et al. Recognising groups among dialects
CN107239512A (en) The microblogging comment spam recognition methods of relational network figure is commented in a kind of combination
CN105843799A (en) Academic paper label recommendation method based on multi-source heterogeneous information graph model
CN106991090A (en) The analysis method and device of public sentiment event entity
Vieira et al. Performance evaluation of modularity based community detection algorithms in large scale networks
Bi et al. MM-GNN: Mix-moment graph neural network towards modeling neighborhood feature distribution
CN109885797B (en) Relational network construction method based on multi-identity space mapping
CN106651461A (en) Film personalized recommendation method based on gray theory
Wang et al. An improved clustering method for detection system of public security events based on genetic algorithm and semisupervised learning
CN105320715A (en) Body based semantic query method
CN102637202A (en) Method for automatically acquiring iterative conception attribute name and system
Sharma et al. Analysis of clustering algorithms in biological networks
CN107203609A (en) The method and mobile terminal of a kind of fast search mobile terminal SMS
Pola et al. Similarity sets: A new concept of sets to seamlessly handle similarity in database management systems
CN107391490A (en) A kind of intelligent semantic analysis and text mining method
Liu et al. Social Network Community‐Discovery Algorithm Based on a Balance Factor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20191018