CN107145523B - Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching - Google Patents

Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching Download PDF

Info

Publication number
CN107145523B
CN107145523B CN201710237034.6A CN201710237034A CN107145523B CN 107145523 B CN107145523 B CN 107145523B CN 201710237034 A CN201710237034 A CN 201710237034A CN 107145523 B CN107145523 B CN 107145523B
Authority
CN
China
Prior art keywords
entity
block
matching
knowledge base
collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710237034.6A
Other languages
Chinese (zh)
Other versions
CN107145523A (en
Inventor
陈岭
顾伟东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201710237034.6A priority Critical patent/CN107145523B/en
Publication of CN107145523A publication Critical patent/CN107145523A/en
Application granted granted Critical
Publication of CN107145523B publication Critical patent/CN107145523B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Abstract

The large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching that the invention discloses a kind of are embodied as follows: 1) being screened to the data in former knowledge base, Uniform data format, obtain relationship and initial matching entity pair in knowledge base on this basis;2) using the relationship in knowledge base to pretreated knowledge library partition, and block is simplified;3) matching block pair is obtained to matching block using matching entities;4) candidate entity pair is selected in matching block pair, and method for measuring similarity and threshold value is combined to confirm candidate entity pair;5) it repeats the above steps, until that cannot find new candidate entity pair, obtains all matching entities pair.The thought of present invention combination Iterative matching is aligned Heterogeneous Knowledge library, has broad application prospects in fields such as knowledge base alignment, data fusion, automatic question answerings.

Description

Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching
Technical field
The present invention relates to knowledge base alignment field more particularly to a kind of large-scale Heterogeneous Knowledge library alignment based on Iterative matching Method.
Background technique
With the arrival of Web 3.0, the knowledge base of structuring is increasingly frequently occurred on internet.These knowledge bases It is widely used in all kinds of semantic applications, such as: automatic question answering, search service and social interaction server etc..However, single knowledge base Limited information, limit these application function.In this context, knowledge base alignment has huge development space.Knowledge Library alignment (Knowledge Base Alignment) is often referred to the entity alignment of knowledge base, i.e., automatic discovery represents same in reality Two entities of one things simultaneously connect them.
Due to the continuous growth of knowledge base scale, alignment procedure is usually divided into two steps by knowledge base alignment schemes: hair Existing candidate's entity to confirm candidate entity pair.It was found that candidate entity is to using a small amount of attribute being quickly usually that each entity screens Several candidate entities out confirm that candidate entity to by comparing two entities comprehensively, utilizes two entity of similarity and threshold decision Whether match.Due to avoiding the accurate comparison of entity between any two, this way substantially increases the whole efficiency of method.Mesh Before, the bottleneck of knowledge base alignment schemes is the candidate entity found to usually being omitted, and further resulting in can matched reality Body is to undiscovered.
For the quality for improving candidate entity pair, researcher proposes to use the thought of Iterative matching, i.e., every wheel discovery is a small amount of Matching entities pair, and find as next round the foundation of candidate entity pair.However, traditional knowledge base alignment schemes are usually closed Infuse the alignment of isomorphism knowledge base, i.e. have between two knowledge bases it is more can alignment relation.Its basic assumption are as follows: if a pair of of entity to Match, and they have the relationship of alignment, then their " compatible neighbours " have greater probability matching, therefore " compatible neighbours " is made For candidate entity pair.But due between knowledge base can alignment relation it is few, conventional method is by holiday candidate's entity pair.In order to Solve the problems, such as this, researcher proposes to use class-based knowledge base alignment schemes.This method is by the example with same characteristic features Be divided into same class, and exclude with the content of class incoherent candidate entity, candidate entity pair is confirmed with this.However, Since this method only obtains candidate entity pair by classical partitioning technique in the model initial stage, when between two knowledge bases When the attribute of alignment is less, this method will also omit more candidate entity pair.
Summary of the invention
In view of above-mentioned, the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching that the invention proposes a kind of.This method Knowledge base alignment is carried out in conjunction with Iterative matching thought, subregion is carried out to knowledge base using iteration frame traversal relationship, is expanded The search space of candidate entity pair;Meanwhile using using thought of dividing and ruling to select and confirming candidate entity pair, so that each entity is only It needs and several candidate entities is compared comprehensively, improve the efficiency of method.
A kind of large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching, specifically include:
Data preprocessing phase: to any two original knowledge base KB1、KB2In data screened, Uniform data format And meaningless character processing is rejected, and count knowledge base KB ' after acquisition and processing1Corresponding set of relations R1, with processing after know Know library KB '2Corresponding set of relations R2, compare and obtain initial matching entity to collection
Knowledge base align stage: set of relations R is utilized1With set of relations R2In relationship to knowledge base KB '1With knowledge base KB '2 Subregion is carried out, and each block is simplified, obtains simplifying block collection B '1With B '2;Then, using initial matching entity to collectionBlock collection B ' is simplified in matching1With B '2In block, obtain matching block pair, finally, matching block pair in select candidate Entity pair, and combine method for measuring similarity and threshold value δeConfirm candidate entity pair.
The specific steps of the data preprocessing phase are as follows:
(1-1) inputs any two original knowledge base KB1、KB2, and remove knowledge base KB1、KB2In it is unrelated with task is aligned Information;
(1-2) is to knowledge base KB1In literal L1With knowledge base KB2In literal L2Uniform data format, by day Phase, number, name are expressed as unified format;
(1-3) removes knowledge base KB1In literal L1With knowledge base KB2In literal L2Middle stop words character, symbol Character, linguistic labels character, knowledge base KB ' after being handled1With KB '2
(1-4) statistics obtains and knowledge base KB '1Opposite set of relations R1And knowledge base KB '2Corresponding set of relations R2
(1-5) compares knowledge base KB '1With knowledge base KB '2In all entities, obtain initial matching entity to collection
Knowledge base is defined as hexa-atomic group of (E, L, R, P, FR,FP), wherein E, L, R, P respectively indicate entity, literal, relationship And the set of attribute;The triplet sets of entity-relationship-entity are represented, expression object is entity Relationship it is true;Entity-attribute-literal triplet sets are represented, indicate that object is literal Attribute is true;FRAnd FPIn all there is meaningless information, such as: comprising original text for extracting triple in certain knowledge bases This corpus, these information will affect the efficiency of algorithm.In addition, certain triples comprising " sameAs " relationship should be also removed.
The detailed process of the step (1-4) are as follows:
For knowledge base KB '1, traverse the triplet sets F for belonging to the knowledge baseR1In all triples (entity-pass System-entity), statistics obtains set of relations R1;For knowledge base KB '2, traverse the triplet sets F for belonging to the knowledge baseR2In institute Have triple (entity-relationship-entity), statistics obtains set of relations R2, set of relations R1With set of relations R2For subsequent knowledge base point Area's operation.
In step (1-5), the initial matching entity is to collectionAcquisition process are as follows:
Firstly, extracting knowledge base KB '1In all entities form entity set E1, extract knowledge base KB '2In all entities Form entity set E2;And with entity set E1In any entity and entity set E2In any entity cartesian product as entity It is right, entity is formed to collection;
Then, screening obtains entity to the identical entity pair of string representation for concentrating two entity name attributes, obtains To pre- initial matching entity to collection;
Finally, screening pre- initial matching entity to the entity pair with one-to-one matching relationship is concentrated, as initial matching Entity is to collection
The specific steps of the knowledge base align stage are as follows:
(2-1) Input knowledge library KB '1, knowledge base KB '2, set of relations R1, set of relations R2, initial matching entity is to collection Block similarity threshold δ is setb, entity similarity threshold δe, physical quantities threshold value δ in block1And matching entities in block Rate threshold δ2, matching entities are to collection MeInitial matching entity is initialized as to collection
(2-2) randomly selects set of relations R1Or set of relations R2In any relationship, using the relationship by knowledge base KB '1With know Know library KB '2In entity be divided into several blocks, obtain and knowledge base KB '1Corresponding block collection B1And knowledge base KB '2Phase Corresponding block collection B2
(2-3) removes Except block collection B1With block collection B2In be also easy to produce high calculation amount or be difficult to generate the block of matching entities pair, It obtains simplifying block collection B '1With simplify block collection B '2
(2-4) is using matching entities to collection MeIn all matching entities block collection B ' is simplified to measurement1Middle either block with Simplify block collection B '2Similarity between middle either block selects similarity value to be greater than block similarity threshold δbTwo blocks It is matched, obtains matching block to collection;
(2-5) is to belonging to matching block to any matching block pair of concentration, in a block of the matching block pair Any non-matching entities and the matching block pair another block in any non-matching entities cartesian product as wait Entity pair is selected, forms candidate entity to collection;
(2-6) judges whether not find new candidate entity pair, executes step (2-7) if it is not, jumping, if so, terminate iteration, Matching entities are exported to collection Me
(2-7) calculates candidate entity to the similarity concentrated between each candidate two entity of entity centering, by similarity value Greater than entity similarity threshold δeIt is corresponding candidate entity to be added to matching entities to collection MeIn, remaining candidate's entity is to house It abandons;
(2-8) judges whether the number of iterations is less than iteration threshold, all no, jumps and executes step (2-2);If so, terminating to change In generation, output matching entities are to collection Me
It is described to be by the detailed process that the entity in knowledge base is divided into several blocks using relationship in step (2-2);
Firstly, for knowledge base KB '1In triplet sets FR1, count and obtain triplet sets FR1Middle n kind object is real Body;
Then, for every kind of object entity, by triplet sets FR1In all subject entities corresponding thereto be placed on one It rises, obtains 1 block, n kind object entity obtains n block, forms block collection B1
Block collection B is obtained using same method2
In step (2-3), the block for being also easy to produce high calculation amount or being difficult to generate matching entities pair includes: entity number Amount is more than threshold value δ1Block, matching entities ratio be less than threshold value δ2Block and entity all matched blocks.
In step (2-4), the acquisition methods of the similarity between block are as follows:
Each block is seen be entity set, matched entity is between the identical element regarding two set as, benefit With set similarity come the similarity between Metrics block, similarity simblock(bk,bl) calculation formula are as follows:
Wherein, bkAnd blIndicate two blocks, | bk∩bl| indicate that matching entities are to quantity in two blocks, | bk∪bl| it indicates Total physical quantities in two blocks.
In step (2-7), the acquisition formula of the similarity between entity are as follows:
sim(ei,ej)=α simstring(ei,ej)+(1-α)simblock(bk,bl)
s.t.ei∈bk,ej∈bl
Wherein, bkAnd blRespectively indicate entity eiAnd ejThe block at place, simstring(ei,ej) and simblock(bk,bl) point Similarity of character string and block similarity between other presentation-entity, α are the weights of similarity of character string, and value range is [0,1].
Preferably, using based on Levenshtein distance, based on Jaro-Winker distance, based on q-gram and be based on The similarity function of I-SUB, and combine these measuring similarity functions by way of linear weighted function and calculate acquisition character string phase Like degree.
Present invention combination Iterative matching thought carries out the alignment of Heterogeneous Knowledge library, using iteration frame traversal relationship to knowledge base Subregion is carried out, the search space of candidate entity pair is expanded;Meanwhile using using thought of dividing and ruling to select and confirm candidate entity It is right, so that each entity only needs to be compared with several candidate entities comprehensively, improve the efficiency of method.With existing method phase Than, the advantage is that:
(1) knowledge base alignment is regarded as an iterative process.In different iteration, traverses each relationship and knowledge base is divided Area, and using matched block to candidate entity pair is selected, so that alignment schemes are independent of the relationship that can be aligned between knowledge base And attribute.
(2) a small amount of matching entities pair are only found in every wheel iteration, and by these matching entities to being used for candidate entity pair Select, the process due to selecting candidate entity pair has used the information of more matching entities pair, improves candidate entity Pair quality.
Detailed description of the invention
Fig. 1 is the flow diagram of the large-scale Heterogeneous Knowledge library alignment schemes the present invention is based on Iterative matching;
Fig. 2 is the process of data preprocessing phase in the large-scale Heterogeneous Knowledge library alignment schemes the present invention is based on Iterative matching Figure;
Fig. 3 is the process of knowledge base align stage in the large-scale Heterogeneous Knowledge library alignment schemes the present invention is based on Iterative matching Figure.
Specific embodiment
In order to more specifically describe the present invention, with reference to the accompanying drawing and specific embodiment is to technical solution of the present invention It is described in detail.
As shown in Figure 1, the present invention is based on the large-scale Heterogeneous Knowledge library alignment schemes of Iterative matching be divided into data prediction and Knowledge base is aligned two parts.Data prediction part: screening the data in former knowledge base KB, Uniform data format, And obtain the relationship in knowledge base and initial matching entity pair;Knowledge base aligned portions: first with the relationship pair in knowledge base Pretreated knowledge library partition, and block is simplified, matching block pair then is obtained to matching block using matching entities, Then candidate entity pair is selected in matching block pair, and method for measuring similarity and threshold value is combined to confirm candidate entity pair, most After repeat the above steps, until that cannot find new candidate entity pair, all matching entities pair can be obtained.
Shown in Fig. 2 is the flow chart of data preprocessing phase;According to fig. 2, which is divided into following steps:
S1-1 inputs any two original knowledge base KB1、KB2, and remove knowledge base KB1、KB2In it is unrelated with task is aligned Information.
Knowledge base is defined as hexa-atomic group of (E, L, R, P, FR,FP), wherein E, L, R, P respectively indicate entity, literal, relationship And the set of attribute;The triplet sets of entity-relationship-entity are represented, expression object is entity Relationship it is true;Entity-attribute-literal triplet sets are represented, indicate that object is literal Attribute is true;FRAnd FPIn all there is meaningless information, such as: comprising original text for extracting triple in certain knowledge bases This corpus, these information will affect the efficiency of algorithm.In addition, certain triples comprising " same As " relationship should be also removed.
S1-2, to knowledge base KB1In literal L1With knowledge base KB2In literal L2Uniform data format, by day Phase, number, name are expressed as unified format.
The expression way of the literals such as name, date, number in different knowledge bases may be different, such as: " 2016-01- 01 " and " 01.01.2016 ".By these information unifications, it is conducive to subsequent comparison, in addition, literal is unified into small letter by method.
S1-3 removes knowledge base KB1In literal L1With knowledge base KB2In literal L2Middle stop words character, symbol The meaningless character such as character, linguistic labels, knowledge base KB ' after being handled1With KB '2
In knowledge base for entity attributes description in may have some meaningless characters, such as: " the ", " a " and Stop words such as " an ", " # ", "!" and the linguistic labels such as symbols and "@en " such as " * ".The similarity of these characters influence entity pair Measurement, therefore remove these characters.
S1-4, statistics obtain and knowledge base KB '1Opposite set of relations R1And knowledge base KB '2Corresponding set of relations R2
In this step, for knowledge base KB '1, traverse the triplet sets F for belonging to the knowledge baseR1In all triples (entity-relationship-entity), statistics obtain set of relations R1;For knowledge base KB '2, traverse the triplet sets for belonging to the knowledge base FR2In all triples (entity-relationship-entity), statistics obtain set of relations R2, set of relations R1With set of relations R2For subsequent Knowledge base division operation.
S1-5 compares knowledge base KB '1With knowledge base KB '2In all entities, obtain initial matching entity to collection
In this step, initial matching entity is to collectionAcquisition process are as follows:
Firstly, extracting knowledge base KB '1In all entities form entity set E1, extract knowledge base KB '2In all entities Form entity set E2;And with entity set E1In any entity and entity set E2In any entity cartesian product as entity It is right, entity is formed to collection;
Then, screening obtains entity to the identical entity pair of string representation for concentrating two entity name attributes, obtains To pre- initial matching entity to collection;
Finally, screening pre- initial matching entity to the entity pair with one-to-one matching relationship is concentrated, as initial matching Entity is to collection
Shown in Fig. 3 is the flow chart of knowledge base align stage;According to Fig. 3, which is divided into following steps:
S2-1, Input knowledge library KB '1, knowledge base KB '2, set of relations R1, set of relations R2, initial matching entity is to collection Block similarity threshold δ is setbFor 0.2, entity similarity threshold δeFor physical quantities threshold value δ in 0.65, block1For 50 and Matching entities rate threshold δ in block2It is 0.3, matching entities are to collection MeInitial matching entity is initialized as to collection
S2-2 randomly selects set of relations R1Or set of relations R2In any relationship, using the relationship by knowledge base KB '1With know Know library KB '2In entity be divided into several blocks, obtain and knowledge base KB '1Corresponding block collection B1And knowledge base KB '2Phase Corresponding block collection B2
In this step, it is by the detailed process that the entity in knowledge base is divided into several blocks using relationship;
Firstly, for knowledge base KR '1In triplet sets FR1, count and obtain triplet sets FR1Middle n kind object is real Body;
Then, for every kind of object entity, by triplet sets FR1In all subject entities corresponding thereto be placed on one It rises, obtains 1 block, n kind object entity obtains n block, forms block collection B1
Block collection B is obtained using same method2, it may be assumed that
Firstly, for knowledge base KB '2In triplet sets FR2, count and obtain triplet sets FR2Middle n kind object is real Body;
Then, for every kind of object entity, by triplet sets FR2In all subject entities corresponding thereto be placed on one It rises, obtains 1 block, n kind object entity obtains n block, forms block collection B2
S2-3 removes Except block collection B1With block collection B2In be also easy to produce high calculation amount or be difficult to generate the block of matching entities pair, It obtains simplifying block collection B '1With simplify block collection B '2
In this step, the block for being also easy to produce high calculation amount or being difficult to generate matching entities pair includes: physical quantities more than threshold Value δ1Block, matching entities ratio be less than threshold value δ2Block and entity all matched blocks.
S2-4, using matching entities to collection MeIn all matching entities block collection B ' is simplified to measurement1Middle either block with Simplify block collection B '2Similarity between middle either block selects similarity value to be greater than block similarity threshold δbTwo blocks It is matched, obtains matching block to collection.
Each block is seen be entity set, matched entity is between the identical element regarding two set as, benefit With set similarity come the similarity between Metrics block, similarity simblock(bk,bl) calculation formula are as follows:
Wherein, bkAnd blIndicate two blocks, | bk∩bl| indicate that matching entities are to quantity in two blocks, | bk∪bl| it indicates Total physical quantities in two blocks.
S2-5 matches block to any matching block pair of concentration, in a block of the matching block pair to belonging to Any non-matching entities and the matching block pair another block in any non-matching entities cartesian product as wait Entity pair is selected, forms candidate entity to collection.
S2-6 judges whether not find new candidate entity pair, executes S2-7 if it is not, jumping, if so, terminate iteration, output Matching entities are to collection Me
S2-7 calculates candidate entity to the similarity concentrated between each candidate two entity of entity centering, by similarity value Greater than entity similarity threshold δeIt is corresponding candidate entity to be added to matching entities to collection MeIn, remaining candidate's entity is to house It abandons.
In this step, the similarity between entity is measured by 2 kinds of modes: similarity of character string is similar with block Degree, and both similarities are combined with certain weight, formula is as follows:
sim(ei,ej)=α simstring(ei,ej)+(1-α)simblock(bk,bl)
s.t.ei∈bk,ej∈bl
Wherein, simstring(ei,ej) and simblock(bk,bl) respectively indicate similarity of character string and block phase between entity Like degree, bkAnd blRespectively indicate entity eiAnd ejThe block at place, α are the weight of similarity of character string, value 0.6.For reality Body eiAnd ejFor shared attribute to (such as: name), similarity of character string measures the similarity of these attribute values.Method uses A variety of measuring similarity functions, such as: based on Levenshtein distance, being based on Jaro-Winker distance, based on q-gram and base In the similarity function of I-SUB, and these measuring similarity functions are combined by way of linear weighted function.Block similarity passes through Similarity between block where entity carrys out the similarity of presentation-entity.After obtaining the similarity between entity, in conjunction with threshold value δeSentence Breaking, whether this matches entity, and by newfound matching entities to all matching entities pair of addition.
S2-8, judges whether the number of iterations is less than iteration threshold, all no, jumps and executes S2-2;If so, terminate iteration, it is defeated Matching entities are to collection M oute
Technical solution of the present invention and beneficial effect is described in detail in above-described specific embodiment, Ying Li Solution is not intended to restrict the invention the foregoing is merely presently most preferred embodiment of the invention, all in principle model of the invention Interior done any modification, supplementary, and equivalent replacement etc. are enclosed, should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching, specifically include:
Data preprocessing phase: to any two original knowledge base KB1、KB2In data screened, Uniform data format and Meaningless character processing is rejected, and counts knowledge base KB ' after acquisition and processing1Corresponding set of relations R1And knowledge base after processing KB′2Corresponding set of relations R2, compare and obtain initial matching entity to collection
Knowledge base align stage: set of relations R is utilized1With set of relations R2In relationship to knowledge base KB '1With knowledge base KB '2Divided Area, and each block is simplified, it obtains simplifying block collection B '1With B '2;Then, using initial matching entity to collection? With simplifying block collection B '1With B '2In block, obtain matching block pair, finally, matching block pair in select candidate entity pair, And method for measuring similarity and threshold value is combined to confirm candidate entity pair.
2. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as described in claim 1, which is characterized in that described The specific steps of data preprocessing phase are as follows:
(1-1) inputs any two original knowledge base KB1、KB2, and remove knowledge base KB1、KB2In the information unrelated with the task that is aligned;
(1-2) is to knowledge base KB1In literal L1With knowledge base KB2In literal L2Uniform data format, by date, number Word, name are expressed as unified format;
(1-3) removes knowledge base KB1In literal L1With knowledge base KB2In literal L2Middle stop words character, sign character, Linguistic labels character, knowledge base KB ' after being handled1With KB '2
(1-4) statistics obtains and knowledge base KB '1Corresponding set of relations R1And knowledge base KB '2Corresponding set of relations R2
(1-5) compares knowledge base KB '1With knowledge base KB '2In all entities, obtain initial matching entity to collection
3. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 2, which is characterized in that the step Suddenly the detailed process of (1-4) are as follows:
For knowledge base KB '1, traverse the triplet sets F for belonging to the knowledge baseR1In all entity-relationship-entity ternarys Group, statistics obtain set of relations R1;For knowledge base KB '2, traverse the triplet sets F for belonging to the knowledge baseR2In all realities Body-relation-entity triple, statistics obtain set of relations R2
4. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 2, which is characterized in that step In (1-5), the initial matching entity is to collectionAcquisition process are as follows:
Firstly, extracting knowledge base KB '1In all entities form entity set E1, extract knowledge base KB '2In all entities composition Entity set E2;And with entity set E1In any entity and entity set E2In any entity cartesian product as entity pair, group At entity to collection;
Then, screening obtains entity to the identical entity pair of string representation for concentrating two entity name attributes, obtains pre- Initial matching entity is to collection;
Finally, screening pre- initial matching entity to the entity pair with one-to-one matching relationship is concentrated, as initial matching entity To collection
5. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as described in claim 1, which is characterized in that described The specific steps of knowledge base align stage are as follows:
(2-1) Input knowledge library KB '1, knowledge base KB '2, set of relations R1, set of relations R2, initial matching entity is to collectionSetting Block similarity threshold δb, entity similarity threshold δe, physical quantities threshold value δ in block1And matching entities ratio in block Threshold value δ2, matching entities are to collection MeInitial matching entity is initialized as to collection
(2-2) randomly selects set of relations R1Or set of relations R2In any relationship, using the relationship by knowledge base KB '1And knowledge base KB′2In entity be divided into several blocks, obtain and knowledge base KB '1Corresponding block collection B1And knowledge base KB '2It is corresponding Block collection B2
(2-3) removes Except block collection B1With block collection B2In be also easy to produce high calculation amount or be difficult to generate the block of matching entities pair, obtain Simplify block collection B '1With simplify block collection B '2
(2-4) is using matching entities to collection MeIn all matching entities block collection B ' is simplified to measurement1Middle either block with simplify Block collection B '2Similarity between middle either block selects similarity value to be greater than block similarity threshold δbTwo blocks carry out Matching obtains matching block to collection;
(2-5) is to belonging to matching block to any matching block pair of concentration, with appointing in a block of the matching block pair The cartesian product of any non-matching entities in another block of one non-matching entities and the matching block pair is as candidate real Body pair forms candidate entity to collection;
(2-6) judges whether not find new candidate entity pair, executes step (2-7) if it is not, jumping, if so, terminate iteration, output Matching entities are to collection Me
(2-7) calculates candidate entity to the similarity concentrated between each candidate two entity of entity centering, and similarity value is greater than Entity similarity threshold δeIt is corresponding candidate entity to be added to matching entities to collection MeIn, remaining candidate's entity is to giving up;
(2-8) judges whether the number of iterations is less than iteration threshold, all no, jumps and executes step (2-2);If so, terminate iteration, it is defeated Matching entities are to collection M oute
6. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 5, which is characterized in that step It is described to be by the detailed process that the entity in knowledge base is divided into several blocks using relationship in (2-2);
Firstly, for knowledge base KB '1In triplet sets FR1, count and obtain triplet sets FR1Middle n kind object entity;
Then, for every kind of object entity, by triplet sets FR1In all subject entities corresponding thereto put together, obtain To 1 block, n kind object entity obtains n block, forms block collection B1
Block collection B is obtained using same method2
7. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 5, which is characterized in that step In (2-3), it is more than threshold value δ that the block for being also easy to produce high calculation amount or being difficult to generate matching entities pair, which includes: physical quantities,1 Block, matching entities ratio be less than threshold value δ2Block and entity all matched blocks.
8. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 5, which is characterized in that step In (2-4), the acquisition methods of the similarity between block are as follows:
Each block is seen be entity set, matched entity utilizes collection to the identical element regarding two set as It closes similarity and comes the similarity between Metrics block, similarity simblock(bk,bl) calculation formula are as follows:
Wherein, bkAnd blIndicate two blocks, | bk∩bl| indicate that matching entities are to quantity in two blocks, | bk∪bl| indicate two Total physical quantities in block.
9. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 8, which is characterized in that step In (2-7), the acquisition formula of the similarity between entity are as follows:
sim(ei,ej)=α simstring(ei,ej)+(1-α)simblock(bk,bl)
s.t.ei∈bk,ej∈bl
Wherein, bkAnd blRespectively indicate entity eiAnd ejThe block at place, simstring(ei,ej) and simblock(bk,bl) difference table Show that the similarity of character string and block similarity between entity, α are the weights of similarity of character string, value range is [0,1].
10. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 9, which is characterized in that use Based on Levenshtein distance, based on Jaro-Winker distance, based on q-gram and based on the similarity function of I-SUB, and These measuring similarity functions are combined by way of linear weighted function calculates acquisition similarity of character string.
CN201710237034.6A 2017-04-12 2017-04-12 Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching Expired - Fee Related CN107145523B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710237034.6A CN107145523B (en) 2017-04-12 2017-04-12 Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710237034.6A CN107145523B (en) 2017-04-12 2017-04-12 Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching

Publications (2)

Publication Number Publication Date
CN107145523A CN107145523A (en) 2017-09-08
CN107145523B true CN107145523B (en) 2019-10-18

Family

ID=59774786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710237034.6A Expired - Fee Related CN107145523B (en) 2017-04-12 2017-04-12 Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching

Country Status (1)

Country Link
CN (1) CN107145523B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460021B (en) * 2018-03-16 2021-10-12 安徽大学 Method for extracting problem method pairs in thesis title
CN109492114A (en) * 2018-11-16 2019-03-19 南京茂毓通软件科技有限公司 A kind of entity information recognition methods
CN109739939A (en) * 2018-12-29 2019-05-10 颖投信息科技(上海)有限公司 The data fusion method and device of knowledge mapping
CN110413704B (en) * 2019-06-27 2022-05-03 浙江大学 Entity alignment method based on weighted neighbor information coding
CN111191045B (en) * 2019-12-30 2023-06-16 创新奇智(上海)科技有限公司 Entity alignment method and system applied to knowledge graph
CN112699667A (en) * 2020-12-29 2021-04-23 京东数字科技控股股份有限公司 Entity similarity determination method, device, equipment and storage medium
CN112784609A (en) * 2021-03-16 2021-05-11 云知声智能科技股份有限公司 Method, apparatus, device and medium for determining whether medical record includes consultation opinions
CN113609304B (en) * 2021-07-20 2023-05-23 广州大学 Entity matching method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679492A (en) * 2013-11-29 2015-06-03 国际商业机器公司 Computer-implemented technical support providing device and method
CN104899242A (en) * 2015-03-10 2015-09-09 四川大学 Mechanical product design two-dimensional knowledge pushing method based on design intent

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9411878B2 (en) * 2014-02-19 2016-08-09 International Business Machines Corporation NLP duration and duration range comparison methodology using similarity weighting

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679492A (en) * 2013-11-29 2015-06-03 国际商业机器公司 Computer-implemented technical support providing device and method
CN104899242A (en) * 2015-03-10 2015-09-09 四川大学 Mechanical product design two-dimensional knowledge pushing method based on design intent

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Interference Alignment with Cyclic Unidirectional;Mohammad Reza Nakhai et al;《2012 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC)》;20121129;第1-5页 *
中文异构百科知识库实体对齐;黄峻福等;《计算机应用》;20160710;第1881-1886,1898页 *

Also Published As

Publication number Publication date
CN107145523A (en) 2017-09-08

Similar Documents

Publication Publication Date Title
CN107145523B (en) Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching
Ahmed et al. Graph sample and hold: A framework for big-graph analytics
Pezzoni et al. How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation
Deng et al. A user identification algorithm based on user behavior analysis in social networks
CN104346481B (en) A kind of community detection method based on dynamic synchronization model
CN113761221B (en) Knowledge graph entity alignment method based on graph neural network
CN105956158B (en) The method that network neologisms based on massive micro-blog text and user information automatically extract
CN109144964A (en) log analysis method and device based on machine learning
CN103034726A (en) Text filtering system and method
CN106991090A (en) The analysis method and device of public sentiment event entity
Yu et al. Motifs in big networks: Methods and applications
CN105930466A (en) Massive data processing method
Cheng et al. Mining research trends with anomaly detection models: the case of social computing research
Zhiyuli et al. Modeling large-scale dynamic social networks via node embeddings
CN106055652A (en) Method and system for database matching based on patterns and examples
Bi et al. MM-GNN: Mix-moment graph neural network towards modeling neighborhood feature distribution
CN106651461A (en) Film personalized recommendation method based on gray theory
Wang et al. An improved clustering method for detection system of public security events based on genetic algorithm and semisupervised learning
CN106339293B (en) A kind of log event extracting method based on signature
CN108595515A (en) A kind of microblog emotional analysis method of the weak relationship of combination microblogging
CN116432125A (en) Code classification method based on hash algorithm
Pola et al. Similarity sets: A new concept of sets to seamlessly handle similarity in database management systems
CN113011152B (en) Text processing method, device and equipment and computer readable storage medium
Zhao et al. Realization of intrusion detection system based on the improved data mining technology
Priya et al. Entity resolution for high velocity streams using semantic measures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20191018