CN107145523B

CN107145523B - Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching

Info

Publication number: CN107145523B
Application number: CN201710237034.6A
Authority: CN
Inventors: 陈岭; 顾伟东
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2017-04-12
Filing date: 2017-04-12
Publication date: 2019-10-18
Anticipated expiration: 2037-04-12
Also published as: CN107145523A

Abstract

The large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching that the invention discloses a kind of are embodied as follows: 1) being screened to the data in former knowledge base, Uniform data format, obtain relationship and initial matching entity pair in knowledge base on this basis；2) using the relationship in knowledge base to pretreated knowledge library partition, and block is simplified；3) matching block pair is obtained to matching block using matching entities；4) candidate entity pair is selected in matching block pair, and method for measuring similarity and threshold value is combined to confirm candidate entity pair；5) it repeats the above steps, until that cannot find new candidate entity pair, obtains all matching entities pair.The thought of present invention combination Iterative matching is aligned Heterogeneous Knowledge library, has broad application prospects in fields such as knowledge base alignment, data fusion, automatic question answerings.

Description

Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching

Technical field

The present invention relates to knowledge base alignment field more particularly to a kind of large-scale Heterogeneous Knowledge library alignment based on Iterative matching Method.

Background technique

With the arrival of Web 3.0, the knowledge base of structuring is increasingly frequently occurred on internet.These knowledge bases It is widely used in all kinds of semantic applications, such as: automatic question answering, search service and social interaction server etc..However, single knowledge base Limited information, limit these application function.In this context, knowledge base alignment has huge development space.Knowledge Library alignment (Knowledge Base Alignment) is often referred to the entity alignment of knowledge base, i.e., automatic discovery represents same in reality Two entities of one things simultaneously connect them.

Due to the continuous growth of knowledge base scale, alignment procedure is usually divided into two steps by knowledge base alignment schemes: hair Existing candidate's entity to confirm candidate entity pair.It was found that candidate entity is to using a small amount of attribute being quickly usually that each entity screens Several candidate entities out confirm that candidate entity to by comparing two entities comprehensively, utilizes two entity of similarity and threshold decision Whether match.Due to avoiding the accurate comparison of entity between any two, this way substantially increases the whole efficiency of method.Mesh Before, the bottleneck of knowledge base alignment schemes is the candidate entity found to usually being omitted, and further resulting in can matched reality Body is to undiscovered.

For the quality for improving candidate entity pair, researcher proposes to use the thought of Iterative matching, i.e., every wheel discovery is a small amount of Matching entities pair, and find as next round the foundation of candidate entity pair.However, traditional knowledge base alignment schemes are usually closed Infuse the alignment of isomorphism knowledge base, i.e. have between two knowledge bases it is more can alignment relation.Its basic assumption are as follows: if a pair of of entity to Match, and they have the relationship of alignment, then their " compatible neighbours " have greater probability matching, therefore " compatible neighbours " is made For candidate entity pair.But due between knowledge base can alignment relation it is few, conventional method is by holiday candidate's entity pair.In order to Solve the problems, such as this, researcher proposes to use class-based knowledge base alignment schemes.This method is by the example with same characteristic features Be divided into same class, and exclude with the content of class incoherent candidate entity, candidate entity pair is confirmed with this.However, Since this method only obtains candidate entity pair by classical partitioning technique in the model initial stage, when between two knowledge bases When the attribute of alignment is less, this method will also omit more candidate entity pair.

Summary of the invention

In view of above-mentioned, the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching that the invention proposes a kind of.This method Knowledge base alignment is carried out in conjunction with Iterative matching thought, subregion is carried out to knowledge base using iteration frame traversal relationship, is expanded The search space of candidate entity pair；Meanwhile using using thought of dividing and ruling to select and confirming candidate entity pair, so that each entity is only It needs and several candidate entities is compared comprehensively, improve the efficiency of method.

A kind of large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching, specifically include:

Data preprocessing phase: to any two original knowledge base KB₁、KB₂In data screened, Uniform data format And meaningless character processing is rejected, and count knowledge base KB ' after acquisition and processing₁Corresponding set of relations R₁, with processing after know Know library KB '₂Corresponding set of relations R₂, compare and obtain initial matching entity to collection

Knowledge base align stage: set of relations R is utilized₁With set of relations R₂In relationship to knowledge base KB '₁With knowledge base KB '₂ Subregion is carried out, and each block is simplified, obtains simplifying block collection B '₁With B '₂；Then, using initial matching entity to collectionBlock collection B ' is simplified in matching₁With B '₂In block, obtain matching block pair, finally, matching block pair in select candidate Entity pair, and combine method for measuring similarity and threshold value δ_eConfirm candidate entity pair.

The specific steps of the data preprocessing phase are as follows:

(1-1) inputs any two original knowledge base KB₁、KB₂, and remove knowledge base KB₁、KB₂In it is unrelated with task is aligned Information；

(1-2) is to knowledge base KB₁In literal L₁With knowledge base KB₂In literal L₂Uniform data format, by day Phase, number, name are expressed as unified format；

(1-3) removes knowledge base KB₁In literal L₁With knowledge base KB₂In literal L₂Middle stop words character, symbol Character, linguistic labels character, knowledge base KB ' after being handled₁With KB '₂；

(1-4) statistics obtains and knowledge base KB '₁Opposite set of relations R₁And knowledge base KB '₂Corresponding set of relations R₂；

(1-5) compares knowledge base KB '₁With knowledge base KB '₂In all entities, obtain initial matching entity to collection

Knowledge base is defined as hexa-atomic group of (E, L, R, P, F_R,F_P), wherein E, L, R, P respectively indicate entity, literal, relationship And the set of attribute；The triplet sets of entity-relationship-entity are represented, expression object is entity Relationship it is true；Entity-attribute-literal triplet sets are represented, indicate that object is literal Attribute is true；F_RAnd F_PIn all there is meaningless information, such as: comprising original text for extracting triple in certain knowledge bases This corpus, these information will affect the efficiency of algorithm.In addition, certain triples comprising " sameAs " relationship should be also removed.

The detailed process of the step (1-4) are as follows:

For knowledge base KB '₁, traverse the triplet sets F for belonging to the knowledge base_R1In all triples (entity-pass System-entity), statistics obtains set of relations R₁；For knowledge base KB '₂, traverse the triplet sets F for belonging to the knowledge base_R2In institute Have triple (entity-relationship-entity), statistics obtains set of relations R₂, set of relations R₁With set of relations R₂For subsequent knowledge base point Area's operation.

In step (1-5), the initial matching entity is to collectionAcquisition process are as follows:

Firstly, extracting knowledge base KB '₁In all entities form entity set E₁, extract knowledge base KB '₂In all entities Form entity set E₂；And with entity set E₁In any entity and entity set E₂In any entity cartesian product as entity It is right, entity is formed to collection；

Then, screening obtains entity to the identical entity pair of string representation for concentrating two entity name attributes, obtains To pre- initial matching entity to collection；

Finally, screening pre- initial matching entity to the entity pair with one-to-one matching relationship is concentrated, as initial matching Entity is to collection

The specific steps of the knowledge base align stage are as follows:

(2-1) Input knowledge library KB '₁, knowledge base KB '₂, set of relations R₁, set of relations R₂, initial matching entity is to collection Block similarity threshold δ is set_b, entity similarity threshold δ_e, physical quantities threshold value δ in block₁And matching entities in block Rate threshold δ₂, matching entities are to collection M_eInitial matching entity is initialized as to collection

(2-2) randomly selects set of relations R₁Or set of relations R₂In any relationship, using the relationship by knowledge base KB '₁With know Know library KB '₂In entity be divided into several blocks, obtain and knowledge base KB '₁Corresponding block collection B₁And knowledge base KB '₂Phase Corresponding block collection B₂；

(2-3) removes Except block collection B₁With block collection B₂In be also easy to produce high calculation amount or be difficult to generate the block of matching entities pair, It obtains simplifying block collection B '₁With simplify block collection B '₂；

(2-4) is using matching entities to collection M_eIn all matching entities block collection B ' is simplified to measurement₁Middle either block with Simplify block collection B '₂Similarity between middle either block selects similarity value to be greater than block similarity threshold δ_bTwo blocks It is matched, obtains matching block to collection；

(2-5) is to belonging to matching block to any matching block pair of concentration, in a block of the matching block pair Any non-matching entities and the matching block pair another block in any non-matching entities cartesian product as wait Entity pair is selected, forms candidate entity to collection；

(2-6) judges whether not find new candidate entity pair, executes step (2-7) if it is not, jumping, if so, terminate iteration, Matching entities are exported to collection M_e；

(2-7) calculates candidate entity to the similarity concentrated between each candidate two entity of entity centering, by similarity value Greater than entity similarity threshold δ_eIt is corresponding candidate entity to be added to matching entities to collection M_eIn, remaining candidate's entity is to house It abandons；

(2-8) judges whether the number of iterations is less than iteration threshold, all no, jumps and executes step (2-2)；If so, terminating to change In generation, output matching entities are to collection M_e。

It is described to be by the detailed process that the entity in knowledge base is divided into several blocks using relationship in step (2-2)；

Firstly, for knowledge base KB '₁In triplet sets F_R1, count and obtain triplet sets F_R1Middle n kind object is real Body；

Then, for every kind of object entity, by triplet sets F_R1In all subject entities corresponding thereto be placed on one It rises, obtains 1 block, n kind object entity obtains n block, forms block collection B₁；

Block collection B is obtained using same method₂。

In step (2-3), the block for being also easy to produce high calculation amount or being difficult to generate matching entities pair includes: entity number Amount is more than threshold value δ₁Block, matching entities ratio be less than threshold value δ₂Block and entity all matched blocks.

In step (2-4), the acquisition methods of the similarity between block are as follows:

Each block is seen be entity set, matched entity is between the identical element regarding two set as, benefit With set similarity come the similarity between Metrics block, similarity sim_block(b_k,b_l) calculation formula are as follows:

Wherein, b_kAnd b_lIndicate two blocks, | b_k∩b_l| indicate that matching entities are to quantity in two blocks, | b_k∪b_l| it indicates Total physical quantities in two blocks.

In step (2-7), the acquisition formula of the similarity between entity are as follows:

sim(e_i,e_j)=α sim_string(e_i,e_j)+(1-α)sim_block(b_k,b_l)

s.t.e_i∈b_k,e_j∈b_l

Wherein, b_kAnd b_lRespectively indicate entity e_iAnd e_jThe block at place, sim_string(e_i,e_j) and sim_block(b_k,b_l) point Similarity of character string and block similarity between other presentation-entity, α are the weights of similarity of character string, and value range is [0,1].

Preferably, using based on Levenshtein distance, based on Jaro-Winker distance, based on q-gram and be based on The similarity function of I-SUB, and combine these measuring similarity functions by way of linear weighted function and calculate acquisition character string phase Like degree.

Present invention combination Iterative matching thought carries out the alignment of Heterogeneous Knowledge library, using iteration frame traversal relationship to knowledge base Subregion is carried out, the search space of candidate entity pair is expanded；Meanwhile using using thought of dividing and ruling to select and confirm candidate entity It is right, so that each entity only needs to be compared with several candidate entities comprehensively, improve the efficiency of method.With existing method phase Than, the advantage is that:

(1) knowledge base alignment is regarded as an iterative process.In different iteration, traverses each relationship and knowledge base is divided Area, and using matched block to candidate entity pair is selected, so that alignment schemes are independent of the relationship that can be aligned between knowledge base And attribute.

(2) a small amount of matching entities pair are only found in every wheel iteration, and by these matching entities to being used for candidate entity pair Select, the process due to selecting candidate entity pair has used the information of more matching entities pair, improves candidate entity Pair quality.

Detailed description of the invention

Fig. 1 is the flow diagram of the large-scale Heterogeneous Knowledge library alignment schemes the present invention is based on Iterative matching；

Fig. 2 is the process of data preprocessing phase in the large-scale Heterogeneous Knowledge library alignment schemes the present invention is based on Iterative matching Figure；

Fig. 3 is the process of knowledge base align stage in the large-scale Heterogeneous Knowledge library alignment schemes the present invention is based on Iterative matching Figure.

Specific embodiment

In order to more specifically describe the present invention, with reference to the accompanying drawing and specific embodiment is to technical solution of the present invention It is described in detail.

As shown in Figure 1, the present invention is based on the large-scale Heterogeneous Knowledge library alignment schemes of Iterative matching be divided into data prediction and Knowledge base is aligned two parts.Data prediction part: screening the data in former knowledge base KB, Uniform data format, And obtain the relationship in knowledge base and initial matching entity pair；Knowledge base aligned portions: first with the relationship pair in knowledge base Pretreated knowledge library partition, and block is simplified, matching block pair then is obtained to matching block using matching entities, Then candidate entity pair is selected in matching block pair, and method for measuring similarity and threshold value is combined to confirm candidate entity pair, most After repeat the above steps, until that cannot find new candidate entity pair, all matching entities pair can be obtained.

Shown in Fig. 2 is the flow chart of data preprocessing phase；According to fig. 2, which is divided into following steps:

S1-1 inputs any two original knowledge base KB₁、KB₂, and remove knowledge base KB₁、KB₂In it is unrelated with task is aligned Information.

Knowledge base is defined as hexa-atomic group of (E, L, R, P, F_R,F_P), wherein E, L, R, P respectively indicate entity, literal, relationship And the set of attribute；The triplet sets of entity-relationship-entity are represented, expression object is entity Relationship it is true；Entity-attribute-literal triplet sets are represented, indicate that object is literal Attribute is true；F_RAnd F_PIn all there is meaningless information, such as: comprising original text for extracting triple in certain knowledge bases This corpus, these information will affect the efficiency of algorithm.In addition, certain triples comprising " same As " relationship should be also removed.

S1-2, to knowledge base KB₁In literal L₁With knowledge base KB₂In literal L₂Uniform data format, by day Phase, number, name are expressed as unified format.

The expression way of the literals such as name, date, number in different knowledge bases may be different, such as: " 2016-01- 01 " and " 01.01.2016 ".By these information unifications, it is conducive to subsequent comparison, in addition, literal is unified into small letter by method.

S1-3 removes knowledge base KB₁In literal L₁With knowledge base KB₂In literal L₂Middle stop words character, symbol The meaningless character such as character, linguistic labels, knowledge base KB ' after being handled₁With KB '₂。

In knowledge base for entity attributes description in may have some meaningless characters, such as: " the ", " a " and Stop words such as " an ", " # ", "！" and the linguistic labels such as symbols and "@en " such as " * ".The similarity of these characters influence entity pair Measurement, therefore remove these characters.

S1-4, statistics obtain and knowledge base KB '₁Opposite set of relations R₁And knowledge base KB '₂Corresponding set of relations R₂。

In this step, for knowledge base KB '₁, traverse the triplet sets F for belonging to the knowledge base_R1In all triples (entity-relationship-entity), statistics obtain set of relations R₁；For knowledge base KB '₂, traverse the triplet sets for belonging to the knowledge base F_R2In all triples (entity-relationship-entity), statistics obtain set of relations R₂, set of relations R₁With set of relations R₂For subsequent Knowledge base division operation.

S1-5 compares knowledge base KB '₁With knowledge base KB '₂In all entities, obtain initial matching entity to collection

In this step, initial matching entity is to collectionAcquisition process are as follows:

Shown in Fig. 3 is the flow chart of knowledge base align stage；According to Fig. 3, which is divided into following steps:

S2-1, Input knowledge library KB '₁, knowledge base KB '₂, set of relations R₁, set of relations R₂, initial matching entity is to collection Block similarity threshold δ is set_bFor 0.2, entity similarity threshold δ_eFor physical quantities threshold value δ in 0.65, block₁For 50 and Matching entities rate threshold δ in block₂It is 0.3, matching entities are to collection M_eInitial matching entity is initialized as to collection

S2-2 randomly selects set of relations R₁Or set of relations R₂In any relationship, using the relationship by knowledge base KB '₁With know Know library KB '₂In entity be divided into several blocks, obtain and knowledge base KB '₁Corresponding block collection B₁And knowledge base KB '₂Phase Corresponding block collection B₂。

In this step, it is by the detailed process that the entity in knowledge base is divided into several blocks using relationship；

Firstly, for knowledge base KR '₁In triplet sets F_R1, count and obtain triplet sets F_R1Middle n kind object is real Body；

Then, for every kind of object entity, by triplet sets F_R1In all subject entities corresponding thereto be placed on one It rises, obtains 1 block, n kind object entity obtains n block, forms block collection B₁。

Block collection B is obtained using same method₂, it may be assumed that

Firstly, for knowledge base KB '₂In triplet sets F_R2, count and obtain triplet sets F_R2Middle n kind object is real Body；

Then, for every kind of object entity, by triplet sets F_R2In all subject entities corresponding thereto be placed on one It rises, obtains 1 block, n kind object entity obtains n block, forms block collection B₂。

S2-3 removes Except block collection B₁With block collection B₂In be also easy to produce high calculation amount or be difficult to generate the block of matching entities pair, It obtains simplifying block collection B '₁With simplify block collection B '₂。

In this step, the block for being also easy to produce high calculation amount or being difficult to generate matching entities pair includes: physical quantities more than threshold Value δ₁Block, matching entities ratio be less than threshold value δ₂Block and entity all matched blocks.

S2-4, using matching entities to collection M_eIn all matching entities block collection B ' is simplified to measurement₁Middle either block with Simplify block collection B '₂Similarity between middle either block selects similarity value to be greater than block similarity threshold δ_bTwo blocks It is matched, obtains matching block to collection.

S2-5 matches block to any matching block pair of concentration, in a block of the matching block pair to belonging to Any non-matching entities and the matching block pair another block in any non-matching entities cartesian product as wait Entity pair is selected, forms candidate entity to collection.

S2-6 judges whether not find new candidate entity pair, executes S2-7 if it is not, jumping, if so, terminate iteration, output Matching entities are to collection M_e。

S2-7 calculates candidate entity to the similarity concentrated between each candidate two entity of entity centering, by similarity value Greater than entity similarity threshold δ_eIt is corresponding candidate entity to be added to matching entities to collection M_eIn, remaining candidate's entity is to house It abandons.

In this step, the similarity between entity is measured by 2 kinds of modes: similarity of character string is similar with block Degree, and both similarities are combined with certain weight, formula is as follows:

sim(e_i,e_j)=α sim_string(e_i,e_j)+(1-α)sim_block(b_k,b_l)

s.t.e_i∈b_k,e_j∈b_l

Wherein, sim_string(e_i,e_j) and sim_block(b_k,b_l) respectively indicate similarity of character string and block phase between entity Like degree, b_kAnd b_lRespectively indicate entity e_iAnd e_jThe block at place, α are the weight of similarity of character string, value 0.6.For reality Body e_iAnd e_jFor shared attribute to (such as: name), similarity of character string measures the similarity of these attribute values.Method uses A variety of measuring similarity functions, such as: based on Levenshtein distance, being based on Jaro-Winker distance, based on q-gram and base In the similarity function of I-SUB, and these measuring similarity functions are combined by way of linear weighted function.Block similarity passes through Similarity between block where entity carrys out the similarity of presentation-entity.After obtaining the similarity between entity, in conjunction with threshold value δ_eSentence Breaking, whether this matches entity, and by newfound matching entities to all matching entities pair of addition.

S2-8, judges whether the number of iterations is less than iteration threshold, all no, jumps and executes S2-2；If so, terminate iteration, it is defeated Matching entities are to collection M out_e。

Technical solution of the present invention and beneficial effect is described in detail in above-described specific embodiment, Ying Li Solution is not intended to restrict the invention the foregoing is merely presently most preferred embodiment of the invention, all in principle model of the invention Interior done any modification, supplementary, and equivalent replacement etc. are enclosed, should all be included in the protection scope of the present invention.

Claims

1. a kind of large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching, specifically include:

Data preprocessing phase: to any two original knowledge base KB₁、KB₂In data screened, Uniform data format and Meaningless character processing is rejected, and counts knowledge base KB ' after acquisition and processing₁Corresponding set of relations R₁And knowledge base after processing KB′₂Corresponding set of relations R₂, compare and obtain initial matching entity to collection

Knowledge base align stage: set of relations R is utilized₁With set of relations R₂In relationship to knowledge base KB '₁With knowledge base KB '₂Divided Area, and each block is simplified, it obtains simplifying block collection B '₁With B '₂；Then, using initial matching entity to collection? With simplifying block collection B '₁With B '₂In block, obtain matching block pair, finally, matching block pair in select candidate entity pair, And method for measuring similarity and threshold value is combined to confirm candidate entity pair.

2. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as described in claim 1, which is characterized in that described The specific steps of data preprocessing phase are as follows:

(1-1) inputs any two original knowledge base KB₁、KB₂, and remove knowledge base KB₁、KB₂In the information unrelated with the task that is aligned；

(1-2) is to knowledge base KB₁In literal L₁With knowledge base KB₂In literal L₂Uniform data format, by date, number Word, name are expressed as unified format；

(1-3) removes knowledge base KB₁In literal L₁With knowledge base KB₂In literal L₂Middle stop words character, sign character, Linguistic labels character, knowledge base KB ' after being handled₁With KB '₂；

(1-4) statistics obtains and knowledge base KB '₁Corresponding set of relations R₁And knowledge base KB '₂Corresponding set of relations R₂；

3. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 2, which is characterized in that the step Suddenly the detailed process of (1-4) are as follows:

For knowledge base KB '₁, traverse the triplet sets F for belonging to the knowledge base_R1In all entity-relationship-entity ternarys Group, statistics obtain set of relations R₁；For knowledge base KB '₂, traverse the triplet sets F for belonging to the knowledge base_R2In all realities Body-relation-entity triple, statistics obtain set of relations R₂。

4. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 2, which is characterized in that step In (1-5), the initial matching entity is to collectionAcquisition process are as follows:

Firstly, extracting knowledge base KB '₁In all entities form entity set E₁, extract knowledge base KB '₂In all entities composition Entity set E₂；And with entity set E₁In any entity and entity set E₂In any entity cartesian product as entity pair, group At entity to collection；

Then, screening obtains entity to the identical entity pair of string representation for concentrating two entity name attributes, obtains pre- Initial matching entity is to collection；

Finally, screening pre- initial matching entity to the entity pair with one-to-one matching relationship is concentrated, as initial matching entity To collection

5. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as described in claim 1, which is characterized in that described The specific steps of knowledge base align stage are as follows:

(2-1) Input knowledge library KB '₁, knowledge base KB '₂, set of relations R₁, set of relations R₂, initial matching entity is to collectionSetting Block similarity threshold δ_b, entity similarity threshold δ_e, physical quantities threshold value δ in block₁And matching entities ratio in block Threshold value δ₂, matching entities are to collection M_eInitial matching entity is initialized as to collection

(2-2) randomly selects set of relations R₁Or set of relations R₂In any relationship, using the relationship by knowledge base KB '₁And knowledge base KB′₂In entity be divided into several blocks, obtain and knowledge base KB '₁Corresponding block collection B₁And knowledge base KB '₂It is corresponding Block collection B₂；

(2-3) removes Except block collection B₁With block collection B₂In be also easy to produce high calculation amount or be difficult to generate the block of matching entities pair, obtain Simplify block collection B '₁With simplify block collection B '₂；

(2-4) is using matching entities to collection M_eIn all matching entities block collection B ' is simplified to measurement₁Middle either block with simplify Block collection B '₂Similarity between middle either block selects similarity value to be greater than block similarity threshold δ_bTwo blocks carry out Matching obtains matching block to collection；

(2-5) is to belonging to matching block to any matching block pair of concentration, with appointing in a block of the matching block pair The cartesian product of any non-matching entities in another block of one non-matching entities and the matching block pair is as candidate real Body pair forms candidate entity to collection；

(2-6) judges whether not find new candidate entity pair, executes step (2-7) if it is not, jumping, if so, terminate iteration, output Matching entities are to collection M_e；

(2-7) calculates candidate entity to the similarity concentrated between each candidate two entity of entity centering, and similarity value is greater than Entity similarity threshold δ_eIt is corresponding candidate entity to be added to matching entities to collection M_eIn, remaining candidate's entity is to giving up；

(2-8) judges whether the number of iterations is less than iteration threshold, all no, jumps and executes step (2-2)；If so, terminate iteration, it is defeated Matching entities are to collection M out_e。

6. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 5, which is characterized in that step It is described to be by the detailed process that the entity in knowledge base is divided into several blocks using relationship in (2-2)；

Firstly, for knowledge base KB '₁In triplet sets F_R1, count and obtain triplet sets F_R1Middle n kind object entity；

Then, for every kind of object entity, by triplet sets F_R1In all subject entities corresponding thereto put together, obtain To 1 block, n kind object entity obtains n block, forms block collection B₁；

Block collection B is obtained using same method₂。

7. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 5, which is characterized in that step In (2-3), it is more than threshold value δ that the block for being also easy to produce high calculation amount or being difficult to generate matching entities pair, which includes: physical quantities,₁ Block, matching entities ratio be less than threshold value δ₂Block and entity all matched blocks.

8. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 5, which is characterized in that step In (2-4), the acquisition methods of the similarity between block are as follows:

Each block is seen be entity set, matched entity utilizes collection to the identical element regarding two set as It closes similarity and comes the similarity between Metrics block, similarity sim_block(b_k,b_l) calculation formula are as follows:

Wherein, b_kAnd b_lIndicate two blocks, | b_k∩b_l| indicate that matching entities are to quantity in two blocks, | b_k∪b_l| indicate two Total physical quantities in block.

9. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 8, which is characterized in that step In (2-7), the acquisition formula of the similarity between entity are as follows:

sim(e_i,e_j)=α sim_string(e_i,e_j)+(1-α)sim_block(b_k,b_l)

s.t.e_i∈b_k,e_j∈b_l

Wherein, b_kAnd b_lRespectively indicate entity e_iAnd e_jThe block at place, sim_string(e_i,e_j) and sim_block(b_k,b_l) difference table Show that the similarity of character string and block similarity between entity, α are the weights of similarity of character string, value range is [0,1].

10. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 9, which is characterized in that use Based on Levenshtein distance, based on Jaro-Winker distance, based on q-gram and based on the similarity function of I-SUB, and These measuring similarity functions are combined by way of linear weighted function calculates acquisition similarity of character string.