CN107145523B - Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching - Google Patents
Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching Download PDFInfo
- Publication number
- CN107145523B CN107145523B CN201710237034.6A CN201710237034A CN107145523B CN 107145523 B CN107145523 B CN 107145523B CN 201710237034 A CN201710237034 A CN 201710237034A CN 107145523 B CN107145523 B CN 107145523B
- Authority
- CN
- China
- Prior art keywords
- entity
- block
- matching
- knowledge base
- collection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 claims abstract description 34
- 230000008569 process Effects 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000012216 screening Methods 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 238000005259 measurement Methods 0.000 claims description 4
- 230000009191 jumping Effects 0.000 claims description 3
- 238000005192 partition Methods 0.000 abstract description 2
- 230000004927 fusion Effects 0.000 abstract 1
- 230000008901 benefit Effects 0.000 description 3
- 241001591024 Samea Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003997 social interaction Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
Abstract
The large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching that the invention discloses a kind of are embodied as follows: 1) being screened to the data in former knowledge base, Uniform data format, obtain relationship and initial matching entity pair in knowledge base on this basis;2) using the relationship in knowledge base to pretreated knowledge library partition, and block is simplified;3) matching block pair is obtained to matching block using matching entities;4) candidate entity pair is selected in matching block pair, and method for measuring similarity and threshold value is combined to confirm candidate entity pair;5) it repeats the above steps, until that cannot find new candidate entity pair, obtains all matching entities pair.The thought of present invention combination Iterative matching is aligned Heterogeneous Knowledge library, has broad application prospects in fields such as knowledge base alignment, data fusion, automatic question answerings.
Description
Technical field
The present invention relates to knowledge base alignment field more particularly to a kind of large-scale Heterogeneous Knowledge library alignment based on Iterative matching
Method.
Background technique
With the arrival of Web 3.0, the knowledge base of structuring is increasingly frequently occurred on internet.These knowledge bases
It is widely used in all kinds of semantic applications, such as: automatic question answering, search service and social interaction server etc..However, single knowledge base
Limited information, limit these application function.In this context, knowledge base alignment has huge development space.Knowledge
Library alignment (Knowledge Base Alignment) is often referred to the entity alignment of knowledge base, i.e., automatic discovery represents same in reality
Two entities of one things simultaneously connect them.
Due to the continuous growth of knowledge base scale, alignment procedure is usually divided into two steps by knowledge base alignment schemes: hair
Existing candidate's entity to confirm candidate entity pair.It was found that candidate entity is to using a small amount of attribute being quickly usually that each entity screens
Several candidate entities out confirm that candidate entity to by comparing two entities comprehensively, utilizes two entity of similarity and threshold decision
Whether match.Due to avoiding the accurate comparison of entity between any two, this way substantially increases the whole efficiency of method.Mesh
Before, the bottleneck of knowledge base alignment schemes is the candidate entity found to usually being omitted, and further resulting in can matched reality
Body is to undiscovered.
For the quality for improving candidate entity pair, researcher proposes to use the thought of Iterative matching, i.e., every wheel discovery is a small amount of
Matching entities pair, and find as next round the foundation of candidate entity pair.However, traditional knowledge base alignment schemes are usually closed
Infuse the alignment of isomorphism knowledge base, i.e. have between two knowledge bases it is more can alignment relation.Its basic assumption are as follows: if a pair of of entity to
Match, and they have the relationship of alignment, then their " compatible neighbours " have greater probability matching, therefore " compatible neighbours " is made
For candidate entity pair.But due between knowledge base can alignment relation it is few, conventional method is by holiday candidate's entity pair.In order to
Solve the problems, such as this, researcher proposes to use class-based knowledge base alignment schemes.This method is by the example with same characteristic features
Be divided into same class, and exclude with the content of class incoherent candidate entity, candidate entity pair is confirmed with this.However,
Since this method only obtains candidate entity pair by classical partitioning technique in the model initial stage, when between two knowledge bases
When the attribute of alignment is less, this method will also omit more candidate entity pair.
Summary of the invention
In view of above-mentioned, the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching that the invention proposes a kind of.This method
Knowledge base alignment is carried out in conjunction with Iterative matching thought, subregion is carried out to knowledge base using iteration frame traversal relationship, is expanded
The search space of candidate entity pair;Meanwhile using using thought of dividing and ruling to select and confirming candidate entity pair, so that each entity is only
It needs and several candidate entities is compared comprehensively, improve the efficiency of method.
A kind of large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching, specifically include:
Data preprocessing phase: to any two original knowledge base KB1、KB2In data screened, Uniform data format
And meaningless character processing is rejected, and count knowledge base KB ' after acquisition and processing1Corresponding set of relations R1, with processing after know
Know library KB '2Corresponding set of relations R2, compare and obtain initial matching entity to collection
Knowledge base align stage: set of relations R is utilized1With set of relations R2In relationship to knowledge base KB '1With knowledge base KB '2
Subregion is carried out, and each block is simplified, obtains simplifying block collection B '1With B '2;Then, using initial matching entity to collectionBlock collection B ' is simplified in matching1With B '2In block, obtain matching block pair, finally, matching block pair in select candidate
Entity pair, and combine method for measuring similarity and threshold value δeConfirm candidate entity pair.
The specific steps of the data preprocessing phase are as follows:
(1-1) inputs any two original knowledge base KB1、KB2, and remove knowledge base KB1、KB2In it is unrelated with task is aligned
Information;
(1-2) is to knowledge base KB1In literal L1With knowledge base KB2In literal L2Uniform data format, by day
Phase, number, name are expressed as unified format;
(1-3) removes knowledge base KB1In literal L1With knowledge base KB2In literal L2Middle stop words character, symbol
Character, linguistic labels character, knowledge base KB ' after being handled1With KB '2;
(1-4) statistics obtains and knowledge base KB '1Opposite set of relations R1And knowledge base KB '2Corresponding set of relations R2;
(1-5) compares knowledge base KB '1With knowledge base KB '2In all entities, obtain initial matching entity to collection
Knowledge base is defined as hexa-atomic group of (E, L, R, P, FR,FP), wherein E, L, R, P respectively indicate entity, literal, relationship
And the set of attribute;The triplet sets of entity-relationship-entity are represented, expression object is entity
Relationship it is true;Entity-attribute-literal triplet sets are represented, indicate that object is literal
Attribute is true;FRAnd FPIn all there is meaningless information, such as: comprising original text for extracting triple in certain knowledge bases
This corpus, these information will affect the efficiency of algorithm.In addition, certain triples comprising " sameAs " relationship should be also removed.
The detailed process of the step (1-4) are as follows:
For knowledge base KB '1, traverse the triplet sets F for belonging to the knowledge baseR1In all triples (entity-pass
System-entity), statistics obtains set of relations R1;For knowledge base KB '2, traverse the triplet sets F for belonging to the knowledge baseR2In institute
Have triple (entity-relationship-entity), statistics obtains set of relations R2, set of relations R1With set of relations R2For subsequent knowledge base point
Area's operation.
In step (1-5), the initial matching entity is to collectionAcquisition process are as follows:
Firstly, extracting knowledge base KB '1In all entities form entity set E1, extract knowledge base KB '2In all entities
Form entity set E2;And with entity set E1In any entity and entity set E2In any entity cartesian product as entity
It is right, entity is formed to collection;
Then, screening obtains entity to the identical entity pair of string representation for concentrating two entity name attributes, obtains
To pre- initial matching entity to collection;
Finally, screening pre- initial matching entity to the entity pair with one-to-one matching relationship is concentrated, as initial matching
Entity is to collection
The specific steps of the knowledge base align stage are as follows:
(2-1) Input knowledge library KB '1, knowledge base KB '2, set of relations R1, set of relations R2, initial matching entity is to collection
Block similarity threshold δ is setb, entity similarity threshold δe, physical quantities threshold value δ in block1And matching entities in block
Rate threshold δ2, matching entities are to collection MeInitial matching entity is initialized as to collection
(2-2) randomly selects set of relations R1Or set of relations R2In any relationship, using the relationship by knowledge base KB '1With know
Know library KB '2In entity be divided into several blocks, obtain and knowledge base KB '1Corresponding block collection B1And knowledge base KB '2Phase
Corresponding block collection B2;
(2-3) removes Except block collection B1With block collection B2In be also easy to produce high calculation amount or be difficult to generate the block of matching entities pair,
It obtains simplifying block collection B '1With simplify block collection B '2;
(2-4) is using matching entities to collection MeIn all matching entities block collection B ' is simplified to measurement1Middle either block with
Simplify block collection B '2Similarity between middle either block selects similarity value to be greater than block similarity threshold δbTwo blocks
It is matched, obtains matching block to collection;
(2-5) is to belonging to matching block to any matching block pair of concentration, in a block of the matching block pair
Any non-matching entities and the matching block pair another block in any non-matching entities cartesian product as wait
Entity pair is selected, forms candidate entity to collection;
(2-6) judges whether not find new candidate entity pair, executes step (2-7) if it is not, jumping, if so, terminate iteration,
Matching entities are exported to collection Me;
(2-7) calculates candidate entity to the similarity concentrated between each candidate two entity of entity centering, by similarity value
Greater than entity similarity threshold δeIt is corresponding candidate entity to be added to matching entities to collection MeIn, remaining candidate's entity is to house
It abandons;
(2-8) judges whether the number of iterations is less than iteration threshold, all no, jumps and executes step (2-2);If so, terminating to change
In generation, output matching entities are to collection Me。
It is described to be by the detailed process that the entity in knowledge base is divided into several blocks using relationship in step (2-2);
Firstly, for knowledge base KB '1In triplet sets FR1, count and obtain triplet sets FR1Middle n kind object is real
Body;
Then, for every kind of object entity, by triplet sets FR1In all subject entities corresponding thereto be placed on one
It rises, obtains 1 block, n kind object entity obtains n block, forms block collection B1;
Block collection B is obtained using same method2。
In step (2-3), the block for being also easy to produce high calculation amount or being difficult to generate matching entities pair includes: entity number
Amount is more than threshold value δ1Block, matching entities ratio be less than threshold value δ2Block and entity all matched blocks.
In step (2-4), the acquisition methods of the similarity between block are as follows:
Each block is seen be entity set, matched entity is between the identical element regarding two set as, benefit
With set similarity come the similarity between Metrics block, similarity simblock(bk,bl) calculation formula are as follows:
Wherein, bkAnd blIndicate two blocks, | bk∩bl| indicate that matching entities are to quantity in two blocks, | bk∪bl| it indicates
Total physical quantities in two blocks.
In step (2-7), the acquisition formula of the similarity between entity are as follows:
sim(ei,ej)=α simstring(ei,ej)+(1-α)simblock(bk,bl)
s.t.ei∈bk,ej∈bl
Wherein, bkAnd blRespectively indicate entity eiAnd ejThe block at place, simstring(ei,ej) and simblock(bk,bl) point
Similarity of character string and block similarity between other presentation-entity, α are the weights of similarity of character string, and value range is [0,1].
Preferably, using based on Levenshtein distance, based on Jaro-Winker distance, based on q-gram and be based on
The similarity function of I-SUB, and combine these measuring similarity functions by way of linear weighted function and calculate acquisition character string phase
Like degree.
Present invention combination Iterative matching thought carries out the alignment of Heterogeneous Knowledge library, using iteration frame traversal relationship to knowledge base
Subregion is carried out, the search space of candidate entity pair is expanded;Meanwhile using using thought of dividing and ruling to select and confirm candidate entity
It is right, so that each entity only needs to be compared with several candidate entities comprehensively, improve the efficiency of method.With existing method phase
Than, the advantage is that:
(1) knowledge base alignment is regarded as an iterative process.In different iteration, traverses each relationship and knowledge base is divided
Area, and using matched block to candidate entity pair is selected, so that alignment schemes are independent of the relationship that can be aligned between knowledge base
And attribute.
(2) a small amount of matching entities pair are only found in every wheel iteration, and by these matching entities to being used for candidate entity pair
Select, the process due to selecting candidate entity pair has used the information of more matching entities pair, improves candidate entity
Pair quality.
Detailed description of the invention
Fig. 1 is the flow diagram of the large-scale Heterogeneous Knowledge library alignment schemes the present invention is based on Iterative matching;
Fig. 2 is the process of data preprocessing phase in the large-scale Heterogeneous Knowledge library alignment schemes the present invention is based on Iterative matching
Figure;
Fig. 3 is the process of knowledge base align stage in the large-scale Heterogeneous Knowledge library alignment schemes the present invention is based on Iterative matching
Figure.
Specific embodiment
In order to more specifically describe the present invention, with reference to the accompanying drawing and specific embodiment is to technical solution of the present invention
It is described in detail.
As shown in Figure 1, the present invention is based on the large-scale Heterogeneous Knowledge library alignment schemes of Iterative matching be divided into data prediction and
Knowledge base is aligned two parts.Data prediction part: screening the data in former knowledge base KB, Uniform data format,
And obtain the relationship in knowledge base and initial matching entity pair;Knowledge base aligned portions: first with the relationship pair in knowledge base
Pretreated knowledge library partition, and block is simplified, matching block pair then is obtained to matching block using matching entities,
Then candidate entity pair is selected in matching block pair, and method for measuring similarity and threshold value is combined to confirm candidate entity pair, most
After repeat the above steps, until that cannot find new candidate entity pair, all matching entities pair can be obtained.
Shown in Fig. 2 is the flow chart of data preprocessing phase;According to fig. 2, which is divided into following steps:
S1-1 inputs any two original knowledge base KB1、KB2, and remove knowledge base KB1、KB2In it is unrelated with task is aligned
Information.
Knowledge base is defined as hexa-atomic group of (E, L, R, P, FR,FP), wherein E, L, R, P respectively indicate entity, literal, relationship
And the set of attribute;The triplet sets of entity-relationship-entity are represented, expression object is entity
Relationship it is true;Entity-attribute-literal triplet sets are represented, indicate that object is literal
Attribute is true;FRAnd FPIn all there is meaningless information, such as: comprising original text for extracting triple in certain knowledge bases
This corpus, these information will affect the efficiency of algorithm.In addition, certain triples comprising " same As " relationship should be also removed.
S1-2, to knowledge base KB1In literal L1With knowledge base KB2In literal L2Uniform data format, by day
Phase, number, name are expressed as unified format.
The expression way of the literals such as name, date, number in different knowledge bases may be different, such as: " 2016-01-
01 " and " 01.01.2016 ".By these information unifications, it is conducive to subsequent comparison, in addition, literal is unified into small letter by method.
S1-3 removes knowledge base KB1In literal L1With knowledge base KB2In literal L2Middle stop words character, symbol
The meaningless character such as character, linguistic labels, knowledge base KB ' after being handled1With KB '2。
In knowledge base for entity attributes description in may have some meaningless characters, such as: " the ", " a " and
Stop words such as " an ", " # ", "!" and the linguistic labels such as symbols and "@en " such as " * ".The similarity of these characters influence entity pair
Measurement, therefore remove these characters.
S1-4, statistics obtain and knowledge base KB '1Opposite set of relations R1And knowledge base KB '2Corresponding set of relations R2。
In this step, for knowledge base KB '1, traverse the triplet sets F for belonging to the knowledge baseR1In all triples
(entity-relationship-entity), statistics obtain set of relations R1;For knowledge base KB '2, traverse the triplet sets for belonging to the knowledge base
FR2In all triples (entity-relationship-entity), statistics obtain set of relations R2, set of relations R1With set of relations R2For subsequent
Knowledge base division operation.
S1-5 compares knowledge base KB '1With knowledge base KB '2In all entities, obtain initial matching entity to collection
In this step, initial matching entity is to collectionAcquisition process are as follows:
Firstly, extracting knowledge base KB '1In all entities form entity set E1, extract knowledge base KB '2In all entities
Form entity set E2;And with entity set E1In any entity and entity set E2In any entity cartesian product as entity
It is right, entity is formed to collection;
Then, screening obtains entity to the identical entity pair of string representation for concentrating two entity name attributes, obtains
To pre- initial matching entity to collection;
Finally, screening pre- initial matching entity to the entity pair with one-to-one matching relationship is concentrated, as initial matching
Entity is to collection
Shown in Fig. 3 is the flow chart of knowledge base align stage;According to Fig. 3, which is divided into following steps:
S2-1, Input knowledge library KB '1, knowledge base KB '2, set of relations R1, set of relations R2, initial matching entity is to collection
Block similarity threshold δ is setbFor 0.2, entity similarity threshold δeFor physical quantities threshold value δ in 0.65, block1For 50 and
Matching entities rate threshold δ in block2It is 0.3, matching entities are to collection MeInitial matching entity is initialized as to collection
S2-2 randomly selects set of relations R1Or set of relations R2In any relationship, using the relationship by knowledge base KB '1With know
Know library KB '2In entity be divided into several blocks, obtain and knowledge base KB '1Corresponding block collection B1And knowledge base KB '2Phase
Corresponding block collection B2。
In this step, it is by the detailed process that the entity in knowledge base is divided into several blocks using relationship;
Firstly, for knowledge base KR '1In triplet sets FR1, count and obtain triplet sets FR1Middle n kind object is real
Body;
Then, for every kind of object entity, by triplet sets FR1In all subject entities corresponding thereto be placed on one
It rises, obtains 1 block, n kind object entity obtains n block, forms block collection B1。
Block collection B is obtained using same method2, it may be assumed that
Firstly, for knowledge base KB '2In triplet sets FR2, count and obtain triplet sets FR2Middle n kind object is real
Body;
Then, for every kind of object entity, by triplet sets FR2In all subject entities corresponding thereto be placed on one
It rises, obtains 1 block, n kind object entity obtains n block, forms block collection B2。
S2-3 removes Except block collection B1With block collection B2In be also easy to produce high calculation amount or be difficult to generate the block of matching entities pair,
It obtains simplifying block collection B '1With simplify block collection B '2。
In this step, the block for being also easy to produce high calculation amount or being difficult to generate matching entities pair includes: physical quantities more than threshold
Value δ1Block, matching entities ratio be less than threshold value δ2Block and entity all matched blocks.
S2-4, using matching entities to collection MeIn all matching entities block collection B ' is simplified to measurement1Middle either block with
Simplify block collection B '2Similarity between middle either block selects similarity value to be greater than block similarity threshold δbTwo blocks
It is matched, obtains matching block to collection.
Each block is seen be entity set, matched entity is between the identical element regarding two set as, benefit
With set similarity come the similarity between Metrics block, similarity simblock(bk,bl) calculation formula are as follows:
Wherein, bkAnd blIndicate two blocks, | bk∩bl| indicate that matching entities are to quantity in two blocks, | bk∪bl| it indicates
Total physical quantities in two blocks.
S2-5 matches block to any matching block pair of concentration, in a block of the matching block pair to belonging to
Any non-matching entities and the matching block pair another block in any non-matching entities cartesian product as wait
Entity pair is selected, forms candidate entity to collection.
S2-6 judges whether not find new candidate entity pair, executes S2-7 if it is not, jumping, if so, terminate iteration, output
Matching entities are to collection Me。
S2-7 calculates candidate entity to the similarity concentrated between each candidate two entity of entity centering, by similarity value
Greater than entity similarity threshold δeIt is corresponding candidate entity to be added to matching entities to collection MeIn, remaining candidate's entity is to house
It abandons.
In this step, the similarity between entity is measured by 2 kinds of modes: similarity of character string is similar with block
Degree, and both similarities are combined with certain weight, formula is as follows:
sim(ei,ej)=α simstring(ei,ej)+(1-α)simblock(bk,bl)
s.t.ei∈bk,ej∈bl
Wherein, simstring(ei,ej) and simblock(bk,bl) respectively indicate similarity of character string and block phase between entity
Like degree, bkAnd blRespectively indicate entity eiAnd ejThe block at place, α are the weight of similarity of character string, value 0.6.For reality
Body eiAnd ejFor shared attribute to (such as: name), similarity of character string measures the similarity of these attribute values.Method uses
A variety of measuring similarity functions, such as: based on Levenshtein distance, being based on Jaro-Winker distance, based on q-gram and base
In the similarity function of I-SUB, and these measuring similarity functions are combined by way of linear weighted function.Block similarity passes through
Similarity between block where entity carrys out the similarity of presentation-entity.After obtaining the similarity between entity, in conjunction with threshold value δeSentence
Breaking, whether this matches entity, and by newfound matching entities to all matching entities pair of addition.
S2-8, judges whether the number of iterations is less than iteration threshold, all no, jumps and executes S2-2;If so, terminate iteration, it is defeated
Matching entities are to collection M oute。
Technical solution of the present invention and beneficial effect is described in detail in above-described specific embodiment, Ying Li
Solution is not intended to restrict the invention the foregoing is merely presently most preferred embodiment of the invention, all in principle model of the invention
Interior done any modification, supplementary, and equivalent replacement etc. are enclosed, should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching, specifically include:
Data preprocessing phase: to any two original knowledge base KB1、KB2In data screened, Uniform data format and
Meaningless character processing is rejected, and counts knowledge base KB ' after acquisition and processing1Corresponding set of relations R1And knowledge base after processing
KB′2Corresponding set of relations R2, compare and obtain initial matching entity to collection
Knowledge base align stage: set of relations R is utilized1With set of relations R2In relationship to knowledge base KB '1With knowledge base KB '2Divided
Area, and each block is simplified, it obtains simplifying block collection B '1With B '2;Then, using initial matching entity to collection?
With simplifying block collection B '1With B '2In block, obtain matching block pair, finally, matching block pair in select candidate entity pair,
And method for measuring similarity and threshold value is combined to confirm candidate entity pair.
2. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as described in claim 1, which is characterized in that described
The specific steps of data preprocessing phase are as follows:
(1-1) inputs any two original knowledge base KB1、KB2, and remove knowledge base KB1、KB2In the information unrelated with the task that is aligned;
(1-2) is to knowledge base KB1In literal L1With knowledge base KB2In literal L2Uniform data format, by date, number
Word, name are expressed as unified format;
(1-3) removes knowledge base KB1In literal L1With knowledge base KB2In literal L2Middle stop words character, sign character,
Linguistic labels character, knowledge base KB ' after being handled1With KB '2;
(1-4) statistics obtains and knowledge base KB '1Corresponding set of relations R1And knowledge base KB '2Corresponding set of relations R2;
(1-5) compares knowledge base KB '1With knowledge base KB '2In all entities, obtain initial matching entity to collection
3. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 2, which is characterized in that the step
Suddenly the detailed process of (1-4) are as follows:
For knowledge base KB '1, traverse the triplet sets F for belonging to the knowledge baseR1In all entity-relationship-entity ternarys
Group, statistics obtain set of relations R1;For knowledge base KB '2, traverse the triplet sets F for belonging to the knowledge baseR2In all realities
Body-relation-entity triple, statistics obtain set of relations R2。
4. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 2, which is characterized in that step
In (1-5), the initial matching entity is to collectionAcquisition process are as follows:
Firstly, extracting knowledge base KB '1In all entities form entity set E1, extract knowledge base KB '2In all entities composition
Entity set E2;And with entity set E1In any entity and entity set E2In any entity cartesian product as entity pair, group
At entity to collection;
Then, screening obtains entity to the identical entity pair of string representation for concentrating two entity name attributes, obtains pre-
Initial matching entity is to collection;
Finally, screening pre- initial matching entity to the entity pair with one-to-one matching relationship is concentrated, as initial matching entity
To collection
5. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as described in claim 1, which is characterized in that described
The specific steps of knowledge base align stage are as follows:
(2-1) Input knowledge library KB '1, knowledge base KB '2, set of relations R1, set of relations R2, initial matching entity is to collectionSetting
Block similarity threshold δb, entity similarity threshold δe, physical quantities threshold value δ in block1And matching entities ratio in block
Threshold value δ2, matching entities are to collection MeInitial matching entity is initialized as to collection
(2-2) randomly selects set of relations R1Or set of relations R2In any relationship, using the relationship by knowledge base KB '1And knowledge base
KB′2In entity be divided into several blocks, obtain and knowledge base KB '1Corresponding block collection B1And knowledge base KB '2It is corresponding
Block collection B2;
(2-3) removes Except block collection B1With block collection B2In be also easy to produce high calculation amount or be difficult to generate the block of matching entities pair, obtain
Simplify block collection B '1With simplify block collection B '2;
(2-4) is using matching entities to collection MeIn all matching entities block collection B ' is simplified to measurement1Middle either block with simplify
Block collection B '2Similarity between middle either block selects similarity value to be greater than block similarity threshold δbTwo blocks carry out
Matching obtains matching block to collection;
(2-5) is to belonging to matching block to any matching block pair of concentration, with appointing in a block of the matching block pair
The cartesian product of any non-matching entities in another block of one non-matching entities and the matching block pair is as candidate real
Body pair forms candidate entity to collection;
(2-6) judges whether not find new candidate entity pair, executes step (2-7) if it is not, jumping, if so, terminate iteration, output
Matching entities are to collection Me;
(2-7) calculates candidate entity to the similarity concentrated between each candidate two entity of entity centering, and similarity value is greater than
Entity similarity threshold δeIt is corresponding candidate entity to be added to matching entities to collection MeIn, remaining candidate's entity is to giving up;
(2-8) judges whether the number of iterations is less than iteration threshold, all no, jumps and executes step (2-2);If so, terminate iteration, it is defeated
Matching entities are to collection M oute。
6. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 5, which is characterized in that step
It is described to be by the detailed process that the entity in knowledge base is divided into several blocks using relationship in (2-2);
Firstly, for knowledge base KB '1In triplet sets FR1, count and obtain triplet sets FR1Middle n kind object entity;
Then, for every kind of object entity, by triplet sets FR1In all subject entities corresponding thereto put together, obtain
To 1 block, n kind object entity obtains n block, forms block collection B1;
Block collection B is obtained using same method2。
7. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 5, which is characterized in that step
In (2-3), it is more than threshold value δ that the block for being also easy to produce high calculation amount or being difficult to generate matching entities pair, which includes: physical quantities,1
Block, matching entities ratio be less than threshold value δ2Block and entity all matched blocks.
8. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 5, which is characterized in that step
In (2-4), the acquisition methods of the similarity between block are as follows:
Each block is seen be entity set, matched entity utilizes collection to the identical element regarding two set as
It closes similarity and comes the similarity between Metrics block, similarity simblock(bk,bl) calculation formula are as follows:
Wherein, bkAnd blIndicate two blocks, | bk∩bl| indicate that matching entities are to quantity in two blocks, | bk∪bl| indicate two
Total physical quantities in block.
9. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 8, which is characterized in that step
In (2-7), the acquisition formula of the similarity between entity are as follows:
sim(ei,ej)=α simstring(ei,ej)+(1-α)simblock(bk,bl)
s.t.ei∈bk,ej∈bl
Wherein, bkAnd blRespectively indicate entity eiAnd ejThe block at place, simstring(ei,ej) and simblock(bk,bl) difference table
Show that the similarity of character string and block similarity between entity, α are the weights of similarity of character string, value range is [0,1].
10. the large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching as claimed in claim 9, which is characterized in that use
Based on Levenshtein distance, based on Jaro-Winker distance, based on q-gram and based on the similarity function of I-SUB, and
These measuring similarity functions are combined by way of linear weighted function calculates acquisition similarity of character string.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710237034.6A CN107145523B (en) | 2017-04-12 | 2017-04-12 | Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710237034.6A CN107145523B (en) | 2017-04-12 | 2017-04-12 | Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107145523A CN107145523A (en) | 2017-09-08 |
CN107145523B true CN107145523B (en) | 2019-10-18 |
Family
ID=59774786
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710237034.6A Expired - Fee Related CN107145523B (en) | 2017-04-12 | 2017-04-12 | Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107145523B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108460021B (en) * | 2018-03-16 | 2021-10-12 | 安徽大学 | Method for extracting problem method pairs in thesis title |
CN109492114A (en) * | 2018-11-16 | 2019-03-19 | 南京茂毓通软件科技有限公司 | A kind of entity information recognition methods |
CN109739939A (en) * | 2018-12-29 | 2019-05-10 | 颖投信息科技(上海)有限公司 | The data fusion method and device of knowledge mapping |
CN110413704B (en) * | 2019-06-27 | 2022-05-03 | 浙江大学 | Entity alignment method based on weighted neighbor information coding |
CN111191045B (en) * | 2019-12-30 | 2023-06-16 | 创新奇智(上海)科技有限公司 | Entity alignment method and system applied to knowledge graph |
CN112699667A (en) * | 2020-12-29 | 2021-04-23 | 京东数字科技控股股份有限公司 | Entity similarity determination method, device, equipment and storage medium |
CN112784609A (en) * | 2021-03-16 | 2021-05-11 | 云知声智能科技股份有限公司 | Method, apparatus, device and medium for determining whether medical record includes consultation opinions |
CN113609304B (en) * | 2021-07-20 | 2023-05-23 | 广州大学 | Entity matching method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679492A (en) * | 2013-11-29 | 2015-06-03 | 国际商业机器公司 | Computer-implemented technical support providing device and method |
CN104899242A (en) * | 2015-03-10 | 2015-09-09 | 四川大学 | Mechanical product design two-dimensional knowledge pushing method based on design intent |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9411878B2 (en) * | 2014-02-19 | 2016-08-09 | International Business Machines Corporation | NLP duration and duration range comparison methodology using similarity weighting |
-
2017
- 2017-04-12 CN CN201710237034.6A patent/CN107145523B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679492A (en) * | 2013-11-29 | 2015-06-03 | 国际商业机器公司 | Computer-implemented technical support providing device and method |
CN104899242A (en) * | 2015-03-10 | 2015-09-09 | 四川大学 | Mechanical product design two-dimensional knowledge pushing method based on design intent |
Non-Patent Citations (2)
Title |
---|
Interference Alignment with Cyclic Unidirectional;Mohammad Reza Nakhai et al;《2012 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC)》;20121129;第1-5页 * |
中文异构百科知识库实体对齐;黄峻福等;《计算机应用》;20160710;第1881-1886,1898页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107145523A (en) | 2017-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107145523B (en) | Large-scale Heterogeneous Knowledge library alignment schemes based on Iterative matching | |
Ahmed et al. | Graph sample and hold: A framework for big-graph analytics | |
Pezzoni et al. | How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation | |
Deng et al. | A user identification algorithm based on user behavior analysis in social networks | |
CN104346481B (en) | A kind of community detection method based on dynamic synchronization model | |
CN113761221B (en) | Knowledge graph entity alignment method based on graph neural network | |
CN105956158B (en) | The method that network neologisms based on massive micro-blog text and user information automatically extract | |
CN109144964A (en) | log analysis method and device based on machine learning | |
CN103034726A (en) | Text filtering system and method | |
CN106991090A (en) | The analysis method and device of public sentiment event entity | |
Yu et al. | Motifs in big networks: Methods and applications | |
CN105930466A (en) | Massive data processing method | |
Cheng et al. | Mining research trends with anomaly detection models: the case of social computing research | |
Zhiyuli et al. | Modeling large-scale dynamic social networks via node embeddings | |
CN106055652A (en) | Method and system for database matching based on patterns and examples | |
Bi et al. | MM-GNN: Mix-moment graph neural network towards modeling neighborhood feature distribution | |
CN106651461A (en) | Film personalized recommendation method based on gray theory | |
Wang et al. | An improved clustering method for detection system of public security events based on genetic algorithm and semisupervised learning | |
CN106339293B (en) | A kind of log event extracting method based on signature | |
CN108595515A (en) | A kind of microblog emotional analysis method of the weak relationship of combination microblogging | |
CN116432125A (en) | Code classification method based on hash algorithm | |
Pola et al. | Similarity sets: A new concept of sets to seamlessly handle similarity in database management systems | |
CN113011152B (en) | Text processing method, device and equipment and computer readable storage medium | |
Zhao et al. | Realization of intrusion detection system based on the improved data mining technology | |
Priya et al. | Entity resolution for high velocity streams using semantic measures |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20191018 |