CN108960292A - Data fusion method, device, system based on pattern match and Entities Matching - Google Patents
Data fusion method, device, system based on pattern match and Entities Matching Download PDFInfo
- Publication number
- CN108960292A CN108960292A CN201810594208.9A CN201810594208A CN108960292A CN 108960292 A CN108960292 A CN 108960292A CN 201810594208 A CN201810594208 A CN 201810594208A CN 108960292 A CN108960292 A CN 108960292A
- Authority
- CN
- China
- Prior art keywords
- record
- pattern match
- entities matching
- matching
- pair
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/251—Fusion techniques of input or preprocessed data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The data fusion method based on pattern match and Entities Matching that the invention discloses a kind of, by the record using given initial connection to progress Entities Matching, then the matched result of current entity is recycled to carry out pattern match, then Entities Matching is carried out further according to the result of pattern match, and so on, each round matches using last round of matching result as input data, the record of the successful match of previous round on the basis of match again, possible erroneous matching in previous round can be enabled us to correct or do not find record group, constantly go the matched result of modification model.Until mode and the result of Entities Matching all tend towards stability.The accuracy rate of data fusion can be improved in this method, promotes data value.The invention also discloses a kind of data fusion device based on pattern match and Entities Matching, system and a kind of readable storage medium storing program for executing, have above-mentioned beneficial effect.
Description
Technical field
The present invention relates to electronic technology field, in particular to a kind of data fusion side based on pattern match and Entities Matching
Method, device, system and a kind of readable storage medium storing program for executing.
Background technique
Data fusion refers to several observation informations chronologically obtained, is automatically analyzed, is integrated under certain criterion, with
The information processing technology completing required decision and assessment task and carrying out.With the quick hair of computer technology, the communication technology
Exhibition, and be closely combined with each other increasingly, the special urgent need of Military Application in addition, Data fusion technique is as data processing
Emerging technology is particularly important.
In the age of current data expansion, the data inconsistency how solved between each data source is critically important at one
The problem of, data are due to the notation methods disunity of the attribute of record, and mistake of data etc. and other issues fusion itself is very
Difficulty solves this problem and is related to two aspects: the inconsistency between the inconsistency and tuple of mode layer, therefore, each number
Two steps: pattern match and Entities Matching are needed according to the fusion between source.Pattern match is to find out different data and concentrate
Same alike result, and Entities Matching is exactly that uniform data is concentrated or different data is concentrated indicates the reality of identical entity in order to find out
Example is right.Data fusion is then that various inconsistent Data Integrations at unified data format and are retained their exclusive some letters
Breath.
Currently, all assigning pattern match and Entities Matching as two independent steps when merging to data, i.e., first
A pattern match is done, an Entities Matching is then done.This data fusion method has only given us a unique chance to do
Pattern match and Entities Matching lose more chances and go adjustment and modification model matching and Entities Matching as a result, number
It is lower according to fusion accuracy rate, cause a large amount of useful datas to be ignored.
Therefore, the accuracy rate of data fusion how is improved, data value is promoted, is that those skilled in the art need to solve
Technical problem.
Summary of the invention
The data fusion method based on pattern match and Entities Matching that the object of the present invention is to provide a kind of, this method can be with
The accuracy rate of data fusion is improved, data value is promoted;It is a further object of the present invention to provide one kind to be based on pattern match and reality
The matched data fusion device of body, system and a kind of readable storage medium storing program for executing have above-mentioned beneficial effect.
In order to solve the above technical problems, the present invention provides a kind of data fusion side based on pattern match and Entities Matching
Method, comprising:
Step 1: the pattern match for receiving initial connection successfully records pair;
Step 2: the record of obtaining mode successful match to as pattern match record pair, to the pattern match record pair
Corresponding data carry out Entities Matching;
Step 3: judging whether there is Entities Matching and successfully record pair, if it does, entering step 4;If without reality
The record pair of body successful match, enters step 6;
Step 4: obtain Entities Matching successfully record to as Entities Matching record pair, to the Entities Matching record pair
Corresponding data carry out pattern match;
Step 5: judging whether there is pattern match and successfully record pair, if it does, entering step 2;If without mould
The record pair of formula successful match, enters step 6;
Step 6: obtaining the record pair of all successful match.
Preferably, the recording index method of the pattern match and the Entities Matching specifically: based on the more of q-gram
Attribute interaction index.
Preferably, the method for building up of more attributes interaction index based on q-gram includes:
It is successfully recorded according to the pattern match of the initial connection to establishing dynamic index;
Remove the record pair that discrimination in index is higher than discrimination threshold value;
Wherein, the calculation method of the discrimination includes: to work as PosT(A) it is centered under record A for the record of successful match
The matched quantity of value, NegT(A) when being centered in the unmatched quantity of value recorded under A for the record of successful match, area
Indexing
Preferably, the method for the Entities Matching includes:
To pattern match record to corresponding data successively computational entity matching degree;
The record that the Entities Matching degree is filtered out higher than Entities Matching degree threshold value is successfully recorded to as Entities Matching
It is right;
Wherein, the calculation method of the Entities Matching degree includes:
When λ be damped coefficient,For all attributes pair connected total contribution score when, it is real
Body matching degree
Wherein,
LSM(A, B) is a possibility that record A and B is matched;ctr
(t [A], s [B]) is the similarity contribution margin of t [A] and s [B] two values;
Sim (t [A], s [B]) is the similarity of t [A] and s [B] two values, and θ is critical point;
The fused discrimination score of attribute A and B
Preferably, the method for the pattern match includes:
Pattern match degree is successively calculated to corresponding data to Entities Matching record;
The record that the pattern match degree is filtered out higher than pattern match degree threshold value is successfully recorded to as pattern match
It is right;
Wherein, the calculation method of the pattern match degree includes:
When α is to control the contribution score of relative recording pair, pattern match degree
Wherein,
Ctr (t [A], s [B]) is the similarity contribution margin of t [A] and s [B] two values;
Sim (t [A], s [B]) is the similarity of t [A] and s [B] two values, and θ is critical point;
The fused discrimination score of attribute A and B
Preferably, the data fusion method based on pattern match and Entities Matching further include:
To the record of the successful match of acquisition to calculating instability;
Remove the matched record pair that the instability is greater than instability threshold value;
Wherein, the calculation method of the instability includes: to work asTo record to all notes under (A, B)
The average value of the similarity of record value, when m is the quantity of the record pair connected, unstable degreeSim (a, b) is the similarity of a and b two values.
It is preferably, described that pattern match is recorded to before corresponding data progress Entities Matching further include:
Calculate the error amount of the pattern match record pair;
Remove the record pair that the error amount is higher than error threshold;
Wherein, the calculation method of the error amount includes: the record when S () is similarity function, and η is quality threshold
To the error amount of (A, B)
The present invention discloses a kind of data fusion device based on pattern match and Entities Matching, comprising:
Receiving unit, the pattern match for receiving initial connection successfully record pair;
Entities Matching unit, for obtaining mode successful match record to as pattern match record pair, to the mould
The corresponding data of formula matching record pair carry out Entities Matching;
Entity judging unit successfully records pair for judging whether there is Entities Matching;
Pattern matching unit, for obtain Entities Matching successfully record to as Entities Matching record pair, to the reality
The corresponding data of body matching record pair carry out pattern match;
Mode determination successfully records pair for judging whether there is pattern match;
Acquiring unit, for ought not no successful match record clock synchronization, obtain the record pair of all successful match.
The present invention discloses a kind of data fusion system based on pattern match and Entities Matching, comprising:
Memory, for storing computer program;
Processor realizes that the data based on pattern match and Entities Matching are melted when for executing the computer program
The step of conjunction method.
The present invention discloses a kind of readable storage medium storing program for executing, and program is stored on the readable storage medium storing program for executing, and described program is located
Manage the step of realizing the data fusion method based on pattern match and Entities Matching when device executes.
Data fusion method provided by the present invention based on pattern match and Entities Matching passes through initial using what is given
Then the record of connection recycles the matched result of current entity to carry out pattern match to Entities Matching is carried out, then further according to
The result of pattern match carries out Entities Matching, and so on, each round matching is using last round of matching result as input number
According to, the record of the successful match of previous round on the basis of match again, can enable us to correct previous round in
Possible erroneous matching does not find record group, constantly goes the matched result of modification model.Until mode and Entities Matching
Result all tend towards stability, the accuracy rate and recall rate of Entities Matching are greatly improved in this way, to realize raising data
The accuracy rate of fusion promotes the purpose of data value.
The invention also discloses a kind of data fusion device based on pattern match and Entities Matching, system and one kind are readable
Storage medium has above-mentioned beneficial effect, and details are not described herein.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis
The attached drawing of offer obtains other attached drawings.
Fig. 1 is the flow chart of the data fusion method provided in an embodiment of the present invention based on pattern match and Entities Matching;
Fig. 2 is index establishment process schematic diagram provided in an embodiment of the present invention;
Fig. 3 is matching process schematic diagram provided in an embodiment of the present invention;
Fig. 4 is the matching result schematic diagram that Entities Matching provided in an embodiment of the present invention is interacted with pattern match;
Fig. 5 is the structural frames of the data fusion device provided in an embodiment of the present invention based on pattern match and Entities Matching
Figure;
Fig. 6 is the structural frames of the data fusion system provided in an embodiment of the present invention based on pattern match and Entities Matching
Figure;
Fig. 7 is the structural representation of the data fusion system provided in an embodiment of the present invention based on pattern match and Entities Matching
Figure.
Specific embodiment
Core of the invention is to provide a kind of data fusion method based on pattern match and Entities Matching, and this method can be with
The accuracy rate of data fusion is improved, data value is promoted;Another core of the invention is to provide a kind of based on pattern match and reality
The matched data fusion device of body, system and a kind of readable storage medium storing program for executing have above-mentioned beneficial effect.
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
For two tables of data T1={ t1, t2 ..., tn } and T2={ s1, s2 ..., sm }, their mode layer is respectively
Any reality in any entity ti (1≤i≤n) in S1={ A1, A2 ..., Ap } and S2={ B1, B2 ..., Bq }, T1, T2
Body sj (1≤j≤m).Pattern match is exactly all attributes to be found out in S1 and S2 to (Ai, Bj), and each attribute is to tool
There is the same nature of record.Entities Matching is exactly that the record for referring to same entity is found out from T1 and T2 to (ti, sj).
In order to carry out data integration, pattern match and Entities Matching are regarded as two independent steps at present, i.e., done first
As soon as then time pattern match is done an Entities Matching and finished, this method has only given us to do mould in the unique chance
Formula matching and Entities Matching, when we are collected into more and more information from one of step, we are can be more
Another step is carried out well, and this method let us loses more chances and removes adjustment modes matching and Entities Matching
As a result.Existing method first does pattern match, and the Entities Matching that then tries again just finishes.This method is generally unattainable
Satisfactory syncretizing effect.
Referring to FIG. 1, Fig. 1 is the data fusion method provided in an embodiment of the present invention based on pattern match and Entities Matching
Flow chart;This method may include:
Step 1: the pattern match for receiving initial connection successfully records pair;
For two tables of data T1={ t1, t2 ..., tn } and T2={ s1, s2 ..., sm }, initial connection is received
The attribute pair to get up.Referring to the following table 1 and table 2, herein using primary data table T1 as table 1, primary data table T2 be table 2 for,
The attribute of tables of data T1 includes: Product, WT, SIZE, CAMERA, ROM, RAM;Tables of data T2 attribute includes: Product,
Weight, Front Cam, Back Cam, Memory, EX-Memory.
In initial attribute table, with the seed connection attribute pair received, i.e., pattern match successfully record to forFor, then it is Iphone in tables of data T1 that pattern match, which is successfully recorded to corresponding data,
6、Iphone 6plus、Iphone 5C、Samsung Note4、Samsung S6、HuaWei 6+、HuaWei P7、HuaWei
Iphone 6, Iphone 6+, Note4, Galaxy S6, MI Note, 4 MI, Coolpad S6 in P8 and tables of data T2,
MX Note4。
The pattern match of received initial connection is successfully recorded to that can be specified by user, can also be by program certainly
The modes such as dynamic screening, without limitation to the generating mode of received initial connection record pair at this.
Wherein, the attribute both referred in tables of data is recorded in tables of data, also the generation including data line claims.For following table 1, number
Record according to table T1 includes attribute Product, WT, SIZE, CAMERA, ROM, RAM, and also the generation including data line claims t1, t2,
T3, t4, t5, t6, t7, t8.It records to two records referred in two tables of data under same type, such as the record pair under attribute
It may include the Product in the Product and tables of data T2 in tables of data T1, Product and tables of data in tables of data T1
The s4 etc. in t8 and tables of data T2 in the s1 in t3 and tables of data T2 in WT in T2, tables of data T1, tables of data T1.
Product | WT | SIZE | CAMERA | ROM | RAM | |
t1 | Iphone 6 | 129g | 4.7inch | — | — | 1GB |
t2 | Iphone 6plus | 172g | 5.5inch | 8mp | 128GB | 1GB |
t3 | Iphone 5C | 112g | 4.0inch | 12mp | 32GB | 1GB |
t4 | Samsung Note4 | 176g | 5.7inch | 16mp | 16GB | 3GB |
t5 | Samsung S6 | — | 5.1inch | 16mp | 32GB | 2GB |
t6 | HuaWei 6+ | 165g | 5.5inch | 8mp | — | 3GB |
t7 | HuaWei P7 | 124g | 5.0inch | — | 64GB | 2GB |
t8 | HuaWei P8 | 144g | 5.5inch | 13mp | — | 3GB |
Table 1
Product | Weight | Screen | Front Cam | Back Cam | Memory | EX-Memory | |
s1 | Iphone 6 | 129g | 4.7in | — | 8mp | 64GB | — |
s2 | Iphone 6+ | 172g | 5.5in | 12mp | — | — | — |
s3 | Note4 | 176g | 5.7in | — | 13mp | 16GB | 128GB |
s4 | Galaxy S6 | — | 5.1in | 8mp | 16mp | 32GB | — |
s5 | MI Note | — | 5.7in | 8mp | — | 16GB | 32GB |
s6 | MI 4 | 149g | 5.0in | 13mp | 13mp | — | 64GB |
s7 | Coolpad S6 | — | 5.95in | 16mp | — | 32GB | 64GB |
s8 | MX Note4 | 145g | — | 16mp | 16mp | 32GB | 16GB |
Table 2
Step 2: the record of obtaining mode successful match records to correspondence pattern match to as pattern match record pair
Data carry out Entities Matching;
Pattern match is the same alike result in order to find out different data concentration, and Entities Matching is exactly to find out unified number
The record pair for indicating identical entity is concentrated according to concentration or different data.
After getting pattern match and successfully recording to being recorded as pattern match, to the record to progress entity
Match.
For example, pattern match successfully record to forThe record carries out corresponding data
Entities Matching, i.e., in tables of data T1 Iphone 6, Iphone 6plus, Iphone 5C, Samsung Note4,
Iphone 6, Iphone 6+ in Samsung S6, HuaWei 6+, HuaWei P7, HuaWei P8 and tables of data T2,
Note4, Galaxy S6, MI Note, 4 MI, Coolpad S6, MX Note4 carry out Entities Matching.
Step 3: judging whether there is Entities Matching and successfully record pair, if it does, entering step 4;If without reality
The record pair of body successful match, enters step 6;
After carrying out Entities Matching, the record of successful match is judged whether there is to output, if so, according to the matching record pair
Result carry out pattern match can further correct Entities Matching as a result, improve data fusion accurate rate.
Step 4: obtaining Entities Matching and successfully record to as Entities Matching record pair, Entities Matching is recorded to correspondence
Data carry out pattern match;
Entities Matching is successfully recorded to as Entities Matching record pair, is believed using Entities Matching record as input
Breath carries out pattern match, and the detailed process of pattern match is it is not limited here.
For example, initial received pattern match successfully record to forTo the record to correspondence
Data carry out Entities Matching, obtain Entities Matching record to forAndBy Entities Matching record pair
As input information, pattern match is carried out.
Step 5: judging whether there is pattern match and successfully record pair, if it does, entering step 2;If without mould
The record pair of formula successful match, enters step 6;
If by the record pair of successful match when pattern match, by the record to the input data for being re-used as Entities Matching,
Carry out pattern match.
The alternately process of execution pattern matching and Entities Matching.Each step Entities Matching, which will be found, more to be connected
Attribute help the pattern match of next step to find the extra current attribute connected.Similarly, the mode of each step
Matching will also find more attributes to helping the next Entities Matching to record with finding pair, iteratively find matching
Record.
Step 6: obtaining the record pair of all successful match.
Matching process in the present invention only compares those, and value under the attribute of matched record carries out, use pattern
The data integrating method that matching and Entities Matching interact gives certain seed attribute connections, alternatively executes Entities Matching
And pattern match, i.e., the Entities Matching (mode of next step is carried out with the result of last round of pattern match (Entities Matching)
With).
In pattern match, if two records be it is matched, their values under same attribute should be closely similar, because
The similarity of the attribute value for the record that this comparison current matching gets up, which can be distinguished more subtly carrying out pattern match, to be possessed
The attribute of similar value offset.This iterative algorithm carries out Entities Matching first with given seed attribute connection, then sharp again
Pattern match is carried out with the matched result of current entity, then carries out Entities Matching again.Before can enabling us to amendment in this way
Possible erroneous matching or record group is not found in one wheel.Same reason, we also can constantly remove modification model
The result matched.Until mode and the result of Entities Matching all tend towards stability, this method just stops.
Due to carrying out Entities Matching the present invention relates to the multiple records of matching and using multiple records, when to reduce matching
Between, it is preferable that the present invention can select the segmenting method based on q-gram to optimize efficiency of the invention.According to based on q-gram
More attributes interaction index carry out recording indexes can substantially reduce index the time, promote the efficiency of data fusion.
Specifically, the method for building up of more attributes interaction index based on q-gram may include:
It is successfully recorded according to the pattern match of initial connection to establishing dynamic index;
Remove the record pair that discrimination in index is higher than discrimination threshold value;
Wherein, the calculation method of discrimination includes: to work as PosT(A) value being centered in for the record of successful match under record A
Matched quantity, NegT(A) when being centered in the unmatched quantity of value recorded under A for the record of successful match, discrimination
Invention introduces amount-discrimination scores-of the ability of one entity of difference and other entities that can measure attribute
IdC, it reflects some attribute to the significance level for carrying out Entities Matching.For the attribute A in tables of data, then discrimination score
It can indicate are as follows:Wherein PosT(A) category is centered in for the matched record in training set
The matched quantity of value under property A, NegT(A) the unmatched number of value being centered under attribute A for the matched record in training set
Amount.So two attribute A and B, fused IdC are indicated with following formula:
A character s is given, then the q-gram collection generated from s is combined into Gms (s, q)={ gm1,gm2,...,
gm|s-q+1|, wherein gmiBy the natural sequence in s from i-th to i+q-1.The continuous q-gram sequence definition of so s l length
To be made of the continuous q-gram sequence in Gms (s, q).
The process that we establish index is not disposably to complete, but go dynamically to go to establish to index.It is primarily based on
Above-mentioned definition indexes upper foundation in given seed attribute.Then, as more attributes are matched, if their IdC
Score is greater than some threshold value, we continue to establish new index based on these attributes.Finally, will obtain about under multiple attributes
Index, the index are interaction index.
Specifically, the greedy record matching strategy based on interaction index is referred to following introduction.
Assuming that we have had some matched attributes and a potential matching record to be placed to a block
In, and the index about record has been done based on these attributes.Two should be met when we select record block to be compared
A condition: (1) it has the number of comparisons of less value;(2) it has higher probability that can generate matched record pair.Specifically
For, a block Block=({ LR }, { RR }) is given, wherein LR is the set of some records in one of table, and
RR is the set of some records in another table.We estimate the priority of a block by formula below:Wherein AttrPairBlockIt is set up the attribute pair of index, andIndicate that we obtain average time that a matched record compares needs
Number.When a block Block=({ LR }, { RR }) is selected to the comparison recorded, their matched categories are calculated
The similarity of value under property, a possibility that then obtaining these record matchings.For being used to do the record pair of pattern match
Number, it is too long to prevent the time of pattern match to be recorded matching delay that a maximum value can be set.
In interaction index, a block usually can generate correctly record pair when having high priority value.
Because usually there is less record, and they have very high similarity in these blocks.
This greedy algorithm can rapidly find matched record, this is for matching remaining record, it is possible to reduce very
Mostly unnecessary comparison.By the theoretical bound for shifting a possibility that calculating record matching onto, can thus reduce
The comparison of more values of the attribute under, to improve the efficiency of algorithm.
It is introduced by taking the interaction index set up based on table 1 and table 2 as an example below, indexes establishment process such as Fig. 2 institute
Show.
Drawn based on this vang, we can be found that for record matching, and the result that we obtain record matching to the end is most
Also it only needs to compare:
And
In addition the limitation of record matching possibility bound, also will be greatly reduced comparatively number.Therefore the index can mention well
The efficiency of high algorithm.
Wherein, based on the discrimination of introducing, the process of Entities Matching is specifically as follows: recording pattern match to corresponding
Data successively computational entity matching degree;Record of the Entities Matching degree higher than Entities Matching degree threshold value is filtered out to as Entities Matching
Successfully record pair;
Wherein, the calculation method of Entities Matching degree includes:
When λ be damped coefficient,For all attributes pair connected total contribution score when, it is real
Body matching degree
Wherein,
LSM(A, B) is a possibility that record A and B is matched;ctr
(t [A], s [B]) is the similarity contribution margin of t [A] and s [B] two values;
Sim (t [A], s [B]) is the similarity of t [A] and s [B] two values, and θ is critical point;
The fused discrimination score of attribute A and B
Matching for entity, it is contemplated that the match condition of multiple attribute values, and consider this multiple category
Some dependences between property.We calculate a possibility that two Entities Matchings with following formula:Wherein λ is the damped coefficient for being used to compensate the dependence between attribute, andIt is total contribution score of all attributes pair connected.Specifically,Wherein φ (A, B) is to be mapped above-mentioned value using logarithmic function
To between 0 and 1, specific formula are as follows:Wherein LSM(A, B) is attribute
A possibility that A and B is matched.Ctr (t [A], s [B]) is the similarity contribution margin for two values of t [A] and s [B], ctr (t [A],
S [B]) calculation formula specifically:
Wherein sim (t [A], s [B]) is the similarity of two values, and θ is a user-defined critical point.
Wherein, the method for pattern match may include: successively to calculate mould to corresponding data to Entities Matching record
Formula matching degree;The record that the pattern match degree is filtered out higher than pattern match degree threshold value is successfully recorded to as pattern match
It is right;Wherein, the calculation method of pattern match degree may include:
When α is to control the contribution score of relative recording pair, pattern match degree
Wherein,
Ctr (t [A], s [B]) is the similarity contribution margin of t [A] and s [B] two values;
Sim (t [A], s [B]) is the similarity of t [A] and s [B] two values, and θ is critical point;
The fused discrimination score of attribute A and B
Specifically, when initial tables of data is table 1 and table 2, data fusion method provided by the invention is specifically as follows:
Give a seed connection attribute pair:Assuming thatAnd it is preparatory
The judgement attribute connection set connects whether matched threshold value is 0.7 with entity, and the θ value in formula 1 is 0.1.Carrying out first
When secondary Entities Matching;We calculate a possibility that they match:
It can be matched
Attribute pair:WithWhen cannot find more matched records to after, carry out pattern match.
According to two matched records pairWithTo carry out the pattern match of next step.Likewise, we
Go a possibility that calculating these attributes match:We are at this
Matched attribute pair is obtained in wheel pattern match
This process is repeated, the record pair of all successful match is obtained are as follows:
As shown in figure 3, SM intermediate scheme matches in figure, RM presentation-entity matches entire matching process, pattern match and reality
Body matches alternately, and the result that every wheel matching obtains is as shown in the figure.Fig. 4 is that Entities Matching and pattern match are handed in matching process
Mutual matching result.
The data fusion method provided by the present invention based on pattern match and Entities Matching is logical based on the above-mentioned technical proposal
The record using given initial connection is crossed to Entities Matching is carried out, the matched result of current entity is then recycled to carry out mode
Then matching carries out Entities Matching further according to the result of pattern match, and so on, each round matches with last round of matching
As a result be used as input data, the record of the successful match of previous round on the basis of match again, us can be enable
It enough corrects possible erroneous matching in previous round or does not find record group, constantly go the matched result of modification model.Until
Mode and the result of Entities Matching all tend towards stability, and greatly improve the accuracy rate and recall rate of Entities Matching in this way, thus
The accuracy rate for improving data fusion is realized, the purpose of data value is promoted.
Based on the above embodiment, data fusion method provided by the present invention may lead to the problem of " semantic shift ",
If the step of producing some mistakes, may result in below amplifies this mistake i.e. during iteration, finally result in
The poor effect of algorithm.For this problem, data fusion method provided by the present invention can be with further include:
To the record of the successful match of acquisition to calculating instability;
Remove the matched record pair that instability is greater than instability threshold value;
Wherein, the calculation method of instability includes: to work asTo record to all record values under (A, B)
Similarity average value, m be connect record pair quantity when, unstable degreeSim (a, b) is the similarity of a and b two values.
Those unstable or uncertain record groups connected are detected with the method for unbiased esti-mator, pass through calculating
The variance of the similarity of value of the attribute connected under come measure one record pair unstable degree.One is recorded
It is rightWe calculate their unstable degree with the calculation method of unbiased variance;
WhereinIt is attribute to institute under (A, B)
There is the average value of the similarity of attribute value, the quantity for the attribute pair that when m connects.It is higher that we remove those unstable degree
Record pair, then re-start the calculating of their unstable degree, it is known that these records are to settling out.This method is not only
Can guarantee needed for the pattern match of next step the quality of record pair, and can let us obtain the features of some attributes pair
So as to more favorable carry out record matching.
For only using those accuracy higher attributes to record matching is carried out, prevent as far as possible after this mistake put
Greatly, recording to pattern match before carrying out Entities Matching to corresponding data to include:
Calculate the error amount of pattern match record pair;
Remove the record pair that error amount is higher than error threshold;
Wherein, the calculation method of error amount include: when S () is similarity function, and η is quality threshold, record to (A,
B error amount)
Those abnormal attributes pair are detected with the method for cross validation.Remember all attributes to for set P={ P1,
P2,...,Pk, and remember P-PiTo remove P from set Pi.In each verifying, we are by P-PiRegard a training set as,
PiRegard verifying collection as.We utilize P-PiGo to speculate record pair, then according to these deduce come record to calculating Pi
A possibility that middle attribute is to matching.Specific method realization is, we allow each PiAn attribute is represented, then we apply one
A linear loss function F is returnedWherein R is by P-PiIt deduces
Record pair, S () is similarity function, and η is a quality threshold.We define wrong { 0,1 } by once judging: ifSo F (S, ((A, B), η))=1.Then we are in the following manner
Carry out computation attribute to the error amount of (A, B):We are to k in P different collection
It closes and repeats to do k this verification process, and error amount is calculated.Then we give up those error amounts higher than some threshold value
Attribute pair.This method is able to detect let us using those accuracy higher attributes to record matching is carried out, and prevents at this
Mistake is amplified later.
With the progress of interactive process, more and more records are to connected.When iteration is completed, that calculates at the beginning
It just needs to update with possibility score.Phase interaction between the attribute pair that the entity pair that Entities Matching obtains is obtained with attributes match
With can be indicated by bigraph (bipartite graph).It is wherein attribute pair on one side, another side is entity pair, and the weight between them on side then may be used
Them are assigned corresponding numerical relation by matching the formula of possibility, can prove that the algorithm is convergent.
The data fusion device provided by the invention based on pattern match and Entities Matching is introduced below, is please referred to
Fig. 5, Fig. 5 are the structural block diagram of the data fusion device provided in an embodiment of the present invention based on pattern match and Entities Matching;It should
Device may include: receiving unit 100, Entities Matching unit 200, entity judging unit 300, pattern matching unit 400, mode
Judging unit 500 and acquiring unit 600.
Wherein, the pattern match that receiving unit 100 can be used for receiving initial connection successfully records pair.
Entities Matching unit 200 can be used for the record of obtaining mode successful match to as pattern match record pair, right
Pattern match record carries out Entities Matching to corresponding data.
Entity judging unit 300 can be used for judging whether there is Entities Matching and successfully record pair.
Pattern matching unit 400 can be used for obtaining Entities Matching successfully record to as Entities Matching record pair, it is right
Entities Matching record carries out pattern match to corresponding data.
Mode determination 500 can be used for judging whether there is pattern match and successfully record pair.
Acquiring unit 600 can be used for ought not no successful match record clock synchronization, obtain the record pair of all successful match.
Data fusion device provided by the invention based on pattern match and Entities Matching passes through 200 benefit of Entities Matching unit
With the record of given initial connection to Entities Matching is carried out, the matched resulting schema matching unit of current entity is then recycled
400 carry out pattern match, then further according to pattern match result carry out Entities Matching, and so on, each round matching with
Last round of matching result as input data, the record of the successful match of previous round on the basis of match again,
Possible erroneous matching in previous round can be enabled us to correct or do not find record group, modification model is constantly gone to match
Result.Until mode and the result of Entities Matching all tend towards stability, greatly improve in this way Entities Matching accuracy rate and
Recall rate promotes the purpose of data value to realize the accuracy rate for improving data fusion.Specifically to based on pattern match and
The introduction of the data fusion device of Entities Matching can refer to the above-mentioned data fusion method based on pattern match and Entities Matching,
This is repeated no more.
The data fusion system provided by the invention based on pattern match and Entities Matching is introduced below, it is specific right
The introduction of data fusion system based on pattern match and Entities Matching can refer to above-mentioned based on pattern match and Entities Matching
Data fusion device, Fig. 6 are the knot of the data fusion system provided in an embodiment of the present invention based on pattern match and Entities Matching
Structure block diagram;The system may include: memory 700 and processor 800.
Wherein, memory 700 can be used for storing computer program;
Processor 800 realizes the wireless self-organization network assemblage method based on frequency hopping when can be used for executing computer program
The step of.
Data fusion system provided by the invention based on pattern match and Entities Matching may be implemented to improve data fusion
Accuracy rate, promote the purpose of data value.
Referring to FIG. 7, the knot of the data fusion system provided in an embodiment of the present invention based on pattern match and Entities Matching
Structure schematic diagram, the system can generate bigger difference because configuration or performance are different, may include at one or more
Device (central processing units, CPU) 322 (for example, one or more processors) and memory 332 is managed,
The storage medium 330 of one or more storage application programs 342 or data 344 (such as deposit by one or more magnanimity
Store up equipment).Wherein, memory 332 and storage medium 330 can be of short duration storage or persistent storage.It is stored in storage medium 330
Program may include one or more modules (diagram does not mark), each module may include in data processing equipment
Series of instructions operation.Further, central processing unit 322 can be set to communicate with storage medium 330, be in fusion
The series of instructions operation in storage medium 330 is executed on system 301.
Emerging system 301 can also include one or more power supplys 326, one or more wired or wireless nets
Network interface 350, one or more input/output interfaces 358, and/or, one or more operating systems 341, such as
Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
Step in data fusion method based on pattern match and Entities Matching described in above figure 1 can be by being based on
The structure of the data fusion system of pattern match and Entities Matching is realized.
Readable storage medium storing program for executing provided in an embodiment of the present invention is introduced below, readable storage medium storing program for executing described below with
The above-described data fusion method based on pattern match and Entities Matching can correspond to each other reference.
A kind of readable storage medium storing program for executing disclosed by the invention, is stored thereon with program, base is realized when program is executed by processor
In the pattern match and the data fusion method of Entities Matching the step of.
It should be noted that being filled based on pattern match and the data fusion of Entities Matching in the specific embodiment of the invention
Each unit in setting, the course of work please refer to the corresponding specific embodiment of Fig. 1, and details are not described herein.
It is apparent to those skilled in the art that for convenience and simplicity of description, the device of foregoing description,
The specific work process of equipment, storage medium and unit, can refer to corresponding processes in the foregoing method embodiment, herein no longer
It repeats.
In several embodiments provided by the present invention, it should be understood that disclosed device, equipment, storage medium and
Method may be implemented in other ways.For example, apparatus embodiments described above are merely indicative, for example, single
Member division, only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or
Component can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point is shown
The mutual coupling, direct-coupling or communication connection shown or discussed can be through some interfaces, between device or unit
Coupling or communication connection are connect, can be electrical property, mechanical or other forms.
Unit may or may not be physically separated as illustrated by the separation member, shown as a unit
Component may or may not be physical unit, it can and it is in one place, or may be distributed over multiple networks
On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
It, can if integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product
To be stored in a mobile terminal.Based on this understanding, technical solution of the present invention is substantially in other words to the prior art
The all or part of the part to contribute or the technical solution can be embodied in the form of software products, which deposits
It stores up in one storage medium, including some instructions are used so that a mobile terminal (can be mobile phone or tablet computer
Deng) execute all or part of the steps of each embodiment method of the present invention.And storage medium above-mentioned includes: USB flash disk, moves firmly
Disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM),
The various media that can store program code such as magnetic or disk.
Each embodiment is described in a progressive manner in specification, the highlights of each of the examples are with other realities
The difference of example is applied, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment
Speech, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is referring to method part illustration
?.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure
And algorithm steps, it can be realized with the combination of electronic hardware, terminal or the two, in order to clearly demonstrate hardware and software
Interchangeability generally describes each exemplary composition and step according to function in the above description.These functions are studied carefully
Unexpectedly it is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technique people
Member can use different methods to achieve the described function each specific application, but this realization is it is not considered that super
The scope of the present invention out.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor
The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit
Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology
In any other form of storage medium well known in field.
Above to the data fusion method provided by the present invention based on pattern match and Entities Matching, device, system and
Readable storage medium storing program for executing is described in detail.Specific case used herein carries out the principle of the present invention and embodiment
It illustrates, the above description of the embodiment is only used to help understand the method for the present invention and its core ideas.It should be pointed out that for this
For the those of ordinary skill of technical field, without departing from the principle of the present invention, the present invention can also be carried out several
Improvement and modification, these improvements and modifications also fall within the scope of protection of the claims of the present invention.
Claims (10)
1. a kind of data fusion method based on pattern match and Entities Matching characterized by comprising
Step 1: the pattern match for receiving initial connection successfully records pair;
Step 2: the record of obtaining mode successful match records to correspondence the pattern match to as pattern match record pair
Data carry out Entities Matching;
Step 3: judging whether there is Entities Matching and successfully record pair, if it does, entering step 4;If without entity
With successful record pair, 6 are entered step;
Step 4: obtaining Entities Matching and successfully record to as Entities Matching record pair, the Entities Matching is recorded to correspondence
Data carry out pattern match;
Step 5: judging whether there is pattern match and successfully record pair, if it does, entering step 2;If without mode
With successful record pair, 6 are entered step;
Step 6: obtaining the record pair of all successful match.
2. the data fusion method based on pattern match and Entities Matching as described in claim 1, which is characterized in that the mould
The recording index method of formula matching and the Entities Matching specifically: more attributes interaction index based on q-gram.
3. the data fusion method based on pattern match and Entities Matching as claimed in claim 2, which is characterized in that the base
Include: in the method for building up of more attributes interaction index of q-gram
It is successfully recorded according to the pattern match of the initial connection to establishing dynamic index;
Remove the record pair that discrimination in index is higher than discrimination threshold value;
Wherein, the calculation method of the discrimination includes: to work as PosT(A) value being centered in for the record of successful match under record A
The quantity matched, NegT(A) when being centered in the unmatched quantity of value recorded under A for the record of successful match, discrimination
4. the data fusion method based on pattern match and Entities Matching as claimed in claim 3, which is characterized in that the reality
The matched method of body includes:
To pattern match record to corresponding data successively computational entity matching degree;
Record of the Entities Matching degree higher than Entities Matching degree threshold value is filtered out to successfully recording as Entities Matching pair;
Wherein, the calculation method of the Entities Matching degree includes:
When λ be damped coefficient,For all attributes pair connected total contribution score when, entity
With degree
Wherein,
LSM(A, B) is a possibility that record A and B is matched;ctr(t
[A], s [B]) it is t [A] and s [B] the two similarity contribution margins being worth;
Sim (t [A], s [B]) is the similarity of t [A] and s [B] two values, and θ is critical point;
The fused discrimination score of attribute A and B
5. the data fusion method based on pattern match and Entities Matching as claimed in claim 3, which is characterized in that the mould
The matched method of formula includes:
Pattern match degree is successively calculated to corresponding data to Entities Matching record;
Record of the pattern match degree higher than pattern match degree threshold value is filtered out to successfully recording as pattern match pair;
Wherein, the calculation method of the pattern match degree includes:
When α is to control the contribution score of relative recording pair, pattern match degree
Wherein,
Ctr (t [A], s [B]) is the similarity contribution margin of t [A] and s [B] two values;
Sim (t [A], s [B]) is the similarity of t [A] and s [B] two values, and θ is critical point;
The fused discrimination score of attribute A and B
6. the data fusion method based on pattern match and Entities Matching as claimed in claim 5, which is characterized in that also wrap
It includes:
To the record of the successful match of acquisition to calculating instability;
Remove the matched record pair that the instability is greater than instability threshold value;
Wherein, the calculation method of the instability includes: to work asFor record under (A, B) all record values it is similar
The average value of degree, when m is the quantity of the record pair connected, unstable degreeSim (a, b) is the similarity of a and b two values.
7. such as the data fusion method as claimed in any one of claims 1 to 6 based on pattern match and Entities Matching, feature exists
In described to record pattern match to before corresponding data progress Entities Matching further include:
Calculate the error amount of the pattern match record pair;
Remove the record pair that the error amount is higher than error threshold;
Wherein, the calculation method of the error amount include: when S () is similarity function, and η is quality threshold, record to (A,
B error amount)
8. a kind of data fusion device based on pattern match and Entities Matching characterized by comprising
Receiving unit, the pattern match for receiving initial connection successfully record pair;
Entities Matching unit, for obtaining mode successful match record to as pattern match record pair, to the mode
Entities Matching is carried out to corresponding data with record;
Entity judging unit successfully records pair for judging whether there is Entities Matching;
Pattern matching unit, for obtain Entities Matching successfully record to as Entities Matching record pair, to the entity
Pattern match is carried out to corresponding data with record;
Mode determination successfully records pair for judging whether there is pattern match;
Acquiring unit, for ought not no successful match record clock synchronization, obtain the record pair of all successful match.
9. a kind of data fusion system based on pattern match and Entities Matching characterized by comprising
Memory, for storing computer program;
Processor, when for executing the computer program realize as described in any one of claim 1 to 7 based on pattern match with
The step of data fusion method of Entities Matching.
10. a kind of readable storage medium storing program for executing, which is characterized in that be stored with program on the readable storage medium storing program for executing, described program is located
It manages realizing the data fusion method based on pattern match and Entities Matching as described in any one of claim 1 to 7 when device executes
Step.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810594208.9A CN108960292A (en) | 2018-06-11 | 2018-06-11 | Data fusion method, device, system based on pattern match and Entities Matching |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810594208.9A CN108960292A (en) | 2018-06-11 | 2018-06-11 | Data fusion method, device, system based on pattern match and Entities Matching |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108960292A true CN108960292A (en) | 2018-12-07 |
Family
ID=64488266
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810594208.9A Pending CN108960292A (en) | 2018-06-11 | 2018-06-11 | Data fusion method, device, system based on pattern match and Entities Matching |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108960292A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111209384A (en) * | 2020-01-08 | 2020-05-29 | 腾讯科技(深圳)有限公司 | Question and answer data processing method and device based on artificial intelligence and electronic equipment |
CN113760995A (en) * | 2021-09-09 | 2021-12-07 | 上海明略人工智能(集团)有限公司 | Entity linking method, system, equipment and storage medium |
-
2018
- 2018-06-11 CN CN201810594208.9A patent/CN108960292A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111209384A (en) * | 2020-01-08 | 2020-05-29 | 腾讯科技(深圳)有限公司 | Question and answer data processing method and device based on artificial intelligence and electronic equipment |
CN111209384B (en) * | 2020-01-08 | 2023-08-15 | 腾讯科技(深圳)有限公司 | Question-answer data processing method and device based on artificial intelligence and electronic equipment |
CN113760995A (en) * | 2021-09-09 | 2021-12-07 | 上海明略人工智能(集团)有限公司 | Entity linking method, system, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110213164B (en) | Method and device for identifying network key propagator based on topology information fusion | |
CN110751042B (en) | Time partition-based portrait and IMSI information association method and system | |
CN105426375B (en) | A kind of calculation method and device of relational network | |
CN108959516B (en) | Conversation message treating method and apparatus | |
CN108959370A (en) | The community discovery method and device of entity similarity in a kind of knowledge based map | |
CN108960292A (en) | Data fusion method, device, system based on pattern match and Entities Matching | |
CN102708327A (en) | Network community discovery method based on spectrum optimization | |
CN110809066A (en) | IPv6 address generation model creation method, device and address generation method | |
CN104092618A (en) | Peer-to-peer network traffic feature selection method based on cuckoo search algorithm | |
EP3633669A1 (en) | Method and apparatus for correcting time delay between accompaniment and dry sound, and storage medium | |
EP3452916A1 (en) | Large scale social graph segmentation | |
CN116957106A (en) | Federal learning model training method based on dynamic attention mechanism | |
CN114880522A (en) | Method and device for realizing ID Mapping based on graph database | |
CN109951430A (en) | A kind of data processing method and device | |
CN111709102B (en) | Water supply network partitioning method based on hierarchical clustering | |
CN110247805A (en) | A kind of method and device for propagating key node based on the identification that K shell decomposes | |
CN111950267B (en) | Text triplet extraction method and device, electronic equipment and storage medium | |
CN106878365A (en) | A kind of method of data synchronization and equipment | |
CN112257332A (en) | Simulation model evaluation method and device | |
CN111415265A (en) | Social relationship data generation method of generative confrontation network | |
CN114691630B (en) | Smart supply chain big data sharing method and system | |
CN108572994B (en) | Data migration processing method and server | |
CN114422338B (en) | Fault influence analysis method and device | |
CN111444327B (en) | Hot spot knowledge determination method, device and system | |
CN112800185B (en) | Method and device for generating and matching text of interface node in mobile terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181207 |