CN108960292A - Data fusion method, device, system based on pattern match and Entities Matching - Google Patents

Data fusion method, device, system based on pattern match and Entities Matching Download PDF

Info

Publication number
CN108960292A
CN108960292A CN201810594208.9A CN201810594208A CN108960292A CN 108960292 A CN108960292 A CN 108960292A CN 201810594208 A CN201810594208 A CN 201810594208A CN 108960292 A CN108960292 A CN 108960292A
Authority
CN
China
Prior art keywords
record
pattern match
entities matching
matching
pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810594208.9A
Other languages
Chinese (zh)
Inventor
李直旭
顾斌斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201810594208.9A priority Critical patent/CN108960292A/en
Publication of CN108960292A publication Critical patent/CN108960292A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The data fusion method based on pattern match and Entities Matching that the invention discloses a kind of, by the record using given initial connection to progress Entities Matching, then the matched result of current entity is recycled to carry out pattern match, then Entities Matching is carried out further according to the result of pattern match, and so on, each round matches using last round of matching result as input data, the record of the successful match of previous round on the basis of match again, possible erroneous matching in previous round can be enabled us to correct or do not find record group, constantly go the matched result of modification model.Until mode and the result of Entities Matching all tend towards stability.The accuracy rate of data fusion can be improved in this method, promotes data value.The invention also discloses a kind of data fusion device based on pattern match and Entities Matching, system and a kind of readable storage medium storing program for executing, have above-mentioned beneficial effect.

Description

Data fusion method, device, system based on pattern match and Entities Matching
Technical field
The present invention relates to electronic technology field, in particular to a kind of data fusion side based on pattern match and Entities Matching Method, device, system and a kind of readable storage medium storing program for executing.
Background technique
Data fusion refers to several observation informations chronologically obtained, is automatically analyzed, is integrated under certain criterion, with The information processing technology completing required decision and assessment task and carrying out.With the quick hair of computer technology, the communication technology Exhibition, and be closely combined with each other increasingly, the special urgent need of Military Application in addition, Data fusion technique is as data processing Emerging technology is particularly important.
In the age of current data expansion, the data inconsistency how solved between each data source is critically important at one The problem of, data are due to the notation methods disunity of the attribute of record, and mistake of data etc. and other issues fusion itself is very Difficulty solves this problem and is related to two aspects: the inconsistency between the inconsistency and tuple of mode layer, therefore, each number Two steps: pattern match and Entities Matching are needed according to the fusion between source.Pattern match is to find out different data and concentrate Same alike result, and Entities Matching is exactly that uniform data is concentrated or different data is concentrated indicates the reality of identical entity in order to find out Example is right.Data fusion is then that various inconsistent Data Integrations at unified data format and are retained their exclusive some letters Breath.
Currently, all assigning pattern match and Entities Matching as two independent steps when merging to data, i.e., first A pattern match is done, an Entities Matching is then done.This data fusion method has only given us a unique chance to do Pattern match and Entities Matching lose more chances and go adjustment and modification model matching and Entities Matching as a result, number It is lower according to fusion accuracy rate, cause a large amount of useful datas to be ignored.
Therefore, the accuracy rate of data fusion how is improved, data value is promoted, is that those skilled in the art need to solve Technical problem.
Summary of the invention
The data fusion method based on pattern match and Entities Matching that the object of the present invention is to provide a kind of, this method can be with The accuracy rate of data fusion is improved, data value is promoted;It is a further object of the present invention to provide one kind to be based on pattern match and reality The matched data fusion device of body, system and a kind of readable storage medium storing program for executing have above-mentioned beneficial effect.
In order to solve the above technical problems, the present invention provides a kind of data fusion side based on pattern match and Entities Matching Method, comprising:
Step 1: the pattern match for receiving initial connection successfully records pair;
Step 2: the record of obtaining mode successful match to as pattern match record pair, to the pattern match record pair Corresponding data carry out Entities Matching;
Step 3: judging whether there is Entities Matching and successfully record pair, if it does, entering step 4;If without reality The record pair of body successful match, enters step 6;
Step 4: obtain Entities Matching successfully record to as Entities Matching record pair, to the Entities Matching record pair Corresponding data carry out pattern match;
Step 5: judging whether there is pattern match and successfully record pair, if it does, entering step 2;If without mould The record pair of formula successful match, enters step 6;
Step 6: obtaining the record pair of all successful match.
Preferably, the recording index method of the pattern match and the Entities Matching specifically: based on the more of q-gram Attribute interaction index.
Preferably, the method for building up of more attributes interaction index based on q-gram includes:
It is successfully recorded according to the pattern match of the initial connection to establishing dynamic index;
Remove the record pair that discrimination in index is higher than discrimination threshold value;
Wherein, the calculation method of the discrimination includes: to work as PosT(A) it is centered under record A for the record of successful match The matched quantity of value, NegT(A) when being centered in the unmatched quantity of value recorded under A for the record of successful match, area Indexing
Preferably, the method for the Entities Matching includes:
To pattern match record to corresponding data successively computational entity matching degree;
The record that the Entities Matching degree is filtered out higher than Entities Matching degree threshold value is successfully recorded to as Entities Matching It is right;
Wherein, the calculation method of the Entities Matching degree includes:
When λ be damped coefficient,For all attributes pair connected total contribution score when, it is real Body matching degree
Wherein,
LSM(A, B) is a possibility that record A and B is matched;ctr (t [A], s [B]) is the similarity contribution margin of t [A] and s [B] two values;
Sim (t [A], s [B]) is the similarity of t [A] and s [B] two values, and θ is critical point;
The fused discrimination score of attribute A and B
Preferably, the method for the pattern match includes:
Pattern match degree is successively calculated to corresponding data to Entities Matching record;
The record that the pattern match degree is filtered out higher than pattern match degree threshold value is successfully recorded to as pattern match It is right;
Wherein, the calculation method of the pattern match degree includes:
When α is to control the contribution score of relative recording pair, pattern match degree
Wherein,
Ctr (t [A], s [B]) is the similarity contribution margin of t [A] and s [B] two values;
Sim (t [A], s [B]) is the similarity of t [A] and s [B] two values, and θ is critical point;
The fused discrimination score of attribute A and B
Preferably, the data fusion method based on pattern match and Entities Matching further include:
To the record of the successful match of acquisition to calculating instability;
Remove the matched record pair that the instability is greater than instability threshold value;
Wherein, the calculation method of the instability includes: to work asTo record to all notes under (A, B) The average value of the similarity of record value, when m is the quantity of the record pair connected, unstable degreeSim (a, b) is the similarity of a and b two values.
It is preferably, described that pattern match is recorded to before corresponding data progress Entities Matching further include:
Calculate the error amount of the pattern match record pair;
Remove the record pair that the error amount is higher than error threshold;
Wherein, the calculation method of the error amount includes: the record when S () is similarity function, and η is quality threshold To the error amount of (A, B)
The present invention discloses a kind of data fusion device based on pattern match and Entities Matching, comprising:
Receiving unit, the pattern match for receiving initial connection successfully record pair;
Entities Matching unit, for obtaining mode successful match record to as pattern match record pair, to the mould The corresponding data of formula matching record pair carry out Entities Matching;
Entity judging unit successfully records pair for judging whether there is Entities Matching;
Pattern matching unit, for obtain Entities Matching successfully record to as Entities Matching record pair, to the reality The corresponding data of body matching record pair carry out pattern match;
Mode determination successfully records pair for judging whether there is pattern match;
Acquiring unit, for ought not no successful match record clock synchronization, obtain the record pair of all successful match.
The present invention discloses a kind of data fusion system based on pattern match and Entities Matching, comprising:
Memory, for storing computer program;
Processor realizes that the data based on pattern match and Entities Matching are melted when for executing the computer program The step of conjunction method.
The present invention discloses a kind of readable storage medium storing program for executing, and program is stored on the readable storage medium storing program for executing, and described program is located Manage the step of realizing the data fusion method based on pattern match and Entities Matching when device executes.
Data fusion method provided by the present invention based on pattern match and Entities Matching passes through initial using what is given Then the record of connection recycles the matched result of current entity to carry out pattern match to Entities Matching is carried out, then further according to The result of pattern match carries out Entities Matching, and so on, each round matching is using last round of matching result as input number According to, the record of the successful match of previous round on the basis of match again, can enable us to correct previous round in Possible erroneous matching does not find record group, constantly goes the matched result of modification model.Until mode and Entities Matching Result all tend towards stability, the accuracy rate and recall rate of Entities Matching are greatly improved in this way, to realize raising data The accuracy rate of fusion promotes the purpose of data value.
The invention also discloses a kind of data fusion device based on pattern match and Entities Matching, system and one kind are readable Storage medium has above-mentioned beneficial effect, and details are not described herein.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 is the flow chart of the data fusion method provided in an embodiment of the present invention based on pattern match and Entities Matching;
Fig. 2 is index establishment process schematic diagram provided in an embodiment of the present invention;
Fig. 3 is matching process schematic diagram provided in an embodiment of the present invention;
Fig. 4 is the matching result schematic diagram that Entities Matching provided in an embodiment of the present invention is interacted with pattern match;
Fig. 5 is the structural frames of the data fusion device provided in an embodiment of the present invention based on pattern match and Entities Matching Figure;
Fig. 6 is the structural frames of the data fusion system provided in an embodiment of the present invention based on pattern match and Entities Matching Figure;
Fig. 7 is the structural representation of the data fusion system provided in an embodiment of the present invention based on pattern match and Entities Matching Figure.
Specific embodiment
Core of the invention is to provide a kind of data fusion method based on pattern match and Entities Matching, and this method can be with The accuracy rate of data fusion is improved, data value is promoted;Another core of the invention is to provide a kind of based on pattern match and reality The matched data fusion device of body, system and a kind of readable storage medium storing program for executing have above-mentioned beneficial effect.
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
For two tables of data T1={ t1, t2 ..., tn } and T2={ s1, s2 ..., sm }, their mode layer is respectively Any reality in any entity ti (1≤i≤n) in S1={ A1, A2 ..., Ap } and S2={ B1, B2 ..., Bq }, T1, T2 Body sj (1≤j≤m).Pattern match is exactly all attributes to be found out in S1 and S2 to (Ai, Bj), and each attribute is to tool There is the same nature of record.Entities Matching is exactly that the record for referring to same entity is found out from T1 and T2 to (ti, sj).
In order to carry out data integration, pattern match and Entities Matching are regarded as two independent steps at present, i.e., done first As soon as then time pattern match is done an Entities Matching and finished, this method has only given us to do mould in the unique chance Formula matching and Entities Matching, when we are collected into more and more information from one of step, we are can be more Another step is carried out well, and this method let us loses more chances and removes adjustment modes matching and Entities Matching As a result.Existing method first does pattern match, and the Entities Matching that then tries again just finishes.This method is generally unattainable Satisfactory syncretizing effect.
Referring to FIG. 1, Fig. 1 is the data fusion method provided in an embodiment of the present invention based on pattern match and Entities Matching Flow chart;This method may include:
Step 1: the pattern match for receiving initial connection successfully records pair;
For two tables of data T1={ t1, t2 ..., tn } and T2={ s1, s2 ..., sm }, initial connection is received The attribute pair to get up.Referring to the following table 1 and table 2, herein using primary data table T1 as table 1, primary data table T2 be table 2 for, The attribute of tables of data T1 includes: Product, WT, SIZE, CAMERA, ROM, RAM;Tables of data T2 attribute includes: Product, Weight, Front Cam, Back Cam, Memory, EX-Memory.
In initial attribute table, with the seed connection attribute pair received, i.e., pattern match successfully record to forFor, then it is Iphone in tables of data T1 that pattern match, which is successfully recorded to corresponding data, 6、Iphone 6plus、Iphone 5C、Samsung Note4、Samsung S6、HuaWei 6+、HuaWei P7、HuaWei Iphone 6, Iphone 6+, Note4, Galaxy S6, MI Note, 4 MI, Coolpad S6 in P8 and tables of data T2, MX Note4。
The pattern match of received initial connection is successfully recorded to that can be specified by user, can also be by program certainly The modes such as dynamic screening, without limitation to the generating mode of received initial connection record pair at this.
Wherein, the attribute both referred in tables of data is recorded in tables of data, also the generation including data line claims.For following table 1, number Record according to table T1 includes attribute Product, WT, SIZE, CAMERA, ROM, RAM, and also the generation including data line claims t1, t2, T3, t4, t5, t6, t7, t8.It records to two records referred in two tables of data under same type, such as the record pair under attribute It may include the Product in the Product and tables of data T2 in tables of data T1, Product and tables of data in tables of data T1 The s4 etc. in t8 and tables of data T2 in the s1 in t3 and tables of data T2 in WT in T2, tables of data T1, tables of data T1.
Product WT SIZE CAMERA ROM RAM
t1 Iphone 6 129g 4.7inch 1GB
t2 Iphone 6plus 172g 5.5inch 8mp 128GB 1GB
t3 Iphone 5C 112g 4.0inch 12mp 32GB 1GB
t4 Samsung Note4 176g 5.7inch 16mp 16GB 3GB
t5 Samsung S6 5.1inch 16mp 32GB 2GB
t6 HuaWei 6+ 165g 5.5inch 8mp 3GB
t7 HuaWei P7 124g 5.0inch 64GB 2GB
t8 HuaWei P8 144g 5.5inch 13mp 3GB
Table 1
Product Weight Screen Front Cam Back Cam Memory EX-Memory
s1 Iphone 6 129g 4.7in 8mp 64GB
s2 Iphone 6+ 172g 5.5in 12mp
s3 Note4 176g 5.7in 13mp 16GB 128GB
s4 Galaxy S6 5.1in 8mp 16mp 32GB
s5 MI Note 5.7in 8mp 16GB 32GB
s6 MI 4 149g 5.0in 13mp 13mp 64GB
s7 Coolpad S6 5.95in 16mp 32GB 64GB
s8 MX Note4 145g 16mp 16mp 32GB 16GB
Table 2
Step 2: the record of obtaining mode successful match records to correspondence pattern match to as pattern match record pair Data carry out Entities Matching;
Pattern match is the same alike result in order to find out different data concentration, and Entities Matching is exactly to find out unified number The record pair for indicating identical entity is concentrated according to concentration or different data.
After getting pattern match and successfully recording to being recorded as pattern match, to the record to progress entity Match.
For example, pattern match successfully record to forThe record carries out corresponding data Entities Matching, i.e., in tables of data T1 Iphone 6, Iphone 6plus, Iphone 5C, Samsung Note4, Iphone 6, Iphone 6+ in Samsung S6, HuaWei 6+, HuaWei P7, HuaWei P8 and tables of data T2, Note4, Galaxy S6, MI Note, 4 MI, Coolpad S6, MX Note4 carry out Entities Matching.
Step 3: judging whether there is Entities Matching and successfully record pair, if it does, entering step 4;If without reality The record pair of body successful match, enters step 6;
After carrying out Entities Matching, the record of successful match is judged whether there is to output, if so, according to the matching record pair Result carry out pattern match can further correct Entities Matching as a result, improve data fusion accurate rate.
Step 4: obtaining Entities Matching and successfully record to as Entities Matching record pair, Entities Matching is recorded to correspondence Data carry out pattern match;
Entities Matching is successfully recorded to as Entities Matching record pair, is believed using Entities Matching record as input Breath carries out pattern match, and the detailed process of pattern match is it is not limited here.
For example, initial received pattern match successfully record to forTo the record to correspondence Data carry out Entities Matching, obtain Entities Matching record to forAndBy Entities Matching record pair As input information, pattern match is carried out.
Step 5: judging whether there is pattern match and successfully record pair, if it does, entering step 2;If without mould The record pair of formula successful match, enters step 6;
If by the record pair of successful match when pattern match, by the record to the input data for being re-used as Entities Matching, Carry out pattern match.
The alternately process of execution pattern matching and Entities Matching.Each step Entities Matching, which will be found, more to be connected Attribute help the pattern match of next step to find the extra current attribute connected.Similarly, the mode of each step Matching will also find more attributes to helping the next Entities Matching to record with finding pair, iteratively find matching Record.
Step 6: obtaining the record pair of all successful match.
Matching process in the present invention only compares those, and value under the attribute of matched record carries out, use pattern The data integrating method that matching and Entities Matching interact gives certain seed attribute connections, alternatively executes Entities Matching And pattern match, i.e., the Entities Matching (mode of next step is carried out with the result of last round of pattern match (Entities Matching) With).
In pattern match, if two records be it is matched, their values under same attribute should be closely similar, because The similarity of the attribute value for the record that this comparison current matching gets up, which can be distinguished more subtly carrying out pattern match, to be possessed The attribute of similar value offset.This iterative algorithm carries out Entities Matching first with given seed attribute connection, then sharp again Pattern match is carried out with the matched result of current entity, then carries out Entities Matching again.Before can enabling us to amendment in this way Possible erroneous matching or record group is not found in one wheel.Same reason, we also can constantly remove modification model The result matched.Until mode and the result of Entities Matching all tend towards stability, this method just stops.
Due to carrying out Entities Matching the present invention relates to the multiple records of matching and using multiple records, when to reduce matching Between, it is preferable that the present invention can select the segmenting method based on q-gram to optimize efficiency of the invention.According to based on q-gram More attributes interaction index carry out recording indexes can substantially reduce index the time, promote the efficiency of data fusion.
Specifically, the method for building up of more attributes interaction index based on q-gram may include:
It is successfully recorded according to the pattern match of initial connection to establishing dynamic index;
Remove the record pair that discrimination in index is higher than discrimination threshold value;
Wherein, the calculation method of discrimination includes: to work as PosT(A) value being centered in for the record of successful match under record A Matched quantity, NegT(A) when being centered in the unmatched quantity of value recorded under A for the record of successful match, discrimination
Invention introduces amount-discrimination scores-of the ability of one entity of difference and other entities that can measure attribute IdC, it reflects some attribute to the significance level for carrying out Entities Matching.For the attribute A in tables of data, then discrimination score It can indicate are as follows:Wherein PosT(A) category is centered in for the matched record in training set The matched quantity of value under property A, NegT(A) the unmatched number of value being centered under attribute A for the matched record in training set Amount.So two attribute A and B, fused IdC are indicated with following formula:
A character s is given, then the q-gram collection generated from s is combined into Gms (s, q)={ gm1,gm2,..., gm|s-q+1|, wherein gmiBy the natural sequence in s from i-th to i+q-1.The continuous q-gram sequence definition of so s l length To be made of the continuous q-gram sequence in Gms (s, q).
The process that we establish index is not disposably to complete, but go dynamically to go to establish to index.It is primarily based on Above-mentioned definition indexes upper foundation in given seed attribute.Then, as more attributes are matched, if their IdC Score is greater than some threshold value, we continue to establish new index based on these attributes.Finally, will obtain about under multiple attributes Index, the index are interaction index.
Specifically, the greedy record matching strategy based on interaction index is referred to following introduction.
Assuming that we have had some matched attributes and a potential matching record to be placed to a block In, and the index about record has been done based on these attributes.Two should be met when we select record block to be compared A condition: (1) it has the number of comparisons of less value;(2) it has higher probability that can generate matched record pair.Specifically For, a block Block=({ LR }, { RR }) is given, wherein LR is the set of some records in one of table, and RR is the set of some records in another table.We estimate the priority of a block by formula below:Wherein AttrPairBlockIt is set up the attribute pair of index, andIndicate that we obtain average time that a matched record compares needs Number.When a block Block=({ LR }, { RR }) is selected to the comparison recorded, their matched categories are calculated The similarity of value under property, a possibility that then obtaining these record matchings.For being used to do the record pair of pattern match Number, it is too long to prevent the time of pattern match to be recorded matching delay that a maximum value can be set.
In interaction index, a block usually can generate correctly record pair when having high priority value. Because usually there is less record, and they have very high similarity in these blocks.
This greedy algorithm can rapidly find matched record, this is for matching remaining record, it is possible to reduce very Mostly unnecessary comparison.By the theoretical bound for shifting a possibility that calculating record matching onto, can thus reduce The comparison of more values of the attribute under, to improve the efficiency of algorithm.
It is introduced by taking the interaction index set up based on table 1 and table 2 as an example below, indexes establishment process such as Fig. 2 institute Show.
Drawn based on this vang, we can be found that for record matching, and the result that we obtain record matching to the end is most Also it only needs to compare:
And In addition the limitation of record matching possibility bound, also will be greatly reduced comparatively number.Therefore the index can mention well The efficiency of high algorithm.
Wherein, based on the discrimination of introducing, the process of Entities Matching is specifically as follows: recording pattern match to corresponding Data successively computational entity matching degree;Record of the Entities Matching degree higher than Entities Matching degree threshold value is filtered out to as Entities Matching Successfully record pair;
Wherein, the calculation method of Entities Matching degree includes:
When λ be damped coefficient,For all attributes pair connected total contribution score when, it is real Body matching degree
Wherein,
LSM(A, B) is a possibility that record A and B is matched;ctr (t [A], s [B]) is the similarity contribution margin of t [A] and s [B] two values;
Sim (t [A], s [B]) is the similarity of t [A] and s [B] two values, and θ is critical point;
The fused discrimination score of attribute A and B
Matching for entity, it is contemplated that the match condition of multiple attribute values, and consider this multiple category Some dependences between property.We calculate a possibility that two Entities Matchings with following formula:Wherein λ is the damped coefficient for being used to compensate the dependence between attribute, andIt is total contribution score of all attributes pair connected.Specifically,Wherein φ (A, B) is to be mapped above-mentioned value using logarithmic function To between 0 and 1, specific formula are as follows:Wherein LSM(A, B) is attribute A possibility that A and B is matched.Ctr (t [A], s [B]) is the similarity contribution margin for two values of t [A] and s [B], ctr (t [A], S [B]) calculation formula specifically:
Wherein sim (t [A], s [B]) is the similarity of two values, and θ is a user-defined critical point.
Wherein, the method for pattern match may include: successively to calculate mould to corresponding data to Entities Matching record Formula matching degree;The record that the pattern match degree is filtered out higher than pattern match degree threshold value is successfully recorded to as pattern match It is right;Wherein, the calculation method of pattern match degree may include:
When α is to control the contribution score of relative recording pair, pattern match degree
Wherein,
Ctr (t [A], s [B]) is the similarity contribution margin of t [A] and s [B] two values;
Sim (t [A], s [B]) is the similarity of t [A] and s [B] two values, and θ is critical point;
The fused discrimination score of attribute A and B
Specifically, when initial tables of data is table 1 and table 2, data fusion method provided by the invention is specifically as follows: Give a seed connection attribute pair:Assuming thatAnd it is preparatory The judgement attribute connection set connects whether matched threshold value is 0.7 with entity, and the θ value in formula 1 is 0.1.Carrying out first When secondary Entities Matching;We calculate a possibility that they match:
It can be matched Attribute pair:WithWhen cannot find more matched records to after, carry out pattern match.
According to two matched records pairWithTo carry out the pattern match of next step.Likewise, we Go a possibility that calculating these attributes match:We are at this Matched attribute pair is obtained in wheel pattern match
This process is repeated, the record pair of all successful match is obtained are as follows:
As shown in figure 3, SM intermediate scheme matches in figure, RM presentation-entity matches entire matching process, pattern match and reality Body matches alternately, and the result that every wheel matching obtains is as shown in the figure.Fig. 4 is that Entities Matching and pattern match are handed in matching process Mutual matching result.
The data fusion method provided by the present invention based on pattern match and Entities Matching is logical based on the above-mentioned technical proposal The record using given initial connection is crossed to Entities Matching is carried out, the matched result of current entity is then recycled to carry out mode Then matching carries out Entities Matching further according to the result of pattern match, and so on, each round matches with last round of matching As a result be used as input data, the record of the successful match of previous round on the basis of match again, us can be enable It enough corrects possible erroneous matching in previous round or does not find record group, constantly go the matched result of modification model.Until Mode and the result of Entities Matching all tend towards stability, and greatly improve the accuracy rate and recall rate of Entities Matching in this way, thus The accuracy rate for improving data fusion is realized, the purpose of data value is promoted.
Based on the above embodiment, data fusion method provided by the present invention may lead to the problem of " semantic shift ", If the step of producing some mistakes, may result in below amplifies this mistake i.e. during iteration, finally result in The poor effect of algorithm.For this problem, data fusion method provided by the present invention can be with further include:
To the record of the successful match of acquisition to calculating instability;
Remove the matched record pair that instability is greater than instability threshold value;
Wherein, the calculation method of instability includes: to work asTo record to all record values under (A, B) Similarity average value, m be connect record pair quantity when, unstable degreeSim (a, b) is the similarity of a and b two values.
Those unstable or uncertain record groups connected are detected with the method for unbiased esti-mator, pass through calculating The variance of the similarity of value of the attribute connected under come measure one record pair unstable degree.One is recorded It is rightWe calculate their unstable degree with the calculation method of unbiased variance;
WhereinIt is attribute to institute under (A, B) There is the average value of the similarity of attribute value, the quantity for the attribute pair that when m connects.It is higher that we remove those unstable degree Record pair, then re-start the calculating of their unstable degree, it is known that these records are to settling out.This method is not only Can guarantee needed for the pattern match of next step the quality of record pair, and can let us obtain the features of some attributes pair So as to more favorable carry out record matching.
For only using those accuracy higher attributes to record matching is carried out, prevent as far as possible after this mistake put Greatly, recording to pattern match before carrying out Entities Matching to corresponding data to include:
Calculate the error amount of pattern match record pair;
Remove the record pair that error amount is higher than error threshold;
Wherein, the calculation method of error amount include: when S () is similarity function, and η is quality threshold, record to (A, B error amount)
Those abnormal attributes pair are detected with the method for cross validation.Remember all attributes to for set P={ P1, P2,...,Pk, and remember P-PiTo remove P from set Pi.In each verifying, we are by P-PiRegard a training set as, PiRegard verifying collection as.We utilize P-PiGo to speculate record pair, then according to these deduce come record to calculating Pi A possibility that middle attribute is to matching.Specific method realization is, we allow each PiAn attribute is represented, then we apply one A linear loss function F is returnedWherein R is by P-PiIt deduces Record pair, S () is similarity function, and η is a quality threshold.We define wrong { 0,1 } by once judging: ifSo F (S, ((A, B), η))=1.Then we are in the following manner Carry out computation attribute to the error amount of (A, B):We are to k in P different collection It closes and repeats to do k this verification process, and error amount is calculated.Then we give up those error amounts higher than some threshold value Attribute pair.This method is able to detect let us using those accuracy higher attributes to record matching is carried out, and prevents at this Mistake is amplified later.
With the progress of interactive process, more and more records are to connected.When iteration is completed, that calculates at the beginning It just needs to update with possibility score.Phase interaction between the attribute pair that the entity pair that Entities Matching obtains is obtained with attributes match With can be indicated by bigraph (bipartite graph).It is wherein attribute pair on one side, another side is entity pair, and the weight between them on side then may be used Them are assigned corresponding numerical relation by matching the formula of possibility, can prove that the algorithm is convergent.
The data fusion device provided by the invention based on pattern match and Entities Matching is introduced below, is please referred to Fig. 5, Fig. 5 are the structural block diagram of the data fusion device provided in an embodiment of the present invention based on pattern match and Entities Matching;It should Device may include: receiving unit 100, Entities Matching unit 200, entity judging unit 300, pattern matching unit 400, mode Judging unit 500 and acquiring unit 600.
Wherein, the pattern match that receiving unit 100 can be used for receiving initial connection successfully records pair.
Entities Matching unit 200 can be used for the record of obtaining mode successful match to as pattern match record pair, right Pattern match record carries out Entities Matching to corresponding data.
Entity judging unit 300 can be used for judging whether there is Entities Matching and successfully record pair.
Pattern matching unit 400 can be used for obtaining Entities Matching successfully record to as Entities Matching record pair, it is right Entities Matching record carries out pattern match to corresponding data.
Mode determination 500 can be used for judging whether there is pattern match and successfully record pair.
Acquiring unit 600 can be used for ought not no successful match record clock synchronization, obtain the record pair of all successful match.
Data fusion device provided by the invention based on pattern match and Entities Matching passes through 200 benefit of Entities Matching unit With the record of given initial connection to Entities Matching is carried out, the matched resulting schema matching unit of current entity is then recycled 400 carry out pattern match, then further according to pattern match result carry out Entities Matching, and so on, each round matching with Last round of matching result as input data, the record of the successful match of previous round on the basis of match again, Possible erroneous matching in previous round can be enabled us to correct or do not find record group, modification model is constantly gone to match Result.Until mode and the result of Entities Matching all tend towards stability, greatly improve in this way Entities Matching accuracy rate and Recall rate promotes the purpose of data value to realize the accuracy rate for improving data fusion.Specifically to based on pattern match and The introduction of the data fusion device of Entities Matching can refer to the above-mentioned data fusion method based on pattern match and Entities Matching, This is repeated no more.
The data fusion system provided by the invention based on pattern match and Entities Matching is introduced below, it is specific right The introduction of data fusion system based on pattern match and Entities Matching can refer to above-mentioned based on pattern match and Entities Matching Data fusion device, Fig. 6 are the knot of the data fusion system provided in an embodiment of the present invention based on pattern match and Entities Matching Structure block diagram;The system may include: memory 700 and processor 800.
Wherein, memory 700 can be used for storing computer program;
Processor 800 realizes the wireless self-organization network assemblage method based on frequency hopping when can be used for executing computer program The step of.
Data fusion system provided by the invention based on pattern match and Entities Matching may be implemented to improve data fusion Accuracy rate, promote the purpose of data value.
Referring to FIG. 7, the knot of the data fusion system provided in an embodiment of the present invention based on pattern match and Entities Matching Structure schematic diagram, the system can generate bigger difference because configuration or performance are different, may include at one or more Device (central processing units, CPU) 322 (for example, one or more processors) and memory 332 is managed, The storage medium 330 of one or more storage application programs 342 or data 344 (such as deposit by one or more magnanimity Store up equipment).Wherein, memory 332 and storage medium 330 can be of short duration storage or persistent storage.It is stored in storage medium 330 Program may include one or more modules (diagram does not mark), each module may include in data processing equipment Series of instructions operation.Further, central processing unit 322 can be set to communicate with storage medium 330, be in fusion The series of instructions operation in storage medium 330 is executed on system 301.
Emerging system 301 can also include one or more power supplys 326, one or more wired or wireless nets Network interface 350, one or more input/output interfaces 358, and/or, one or more operating systems 341, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
Step in data fusion method based on pattern match and Entities Matching described in above figure 1 can be by being based on The structure of the data fusion system of pattern match and Entities Matching is realized.
Readable storage medium storing program for executing provided in an embodiment of the present invention is introduced below, readable storage medium storing program for executing described below with The above-described data fusion method based on pattern match and Entities Matching can correspond to each other reference.
A kind of readable storage medium storing program for executing disclosed by the invention, is stored thereon with program, base is realized when program is executed by processor In the pattern match and the data fusion method of Entities Matching the step of.
It should be noted that being filled based on pattern match and the data fusion of Entities Matching in the specific embodiment of the invention Each unit in setting, the course of work please refer to the corresponding specific embodiment of Fig. 1, and details are not described herein.
It is apparent to those skilled in the art that for convenience and simplicity of description, the device of foregoing description, The specific work process of equipment, storage medium and unit, can refer to corresponding processes in the foregoing method embodiment, herein no longer It repeats.
In several embodiments provided by the present invention, it should be understood that disclosed device, equipment, storage medium and Method may be implemented in other ways.For example, apparatus embodiments described above are merely indicative, for example, single Member division, only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or Component can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point is shown The mutual coupling, direct-coupling or communication connection shown or discussed can be through some interfaces, between device or unit Coupling or communication connection are connect, can be electrical property, mechanical or other forms.
Unit may or may not be physically separated as illustrated by the separation member, shown as a unit Component may or may not be physical unit, it can and it is in one place, or may be distributed over multiple networks On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
It, can if integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product To be stored in a mobile terminal.Based on this understanding, technical solution of the present invention is substantially in other words to the prior art The all or part of the part to contribute or the technical solution can be embodied in the form of software products, which deposits It stores up in one storage medium, including some instructions are used so that a mobile terminal (can be mobile phone or tablet computer Deng) execute all or part of the steps of each embodiment method of the present invention.And storage medium above-mentioned includes: USB flash disk, moves firmly Disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), The various media that can store program code such as magnetic or disk.
Each embodiment is described in a progressive manner in specification, the highlights of each of the examples are with other realities The difference of example is applied, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment Speech, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is referring to method part illustration ?.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, it can be realized with the combination of electronic hardware, terminal or the two, in order to clearly demonstrate hardware and software Interchangeability generally describes each exemplary composition and step according to function in the above description.These functions are studied carefully Unexpectedly it is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technique people Member can use different methods to achieve the described function each specific application, but this realization is it is not considered that super The scope of the present invention out.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
Above to the data fusion method provided by the present invention based on pattern match and Entities Matching, device, system and Readable storage medium storing program for executing is described in detail.Specific case used herein carries out the principle of the present invention and embodiment It illustrates, the above description of the embodiment is only used to help understand the method for the present invention and its core ideas.It should be pointed out that for this For the those of ordinary skill of technical field, without departing from the principle of the present invention, the present invention can also be carried out several Improvement and modification, these improvements and modifications also fall within the scope of protection of the claims of the present invention.

Claims (10)

1. a kind of data fusion method based on pattern match and Entities Matching characterized by comprising
Step 1: the pattern match for receiving initial connection successfully records pair;
Step 2: the record of obtaining mode successful match records to correspondence the pattern match to as pattern match record pair Data carry out Entities Matching;
Step 3: judging whether there is Entities Matching and successfully record pair, if it does, entering step 4;If without entity With successful record pair, 6 are entered step;
Step 4: obtaining Entities Matching and successfully record to as Entities Matching record pair, the Entities Matching is recorded to correspondence Data carry out pattern match;
Step 5: judging whether there is pattern match and successfully record pair, if it does, entering step 2;If without mode With successful record pair, 6 are entered step;
Step 6: obtaining the record pair of all successful match.
2. the data fusion method based on pattern match and Entities Matching as described in claim 1, which is characterized in that the mould The recording index method of formula matching and the Entities Matching specifically: more attributes interaction index based on q-gram.
3. the data fusion method based on pattern match and Entities Matching as claimed in claim 2, which is characterized in that the base Include: in the method for building up of more attributes interaction index of q-gram
It is successfully recorded according to the pattern match of the initial connection to establishing dynamic index;
Remove the record pair that discrimination in index is higher than discrimination threshold value;
Wherein, the calculation method of the discrimination includes: to work as PosT(A) value being centered in for the record of successful match under record A The quantity matched, NegT(A) when being centered in the unmatched quantity of value recorded under A for the record of successful match, discrimination
4. the data fusion method based on pattern match and Entities Matching as claimed in claim 3, which is characterized in that the reality The matched method of body includes:
To pattern match record to corresponding data successively computational entity matching degree;
Record of the Entities Matching degree higher than Entities Matching degree threshold value is filtered out to successfully recording as Entities Matching pair;
Wherein, the calculation method of the Entities Matching degree includes:
When λ be damped coefficient,For all attributes pair connected total contribution score when, entity With degree
Wherein,
LSM(A, B) is a possibility that record A and B is matched;ctr(t [A], s [B]) it is t [A] and s [B] the two similarity contribution margins being worth;
Sim (t [A], s [B]) is the similarity of t [A] and s [B] two values, and θ is critical point;
The fused discrimination score of attribute A and B
5. the data fusion method based on pattern match and Entities Matching as claimed in claim 3, which is characterized in that the mould The matched method of formula includes:
Pattern match degree is successively calculated to corresponding data to Entities Matching record;
Record of the pattern match degree higher than pattern match degree threshold value is filtered out to successfully recording as pattern match pair;
Wherein, the calculation method of the pattern match degree includes:
When α is to control the contribution score of relative recording pair, pattern match degree
Wherein,
Ctr (t [A], s [B]) is the similarity contribution margin of t [A] and s [B] two values;
Sim (t [A], s [B]) is the similarity of t [A] and s [B] two values, and θ is critical point;
The fused discrimination score of attribute A and B
6. the data fusion method based on pattern match and Entities Matching as claimed in claim 5, which is characterized in that also wrap It includes:
To the record of the successful match of acquisition to calculating instability;
Remove the matched record pair that the instability is greater than instability threshold value;
Wherein, the calculation method of the instability includes: to work asFor record under (A, B) all record values it is similar The average value of degree, when m is the quantity of the record pair connected, unstable degreeSim (a, b) is the similarity of a and b two values.
7. such as the data fusion method as claimed in any one of claims 1 to 6 based on pattern match and Entities Matching, feature exists In described to record pattern match to before corresponding data progress Entities Matching further include:
Calculate the error amount of the pattern match record pair;
Remove the record pair that the error amount is higher than error threshold;
Wherein, the calculation method of the error amount include: when S () is similarity function, and η is quality threshold, record to (A, B error amount)
8. a kind of data fusion device based on pattern match and Entities Matching characterized by comprising
Receiving unit, the pattern match for receiving initial connection successfully record pair;
Entities Matching unit, for obtaining mode successful match record to as pattern match record pair, to the mode Entities Matching is carried out to corresponding data with record;
Entity judging unit successfully records pair for judging whether there is Entities Matching;
Pattern matching unit, for obtain Entities Matching successfully record to as Entities Matching record pair, to the entity Pattern match is carried out to corresponding data with record;
Mode determination successfully records pair for judging whether there is pattern match;
Acquiring unit, for ought not no successful match record clock synchronization, obtain the record pair of all successful match.
9. a kind of data fusion system based on pattern match and Entities Matching characterized by comprising
Memory, for storing computer program;
Processor, when for executing the computer program realize as described in any one of claim 1 to 7 based on pattern match with The step of data fusion method of Entities Matching.
10. a kind of readable storage medium storing program for executing, which is characterized in that be stored with program on the readable storage medium storing program for executing, described program is located It manages realizing the data fusion method based on pattern match and Entities Matching as described in any one of claim 1 to 7 when device executes Step.
CN201810594208.9A 2018-06-11 2018-06-11 Data fusion method, device, system based on pattern match and Entities Matching Pending CN108960292A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810594208.9A CN108960292A (en) 2018-06-11 2018-06-11 Data fusion method, device, system based on pattern match and Entities Matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810594208.9A CN108960292A (en) 2018-06-11 2018-06-11 Data fusion method, device, system based on pattern match and Entities Matching

Publications (1)

Publication Number Publication Date
CN108960292A true CN108960292A (en) 2018-12-07

Family

ID=64488266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810594208.9A Pending CN108960292A (en) 2018-06-11 2018-06-11 Data fusion method, device, system based on pattern match and Entities Matching

Country Status (1)

Country Link
CN (1) CN108960292A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209384A (en) * 2020-01-08 2020-05-29 腾讯科技(深圳)有限公司 Question and answer data processing method and device based on artificial intelligence and electronic equipment
CN113760995A (en) * 2021-09-09 2021-12-07 上海明略人工智能(集团)有限公司 Entity linking method, system, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209384A (en) * 2020-01-08 2020-05-29 腾讯科技(深圳)有限公司 Question and answer data processing method and device based on artificial intelligence and electronic equipment
CN111209384B (en) * 2020-01-08 2023-08-15 腾讯科技(深圳)有限公司 Question-answer data processing method and device based on artificial intelligence and electronic equipment
CN113760995A (en) * 2021-09-09 2021-12-07 上海明略人工智能(集团)有限公司 Entity linking method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110213164B (en) Method and device for identifying network key propagator based on topology information fusion
CN110751042B (en) Time partition-based portrait and IMSI information association method and system
CN105426375B (en) A kind of calculation method and device of relational network
CN108959516B (en) Conversation message treating method and apparatus
CN108959370A (en) The community discovery method and device of entity similarity in a kind of knowledge based map
CN108960292A (en) Data fusion method, device, system based on pattern match and Entities Matching
CN102708327A (en) Network community discovery method based on spectrum optimization
CN110809066A (en) IPv6 address generation model creation method, device and address generation method
CN104092618A (en) Peer-to-peer network traffic feature selection method based on cuckoo search algorithm
EP3633669A1 (en) Method and apparatus for correcting time delay between accompaniment and dry sound, and storage medium
EP3452916A1 (en) Large scale social graph segmentation
CN116957106A (en) Federal learning model training method based on dynamic attention mechanism
CN114880522A (en) Method and device for realizing ID Mapping based on graph database
CN109951430A (en) A kind of data processing method and device
CN111709102B (en) Water supply network partitioning method based on hierarchical clustering
CN110247805A (en) A kind of method and device for propagating key node based on the identification that K shell decomposes
CN111950267B (en) Text triplet extraction method and device, electronic equipment and storage medium
CN106878365A (en) A kind of method of data synchronization and equipment
CN112257332A (en) Simulation model evaluation method and device
CN111415265A (en) Social relationship data generation method of generative confrontation network
CN114691630B (en) Smart supply chain big data sharing method and system
CN108572994B (en) Data migration processing method and server
CN114422338B (en) Fault influence analysis method and device
CN111444327B (en) Hot spot knowledge determination method, device and system
CN112800185B (en) Method and device for generating and matching text of interface node in mobile terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181207