CN104166719A - Matching method based on generalization bi-direction similarity connection technique - Google Patents

Matching method based on generalization bi-direction similarity connection technique Download PDF

Info

Publication number
CN104166719A
CN104166719A CN201410407666.9A CN201410407666A CN104166719A CN 104166719 A CN104166719 A CN 104166719A CN 201410407666 A CN201410407666 A CN 201410407666A CN 104166719 A CN104166719 A CN 104166719A
Authority
CN
China
Prior art keywords
matching
data
mapping
extensive
way similar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410407666.9A
Other languages
Chinese (zh)
Other versions
CN104166719B (en
Inventor
王朝坤
王萌
汪浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201410407666.9A priority Critical patent/CN104166719B/en
Publication of CN104166719A publication Critical patent/CN104166719A/en
Application granted granted Critical
Publication of CN104166719B publication Critical patent/CN104166719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • G06F16/2456Join operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a matching method based on a generalization bi-direction similarity connection technique, and relates to the technical field of computer networks. The method includes the steps of obtaining data of two parties to be matched and a given threshold value for limiting the matching degree, carrying out bi-direction matching on factual data of each party and expected data of the other party, filtering a large quantity of records which do not meet matching conditions according to bi-direction matching results, quickly judging filtered candidate sets and obtaining record pairs capable of being successfully matched. According to the matching method based on the generalization bi-direction similarity connection technique, generalization bi-direction similarity connection is carried out based on mapping, filtering and verification (MFV), and the method is suitable for wide application scenes in the real world.

Description

Matching process based on extensive two-way similar interconnection technique
Technical field
The present invention relates to technical field of the computer network, particularly relate to a kind of matching process based on extensive two-way similar interconnection technique.
Background technology
Two-way similar connection is intended to concentrate and find out all data recording pair that meet predetermined condition of contact from two or data-oriented, is an important operation in database application.Yet traditional matching process has higher limitation in diversified data type, can not meet well ever-increasing objective demand in real world.And, in the process of bi-directional matching, both sides' role may be different, and the object of coupling carrys out the factual data of My World to the other side's expected data and the other side, and how expected data and factual data being intersected is relatively a large key technical problem that instantly needs solution.In addition, in real world, the individual requirement for matching degree is different, somebody need to find 100% satisfied match objects, somebody's psychology expection is so high, according to the comparative approach that the restriction of matching degree is designed to two-phase connection, is so how also important technological problems urgently to be resolved hurrily.
Therefore, need at present the urgent technical matters solving of those skilled in the art to be exactly: how can innovate and to propose a kind of more efficiently matching process, to meet the more demands in practical application.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of matching process based on extensive two-way similar interconnection technique, carries out extensive two-way similar connection be more suitable in real world application scenarios comparatively widely based on mapping-filter-checking (MFV).
In order to solve the problems of the technologies described above, the embodiment of the invention discloses a kind of matching process based on extensive two-way similar interconnection technique, comprising:
Obtain coupling both sides data to be matched and the threshold values limiting for matching degree separately;
Coupling both sides each party's factual data and the other side's expected data are carried out to bi-directional matching;
According to the result of described bi-directional matching, filter the record that does not meet in a large number matching condition;
Candidate collection after filtering is judged fast, and recording that acquisition can the match is successful is right.
Preferably, the coupling that the described expected data to coupling both sides each party's factual data and the other side is carried out in bi-directional matching adopts cross-matched mode to complete.
Preferably, described cross-matched adopts the mode of mapping to complete.
Preferably, the mode of described mapping comprise injection turn, etc. step-length mapping turn and/or heuristic mapping.
Preferably, the mode of described mapping is heuristic mapping.
Matching process based on extensive two-way similar interconnection technique of the present invention, carries out extensive two-way similar connection based on mapping-filter-checking (MFV), is more suitable in real world application scenarios comparatively widely, for example, and employment and friend-making etc.In these application scenarioss, need the data type of mating various, each individuality in coupling both sides there are differences for the restriction of matching degree, the present invention can carry out cross-matched effectively by a side factual data with the opposing party's expected data by mapping-filtration-verification method based on these problems when carrying out two-way connection, such friend-making process can meet both sides' expection simultaneously, in employment, can impel recruitment side and job hunter to find rapidly satisfied employee and enterprise, increase the success ratio in employment process, in other social networks, also have a wide range of applications scene.
Meanwhile, the extensive two-way similar method of attachment that this technology proposes does not produce error result and does not miss correct result yet in matching process, has correctness and completeness.And, the method before, method efficiency that this technology is invented is higher, the scope of application more extensively, more meets current demand.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of a kind of matching process embodiment based on extensive two-way similar interconnection technique of the present invention.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation.
Referring to Fig. 1, a kind of matching process based on extensive two-way similar interconnection technique described in this programme, specifically comprises:
Step S101, the threshold values limiting for matching degree that obtains both sides' data to be matched and provide;
Step S102, carries out bi-directional matching to coupling both sides each party's factual data and the other side's expected data;
Step S103, filters the record that does not meet in a large number matching condition according to the result of described bi-directional matching;
Step S104, judges fast to the candidate collection after filtering, and recording that acquisition can the match is successful is right.
Specific implementation in actual applications:
1. obtain data set R and S to be matched.In R and S every records r and s has comprised the factual data of self, to the satisfaction threshold values t of the other side's expected data, coupling and all other data.
2. the fact and the expected data in every record in couple data set R and S shone upon (referring to algorithm 1 step 2).
A) according to selected numerical value mapping mode, carry out numerical value mapping, injection turns 2.2, etc. step-length mapping, turns 2.3, and heuristic mapping turns 2.4.
B) injection mapping, is mapped to each numerical value of each attribute on a unique globak symbol.Turn 2.5.
The step-length mapping such as c), carrys out even partition numerical range data by fixing step-length, and a plurality of values are mapped on same symbol.Turn 2.5.
D), by heuristic mapping, find the mapping scheme of near-optimization.
I., the number k of the maximum divided block that certain numerical range accepts is set 0.
Ii. find the optimization aim of calculating the optimal dividing in certain numerical range.
min Σ r ∈ R f ( e ( r ) , Π ) , sub . to | Π | ≤ k 0 , Π ∈ 2 Π
Wherein e (r)=a~b is the numerical range of an expectation, and Ext (e (r)) is one and divides ∏ for the extension of e (r), i.e. Ext (e (r))=a j~bk, a j=Max (a i| a i≤ a), b k=Min (b i| b i>=b), 2 represent the set of the division composition of all division numerical ranges, | ∏ | be the piece number of dividing in ∏, f be a calculating in the gesture of dividing Ext (e (r)) in ∏ situation | Ext (e (r)) | and the gesture of e (r) | e (r) | the function of difference.
According to following optimum minor structure, by dynamic programming method, obtain optimal dividing ∏.
c [ i , j , k ] = 0 i = j c [ i , j , 1 ] k = 1 , i < j min i &le; m < j { c [ i , m , 1 ] + c [ m + 1 , j , k - 1 ] } k > 1 , i < j
P i, j, kexpression is divided into numerical range i~j a division of k divided block, i≤j wherein, k>0.P i, j, kcost c[i, j, k] be defined as the extension size sum of all expected datas that this division causes.For a given divided block P i=a i~b iif, original expectation numerical range c i~d ip iif a part, have c i>=a iand d i<b i, c i-a iwith b i-d isum is the extension size of this divided block to this expected data.
E) by adding up and sorting, obtain the overall order of the symbol according to an occurrence number sort ascending O t.
F) to be mapped to be the generation record that a symbol in globak symbol set forms to every record, and these records have formed the data set after mapping, are denoted as R mand S m.
3. the record of the generation after pair mapping carries out pre-service.
A) for R mand S min generation record according to O t(referring to algorithm 1 step 3) sorts.
B) adopt globak symbol as keyword, respectively to R mand S minverted index I is partly set up in the expectation of middle record rand I s.For the R after sequence mand S min globak symbol record, according to t*L+1 of its front L – of threshold values t index of record, expect symbol (referring to algorithm 1 step 4).
4. on the symbol record producing afterwards in mapping, carry out bi-directional filtered acquisition candidate collection (referring to algorithm 1 step 5-12).
A) enumerate R mr is recorded in the generation of data centralization, finds the true globak symbol of those r at index I sthe S of middle correspondence min record s, and initial candidate is put into initial candidate set and CR to (r, s) 1in (referring to algorithm 1 step 5-8).
B) traversal CR 1in all initial candidate to (r, s), judge that the true globak symbol of s is at index I rinverted list in whether exist and record r.If existed, turn 4.3, if there is no, turn 4.4.
C) put it into final candidate result collection CR 2in.Turn 5.
D) carry out beta pruning (referring to algorithm 1 step 9-12).
5. check final candidate result collection CR 2in each candidate couple, using qualified as net result output (referring to algorithm 1 step 13-15).
Algorithm 1. mapping-filtration-verification algorithms (MFV)
Input: R, S-data set
Output: RS-Query Result data set
For making those skilled in the art understand better the present invention, below in conjunction with practical application, more detailed does concrete introduction to this programme.
[step]
1, define extensive two-way similar connection
Definition 1. " meeting " (∝) operational character are defined in true and corresponding expectation.For dissimilar data, the criterion of " ∝ " is not quite similar.For instance, if true f is the data of a value type and expect that e=a~b is the data of a numerical value wide-style, f ∝ e and if only if f>=a ∧ f≤b so; If true f be set in an element and expect e={e 1, e 2..., e nbe a set, f ∝ e and if only if f ∈ e.
Every the record defining in 2. tentation data collection R and S all comprises factual data, expected data, threshold values data and other extraneous data, and formalized description is: R = { ( r 1 f , . . . , r u f , r u + 1 e , . . . , r u + v e , r u + v + 1 , . . . , r q , T ( r ) ) } , S = { ( s 1 f , . . . , s v f , s v + 1 e , . . . , s u + v e , s u + v + 1 , . . . , s w , T ( s ) ) } , Wherein u+v≤q and u+v≤w, r i f(i=1 ..., u) represent u the factual data of r; represent v the expected data of r; r i(i=u+v+1 ..., q) represent other data of r; T (r) is the threshold values data of r.Same, represent v the factual data of s; represent u the expected data of s; s i(i=u+v+1, u+v+2 ..., w) represent other data of s; T (s) is the threshold values data of s.The extensive two-way similar connection of R and S is defined as: R s={ (r, s) | r ∈ R, s ∈ S, ExSim (r, s)>=T (s) ∧ ExSim (s, r)>=T (r) }, wherein: ( 1 ) ExSim ( r , s ) | { r i f &Proportional; s i + v e | 1 &le; i &le; u } | / u ; ( 2 ) ExSim ( s , r ) = | { s j f &Proportional; r j + u e | 1 &le; j &le; v | / v .
2, mapping-filtration-verification method
Based on above-mentioned definition, propose to solve the method for extensive two-way similar connectivity problem, it comprises mapping, filters and three steps of checking, referred to as mapping-filtration-verification method.
Algorithm 1. mapping-filtration-verification algorithms
Input: R, S-data set
Output: RS-Query Result data set
Algorithm 1 is described three concrete steps in detail:
The first step: mapping.
1) according to selected numerical value mapping mode, carry out numerical value mapping, injection turns 2), etc. step-length mapping, turn 3), heuristic mapping turns 4).
2) injection mapping, is mapped to each numerical value of each attribute on a unique globak symbol.Turn 5).
3) the step-length mapping such as, carrys out even partition numerical range data by fixing step-length, and a plurality of values are mapped on same symbol.Turn 5).
4), by heuristic mapping, find the mapping scheme of near-optimization.
A) the number k0 of the maximum divided block that certain numerical range accepts is set.
B) find the optimization aim of calculating the optimal dividing in certain numerical range.
min &Sigma; r &Element; R f ( e ( r ) , &Pi; ) , sub . to | &Pi; | &le; k 0 , &Pi; &Element; 2 &Pi;
Wherein e (r)=a~b is the numerical range of an expectation, and Ext (e (r)) is one and divides ∏ for the extension of e (r), i.e. Ext (e (r))=a j~b k, a j=Max (a i| a i≤ a), b k=Min (b i| b i>=b), 2 represent the set of the division composition of all division numerical ranges, | ∏ | be the piece number of dividing in ∏, f be a calculating in the gesture of dividing Ext (e (r)) in ∏ situation | Ext (e (r)) | and the gesture of e (r) | e (r) | the function of difference.
C) according to following optimum minor structure, by dynamic programming method, obtain optimal dividing ∏.
c [ i , j , k ] = 0 i = j c [ i , j , 1 ] k = 1 , i < j min i &le; m < j { c [ i , m , 1 ] + c [ m + 1 , j , k - 1 ] } k > 1 , i < j
P i, j, kexpression is divided into numerical range i~j a division of k divided block, i≤j wherein, k>0.P i, j, kcost c[i, j, k] be defined as the extension size sum of all expected datas that this division causes.For a given divided block P i=a i~b iif, original expectation numerical range c i~d ip iif a part, have c i>=a iand d i<b i, c i-a iwith b i-d isum is the extension size of this divided block to this expected data.
5) by adding up and sorting, obtain the overall order of the symbol according to an occurrence number sort ascending O t.
6) to be mapped to be the generation record that a symbol in globak symbol set forms to every record, and these records have formed the data set after mapping, are denoted as R mand S m(step 2).
Second step: filter.
1) for R mand S min generation record according to O t(step 3) sorts.
2) adopt globak symbol as keyword respectively to R mand S minverted index I is partly set up in the expectation of middle record rand I s(step 4).For the R after sequence mand S min globak symbol record, t*L+1 of its front L – of index expectation symbol, wherein, t represents the threshold values of this record.
3) according to filtering principle, generate candidate result to (step 5-12).
A) enumerate R mr is recorded in the generation of data centralization, finds the true globak symbol of those r at index I sthe S of middle correspondence min record s, and initial candidate is put into initial candidate set and CR to (r, s) 1in (step 5-8).
B) traversal CR 1in all initial candidate to (r, s), judge that the true globak symbol of s is at index I rinverted list in whether exist and record r.If exist, turn c), if there is no, turn d).
C) put it into final candidate result collection CR 2in.Turn the 3rd step.
D) carry out beta pruning (step 9-12).
The 3rd step: checking.
Check final candidate result collection CR 2in each candidate couple, using qualified as net result output (step 13-15).
Above a kind of matching process based on extensive two-way similar interconnection technique provided by the present invention is described in detail, applied specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment is just for helping to understand method of the present invention and core concept thereof; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention meanwhile.

Claims (5)

1. the matching process based on extensive two-way similar interconnection technique, is characterized in that, comprising:
The threshold values that obtains both sides' data to be matched and limit for matching degree separately;
Coupling both sides each party's factual data and the other side's expected data are carried out to bi-directional matching;
According to the result of described bi-directional matching, filter the record that does not meet in a large number matching condition;
Candidate collection after filtering is judged fast, and recording that acquisition can the match is successful is right.
2. the matching process based on extensive two-way similar interconnection technique as claimed in claim 1, is characterized in that, the coupling that the described expected data to coupling both sides each party's factual data and the other side is carried out in bi-directional matching adopts cross-matched mode to complete.
3. the matching process based on extensive two-way similar interconnection technique as claimed in claim 2, is characterized in that, described cross-matched adopts the mode of mapping to complete.
4. the matching process based on extensive two-way similar interconnection technique as claimed in claim 3, is characterized in that, the mode of described mapping comprise injection turn, etc. step-length mapping turn and/or heuristic mapping.
5. the matching process based on extensive two-way similar interconnection technique as claimed in claim 3, is characterized in that, the mode of described mapping is heuristic mapping.
CN201410407666.9A 2014-08-19 2014-08-19 Matching process based on extensive two-way similar interconnection technique Active CN104166719B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410407666.9A CN104166719B (en) 2014-08-19 2014-08-19 Matching process based on extensive two-way similar interconnection technique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410407666.9A CN104166719B (en) 2014-08-19 2014-08-19 Matching process based on extensive two-way similar interconnection technique

Publications (2)

Publication Number Publication Date
CN104166719A true CN104166719A (en) 2014-11-26
CN104166719B CN104166719B (en) 2018-02-16

Family

ID=51910532

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410407666.9A Active CN104166719B (en) 2014-08-19 2014-08-19 Matching process based on extensive two-way similar interconnection technique

Country Status (1)

Country Link
CN (1) CN104166719B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021493A (en) * 2016-05-19 2016-10-12 天津工业大学 Method and device for similarity connection of inconsistent constraints
CN108573052A (en) * 2018-04-23 2018-09-25 南京大学 A kind of similar connection method of the set of threshold adaptive
CN108846067A (en) * 2018-06-05 2018-11-20 洛阳师范学院 The high dimensional data similarity join querying method and device divided based on mapping space

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101180645A (en) * 2004-12-07 2008-05-14 毕库德股份有限公司 Electronic commerce system, method and apparatus
CN101453398A (en) * 2007-12-06 2009-06-10 怀特威盛软件公司 Novel distributed grid super computer system and method
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
US20120185422A1 (en) * 2011-01-14 2012-07-19 Shah Amip J Node similarity for component substitution
US20130036119A1 (en) * 2011-08-01 2013-02-07 Qatar Foundation Behavior Based Record Linkage

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101180645A (en) * 2004-12-07 2008-05-14 毕库德股份有限公司 Electronic commerce system, method and apparatus
CN103218732A (en) * 2004-12-07 2013-07-24 毕库德股份有限公司 Electronic commerce system, method and apparatus
CN101453398A (en) * 2007-12-06 2009-06-10 怀特威盛软件公司 Novel distributed grid super computer system and method
US20120185422A1 (en) * 2011-01-14 2012-07-19 Shah Amip J Node similarity for component substitution
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
US20130036119A1 (en) * 2011-08-01 2013-02-07 Qatar Foundation Behavior Based Record Linkage

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
朱建新: ""生物认证系统错误率分析"", 《计算机应用研究》 *
王金宝: ""云计算系统中索引与查询处理技术研究"", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021493A (en) * 2016-05-19 2016-10-12 天津工业大学 Method and device for similarity connection of inconsistent constraints
CN108573052A (en) * 2018-04-23 2018-09-25 南京大学 A kind of similar connection method of the set of threshold adaptive
CN108846067A (en) * 2018-06-05 2018-11-20 洛阳师范学院 The high dimensional data similarity join querying method and device divided based on mapping space

Also Published As

Publication number Publication date
CN104166719B (en) 2018-02-16

Similar Documents

Publication Publication Date Title
CN103064970B (en) Optimize the search method of interpreter
Leung et al. Mining interesting link formation rules in social networks
CN110147722A (en) A kind of method for processing video frequency, video process apparatus and terminal device
Nandurge et al. Analyzing road accident data using machine learning paradigms
CN106649503A (en) Query method and system based on sql
CN103793422A (en) Methods for generating cube metadata and query statements on basis of enhanced star schema
CN104573130A (en) Entity resolution method based on group calculation and entity resolution device based on group calculation
CN106254321A (en) A kind of whole network abnormal data stream sorting technique
CN112257762B (en) Road network matching method and system among different-source high-precision maps
CN113761221B (en) Knowledge graph entity alignment method based on graph neural network
CN104166719A (en) Matching method based on generalization bi-direction similarity connection technique
CN108170707A (en) A kind of method and system of data quality checking
CN105045863A (en) Method and system used for entity matching
CN111709714A (en) Method and device for predicting lost personnel based on artificial intelligence
CN107451177B (en) Query method and system for single error-surveying block chain of increased blocks
CN105930174A (en) Difference comparison method and system for graphical page programs
CN104036039A (en) Parallel processing method and system of data
CN109885797B (en) Relational network construction method based on multi-identity space mapping
CN105224629A (en) The implementation method of a kind of XBRL application platform operation flow integration
CN112905906B (en) Recommendation method and system fusing local collaboration and feature intersection
Jain et al. Advanced information and knowledge processing
CN103257983A (en) Unique constraint based Deep Web entity identification method
WO2023178767A1 (en) Enterprise risk detection method and apparatus based on enterprise credit investigation big data knowledge graph
CN104199824A (en) Method for judging node relation on tree-shaped data
CN105373804A (en) A human body part positioning method and system based on multi-dimensional space quick clustering

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant