CN104166719A

CN104166719A - Matching method based on generalization bi-direction similarity connection technique

Info

Publication number: CN104166719A
Application number: CN201410407666.9A
Authority: CN
Inventors: 王朝坤; 王萌; 汪浩
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2014-08-19
Filing date: 2014-08-19
Publication date: 2014-11-26
Anticipated expiration: 2034-08-19
Also published as: CN104166719B

Abstract

The invention discloses a matching method based on a generalization bi-direction similarity connection technique, and relates to the technical field of computer networks. The method includes the steps of obtaining data of two parties to be matched and a given threshold value for limiting the matching degree, carrying out bi-direction matching on factual data of each party and expected data of the other party, filtering a large quantity of records which do not meet matching conditions according to bi-direction matching results, quickly judging filtered candidate sets and obtaining record pairs capable of being successfully matched. According to the matching method based on the generalization bi-direction similarity connection technique, generalization bi-direction similarity connection is carried out based on mapping, filtering and verification (MFV), and the method is suitable for wide application scenes in the real world.

Description

Matching process based on extensive two-way similar interconnection technique

Technical field

The present invention relates to technical field of the computer network, particularly relate to a kind of matching process based on extensive two-way similar interconnection technique.

Background technology

Two-way similar connection is intended to concentrate and find out all data recording pair that meet predetermined condition of contact from two or data-oriented, is an important operation in database application.Yet traditional matching process has higher limitation in diversified data type, can not meet well ever-increasing objective demand in real world.And, in the process of bi-directional matching, both sides' role may be different, and the object of coupling carrys out the factual data of My World to the other side's expected data and the other side, and how expected data and factual data being intersected is relatively a large key technical problem that instantly needs solution.In addition, in real world, the individual requirement for matching degree is different, somebody need to find 100% satisfied match objects, somebody's psychology expection is so high, according to the comparative approach that the restriction of matching degree is designed to two-phase connection, is so how also important technological problems urgently to be resolved hurrily.

Therefore, need at present the urgent technical matters solving of those skilled in the art to be exactly: how can innovate and to propose a kind of more efficiently matching process, to meet the more demands in practical application.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of matching process based on extensive two-way similar interconnection technique, carries out extensive two-way similar connection be more suitable in real world application scenarios comparatively widely based on mapping-filter-checking (MFV).

In order to solve the problems of the technologies described above, the embodiment of the invention discloses a kind of matching process based on extensive two-way similar interconnection technique, comprising:

Obtain coupling both sides data to be matched and the threshold values limiting for matching degree separately;

Coupling both sides each party's factual data and the other side's expected data are carried out to bi-directional matching;

According to the result of described bi-directional matching, filter the record that does not meet in a large number matching condition;

Candidate collection after filtering is judged fast, and recording that acquisition can the match is successful is right.

Preferably, the coupling that the described expected data to coupling both sides each party's factual data and the other side is carried out in bi-directional matching adopts cross-matched mode to complete.

Preferably, described cross-matched adopts the mode of mapping to complete.

Preferably, the mode of described mapping comprise injection turn, etc. step-length mapping turn and/or heuristic mapping.

Preferably, the mode of described mapping is heuristic mapping.

Matching process based on extensive two-way similar interconnection technique of the present invention, carries out extensive two-way similar connection based on mapping-filter-checking (MFV), is more suitable in real world application scenarios comparatively widely, for example, and employment and friend-making etc.In these application scenarioss, need the data type of mating various, each individuality in coupling both sides there are differences for the restriction of matching degree, the present invention can carry out cross-matched effectively by a side factual data with the opposing party's expected data by mapping-filtration-verification method based on these problems when carrying out two-way connection, such friend-making process can meet both sides' expection simultaneously, in employment, can impel recruitment side and job hunter to find rapidly satisfied employee and enterprise, increase the success ratio in employment process, in other social networks, also have a wide range of applications scene.

Meanwhile, the extensive two-way similar method of attachment that this technology proposes does not produce error result and does not miss correct result yet in matching process, has correctness and completeness.And, the method before, method efficiency that this technology is invented is higher, the scope of application more extensively, more meets current demand.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of a kind of matching process embodiment based on extensive two-way similar interconnection technique of the present invention.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation.

Referring to Fig. 1, a kind of matching process based on extensive two-way similar interconnection technique described in this programme, specifically comprises:

Step S101, the threshold values limiting for matching degree that obtains both sides' data to be matched and provide;

Step S102, carries out bi-directional matching to coupling both sides each party's factual data and the other side's expected data;

Step S103, filters the record that does not meet in a large number matching condition according to the result of described bi-directional matching;

Step S104, judges fast to the candidate collection after filtering, and recording that acquisition can the match is successful is right.

Specific implementation in actual applications:

1. obtain data set R and S to be matched.In R and S every records r and s has comprised the factual data of self, to the satisfaction threshold values t of the other side's expected data, coupling and all other data.

2. the fact and the expected data in every record in couple data set R and S shone upon (referring to algorithm 1 step 2).

A) according to selected numerical value mapping mode, carry out numerical value mapping, injection turns 2.2, etc. step-length mapping, turns 2.3, and heuristic mapping turns 2.4.

B) injection mapping, is mapped to each numerical value of each attribute on a unique globak symbol.Turn 2.5.

The step-length mapping such as c), carrys out even partition numerical range data by fixing step-length, and a plurality of values are mapped on same symbol.Turn 2.5.

D), by heuristic mapping, find the mapping scheme of near-optimization.

I., the number k of the maximum divided block that certain numerical range accepts is set ₀.

Ii. find the optimization aim of calculating the optimal dividing in certain numerical range.

\min \underset{r &Element; R}{Σ} f (e (r), Π), sub . to | Π | \leq k_{0}, Π &Element; 2^{Π}

Wherein e (r)=a～b is the numerical range of an expectation, and Ext (e (r)) is one and divides ∏ for the extension of e (r), i.e. Ext (e (r))=a _j～bk, a _j=Max (a _i| a _i≤ a), b _k=Min (b _i| b _i>=b), 2 ^∏represent the set of the division composition of all division numerical ranges, | ∏ | be the piece number of dividing in ∏, f be a calculating in the gesture of dividing Ext (e (r)) in ∏ situation | Ext (e (r)) | and the gesture of e (r) | e (r) | the function of difference.

According to following optimum minor structure, by dynamic programming method, obtain optimal dividing ∏.

c [i, j, k] = \{\begin{matrix} 0 & i = j \\ c [i, j, 1] & k = 1, i < j \\ \min_{i \leq m < j} {c [i, m, 1] + c [m + 1, j, k - 1]} & k > 1, i < j \end{matrix}

P _{i, j, k}expression is divided into numerical range i～j a division of k divided block, i≤j wherein, k>0.P _{i, j, k}cost c[i, j, k] be defined as the extension size sum of all expected datas that this division causes.For a given divided block P _i=a _i～b _iif, original expectation numerical range c _i～d _ip _iif a part, have c _i>=a _iand d _i<b _i, c _i-a _iwith b _i-d _isum is the extension size of this divided block to this expected data.

E) by adding up and sorting, obtain the overall order of the symbol according to an occurrence number sort ascending O _t.

F) to be mapped to be the generation record that a symbol in globak symbol set forms to every record, and these records have formed the data set after mapping, are denoted as R _mand S _m.

3. the record of the generation after pair mapping carries out pre-service.

A) for R _mand S _min generation record according to O _t(referring to algorithm 1 step 3) sorts.

B) adopt globak symbol as keyword, respectively to R _mand S _minverted index I is partly set up in the expectation of middle record _rand I _s.For the R after sequence _mand S _min globak symbol record, according to t*L+1 of its front L – of threshold values t index of record, expect symbol (referring to algorithm 1 step 4).

4. on the symbol record producing afterwards in mapping, carry out bi-directional filtered acquisition candidate collection (referring to algorithm 1 step 5-12).

A) enumerate R _mr is recorded in the generation of data centralization, finds the true globak symbol of those r at index I _sthe S of middle correspondence _min record s, and initial candidate is put into initial candidate set and CR to (r, s) ₁in (referring to algorithm 1 step 5-8).

B) traversal CR ₁in all initial candidate to (r, s), judge that the true globak symbol of s is at index I _rinverted list in whether exist and record r.If existed, turn 4.3, if there is no, turn 4.4.

C) put it into final candidate result collection CR ₂in.Turn 5.

D) carry out beta pruning (referring to algorithm 1 step 9-12).

5. check final candidate result collection CR ₂in each candidate couple, using qualified as net result output (referring to algorithm 1 step 13-15).

Algorithm 1. mapping-filtration-verification algorithms (MFV)

Input: R, S-data set

Output: RS-Query Result data set

For making those skilled in the art understand better the present invention, below in conjunction with practical application, more detailed does concrete introduction to this programme.

[step]

1, define extensive two-way similar connection

Definition 1. " meeting " (∝) operational character are defined in true and corresponding expectation.For dissimilar data, the criterion of " ∝ " is not quite similar.For instance, if true f is the data of a value type and expect that e=a～b is the data of a numerical value wide-style, f ∝ e and if only if f>=a ∧ f≤b so; If true f be set in an element and expect e={e ₁, e ₂..., e _nbe a set, f ∝ e and if only if f ∈ e.

Every the record defining in 2. tentation data collection R and S all comprises factual data, expected data, threshold values data and other extraneous data, and formalized description is:

R

= {(\begin{matrix} r_{1}^{f} & , . . ., & r_{u}^{f}, r_{u + 1}^{e} & , . . ., & r_{u + v}^{e}, r_{u + v + 1} & , . . ., & r_{q} & , T (r))}, \end{matrix}

S

= {(s_{1}^{f}, . . ., s_{v}^{f}, s_{v + 1}^{e}, . . ., s_{u + v}^{e}, s_{u + v + 1}, . . ., s_{w}, T (s))},

Wherein u+v≤q and u+v≤w, r _i ^f(i=1 ..., u) represent u the factual data of r; represent v the expected data of r; r _i(i=u+v+1 ..., q) represent other data of r; T (r) is the threshold values data of r.Same, represent v the factual data of s; represent u the expected data of s; s _i(i=u+v+1, u+v+2 ..., w) represent other data of s; T (s) is the threshold values data of s.The extensive two-way similar connection of R and S is defined as: R s={ (r, s) | r ∈ R, s ∈ S, ExSim (r, s)>=T (s) ∧ ExSim (s, r)>=T (r) }, wherein:

(1) ExSim (r, s)

| {r_{i}^{f} &Proportional; s_{i + v}^{e} | 1 \leq i \leq u} | / u; (2) ExSim (s, r) = | {s_{j}^{f} &Proportional; r_{j + u}^{e} | 1 \leq j \leq v | / v .

2, mapping-filtration-verification method

Based on above-mentioned definition, propose to solve the method for extensive two-way similar connectivity problem, it comprises mapping, filters and three steps of checking, referred to as mapping-filtration-verification method.

Algorithm 1. mapping-filtration-verification algorithms

Input: R, S-data set

Output: RS-Query Result data set

Algorithm 1 is described three concrete steps in detail:

The first step: mapping.

1) according to selected numerical value mapping mode, carry out numerical value mapping, injection turns 2), etc. step-length mapping, turn 3), heuristic mapping turns 4).

2) injection mapping, is mapped to each numerical value of each attribute on a unique globak symbol.Turn 5).

3) the step-length mapping such as, carrys out even partition numerical range data by fixing step-length, and a plurality of values are mapped on same symbol.Turn 5).

4), by heuristic mapping, find the mapping scheme of near-optimization.

A) the number k0 of the maximum divided block that certain numerical range accepts is set.

B) find the optimization aim of calculating the optimal dividing in certain numerical range.

\min \underset{r &Element; R}{Σ} f (e (r), Π), sub . to | Π | \leq k_{0}, Π &Element; 2^{Π}

Wherein e (r)=a～b is the numerical range of an expectation, and Ext (e (r)) is one and divides ∏ for the extension of e (r), i.e. Ext (e (r))=a _j～b _k, a _j=Max (a _i| a _i≤ a), b _k=Min (b _i| b _i>=b), 2 ^∏represent the set of the division composition of all division numerical ranges, | ∏ | be the piece number of dividing in ∏, f be a calculating in the gesture of dividing Ext (e (r)) in ∏ situation | Ext (e (r)) | and the gesture of e (r) | e (r) | the function of difference.

C) according to following optimum minor structure, by dynamic programming method, obtain optimal dividing ∏.

c [i, j, k] = \{\begin{matrix} 0 & i = j \\ c [i, j, 1] & k = 1, i < j \\ \min_{i \leq m < j} {c [i, m, 1] + c [m + 1, j, k - 1]} & k > 1, i < j \end{matrix}

5) by adding up and sorting, obtain the overall order of the symbol according to an occurrence number sort ascending O _t.

6) to be mapped to be the generation record that a symbol in globak symbol set forms to every record, and these records have formed the data set after mapping, are denoted as R _mand S _m(step 2).

Second step: filter.

1) for R _mand S _min generation record according to O _t(step 3) sorts.

2) adopt globak symbol as keyword respectively to R _mand S _minverted index I is partly set up in the expectation of middle record _rand I _s(step 4).For the R after sequence _mand S _min globak symbol record, t*L+1 of its front L – of index expectation symbol, wherein, t represents the threshold values of this record.

3) according to filtering principle, generate candidate result to (step 5-12).

A) enumerate R _mr is recorded in the generation of data centralization, finds the true globak symbol of those r at index I _sthe S of middle correspondence _min record s, and initial candidate is put into initial candidate set and CR to (r, s) ₁in (step 5-8).

B) traversal CR ₁in all initial candidate to (r, s), judge that the true globak symbol of s is at index I _rinverted list in whether exist and record r.If exist, turn c), if there is no, turn d).

C) put it into final candidate result collection CR ₂in.Turn the 3rd step.

D) carry out beta pruning (step 9-12).

The 3rd step: checking.

Check final candidate result collection CR ₂in each candidate couple, using qualified as net result output (step 13-15).

Above a kind of matching process based on extensive two-way similar interconnection technique provided by the present invention is described in detail, applied specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment is just for helping to understand method of the present invention and core concept thereof; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention meanwhile.

Claims

1. the matching process based on extensive two-way similar interconnection technique, is characterized in that, comprising:

The threshold values that obtains both sides' data to be matched and limit for matching degree separately;

2. the matching process based on extensive two-way similar interconnection technique as claimed in claim 1, is characterized in that, the coupling that the described expected data to coupling both sides each party's factual data and the other side is carried out in bi-directional matching adopts cross-matched mode to complete.

3. the matching process based on extensive two-way similar interconnection technique as claimed in claim 2, is characterized in that, described cross-matched adopts the mode of mapping to complete.

4. the matching process based on extensive two-way similar interconnection technique as claimed in claim 3, is characterized in that, the mode of described mapping comprise injection turn, etc. step-length mapping turn and/or heuristic mapping.

5. the matching process based on extensive two-way similar interconnection technique as claimed in claim 3, is characterized in that, the mode of described mapping is heuristic mapping.