CN104166719A - Matching method based on generalization bi-direction similarity connection technique - Google Patents
Matching method based on generalization bi-direction similarity connection technique Download PDFInfo
- Publication number
- CN104166719A CN104166719A CN201410407666.9A CN201410407666A CN104166719A CN 104166719 A CN104166719 A CN 104166719A CN 201410407666 A CN201410407666 A CN 201410407666A CN 104166719 A CN104166719 A CN 104166719A
- Authority
- CN
- China
- Prior art keywords
- matching
- data
- mapping
- extensive
- way similar
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24558—Binary matching operations
- G06F16/2456—Join operations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a matching method based on a generalization bi-direction similarity connection technique, and relates to the technical field of computer networks. The method includes the steps of obtaining data of two parties to be matched and a given threshold value for limiting the matching degree, carrying out bi-direction matching on factual data of each party and expected data of the other party, filtering a large quantity of records which do not meet matching conditions according to bi-direction matching results, quickly judging filtered candidate sets and obtaining record pairs capable of being successfully matched. According to the matching method based on the generalization bi-direction similarity connection technique, generalization bi-direction similarity connection is carried out based on mapping, filtering and verification (MFV), and the method is suitable for wide application scenes in the real world.
Description
Technical field
The present invention relates to technical field of the computer network, particularly relate to a kind of matching process based on extensive two-way similar interconnection technique.
Background technology
Two-way similar connection is intended to concentrate and find out all data recording pair that meet predetermined condition of contact from two or data-oriented, is an important operation in database application.Yet traditional matching process has higher limitation in diversified data type, can not meet well ever-increasing objective demand in real world.And, in the process of bi-directional matching, both sides' role may be different, and the object of coupling carrys out the factual data of My World to the other side's expected data and the other side, and how expected data and factual data being intersected is relatively a large key technical problem that instantly needs solution.In addition, in real world, the individual requirement for matching degree is different, somebody need to find 100% satisfied match objects, somebody's psychology expection is so high, according to the comparative approach that the restriction of matching degree is designed to two-phase connection, is so how also important technological problems urgently to be resolved hurrily.
Therefore, need at present the urgent technical matters solving of those skilled in the art to be exactly: how can innovate and to propose a kind of more efficiently matching process, to meet the more demands in practical application.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of matching process based on extensive two-way similar interconnection technique, carries out extensive two-way similar connection be more suitable in real world application scenarios comparatively widely based on mapping-filter-checking (MFV).
In order to solve the problems of the technologies described above, the embodiment of the invention discloses a kind of matching process based on extensive two-way similar interconnection technique, comprising:
Obtain coupling both sides data to be matched and the threshold values limiting for matching degree separately;
Coupling both sides each party's factual data and the other side's expected data are carried out to bi-directional matching;
According to the result of described bi-directional matching, filter the record that does not meet in a large number matching condition;
Candidate collection after filtering is judged fast, and recording that acquisition can the match is successful is right.
Preferably, the coupling that the described expected data to coupling both sides each party's factual data and the other side is carried out in bi-directional matching adopts cross-matched mode to complete.
Preferably, described cross-matched adopts the mode of mapping to complete.
Preferably, the mode of described mapping comprise injection turn, etc. step-length mapping turn and/or heuristic mapping.
Preferably, the mode of described mapping is heuristic mapping.
Matching process based on extensive two-way similar interconnection technique of the present invention, carries out extensive two-way similar connection based on mapping-filter-checking (MFV), is more suitable in real world application scenarios comparatively widely, for example, and employment and friend-making etc.In these application scenarioss, need the data type of mating various, each individuality in coupling both sides there are differences for the restriction of matching degree, the present invention can carry out cross-matched effectively by a side factual data with the opposing party's expected data by mapping-filtration-verification method based on these problems when carrying out two-way connection, such friend-making process can meet both sides' expection simultaneously, in employment, can impel recruitment side and job hunter to find rapidly satisfied employee and enterprise, increase the success ratio in employment process, in other social networks, also have a wide range of applications scene.
Meanwhile, the extensive two-way similar method of attachment that this technology proposes does not produce error result and does not miss correct result yet in matching process, has correctness and completeness.And, the method before, method efficiency that this technology is invented is higher, the scope of application more extensively, more meets current demand.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of a kind of matching process embodiment based on extensive two-way similar interconnection technique of the present invention.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, below in conjunction with the drawings and specific embodiments, the present invention is further detailed explanation.
Referring to Fig. 1, a kind of matching process based on extensive two-way similar interconnection technique described in this programme, specifically comprises:
Step S101, the threshold values limiting for matching degree that obtains both sides' data to be matched and provide;
Step S102, carries out bi-directional matching to coupling both sides each party's factual data and the other side's expected data;
Step S103, filters the record that does not meet in a large number matching condition according to the result of described bi-directional matching;
Step S104, judges fast to the candidate collection after filtering, and recording that acquisition can the match is successful is right.
Specific implementation in actual applications:
1. obtain data set R and S to be matched.In R and S every records r and s has comprised the factual data of self, to the satisfaction threshold values t of the other side's expected data, coupling and all other data.
2. the fact and the expected data in every record in couple data set R and S shone upon (referring to algorithm 1 step 2).
A) according to selected numerical value mapping mode, carry out numerical value mapping, injection turns 2.2, etc. step-length mapping, turns 2.3, and heuristic mapping turns 2.4.
B) injection mapping, is mapped to each numerical value of each attribute on a unique globak symbol.Turn 2.5.
The step-length mapping such as c), carrys out even partition numerical range data by fixing step-length, and a plurality of values are mapped on same symbol.Turn 2.5.
D), by heuristic mapping, find the mapping scheme of near-optimization.
I., the number k of the maximum divided block that certain numerical range accepts is set
0.
Ii. find the optimization aim of calculating the optimal dividing in certain numerical range.
Wherein e (r)=a~b is the numerical range of an expectation, and Ext (e (r)) is one and divides ∏ for the extension of e (r), i.e. Ext (e (r))=a
j~bk, a
j=Max (a
i| a
i≤ a), b
k=Min (b
i| b
i>=b), 2
∏represent the set of the division composition of all division numerical ranges, | ∏ | be the piece number of dividing in ∏, f be a calculating in the gesture of dividing Ext (e (r)) in ∏ situation | Ext (e (r)) | and the gesture of e (r) | e (r) | the function of difference.
According to following optimum minor structure, by dynamic programming method, obtain optimal dividing ∏.
P
i, j, kexpression is divided into numerical range i~j a division of k divided block, i≤j wherein, k>0.P
i, j, kcost c[i, j, k] be defined as the extension size sum of all expected datas that this division causes.For a given divided block P
i=a
i~b
iif, original expectation numerical range c
i~d
ip
iif a part, have c
i>=a
iand d
i<b
i, c
i-a
iwith b
i-d
isum is the extension size of this divided block to this expected data.
E) by adding up and sorting, obtain the overall order of the symbol according to an occurrence number sort ascending O
t.
F) to be mapped to be the generation record that a symbol in globak symbol set forms to every record, and these records have formed the data set after mapping, are denoted as R
mand S
m.
3. the record of the generation after pair mapping carries out pre-service.
A) for R
mand S
min generation record according to O
t(referring to algorithm 1 step 3) sorts.
B) adopt globak symbol as keyword, respectively to R
mand S
minverted index I is partly set up in the expectation of middle record
rand I
s.For the R after sequence
mand S
min globak symbol record, according to t*L+1 of its front L – of threshold values t index of record, expect symbol (referring to algorithm 1 step 4).
4. on the symbol record producing afterwards in mapping, carry out bi-directional filtered acquisition candidate collection (referring to algorithm 1 step 5-12).
A) enumerate R
mr is recorded in the generation of data centralization, finds the true globak symbol of those r at index I
sthe S of middle correspondence
min record s, and initial candidate is put into initial candidate set and CR to (r, s)
1in (referring to algorithm 1 step 5-8).
B) traversal CR
1in all initial candidate to (r, s), judge that the true globak symbol of s is at index I
rinverted list in whether exist and record r.If existed, turn 4.3, if there is no, turn 4.4.
C) put it into final candidate result collection CR
2in.Turn 5.
D) carry out beta pruning (referring to algorithm 1 step 9-12).
5. check final candidate result collection CR
2in each candidate couple, using qualified as net result output (referring to algorithm 1 step 13-15).
Algorithm 1. mapping-filtration-verification algorithms (MFV)
Input: R, S-data set
Output: RS-Query Result data set
For making those skilled in the art understand better the present invention, below in conjunction with practical application, more detailed does concrete introduction to this programme.
[step]
1, define extensive two-way similar connection
Definition 1. " meeting " (∝) operational character are defined in true and corresponding expectation.For dissimilar data, the criterion of " ∝ " is not quite similar.For instance, if true f is the data of a value type and expect that e=a~b is the data of a numerical value wide-style, f ∝ e and if only if f>=a ∧ f≤b so; If true f be set in an element and expect e={e
1, e
2..., e
nbe a set, f ∝ e and if only if f ∈ e.
Every the record defining in 2. tentation data collection R and S all comprises factual data, expected data, threshold values data and other extraneous data, and formalized description is:
Wherein u+v≤q and u+v≤w, r
i f(i=1 ..., u) represent u the factual data of r;
represent v the expected data of r; r
i(i=u+v+1 ..., q) represent other data of r; T (r) is the threshold values data of r.Same,
represent v the factual data of s;
represent u the expected data of s; s
i(i=u+v+1, u+v+2 ..., w) represent other data of s; T (s) is the threshold values data of s.The extensive two-way similar connection of R and S is defined as: R
s={ (r, s) | r ∈ R, s ∈ S, ExSim (r, s)>=T (s) ∧ ExSim (s, r)>=T (r) }, wherein:
2, mapping-filtration-verification method
Based on above-mentioned definition, propose to solve the method for extensive two-way similar connectivity problem, it comprises mapping, filters and three steps of checking, referred to as mapping-filtration-verification method.
Algorithm 1. mapping-filtration-verification algorithms
Input: R, S-data set
Output: RS-Query Result data set
Algorithm 1 is described three concrete steps in detail:
The first step: mapping.
1) according to selected numerical value mapping mode, carry out numerical value mapping, injection turns 2), etc. step-length mapping, turn 3), heuristic mapping turns 4).
2) injection mapping, is mapped to each numerical value of each attribute on a unique globak symbol.Turn 5).
3) the step-length mapping such as, carrys out even partition numerical range data by fixing step-length, and a plurality of values are mapped on same symbol.Turn 5).
4), by heuristic mapping, find the mapping scheme of near-optimization.
A) the number k0 of the maximum divided block that certain numerical range accepts is set.
B) find the optimization aim of calculating the optimal dividing in certain numerical range.
Wherein e (r)=a~b is the numerical range of an expectation, and Ext (e (r)) is one and divides ∏ for the extension of e (r), i.e. Ext (e (r))=a
j~b
k, a
j=Max (a
i| a
i≤ a), b
k=Min (b
i| b
i>=b), 2
∏represent the set of the division composition of all division numerical ranges, | ∏ | be the piece number of dividing in ∏, f be a calculating in the gesture of dividing Ext (e (r)) in ∏ situation | Ext (e (r)) | and the gesture of e (r) | e (r) | the function of difference.
C) according to following optimum minor structure, by dynamic programming method, obtain optimal dividing ∏.
P
i, j, kexpression is divided into numerical range i~j a division of k divided block, i≤j wherein, k>0.P
i, j, kcost c[i, j, k] be defined as the extension size sum of all expected datas that this division causes.For a given divided block P
i=a
i~b
iif, original expectation numerical range c
i~d
ip
iif a part, have c
i>=a
iand d
i<b
i, c
i-a
iwith b
i-d
isum is the extension size of this divided block to this expected data.
5) by adding up and sorting, obtain the overall order of the symbol according to an occurrence number sort ascending O
t.
6) to be mapped to be the generation record that a symbol in globak symbol set forms to every record, and these records have formed the data set after mapping, are denoted as R
mand S
m(step 2).
Second step: filter.
1) for R
mand S
min generation record according to O
t(step 3) sorts.
2) adopt globak symbol as keyword respectively to R
mand S
minverted index I is partly set up in the expectation of middle record
rand I
s(step 4).For the R after sequence
mand S
min globak symbol record, t*L+1 of its front L – of index expectation symbol, wherein, t represents the threshold values of this record.
3) according to filtering principle, generate candidate result to (step 5-12).
A) enumerate R
mr is recorded in the generation of data centralization, finds the true globak symbol of those r at index I
sthe S of middle correspondence
min record s, and initial candidate is put into initial candidate set and CR to (r, s)
1in (step 5-8).
B) traversal CR
1in all initial candidate to (r, s), judge that the true globak symbol of s is at index I
rinverted list in whether exist and record r.If exist, turn c), if there is no, turn d).
C) put it into final candidate result collection CR
2in.Turn the 3rd step.
D) carry out beta pruning (step 9-12).
The 3rd step: checking.
Check final candidate result collection CR
2in each candidate couple, using qualified as net result output (step 13-15).
Above a kind of matching process based on extensive two-way similar interconnection technique provided by the present invention is described in detail, applied specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment is just for helping to understand method of the present invention and core concept thereof; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention meanwhile.
Claims (5)
1. the matching process based on extensive two-way similar interconnection technique, is characterized in that, comprising:
The threshold values that obtains both sides' data to be matched and limit for matching degree separately;
Coupling both sides each party's factual data and the other side's expected data are carried out to bi-directional matching;
According to the result of described bi-directional matching, filter the record that does not meet in a large number matching condition;
Candidate collection after filtering is judged fast, and recording that acquisition can the match is successful is right.
2. the matching process based on extensive two-way similar interconnection technique as claimed in claim 1, is characterized in that, the coupling that the described expected data to coupling both sides each party's factual data and the other side is carried out in bi-directional matching adopts cross-matched mode to complete.
3. the matching process based on extensive two-way similar interconnection technique as claimed in claim 2, is characterized in that, described cross-matched adopts the mode of mapping to complete.
4. the matching process based on extensive two-way similar interconnection technique as claimed in claim 3, is characterized in that, the mode of described mapping comprise injection turn, etc. step-length mapping turn and/or heuristic mapping.
5. the matching process based on extensive two-way similar interconnection technique as claimed in claim 3, is characterized in that, the mode of described mapping is heuristic mapping.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410407666.9A CN104166719B (en) | 2014-08-19 | 2014-08-19 | Matching process based on extensive two-way similar interconnection technique |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410407666.9A CN104166719B (en) | 2014-08-19 | 2014-08-19 | Matching process based on extensive two-way similar interconnection technique |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104166719A true CN104166719A (en) | 2014-11-26 |
CN104166719B CN104166719B (en) | 2018-02-16 |
Family
ID=51910532
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410407666.9A Active CN104166719B (en) | 2014-08-19 | 2014-08-19 | Matching process based on extensive two-way similar interconnection technique |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104166719B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106021493A (en) * | 2016-05-19 | 2016-10-12 | 天津工业大学 | Method and device for similarity connection of inconsistent constraints |
CN108573052A (en) * | 2018-04-23 | 2018-09-25 | 南京大学 | A kind of similar connection method of the set of threshold adaptive |
CN108846067A (en) * | 2018-06-05 | 2018-11-20 | 洛阳师范学院 | The high dimensional data similarity join querying method and device divided based on mapping space |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101180645A (en) * | 2004-12-07 | 2008-05-14 | 毕库德股份有限公司 | Electronic commerce system, method and apparatus |
CN101453398A (en) * | 2007-12-06 | 2009-06-10 | 怀特威盛软件公司 | Novel distributed grid super computer system and method |
CN102184262A (en) * | 2011-06-15 | 2011-09-14 | 悠易互通(北京)广告有限公司 | Web-based text classification mining system and web-based text classification mining method |
US20120185422A1 (en) * | 2011-01-14 | 2012-07-19 | Shah Amip J | Node similarity for component substitution |
US20130036119A1 (en) * | 2011-08-01 | 2013-02-07 | Qatar Foundation | Behavior Based Record Linkage |
-
2014
- 2014-08-19 CN CN201410407666.9A patent/CN104166719B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101180645A (en) * | 2004-12-07 | 2008-05-14 | 毕库德股份有限公司 | Electronic commerce system, method and apparatus |
CN103218732A (en) * | 2004-12-07 | 2013-07-24 | 毕库德股份有限公司 | Electronic commerce system, method and apparatus |
CN101453398A (en) * | 2007-12-06 | 2009-06-10 | 怀特威盛软件公司 | Novel distributed grid super computer system and method |
US20120185422A1 (en) * | 2011-01-14 | 2012-07-19 | Shah Amip J | Node similarity for component substitution |
CN102184262A (en) * | 2011-06-15 | 2011-09-14 | 悠易互通(北京)广告有限公司 | Web-based text classification mining system and web-based text classification mining method |
US20130036119A1 (en) * | 2011-08-01 | 2013-02-07 | Qatar Foundation | Behavior Based Record Linkage |
Non-Patent Citations (2)
Title |
---|
朱建新: ""生物认证系统错误率分析"", 《计算机应用研究》 * |
王金宝: ""云计算系统中索引与查询处理技术研究"", 《中国博士学位论文全文数据库 信息科技辑》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106021493A (en) * | 2016-05-19 | 2016-10-12 | 天津工业大学 | Method and device for similarity connection of inconsistent constraints |
CN108573052A (en) * | 2018-04-23 | 2018-09-25 | 南京大学 | A kind of similar connection method of the set of threshold adaptive |
CN108846067A (en) * | 2018-06-05 | 2018-11-20 | 洛阳师范学院 | The high dimensional data similarity join querying method and device divided based on mapping space |
Also Published As
Publication number | Publication date |
---|---|
CN104166719B (en) | 2018-02-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103064970B (en) | Optimize the search method of interpreter | |
Leung et al. | Mining interesting link formation rules in social networks | |
CN110147722A (en) | A kind of method for processing video frequency, video process apparatus and terminal device | |
Nandurge et al. | Analyzing road accident data using machine learning paradigms | |
CN106649503A (en) | Query method and system based on sql | |
CN103793422A (en) | Methods for generating cube metadata and query statements on basis of enhanced star schema | |
CN104573130A (en) | Entity resolution method based on group calculation and entity resolution device based on group calculation | |
CN106254321A (en) | A kind of whole network abnormal data stream sorting technique | |
CN112257762B (en) | Road network matching method and system among different-source high-precision maps | |
CN113761221B (en) | Knowledge graph entity alignment method based on graph neural network | |
CN104166719A (en) | Matching method based on generalization bi-direction similarity connection technique | |
CN108170707A (en) | A kind of method and system of data quality checking | |
CN105045863A (en) | Method and system used for entity matching | |
CN111709714A (en) | Method and device for predicting lost personnel based on artificial intelligence | |
CN107451177B (en) | Query method and system for single error-surveying block chain of increased blocks | |
CN105930174A (en) | Difference comparison method and system for graphical page programs | |
CN104036039A (en) | Parallel processing method and system of data | |
CN109885797B (en) | Relational network construction method based on multi-identity space mapping | |
CN105224629A (en) | The implementation method of a kind of XBRL application platform operation flow integration | |
CN112905906B (en) | Recommendation method and system fusing local collaboration and feature intersection | |
Jain et al. | Advanced information and knowledge processing | |
CN103257983A (en) | Unique constraint based Deep Web entity identification method | |
WO2023178767A1 (en) | Enterprise risk detection method and apparatus based on enterprise credit investigation big data knowledge graph | |
CN104199824A (en) | Method for judging node relation on tree-shaped data | |
CN105373804A (en) | A human body part positioning method and system based on multi-dimensional space quick clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |