CN109472013A

CN109472013A - The foreign key relationship detection method of net list compartment based on fitting of distribution

Info

Publication number: CN109472013A
Application number: CN201811250624.3A
Authority: CN
Inventors: 王宁; 王佳敏
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2018-10-25
Filing date: 2018-10-25
Publication date: 2019-03-15
Anticipated expiration: 2038-10-25
Also published as: CN109472013B

Abstract

The foreign key relationship detection method of the present invention provides a kind of net list compartment based on fitting of distribution.This method comprises: detecting between net list compartment different attribute column includes covering relation, the candidate foreign key relationship pair of the net list compartment is filtered out according to the testing result comprising covering relation；The multiple dimensional distribution figure for constructing candidate foreign key relationship centering candidate external key and candidate major key calculates the degree of fitting between candidate external key and the multiple dimensional distribution figure of candidate major key；Judge candidate's foreign key relationship to whether being real foreign key relationship pair according to the degree of fitting between candidate external key and the multiple dimensional distribution figure of candidate major key.The present invention is not only suitable for the foreign key relationship detection of character types, it is also applied for the foreign key relationship detection of numeric type, single-row foreign key relationship can be detected, the foreign key relationship of multiple row can be also detected, higher detection efficiency is had both while detection accuracy with higher.

Description

The foreign key relationship detection method of net list compartment based on fitting of distribution

Technical field

The present invention relates to network information processing technical field more particularly to a kind of net list compartments based on fitting of distribution Foreign key relationship detection method.

Background technique

Include a large amount of structuring table on internet, provides number very convenient and abundant for data integration and retrieval According to collection.Disclosed list data in connection and effective use network in order to enhance net list compartment, Anish et al. attempt to detect The potential relationship of net list compartment, and find association table.And foreign key relationship is as one of constraint most important in database, it is right For mode designer be it is very valuable, the relevant table of two semantemes can be used to specify.However for from different The a large amount of network table of structure data source, will not in most cases specify external key.Thus, it is found that foreign key relationship be understand and Utilize the important step of network table.

Currently, it includes dependence that foreign key relationship detection method in the prior art, which mostly concentrates between identifying table,. But only by comprising covering detect foreign key relationship be it is inadequate, most straightforward approach is to find real foreign key relationship to answer The important feature of the satisfaction.Alexandra Rostin et al. proposes some rules, such as column name similarity, train value are averagely grown The series of features such as degree, the uniqueness of train value and coverage rate, and the single-row foreign key relationship on conventional relationship table is found with this.But It is, for there are the network table of pattern information missing and noise data, above method is simultaneously not suitable for.

Meihui Zhang et al. proposition substitutes the series of rules that above-mentioned foreign key relationship should meet using randomness, and It has applied it in the detection of single-row and multiple row foreign key relationship.Two column datas are assessed in the distribution that this method only passes through attribute column train value The randomness of distribution, and real foreign key relationship is screened using the size of randomness.In the method, Earth Mover' One group of attribute value that s Distance (EMD removes native distance) is used to measure in external key is transferred to another group of attribute value in major key Collection closes required workload, and indicates randomness size with this value.When foreign key value only in some region of major key uniformly When distribution, EMD can still be calculated as the value of a very little.

Above-mentioned foreign key relationship detection method in the prior art there are the problem of it is as follows:

(1) since network table is lack of standardization, data can have noise and gauge outfit missing, most of at present to rely on The external key detection method of tableau format feature is only applicable to conventional relationship table, is not particularly suited for network table.

(2) current external key detection algorithm is mostly only applicable to the detection of character type foreign key relationship, is not particularly suited for number The detection of type foreign key relationship.

(3) current external key detection algorithm is detected to single-row foreign key relationship, alternatively, carrying out multiple row by randomness Foreign key relationship detection, these methods do not ensure that the randomness that external key is distributed in major key, random due to not can solve part Property problem, the effect is unsatisfactory.

Summary of the invention

The foreign key relationship detection method of the embodiment of the invention provides a kind of net list compartment based on fitting of distribution, with gram Take problem of the prior art.

To achieve the goals above, this invention takes following technical solutions.

A kind of foreign key relationship detection method of the net list compartment based on fitting of distribution, comprising:

Detecting between net list compartment different attribute column includes covering relation, according to the detection comprising covering relation As a result the candidate foreign key relationship pair of the net list compartment is filtered out；

The multiple dimensional distribution figure for constructing the candidate foreign key relationship centering candidate external key and candidate major key, calculates the candidate Degree of fitting between external key and the multiple dimensional distribution figure of candidate major key；

Judge that the candidate external key closes according to the degree of fitting between the candidate external key and the multiple dimensional distribution figure of candidate major key Whether system is to being real foreign key relationship pair.

It further, between the described detection net list compartment different attribute column include covering relation, according to the packet Testing result containing covering relation filters out the candidate foreign key relationship pair of the net list compartment, comprising:

By the table in network table set to be detected according to column storage into column set, to the word in the column set Symbol type attribute column carries out fuzzy matching, values match is carried out to the numeric type attribute column in the column set, according to described fuzzy Matching and the matched matching result of numerical value find out all single-row attributes pair in the column set；

The attribute pair that detected the multiple row from identical table from all single-row attributes pair, it is all for what is detected Single-row IND searches whether to be contained in the n category from another table there are the n attribute column set A from the same table Property column set B, and if it exists, then by A and B composition attribute to as multiple row IND；

Judge all single-row attributes to and multiple row attribute to whether meeting the major key uniqueness condition of setting, it is described to set Fixed major key uniqueness condition includes that the repetition values in major key are less than the threshold value λ set, and the major key for meeting the setting is unique Property condition single-row attribute to and multiple row attribute to as candidate foreign key relationship pair, each candidate's foreign key relationship is to including waiting Select external key F and candidate major key P.

Further, the multiple dimensional distribution of the building candidate foreign key relationship centering candidate external key and candidate major key Figure, comprising:

For each candidate foreign key relationship pair, the train value for each column of candidate external key F is ranked up, and is obtained in the column Each column is corresponded to a dimension of hyperspace by the position of each value, then the train value to each column being distributed in each dimension Position carry out Hash mapping, obtain the multiple dimensional distribution figure of candidate external key F；Train value for each column of candidate major key P is arranged Each column, is corresponded to a dimension of hyperspace by sequence, and obtain the position of each value in the column, then to being distributed in each dimension Each column train value position carry out Hash mapping, obtain the multiple dimensional distribution figure of candidate major key P.

Further, the degree of fitting calculated between the candidate external key and the multiple dimensional distribution figure of candidate major key, Include:

Subregion is carried out to the multiple dimensional distribution figure of the candidate external key F and candidate major key P；

According to the multiple dimensional distribution figure of the candidate external key F and candidate major key P after subregion, the value in candidate's external key F is determined The number of each subregion of the multiple dimensional distribution figure of candidate major key P should be fallen into, which is known as theoretical frequency, counts candidate external key The actual number of each subregion of the practical multiple dimensional distribution figure for falling into candidate major key P of value in F, the actual number are known as observing frequency Number, according to the theoretical frequency and observed frequency calculate the candidate major key P and candidate external key F multiple dimensional distribution figure it Between whole deviation；

It is determined according to the whole deviation quasi- between the candidate external key F and two multiple dimensional distribution figures of candidate major key P It is right.

Further, the multiple dimensional distribution figure to the candidate external key F and candidate major key P carries out subregion, comprising:

The points threshold value s for setting subspace ties up multiple dimensional distribution figure for each k, by corresponding section in each dimension Two equal parts are divided into, obtain 2^kSub-spaces, by described 2^kIn sub-spaces points be more than threshold value s subspace after It is continuous to be divided into 2^kThe subspace that obtained points are more than threshold value s is continued to divide, and changed in this way by sub-spaces Generation, until the points in every sub-spaces are both less than or equal to threshold value s.

Further, the multiple dimensional distribution figure of the candidate external key F according to after subregion and candidate major key P, determines Value in candidate external key F should fall into the number of each subregion of the multiple dimensional distribution figure of candidate major key P, which is known as theoretical frequency Number counts the actual number of each subregion of the practical multiple dimensional distribution figure for falling into candidate major key P of value in candidate external key F, the reality Border number is known as observed frequency, calculates the candidate major key P according to the theoretical frequency and observed frequency and the candidate is outer Whole deviation between the multiple dimensional distribution figure of key F, comprising:

The multiple dimensional distribution figure G of the known candidate major key P comprising k Column Properties value is enabledAs P in the i-th column The set of subregion, G_P=F₁×...F_kIt is defined as the k dimension block plan of P, by n₁×...×n_kA k n-dimensional subspace n composition, uses N_sub (G_P) indicate block plan G_PThe sum of sub-spaces:

Candidate external key F corresponds to G_PT-th of subspace observed frequencyIt is defined as that F is practical to fall in G_P T-th of subspace value number, by by t-th of subspace in the attribute value and P in candidate t-th of subspace external key F Middle attribute value is matched, and the number for wherein matching identical attribute value is denoted as observed frequency

Candidate external key F corresponds to G_PThe theoretical frequency of t-th of subspace be defined as F and should theoretically fall in G_PT The number of the value of sub-spaces, is denoted as

Wherein FNum_all(F) number of all values in candidate external key F, PNum are indicated_t(P) G is indicated_PIn in t-th of subspace Value number, PNum_all(P) number of all different values in candidate major key P is indicated；

Whole deviation D ev (F, P) between the candidate major key P and the multiple dimensional distribution figure of the candidate external key F by with Lower formula calculates:

Further, two multidimensional that the candidate external key F and candidate major key P are determined according to the whole deviation Degree of fitting between distribution map, comprising:

The calculating of degree of fitting GOF (F, P) between the candidate external key F and two multiple dimensional distribution figures of candidate major key P is public Formula is as follows:

Wherein a is the parameter for adjusting monotonicity, a > 1.

Further, the degree of fitting according between the candidate external key and the multiple dimensional distribution figure of candidate major key judges Whether candidate's foreign key relationship is to being real foreign key relationship pair, comprising:

If the degree of fitting between the candidate external key F and two multiple dimensional distribution figures of the candidate major key P is greater than setting Threshold value, then judge the candidate external key F and candidate major key P be the real foreign key relationship of a pair.

As can be seen from the technical scheme provided by the above-mentioned embodiment of the present invention, the algorithm of the embodiment of the present invention is not only suitable for The foreign key relationship of character types detects, and is also applied for the foreign key relationship detection of numeric type, can detect single-row foreign key relationship, Also the foreign key relationship of multiple row can be detected；Finally, optimizing to the external key detection algorithm, it is lower repeatedly to provide time complexity Partitioning algorithm, to effectively extend in catenet table.The algorithm of the embodiment of the present invention is quasi- in detection with higher Higher detection efficiency is had both while true property.

The additional aspect of the present invention and advantage will be set forth in part in the description, these will become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill of field, without creative efforts, it can also be obtained according to these attached drawings others Attached drawing.

Fig. 1 is a kind of foreign key relationship detection method of the net list compartment based on fitting of distribution provided in an embodiment of the present invention Realization principle figure；

Fig. 2 is a kind of foreign key relationship detection method of the net list compartment based on fitting of distribution provided in an embodiment of the present invention Process flow diagram；

Fig. 3 is a kind of lookup flow chart of candidate foreign key relationship provided in an embodiment of the present invention；

Fig. 4 is a kind of distribution model test flow chart provided in an embodiment of the present invention；

Fig. 5 is a kind of multiple division schematic diagram of multiple dimensional distribution figure provided in an embodiment of the present invention；

Fig. 6 is a kind of optimization partitioning algorithm flow chart provided in an embodiment of the present invention.

Specific embodiment

Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning Same or similar element or element with the same or similar functions are indicated to same or similar label eventually.Below by ginseng The embodiment for examining attached drawing description is exemplary, and for explaining only the invention, and is not construed as limiting the claims.

Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one It is a ", " described " and "the" may also comprise plural form.It is to be further understood that being arranged used in specification of the invention Diction " comprising " refer to that there are the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition Other one or more features, integer, step, operation, element, component and/or their group.It should be understood that when the present invention is real Apply that example claims element to be " connected " or when " coupled " to another element, it can be directly connected or coupled to other elements, Huo Zheye There may be intermediary elements.In addition, " connection " used herein or " coupling " may include being wirelessly connected or coupling.Used here as Wording "and/or" include one or more associated any cells for listing item and all combination.

Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art Language and scientific term) there is meaning identical with the general understanding of those of ordinary skill in fields of the present invention.Should also Understand, those terms such as defined in the general dictionary, which should be understood that, to be had and the meaning in the context of the prior art The consistent meaning of justice, and unless defined as here, it will not be explained in an idealized or overly formal meaning.

In order to facilitate understanding of embodiments of the present invention, it is done by taking several specific embodiments as an example below in conjunction with attached drawing further Explanation, and each embodiment does not constitute the restriction to the embodiment of the present invention.

The embodiment of the present invention, which proposes one kind, can be adapted for network table, can handle numeric type and character type external key simultaneously Relationship, single-row and multiple row foreign key relationship external key detection algorithm, algorithm accuracy with higher and execution efficiency.

The present invention comprehensively considers the foreign key relationship of numerous types of data and the noise data of network table, proposes a kind of base In the foreign key relationship detection algorithm of fitting of distribution.Firstly, finding candidate comprising covering relation by detection net list compartment Foreign key relationship is waited to handle the inconsistent data in network table by relaxing the condition that foreign key relationship need to meet Select the screening of foreign key relationship；Secondly, by the inspiration of fitting of distribution thought in mathematical statistics, it is believed that the value in foreign key column should be with master Value distribution having the same in key column, and real external key is found from all candidate foreign key relationships using distribution model test Relationship.

A kind of reality of the foreign key relationship detection method of net list compartment based on fitting of distribution provided in an embodiment of the present invention Existing schematic diagram is as shown in Figure 1, specifically process flow is as shown in Fig. 2, include following processing step:

Between step 1, detection net list compartment different attribute column includes covering relation, according to described comprising covering relation Testing result filter out the candidate foreign key relationship pair of the net list compartment.

The multiple dimensional distribution figure of step 2, the building candidate foreign key relationship centering candidate external key and candidate major key, calculates institute State the degree of fitting between candidate external key and the multiple dimensional distribution figure of candidate major key.

Step 3 judges the candidate according to the degree of fitting between the candidate external key and the multiple dimensional distribution figure of candidate major key Whether foreign key relationship is to being real foreign key relationship pair.

Fig. 3 is a kind of lookup flow chart of candidate foreign key relationship provided in an embodiment of the present invention, and concrete processing procedure includes: Different from the table in relational database, network table not necessarily has complete pattern information, therefore from different data sources Foreign key relationship can only be searched using attribute value.Meanwhile the embodiment of the present invention relaxes what foreign key relationship should meet in terms of three Characteristic, to adapt to data noise in network table and the inconsistent feature of data.

(1) hash value matches.For different types of train value in network table, the embodiment of the present invention uses different matchings Method.To the data of character string type, the embodiment of the present invention uses the fuzzy similarity based on Jaro Winkler Distance Matching process, and a threshold value δ is set to handle the noise data in net list.For numeric type data, the embodiment of the present invention Use values match.

When searching single-row IND, it would be desirable between detecting column comprising coverage rate, certain will be higher than comprising coverage rate The attribute of a threshold value μ is exported to as single-row IND.In detection process, we are by the attribute value between two column of matching, according to having It includes coverage rate that the number of the attribute value of matching relationship, which calculates between column,.For the data of character string type, we are used Jaro-Winkler Distance method carries out fuzzy matching, will be above the attribute value of threshold value δ as matching value, is lower than threshold value δ Attribute value as mismatch value.For numeric type data, using accurate matched result as attribute value it is whether matched according to According to.

(2) major key uniqueness.Since network table does not standardize, it is difficult to meet major key uniqueness.The present invention is implemented Example allows to have a small amount of repetition values in major key, and the nonuniqueness that a threshold value λ is used to handle major key, i.e. weight in major key is arranged Complex value is less than the threshold value λ of setting.

It (3) include spreadability.In order to handle data inconsistence problems, the embodiment of the present invention is relaxed between candidate attribute pair The inclusion relation that must satisfy, and threshold value μ is set for above-mentioned inclusion relation.

Network table set is obtained first, by the table in network table set according to column storage into column set.As The first step of foreign key relationship detection, the embodiment of the present invention, should firstly the need of all attributes pair met comprising covering relation are found Attribute is to including single-row attribute pair, referred to as single-row IND (inclusion dependency, comprising covering to) and multiple row Attribute pair, referred to as multiple row IND.Fuzzy matching is carried out to the character type attribute column in above-mentioned column set, in above-mentioned column set Numeric type attribute column carry out values match, wherein values match uses Jaro-using accurate matched method, fuzzy matching Winkler Distance method.It is found out in above-mentioned column set according to above-mentioned fuzzy matching and the matched matching result of numerical value All single-row IND.

When searching single-row IND, it would be desirable between detecting column comprising coverage rate, threshold will be higher than comprising coverage rate The attribute of value μ is exported to as single-row IND.In detection process, we are matched by the attribute value between two column of matching according to having It includes coverage rate that the number of the attribute value of relationship, which calculates between column,.For the data of character string type, we use Jaro- Winkler Distance method carries out fuzzy matching, will be above the attribute value of threshold value δ as matching value, lower than the category of threshold value δ Property value is as mismatch value.For numeric type data, using accurate matched result as the whether matched foundation of attribute value.

Whether detect has in all single-row IND comprising the multiple row IND from identical table.For all lists detected IND is arranged, we search whether that there are the n attribute column set A from the same table to be contained in the n from another table The set B of attribute column, and if it exists, we are then by the attribute of A and B composition to as multiple row IND.

All single-row IND and multiple row IND are then directed to, judge whether they meet the major key of setting of the embodiment of the present invention Uniqueness condition will meet the single-row IND and multiple row IND of above-mentioned major key uniqueness condition as candidate foreign key relationship pair, each Candidate foreign key relationship centering includes candidate external key F and candidate major key P.The candidate for the candidate foreign key relationship centering that single-row IND is formed It is only included in external key F and candidate major key P single-row.

Fig. 4 is a kind of distribution model test flow chart provided in an embodiment of the present invention, by fitting of distribution from all times It selects and finds real foreign key relationship in foreign key relationship.As can be seen from Figure 4 the fitting of distribution process of the embodiment of the present invention is main It is divided into following four part:

(1) multiple dimensional distribution figure constructs.For each candidate foreign key relationship pair, the embodiment of the present invention is the every of candidate external key F The train value of a column is ranked up, and obtains the position of each value in the column.Each column is corresponded to a dimension of hyperspace, then right The position for being distributed in the train value of each column in each dimension carries out Hash mapping, obtains the multiple dimensional distribution figure of candidate external key F.Together Reason, the multiple dimensional distribution figure of available candidate's major key P.

(2) multiple dimensional distribution figure subregion.In order to obtain between candidate external key F and two multiple dimensional distribution figures of candidate major key P Two multiple dimensional distribution figures of candidate external key F and candidate major key P are carried out subregion by fitting degree, the embodiment of the present invention.Intuitively, When distribution height of two multiple dimensional distribution figures in each subregion is fitted, two multiple dimensional distribution figure ability calculated altitudes are quasi- It closes.

(3) Frequency statistics.After two multiple dimensional distribution figures to candidate external key F and candidate major key P carry out subregion, this hair Bright embodiment it needs to be determined that the value in candidate external key F should fall into the number of each subregion of the multiple dimensional distribution figure of candidate major key P, The number is known as theoretical frequency.Meanwhile the embodiment of the present invention also needs to count that the value in candidate external key F is practical falls into candidate major key The actual number of each subregion of the multiple dimensional distribution figure of P, the actual number are known as observed frequency.

(4) degree of fitting calculates.The theoretical frequency of each subregion and sight in the multiple dimensional distribution figure for obtaining the candidate major key P After frequency measurement number, the embodiment of the present invention can be by assessing the theoretical frequency of all subregions of the candidate major key P and observing frequency Whole deviation between number calculates the degree of fitting between candidate external key F and two multiple dimensional distribution figures of candidate major key P.Obviously, Whole deviation is smaller, and degree of fitting is higher.If the fitting between candidate external key F and two multiple dimensional distribution figures of candidate major key P Degree is greater than the threshold value of setting, then judges candidate external key F and candidate major key P is a pair of real foreign key relationship.

In order to judge that the fitting degree of two distributions, the embodiment of the present invention attempt their multiple dimensional distribution figure of construction.

It defines 1. multiple dimensional distribution figures: all values in the given P or F with k column, P or F can be hashed as k dimension Point in space.The scatter plot being made of reference axis with these points is known as the multiple dimensional distribution figure of P or F by the embodiment of the present invention.

In order to construct the multiple dimensional distribution figure of P, the embodiment of the present invention first resequences to the value of each column in P.For numerical value Type, the embodiment of the present invention according to value sort them；For character string, alphabet sequence of the embodiment of the present invention arranges them Sequence.After the sequence for defining every train value, the embodiment of the present invention executes Hash mapping according to sequence of each value in column, will All values hash is the point in k dimension space in P.

In order to accurately assess the fitting degree between two distributions, the embodiment of the present invention is needed their multiple dimensional distribution Figure is divided into several subspaces, and assesses the fitting degree between each pair of subspace.

Define 2. block plans: the multiple dimensional distribution figure G of the known P comprising k Column Properties value is enabledAs P i-th The set of the subregion of column (different lines may have different subregion numbers).G_P=F₁×...F_kIt is defined as the k dimension block plan of P, By n₁×...×n_kA k n-dimensional subspace n composition.

The embodiment of the present invention uses N_sub(G_P) indicate block plan G_PThe sum of sub-spaces

Define 3. observed frequencys: the block plan G of given P_PAnd F corresponds to the multiple dimensional distribution figure of P, F corresponds to G_PT The observed frequency of sub-spacesIt is defined as that F is practical to fall in G_PT-th of subspace value number.

Observed frequency by by t-th of the subspace F attribute value with attribute value matches in t-th of subspace in P, And the number for wherein matching identical attribute value is denoted as observed frequency

Define 4. theoretical frequencies: the block plan G of given P_PAnd F corresponds to the multiple dimensional distribution figure of P, F corresponds to G_PT The theoretical frequency of sub-spaces, which is defined as F, should theoretically fall in G_PT-th of subspace value number, be denoted as

Wherein FNum_all(F) number of all values in F is indicated；PNum_t(P) G is indicated_PIn value in t-th of subspace Number；PNum_all(P) number of all different values in P is indicated.Due to network table format and imprecision, unique degree of major key is not up to To 100%, therefore in P, there may be repetition values, herein in P different value number PNum_all(P) after to reject duplicate attribute value Attribute value number.

G_PEach subspace in the theoretical frequency of F and the difference of observed frequency reflect the quasi- of local distribution in the subspace Conjunction degree.Difference is smaller, and degree of fitting is better.Therefore, in order to assess two overall fit degree being distributed, the embodiment of the present invention is fixed Adopted population deviation, population deviation are the summations of the partial deviations between local distribution.

Define 5. population deviations: the block plan G of given P_PAnd F corresponds to the multiple dimensional distribution figure of P, F is total corresponding to P's Body deviation D ev (F, P) can be calculated by the following formula:

After obtaining candidate foreign key relationship to the population deviation value of (F, P), the embodiment of the present invention can assess this candidate The fitting degree of foreign key relationship centering F and P.In order to keep result more intuitive, the embodiment of the present invention normalizes deviation, changes it Monotonicity defines the definition that degree of fitting is provided in 6.

Define 6. degrees of fitting: the block plan G of given P_PAnd F corresponds to the multiple dimensional distribution figure of P, F corresponds to the fitting of P It spends GOF (F, P) are as follows:

Wherein a is the parameter for adjusting monotonicity, and a > 1 is arranged in the embodiment of the present invention.

After the degree of fitting for calculating all candidate foreign key relationships, the embodiment of the present invention can be by assessing their fitting Degree size come determine these candidate foreign key relationships to whether real foreign key relationship.Input data in evaluation process is upper single order The candidate foreign key relationship set that section obtains exports as the maximum candidate foreign key relationship pair of Top-k degree of fitting.Multidimensional is carried out first Distribution map building；Then subregion；Theoretical frequency and actual frequency then are calculated to each subregion；And by calculating overall deviation Value obtains fitting of distribution degree in turn.After the last sequence according to the degree of fitting being calculated, take Top-k candidate's foreign key relationship to depositing It stores up in file, as output.

In order to find out real foreign key relationship from candidate foreign key relationship, the embodiment of the present invention is all single-row and multiple rows Candidate foreign key relationship to construction multiple dimensional distribution figure, the multiple dimensional distribution figure of each P is then divided into small subspace.It divides more Tieing up the simplest method of distribution map is coequally to divide each dimension, and subspace is divided thinner, and obtained GOF is more accurate. But increasing with subspace, the cost that GOF is calculated exponentially increase, especially for large-scale network table.

Fig. 5 is a kind of multiple division schematic diagram of multiple dimensional distribution figure provided in an embodiment of the present invention, in fact, when multidimensional point When Butut is equally divided into each small subregion, distribution of all the points in subspace is possible and uneven in figure, or even meeting There is the subspace of some skies, without any point.For the above reason, the embodiment of the present invention proposes the square partition of optimization Method, still can rapidly assess GOF when ensuring that the external key detection algorithm of the embodiment of the present invention encounters catenet table.

The strategy that the embodiment of the present invention uses multipass to divide.Multiple dimensional distribution figure is tieed up for each k, first in each dimension By corresponding interval division at two equal parts, 2 are obtained^kSub-spaces.The next step embodiment of the present invention only divides wherein Points are more than the subspace of s, and iteration in this way, until the points in every sub-spaces are both less than or equal to s.In m After secondary division, block plan G_PIn subspace total N_sub(G_P) become:

N_sub(G_P)=2^k+(m-1)(2^k-1) (5)

As shown in figure 5, given k=2 and s=5, the embodiment of the present invention only need to draw the multiple dimensional distribution figure of P twice Point, so that it may ensure that the points in every sub-spaces are less than or equal to 5.It include 7 sub-spaces in block plan after segmentation, this meaning Taste the embodiment of the present invention only need calculate 7 theoretical frequencies.With original dividing method (16 sub-spaces and 16 theoretical frequencies) It compares, the multiple division methods of the embodiment of the present invention can substantially reduce cost.

Fig. 6 is a kind of optimization partitioning algorithm flow chart provided in an embodiment of the present invention.Input is that partition size and k tie up multidimensional Distribution map exports the subregion to obtain after repeatedly dividing.The embodiment of the present invention is first by by original multi-dimensional distribution map Each dimension is divided into two parts and obtains primary partition figure.Then, it is super to find out points in primary partition figure for the embodiment of the present invention One group of subspace for crossing partition size carries out secondary division for every sub-spaces therein.Through successive ignition, the present invention is implemented Example obtains final block plan.

In conclusion the algorithm of the embodiment of the present invention is not only suitable for the foreign key relationship detection of character types, it is also applied for counting The foreign key relationship of word type detects, and can detect single-row foreign key relationship, can also detect the foreign key relationship of multiple row；Finally, to this External key detection algorithm optimizes, and the lower multiple partitioning algorithm of time complexity is provided, to effectively extend to large-scale net In network table.The algorithm of the embodiment of the present invention has both higher detection efficiency while detection accuracy with higher.

The algorithm of the embodiment of the present invention relaxes for haveing the characteristics that data noise and data are inconsistent in network table Hash value matches that candidate external key need to meet, major key uniqueness, comprising spreadability three features, waited with being promoted in network table The accuracy for selecting foreign key relationship to detect.The algorithm detects real foreign key relationship from candidate foreign key relationship, according to attribute value The fitting degree of distribution come judge candidate foreign key relationship become real foreign key relationship a possibility that.In order to more accurately according to fitting Degree finds out foreign key relationship, and by distribution map subregion, being continuously increased for subregion in solution catenet table causes time complexity to increase The problem of adding, proposes to reduce calculation amount based on the partitioning algorithm repeatedly divided, the operational efficiency of algorithm is improved, so that algorithm can To expand on catenet table.

Those of ordinary skill in the art will appreciate that: attached drawing is the schematic diagram of one embodiment, module in attached drawing or Process is not necessarily implemented necessary to the present invention.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device or For system embodiment, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to method The part of embodiment illustrates.Apparatus and system embodiment described above is only schematical, wherein the conduct The unit of separate part description may or may not be physically separated, component shown as a unit can be or Person may not be physical unit, it can and it is in one place, or may be distributed over multiple network units.It can root According to actual need that some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Ordinary skill Personnel can understand and implement without creative efforts.

Those of ordinary skill in the art will appreciate that: the component in device in embodiment can describe to divide according to embodiment It is distributed in the device of embodiment, corresponding change can also be carried out and be located in one or more devices different from the present embodiment.On The component for stating embodiment can be merged into a component, can also be further split into multiple subassemblies.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with scope of protection of the claims Subject to.

Claims

1. a kind of foreign key relationship detection method of the net list compartment based on fitting of distribution characterized by comprising

Detecting between net list compartment different attribute column includes covering relation, according to the testing result comprising covering relation Filter out the candidate foreign key relationship pair of the net list compartment；

The multiple dimensional distribution figure for constructing the candidate foreign key relationship centering candidate external key and candidate major key calculates the candidate external key Degree of fitting between the multiple dimensional distribution figure of candidate major key；

The candidate foreign key relationship pair is judged according to the degree of fitting between the candidate external key and the multiple dimensional distribution figure of candidate major key It whether is real foreign key relationship pair.

2. the method according to claim 1, wherein between the detection net list compartment different attribute column Comprising covering relation, the candidate foreign key relationship of the net list compartment is filtered out according to the testing result comprising covering relation It is right, comprising:

By the table in network table set to be detected according to column storage into column set, to the character type in the column set Attribute column carries out fuzzy matching, values match is carried out to the numeric type attribute column in the column set, according to the fuzzy matching All single-row attributes pair in the column set are found out with the matched matching result of numerical value；

The attribute pair that detected the multiple row from identical table from all single-row attributes pair, it is all single-row for what is detected IND searches whether that there are the n attribute column set A from the same table to be contained in the n attribute column from another table Set B, and if it exists, then by A and B composition attribute to as multiple row IND；

Judge all single-row attributes to and multiple row attribute to whether meeting the major key uniqueness condition of setting, the setting Major key uniqueness condition includes that the repetition values in major key are less than the threshold value λ set, will meet the major key uniqueness item of the setting The single-row attribute of part to and multiple row attribute to as candidate foreign key relationship pair, each candidate's foreign key relationship is to including candidate outer Key F and candidate major key P.

3. according to the method described in claim 2, it is characterized in that, the building candidate foreign key relationship centering is candidate outer The multiple dimensional distribution figure of key and candidate major key, comprising:

For each candidate foreign key relationship pair, the train value for each column of candidate external key F is ranked up, and is obtained each in the column Each column is corresponded to a dimension of hyperspace, then the position of the train value to each column being distributed in each dimension by the position of value Carry out Hash mapping is set, the multiple dimensional distribution figure of candidate external key F is obtained；Train value for each column of candidate major key P is ranked up, and Each column is corresponded to a dimension of hyperspace by the position for obtaining each value in the column, then every in each dimension to being distributed in The position of the train value of a column carries out Hash mapping, obtains the multiple dimensional distribution figure of candidate major key P.

4. according to the method described in claim 3, it is characterized in that, described calculates the candidate external key and candidate major key Degree of fitting between multiple dimensional distribution figure, comprising:

According to the multiple dimensional distribution figure of the candidate external key F and candidate major key P after subregion, determine that the value in candidate's external key F should The number of each subregion of the multiple dimensional distribution figure of candidate major key P is fallen into, which is known as theoretical frequency, counts in candidate external key F The practical multiple dimensional distribution figure for falling into candidate major key P of value each subregion actual number, which is known as observed frequency, It is calculated between the candidate major key P and the multiple dimensional distribution figure of the candidate external key F according to the theoretical frequency and observed frequency Whole deviation；

The degree of fitting between the candidate external key F and two multiple dimensional distribution figures of candidate major key P is determined according to the whole deviation.

5. according to the method described in claim 4, it is characterized in that, described to the more of the candidate external key F and candidate major key P It ties up distribution map and carries out subregion, comprising:

The points threshold value s for setting subspace ties up multiple dimensional distribution figure for each k, by corresponding interval division in each dimension At two equal parts, 2 are obtained^kSub-spaces, by described 2^kPoints are more than that the subspace of threshold value s continues to draw in sub-spaces It is divided into 2^kThe subspace that obtained points are more than threshold value s is continued to divide by sub-spaces, and iteration in this way, Until the points in every sub-spaces are both less than or equal to threshold value s.

6. according to the method described in claim 4, it is characterized in that, the candidate external key F and time according to after subregion The multiple dimensional distribution figure of major key P is selected, determines that the value in candidate's external key F should fall into each of multiple dimensional distribution figure of candidate major key P point The number in area, the number are known as theoretical frequency, count the practical multiple dimensional distribution figure for falling into candidate major key P of value in candidate external key F Each subregion actual number, which is known as observed frequency, is calculated according to the theoretical frequency and observed frequency Whole deviation between the candidate major key P and the multiple dimensional distribution figure of the candidate external key F, comprising:

The multiple dimensional distribution figure G of the known candidate major key P comprising k Column Properties value is enabledAs the P subregion arranged i-th Set, G_P=F₁×...F_kIt is defined as the k dimension block plan of P, by n₁×...×n_kA k n-dimensional subspace n composition, uses N_sub(G_P) table Show block plan G_PThe sum of sub-spaces:

Candidate external key F corresponds to G_PT-th of subspace observed frequencyIt is defined as that F is practical to fall in G_PT The number of the value of sub-spaces, by by attribute in t-th of subspace in the attribute value and P in candidate t-th of subspace external key F Value is matched, and the number for wherein matching identical attribute value is denoted as observed frequency

Candidate external key F corresponds to G_PThe theoretical frequency of t-th of subspace be defined as F and should theoretically fall in G_PT-th son The number of the value in space, is denoted as

Wherein FNum_all(F) number of all values in candidate external key F, PNum are indicated_t(P) G is indicated_PIn value in t-th of subspace Number, PNum_all(P) number of all different values in candidate major key P is indicated；

Whole deviation D ev (F, P) between the candidate major key P and the multiple dimensional distribution figure of the candidate external key F passes through following public affairs Formula calculates:

7. according to the method described in claim 6, it is characterized in that, described determine that the candidate is outer according to the whole deviation Degree of fitting between key F and two multiple dimensional distribution figures of candidate major key P, comprising:

The calculation formula of degree of fitting GOF (F, P) between the candidate external key F and two multiple dimensional distribution figures of candidate major key P is such as Under:

Wherein a is the parameter for adjusting monotonicity, a > 1.

8. method according to any one of claims 1 to 7, which is characterized in that described according to the candidate external key and time The degree of fitting between the multiple dimensional distribution figure of major key is selected to judge that the candidate foreign key relationship to whether being real foreign key relationship pair, wraps It includes:

If the degree of fitting between the candidate external key F and two multiple dimensional distribution figures of the candidate major key P is greater than the threshold of setting Value, then judge the candidate external key F and candidate major key P is a pair of real foreign key relationship.