CN106202486A

CN106202486A - Heterogeneous datasets field value based on MIC preferential attachment method

Info

Publication number: CN106202486A
Application number: CN201610569447.XA
Authority: CN
Inventors: 肖如良; 丘志鹏; 张锐; 蔡声镇; 倪友聪; 杜欣
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2016-07-19
Filing date: 2016-07-19
Publication date: 2016-12-07
Anticipated expiration: 2036-07-19
Also published as: CN106202486B

Abstract

The present invention relates to a kind of heterogeneous datasets based on MIC field value preferential attachment method, comprise the following steps: the parameter of matching heterogeneous datasets SE distribution；MIC coefficient between calculated field A, B；Generate all values set that occurrence number is constituted respectively in field A, BS _tAWithS _tB；Set up setS _tA、S _tBCorresponding cumulative distribution functionP ^A(x)、P ^B(y)；Judge record total numberlWhether it is 0, is to turn final step, otherwise turn next step；According toP ^A(x) calculate corresponding field value in field AA _x；Corresponding field value in field B is calculated based on field value preferential attachment modelB _y；PreserveA _x,B _yAs a record；Update total numberl=l‑1, and return step 5；Complete all connections of isomeric data.The method is conducive to realistic simulation heterogeneous datasets, makes the data set of connection keep the harmonious and internodal similarity of rational interfield.

Description

Heterogeneous datasets field value based on MIC preferential attachment method

Technical field

The present invention relates to isomeric data field value interconnection technique field, based on MIC particularly to a kind of heterogeneous datasets Field value preferential attachment method.

Background technology

The field contents of reasonable analysis heterogeneous datasets, contributes to the structure to its neighborhood system and test, but isomery Data set generally reaches TB even PB rank, extremely expends user behavior and relative article attribute in Internet resources, and data Relating to privacy information Deng relevant field content, therefore, the mechanism such as enterprise and government is seldom ready to share its data for research worker Use.Along with the continuous expansion of internet scale, in isomeric data, heavy-tailed phenomenon is the most universal, and the connection of each interfield is closed System becomes more sophisticated, generates the great difficulty of heterogeneous datasets collection with truthful data characteristic.Therefore, build one can simulate The heterogeneous datasets field value join algorithm going out true interfield value annexation becomes isomeric data in numerous research work The basis in source.

The research of existing data cube computation algorithm is broadly divided into time field dependency Quality Research and non-temporal field phase Closing property two aspects of Quality Research.The former is mainly used in the aspect such as predicting network flow, Time-Series analysis, the most ripe, There is corresponding commercialization with scientific research software for research worker, and the latter essentially consists in the mathematical modeling to field value distribution character And interfield connects research, it is mainly used in specific research project, needs to carry out generation true to nature according to different business scene, Complexity is high, and main representative sex work has the proWGen isomeric data simulation that Canadian University of Saskachewan Busari proposes Device, by analyzing isomeric data field value distribution situation, portrays the heavy-tailed property of field with Zipf-like distribution and carries out digital simulation, Use the mechanism of multiparameter so that this simulator has good autgmentability, can apply to the stress test of Web server and delay Deposit performance study.Shortcoming is: field connection is realized by proWGen only with simple positive/negative relevant mode, it is difficult to true to nature Complicated and diversified isomeric data in simulation reality.Along with the explosion type of internet data amount increases, Zipf-like is the suitableeest For describing the isomeric data distribution with heavy-tailed property, carry out data genaration according to Zipf-like, for generating data institute For the system of application, the result over-evaluated can be there is in the assessment of its test performance, have bigger mistake with the contrast of truthful data situation Difference, it is meant that generate insecure data.

Summary of the invention

It is an object of the invention to provide a kind of heterogeneous datasets field value based on MIC preferential attachment method, the method Be conducive to realistic simulation heterogeneous datasets, make the data set of connection keep the harmonious and internodal phase of rational interfield Like property.

For achieving the above object, the technical scheme is that a kind of heterogeneous datasets is based on MIC field value preferential attachment Method, it is characterised in that for two heterogeneous datasets U and V, containing A field in U, containing B field in V, by field A and word Section B connects the problem of the data set generating l bar record, wherein the set S of all values structure in field A_A={ A₁,A₂,A₃,…, A_m, the set S of all values structure in field B_B={ B₁,B₂,B₃,…,B_n, the form of every record is { A_x,B_y, 1≤x≤m, 1≤y≤n, m, n represent in field A, B total m, n kind value respectively, comprise the following steps:

Step 1: matching isomeric data extent of a set exponential, i.e. parameter a of SE distribution, b, c, x₀, wherein c is extension Parameter, x₀For scale parameter, a represents SE matching near linear slope, and b represents SE matching near linear intercept；

Step 2: the MIC coefficient e between calculated field A and field B_MIC；

Step 3: generate all values set S that occurrence number is constituted respectively in field A obeying SE distribution respectively_tAAnd field The all values set S that occurrence number is constituted respectively in B_tB；

Step 4: set up set S respectively_tA、S_tBCorresponding cumulative distribution function P^A(x)、P^B(y)；

Step 5: judge whether the total number l of record is 0, is to go to step 10, otherwise goes to step 6；

Step 6: generate random number ξ_A, according to cumulative distribution function P^AX () calculates field value A corresponding in field A_x；

Step 7: calculate field value B corresponding in field B based on field value preferential attachment model_y；

Step 8: preserve { A_x,B_yAs a record, add in file D；

Step 9: update total number l=l-1, and return step 5；

Step 10: output file D, completes all connections of isomeric data.

Further, in step 1, parameter a of SE distribution of matching heterogeneous datasets, b, c, x as follows₀:

Field value occurrence number obeys SE distribution, uses equation below to retouch field value occurrence number distribution curve State:

y_{i}^{c} = - a \log i + b

Wherein, for obeying all values respectively occurrence number in the field of SE distribution, according to each field value occurrence number from More to lacking descending, represent the position sequence value of field value occurrence number with i, 1≤i≤N, N represent that in described field, total N kind takes Value, uses y_iRepresent the occurrence number that position sequence value i is corresponding, y_i ^cRepresent y_iC power；

Taking an experience constant for extensive parameter c, span is (0,1), then uses least square fitting to go out a, b Value, further according to a=x₀ ^cTry to achieve scale parameter x₀Value, then substitute into following formula:

p (x) = e^{- {(\frac{x}{x_{0}})}^{c}}

Obtain the accumulated probability distribution function of SE distribution, complete the modeling of SE distribution.

Further, in step 2, use the MIC coefficient e between equation below calculated field A and field B_MIC；

e_{M I C} = \underset{m n < B (l)}{m a x} {\frac{I^{*} (D^{'}, m, n)}{\log \min {m, n}}}

Wherein, m represents that the value of field A divides number, and n represents that the value of field B divides number, and B (l) represents that grid is drawn Portioning number, D' is represented to given data collection, I^*(D ', m, n) be represented to given data collection D' maximum mutual information value under m*n divides, Min{m, n} represent and take that { minima in m, n}, max{} represents and takes the maximum of element in { }.

Further, in step 4, adopt and set up set S with the following method_tA、S_tBCorresponding cumulative distribution function P^A(x)、P^B (y):

For set S_tA={ t_A1,t_A2,t_A3,…,t_Am, wherein t_AmRepresent m-th field value A in field A_mOccur is secondary Number, according to set S_tAIn each field value occurrence number from more to lack descending, obtain the set of occurrence number descending S'_tA, then set up following cumulative distribution function P^A(x):

p^{A} (x) = Σ_{i = 1}^{x} t_{A i} / Σ_{i = 1}^{m} t_{A i}

Wherein, x represents set S'_tAThe position sequence value of middle field value occurrence number, 1≤x≤m；

In like manner, S is set up_tBCorresponding cumulative distribution function P^B(y)。

Further, in step 6, as follows according to cumulative distribution function P^AX () calculates correspondence in field A Field value A_x: being uniformly distributed from (0,1) generates random number ξ_A, make ξ_A=p^AX (), by cumulative distribution function P^A(x) inverse Function analytic expressionIt is calculated unique position sequence value x, according to the mapping relations of position sequence value Yu field value, tries to achieve correspondence Field value A_x。

Further, in step 7, calculate field corresponding in field B based on field priority model as follows Value B_y: positive and negative correlation circumstance between the field according to required association, set up field value preferential attachment model, excellent by field value First link model is calculated r '_AB, wherein positive correlation field value preferential attachment model is as follows:

r_{A B}^{'} = e_{M I C} Σ_{i = 1}^{g} t_{A i} / Σ_{i = 1}^{m} t_{A i} + (1 - e_{M I C}) h / n

Negative correlation field value preferential attachment model is as follows:

r_{A B}^{'} = 1 - [e_{M I C} Σ_{i = 1}^{g} t_{A i} / Σ_{i = 1}^{m} t_{A i} + (1 - e_{M I C}) h / n]

Wherein g ∈ [1, m], h ∈ [1, n], parameter e_MIC∈ [0,1] is the MIC coefficient between field A and field B, is used for Weighing interfield annexation, physical significance in a model represents preferential attachment part proportion；r′_ABRepresent and connect value Number, for sampling process in field value cumulative probability, makes r '_AB=p^BY (), by cumulative distribution function P^BThe inverse function of (y) Analytic expressionIt is calculated unique position sequence value y, according to the mapping relations of position sequence value Yu field value, tries to achieve the field of correspondence Value B_y, i.e. obtain a complete record { A_x,B_y, thus complete the connection of a data record.

The invention has the beneficial effects as follows entirely different with existing method, carry by utilizing isomeric data feature to carry out parameter Take, use SE distribution to replace Zipf-like to be distributed property heavy-tailed to field and portray, then use a kind of brand-new based on MIC Field value preferential attachment model replace traditional positive/negative correlation model, carry out data field value connection.Connected by the method Data, the most on the whole can one distribution trend true to nature of matching, also be able to partially accurately portray the heavy-tailed property of field, The data set making generation keeps the harmonious and internodal similarity of rational interfield, can be applicable to isomeric data and drives Software process.

Accompanying drawing explanation

Fig. 1 is the flowchart of the embodiment of the present invention.

Detailed description of the invention

The present invention provides a kind of heterogeneous datasets based on MIC field value preferential attachment method, it is characterised in that for two Heterogeneous datasets U and V, containing A field in U, containing B field in V, is connected the data of generation l bar record by field A with field B The problem of collection, wherein the set S of all values structure in field A_A={ A₁,A₂,A₃,…,A_m, the collection of all values structure in field B Close S_B={ B₁,B₂,B₃,…,B_n, the form of every record is { A_x,B_y, 1≤x≤m, 1≤y≤n, m, n represent field respectively Total m, n kind value in A, B, as it is shown in figure 1, comprise the following steps:

Step 1: matching isomeric data extent of a set exponential, i.e. SE is distributed (Stretched Exponential Distribution) parameter a, b, c, x₀, wherein c is extensive parameter, x₀For scale parameter, a represents that SE matching near linear is oblique Rate, b represents SE matching near linear intercept.Concrete grammar is as follows:

y_{i}^{c} = - a \log i + b

Taking an experience constant for extensive parameter c, span is (0,1), is then based on heterogeneous datasets and uses a young waiter in a wineshop or an inn Multiplication simulates the value of a, b, further according to a=x₀ ^cTry to achieve scale parameter x₀Value, then substitute into following formula:

p (x) = e^{- {(\frac{x}{x_{0}})}^{c}}

Step 2: use MIC coefficient (the maximal information between equation below calculated field A and field B Coefficient, maximum information coefficient) e_MIC:

e_{M I C} = \underset{m n < B (l)}{m a x} {\frac{I^{*} (D^{'}, m, n)}{\log \min {m, n}}}

Step 3: generate all values set S that occurrence number is constituted respectively in field A obeying SE distribution respectively_tAAnd field The all values set S that occurrence number is constituted respectively in B_tB。

Step 4: set up set S respectively_tA、S_tBCorresponding cumulative distribution function P^A(x)、P^B(y).Concrete grammar is as follows:

p^{A} (x) = Σ_{i = 1}^{x} t_{A i} / Σ_{i = 1}^{m} t_{A i}

Step 5: judge whether the total number l of record is 0, is to go to step 10, otherwise goes to step 6.

Step 6: according to cumulative distribution function P^AX () calculates field value A corresponding in field A_x.Concrete grammar is as follows: Being uniformly distributed from (0,1) generates random number ξ_A, make ξ_A=p^AX (), by cumulative distribution function P^AX the inverse function of () resolves FormulaIt is calculated unique position sequence value x, according to the mapping relations of position sequence value Yu field value, tries to achieve the field value of correspondence A_x。

Step 7: calculate field value B corresponding in field B based on field value preferential attachment model_y.Concrete grammar is such as Under: positive and negative correlation circumstance between the field according to required association, set up field value preferential attachment model, preferential by field value Link model is calculated r '_AB, wherein positive correlation field value preferential attachment model is as follows:

r_{A B}^{'} = e_{M I C} Σ_{i = 1}^{g} t_{A i} / Σ_{i = 1}^{m} t_{A i} + (1 - e_{M I C}) h / n

Negative correlation field value preferential attachment model is as follows:

r_{A B}^{'} = 1 - [e_{M I C} Σ_{i = 1}^{g} t_{A i} / Σ_{i = 1}^{m} t_{A i} + (1 - e_{M I C}) h / n]

Wherein g ∈ [1, m], h ∈ [1, n], parameter e_MIC∈ [0,1] is the MIC coefficient between field A and field B, is used for Weighing interfield annexation, physical significance in a model represents preferential attachment part proportion；r′_ABRepresent and connect value Number, for sampling process in field value cumulative probability, makes r '_AB=p^BY (), by cumulative distribution function P^BThe inverse function of (y) Analytic expressionIt is calculated unique position sequence value y, according to the mapping relations of position sequence value Yu field value, tries to achieve the word of correspondence Segment value B_y, i.e. obtain a complete record { A_x,B_y, thus complete the connection of a data record.

Step 8: preserve { A_x,B_yAs a record, add in file D.

Step 9: update total number l=l-1, and return step 5.

Step 10: output file D, completes all connections of isomeric data.

Wherein, step 1 to step 2 extracts process for isomeric data field feature, and step 3 to step 4 represents to be entered field Row modeling, step 6 to step 8 is the process connecting a complete documentation, and wherein step 7 represents that field connects.

Below the related content that the present invention relates to is further described.

1, SE distribution

SE is distributed (Stretched Exponential Distribution), and Chinese full name is extension exponential, Early found in research in 1847 by Kohlrausch, it is adaptable to describe the dynamic attenuation phenomenon of different complication system, including The fields such as nature, economy, the Internet.The user behavior data of different isomerization system is entered by Ohio State Univ-Columbus USA's Zhang Xiaodong Row is analyzed, and finds that Zipf-like distribution is not suitable for describing the heavy-tailed property of isomeric data behavior, and it can be carried out very well by SE distribution Portray, illustrate that this distribution is applicable to describe power law model situation about cannot accurately portray.

Following formula represents the probability density function that SE is distributed:

p (x) = c \frac{x^{c - 1}}{x_{0}^{c}} e^{- {(\frac{x}{x_{0}})}^{c}}

Cumulative Distribution Function is shown below:

p (x) = e^{- {(\frac{x}{x_{0}})}^{c}}

Wherein c is extensive parameter, and its parameter area is at (0,1), x₀For scale parameter.

Describing for convenience, all for the correspondence in X-axis data are carried out taking from right logarithm process, in Y-axis agreement by we Corresponding all data carry out taking the c power of initial value and process, and the coordinate system so obtained is referred to as SE coordinate system.If isomeric data is concentrated The all values occurrence number of certain field obeys SE distribution, then according to object occurrence number descending, in a coordinate system, with position Sequence value i is as X-axis, with occurrence number t_iAs Y-axis, then putting in SE coordinate system by the value conversion of X, Y, field value occurrence number exists Presenting a near linear under SE coordinate system, field value occurrence number obeys SE distribution.

Use following formula that field value occurrence number distribution straight line is described:

y_{i}^{c} = - a \log i + b

Wherein a=x₀ ^c, b=y₁ ^c, again because c is experience constant, therefore a, b can be simulated by method of least square Value, thus try to achieve x₀, substitute into above formula, be calculated complete accumulated probability distribution function, complete the modeling of SE distribution.

2, field annexation tolerance

Record is to be formed by connecting by several fields, and interfield certainly exists certain relation.In order to enable accurate quantification Describing the annexation of two interfields, researchers propose pearson coefficient, spearman coefficient, Density Estimator (KDE), mutual information isometry standard.These measures nonlinear data complicated, inapplicable, shortage universality, vigorousness are low Etc. problem, it is difficult to be applicable to data cube computation algorithm.MIC (The Maximal Information is used for this present invention Coefficient) coefficient is measured as field annexation.

2011, Reshef proposed MIC coefficient first at Science, and Chinese is also called maximum information coefficient.This coefficient is Derivation on the basis of mutual information, can be estimated different types of annexation, in the range from [0,1], and have Symmetry, good universality and fairness.If field A is independent with B, then MIC (A, B)=0；If had between A and B really , the most there is not any influence of noise in fixed relation, then MIC (A, B)=1.

Computational methods are mainly by dividing field to the scatterplot of the composition of sample points all in (A, B), profit Calculate and search for the lower maximum mutual information value that can reach of different demarcation mode by the mode of dynamic programming.Finally, mutual to maximum The value of information is standardized processing, and acquired results is MIC, is denoted as e_MIC.Note D' is data-oriented collection, m and n is illustrated respectively in A With the division number of B field value, l is the sample size of field (A, B), and G represents that certain divides.Therefore divide G inferior (m × N) maximum mutual information that axle divides is formula (6):

I^*(D ', m, n)=maxI (D ' |_G)

The eigenmatrix that standardization obtains is shown below:

M {(D^{'})}_{m, n} = \frac{I^{*} (D^{'}, m, n)}{\log \min {m, n}}

The MIC value finally given is shown below:

e_{M I C} = \underset{m n < B (l)}{m a x} {M {(D^{'})}_{m, n}}

Wherein B (l) is stress and strain model fineness, and usual value is l^0.6, above method step is called for short MINE method.

By above formula it is found that MIC changes along with the change of stress and strain model fineness, estimate when sample size is the biggest when Evaluation is the most accurate, and this is applicable to the historical background of current big data.MIC coefficient have applied widely, computation complexity is low, Vigorousness is high, standardized structural characteristic.Therefore, the present invention uses MIC as field annexation degree reference.

3, field value preferential attachment model based on MIC

Assuming will be in two heterogeneous datasets U and V, containing A field in U, containing B field in V, by field A and word Section B connects the data set generating l bar record altogether.Order letter S is expressed as set, then the value place set S that field A is corresponding_A ={ A₁,A₂,A₃,…,A_m, total m kind value；The value place set S that field B is corresponding_B={ B₁,B₂,B₃,…,B_n, total n kind Value.The form of every record is { A_x,B_y}(1≤x≤m,1≤y≤n).Order letter t represents number of times, then field A value A_mOccur Number of times be t_AmSecondary, in field A, all values occurrence number respectively constitutes set S_tA, field B intermediate value B_nThe number of times occurred is t_Bn Secondary, in field B, all values occurrence number respectively constitutes set S_tB, and meet following formula:

Σ_{i = 1}^{m} t_{A i} = Σ_{i = 1}^{n} t_{B i} = l

Represent that field A all values occurrence number is cumulative and cumulative equal to field B all values occurrence number and is equal to isomery number According to collection total number of records l.

For data cube computation, the set to all values occurrence number of field is modeled, according to collection the most respectively Close S_tAWith S_tBIn each field value occurrence number from more to lack descending, obtain the set S' of occurrence number descending_tAWith S'_tB.Then cumulative distribution function p (x), wherein x represents the ranking position sequence of field value occurrence number.As a example by A field, iterated integral Cloth function formula specific as follows:

p^{A} (x) = Σ_{i = 1}^{x} t_{A i} / Σ_{i = 1}^{m} t_{A i} (1 \leq x \leq m)

Just the step of field modeling is completed to this step.

Record is to be formed by connecting by field, after completing field modeling, needs two fields are attached operation, enters And form a complete record.Attended operation is and takes set S_AWith S_BThe process of one element of cartesian product.Assuming that symbol ξ represents (0,1) upper equally distributed random number, and letter r represents connection value number, then, when connecting a record, firstly generates Random number ξ_A, make ξ_A=p^AX (), by the inverse function analytic expression of above formulaCan be calculated unique real bit sequence x, according to Position sequence and field value mapping relations, try to achieve field value A_x.Then, according to the annexation of AB interfield, by link model meter Calculation obtains r_AB, make r_AB=p^BY (), in like manner can obtain field value B_y, i.e. recorded { A_x,B_y}。

There are three kinds of situations in annexation, respectively positive correlation, negative correlation and zero correlation, wherein positive correlation represents independent variable Increasing, dependent variable also and then increases；Negative correlation represents that independent variable increases, and dependent variable reduces on the contrary；The increase and decrease of dependent variable and change certainly The increase and decrease of amount is unrelated, separate.In present stage data cube computation algorithm, main use link model is divided into positive correlation model with negative Correlation model, wherein positive correlation model is r_AB=ξ_A, negative correlation model is r_AB=1-ξ_A, this model is disadvantageous in that connection Relationship metric is simple, the physical significance not possessed, and does not considers interfield zero correlation situation.Therefore, the present invention proposes a kind of base Field value preferential attachment model PCF (the Priority Connection of Field based in MIC maximal information coefficient,PCF).Make r'_ABRepresent the connection value number obtained through PCF model, and r'_ABIt is formed by connecting with independent sector by preferential attachment part.Positive correlation PCF model is shown below:

r_{A B}^{'} = e_{M I C} Σ_{i = 1}^{g} t_{A i} / Σ_{i = 1}^{m} t_{A i} + (1 - e_{M I C}) h / n

Negative correlation PCF model is shown below:

r_{A B}^{'} = 1 - [e_{M I C} Σ_{i = 1}^{g} t_{A i} / Σ_{i = 1}^{m} t_{A i} + (1 - e_{M I C}) h / n]

Wherein g ∈ [1, m], h ∈ [1, n], parameter e_MIC∈ [0,1] is the MIC coefficient between field A and field B, is used for Weighing interfield annexation, physical significance in a model represents preferential attachment part proportion.Table Show random words segment value A_gCumulative distribution probability p (x) of occurrence number.H/n represents and in n value, randomly selects in B field H is worth the probability as field value.Orderξ_B=h/n brings positive and negative relevant PCF model respectively into, changes Letter obtains:

r′_AB=e_MICξ_A+(1-e_MIC)ξ_B

r′_AB=1-[e_MICξ_A+(1-e_MIC)ξ_B]

If interfield exists annexation, model optimization uses ξ_AField B is attached value, if interfield is the most only Vertical, then regenerate random number ξ_B, it is attached value.Work as e_MICWhen → 1, there is linear correlation even in description field A and field B Connect relation, represent that the value of each field A is connected to field B of identical cumulative probability under respective cumulative distribution function p (x) Value, as a example by positive correlation model, PRF model conversation is r '_AB=ξ_A.Work as e_MICWhen → 0, description field A is separate with field B, Represent that the value of each field A does not exist annexation with the value of field B, present stochastic relation.As a example by positive correlation model, PRF model conversation is r '_AB=ξ_B.Work as e_MICDuring ∈ (0,1), preferential attachment part proportion is e_MIC, independent sector institute accounting Example is (1-e_MIC), by two-part and, calculate r' according to positive correlation PRF model formation_AB, with r'_ABAs in field B The cumulative probability of certain value, such that it is able to obtain the value of field B, is finally completed field A and is connected with the value of field B.

PCF model has general Yu clear and definite physical significance, using MIC coefficient as Primary Reference, can reasonably describe Annexation situation between data, it is adaptable to major part isomeric data field value Connection Step.

Being above presently preferred embodiments of the present invention, all changes made according to technical solution of the present invention, produced function is made With during without departing from the scope of technical solution of the present invention, belong to protection scope of the present invention.

Claims

1. a heterogeneous datasets is based on MIC field value preferential attachment method, it is characterised in that for two heterogeneous datasets U And V, containing A field in U, containing B field in V, field A the problem being connected the data set generating l bar record with field B, its The set S of all values structure in middle field A_A={ A₁,A₂,A₃,…,A_m, the set S of all values structure in field B_B={ B₁,B₂, B₃,…,B_n, the form of every record is { A_x,B_y, 1≤x≤m, 1≤y≤n, m, n represent total m, n in field A, B respectively Plant value, comprise the following steps:

Step 1: matching isomeric data extent of a set exponential, i.e. parameter a of SE distribution, b, c, x₀, wherein c is extensive parameter, x₀For scale parameter, a represents SE matching near linear slope, and b represents SE matching near linear intercept；

Step 2: the MIC coefficient e between calculated field A and field B_MIC；

Step 3: generate all values set S that occurrence number is constituted respectively in field A obeying SE distribution respectively_tAWith in field B The all values set S that occurrence number is constituted respectively_tB；

Step 8: preserve { A_x,B_yAs a record, add in file D；

Step 9: update total number l=l-1, and return step 5；

Step 10: output file D, completes all connections of isomeric data.

Heterogeneous datasets the most according to claim 1 field value based on MIC preferential attachment method, it is characterised in that step In rapid 1, parameter a of SE distribution of matching heterogeneous datasets, b, c, x as follows₀:

Field value occurrence number obeys SE distribution, uses equation below to be described field value occurrence number distribution curve:

y_{i}^{c} = - a \log i + b

Wherein, for obeying all values occurrence number respectively in the field that SE is distributed, according to each field value occurrence number from more to Few descending, represents the position sequence value of field value occurrence number with i, and 1≤i≤N, N represent total N kind value in described field, Use y_iRepresent the occurrence number that position sequence value i is corresponding, y_i ^cRepresent y_iC power；

Taking an experience constant for extensive parameter c, span is (0,1), then uses least square fitting to go out the value of a, b, Further according to a=x₀ ^cTry to achieve scale parameter x₀Value, then substitute into following formula:

p (x) = e^{- {(\frac{x}{x_{0}})}^{c}}

Heterogeneous datasets the most according to claim 2 field value based on MIC preferential attachment method, it is characterised in that step In rapid 2, use the MIC coefficient e between equation below calculated field A and field B_MIC；

e_{M I C} = \underset{m n < B (l)}{m a x} {\frac{I^{*} (D^{'}, m, n)}{\log \min {m, n}}}

Wherein, m represents that the value of field A divides number, and n represents that the value of field B divides number, and B (l) represents stress and strain model part Number, D' is represented to given data collection, I^*(D ', m n) are represented to given data collection D' maximum mutual information value under m*n divides, min { m, n} represent and take that { minima in m, n}, max{} represents and takes the maximum of element in { }.

Heterogeneous datasets the most according to claim 3 field value based on MIC preferential attachment method, it is characterised in that step In rapid 4, adopt and set up set S with the following method_tA、S_tBCorresponding cumulative distribution function P^A(x)、P^B(y):

For set S_tA={ t_A1,t_A2,t_A3,…,t_Am, wherein t_AmRepresent m-th field value A in field A_mThe number of times occurred, root According to set S_tAIn each field value occurrence number from more to lack descending, obtain the set S' of occurrence number descending_tA, so Cumulative distribution function P that rear foundation is following^A(x):

p^{A} (x) = Σ_{i = 1}^{x} t_{A i} / Σ_{i = 1}^{m} t_{A i}

Heterogeneous datasets the most according to claim 4 field value based on MIC preferential attachment method, it is characterised in that step In rapid 6, as follows according to cumulative distribution function P^AX () calculates field value A corresponding in field A_x: from (0,1) It is uniformly distributed generation random number ξ_A, make ξ_A=p^AX (), by cumulative distribution function P^AThe inverse function analytic expression of (x)Calculate Obtain unique position sequence value x, according to the mapping relations of position sequence value Yu field value, try to achieve the field value A of correspondence_x。

Heterogeneous datasets the most according to claim 5 field value based on MIC preferential attachment method, it is characterised in that step In rapid 7, calculate field value B corresponding in field B based on field priority model as follows_y: according to required association Field between positive and negative correlation circumstance, set up field value preferential attachment model, be calculated by field value preferential attachment model r′_AB, wherein positive correlation field value preferential attachment model is as follows:

r_{A B}^{'} = e_{M I C} Σ_{i = 1}^{g} t_{A i} / Σ_{i = 1}^{m} t_{A i} + (1 - e_{M I C}) h / n

Negative correlation field value preferential attachment model is as follows:

r_{A B}^{'} = 1 - [e_{M I C} Σ_{i = 1}^{g} t_{A i} / Σ_{i = 1}^{m} t_{A i} + (1 - e_{M I C}) h / n]

Wherein g ∈ [1, m], h ∈ [1, n], parameter e_MIC∈ [0,1] is the MIC coefficient between field A and field B, is used for weighing Interfield annexation, physical significance in a model represents preferential attachment part proportion；r′_ABRepresent and connect value number, For sampling process in field value cumulative probability, make r '_AB=p^BY (), by cumulative distribution function P^BY the inverse function of () resolves FormulaIt is calculated unique position sequence value y, according to the mapping relations of position sequence value Yu field value, tries to achieve the field value of correspondence B_y, i.e. obtain a complete record { A_x,B_y, thus complete the connection of a data record.