CN106202486A - Heterogeneous datasets field value based on MIC preferential attachment method - Google Patents

Heterogeneous datasets field value based on MIC preferential attachment method Download PDF

Info

Publication number
CN106202486A
CN106202486A CN201610569447.XA CN201610569447A CN106202486A CN 106202486 A CN106202486 A CN 106202486A CN 201610569447 A CN201610569447 A CN 201610569447A CN 106202486 A CN106202486 A CN 106202486A
Authority
CN
China
Prior art keywords
field
value
field value
mic
preferential attachment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610569447.XA
Other languages
Chinese (zh)
Other versions
CN106202486B (en
Inventor
肖如良
丘志鹏
张锐
蔡声镇
倪友聪
杜欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Normal University
Original Assignee
Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Normal University filed Critical Fujian Normal University
Priority to CN201610569447.XA priority Critical patent/CN106202486B/en
Publication of CN106202486A publication Critical patent/CN106202486A/en
Application granted granted Critical
Publication of CN106202486B publication Critical patent/CN106202486B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types

Abstract

The present invention relates to a kind of heterogeneous datasets based on MIC field value preferential attachment method, comprise the following steps: the parameter of matching heterogeneous datasets SE distribution;MIC coefficient between calculated field A, B;Generate all values set that occurrence number is constituted respectively in field A, BS tA WithS tB ;Set up setS tA S tB Corresponding cumulative distribution functionP A (x)、P B (y);Judge record total numberlWhether it is 0, is to turn final step, otherwise turn next step;According toP A (x) calculate corresponding field value in field AA x ;Corresponding field value in field B is calculated based on field value preferential attachment modelB y ;PreserveA x ,B y As a record;Update total numberl=l‑1, and return step 5;Complete all connections of isomeric data.The method is conducive to realistic simulation heterogeneous datasets, makes the data set of connection keep the harmonious and internodal similarity of rational interfield.

Description

Heterogeneous datasets field value based on MIC preferential attachment method
Technical field
The present invention relates to isomeric data field value interconnection technique field, based on MIC particularly to a kind of heterogeneous datasets Field value preferential attachment method.
Background technology
The field contents of reasonable analysis heterogeneous datasets, contributes to the structure to its neighborhood system and test, but isomery Data set generally reaches TB even PB rank, extremely expends user behavior and relative article attribute in Internet resources, and data Relating to privacy information Deng relevant field content, therefore, the mechanism such as enterprise and government is seldom ready to share its data for research worker Use.Along with the continuous expansion of internet scale, in isomeric data, heavy-tailed phenomenon is the most universal, and the connection of each interfield is closed System becomes more sophisticated, generates the great difficulty of heterogeneous datasets collection with truthful data characteristic.Therefore, build one can simulate The heterogeneous datasets field value join algorithm going out true interfield value annexation becomes isomeric data in numerous research work The basis in source.
The research of existing data cube computation algorithm is broadly divided into time field dependency Quality Research and non-temporal field phase Closing property two aspects of Quality Research.The former is mainly used in the aspect such as predicting network flow, Time-Series analysis, the most ripe, There is corresponding commercialization with scientific research software for research worker, and the latter essentially consists in the mathematical modeling to field value distribution character And interfield connects research, it is mainly used in specific research project, needs to carry out generation true to nature according to different business scene, Complexity is high, and main representative sex work has the proWGen isomeric data simulation that Canadian University of Saskachewan Busari proposes Device, by analyzing isomeric data field value distribution situation, portrays the heavy-tailed property of field with Zipf-like distribution and carries out digital simulation, Use the mechanism of multiparameter so that this simulator has good autgmentability, can apply to the stress test of Web server and delay Deposit performance study.Shortcoming is: field connection is realized by proWGen only with simple positive/negative relevant mode, it is difficult to true to nature Complicated and diversified isomeric data in simulation reality.Along with the explosion type of internet data amount increases, Zipf-like is the suitableeest For describing the isomeric data distribution with heavy-tailed property, carry out data genaration according to Zipf-like, for generating data institute For the system of application, the result over-evaluated can be there is in the assessment of its test performance, have bigger mistake with the contrast of truthful data situation Difference, it is meant that generate insecure data.
Summary of the invention
It is an object of the invention to provide a kind of heterogeneous datasets field value based on MIC preferential attachment method, the method Be conducive to realistic simulation heterogeneous datasets, make the data set of connection keep the harmonious and internodal phase of rational interfield Like property.
For achieving the above object, the technical scheme is that a kind of heterogeneous datasets is based on MIC field value preferential attachment Method, it is characterised in that for two heterogeneous datasets U and V, containing A field in U, containing B field in V, by field A and word Section B connects the problem of the data set generating l bar record, wherein the set S of all values structure in field AA={ A1,A2,A3,…, Am, the set S of all values structure in field BB={ B1,B2,B3,…,Bn, the form of every record is { Ax,By, 1≤x≤m, 1≤y≤n, m, n represent in field A, B total m, n kind value respectively, comprise the following steps:
Step 1: matching isomeric data extent of a set exponential, i.e. parameter a of SE distribution, b, c, x0, wherein c is extension Parameter, x0For scale parameter, a represents SE matching near linear slope, and b represents SE matching near linear intercept;
Step 2: the MIC coefficient e between calculated field A and field BMIC
Step 3: generate all values set S that occurrence number is constituted respectively in field A obeying SE distribution respectivelytAAnd field The all values set S that occurrence number is constituted respectively in BtB
Step 4: set up set S respectivelytA、StBCorresponding cumulative distribution function PA(x)、PB(y);
Step 5: judge whether the total number l of record is 0, is to go to step 10, otherwise goes to step 6;
Step 6: generate random number ξA, according to cumulative distribution function PAX () calculates field value A corresponding in field Ax
Step 7: calculate field value B corresponding in field B based on field value preferential attachment modely
Step 8: preserve { Ax,ByAs a record, add in file D;
Step 9: update total number l=l-1, and return step 5;
Step 10: output file D, completes all connections of isomeric data.
Further, in step 1, parameter a of SE distribution of matching heterogeneous datasets, b, c, x as follows0:
Field value occurrence number obeys SE distribution, uses equation below to retouch field value occurrence number distribution curve State:
y i c = - a log i + b
Wherein, for obeying all values respectively occurrence number in the field of SE distribution, according to each field value occurrence number from More to lacking descending, represent the position sequence value of field value occurrence number with i, 1≤i≤N, N represent that in described field, total N kind takes Value, uses yiRepresent the occurrence number that position sequence value i is corresponding, yi cRepresent yiC power;
Taking an experience constant for extensive parameter c, span is (0,1), then uses least square fitting to go out a, b Value, further according to a=x0 cTry to achieve scale parameter x0Value, then substitute into following formula:
p ( x ) = e - ( x x 0 ) c
Obtain the accumulated probability distribution function of SE distribution, complete the modeling of SE distribution.
Further, in step 2, use the MIC coefficient e between equation below calculated field A and field BMIC
e M I C = m a x m n < B ( l ) { I * ( D &prime; , m , n ) log min { m , n } }
Wherein, m represents that the value of field A divides number, and n represents that the value of field B divides number, and B (l) represents that grid is drawn Portioning number, D' is represented to given data collection, I*(D ', m, n) be represented to given data collection D' maximum mutual information value under m*n divides, Min{m, n} represent and take that { minima in m, n}, max{} represents and takes the maximum of element in { }.
Further, in step 4, adopt and set up set S with the following methodtA、StBCorresponding cumulative distribution function PA(x)、PB (y):
For set StA={ tA1,tA2,tA3,…,tAm, wherein tAmRepresent m-th field value A in field AmOccur is secondary Number, according to set StAIn each field value occurrence number from more to lack descending, obtain the set of occurrence number descending S'tA, then set up following cumulative distribution function PA(x):
p A ( x ) = &Sigma; i = 1 x t A i / &Sigma; i = 1 m t A i
Wherein, x represents set S'tAThe position sequence value of middle field value occurrence number, 1≤x≤m;
In like manner, S is set uptBCorresponding cumulative distribution function PB(y)。
Further, in step 6, as follows according to cumulative distribution function PAX () calculates correspondence in field A Field value Ax: being uniformly distributed from (0,1) generates random number ξA, make ξA=pAX (), by cumulative distribution function PA(x) inverse Function analytic expressionIt is calculated unique position sequence value x, according to the mapping relations of position sequence value Yu field value, tries to achieve correspondence Field value Ax
Further, in step 7, calculate field corresponding in field B based on field priority model as follows Value By: positive and negative correlation circumstance between the field according to required association, set up field value preferential attachment model, excellent by field value First link model is calculated r 'AB, wherein positive correlation field value preferential attachment model is as follows:
r A B &prime; = e M I C &Sigma; i = 1 g t A i / &Sigma; i = 1 m t A i + ( 1 - e M I C ) h / n
Negative correlation field value preferential attachment model is as follows:
r A B &prime; = 1 - &lsqb; e M I C &Sigma; i = 1 g t A i / &Sigma; i = 1 m t A i + ( 1 - e M I C ) h / n &rsqb;
Wherein g ∈ [1, m], h ∈ [1, n], parameter eMIC∈ [0,1] is the MIC coefficient between field A and field B, is used for Weighing interfield annexation, physical significance in a model represents preferential attachment part proportion;r′ABRepresent and connect value Number, for sampling process in field value cumulative probability, makes r 'AB=pBY (), by cumulative distribution function PBThe inverse function of (y) Analytic expressionIt is calculated unique position sequence value y, according to the mapping relations of position sequence value Yu field value, tries to achieve the field of correspondence Value By, i.e. obtain a complete record { Ax,By, thus complete the connection of a data record.
The invention has the beneficial effects as follows entirely different with existing method, carry by utilizing isomeric data feature to carry out parameter Take, use SE distribution to replace Zipf-like to be distributed property heavy-tailed to field and portray, then use a kind of brand-new based on MIC Field value preferential attachment model replace traditional positive/negative correlation model, carry out data field value connection.Connected by the method Data, the most on the whole can one distribution trend true to nature of matching, also be able to partially accurately portray the heavy-tailed property of field, The data set making generation keeps the harmonious and internodal similarity of rational interfield, can be applicable to isomeric data and drives Software process.
Accompanying drawing explanation
Fig. 1 is the flowchart of the embodiment of the present invention.
Detailed description of the invention
The present invention provides a kind of heterogeneous datasets based on MIC field value preferential attachment method, it is characterised in that for two Heterogeneous datasets U and V, containing A field in U, containing B field in V, is connected the data of generation l bar record by field A with field B The problem of collection, wherein the set S of all values structure in field AA={ A1,A2,A3,…,Am, the collection of all values structure in field B Close SB={ B1,B2,B3,…,Bn, the form of every record is { Ax,By, 1≤x≤m, 1≤y≤n, m, n represent field respectively Total m, n kind value in A, B, as it is shown in figure 1, comprise the following steps:
Step 1: matching isomeric data extent of a set exponential, i.e. SE is distributed (Stretched Exponential Distribution) parameter a, b, c, x0, wherein c is extensive parameter, x0For scale parameter, a represents that SE matching near linear is oblique Rate, b represents SE matching near linear intercept.Concrete grammar is as follows:
Field value occurrence number obeys SE distribution, uses equation below to retouch field value occurrence number distribution curve State:
y i c = - a log i + b
Wherein, for obeying all values respectively occurrence number in the field of SE distribution, according to each field value occurrence number from More to lacking descending, represent the position sequence value of field value occurrence number with i, 1≤i≤N, N represent that in described field, total N kind takes Value, uses yiRepresent the occurrence number that position sequence value i is corresponding, yi cRepresent yiC power;
Taking an experience constant for extensive parameter c, span is (0,1), is then based on heterogeneous datasets and uses a young waiter in a wineshop or an inn Multiplication simulates the value of a, b, further according to a=x0 cTry to achieve scale parameter x0Value, then substitute into following formula:
p ( x ) = e - ( x x 0 ) c
Obtain the accumulated probability distribution function of SE distribution, complete the modeling of SE distribution.
Step 2: use MIC coefficient (the maximal information between equation below calculated field A and field B Coefficient, maximum information coefficient) eMIC:
e M I C = m a x m n < B ( l ) { I * ( D &prime; , m , n ) log min { m , n } }
Wherein, m represents that the value of field A divides number, and n represents that the value of field B divides number, and B (l) represents that grid is drawn Portioning number, D' is represented to given data collection, I*(D ', m, n) be represented to given data collection D' maximum mutual information value under m*n divides, Min{m, n} represent and take that { minima in m, n}, max{} represents and takes the maximum of element in { }.
Step 3: generate all values set S that occurrence number is constituted respectively in field A obeying SE distribution respectivelytAAnd field The all values set S that occurrence number is constituted respectively in BtB
Step 4: set up set S respectivelytA、StBCorresponding cumulative distribution function PA(x)、PB(y).Concrete grammar is as follows:
For set StA={ tA1,tA2,tA3,…,tAm, wherein tAmRepresent m-th field value A in field AmOccur is secondary Number, according to set StAIn each field value occurrence number from more to lack descending, obtain the set of occurrence number descending S'tA, then set up following cumulative distribution function PA(x):
p A ( x ) = &Sigma; i = 1 x t A i / &Sigma; i = 1 m t A i
Wherein, x represents set S'tAThe position sequence value of middle field value occurrence number, 1≤x≤m;
In like manner, S is set uptBCorresponding cumulative distribution function PB(y)。
Step 5: judge whether the total number l of record is 0, is to go to step 10, otherwise goes to step 6.
Step 6: according to cumulative distribution function PAX () calculates field value A corresponding in field Ax.Concrete grammar is as follows: Being uniformly distributed from (0,1) generates random number ξA, make ξA=pAX (), by cumulative distribution function PAX the inverse function of () resolves FormulaIt is calculated unique position sequence value x, according to the mapping relations of position sequence value Yu field value, tries to achieve the field value of correspondence Ax
Step 7: calculate field value B corresponding in field B based on field value preferential attachment modely.Concrete grammar is such as Under: positive and negative correlation circumstance between the field according to required association, set up field value preferential attachment model, preferential by field value Link model is calculated r 'AB, wherein positive correlation field value preferential attachment model is as follows:
r A B &prime; = e M I C &Sigma; i = 1 g t A i / &Sigma; i = 1 m t A i + ( 1 - e M I C ) h / n
Negative correlation field value preferential attachment model is as follows:
r A B &prime; = 1 - &lsqb; e M I C &Sigma; i = 1 g t A i / &Sigma; i = 1 m t A i + ( 1 - e M I C ) h / n &rsqb;
Wherein g ∈ [1, m], h ∈ [1, n], parameter eMIC∈ [0,1] is the MIC coefficient between field A and field B, is used for Weighing interfield annexation, physical significance in a model represents preferential attachment part proportion;r′ABRepresent and connect value Number, for sampling process in field value cumulative probability, makes r 'AB=pBY (), by cumulative distribution function PBThe inverse function of (y) Analytic expressionIt is calculated unique position sequence value y, according to the mapping relations of position sequence value Yu field value, tries to achieve the word of correspondence Segment value By, i.e. obtain a complete record { Ax,By, thus complete the connection of a data record.
Step 8: preserve { Ax,ByAs a record, add in file D.
Step 9: update total number l=l-1, and return step 5.
Step 10: output file D, completes all connections of isomeric data.
Wherein, step 1 to step 2 extracts process for isomeric data field feature, and step 3 to step 4 represents to be entered field Row modeling, step 6 to step 8 is the process connecting a complete documentation, and wherein step 7 represents that field connects.
Below the related content that the present invention relates to is further described.
1, SE distribution
SE is distributed (Stretched Exponential Distribution), and Chinese full name is extension exponential, Early found in research in 1847 by Kohlrausch, it is adaptable to describe the dynamic attenuation phenomenon of different complication system, including The fields such as nature, economy, the Internet.The user behavior data of different isomerization system is entered by Ohio State Univ-Columbus USA's Zhang Xiaodong Row is analyzed, and finds that Zipf-like distribution is not suitable for describing the heavy-tailed property of isomeric data behavior, and it can be carried out very well by SE distribution Portray, illustrate that this distribution is applicable to describe power law model situation about cannot accurately portray.
Following formula represents the probability density function that SE is distributed:
p ( x ) = c x c - 1 x 0 c e - ( x x 0 ) c
Cumulative Distribution Function is shown below:
p ( x ) = e - ( x x 0 ) c
Wherein c is extensive parameter, and its parameter area is at (0,1), x0For scale parameter.
Describing for convenience, all for the correspondence in X-axis data are carried out taking from right logarithm process, in Y-axis agreement by we Corresponding all data carry out taking the c power of initial value and process, and the coordinate system so obtained is referred to as SE coordinate system.If isomeric data is concentrated The all values occurrence number of certain field obeys SE distribution, then according to object occurrence number descending, in a coordinate system, with position Sequence value i is as X-axis, with occurrence number tiAs Y-axis, then putting in SE coordinate system by the value conversion of X, Y, field value occurrence number exists Presenting a near linear under SE coordinate system, field value occurrence number obeys SE distribution.
Use following formula that field value occurrence number distribution straight line is described:
y i c = - a log i + b
Wherein a=x0 c, b=y1 c, again because c is experience constant, therefore a, b can be simulated by method of least square Value, thus try to achieve x0, substitute into above formula, be calculated complete accumulated probability distribution function, complete the modeling of SE distribution.
2, field annexation tolerance
Record is to be formed by connecting by several fields, and interfield certainly exists certain relation.In order to enable accurate quantification Describing the annexation of two interfields, researchers propose pearson coefficient, spearman coefficient, Density Estimator (KDE), mutual information isometry standard.These measures nonlinear data complicated, inapplicable, shortage universality, vigorousness are low Etc. problem, it is difficult to be applicable to data cube computation algorithm.MIC (The Maximal Information is used for this present invention Coefficient) coefficient is measured as field annexation.
2011, Reshef proposed MIC coefficient first at Science, and Chinese is also called maximum information coefficient.This coefficient is Derivation on the basis of mutual information, can be estimated different types of annexation, in the range from [0,1], and have Symmetry, good universality and fairness.If field A is independent with B, then MIC (A, B)=0;If had between A and B really , the most there is not any influence of noise in fixed relation, then MIC (A, B)=1.
Computational methods are mainly by dividing field to the scatterplot of the composition of sample points all in (A, B), profit Calculate and search for the lower maximum mutual information value that can reach of different demarcation mode by the mode of dynamic programming.Finally, mutual to maximum The value of information is standardized processing, and acquired results is MIC, is denoted as eMIC.Note D' is data-oriented collection, m and n is illustrated respectively in A With the division number of B field value, l is the sample size of field (A, B), and G represents that certain divides.Therefore divide G inferior (m × N) maximum mutual information that axle divides is formula (6):
I*(D ', m, n)=maxI (D ' |G)
The eigenmatrix that standardization obtains is shown below:
M ( D &prime; ) m , n = I * ( D &prime; , m , n ) log min { m , n }
The MIC value finally given is shown below:
e M I C = m a x m n < B ( l ) { M ( D &prime; ) m , n }
Wherein B (l) is stress and strain model fineness, and usual value is l0.6, above method step is called for short MINE method.
By above formula it is found that MIC changes along with the change of stress and strain model fineness, estimate when sample size is the biggest when Evaluation is the most accurate, and this is applicable to the historical background of current big data.MIC coefficient have applied widely, computation complexity is low, Vigorousness is high, standardized structural characteristic.Therefore, the present invention uses MIC as field annexation degree reference.
3, field value preferential attachment model based on MIC
Assuming will be in two heterogeneous datasets U and V, containing A field in U, containing B field in V, by field A and word Section B connects the data set generating l bar record altogether.Order letter S is expressed as set, then the value place set S that field A is correspondingA ={ A1,A2,A3,…,Am, total m kind value;The value place set S that field B is correspondingB={ B1,B2,B3,…,Bn, total n kind Value.The form of every record is { Ax,By}(1≤x≤m,1≤y≤n).Order letter t represents number of times, then field A value AmOccur Number of times be tAmSecondary, in field A, all values occurrence number respectively constitutes set StA, field B intermediate value BnThe number of times occurred is tBn Secondary, in field B, all values occurrence number respectively constitutes set StB, and meet following formula:
&Sigma; i = 1 m t A i = &Sigma; i = 1 n t B i = l
Represent that field A all values occurrence number is cumulative and cumulative equal to field B all values occurrence number and is equal to isomery number According to collection total number of records l.
For data cube computation, the set to all values occurrence number of field is modeled, according to collection the most respectively Close StAWith StBIn each field value occurrence number from more to lack descending, obtain the set S' of occurrence number descendingtAWith S'tB.Then cumulative distribution function p (x), wherein x represents the ranking position sequence of field value occurrence number.As a example by A field, iterated integral Cloth function formula specific as follows:
p A ( x ) = &Sigma; i = 1 x t A i / &Sigma; i = 1 m t A i ( 1 &le; x &le; m )
Just the step of field modeling is completed to this step.
Record is to be formed by connecting by field, after completing field modeling, needs two fields are attached operation, enters And form a complete record.Attended operation is and takes set SAWith SBThe process of one element of cartesian product.Assuming that symbol ξ represents (0,1) upper equally distributed random number, and letter r represents connection value number, then, when connecting a record, firstly generates Random number ξA, make ξA=pAX (), by the inverse function analytic expression of above formulaCan be calculated unique real bit sequence x, according to Position sequence and field value mapping relations, try to achieve field value Ax.Then, according to the annexation of AB interfield, by link model meter Calculation obtains rAB, make rAB=pBY (), in like manner can obtain field value By, i.e. recorded { Ax,By}。
There are three kinds of situations in annexation, respectively positive correlation, negative correlation and zero correlation, wherein positive correlation represents independent variable Increasing, dependent variable also and then increases;Negative correlation represents that independent variable increases, and dependent variable reduces on the contrary;The increase and decrease of dependent variable and change certainly The increase and decrease of amount is unrelated, separate.In present stage data cube computation algorithm, main use link model is divided into positive correlation model with negative Correlation model, wherein positive correlation model is rABA, negative correlation model is rAB=1-ξA, this model is disadvantageous in that connection Relationship metric is simple, the physical significance not possessed, and does not considers interfield zero correlation situation.Therefore, the present invention proposes a kind of base Field value preferential attachment model PCF (the Priority Connection of Field based in MIC maximal information coefficient,PCF).Make r'ABRepresent the connection value number obtained through PCF model, and r'ABIt is formed by connecting with independent sector by preferential attachment part.Positive correlation PCF model is shown below:
r A B &prime; = e M I C &Sigma; i = 1 g t A i / &Sigma; i = 1 m t A i + ( 1 - e M I C ) h / n
Negative correlation PCF model is shown below:
r A B &prime; = 1 - &lsqb; e M I C &Sigma; i = 1 g t A i / &Sigma; i = 1 m t A i + ( 1 - e M I C ) h / n &rsqb;
Wherein g ∈ [1, m], h ∈ [1, n], parameter eMIC∈ [0,1] is the MIC coefficient between field A and field B, is used for Weighing interfield annexation, physical significance in a model represents preferential attachment part proportion.Table Show random words segment value AgCumulative distribution probability p (x) of occurrence number.H/n represents and in n value, randomly selects in B field H is worth the probability as field value.OrderξB=h/n brings positive and negative relevant PCF model respectively into, changes Letter obtains:
r′AB=eMICξA+(1-eMICB
r′AB=1-[eMICξA+(1-eMICB]
If interfield exists annexation, model optimization uses ξAField B is attached value, if interfield is the most only Vertical, then regenerate random number ξB, it is attached value.Work as eMICWhen → 1, there is linear correlation even in description field A and field B Connect relation, represent that the value of each field A is connected to field B of identical cumulative probability under respective cumulative distribution function p (x) Value, as a example by positive correlation model, PRF model conversation is r 'ABA.Work as eMICWhen → 0, description field A is separate with field B, Represent that the value of each field A does not exist annexation with the value of field B, present stochastic relation.As a example by positive correlation model, PRF model conversation is r 'ABB.Work as eMICDuring ∈ (0,1), preferential attachment part proportion is eMIC, independent sector institute accounting Example is (1-eMIC), by two-part and, calculate r' according to positive correlation PRF model formationAB, with r'ABAs in field B The cumulative probability of certain value, such that it is able to obtain the value of field B, is finally completed field A and is connected with the value of field B.
PCF model has general Yu clear and definite physical significance, using MIC coefficient as Primary Reference, can reasonably describe Annexation situation between data, it is adaptable to major part isomeric data field value Connection Step.
Being above presently preferred embodiments of the present invention, all changes made according to technical solution of the present invention, produced function is made With during without departing from the scope of technical solution of the present invention, belong to protection scope of the present invention.

Claims (6)

1. a heterogeneous datasets is based on MIC field value preferential attachment method, it is characterised in that for two heterogeneous datasets U And V, containing A field in U, containing B field in V, field A the problem being connected the data set generating l bar record with field B, its The set S of all values structure in middle field AA={ A1,A2,A3,…,Am, the set S of all values structure in field BB={ B1,B2, B3,…,Bn, the form of every record is { Ax,By, 1≤x≤m, 1≤y≤n, m, n represent total m, n in field A, B respectively Plant value, comprise the following steps:
Step 1: matching isomeric data extent of a set exponential, i.e. parameter a of SE distribution, b, c, x0, wherein c is extensive parameter, x0For scale parameter, a represents SE matching near linear slope, and b represents SE matching near linear intercept;
Step 2: the MIC coefficient e between calculated field A and field BMIC
Step 3: generate all values set S that occurrence number is constituted respectively in field A obeying SE distribution respectivelytAWith in field B The all values set S that occurrence number is constituted respectivelytB
Step 4: set up set S respectivelytA、StBCorresponding cumulative distribution function PA(x)、PB(y);
Step 5: judge whether the total number l of record is 0, is to go to step 10, otherwise goes to step 6;
Step 6: generate random number ξA, according to cumulative distribution function PAX () calculates field value A corresponding in field Ax
Step 7: calculate field value B corresponding in field B based on field value preferential attachment modely
Step 8: preserve { Ax,ByAs a record, add in file D;
Step 9: update total number l=l-1, and return step 5;
Step 10: output file D, completes all connections of isomeric data.
Heterogeneous datasets the most according to claim 1 field value based on MIC preferential attachment method, it is characterised in that step In rapid 1, parameter a of SE distribution of matching heterogeneous datasets, b, c, x as follows0:
Field value occurrence number obeys SE distribution, uses equation below to be described field value occurrence number distribution curve:
y i c = - a log i + b
Wherein, for obeying all values occurrence number respectively in the field that SE is distributed, according to each field value occurrence number from more to Few descending, represents the position sequence value of field value occurrence number with i, and 1≤i≤N, N represent total N kind value in described field, Use yiRepresent the occurrence number that position sequence value i is corresponding, yi cRepresent yiC power;
Taking an experience constant for extensive parameter c, span is (0,1), then uses least square fitting to go out the value of a, b, Further according to a=x0 cTry to achieve scale parameter x0Value, then substitute into following formula:
p ( x ) = e - ( x x 0 ) c
Obtain the accumulated probability distribution function of SE distribution, complete the modeling of SE distribution.
Heterogeneous datasets the most according to claim 2 field value based on MIC preferential attachment method, it is characterised in that step In rapid 2, use the MIC coefficient e between equation below calculated field A and field BMIC
e M I C = m a x m n < B ( l ) { I * ( D &prime; , m , n ) log min { m , n } }
Wherein, m represents that the value of field A divides number, and n represents that the value of field B divides number, and B (l) represents stress and strain model part Number, D' is represented to given data collection, I*(D ', m n) are represented to given data collection D' maximum mutual information value under m*n divides, min { m, n} represent and take that { minima in m, n}, max{} represents and takes the maximum of element in { }.
Heterogeneous datasets the most according to claim 3 field value based on MIC preferential attachment method, it is characterised in that step In rapid 4, adopt and set up set S with the following methodtA、StBCorresponding cumulative distribution function PA(x)、PB(y):
For set StA={ tA1,tA2,tA3,…,tAm, wherein tAmRepresent m-th field value A in field AmThe number of times occurred, root According to set StAIn each field value occurrence number from more to lack descending, obtain the set S' of occurrence number descendingtA, so Cumulative distribution function P that rear foundation is followingA(x):
p A ( x ) = &Sigma; i = 1 x t A i / &Sigma; i = 1 m t A i
Wherein, x represents set S'tAThe position sequence value of middle field value occurrence number, 1≤x≤m;
In like manner, S is set uptBCorresponding cumulative distribution function PB(y)。
Heterogeneous datasets the most according to claim 4 field value based on MIC preferential attachment method, it is characterised in that step In rapid 6, as follows according to cumulative distribution function PAX () calculates field value A corresponding in field Ax: from (0,1) It is uniformly distributed generation random number ξA, make ξA=pAX (), by cumulative distribution function PAThe inverse function analytic expression of (x)Calculate Obtain unique position sequence value x, according to the mapping relations of position sequence value Yu field value, try to achieve the field value A of correspondencex
Heterogeneous datasets the most according to claim 5 field value based on MIC preferential attachment method, it is characterised in that step In rapid 7, calculate field value B corresponding in field B based on field priority model as followsy: according to required association Field between positive and negative correlation circumstance, set up field value preferential attachment model, be calculated by field value preferential attachment model r′AB, wherein positive correlation field value preferential attachment model is as follows:
r A B &prime; = e M I C &Sigma; i = 1 g t A i / &Sigma; i = 1 m t A i + ( 1 - e M I C ) h / n
Negative correlation field value preferential attachment model is as follows:
r A B &prime; = 1 - &lsqb; e M I C &Sigma; i = 1 g t A i / &Sigma; i = 1 m t A i + ( 1 - e M I C ) h / n &rsqb;
Wherein g ∈ [1, m], h ∈ [1, n], parameter eMIC∈ [0,1] is the MIC coefficient between field A and field B, is used for weighing Interfield annexation, physical significance in a model represents preferential attachment part proportion;r′ABRepresent and connect value number, For sampling process in field value cumulative probability, make r 'AB=pBY (), by cumulative distribution function PBY the inverse function of () resolves FormulaIt is calculated unique position sequence value y, according to the mapping relations of position sequence value Yu field value, tries to achieve the field value of correspondence By, i.e. obtain a complete record { Ax,By, thus complete the connection of a data record.
CN201610569447.XA 2016-07-19 2016-07-19 Field value preferential attachment method of the heterogeneous datasets based on MIC Active CN106202486B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610569447.XA CN106202486B (en) 2016-07-19 2016-07-19 Field value preferential attachment method of the heterogeneous datasets based on MIC

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610569447.XA CN106202486B (en) 2016-07-19 2016-07-19 Field value preferential attachment method of the heterogeneous datasets based on MIC

Publications (2)

Publication Number Publication Date
CN106202486A true CN106202486A (en) 2016-12-07
CN106202486B CN106202486B (en) 2019-07-09

Family

ID=57494394

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610569447.XA Active CN106202486B (en) 2016-07-19 2016-07-19 Field value preferential attachment method of the heterogeneous datasets based on MIC

Country Status (1)

Country Link
CN (1) CN106202486B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106940731A (en) * 2017-03-30 2017-07-11 福建师范大学 A kind of data based on non-temporal Attribute Association generation method true to nature

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661510A (en) * 2009-09-29 2010-03-03 金蝶软件(中国)有限公司 Data matching method and device thereof
CN101702180A (en) * 2009-12-04 2010-05-05 金蝶软件(中国)有限公司 Method and system for searching associated field value
CN103546312A (en) * 2013-08-27 2014-01-29 中国航天科工集团第二研究院七〇六所 Massive multi-source isomerism log correlation analyzing method
CN103678665A (en) * 2013-12-24 2014-03-26 焦点科技股份有限公司 Heterogeneous large data integration method and system based on data warehouses
CN105719006A (en) * 2016-01-18 2016-06-29 合肥工业大学 Cause-and-effect structure learning method based on flow characteristics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661510A (en) * 2009-09-29 2010-03-03 金蝶软件(中国)有限公司 Data matching method and device thereof
CN101702180A (en) * 2009-12-04 2010-05-05 金蝶软件(中国)有限公司 Method and system for searching associated field value
CN103546312A (en) * 2013-08-27 2014-01-29 中国航天科工集团第二研究院七〇六所 Massive multi-source isomerism log correlation analyzing method
CN103678665A (en) * 2013-12-24 2014-03-26 焦点科技股份有限公司 Heterogeneous large data integration method and system based on data warehouses
CN105719006A (en) * 2016-01-18 2016-06-29 合肥工业大学 Cause-and-effect structure learning method based on flow characteristics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XI ZHAO: "《Feature Selection with Attributes Clustering by Maximal Information Coefficient》", 《PROCEDIA COMPUTER SCIENCE》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106940731A (en) * 2017-03-30 2017-07-11 福建师范大学 A kind of data based on non-temporal Attribute Association generation method true to nature

Also Published As

Publication number Publication date
CN106202486B (en) 2019-07-09

Similar Documents

Publication Publication Date Title
CN103412918B (en) A kind of service trust degree appraisal procedure based on service quality and reputation
CN103106535B (en) Method for solving collaborative filtering recommendation data sparsity based on neural network
CN102541920A (en) Method and device for improving accuracy degree by collaborative filtering jointly based on user and item
CN102982107A (en) Recommendation system optimization method with information of user and item and context attribute integrated
CN103324690A (en) Mixed recommendation method based on factorization condition limitation Boltzmann machine
CN102629341B (en) A kind of Web service QoS on-line prediction method based on user&#39;s geographical location information
D'Amico et al. Wind speed modeled as an indexed semi‐Markov process
CN103226796A (en) Method for evaluating quality of whole process of on-line education service
Ouattara et al. Infrastructure and long-run economic growth: evidence from Chinese provinces
Mittal et al. Dual artificial neural network for rainfall-runoff forecasting
CN103744958B (en) A kind of Web page classification method based on Distributed Calculation
CN116468300A (en) Army general hospital discipline assessment method and system based on neural network
Mai et al. Surrogate modelling for stochastic dynamical systems by combining NARX models and polynomial chaos expansions
CN104133808B (en) User behavior uniformity degree measuring method based on complicated correspondence system
CN105894138A (en) Optimum weighted composite prediction method for shipment amount of manufacturing industry
CN106202486A (en) Heterogeneous datasets field value based on MIC preferential attachment method
Zorn et al. Replacing energy simulations with surrogate models for design space exploration
CN104102716A (en) Imbalance data predicting method based on cluster stratified sampling compensation logic regression
CN102508894B (en) Training method for digital information recommendation prediction model and system
CN106342305B (en) A kind of testability index requiring towards multitask is determined method
Cao et al. On the proportional hazards model with last observation carried forward covariates
Rodgers et al. The myth of the cavern revisited: Are large-scale behavioral models useful?
Maddumage et al. R programming for Social Network Analysis-A Review
Wang et al. Simulation error characteristics of grey model gm (1, 1) under translation transformation
Zaidi et al. Employment transitions and earnings dynamics in the SAGE model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant