CN106202486A - Heterogeneous datasets field value based on MIC preferential attachment method - Google Patents
Heterogeneous datasets field value based on MIC preferential attachment method Download PDFInfo
- Publication number
- CN106202486A CN106202486A CN201610569447.XA CN201610569447A CN106202486A CN 106202486 A CN106202486 A CN 106202486A CN 201610569447 A CN201610569447 A CN 201610569447A CN 106202486 A CN106202486 A CN 106202486A
- Authority
- CN
- China
- Prior art keywords
- field
- value
- field value
- mic
- preferential attachment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
Abstract
The present invention relates to a kind of heterogeneous datasets based on MIC field value preferential attachment method, comprise the following steps: the parameter of matching heterogeneous datasets SE distribution;MIC coefficient between calculated field A, B;Generate all values set that occurrence number is constituted respectively in field A, BS tA WithS tB ;Set up setS tA 、S tB Corresponding cumulative distribution functionP A (x)、P B (y);Judge record total numberlWhether it is 0, is to turn final step, otherwise turn next step;According toP A (x) calculate corresponding field value in field AA x ;Corresponding field value in field B is calculated based on field value preferential attachment modelB y ;PreserveA x ,B y As a record;Update total numberl=l‑1, and return step 5;Complete all connections of isomeric data.The method is conducive to realistic simulation heterogeneous datasets, makes the data set of connection keep the harmonious and internodal similarity of rational interfield.
Description
Technical field
The present invention relates to isomeric data field value interconnection technique field, based on MIC particularly to a kind of heterogeneous datasets
Field value preferential attachment method.
Background technology
The field contents of reasonable analysis heterogeneous datasets, contributes to the structure to its neighborhood system and test, but isomery
Data set generally reaches TB even PB rank, extremely expends user behavior and relative article attribute in Internet resources, and data
Relating to privacy information Deng relevant field content, therefore, the mechanism such as enterprise and government is seldom ready to share its data for research worker
Use.Along with the continuous expansion of internet scale, in isomeric data, heavy-tailed phenomenon is the most universal, and the connection of each interfield is closed
System becomes more sophisticated, generates the great difficulty of heterogeneous datasets collection with truthful data characteristic.Therefore, build one can simulate
The heterogeneous datasets field value join algorithm going out true interfield value annexation becomes isomeric data in numerous research work
The basis in source.
The research of existing data cube computation algorithm is broadly divided into time field dependency Quality Research and non-temporal field phase
Closing property two aspects of Quality Research.The former is mainly used in the aspect such as predicting network flow, Time-Series analysis, the most ripe,
There is corresponding commercialization with scientific research software for research worker, and the latter essentially consists in the mathematical modeling to field value distribution character
And interfield connects research, it is mainly used in specific research project, needs to carry out generation true to nature according to different business scene,
Complexity is high, and main representative sex work has the proWGen isomeric data simulation that Canadian University of Saskachewan Busari proposes
Device, by analyzing isomeric data field value distribution situation, portrays the heavy-tailed property of field with Zipf-like distribution and carries out digital simulation,
Use the mechanism of multiparameter so that this simulator has good autgmentability, can apply to the stress test of Web server and delay
Deposit performance study.Shortcoming is: field connection is realized by proWGen only with simple positive/negative relevant mode, it is difficult to true to nature
Complicated and diversified isomeric data in simulation reality.Along with the explosion type of internet data amount increases, Zipf-like is the suitableeest
For describing the isomeric data distribution with heavy-tailed property, carry out data genaration according to Zipf-like, for generating data institute
For the system of application, the result over-evaluated can be there is in the assessment of its test performance, have bigger mistake with the contrast of truthful data situation
Difference, it is meant that generate insecure data.
Summary of the invention
It is an object of the invention to provide a kind of heterogeneous datasets field value based on MIC preferential attachment method, the method
Be conducive to realistic simulation heterogeneous datasets, make the data set of connection keep the harmonious and internodal phase of rational interfield
Like property.
For achieving the above object, the technical scheme is that a kind of heterogeneous datasets is based on MIC field value preferential attachment
Method, it is characterised in that for two heterogeneous datasets U and V, containing A field in U, containing B field in V, by field A and word
Section B connects the problem of the data set generating l bar record, wherein the set S of all values structure in field AA={ A1,A2,A3,…,
Am, the set S of all values structure in field BB={ B1,B2,B3,…,Bn, the form of every record is { Ax,By, 1≤x≤m,
1≤y≤n, m, n represent in field A, B total m, n kind value respectively, comprise the following steps:
Step 1: matching isomeric data extent of a set exponential, i.e. parameter a of SE distribution, b, c, x0, wherein c is extension
Parameter, x0For scale parameter, a represents SE matching near linear slope, and b represents SE matching near linear intercept;
Step 2: the MIC coefficient e between calculated field A and field BMIC;
Step 3: generate all values set S that occurrence number is constituted respectively in field A obeying SE distribution respectivelytAAnd field
The all values set S that occurrence number is constituted respectively in BtB;
Step 4: set up set S respectivelytA、StBCorresponding cumulative distribution function PA(x)、PB(y);
Step 5: judge whether the total number l of record is 0, is to go to step 10, otherwise goes to step 6;
Step 6: generate random number ξA, according to cumulative distribution function PAX () calculates field value A corresponding in field Ax;
Step 7: calculate field value B corresponding in field B based on field value preferential attachment modely;
Step 8: preserve { Ax,ByAs a record, add in file D;
Step 9: update total number l=l-1, and return step 5;
Step 10: output file D, completes all connections of isomeric data.
Further, in step 1, parameter a of SE distribution of matching heterogeneous datasets, b, c, x as follows0:
Field value occurrence number obeys SE distribution, uses equation below to retouch field value occurrence number distribution curve
State:
Wherein, for obeying all values respectively occurrence number in the field of SE distribution, according to each field value occurrence number from
More to lacking descending, represent the position sequence value of field value occurrence number with i, 1≤i≤N, N represent that in described field, total N kind takes
Value, uses yiRepresent the occurrence number that position sequence value i is corresponding, yi cRepresent yiC power;
Taking an experience constant for extensive parameter c, span is (0,1), then uses least square fitting to go out a, b
Value, further according to a=x0 cTry to achieve scale parameter x0Value, then substitute into following formula:
Obtain the accumulated probability distribution function of SE distribution, complete the modeling of SE distribution.
Further, in step 2, use the MIC coefficient e between equation below calculated field A and field BMIC;
Wherein, m represents that the value of field A divides number, and n represents that the value of field B divides number, and B (l) represents that grid is drawn
Portioning number, D' is represented to given data collection, I*(D ', m, n) be represented to given data collection D' maximum mutual information value under m*n divides,
Min{m, n} represent and take that { minima in m, n}, max{} represents and takes the maximum of element in { }.
Further, in step 4, adopt and set up set S with the following methodtA、StBCorresponding cumulative distribution function PA(x)、PB
(y):
For set StA={ tA1,tA2,tA3,…,tAm, wherein tAmRepresent m-th field value A in field AmOccur is secondary
Number, according to set StAIn each field value occurrence number from more to lack descending, obtain the set of occurrence number descending
S'tA, then set up following cumulative distribution function PA(x):
Wherein, x represents set S'tAThe position sequence value of middle field value occurrence number, 1≤x≤m;
In like manner, S is set uptBCorresponding cumulative distribution function PB(y)。
Further, in step 6, as follows according to cumulative distribution function PAX () calculates correspondence in field A
Field value Ax: being uniformly distributed from (0,1) generates random number ξA, make ξA=pAX (), by cumulative distribution function PA(x) inverse
Function analytic expressionIt is calculated unique position sequence value x, according to the mapping relations of position sequence value Yu field value, tries to achieve correspondence
Field value Ax。
Further, in step 7, calculate field corresponding in field B based on field priority model as follows
Value By: positive and negative correlation circumstance between the field according to required association, set up field value preferential attachment model, excellent by field value
First link model is calculated r 'AB, wherein positive correlation field value preferential attachment model is as follows:
Negative correlation field value preferential attachment model is as follows:
Wherein g ∈ [1, m], h ∈ [1, n], parameter eMIC∈ [0,1] is the MIC coefficient between field A and field B, is used for
Weighing interfield annexation, physical significance in a model represents preferential attachment part proportion;r′ABRepresent and connect value
Number, for sampling process in field value cumulative probability, makes r 'AB=pBY (), by cumulative distribution function PBThe inverse function of (y)
Analytic expressionIt is calculated unique position sequence value y, according to the mapping relations of position sequence value Yu field value, tries to achieve the field of correspondence
Value By, i.e. obtain a complete record { Ax,By, thus complete the connection of a data record.
The invention has the beneficial effects as follows entirely different with existing method, carry by utilizing isomeric data feature to carry out parameter
Take, use SE distribution to replace Zipf-like to be distributed property heavy-tailed to field and portray, then use a kind of brand-new based on MIC
Field value preferential attachment model replace traditional positive/negative correlation model, carry out data field value connection.Connected by the method
Data, the most on the whole can one distribution trend true to nature of matching, also be able to partially accurately portray the heavy-tailed property of field,
The data set making generation keeps the harmonious and internodal similarity of rational interfield, can be applicable to isomeric data and drives
Software process.
Accompanying drawing explanation
Fig. 1 is the flowchart of the embodiment of the present invention.
Detailed description of the invention
The present invention provides a kind of heterogeneous datasets based on MIC field value preferential attachment method, it is characterised in that for two
Heterogeneous datasets U and V, containing A field in U, containing B field in V, is connected the data of generation l bar record by field A with field B
The problem of collection, wherein the set S of all values structure in field AA={ A1,A2,A3,…,Am, the collection of all values structure in field B
Close SB={ B1,B2,B3,…,Bn, the form of every record is { Ax,By, 1≤x≤m, 1≤y≤n, m, n represent field respectively
Total m, n kind value in A, B, as it is shown in figure 1, comprise the following steps:
Step 1: matching isomeric data extent of a set exponential, i.e. SE is distributed (Stretched Exponential
Distribution) parameter a, b, c, x0, wherein c is extensive parameter, x0For scale parameter, a represents that SE matching near linear is oblique
Rate, b represents SE matching near linear intercept.Concrete grammar is as follows:
Field value occurrence number obeys SE distribution, uses equation below to retouch field value occurrence number distribution curve
State:
Wherein, for obeying all values respectively occurrence number in the field of SE distribution, according to each field value occurrence number from
More to lacking descending, represent the position sequence value of field value occurrence number with i, 1≤i≤N, N represent that in described field, total N kind takes
Value, uses yiRepresent the occurrence number that position sequence value i is corresponding, yi cRepresent yiC power;
Taking an experience constant for extensive parameter c, span is (0,1), is then based on heterogeneous datasets and uses a young waiter in a wineshop or an inn
Multiplication simulates the value of a, b, further according to a=x0 cTry to achieve scale parameter x0Value, then substitute into following formula:
Obtain the accumulated probability distribution function of SE distribution, complete the modeling of SE distribution.
Step 2: use MIC coefficient (the maximal information between equation below calculated field A and field B
Coefficient, maximum information coefficient) eMIC:
Wherein, m represents that the value of field A divides number, and n represents that the value of field B divides number, and B (l) represents that grid is drawn
Portioning number, D' is represented to given data collection, I*(D ', m, n) be represented to given data collection D' maximum mutual information value under m*n divides,
Min{m, n} represent and take that { minima in m, n}, max{} represents and takes the maximum of element in { }.
Step 3: generate all values set S that occurrence number is constituted respectively in field A obeying SE distribution respectivelytAAnd field
The all values set S that occurrence number is constituted respectively in BtB。
Step 4: set up set S respectivelytA、StBCorresponding cumulative distribution function PA(x)、PB(y).Concrete grammar is as follows:
For set StA={ tA1,tA2,tA3,…,tAm, wherein tAmRepresent m-th field value A in field AmOccur is secondary
Number, according to set StAIn each field value occurrence number from more to lack descending, obtain the set of occurrence number descending
S'tA, then set up following cumulative distribution function PA(x):
Wherein, x represents set S'tAThe position sequence value of middle field value occurrence number, 1≤x≤m;
In like manner, S is set uptBCorresponding cumulative distribution function PB(y)。
Step 5: judge whether the total number l of record is 0, is to go to step 10, otherwise goes to step 6.
Step 6: according to cumulative distribution function PAX () calculates field value A corresponding in field Ax.Concrete grammar is as follows:
Being uniformly distributed from (0,1) generates random number ξA, make ξA=pAX (), by cumulative distribution function PAX the inverse function of () resolves
FormulaIt is calculated unique position sequence value x, according to the mapping relations of position sequence value Yu field value, tries to achieve the field value of correspondence
Ax。
Step 7: calculate field value B corresponding in field B based on field value preferential attachment modely.Concrete grammar is such as
Under: positive and negative correlation circumstance between the field according to required association, set up field value preferential attachment model, preferential by field value
Link model is calculated r 'AB, wherein positive correlation field value preferential attachment model is as follows:
Negative correlation field value preferential attachment model is as follows:
Wherein g ∈ [1, m], h ∈ [1, n], parameter eMIC∈ [0,1] is the MIC coefficient between field A and field B, is used for
Weighing interfield annexation, physical significance in a model represents preferential attachment part proportion;r′ABRepresent and connect value
Number, for sampling process in field value cumulative probability, makes r 'AB=pBY (), by cumulative distribution function PBThe inverse function of (y)
Analytic expressionIt is calculated unique position sequence value y, according to the mapping relations of position sequence value Yu field value, tries to achieve the word of correspondence
Segment value By, i.e. obtain a complete record { Ax,By, thus complete the connection of a data record.
Step 8: preserve { Ax,ByAs a record, add in file D.
Step 9: update total number l=l-1, and return step 5.
Step 10: output file D, completes all connections of isomeric data.
Wherein, step 1 to step 2 extracts process for isomeric data field feature, and step 3 to step 4 represents to be entered field
Row modeling, step 6 to step 8 is the process connecting a complete documentation, and wherein step 7 represents that field connects.
Below the related content that the present invention relates to is further described.
1, SE distribution
SE is distributed (Stretched Exponential Distribution), and Chinese full name is extension exponential,
Early found in research in 1847 by Kohlrausch, it is adaptable to describe the dynamic attenuation phenomenon of different complication system, including
The fields such as nature, economy, the Internet.The user behavior data of different isomerization system is entered by Ohio State Univ-Columbus USA's Zhang Xiaodong
Row is analyzed, and finds that Zipf-like distribution is not suitable for describing the heavy-tailed property of isomeric data behavior, and it can be carried out very well by SE distribution
Portray, illustrate that this distribution is applicable to describe power law model situation about cannot accurately portray.
Following formula represents the probability density function that SE is distributed:
Cumulative Distribution Function is shown below:
Wherein c is extensive parameter, and its parameter area is at (0,1), x0For scale parameter.
Describing for convenience, all for the correspondence in X-axis data are carried out taking from right logarithm process, in Y-axis agreement by we
Corresponding all data carry out taking the c power of initial value and process, and the coordinate system so obtained is referred to as SE coordinate system.If isomeric data is concentrated
The all values occurrence number of certain field obeys SE distribution, then according to object occurrence number descending, in a coordinate system, with position
Sequence value i is as X-axis, with occurrence number tiAs Y-axis, then putting in SE coordinate system by the value conversion of X, Y, field value occurrence number exists
Presenting a near linear under SE coordinate system, field value occurrence number obeys SE distribution.
Use following formula that field value occurrence number distribution straight line is described:
Wherein a=x0 c, b=y1 c, again because c is experience constant, therefore a, b can be simulated by method of least square
Value, thus try to achieve x0, substitute into above formula, be calculated complete accumulated probability distribution function, complete the modeling of SE distribution.
2, field annexation tolerance
Record is to be formed by connecting by several fields, and interfield certainly exists certain relation.In order to enable accurate quantification
Describing the annexation of two interfields, researchers propose pearson coefficient, spearman coefficient, Density Estimator
(KDE), mutual information isometry standard.These measures nonlinear data complicated, inapplicable, shortage universality, vigorousness are low
Etc. problem, it is difficult to be applicable to data cube computation algorithm.MIC (The Maximal Information is used for this present invention
Coefficient) coefficient is measured as field annexation.
2011, Reshef proposed MIC coefficient first at Science, and Chinese is also called maximum information coefficient.This coefficient is
Derivation on the basis of mutual information, can be estimated different types of annexation, in the range from [0,1], and have
Symmetry, good universality and fairness.If field A is independent with B, then MIC (A, B)=0;If had between A and B really
, the most there is not any influence of noise in fixed relation, then MIC (A, B)=1.
Computational methods are mainly by dividing field to the scatterplot of the composition of sample points all in (A, B), profit
Calculate and search for the lower maximum mutual information value that can reach of different demarcation mode by the mode of dynamic programming.Finally, mutual to maximum
The value of information is standardized processing, and acquired results is MIC, is denoted as eMIC.Note D' is data-oriented collection, m and n is illustrated respectively in A
With the division number of B field value, l is the sample size of field (A, B), and G represents that certain divides.Therefore divide G inferior (m ×
N) maximum mutual information that axle divides is formula (6):
I*(D ', m, n)=maxI (D ' |G)
The eigenmatrix that standardization obtains is shown below:
The MIC value finally given is shown below:
Wherein B (l) is stress and strain model fineness, and usual value is l0.6, above method step is called for short MINE method.
By above formula it is found that MIC changes along with the change of stress and strain model fineness, estimate when sample size is the biggest when
Evaluation is the most accurate, and this is applicable to the historical background of current big data.MIC coefficient have applied widely, computation complexity is low,
Vigorousness is high, standardized structural characteristic.Therefore, the present invention uses MIC as field annexation degree reference.
3, field value preferential attachment model based on MIC
Assuming will be in two heterogeneous datasets U and V, containing A field in U, containing B field in V, by field A and word
Section B connects the data set generating l bar record altogether.Order letter S is expressed as set, then the value place set S that field A is correspondingA
={ A1,A2,A3,…,Am, total m kind value;The value place set S that field B is correspondingB={ B1,B2,B3,…,Bn, total n kind
Value.The form of every record is { Ax,By}(1≤x≤m,1≤y≤n).Order letter t represents number of times, then field A value AmOccur
Number of times be tAmSecondary, in field A, all values occurrence number respectively constitutes set StA, field B intermediate value BnThe number of times occurred is tBn
Secondary, in field B, all values occurrence number respectively constitutes set StB, and meet following formula:
Represent that field A all values occurrence number is cumulative and cumulative equal to field B all values occurrence number and is equal to isomery number
According to collection total number of records l.
For data cube computation, the set to all values occurrence number of field is modeled, according to collection the most respectively
Close StAWith StBIn each field value occurrence number from more to lack descending, obtain the set S' of occurrence number descendingtAWith
S'tB.Then cumulative distribution function p (x), wherein x represents the ranking position sequence of field value occurrence number.As a example by A field, iterated integral
Cloth function formula specific as follows:
Just the step of field modeling is completed to this step.
Record is to be formed by connecting by field, after completing field modeling, needs two fields are attached operation, enters
And form a complete record.Attended operation is and takes set SAWith SBThe process of one element of cartesian product.Assuming that symbol
ξ represents (0,1) upper equally distributed random number, and letter r represents connection value number, then, when connecting a record, firstly generates
Random number ξA, make ξA=pAX (), by the inverse function analytic expression of above formulaCan be calculated unique real bit sequence x, according to
Position sequence and field value mapping relations, try to achieve field value Ax.Then, according to the annexation of AB interfield, by link model meter
Calculation obtains rAB, make rAB=pBY (), in like manner can obtain field value By, i.e. recorded { Ax,By}。
There are three kinds of situations in annexation, respectively positive correlation, negative correlation and zero correlation, wherein positive correlation represents independent variable
Increasing, dependent variable also and then increases;Negative correlation represents that independent variable increases, and dependent variable reduces on the contrary;The increase and decrease of dependent variable and change certainly
The increase and decrease of amount is unrelated, separate.In present stage data cube computation algorithm, main use link model is divided into positive correlation model with negative
Correlation model, wherein positive correlation model is rAB=ξA, negative correlation model is rAB=1-ξA, this model is disadvantageous in that connection
Relationship metric is simple, the physical significance not possessed, and does not considers interfield zero correlation situation.Therefore, the present invention proposes a kind of base
Field value preferential attachment model PCF (the Priority Connection of Field based in MIC
maximal information coefficient,PCF).Make r'ABRepresent the connection value number obtained through PCF model, and
r'ABIt is formed by connecting with independent sector by preferential attachment part.Positive correlation PCF model is shown below:
Negative correlation PCF model is shown below:
Wherein g ∈ [1, m], h ∈ [1, n], parameter eMIC∈ [0,1] is the MIC coefficient between field A and field B, is used for
Weighing interfield annexation, physical significance in a model represents preferential attachment part proportion.Table
Show random words segment value AgCumulative distribution probability p (x) of occurrence number.H/n represents and in n value, randomly selects in B field
H is worth the probability as field value.OrderξB=h/n brings positive and negative relevant PCF model respectively into, changes
Letter obtains:
r′AB=eMICξA+(1-eMIC)ξB
r′AB=1-[eMICξA+(1-eMIC)ξB]
If interfield exists annexation, model optimization uses ξAField B is attached value, if interfield is the most only
Vertical, then regenerate random number ξB, it is attached value.Work as eMICWhen → 1, there is linear correlation even in description field A and field B
Connect relation, represent that the value of each field A is connected to field B of identical cumulative probability under respective cumulative distribution function p (x)
Value, as a example by positive correlation model, PRF model conversation is r 'AB=ξA.Work as eMICWhen → 0, description field A is separate with field B,
Represent that the value of each field A does not exist annexation with the value of field B, present stochastic relation.As a example by positive correlation model,
PRF model conversation is r 'AB=ξB.Work as eMICDuring ∈ (0,1), preferential attachment part proportion is eMIC, independent sector institute accounting
Example is (1-eMIC), by two-part and, calculate r' according to positive correlation PRF model formationAB, with r'ABAs in field B
The cumulative probability of certain value, such that it is able to obtain the value of field B, is finally completed field A and is connected with the value of field B.
PCF model has general Yu clear and definite physical significance, using MIC coefficient as Primary Reference, can reasonably describe
Annexation situation between data, it is adaptable to major part isomeric data field value Connection Step.
Being above presently preferred embodiments of the present invention, all changes made according to technical solution of the present invention, produced function is made
With during without departing from the scope of technical solution of the present invention, belong to protection scope of the present invention.
Claims (6)
1. a heterogeneous datasets is based on MIC field value preferential attachment method, it is characterised in that for two heterogeneous datasets U
And V, containing A field in U, containing B field in V, field A the problem being connected the data set generating l bar record with field B, its
The set S of all values structure in middle field AA={ A1,A2,A3,…,Am, the set S of all values structure in field BB={ B1,B2,
B3,…,Bn, the form of every record is { Ax,By, 1≤x≤m, 1≤y≤n, m, n represent total m, n in field A, B respectively
Plant value, comprise the following steps:
Step 1: matching isomeric data extent of a set exponential, i.e. parameter a of SE distribution, b, c, x0, wherein c is extensive parameter,
x0For scale parameter, a represents SE matching near linear slope, and b represents SE matching near linear intercept;
Step 2: the MIC coefficient e between calculated field A and field BMIC;
Step 3: generate all values set S that occurrence number is constituted respectively in field A obeying SE distribution respectivelytAWith in field B
The all values set S that occurrence number is constituted respectivelytB;
Step 4: set up set S respectivelytA、StBCorresponding cumulative distribution function PA(x)、PB(y);
Step 5: judge whether the total number l of record is 0, is to go to step 10, otherwise goes to step 6;
Step 6: generate random number ξA, according to cumulative distribution function PAX () calculates field value A corresponding in field Ax;
Step 7: calculate field value B corresponding in field B based on field value preferential attachment modely;
Step 8: preserve { Ax,ByAs a record, add in file D;
Step 9: update total number l=l-1, and return step 5;
Step 10: output file D, completes all connections of isomeric data.
Heterogeneous datasets the most according to claim 1 field value based on MIC preferential attachment method, it is characterised in that step
In rapid 1, parameter a of SE distribution of matching heterogeneous datasets, b, c, x as follows0:
Field value occurrence number obeys SE distribution, uses equation below to be described field value occurrence number distribution curve:
Wherein, for obeying all values occurrence number respectively in the field that SE is distributed, according to each field value occurrence number from more to
Few descending, represents the position sequence value of field value occurrence number with i, and 1≤i≤N, N represent total N kind value in described field,
Use yiRepresent the occurrence number that position sequence value i is corresponding, yi cRepresent yiC power;
Taking an experience constant for extensive parameter c, span is (0,1), then uses least square fitting to go out the value of a, b,
Further according to a=x0 cTry to achieve scale parameter x0Value, then substitute into following formula:
Obtain the accumulated probability distribution function of SE distribution, complete the modeling of SE distribution.
Heterogeneous datasets the most according to claim 2 field value based on MIC preferential attachment method, it is characterised in that step
In rapid 2, use the MIC coefficient e between equation below calculated field A and field BMIC;
Wherein, m represents that the value of field A divides number, and n represents that the value of field B divides number, and B (l) represents stress and strain model part
Number, D' is represented to given data collection, I*(D ', m n) are represented to given data collection D' maximum mutual information value under m*n divides, min
{ m, n} represent and take that { minima in m, n}, max{} represents and takes the maximum of element in { }.
Heterogeneous datasets the most according to claim 3 field value based on MIC preferential attachment method, it is characterised in that step
In rapid 4, adopt and set up set S with the following methodtA、StBCorresponding cumulative distribution function PA(x)、PB(y):
For set StA={ tA1,tA2,tA3,…,tAm, wherein tAmRepresent m-th field value A in field AmThe number of times occurred, root
According to set StAIn each field value occurrence number from more to lack descending, obtain the set S' of occurrence number descendingtA, so
Cumulative distribution function P that rear foundation is followingA(x):
Wherein, x represents set S'tAThe position sequence value of middle field value occurrence number, 1≤x≤m;
In like manner, S is set uptBCorresponding cumulative distribution function PB(y)。
Heterogeneous datasets the most according to claim 4 field value based on MIC preferential attachment method, it is characterised in that step
In rapid 6, as follows according to cumulative distribution function PAX () calculates field value A corresponding in field Ax: from (0,1)
It is uniformly distributed generation random number ξA, make ξA=pAX (), by cumulative distribution function PAThe inverse function analytic expression of (x)Calculate
Obtain unique position sequence value x, according to the mapping relations of position sequence value Yu field value, try to achieve the field value A of correspondencex。
Heterogeneous datasets the most according to claim 5 field value based on MIC preferential attachment method, it is characterised in that step
In rapid 7, calculate field value B corresponding in field B based on field priority model as followsy: according to required association
Field between positive and negative correlation circumstance, set up field value preferential attachment model, be calculated by field value preferential attachment model
r′AB, wherein positive correlation field value preferential attachment model is as follows:
Negative correlation field value preferential attachment model is as follows:
Wherein g ∈ [1, m], h ∈ [1, n], parameter eMIC∈ [0,1] is the MIC coefficient between field A and field B, is used for weighing
Interfield annexation, physical significance in a model represents preferential attachment part proportion;r′ABRepresent and connect value number,
For sampling process in field value cumulative probability, make r 'AB=pBY (), by cumulative distribution function PBY the inverse function of () resolves
FormulaIt is calculated unique position sequence value y, according to the mapping relations of position sequence value Yu field value, tries to achieve the field value of correspondence
By, i.e. obtain a complete record { Ax,By, thus complete the connection of a data record.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610569447.XA CN106202486B (en) | 2016-07-19 | 2016-07-19 | Field value preferential attachment method of the heterogeneous datasets based on MIC |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610569447.XA CN106202486B (en) | 2016-07-19 | 2016-07-19 | Field value preferential attachment method of the heterogeneous datasets based on MIC |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106202486A true CN106202486A (en) | 2016-12-07 |
CN106202486B CN106202486B (en) | 2019-07-09 |
Family
ID=57494394
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610569447.XA Active CN106202486B (en) | 2016-07-19 | 2016-07-19 | Field value preferential attachment method of the heterogeneous datasets based on MIC |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106202486B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106940731A (en) * | 2017-03-30 | 2017-07-11 | 福建师范大学 | A kind of data based on non-temporal Attribute Association generation method true to nature |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101661510A (en) * | 2009-09-29 | 2010-03-03 | 金蝶软件(中国)有限公司 | Data matching method and device thereof |
CN101702180A (en) * | 2009-12-04 | 2010-05-05 | 金蝶软件(中国)有限公司 | Method and system for searching associated field value |
CN103546312A (en) * | 2013-08-27 | 2014-01-29 | 中国航天科工集团第二研究院七〇六所 | Massive multi-source isomerism log correlation analyzing method |
CN103678665A (en) * | 2013-12-24 | 2014-03-26 | 焦点科技股份有限公司 | Heterogeneous large data integration method and system based on data warehouses |
CN105719006A (en) * | 2016-01-18 | 2016-06-29 | 合肥工业大学 | Cause-and-effect structure learning method based on flow characteristics |
-
2016
- 2016-07-19 CN CN201610569447.XA patent/CN106202486B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101661510A (en) * | 2009-09-29 | 2010-03-03 | 金蝶软件(中国)有限公司 | Data matching method and device thereof |
CN101702180A (en) * | 2009-12-04 | 2010-05-05 | 金蝶软件(中国)有限公司 | Method and system for searching associated field value |
CN103546312A (en) * | 2013-08-27 | 2014-01-29 | 中国航天科工集团第二研究院七〇六所 | Massive multi-source isomerism log correlation analyzing method |
CN103678665A (en) * | 2013-12-24 | 2014-03-26 | 焦点科技股份有限公司 | Heterogeneous large data integration method and system based on data warehouses |
CN105719006A (en) * | 2016-01-18 | 2016-06-29 | 合肥工业大学 | Cause-and-effect structure learning method based on flow characteristics |
Non-Patent Citations (1)
Title |
---|
XI ZHAO: "《Feature Selection with Attributes Clustering by Maximal Information Coefficient》", 《PROCEDIA COMPUTER SCIENCE》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106940731A (en) * | 2017-03-30 | 2017-07-11 | 福建师范大学 | A kind of data based on non-temporal Attribute Association generation method true to nature |
Also Published As
Publication number | Publication date |
---|---|
CN106202486B (en) | 2019-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103412918B (en) | A kind of service trust degree appraisal procedure based on service quality and reputation | |
CN103106535B (en) | Method for solving collaborative filtering recommendation data sparsity based on neural network | |
CN102541920A (en) | Method and device for improving accuracy degree by collaborative filtering jointly based on user and item | |
CN102982107A (en) | Recommendation system optimization method with information of user and item and context attribute integrated | |
CN103324690A (en) | Mixed recommendation method based on factorization condition limitation Boltzmann machine | |
CN102629341B (en) | A kind of Web service QoS on-line prediction method based on user's geographical location information | |
D'Amico et al. | Wind speed modeled as an indexed semi‐Markov process | |
CN103226796A (en) | Method for evaluating quality of whole process of on-line education service | |
Ouattara et al. | Infrastructure and long-run economic growth: evidence from Chinese provinces | |
Mittal et al. | Dual artificial neural network for rainfall-runoff forecasting | |
CN103744958B (en) | A kind of Web page classification method based on Distributed Calculation | |
CN116468300A (en) | Army general hospital discipline assessment method and system based on neural network | |
Mai et al. | Surrogate modelling for stochastic dynamical systems by combining NARX models and polynomial chaos expansions | |
CN104133808B (en) | User behavior uniformity degree measuring method based on complicated correspondence system | |
CN105894138A (en) | Optimum weighted composite prediction method for shipment amount of manufacturing industry | |
CN106202486A (en) | Heterogeneous datasets field value based on MIC preferential attachment method | |
Zorn et al. | Replacing energy simulations with surrogate models for design space exploration | |
CN104102716A (en) | Imbalance data predicting method based on cluster stratified sampling compensation logic regression | |
CN102508894B (en) | Training method for digital information recommendation prediction model and system | |
CN106342305B (en) | A kind of testability index requiring towards multitask is determined method | |
Cao et al. | On the proportional hazards model with last observation carried forward covariates | |
Rodgers et al. | The myth of the cavern revisited: Are large-scale behavioral models useful? | |
Maddumage et al. | R programming for Social Network Analysis-A Review | |
Wang et al. | Simulation error characteristics of grey model gm (1, 1) under translation transformation | |
Zaidi et al. | Employment transitions and earnings dynamics in the SAGE model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |