CN105138588A

CN105138588A - Database overlap mode abstract generating method based on multi-label propagation

Info

Publication number: CN105138588A
Application number: CN201510464314.1A
Authority: CN
Inventors: 袁晓洁; 于漫; 王超; 靳宇东; 温延龙
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2015-07-31
Filing date: 2015-07-31
Publication date: 2015-12-09
Anticipated expiration: 2035-07-31
Also published as: CN105138588B

Abstract

The invention provides a database overlap mode abstract generating method based on multi-label propagation. The overlap mode abstract generating method comprises: mapping database mode information into a multi-label graph model; clustering the database mode information with a multi-label propagation algorithm, and generating an overlap group; clustering the overlap group with a hierarchical clustering algorithm, and further generating result categories of appropriate sizes; and and finally, on the basis of an information entropy and a random walk model, selecting a topic table for each result category, thus generating the final overlap mode abstract of the database. According to the overlap mode abstract generating scheme provided by the invention, a user can be provided with a more accurate and more meaningful database overlap mode abstract, thus helping the user to understand the database information quickly.

Description

A kind of database overlap scheme abstraction generating method propagated based on many labels

Technical field

The invention belongs to database technical field, be specifically related to a kind of novel relational database overlap scheme summarization generation technology.

Background technology

Along with the develop rapidly of the universal of computing machine and infotech, a large amount of data messages makes database technology obtain to use widely, and database application starts to move towards domestic consumer.But the scale in modern data storehouse is often very huge and complicated, user wants in query script, generate suitable Structured Query Language (SQL), just must have certain understanding to the pattern information of database.But pattern information corresponding to large scale database is usual also very complicated, and ubiquity relevant documentation deficient phenomena, more understand database schema to user and cause difficulty.

Pattern summarization generation technology can effectively solve the problem, and for user provides a simple and clear database schema summary, improves the availability of database.Existing pattern summary solution is all only absorbed in the generation of non-overlapping pattern summary, namely only permission database relational table belongs to a theme class in pattern summary, but in reality, database relational table often can have multi-meaning and be under the jurisdiction of multiple theme class.Only considering non-overlapped situation to cause summary, result is imperfect even makes user misunderstand.

To make a summary the problem often can not comprehensively met consumers' demand relative to non-overlapping pattern.Overlap scheme summarization generation technology can generate more reasonably database schema summary info, and effectively minimizing user understands the time and efforts that database schema consumes, and has future in engineering applications widely.

Summary of the invention

The object of the invention is to overcome prior art above shortcomings, propose a kind of database overlap scheme summary automatic generation method propagated based on many labels.

The database overlap scheme abstraction generating method propagated based on many labels provided by the invention, innovatively proposes overlap scheme summary concept; Design a kind of new database many label mode graph model; Have employed many labels propagation algorithm and hierarchical clustering algorithm carries out cluster to database schema respectively; Final each result class for cluster gained chooses a subject heading list, can overlapping pattern make a summary for user returns one.The step of the method is as follows:

1st, database schema is mapped as many labels figure of a Weight;

1.1st, database schema is mapped as label figure more than,

Define 1: one relational data base schema and can be mapped as label figure more than, with tlv triple G=(V, E, a L _m) represent, wherein:

1. .V represents the set of relation table node in database, and v ∈ V represents the relation table node in database;

2. .E represents the set of foreign key relationship in database, and e ∈ E represents the foreign key relationship in database;

3. .L _mbe a label mapping function, by node mapping to one or more corresponding label, wherein label (c, b) represent, c represents a result class indications, and b is label degree of membership, represents that a database relational table v's and its result class indications c is subordinate to intensity;

Similarity between two relation tables 1.2nd, calculating fillet in many labels figure, as label figure weight;

1.2.1, the table name of usage space vector model calculated relationship table and the text similarity of attribute-name, as the title similarity of relation table;

1.2.2, use Jaccard coefficient carry out numerical value similarity analysis to the value of relation table attribute column, and find optimum matching attribute pair by greedy algorithm, get the mean value of optimum matching attribute to value similarity and try to achieve relation tabular value similarity;

1.2.3, by analyzing the count rate between relation table, calculate the mapping relations similarity of relation table,

Definition 2: the mapping relations similarity between relation table R and relation table S, is denoted as Sim _m(R, S), is defined as follows:

{Sim}_{m} (R, S) = \frac{q_{i}}{Σ f a n (τ_{i})} \times \frac{q_{j}}{Σ f a n (τ_{j})};

Wherein:

1.. τ represents all tuples of relation table;

2. .fan (τ _i) be tuple τ _ifan-out degree on fillet e, fan-out degree defines for the fillet number between tuple and tuple, represents the different tuple numbers that certain a line tuple can connect;

3. .q _ifan (τ is met for all in relation table R _i) number of tuples of > 0;

1.2.4, walk based on above-mentioned 1.2.1 to 1.2.3 in three kinds of similarity features, adopt multiple linear regression model to calculate relation table similarity, and using the weight of this similarity as many labels figure.

2nd, adopt many labels propagation algorithm to carry out cluster to many labels figure, generation can overlapping be rolled into a ball;

2.1st, determine the parameter θ of many labels propagation algorithm, θ is the maximum portable number of tags of each node; If user's designated mode summary net result class number is k, then θ attempts value is k-1 to k+3, final select to make many labels to propagate gained can the maximum θ of the inside cluster similarity of overlapping group, inner cluster similarity is defined as follows:

Definition 3: supposing that many labels are propagated many labels figure cluster is C={C ₁, C ₂..., C _mcan overlapping roll into a ball, the intra-cluster similarity of so much label propagation result C is as follows:

S i m (C) = \frac{1}{m} \underset{C_{i} &Element; C}{Σ} \frac{\underset{v_{i}, v_{j} &Element; C_{j}}{Σ} S i m (v_{i}, v_{j})}{C_{| C_{i} |}^{2}};

Wherein:

1. .Sim (v _i, v _j) be relation table v _iand v _jbetween similarity;

2. .|C _i| represent C _iin relation table number;

2.2nd, be the label that each Node configuration in label figure one is unique, the classification indications of this label is set to the relation table title of this node, and degree of membership is set to 1;

2.3rd, the label of an all neighbor node of node joins in the label of this node according to the weight on degree of membership and limit by each iteration, and does standardization and make the degree of membership of this node and be 1,

Definition 4: normalization function b _x(c, v _i) represent when the secondary iteration of xth, node v _ilabel in, the mapping relations of corporations indications c and its degree of membership b are:

b_{t} (c, v_{i}) = \frac{\underset{v_{j} &Element; N (v_{i})}{Σ} b_{t - 1} (c, v_{j}) w_{v_{i} v_{j}}}{| N (v_{i}) |};

Wherein:

1. .N (v _i) be node v _iall neighbor nodes;

2.. represent limit (v _i, v _j) weight;

2.4th, the label of degree of membership lower than 1/ θ is deleted;

2.5th, when the nodes that the minimum classification indications be labeled marks is constant, iteration stopping; After supposing that iteration terminates, remaining classification indications is m, will with indications c _mnode be referred to a C _min, now, many labels figure is divided into the group C={C that m can have lap ₁, C ₂..., C _m;

2.6th, θ gets different values, repeats above-mentioned 2.2nd to the 2.5th step, selects maximum one group of inner cluster similarity overlappingly can roll into a ball the result propagated as many labels.

3rd, to overlapping group hierarchical clustering can be carried out, result class is generated;

3.1st, calculating can similarity between overlapping group,

Definition 5:C _iand C _jrepresent two that many labels propagation clustering obtains respectively can overlapping roll into a ball, C _iand C _jbetween similarity can be defined as:

S i m (C_{i}, C_{j}) = \frac{\underset{v_{i} &Element; C_{i}}{Σ} \underset{v_{j} &Element; C_{j}}{Σ} S i m (v_{i}, v_{j})}{| C_{i} | | C_{j} |};

Wherein, Sim (C _i, C _j) representation relation table v _iand v _jbetween similarity, if two table between there is no incidence edge, the similarity between them is 0;

3.2nd, each overlappingly can be rolled into a ball as an independent class, in each iteration, merge two classes that similarity is maximum, until stop iteration after being incorporated into k result class specified by user.

4th, for each result class chooses subject heading list, final pattern summary is returned to user;

4.1st, the importance degree of calculated relationship table;

The quantity of information of 4.1.1, calculated relationship table,

Definition 6: the attribute A in relation table R is denoted as R.A, and the information entropy on this attribute is defined as:

H (R . A) = Σ_{i = 1}^{h} p_{i} l o g (1 / p_{i})

Wherein, h represents the number of all not identical values on attribute A; If the value on attribute A can be expressed as the set R.A={a of h different value ₁..., a _h, use p _irepresent a _ithe probability occurred;

Definition 7: the quantity of information of relation table R is defined as:

I C (R) = l o g | R | + \underset{R . A}{Σ} H (R . A)

Wherein, | R| represents the tuple number in R;

Transition probability between 4.1.2, calculated relationship table,

Definition 8: for relation table R and relation table S, the definition of probability being transferred to S by R is as follows:

Π (R, S) = Σ_{R . A - S . B} \frac{H (R . A)}{l o g | R | + Σ_{R . A^{'}} {qA}^{'} \cdot H (R . A^{'})};

Wherein:

1. .R.A-S.B represents the foreign key reference between the A attribute of relation table R and the B attribute of relation table S;

2.. for any attribute A ', q in R _{a '}represent the upper all external key linking numbers of R.A ';

4.3rd, adopt random walk model, using the quantity of information of relation table as the initial value of random walk, using the transition probability between relation table as the transition probability of random walk, quantity of information distribution when model reaches stable state is the importance degree of relation table;

4.4th, select relation table that in each result class, importance degree is the highest as such subject heading list, return to the pattern summary that user is final.

Advantage of the present invention and beneficial effect:

The present invention innovatively proposes the mapping method of a kind of database schema to many labels figure, the classification information of relation table is stored form with label, and determines the final cluster result of pattern summary by degree of membership; Analyse in depth the many labels propagation algorithm based on figure, and propose a kind of pattern summary Auto-generation Model propagated based on many labels based on this; Compared with conventional model, this model inheritance advantage of many labels propagation algorithm, the pattern that can automatically generate with lap makes a summary, and achieves higher clustering precision; For user's quick-searching database provides help;

Accompanying drawing explanation

Fig. 1 is method general flow chart;

Fig. 2 is primitive relation database schema figure;

Fig. 3 is many label graphic formula that example relationship database is corresponding;

Fig. 4 is can overlapping group divide after many labels propagation clustering;

Fig. 5 is that the result class after hierarchical clustering divides;

Fig. 6 is pattern summary result figure, and wherein, a, b are the Database clustering figure that pattern summary is corresponding, and c is pattern summary figure;

Table 1 is illustrative data base relation table importance degree result of calculation information.

Embodiment

The treatment scheme of the inventive method as shown in Figure 1.

Introduce the embodiment of the inventive method below in conjunction with embodiment, be illustrated in figure 2 embodiment relational data base schema figure.The pattern generated through overlap scheme abstraction generating method is made a summary as shown in Figure 6, wherein Fig. 6 (c) is overlap scheme summary figure, be convenient to user and put complex patterns relation in order, simultaneously, user also can check for certain part in pattern summary figure in detail, after launching as shown in Fig. 6 (a) He (b).The concrete steps of the inventive method are introduced below in conjunction with the embodiment shown in Fig. 2:

Step 1: many labels figure database schema being mapped as a Weight.

1.1st, database schema is mapped as label figure more than,

First traveling through the pattern information of relational database, is label figure more than by the pattern information formal definitions of relational database, by tlv triple G=(V, E, L _m) represent, wherein V represents the set of relation table node in database, and v ∈ V represents the relation table node in database; E represents the set of foreign key relationship in database, and e ∈ E represents the foreign key relationship in database; L _mbe a label mapping function, by node mapping to one or more corresponding label, wherein label (c, b) represent, c represents a result class indications, and b is label degree of membership, represents that a database relational table v's and its result class c is subordinate to intensity.Fig. 3 shows many label graphic formula corresponding to example relationship database in Fig. 2, and initially, for relation table each in many labels figure only arranges a unique label, indications is the table name of relation table, and degree of membership is 1.

1.2nd, calculate many labels figure weight, concrete steps are as follows:

First regard often open relation table as one section of text be made up of table name and the attribute-name of the relation table through word segmentation processing as, for the ProductCategory relation table in Fig. 2, the table name of this relation table and attribute-name can be divided into following morpheme: Product, Category, ID and Type, occur three times in the text that wherein Category represents at relation table ProductCategory, ID and Type all occurs once; Whole relational database is regarded as a text set be made up of the morpheme after participle, by the weight of calculated relationship table morpheme in text set, the name information of relation table is mapped as a space vector; Vector space model is adopted to calculate the angle of two space vectors, i.e. the title similarity of two relation tables;

The false code of the lookup algorithm specific implementation that optimum matching attribute is right is as follows:

Algorithm 1: the lookup algorithm GreedyMatching that optimum matching attribute is right

Input: relation table R, the attribute of relation table S, R and S is as calculated to similarity set P

Export: optimum matching community set Z

①.

2. the property set of .U:=R

3. the property set of .V:=S

④.WHILE(1)DO

⑤. BREAK；

6.. traversal U and V

7.. there is the attribute of maximal value maximum to (u, v) according to P searching

8.. (u, v) is inserted in Z

9.. in U, u is deleted, in V, v is deleted

⑩.ENDWHILE

RETURNZ

algorithm terminates

For the Product relation table in Fig. 2 and ProductCategory relation table, first the attributes similarity between these two relation tables is calculated: J (Product.ProductID, ProductCategory.CategoryID)=0.1, J (Product.ProducName, ProductCategory.CategoryType)=0.05, J (Product.CategoryID, ProductCategory.CategoryID)=0.8, the similar value between other attributes is 0.Optimum matching attribute pair is excavated, respectively: J (Product.CategoryID, ProductCategory.CategoryID) and J (Product.ProducName, ProductCategory.CategoryType) by algorithm 1.Therefore the value similarity between Product relation table and ProductCategory relation table: Sim _v(Product, ProductCategory)=(0.8+0.05)/2=0.425.

1.2.3, for the fillet number between tuple and tuple, adopting tuple fan-out degree to represent the different tuple numbers that certain a line tuple can connect, representing the mapping relations similarity of relation table by defining the linear function be directly proportional to tuple fan-out degree.

1.2.4, last based on above-mentioned three kinds of relation table similarity features, adopt each feature of multiple linear regression model comprehensive consideration, calculate relation table similarity, as the weight of many labels figure; First the title similarity between relation table, value similarity and mapping relations similarity are normalized, make within data-mapping to 0 ~ 1 scope.Following employing multiple linear regression model, it is considered herein that the influence degree of the title factor, the value factor and mapping relations factor pair relation table similarity is successively decreased successively, therefore by the parameter alpha in algorithm, β, γ, δ are set to 6.4,4.8,2.0 and 0.2 respectively, make Sim (R, S) ∈ [0,1].

Step 2: adopt many labels propagation algorithm to carry out cluster to many labels figure, generation can overlapping be rolled into a ball.

2.1st, determine the parameter θ of many labels propagation algorithm, θ is the maximum portable number of tags of each node; If user's designated mode summary net result class number is k, then θ attempts value is k-1 to k+3;

For Fig. 2 illustrative data base, when designated result class number k is 2, the value of θ this attempt 1 respectively, 2,3,4,5 carry out the propagation of many labels.

2.2nd, be the label that each Node configuration in label figure one is unique, the classification indications of this label is set to the relation table title of this node, and degree of membership is set to 1.

2.3rd, the label of a nodes neighbors node joins in the label of this node according to degree of membership by each iteration, and does standardization and make the degree of membership of this node and be 1.

2.4th, be unlikely to again last each node have all labels to retain multiple label, algorithm calculates the degree of membership of each label, and deletes those labels lower than given threshold value.Threshold value is herein 1/ θ.

2.5th, after many wheel iteration, when the nodes that the minimum classification indications be labeled marks is constant, iteration is stopped; Relation table now with identical indications label being divided into one can in overlapping group.

2.6 the 2.2nd to the 2.5th step, selects maximum one group of inner cluster similarity overlappingly can roll into a ball the result propagated as many labels.

For Fig. 2 illustrative data base, when designated result class number k212345534 θ gets 3, what many labels marked off after propagating can overlapping roll into a ball, and wherein rolling into a ball 1 with the lap of group 2 is relation table ZipCode and Order, and the lap between group 1 and group 3 is relation table Supply.

Step 3, to overlapping group hierarchical clustering can be carried out, generate result class.

3.1st, calculating can similarity between overlapping group.

The false code of hierarchical clustering algorithm specific implementation is as follows:

Algorithm 2: hierarchical clustering algorithm HierarchicalClustering

Input: overlapping group can divide C={C ₁, C ₂..., C _m, result class number k

Export: result class divides C={C ₁, C ₂..., C _k}

①.FORi＝|S|TOk

2.. two classes finding similarity maximum, C _p, C _q∈ C

3.. merge class C _pand C _q

4.. from class C, delete C _q

⑤.FOREACHC _j

6.. compute classes C _jand C _qbetween similarity

⑦.ENDFOR

⑧.ENDFOR

⑨.RETURNC

10.. algorithm terminates

Algorithm 2 describes the execution flow process of hierarchical clustering algorithm.First each overlapping group can be used as an independent result class by this algorithm; In each step iterative process, search two classes that similarity is maximum, merged, as shown in 2. arrive in algorithm 4.; Iterative process can be carried out always, until reach k result class.

Can overlapping roll into a ball for Fig. 4 example, when designated result class number k is 2, hierarchical clustering stops after obtaining two result classes.As shown in Figure 5, now mode chart is divided into 2 result classes, and relation table Supply is the lap of two classes.

Step 4, choose subject heading list for each result class, final pattern summary is returned to user.

4.1st, the importance of often opening relation table is weighed by main foreign key information, attribute information and the tuple information in relation table.Part relation table importance degree result of calculation information in illustrative data base is listed in table 1.

Table 1 illustrative data base relation table importance degree result of calculation information

Rank	Relation table	Importance degree
			1	Company	189.35
2	Order183.28
			3	Customer	116.54
4	Product	101.07

The false code of calculated relationship table importance degree specific implementation is as follows:

Algorithm 3: calculated relationship table importance degree method TableImportance

Input: label figure G

Export: relation table importance degree vector I

①.FOREACHnodeRING

2. .IC [R]: the quantity of information of=relation table R

③.I ₀[R]:＝IC[R]

④.ENDFOR

⑤.FOREACHedgeeING

6.. Π: the transition matrix between=relation table

⑦.ENDFOR

8. .done=FALSE; / * stochastic process convergence identifier */

⑨.WHILE(！done)DO

⑩.I:＝I ₀*Π

iF (dist (I, I ₀)≤ε)/* uses Infinite Norm compute vector distance, ε be minimal value */

done＝TRUE；

I ₀:＝I

ENDWHILE

RETURNI

algorithm terminates

This arthmetic statement method of calculated relationship table importance degree.First, the quantity of information of often opening relation table is calculated according to main foreign key information, attribute information and the tuple information in relation table, by calculating the initial value of quantity of information as random walk of gained, then the transition probability between gained relation table is calculated by the foreign key reference relation between relation table, repeatedly send and receive information along the limit in figure according to transition probability, until stochastic process converges to a Stable distritation.Finally, the information magnitude of each relation table during stationary distribution is defined as the importance degree of this relation table.

4.2nd, select relation table that in each result class, importance degree is the highest as such subject heading list, return to the pattern summary that user is final.

For the result class of Fig. 5 example, for each result class chooses the highest relation table of importance degree as subject heading list, wherein in classification 1, most important table is Company, and in classification 2, most important table is Product.Fig. 6 be automatically generate can overlap scheme summary figure, wherein Fig. 6 (a) and (b) be depicted as cluster result be mapped to relational database after result display.

Claims

1., based on the database overlap scheme abstraction generating method that many labels are propagated, it is characterized in that the method comprises:

1st, database schema is mapped as many labels figure of a Weight;

1.1st, database schema is mapped as label figure more than,

{Sim}_{m} (R, S) = \frac{q_{i}}{Σ f a n (τ_{i})} \times \frac{q_{j}}{Σ f a n (τ_{j})};

Wherein:

1.. τ represents all tuples of relation table;

3. .q _ifan (τ is met for all in relation table R _i) number of tuples of > 0;

1.2.4, walk based on above-mentioned 1.2.1 to 1.2.3 in three kinds of similarity features, adopt multiple linear regression model to calculate relation table similarity, and using the weight of this similarity as many labels figure;

S i m (C) = \frac{1}{m} \underset{C_{i} &Element; C}{Σ} \frac{\underset{v_{i}, v_{j} &Element; C_{j}}{Σ} S i m (v_{i}, v_{j})}{C_{| C_{i} |}^{2}};

Wherein:

1. .Sim (v _i, v _j) be relation table v _iand v _jbetween similarity;

2. .|C _i| represent C _iin relation table number;

b_{t} (c, v_{i}) = \frac{\underset{v_{j} &Element; N (v_{i})}{Σ} b_{t - 1} (c, v_{j}) w_{v_{i} v_{j}}}{| N (v_{i}) |};

Wherein:

1. .N (v _i) be node v _iall neighbor nodes;

2.. represent limit (v _i, v _j) weight;

2.4th, the label of degree of membership lower than 1/ θ is deleted;

2.6th, θ gets different values, repeats above-mentioned 2.2nd to the 2.5th step, selects maximum one group of inner cluster similarity overlappingly can roll into a ball the result propagated as many labels;

3.1st, calculating can similarity between overlapping group,

S i m (C_{i}, C_{j}) = \frac{\underset{v_{i} &Element; C_{i}}{Σ} \underset{v_{j} &Element; C_{j}}{Σ} S i m (v_{i}, v_{j})}{| C_{i} | | C_{j} |};

3.2nd, each overlappingly can be rolled into a ball as an independent class, in each iteration, merge two classes that similarity is maximum, until stop iteration after being incorporated into k result class specified by user;

4.1st, the importance degree of calculated relationship table;

The quantity of information of 4.1.1, calculated relationship table,

H (R . A) = Σ_{i = 1}^{h} p_{i} l o g (1 / p_{i})

Definition 7: the quantity of information of relation table R is defined as:

I C (R) = l o g | R | + \underset{R . A}{Σ} H (R . A)

Wherein, | R| represents the tuple number in R;

Transition probability between 4.1.2, calculated relationship table,

Π (R, S) = Σ_{R . A - S . B} \frac{H (R . A)}{l o g | R | + Σ_{R . A^{'}} q_{A^{'}} \cdot H (R . A^{'})};

Wherein: