CN105138588B

CN105138588B - A kind of database overlap scheme abstraction generating method propagated based on multi-tag

Info

Publication number: CN105138588B
Application number: CN201510464314.1A
Authority: CN
Inventors: 袁晓洁; 于漫; 王超; 靳宇东; 温延龙
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2015-07-31
Filing date: 2015-07-31
Publication date: 2018-09-28
Anticipated expiration: 2035-07-31
Also published as: CN105138588A

Abstract

A kind of database overlap scheme abstraction generating method propagated based on multi-tag.Including：It is multi-tag graph model by database schema information MAP；Database pattern information is clustered using multi-tag propagation algorithm, generation can the group of overlapping；Using hierarchical clustering algorithm pair can the group of overlapping cluster, further generate suitable scale result class；It is that each result class chooses subject heading list to be finally based on comentropy and random walk model, is made a summary with generating final database overlap scheme.Overlap scheme summarization generation scheme proposed by the present invention can provide more accurate, meaningful database overlap scheme abstract to the user, help user that database information is understood quickly.

Description

A kind of database overlap scheme abstraction generating method propagated based on multi-tag

Technical field

The invention belongs to database technical fields, and in particular to a kind of novel relational database overlap scheme summarization generation Technology.

Background technology

With universal and information technology the rapid development of computer, a large amount of data information makes database technology obtain Extensive use, database application start to move towards ordinary user.However the scale in modern data library is often very huge and answers Miscellaneous, user just has to the pattern information tool to database to generate structured query language appropriate in query process There is certain understanding.However the pattern information corresponding to large scale database is generally also sufficiently complex, and generally existing is related Document deficient phenomena more understands database schema to user and causes difficulty.

Pattern summarization generation technology can the effective solution above problem, provide the database schema of a simplicity to the user Summary improves the availability of database.Existing pattern abstract solution is all only absorbed in the generation of non-overlapping pattern abstract, A theme class for namely only allowing a database relational table to belong to during pattern is made a summary, however in reality, database closes It is table can often possess multi-meaning and be under the jurisdiction of multiple theme class.Only consider that non-overlapping situation can cause abstract result endless It is whole to misunderstand even with family.

It often can not meet the problem of user demand comprehensively relative to non-overlapping pattern abstract.Overlap scheme summarization generation Technology can generate more rational database schema summary info, effectively reduce that user understands that database schema consumed when Between and energy, have extensive future in engineering applications.

Invention content

It is an object of the invention to overcome deficiencies of the prior art, a kind of number propagated based on multi-tag is proposed According to library overlap scheme abstract automatic generation method.

The database overlap scheme abstraction generating method provided by the invention propagated based on multi-tag, innovatively proposes weight Folded pattern abstract concept；Design a kind of new database multi-tag pattern graph model；Use multi-tag propagation algorithm and level Clustering algorithm respectively clusters database schema；Final each result class for cluster gained chooses a subject heading list, is User returns to the pattern that can a be overlapped abstract.The step of this method, is as follows：

The 1st, database schema is mapped as to the multi-tag figure of a Weight；

The 1.1st, database schema is mapped as to a multi-tag figure,

Define 1：One relational data base schema can be mapped as a multi-tag figure, with a triple G=(V, E, L_M) indicate, wherein：

1. .V indicates that the set of relation table node in database, v ∈ V indicate the relation table node in database；

2. .E indicates that the set of foreign key relationship in database, e ∈ E indicate the foreign key relationship in database；

③.L_MFor a label mapping function, node is mapped to one or more corresponding label, wherein label is used (c, b) is indicated, c indicates that a result class indications, b are label degree of membership, indicates a database relational table v and its result class Indications c's is subordinate to intensity；

1.2nd, the similitude between two relation tables on connection side in multi-tag figure is calculated, as label figure weight；

1.2.1, use space vector model calculated relationship table table name and attribute-name text similarity, as relationship The title similarity of table；

1.2.2, numerical value similarity analysis is carried out to the value of relation table attribute column using Jaccard coefficients, and by greedy Center algorithm finds best match attribute pair, and best match attribute is taken to acquire relationship tabular value similarity to the average value of value similarity；

1.2.3, by analyze relation table between count rate, calculate the mapping relations similarity of relation table,

Define 2：Mapping relations similarity between relation table R and relation table S is denoted as Simm (R, S), is defined as follows：

Wherein：

1. τ indicate all tuples of relation table；

②.fan(τ_i) it is tuple τ_iDegree of being fanned out on connection side e, degree of being fanned out to are for the connection between tuple and tuple Edge strip number and define, indicate the different tuple numbers that certain a line tuple can connect；

③.q_iMeet fan (τ to be all in relation table R_i) ＞ 0 number of tuples；

1.2.4, based on above-mentioned 1.2.1 to 1.2.3 walk in three kinds of similarity features, using multiple linear regression Relation table similarity is calculated in model, and using the similarity as the weight of multi-tag figure.

2nd, multi-tag figure is clustered using multi-tag propagation algorithm, generation can the group of overlapping；

2.1st, determine that the parameter θ of multi-tag propagation algorithm, θ are the at most portable number of tags of each node；If user Designated mode makes a summary final result class number as k, then it is k-1 to k+3 that θ, which attempts value, and final choice makes multi-tag propagate institute The inside of the group of overlapping obtained clusters the maximum θ of similarity, and inside cluster similarity is defined as follows：

Define 3：Assuming that it is C={ C that multi-tag, which is propagated multi-tag figure cluster,₁,C₂,...,C_mThe group of overlapping, it is so much The intra-cluster similarity that label propagates result C is as follows：

Wherein：

①.Sim(v_i,v_j) it is relation table v_iAnd v_jBetween similarity；

②.|C_i| indicate C_iIn relation table number；

2.2nd, one unique label is set for each node in label figure, the classification indications of the label are set as The relationship table name of the node, degree of membership are set as 1；

2.3rd, the label of all neighbor nodes of node is added to by each iteration according to the weight on degree of membership and side In the label of the node, and do standardization make the node degree of membership and be 1,

Define 4：Normalization function b_x(c,v_i) indicate in x: th iteration, relation table v_iLabel in, corporations indications c Mapping relations with its degree of membership b are：

Wherein：

①.N(v_i) it is relation table v_iAll neighborhood tables；

②.Indicate side (v_i,v_j) weight；

2.4th, the label that degree of membership is less than 1/ θ is deleted；

2.5th, when the number of nodes that labeled minimum classification indications are marked is constant, iteration stopping；Assuming that repeatedly After generation, remaining classification indications are m, will carry indications c_mNode be referred to a C_mIn, at this point, multi-tag figure It is divided into the m group C={ C that there can be lap₁,C₂,...,C_m}；

2.6th, θ takes different values, repeats above-mentioned 2.2nd to the 2.5th step, selects internal maximum one group of similarity of cluster It can the result propagated as multi-tag of the group of overlapping.

3rd, pair can the group of overlapping carry out hierarchical clustering, generate result class；

3.1st, calculate can similarity between the group of overlapping,

Define 5：C_iAnd C_jRespectively represent obtained two of multi-tag propagation clustering can the group of overlapping, C_iAnd C_jBetween phase It can be defined as like degree：

Wherein, Sim (v_i,v_j) representation relation table v_iAnd v_jBetween similarity, if there is no incidence edge between two tables, they Between similarity be 0；

3.2nd, by each can one individual class of the group's of overlapping conduct it is maximum to merge similarity in each iteration Two classes, stop iteration after being incorporated into k result class specified by user.

4th, it is that each result class chooses subject heading list, final pattern abstract is returned into user；

4.1st, the importance of calculated relationship table；

The information content of 4.1.1, calculated relationship table,

Define 6：Attribute A in relation table R is denoted as R.A, the comentropy on the attribute is defined as：

Wherein, h indicates all numbers for differing value on attribute A；If the value on attribute A can be expressed as h difference Set R.A={ a of value₁,...,a_h, use p_iTo indicate a_iThe probability of appearance；

Define 7：The information content of relation table R is defined as：

Wherein, | R | indicate the tuple number in R；

Transition probability between 4.1.2, calculated relationship table,

Define 8：By taking relation table R and relation table S as an example, the definition of probability that S is transferred to by R is as follows：

Wherein：

1. .R.A-S.B indicates the foreign key reference between the A attributes and the B attributes of relation table S of relation table R；

2. is for arbitrary the attribute A ', q in R_A′Indicate that R.A ' goes up all external key linking numbers；

4.3rd, using random walk model, using the information content of relation table as the initial value of random walk, with relation table Between transition probability of the transition probability as random walk, information content distribution when model reaches stable state is the important of relation table Degree；

4.4th, the highest relation table of importance in each result class is selected to return to user most as such subject heading list Whole pattern abstract.

The advantages of the present invention：

The present invention innovatively proposes a kind of database schema to the mapping method of multi-tag figure, and the classification of relation table is believed Breath is stored by label in the form of, and the final cluster result of pattern abstract is determined by degree of membership；It analyses in depth based on the more of figure Label propagation algorithm, and a kind of pattern abstract Auto-generation Model propagated based on multi-tag is proposed based on this；With biography System model is compared, and the model inheritance advantage of multi-tag propagation algorithm can automatically generate the pattern with lap and pluck It wants, and achieves higher clustering precision；Help is provided for user's quick-searching database；

Description of the drawings

Fig. 1 is method general flow chart；

Fig. 2 is primitive relation database schema figure；

Fig. 3 is the corresponding multi-tag diagram form of example relationship database；

Fig. 4 is that the group of overlapping after multi-tag propagation clustering divides；

Fig. 5 is that the result class after hierarchical clustering divides；

Fig. 6 is pattern abstract result figure, wherein a, b are the corresponding Database clustering figure of pattern abstract, and c is that pattern is made a summary Figure；Table 1 is illustrative data base relation table importance result of calculation information.

Specific implementation mode

The process flow of the method for the present invention is as shown in Figure 1.

The specific implementation mode that the method for the present invention is introduced with reference to embodiment is illustrated in figure 2 embodiment relation data Library ideograph.The pattern abstract generated by overlap scheme abstraction generating method is as shown in fig. 6, wherein Fig. 6 (c) is overlap scheme Summary figure clears complex patterns relationship convenient for user, meanwhile, certain part that user can also be directed in pattern summary figure is looked into detail It sees, after expansion as shown in Fig. 6 (a) and (b).The specific steps of the method for the present invention are introduced below in conjunction with embodiment shown in Fig. 2：

Step 1：Database schema is mapped as to the multi-tag figure of a Weight.

The 1.1st, database schema is mapped as to a multi-tag figure,

The pattern information formal definitions of relational database are more than one by the pattern information for traversing relational database first Label figure, by triple G=(V, E, L_M) indicate, wherein V indicates that the set of relation table node in database, v ∈ V indicate data Relation table node in library；E indicates that the set of foreign key relationship in database, e ∈ E indicate the foreign key relationship in database；L_MFor Node is mapped to one or more corresponding label by one label mapping function, and wherein label indicates that c is indicated with (c, b) One result class indications, b are label degree of membership, indicate that a database relational table v and its result class c's is subordinate to intensity.Fig. 3 The corresponding multi-tag diagram form of example relationship database in Fig. 2 is shown, initially, is only arranged for each relation table in multi-tag figure One unique label, indications are the table name of relation table, degree of membership 1.

1.2nd, multi-tag figure weight is calculated, is as follows：

Regard every relation table as one be made of the table name and attribute-name of the relation table Jing Guo word segmentation processing first Text, by taking the ProductCategory relation tables in Fig. 2 as an example, the table name of the relation table can be divided into following word with attribute-name Element：Product, Category, ID and Type, wherein Category are in the text that relation table ProductCategory is indicated Occur three times, ID and Type occur once；Regard entire relational database as a text being made of the morpheme after segmenting The name information of relation table is mapped as a space vector by this collection by weight of the calculated relationship table morpheme in text set； The angle of two spaces vector, i.e., the title similarity of two relation tables are calculated using vector space model；

The pseudocode of the lookup algorithm specific implementation of best match attribute pair is as follows：

Algorithm 1：The lookup algorithm GreedyMatching of best match attribute pair

Input：The attribute of relation table R, relation table S, the R and S that are computed are to similarity set P

Output：Best match attribute set Z

By in Fig. 2 Product relation tables and ProductCategory relation tables for, calculate this two passes first It is the attributes similarity between table：J (Product.ProductID, ProductCategory.CategoryID)=0.1, J (Product.ProducName, ProductCategory.CategoryType)=0.05, J (Product.CategoryID, ProductCategory.CategoryID)=0.8, the similar value between other attributes is 0.Best is excavated by algorithm 1 It is properties right, be respectively：J (Product.CategoryID, ProductCategory.CategoryID) and J (Product.ProducName,ProductCategory.CategoryType).Therefore Product relation tables and Value similarity between ProductCategory relation tables：Sim_v(Product, ProductCategory)=(0.8+ 0.05)/2=0.425.

1.2.3, for the connection edge strip number between tuple and tuple, certain a line tuple energy is indicated using tuple degree of being fanned out to The different tuple numbers enough connected indicate that the mapping of relation table is closed by defining the linear function directly proportional to tuple degree of being fanned out to It is similarity.

1.2.4, it is finally based on above-mentioned three kinds of relation tables similarity feature, using multiple linear regression model comprehensive consideration Relation table similarity is calculated in each feature, the weight as multi-tag figure；Title first between relation table is similar Degree, value similarity and mapping relations similarity are normalized, and data is made to be mapped within the scope of 0~1.Next it uses Multiple linear regression model, it is considered herein that the influence of the title factor, the value factor and mapping relations factor pair relation table similarity Degree is successively decreased successively, therefore by the parameter alpha in algorithm, beta, gamma, and δ is set to 6.4,4.8,2.0 and 0.2, makes Sim (R, S) ∈ [0,1]。

Step 2：Multi-tag figure is clustered using multi-tag propagation algorithm, generation can the group of overlapping.

2.1st, determine that the parameter θ of multi-tag propagation algorithm, θ are the at most portable number of tags of each node；If user Designated mode makes a summary final result class number as k, then it is k-1 to k+3 that θ, which attempts value,；

By taking Fig. 2 illustrative data bases as an example, when designated result class number k is 2, the value of θ this attempt 1,2,3,4,5 respectively To carry out multi-tag propagation.

2.2nd, one unique label is set for each node in label figure, the classification indications of the label are set as The relationship table name of the node, degree of membership are set as 1.

2.3rd, the label of a nodes neighbors node is added to according to degree of membership in the label of the node by each iteration, And do standardization make the node degree of membership and be 1.

2.4th, being unlikely to last each node again to retain multiple labels is owned by all labels, and algorithm calculates each The degree of membership of label, and delete those labels for being less than given threshold value.Threshold value herein is 1/ θ.

2.5th, mostly after wheel iteration, when the number of nodes that labeled minimum classification indications are marked is constant, stop Iteration；The relation table for carrying identical indications label at this time is divided into one can be in the group of overlapping.

2.6th, different values is taken, repeats above-mentioned 2.2nd to the 2.5th step, select internal maximum one group of similarity of cluster It can the result propagated as multi-tag of the group of overlapping.

By taking Fig. 2 illustrative data bases as an example, when designated result class number k is 2, the value attempted respectively is 1,2,3,4,5 Multi-tag propagation is carried out, 5 groups of result classes are obtained, finds that, when value is 3, the inside of acquired results clusters similarity by calculating It is maximum；When Fig. 4 is that θ takes 3, the group of overlapping that multi-tag is marked off after propagating, wherein the lap of group 1 and group 2 is relationship Table ZipCode and Order, the lap between group 1 and group 3 are relation table Supply.

Step 3, pair can the group of overlapping carry out hierarchical clustering, generate result class.

3.1st, calculating can similarity between the group of overlapping.

The pseudocode of hierarchical clustering algorithm specific implementation is as follows：

Algorithm 2：Hierarchical clustering algorithm HierarchicalClustering

Input：It can the group of overlapping division C={ C₁,C₂,...,C_m, as a result class number k

Output：As a result class divides C={ C₁,C₂,...,C_k}

Algorithm 2 describes the execution flow of hierarchical clustering algorithm.The algorithm first by each can the group of overlapping as one Individual result class；In each step iterative process, maximum two classes of similarity are searched, are merged, as 2. arrived in algorithm Shown in 4.；Iterative process can carry out always, until reaching k result class.

By taking the exemplary groups of overlapping of Fig. 4 as an example, when designated result class number k is 2, hierarchical clustering is obtaining two results Stop after class.As shown in figure 5, ideograph is divided into 2 result classes at this time, and the overlapping portion that relation table Supply is two classes Point.

Step 4 chooses subject heading list for each result class, and final pattern abstract is returned to user.

4.1st, the weight of every relation table is weighed by main foreign key information, attribute information and the tuple information in relation table The property wanted.Part relation table importance result of calculation information in illustrative data base is listed in table 1.

1 illustrative data base relation table importance result of calculation information of table

Ranking	Relation table	Importance
			1	Company	189.35
2	Order	183.28
			3	Customer	116.54
4	Product	101.07

The pseudocode of calculated relationship table importance specific implementation is as follows：

Algorithm 3：Calculated relationship table importance method TableImportance

Input：Label figure G

Output：Relation table importance vector I

The algorithm description method of calculated relationship table importance.First, according to the main foreign key information in relation table, attribute Information and tuple information calculate the information content of every relation table, and the information content by calculating gained is used as the initial of random walk It is worth, then the transition probability between the relation table as obtained by the foreign key reference relationship calculating between relation table, along the side root in figure It is sent and received information repeatedly according to transition probability, until random process converges to a Stable distritation.When finally, by Stationary Distribution The information magnitude of each relation table is defined as the importance of the relation table.

4.2nd, the highest relation table of importance in each result class is selected to return to user most as such subject heading list Whole pattern abstract.

By taking the exemplary result classes of Fig. 5 as an example, the highest relation table of importance is chosen as subject heading list for each result class, Most important table is Company in middle classification 1, and most important table is Product in classification 2.Fig. 6 is being overlapped of automatically generating Pattern summary figure, wherein Fig. 6 (a) and (b), which show cluster result and be mapped to the result after relational database, to be shown.

Claims

1. a kind of database overlap scheme abstraction generating method propagated based on multi-tag, it is characterised in that this method includes：

The 1st, database schema is mapped as to the multi-tag figure of a Weight；

The 1.1st, database schema is mapped as to a multi-tag figure,

Define 1：One relational data base schema can be mapped as a multi-tag figure, with triple G=(V, E, a L_M) table Show, wherein：

③.L_MFor a label mapping function, node is mapped to one or more corresponding label, wherein label uses (c, b) It indicates, c indicates that a result class indications, b are label degree of membership, indicates that a database relational table v is indicated with its result class Symbol c's is subordinate to intensity；

1.2.1, use space vector model calculated relationship table table name and attribute-name text similarity, as relation table Title similarity；

1.2.2, numerical value similarity analysis is carried out to the value of relation table attribute column using Jaccard coefficients, and is calculated by greed Method finds best match attribute pair, and best match attribute is taken to acquire relationship tabular value similarity to the average value of value similarity；

Define 2：Mapping relations similarity between relation table R and relation table S, is denoted as Sim_m(R, S), is defined as follows：

Wherein：

1. τ indicate all tuples of relation table；

②.fan(τ_i) it is tuple τ_iDegree of being fanned out on connection side e, degree of being fanned out to are for the connection edge strip between tuple and tuple It is several and definition, indicate the different tuple numbers that certain a line tuple can connect；

③.q_iMeet fan (τ to be all in relation table R_i) ＞ 0 number of tuples；

1.2.4, based on above-mentioned 1.2.1 to 1.2.3 walk in three kinds of similarity features, using multiple linear regression model Relation table similarity is calculated, and using the similarity as the weight of multi-tag figure；

2.1st, determine that the parameter θ of multi-tag propagation algorithm, θ are the at most portable number of tags of each node；If user is specified Pattern makes a summary final result class number as k, then it is k-1 to k+3 that θ, which attempts value, and final choice makes multi-tag propagate gained Can the group of overlapping inside cluster the maximum θ of similarity, inside cluster similarity be defined as follows：

Define 3：Assuming that it is C={ C that multi-tag, which is propagated multi-tag figure cluster,₁,C₂,...,C_mThe group of overlapping, then multi-tag The intra-cluster similarity for propagating result C is as follows：

Wherein：

①.Sim(v_i,v_j) it is relation table v_iAnd v_jBetween similarity；

②.|C_i| indicate C_iIn relation table number；

2.2nd, one unique label is set for each node in label figure, the classification indications of the label are set as the section The relationship table name of point, degree of membership are set as 1；

2.3rd, the label of all neighbor nodes of node is added to the section by each iteration according to the weight of degree of membership and side Point label in, and do standardization make the node degree of membership and be 1,

Define 4：Normalization function b_x(c,v_i) indicate in x: th iteration, relation table v_iLabel in, corporations indications c and its The mapping relations of degree of membership b are：

Wherein：

①.N(v_i) it is relation table v_iAll neighborhood tables；

②.Indicate side (v_i,v_j) weight；

2.4th, the label that degree of membership is less than 1/ θ is deleted；

2.5th, when the number of nodes that labeled minimum classification indications are marked is constant, iteration stopping；Assuming that iteration knot Shu Hou, remaining classification indications are m, will carry indications c_mNode be referred to a C_mIn, at this point, multi-tag figure is drawn It is divided into the m group C={ C that there can be lap₁,C₂,...,C_m}；

2.6th, θ takes different values, repeats above-mentioned 2.2nd to the 2.5th step, selects internal maximum one group of similarity of cluster that can weigh The result that folded group propagates as multi-tag；

3.1st, calculate can similarity between the group of overlapping,

Define 5：C_iAnd C_jRespectively represent obtained two of multi-tag propagation clustering can the group of overlapping, C_iAnd C_jBetween similarity can To be defined as：

Wherein, Sim (v_i,v_j) representation relation table v_iAnd v_jBetween similarity, if there is no incidence edge between two tables, between them Similarity is 0；

3.2nd, by each can one individual class of the group's of overlapping conduct, in each iteration, merge similarity maximum two A class stops iteration after being incorporated into k result class specified by user；

4.1st, the importance of calculated relationship table；

The information content of 4.1.1, calculated relationship table,

Wherein, h indicates all numbers for differing value on attribute A；If the value on attribute A can be expressed as h different value Set R.A={ a₁,...,a_h, use p_iTo indicate a_iThe probability of appearance；

Define 7：The information content of relation table R is defined as：

Wherein, | R | indicate the tuple number in R；

Transition probability between 4.1.2, calculated relationship table,

Wherein：

4.3rd, using random walk model, using the information content of relation table as the initial value of random walk, between relation table Transition probability of the transition probability as random walk, information content distribution when model reaches stable state are the importance of relation table；

4.4th, it selects in each result class the highest relation table of importance as such subject heading list, it is final to return to user Pattern is made a summary.