CN105975488B

CN105975488B - A kind of keyword query method based on theme class cluster unit in relational database

Info

Publication number: CN105975488B
Application number: CN201610264735.4A
Authority: CN
Inventors: 王念滨; 周连科; 王红滨; 王瑛琦; 何鸣; 宋奎勇
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2016-04-25
Filing date: 2016-04-25
Publication date: 2019-06-18
Anticipated expiration: 2036-04-25
Also published as: CN105975488A

Abstract

A kind of keyword query method based on theme class cluster unit in relational database, is related to a kind of keyword query method in information retrieval field more particularly to relational database based on theme class cluster unit.The present invention will there are table frequent in query process connections to be brought huge time overhead to solve the problems, such as existing keyword online query method, and the inquiry on the large scale database that existing keyword offline search method is complicated for internal structure, data volume is huge has that search efficiency is low.The keyword query method based on theme class cluster unit sequentially includes the following steps: 1, theme class cluster unit building process in a kind of relational database；1., be based on tables of data characteristic and inquiry log vertical grouping；2., propose theme class cluster in table order of connection prioritization scheme；3., based on theme class cluster tuple associated diagram level be grouped；2, the optimiged index mechanism based on correlation rule is established；3, query result is returned into user.The present invention is applied to information retrieval field.

Description

A kind of keyword query method based on theme class cluster unit in relational database

Technical field

The present invention relates to the keys based on theme class cluster unit in information retrieval field more particularly to a kind of relational database Word querying method.

Background technique

In recent years, keyword query is successfully applied as one important inquiring technology of information retrieval field. Due to its feature easy to use, received by more and more users.For relational database, also needing one kind simply has The querying method of effect obtains the interested information of user from numerous and complicated relational database.Traditional structuralized query side Method, such as SQL query not only need user to understand the bottom mode of relational database complexity, it is also necessary to which user grasps correlation and looks into The application method for asking language brings bigger difficulty and inconvenience to inquiry work.Therefore, based on the keyword of relational database Inquiring technology has received widespread attention.Well known some correlative studys attempt traditional keyword query method being introduced directly into pass It is database, but since relational database needs to follow certain Standardization Requirement, information is dispersed in different tables of data In, being simply introduced into can not bring good inquiry to experience to user.Therefore the structure of marriage relation database itself is needed Feature studies a kind of keyword query technology of suitable relational database.

Existing keyword query method can be divided into online query and offline search two major classes.The main think of of online query Think that use pattern figure or datagram model relational database, after user proposes a group polling keyword, online Figure traversal is carried out, returns to one or more subgraph or candidate network or steiner tree as query result.Due to looking into Table connection is constantly carried out during asking, this class inquiry method is caused to generate high time cost.On the contrary, offline search method Then efficiently solve the problems, such as that online query exists using the data structure of similar virtual document or tuple unit.It is mentioned in user Before inquiring out, table connection is carried out using the method for breadth first traversal, so as to avoid table frequent in query process connection Brought time overhead.But the above offline search method does not consider that the search efficiency in extensive relational database is asked Topic.Enterprise database generally comprises hundreds and thousands of tables of data, and using above method, preprocessing process needs sizable time Expense.In addition, since finally formed table is in large scale, even if building index, is also unfavorable for user and finds within a short period of time Desired inquiry response.

In order to solve this problem, present applicant proposes TCU-Based inquiries --- it is a kind of based on theme class cluster unit from Line askes method.Firstly, constructing number by carrying out division operation twice both vertically and horizontally to the data in database According to structure --- theme class cluster unit.Secondly, in order to further increase the efficiency of data prediction, the application is based on genetic algorithm Devise a kind of table order of connection prioritization scheme.Finally, constructing subject index for each theme class cluster, make these indexes can be Concurrent working on machine node, significantly improves inquiry velocity.After user proposes inquiry, one or more theme is returned Class cluster unit can include more complete information as inquiry response, meet user query intention.

Summary of the invention

This application involves correlation theory

Keyword query technology can substantially be divided into online query and offline search two major classes.Online query main thought It is: before inquiry, constructs ideograph corresponding with database or datagram；After user proposes inquiry, figure traversal is carried out online, with hair Existing top-k candidate network or steiner tree, and user is returned to as inquiry response.And offline search is then with offline mode pair Database is pre-processed, and virtual document or tuple unit are constructed, and returns to top-k inquiry response using information retrieval technique, Online table connection and figure traversing operation are avoided, to be obviously improved query processing efficiency.

1, online query

It is divided according to the quantity that inquiry is related to database, online query can be divided into the key towards single relational database again The keyword query of word inquiry and Based on Distributed database.

Keyword query towards single relational database

In DBXplorer and DISCOVER system, database is modeled as ideograph G, wherein node on behalf relationship, side Main foreign key constraint between representation relation, the result of keyword query i.e. one group candidate network.BANKS, BANKS-II and BLINKS Etc. systems then use the keyword query method based on datagram, directly on bottom data figure retrieval include keyword this Tan Na tree.BANKS system uses inverse expansion searching algorithm, when encountering the biggish node of an in-degree, this method performance by To seriously affecting.BLINKS proposes a kind of new searching method --- bidirectional research on its basis, significantly improves search Performance.

The keyword query of Based on Distributed database

In order to solve the problems, such as the keyword query in distributed data base, currently known research work is primarily present following Several strategies: Kite system, the comprehensive use pattern matching of the system and topology discovery technology, to obtain between heterogeneous database Main outer key connection, to solve the problems, such as the keyword query on isomeric relationship database.Hristidis, V. etc. it is artificial each Database D B_iEstablish keyword relational matrix KRM_iAbstract as database.For each keyword word to (k_i,k_j), item KRM_i(k_i,k_j) for recording keyword k_iAnd k_jThe frequency occurred in different distance.But keyword relational matrix only leads to The binary crelation beta pruning crossed between Feature Words is fallen to cannot function as the database of inquiry response.In order to overcome the above deficiency, You Renti Go out G-KS method, uses the complex relationship between keyword relational graph characterization keyword.Figure interior joint represents Feature Words, side generation Relationship between table word and word.Therefore, it can be calculated using key relationship figure similar between database and keyword query Property, to retrieve most potential database.

2, offline search

The above querying method, which does not account for online table connection, leads to high time overhead problem.In general, pass through Table in database and tuple are pre-processed, the above problem can be improved.In recent years, someone's further investigated is based on relationship number According to the offline search problem in library, and propose preliminary solution.Feldman, P. et al. be put forward for the first time text object and The concept of virtual document completes the table attended operation in database before user proposes inquiry, shows search efficiency It writes and improves.Teorey, T.J. et al. are further expanded on the basis of text object, will be carried out with the tuple of same alike result value Merge, with the more complete data structure of content construction --- tuple unit, and using multiple tuple units interconnected as Inquiry response returns to user, effectively improves result precision.The above method only considers simple table connection and member The group of group | by operation is not particularly suited for the database of internal table structure complexity.Herein described method is connected by optimization table Sequence is connect, and defines a kind of more reasonable data structure --- theme class cluster unit, to improve inquiry effect significantly Rate and precision.

3, vertical grouping

A given database D (T being made of l tables₁,T₂,...,T_l), vertical grouping refers to: according to a kind of reasonable Table in database D is divided into one group of theme class cluster C by partition strategy_D={ C₁,C₂,...,C_k, so that (1) each theme class Cluster C_i∈C_DComprising being associated with close, the relevant tables of data C of content in one group of structure_i={ T₁,T₂,...,T_j}；(2) There is C_iI C_j=φ；(3) to all i,

4, horizontal grouping

Give a tables of data T'(t₁,t₂,...,t_n), wherein t_iIndicate tuple, the horizontal grouping of tables of data refers to: root According to certain similarity measurements flow function, the tuple with higher similarity is assigned to identity set Γ_iIn.As shown in Figure 1, by After horizontal division operation, n tuple in Table A merican football is divided into m unit Γ₁,Γ₂,...,Γ_m, Wherein m≤n.

5, theme class cluster tuple associated diagram

A table T'(t' in given theme class cluster₁,t'₂,...,t'_n), wherein t'_iIndicate theme class cluster tuple, theme Class cluster tuple associated diagram is weighted undirected graph G=(V, E), wherein vertex v_i∈ V indicates theme class cluster tuple t'_iIf two tuples t'_iAnd t'_jBetween similitude si_j(t'_i,t'_j) > 0, then in node v_jThere are a line ei between node vi_j∈ E, side ei_j's Weight is denoted as s_ij(t'i,t'_j).Fig. 2 is the distribution subject class cluster tuple associated diagram of table T'.

6, theme class cluster unit

Give a database D (T₁,T₂,...,T_l), include l tables of data interconnected.Firstly, hanging down to it Straight grouping obtains k theme class cluster C={ C₁,C₂,...,C_k}；Secondly, according to main foreign key relationship by each theme class cluster C_iIn Table carry out table connect to obtain consolidated statement T '_i(t₁,t₂,...,t_n).Finally, to consolidated statement T '_i(t₁,t₂,...,t_n) in n Tuple carry out level is grouped to obtain Γ₁,Γ₂,...,Γ_m.Wherein Γ_iReferred to as theme class cluster unit.Theme class cluster American Theme class cluster unit Γ in football₁,Γ₂,...,Γ_m, as shown in Figure 1.

7, top-k keyword query

Give a keyword query Q={ k₁,k₂,...,k_mAnd a database D (T comprising l tables of data₁, T₂,...,T_l), k theme class cluster units are as query result before top-k keyword query returns to Relevance scores ranking.

8, the application integral frame

Existing major part keyword query method needs constantly to carry out table company after user proposes specific inquiry It connects, therefore generates biggish time overhead.To solve problems, it is thus proposed that the concept of offline search, but they are not There is the inquiry on the relational database for considering internal structure complexity.The application wants the thought to be: successively carrying out Vertical Square to tables of data To with the secondary grouping of horizontal direction, construct one group of theme class cluster unit；A kind of optimal table company based on genetic algorithm is designed simultaneously Sequence Choice is connect, the cost of data prediction is reduced；It finally uses association rule algorithm to construct for each theme class cluster to lead Topic index, thus the search efficiency and accuracy rate that are obviously improved on relational database.

Fig. 4 is the architectural framework of the application querying method, is totally divided into online and offline two parts.

Online query part

User submits a group polling keyword to query processor online, and query processor will include by subject index The theme class cluster unit of one or more keywords returns to user as inquiry response.

Off-line data preprocessing part

It is divided into following four module: vertical grouping module, table order of connection optimization module, horizontal grouping module and theme rope Draw building module.

(1) vertical grouping module

This module utilizes a kind of optimal dividing strategy of figure using relational database and user query log as input --- Improved spectral clustering, and the characteristics of marriage relation database itself, while considering the content information of table and coming from inquiry day Tables of data is carried out vertical grouping by the field feedback of will, is formed a theme class gathering and is closed.Wherein each theme class cluster is equal Comprising one group of tables of data, not only the association in structure is close, content is close for these tables, but also has in the inquiry log of user Higher co-occurrence frequency.

(2) table order of connection optimization module

For each theme class cluster that vertical grouping module obtains, tables of data therein is carried out according to main foreign key relationship Table connection.Due to the substantial amounts of table in large database, table attended operation needs relatively large time overhead, it is therefore desirable to Table attended operation is optimized.Optimal Choice of this module based on the genetic Algorithm Design table order of connection, is substantially reduced Table connects cost.

(3) horizontal grouping module

Utilizing the theme class gathering cooperation obtained with upper module is input, by calculating the mixing in theme class cluster between tuple Similarity, respectively each theme class cluster construct tuple associated diagram.Hierarchical clustering algorithm is further used, to each theme class cluster Horizontal division operation is carried out, theme class cluster unit set is formed.

(4) subject index constructs module

This module is carried out using theme class cluster unit set obtained in horizontal grouping module as input using correlation rule The selection of index terms, and then subject index is constructed for each theme class cluster.

The application by the existing keyword online query method of solution there are table frequent in query process connection bring it is huge The problem of big time overhead, and that existing keyword offline search method is complicated for internal structure, data volume is huge is extensive Inquiry on database has that search efficiency is low, and proposes the pass based on theme class cluster unit in a kind of relational database Keyword querying method.

A kind of keyword query method based on theme class cluster unit in relational database, sequentially includes the following steps:

One, theme class cluster unit building process；

One, mono-, the vertical grouping based on tables of data characteristic and inquiry log；

One, bis-, table order of connection prioritization scheme in theme class cluster is proposed；

One, tri-, the horizontal grouping based on theme class cluster tuple associated diagram；

Two, the optimiged index mechanism based on correlation rule is established；

Three, query result is returned into user.

The present invention include it is following the utility model has the advantages that

1, a kind of offline search method based on theme class cluster unit is proposed, suitable on extensive relational database Keyword query；

2, new types of data structure is constructed --- theme class cluster unit.Improved spectral clustering and theme class are used respectively Cluster tuple associated diagram carries out vertical grouping and horizontal grouping to tables of data and tuple；Offline building TCU collection merges as looking into Response is ask, query responding time can not only be substantially reduced, and more abundant, complete theme semantic information can be returned；

3, a kind of table order of connection prioritization scheme based on genetic algorithm is devised, pretreated time overhead is reduced；

4, index terms is selected using association rule algorithm, and then is each theme class cluster building index, it is significant to add Fast inquiry velocity.

Detailed description of the invention

Fig. 1 be the theme class cluster Americanfootball level grouping schematic diagram；

Fig. 2 is distribution subject class cluster tuple associated diagram；

Fig. 3 is to inquire architecture diagram based on theme class cluster unit；

Fig. 4 is vertical grouping method architecture diagram

Fig. 5 is the table order of connection prioritization scheme flow chart based on genetic algorithm；

Fig. 6 is the correspondence diagram of threaded tree and integer sequence；

Fig. 7 is pretreatment time relativity figure；

Fig. 8 is the average response time relativity figure under different keyword numbers；

Fig. 9 is the average response time relativity figure under different value of K；

Figure 10 is keyword quantity and Average Accuracy corresponding relationship；

Figure 11 is keyword quantity and average recall rate corresponding relationship；

Figure 12 is influence relational graph of the different data collection size to query performance

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below with reference to fig. 4 to fig. 6 and tool The present invention is described in further detail for body embodiment.

Key based on theme class cluster unit in a kind of relational database described in specific embodiment one, present embodiment Word querying method, sequentially includes the following steps:

One, theme class cluster unit building process；

Two, the optimiged index mechanism based on correlation rule is established；

Three, query result is returned into user.

Present embodiment include it is following the utility model has the advantages that

Specific embodiment two, present embodiment are based on in a kind of relational database described in specific embodiment one The further explanation of the keyword query method of theme class cluster unit, based on tables of data characteristic and inquiry day described in step 1 mono- The detailed process of the vertical grouping of will are as follows: the application uses similarity matrix construction method between table, respectively from table characteristic, including table Between between topological compactness and table two aspects of content similarities and inquiry log construct initial input matrixes, vertical grouping method Using relational database D and user query log as input, one group of theme class cluster is as output, and detailed process is as shown in figure 4, hang down Straight group technology is broadly divided into following 3 big modules: input module, similarity matrix building module and output module.Input mould Block inputs using relational database and its ideograph as system, is respectively used to the content information and structural information of descriptive data base； In addition, inquiry log is also used as system to input, the information distribution characteristics of database is reflected in side；Similarity matrix constructs module: By analysis, the calculating to ideograph in input module and database, the topological compactness obtained between tables of data is similar with content Property, topological compactness matrix and content similarities matrix are constructed respectively, and construct similarity matrix between table on this basis；In addition It is for statistical analysis to inquiry log that the above matrix is made further to be reinforced and be corrected；Finally, being made with one group of theme class cluster For result output.

(1) topological compactness between table

The ideograph G=(V, E) in data-oriented library, node v_iWith node v_jBetween topological compactness be defined as follows:

Wherein, | v_i| for table T in database_iSize, | v_j| for table T in database_jSize；σ is impact factor, and σ is got over Interaction force between big node is stronger；Conversely, interaction force is weaker.For node v_iWith node v_jBetween logic away from From: i.e. in database schema figure, node v_iWith node v_jBetween path length.According to the mathematical property of Gaussian function, for giving Fixed σ value, the coverage of each node are approximately equal toRegional area, the logical reach between two nodes When greater than the value, the topological compactness between two nodes decays to rapidly 0.

The topological compactness between any two node is calculated by formula (1), and then constructs the topology of relational database Compactness matrix is as follows:

(2) content similarities between table

Tables of data is made of table name, attribute and tuple, therefore can be obeyed the order when between content similarities are analyzed table Two aspects of name similitude and assignment similitude are deeply probed into；

Naming similitude includes table name similitude and attribute-name similitude two large divisions, and the application is calculated in vector space The method of similitude between two entities, first extraction table T_iTable name and attribute-name in keyword be table T_iConstruct vector V_i, mention Take table T_jTable name and attribute-name in keyword, be table T_jConstruct vector V_j, name similitude is calculated using Cosine function:

Sim₁(T_i,T_j)=Sim (V_i,V_j)=V_i·V_j/(|V_i|·|V_j|) (2)

Wherein Sim₁(T_i,T_j) it is table T_iWith table T_jBetween name similitude；

Sim(V_i,V_j) it is vector V_iWith vector V_jBetween similitude

|V_i| and | V_j| it is respectively vector V_iWith vector V_jSize；

The specific solution procedure of assignment similitude is as follows:

1, the content similarities between two attributes are calculated using Jaccard distance；

J (u, v)=| uI v |/| uUv | (3)

Wherein, u is tables of data T_iIn attribute column；V is tables of data T_jIn attribute column；

2, using the attribute in greedy matching strategy Test database to set Z；

3, weighting is averaging and obtains the assignment similitude between two tables；

Wherein, | T_i| it is tables of data T_iIn attribute column number；|T_j| it is tables of data T_jIn attribute column number；max(|T_i |,|T_j|) be | T_i| and | T_j| the larger value in the two；For the coefficient of variation of attribute column u,It is a statistic of each observation degree of variation in measurement table for the coefficient of variation of attribute column v；Variation Coefficient is smaller, and the richness of attribute column content is smaller；Conversely, the coefficient of variation is bigger, the richness of attribute column content is bigger；Formula InFor the standard deviation of attribute column u,For the standard deviation of attribute column v,For the average value of attribute column u；For attribute column v's Average value；Max (u.V, v.V) is the larger value in both u.V and v.V；

In conclusion tables of data T_iWith tables of data T_jBetween content similarities are as follows: Sim (T_i,T_j)=(Sim₁(T_i,T_j)+ Sim₂(T_i,T_j))/2；

Content similarities matrix S between tables of data are as follows:

Wherein l is the number of tables of data in database；

Comprehensively consider the similitude in structure and content, obtains similarity matrix between table: A_DB=T+S；

Wherein T is the topological compactness matrix of relational database；

(3) similarity matrix modification method

Inquiry log has recorded the history access information of user search database, include 3 fields: User ID, inquiry Q, Tables of data T where query result and result.Vertical grouping method basic thought with user feedback is to inquiry log In inquiry record for statistical analysis, and similarity matrix is modified using following boost function.

boost_log(T_i,T_j)=exp (log (count (T_i,T_j))/log(max(count))) (5)

count(T_i,T_j) have recorded table T_iWith table T_jCo-occurrence number in inquiry log, max (count) are inquiry day The maximum value of any two tables co-occurrence number in will.By formula (5) it is found that in inquiry log the more table of co-occurrence number, closely The degree that property score is reinforced is bigger.

The above vertical grouping result is reinforced using the information in user query log, proposes that following score reinforces letter Number:

A_Final(t_i,t_j)=A_DB(t_i,t_j)×boost_log(t_i,t_j) (6)

And then obtain similarity matrix between tableWherein l is number in database According to the number of table；

(4) vertical grouping

The application is based on the vertical grouping for improving spectral clustering

Input: G=(V, E), k and impact factor σ, wherein V={ v₁,...,v_l, | E |=m；

Output: theme class gathering closes C={ C₁,C₂,...,C_k}；

Step:

1. constructing similarity matrix A between table_Final；

2. feature vector and characteristic value are calculated, with preceding k feature vector u₁,...,u_kConstruction feature vector space R^k；

3. node all in V is mapped to R^kSpace；

4. using k-means algorithm by R^kIn node rendezvous to theme class cluster C₁,C₂,...,C_kIn.

Specific embodiment three, present embodiment are in a kind of relational database described in specific embodiment one or two The further explanation of keyword query method based on theme class cluster unit, table connects in proposition theme class cluster described in step 1 bis- Connect the detailed process of sequential optimization scheme are as follows:

In order to avoid table attended operation complicated in query process, needed in data prediction by theme class cluster C_i= (T₁,T₂,...,T_n) in n table T₁,T₂,...,T_nIt is attached to obtain consolidated statement T '_i.Existing method is only in accordance with main outer Key relationship carries out breadth first traversal and is attached to table.In large database, hundreds and thousands of tables of data are generally comprised, It needs to pay biggish time overhead using above method, pre-processes efficiency by extreme influence.For this problem, the present invention Based on genetic Algorithm Design table order of connection prioritization scheme, as shown in Figure 5.

Firstly, one step of most critical is compiled to tables of data when carrying out the optimization of the table order of connection with genetic algorithm Code.The different table order of connection is expressed using the form of threaded tree, in order to retain the characteristic information of threaded tree comprehensively, using first root The form of traversal threaded tree is encoded.It is as shown in Figure 6:

After encoding above, randomly select poplength threaded tree as initial population, and to first generation population into Row genetic manipulation generates genlength new individual.Genetic manipulation swaps population using two kinds of operators of intersection and variation. Crossover operator: the subtree that same size is randomly generated swaps, and for duplicate tables of data on threaded tree after intersecting, uses other The tables of data replacement not occurred；Mutation operator: the tables of data of any nonzero digit on exchange threaded tree.Then according to well known connection Tree cost formula calculates the cost of each threaded tree, selects the smallest threaded tree of poplength cost as next-generation population. The above heredity, selection course are repeated until reaching predetermined the number of iterations, which passes through the analysis and summary to many experiments Reasonable value is obtained, the smallest threaded tree of cost is exactly the optimal table obtained by genetic algorithm in last generation of evolutionary process The order of connection.

Specific embodiment four, present embodiment are to a kind of relation data described in one of specific embodiment one to three The further explanation of keyword query method in library based on theme class cluster unit, based on theme class cluster member described in step 1 tri- The detailed process of the horizontal grouping of group associated diagram are as follows:

After vertical grouping and theme class cluster table attended operation, obtains theme class gathering and close C=(C₁,C₂,..., C_k), wherein each theme class cluster C_iIt include a consolidated statement T '_i, in general, user wishes to integrate multiple related tuples Theme class cluster is made further to be grouped and can effectively mentioned as response result, therefore using a kind of reasonable horizontal group technology High inquiry velocity.Existing level group technology simply uses the operation of the group by database to carry out, and may result in category Property the different but tuple with higher similitude of value be assigned in different groupings, not enough closed so as to cause horizontal group result Reason.The present invention carries out horizontal grouping to theme class cluster using theme class cluster tuple associated diagram, makes while improving search efficiency The grouping of data more meets user query demand.In addition, making the similitude between tuple using a kind of mixing similarity calculation method Calculated result is more accurate and has good scalability.

Theme class cluster tuple mixes Similarity measures

An important data model is exactly theme class cluster tuple associated diagram G=(V, E) in horizontal grouping strategy, it is one A weighted undirected graph, weight of the similarity as side between theme class cluster tuple, in order to improve similitude between theme class cluster tuple The accuracy rate and scalability of calculating, this trifle propose a kind of mixing similarity calculation method.

It is assumed that t '_iWith t '_jThe class that is the theme cluster C_iThe mixing similitude calculating process of middle any two tuple, two tuples is as follows:

1, different distance function d is defined_k, comprising: Euclidean distance, editing distance, Hamming distance；

2, theme class cluster tuple is mapped on n-dimensional space, and the distance between two tuples is found out according to distance function d_k(t′_i,t′_j)；

3, the mixing similitude between two tuples is found out according to the following formula.

Wherein, Simk (ti', t_j') it be distance function is d_kWhen, tuple t '_iWith t '_jBetween similitude；

It is d for distance function_kWhen { 1,2,3 } k ∈, tuple t '_iWith t '_jBetween similitude maximum value；

Sim(t′_i,t′_j) it is two tuple t '_iWith t '_jBetween mixing similitude.

The step of level grouping:

1, theme class cluster C is calculated_iIn mixing similitude between each theme class cluster tuple, building theme class cluster tuple association Figure；

2, according to similarity threshold φ, associated diagram is divided into several connected components；

3, the theme class cluster unit that theme class cluster tuple number is greater than Minsize is found out, and calculates its amalgamation, selects and melts The smallest theme class cluster unit of conjunction property, disconnects the smallest side of similitude in the theme class cluster unit；

4, step 3 is repeated, until the number of tuples contained in all theme class cluster units is less than Minsize；

5, the separation property between theme class cluster unit is calculated, the smallest theme class cluster unit of separation property is merged；

6, step 5 is repeated, until theme class cluster C_iThe number of middle theme class cluster unit | C_i| until reaching requirement.

The theme class cluster unit amalgamation used in step 3 and step 5, separation property are calculate by the following formula:

Wherein, Sim (t '_i,t′_j) class that is the theme cluster tuple t '_iWith t '_jBetween mixing similitude；TCU_kAnd TCU_lTable respectively Show two different theme class cluster units；

Formula (8) and (9) are related to 3 parameters: similarity threshold φ, final theme class cluster unit number k and Minsize, the first two parameter are set acording to the requirement of user, and Minsize is then set as the 1%~3% of entire data volume.

Specific embodiment five, present embodiment are to a kind of relation data described in one of specific embodiment one to four The further explanation of keyword query method in library based on theme class cluster unit is established based on correlation rule described in step 2 The detailed process of optimiged index mechanism are as follows:

In order to accelerate inquiry velocity, need for database sharing index.Traditional method is only single keyword building Inverted index item list, index efficiency is by larger limitation in multi-key word inquiry.In order to solve problem above, the application mentions The multi-key word index construct mechanism based on theme class cluster unit is gone out.

Frequent item set, each frequent item set corresponding one are found using association rule algorithm in each theme class cluster respectively A inverted index item list contains all theme class cluster units for the keyword or keyword combination directly occur in list. Index structure is as follows:

Keyword(s)→(TCU₁,TCU₂…··) (10)

We generate the above subject index based on theme class cluster unit using Lucene kit.

Specific embodiment six, present embodiment are to a kind of relation data described in one of specific embodiment one to five The further explanation of keyword query method in library based on theme class cluster unit, it is described described in step 3 to return to query result To the detailed process of user are as follows:

It can be effectively by the theme class cluster comprising keyword (group) using the subject index based on theme class cluster unit Unit returns to user as query result.Give one group of keyword query K={ k₁,k₂,....,k_n, system in parallel search is more A subject index finds corresponding theme class cluster unit, calculates related subject class cluster unit according to existing sort result function Score, carries out descending sort to it according to score, and top-k theme class cluster unit is returned to user.

For verifying beneficial effects of the present invention, make following emulation experiment:

This experiment uses public data collection Freebase.Freebase is an open structured database, scale compared with Greatly and there is certain structural complexity, wherein including about 2000 tables and 39,000,000 entities.Due to by experiment condition Limitation, we carry out data set under the premise of not influencing experimental result to simplify processing.Keep database bottom mode sum number It is constant according to the connection relationship between table, data set of the extraction section data (400M) as this experiment from Freebase database. For the effect and performance for verifying the TCU-Based querying method that the application proposes, following three groups of experiments have been carried out.Firstly, passing through The comparison of the substantial amounts of data prediction time, the validity of proof list order of connection prioritization scheme.Secondly, the TCU- that the application is proposed Based querying method is compared with pedestal method DBXplorer, BLANKS and SAINT, verify the application method efficiency and Accuracy rate.Finally, that verifies this method can by the comparative experiments for carrying out query responding time on the data set of different scales Scalability.

Algorithm is run under 7 operating system of Microsoft Windows, JAVA environment, is usedCore(TM) The CPU of 2.5GHz, 4GB memory, 500G hard disk.

The assessment of table order of connection prioritization scheme

The statistical form order of connection optimization front and back the substantial amounts of data prediction time, and by the database of different scales into The validity of the table order of connection prioritization scheme based on genetic algorithm of the application proposition is verified in row comparative analysis.Experimental result As shown in Figure 7.Abscissa indicates the quantity of tables of data involved in process of data preprocessing in figure, and ordinate indicates data prediction Time.By result in figure it is found that data prediction can significantly improve pretreatment using the table order of connection that genetic algorithm obtains Efficiency, and with the increase of table quantity, effect is more obvious.

Method comparison

By comparing with benchmark querying method, the performance of the application querying method TCU-Based is had evaluated.Respectively from Two aspects of search efficiency and result accuracy rate are tested.Select 100 keywords at random from the inquiry log of database Inquiry is carried out using tetra- kinds of methods of DBXplorer, BLANKS, SAINT and TCU-Based respectively on Freebase data set Inquiry, analyzes the query responding time of each method and the quality of query result.

(1) search efficiency

Firstly, we use different number searching keyword, by the top-2 average lookup response time of four kinds of methods into Row compares.Since query responding time terminate executing inquiry to top-2 inquiry response of generation, does not include that off-line data is located in advance Manage the time.As shown in figure 8, abscissa indicates keyword number, ordinate indicates the average lookup response time.As seen from the figure, from Line querying method SAINT and TCU-Based search efficiency is substantially better than online query method DBXplorer and BLANKS, reason It is that first two method needs to carry out in query process complicated table attended operation, especially in the more complicated database of structure In, such table attended operation needs biggish time overhead；On the contrary, offline search method is before user proposes inquiry to data Table and tuple in library are pre-processed, therefore significantly improve inquiry velocity.In addition, the querying method that the application proposes TCU-Based is that each theme class cluster constructs subject index, and multiple indexes are parallel after user proposes inquiry executes, and imitates inquiry Rate has further promotion relative to offline search method SAINT.As shown in Figure 8, it when keyword number is more than or equal to 3, looks into The raising for asking efficiency becomes apparent, the reason is that being indexed the selection of word using correlation rule, provides multiple queries in user and closes When keyword, the index entry comprising all keywords can be directly retrieved, without being indexed connection.

The performance of the application querying method for further evaluation, We conducted the querying method comparison on top-k, k Value is carried out from 2 to 20.As shown in figure 9, abscissa indicates the different values of k in figure, ordinate is that being averaged under different value of K is looked into Ask the response time.Obviously, the query performance of the application method is substantially better than other three kinds of benchmaring querying methods.For example, in k When taking 12, the average lookup response time of DBXplorer, BLANKS, SAINT are respectively 13500ms, 7452ms, 3420ms, and The querying method TCU-Based of the application only spends 2014ms.Concrete reason is similar to Fig. 8, and details are not described herein again.

(2) validity is inquired

Inquiry validity is measured using two evaluation indexes of accuracy rate and recall rate respectively.It is accurate in order to be carried out to it Calculating, first we randomly select 100 SQL queries and using its corresponding inquiry response as standard queries result.Then Extract the keyword in SQL query, and the input as above-mentioned four kinds of querying methods.Its Average Accuracy and average recall rate pair It is more as shown in Figure 10 and Figure 11 than situation.As shown in Figure 10, with the increase of keyword quantity, the accuracy rate of search algorithm is in decline Trend.The accuracy rate of 1-keyword and 2-keyword inquiry is better than 3-keyword, 4-keyword and 5- under normal circumstances Keyword, because of increasing with keyword quantity, the relationship between keyword becomes increasingly complex.Context of methods TCU-Based It is higher by 3%~7% than congenic method SAINT in terms of accuracy rate, compared with other two methods, effect is more significant.By scheming 11 it is found that context of methods is significantly better than that existing method in terms of recall rate, and 15% or so is higher by compared with SAINT.

Scalability assessment

The average lookup response time of top-5 is measured in this experiment.Abscissa indicates that different data collection is big in Figure 12 Small, ordinate indicates the average lookup response time of top-5.As can be known from Fig. 12, successively increase with database size from 100MB It is added to 500MB, is changed by the average lookup response time that context of methods obtains slow.This is because the increase of data volume, only Tuple connection when to pretreatment produces large effect, and to the influence that indexes online and little.It can be demonstrate,proved by experiment Bright context of methods presents preferable scalability in different data collection size.

Claims

1. it is sequentially included the following steps: based on the keyword query method of theme class cluster unit in a kind of relational database

One, theme class cluster unit building process；

Two, the optimiged index mechanism based on correlation rule is established；

Three, query result is returned into user；

It is characterized in that the detailed process of the vertical grouping based on tables of data characteristic and inquiry log described in step 1 mono- are as follows:

Using similarity matrix construction method between table, respectively from table characteristic, including between table, topological compactness is similar with content between table Property and the aspect of inquiry log two construct initial input matrixes, vertical grouping method is by relational database and user query log As input, one group of theme class cluster is divided into following 3 big modules: input module, similarity matrix as output, vertical grouping method Construct module and output module；Input module, using relational database and its ideograph as input, inquiry log is also used as defeated Enter；Similarity matrix constructs module: by analysis to ideograph in input module and database, calculating, obtains between tables of data Topological compactness and content similarities, construct topological compactness matrix and content similarities matrix respectively, and on this basis Similarity matrix between building table；It is for statistical analysis to inquiry log that the above matrix is made further to be reinforced and be corrected；Most Afterwards, it is exported as a result with one group of theme class cluster；

(1) topological compactness between table

Wherein, | v_i| for table T in database_iSize, | v_j| for table T in database_jSize；σ is impact factor；For section Point v_iWith node v_jBetween logical reach: i.e. in database schema figure, node v_iWith node v_jBetween path length；According to The mathematical property of Gaussian function, for given σ value, the coverage of each node is approximately equal toRegional area, Logical reach between two nodesWhen greater than the value, the topological compactness between two nodes decays to rapidly 0；

The topological compactness between any two node is calculated by formula (1), and then the topology for constructing relational database is close Property matrix is as follows:

(2) content similarities between table

Tables of data is made of table name, attribute and tuple, therefore the famous prime minister that can obey the order when between content similarities are analyzed table It is deeply probed into like two aspects of property and assignment similitude；

Name similitude includes table name similitude and attribute-name similitude two large divisions, with calculating phase between two entities in vector space Like the method for property, first extraction table T_iTable name and attribute-name in keyword be table T_iConstruct vector V_i, extract table T_jTable Keyword in name and attribute-name is table T_jConstruct vector V_j, name similitude is calculated using Cosine function:

Sim₁(T_i,T_j)=Sim (V_i,V_j)=V_i·V_j/(|V_i|·|V_j|) (2)

Wherein Sim₁(T_i,T_j) it is table T_iWith table T_jBetween name similitude；

Sim(V_i,V_j) it is vector V_iWith vector V_jBetween similitude

|V_i| and | V_j| it is respectively vector V_iWith vector V_jSize；

The specific solution procedure of assignment similitude is as follows:

1. calculating the content similarities between two attributes using Jaccard distance；

J (u, v)=| u ∩ v |/| u ∪ v | (3)

2. using the attribute in greedy matching strategy Test database to set Z:

3. weighting is averaging and obtains the assignment similitude between two tables

Wherein, | T_i| it is tables of data T_iIn attribute column number；|T_j| it is tables of data T_jIn attribute column number；max(|T_i|,|T_j |) be | T_i| and | T_j| the larger value in the two；For the coefficient of variation of attribute column u, It is a statistic of each observation degree of variation in measurement table for the coefficient of variation of attribute column v；The coefficient of variation is smaller, attribute The richness of column content is smaller；Conversely, the coefficient of variation is bigger, the richness of attribute column content is bigger；In formulaFor attribute column u Standard deviation,For the standard deviation of attribute column v,For the average value of attribute column u；For the average value of attribute column v；max(u.V, It v.V is) the larger value in both u.V and v.V；

Tables of data T_iWith tables of data T_jBetween content similarities are as follows: Sim (T_i,T_j)=(Sim₁(T_i,T_j)+Sim₂(T_i,T_j))/2；

Content similarities matrix S between tables of data are as follows:

Wherein l is the number of tables of data in database；

Wherein T is the topological compactness matrix of relational database；

(3) similarity matrix modification method

Inquiry log has recorded the history access information of user search database, includes 3 fields: User ID, inquiry Q, inquiry As a result the tables of data T and where result；Vertical grouping method basic thought with user feedback is in inquiry log Inquiry records for statistical analysis, and is modified using following boost function to similarity matrix；

boost_log(T_i,T_j)=exp (log (count (T_i,T_j))/log(max(count))) (5)

count(T_i,T_j) have recorded table T_iWith table T_jCo-occurrence number in inquiry log, max (count) are to appoint in inquiry log It anticipates the maximum values of two table co-occurrence numbers；

The above vertical grouping result is reinforced using the information in user query log, proposes that following score reinforces function:

A_Final(t_i,t_j)=A_DB(t_i,t_j)×boost_log(t_i,t_j) (6)

And then obtain similarity matrix between tableWherein l is tables of data in database Number；

(4) vertical grouping

Based on the vertical grouping for improving spectral clustering

Input: G=(V, E), k and impact factor σ, wherein, V={ v₁,...,v_l, | E |=m；V={ v₁,...,v_l}

Output: theme class gathering closes C={ C₁,C₂,...,C_k}；

Step:

1. constructing similarity matrix A between table_Final；

3. node all in V is mapped to R^kSpace；

4. using k-means algorithm by R^kIn node rendezvous to theme class cluster C₁,C₂,...,C_kIn；

The particular content of table order of connection prioritization scheme in theme class cluster described in step 1 bis- are as follows:

By theme class cluster C in data prediction_i=(T₁,T₂..., Tn) in n table T₁,T₂,...,T_nIt is attached to obtain Consolidated statement T_i', based on genetic Algorithm Design table order of connection prioritization scheme, firstly, carrying out table connection with genetic algorithm When sequential optimization, the different table order of connection is expressed using the form of threaded tree, and using the form of pre-reset mechanism threaded tree It is encoded；After coding, poplength threaded tree is randomly selected as initial population, and first generation population is carried out Genetic manipulation generates genlength new individual；Genetic manipulation swaps population using two kinds of operators of intersection and variation；It hands over Pitch operator: the subtree that same size is randomly generated swaps, for duplicate tables of data on threaded tree after intersecting, not with other The tables of data of appearance is replaced；Mutation operator: the tables of data of any nonzero digit on exchange threaded tree；Then according to well known threaded tree Cost formula calculates the cost of each threaded tree, selects the smallest threaded tree of poplength cost as next-generation population；Weight Until reaching predetermined the number of iterations, which is obtained by the analysis and summary to many experiments for the multiple above heredity, selection course To reasonable value, the smallest threaded tree of cost is exactly that the optimal table obtained by genetic algorithm connects in last generation of evolutionary process Connect sequence；

The detailed process of horizontal grouping described in step 1 tri- based on theme class cluster tuple associated diagram are as follows:

After vertical grouping and theme class cluster table attended operation, obtains theme class gathering and close C=(C₁,C₂,...,C_k), In each theme class cluster C_iIt include a consolidated statement T_i', horizontal point is carried out to theme class cluster using theme class cluster tuple associated diagram Group uses a kind of mixing similarity calculation method；

Theme class cluster tuple mixes Similarity measures

An important data model is exactly theme class cluster tuple associated diagram G=(V, E) in horizontal grouping strategy, is a weighting Non-directed graph, weight of the similarity as side between theme class cluster tuple；

It is assumed that t_i' and t_j' the class that is the theme cluster C_iThe mixing similitude calculating process of middle any two tuple, two tuples is as follows:

1. defining different distance function d_k, comprising: Euclidean distance, editing distance, Hamming distance；

2. theme class cluster tuple is mapped on n-dimensional space, and the distance d between two tuples is found out according to distance function_k (t_i',t_j')；

3. finding out the mixing similitude between two tuples according to the following formula；

Wherein, Sim_k(t_i',t_j') it be distance function is d_kWhen, tuple t_i' and t_j' between similitude；

It is d for distance function_k, when { 1,2,3 } k ∈, tuple t_i' and t_j' between similitude maximum value；

Sim(t_i',t_j') it is two tuple t_i' and t_j' between mixing similitude；

The step of level grouping:

1. calculating theme class cluster C_iIn mixing similitude between each theme class cluster tuple, construct theme class cluster tuple associated diagram；

2. associated diagram is divided into several connected components according to similarity threshold φ；

3. finding out the theme class cluster unit that theme class cluster tuple number is greater than Minsize, and its amalgamation is calculated, selects amalgamation The smallest theme class cluster unit disconnects the smallest side of similitude in the theme class cluster unit；

4. step is repeated 3., until theme class cluster C_iIn until number of tuples contained in each theme class cluster unit is less than Minsize；

5. calculating the separation property between theme class cluster unit, merge the smallest theme class cluster unit of separation property；

6. step is repeated 5., until theme class cluster C_iThe number of middle theme class cluster unit | C_i| until reaching requirement；

Step 3. with step 5. in the theme class cluster unit amalgamation used, separation property be calculate by the following formula:

Wherein, Sim (t_i',t_j') class that is the theme cluster tuple t_i' and t_j' between mixing similitude；TCU_kAnd TCU_lRespectively indicate two A different theme class cluster unit；

The detailed process of the optimiged index mechanism based on correlation rule is established described in step 2 are as follows: respectively in each theme class cluster Frequent item set is found using association rule algorithm, each frequent item set corresponds to an inverted index item list, includes in list All theme class cluster units for the keyword or keyword combination directly occur, index structure are as follows:

Keyword(s)→(TCU₁,TCU₂……) (10)

The above subject index based on theme class cluster unit is generated using Lucene kit；

Query result is returned to the detailed process of user described in step 3 are as follows:

Give one group of keyword query K={ k₁,k₂,....,k_n, the multiple subject index of parallel search find corresponding theme class Cluster unit calculates the score of related subject class cluster unit according to well known sort result function, carries out descending to it according to score Sequence, returns to user for top-k theme class cluster unit.