CN105975488B - A kind of keyword query method based on theme class cluster unit in relational database - Google Patents

A kind of keyword query method based on theme class cluster unit in relational database Download PDF

Info

Publication number
CN105975488B
CN105975488B CN201610264735.4A CN201610264735A CN105975488B CN 105975488 B CN105975488 B CN 105975488B CN 201610264735 A CN201610264735 A CN 201610264735A CN 105975488 B CN105975488 B CN 105975488B
Authority
CN
China
Prior art keywords
class cluster
theme class
data
theme
tables
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201610264735.4A
Other languages
Chinese (zh)
Other versions
CN105975488A (en
Inventor
王念滨
周连科
王红滨
王瑛琦
何鸣
宋奎勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201610264735.4A priority Critical patent/CN105975488B/en
Publication of CN105975488A publication Critical patent/CN105975488A/en
Application granted granted Critical
Publication of CN105975488B publication Critical patent/CN105975488B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of keyword query method based on theme class cluster unit in relational database, is related to a kind of keyword query method in information retrieval field more particularly to relational database based on theme class cluster unit.The present invention will there are table frequent in query process connections to be brought huge time overhead to solve the problems, such as existing keyword online query method, and the inquiry on the large scale database that existing keyword offline search method is complicated for internal structure, data volume is huge has that search efficiency is low.The keyword query method based on theme class cluster unit sequentially includes the following steps: 1, theme class cluster unit building process in a kind of relational database;1., be based on tables of data characteristic and inquiry log vertical grouping;2., propose theme class cluster in table order of connection prioritization scheme;3., based on theme class cluster tuple associated diagram level be grouped;2, the optimiged index mechanism based on correlation rule is established;3, query result is returned into user.The present invention is applied to information retrieval field.

Description

A kind of keyword query method based on theme class cluster unit in relational database
Technical field
The present invention relates to the keys based on theme class cluster unit in information retrieval field more particularly to a kind of relational database Word querying method.
Background technique
In recent years, keyword query is successfully applied as one important inquiring technology of information retrieval field. Due to its feature easy to use, received by more and more users.For relational database, also needing one kind simply has The querying method of effect obtains the interested information of user from numerous and complicated relational database.Traditional structuralized query side Method, such as SQL query not only need user to understand the bottom mode of relational database complexity, it is also necessary to which user grasps correlation and looks into The application method for asking language brings bigger difficulty and inconvenience to inquiry work.Therefore, based on the keyword of relational database Inquiring technology has received widespread attention.Well known some correlative studys attempt traditional keyword query method being introduced directly into pass It is database, but since relational database needs to follow certain Standardization Requirement, information is dispersed in different tables of data In, being simply introduced into can not bring good inquiry to experience to user.Therefore the structure of marriage relation database itself is needed Feature studies a kind of keyword query technology of suitable relational database.
Existing keyword query method can be divided into online query and offline search two major classes.The main think of of online query Think that use pattern figure or datagram model relational database, after user proposes a group polling keyword, online Figure traversal is carried out, returns to one or more subgraph or candidate network or steiner tree as query result.Due to looking into Table connection is constantly carried out during asking, this class inquiry method is caused to generate high time cost.On the contrary, offline search method Then efficiently solve the problems, such as that online query exists using the data structure of similar virtual document or tuple unit.It is mentioned in user Before inquiring out, table connection is carried out using the method for breadth first traversal, so as to avoid table frequent in query process connection Brought time overhead.But the above offline search method does not consider that the search efficiency in extensive relational database is asked Topic.Enterprise database generally comprises hundreds and thousands of tables of data, and using above method, preprocessing process needs sizable time Expense.In addition, since finally formed table is in large scale, even if building index, is also unfavorable for user and finds within a short period of time Desired inquiry response.
In order to solve this problem, present applicant proposes TCU-Based inquiries --- it is a kind of based on theme class cluster unit from Line askes method.Firstly, constructing number by carrying out division operation twice both vertically and horizontally to the data in database According to structure --- theme class cluster unit.Secondly, in order to further increase the efficiency of data prediction, the application is based on genetic algorithm Devise a kind of table order of connection prioritization scheme.Finally, constructing subject index for each theme class cluster, make these indexes can be Concurrent working on machine node, significantly improves inquiry velocity.After user proposes inquiry, one or more theme is returned Class cluster unit can include more complete information as inquiry response, meet user query intention.
Summary of the invention
This application involves correlation theory
Keyword query technology can substantially be divided into online query and offline search two major classes.Online query main thought It is: before inquiry, constructs ideograph corresponding with database or datagram;After user proposes inquiry, figure traversal is carried out online, with hair Existing top-k candidate network or steiner tree, and user is returned to as inquiry response.And offline search is then with offline mode pair Database is pre-processed, and virtual document or tuple unit are constructed, and returns to top-k inquiry response using information retrieval technique, Online table connection and figure traversing operation are avoided, to be obviously improved query processing efficiency.
1, online query
It is divided according to the quantity that inquiry is related to database, online query can be divided into the key towards single relational database again The keyword query of word inquiry and Based on Distributed database.
Keyword query towards single relational database
In DBXplorer and DISCOVER system, database is modeled as ideograph G, wherein node on behalf relationship, side Main foreign key constraint between representation relation, the result of keyword query i.e. one group candidate network.BANKS, BANKS-II and BLINKS Etc. systems then use the keyword query method based on datagram, directly on bottom data figure retrieval include keyword this Tan Na tree.BANKS system uses inverse expansion searching algorithm, when encountering the biggish node of an in-degree, this method performance by To seriously affecting.BLINKS proposes a kind of new searching method --- bidirectional research on its basis, significantly improves search Performance.
The keyword query of Based on Distributed database
In order to solve the problems, such as the keyword query in distributed data base, currently known research work is primarily present following Several strategies: Kite system, the comprehensive use pattern matching of the system and topology discovery technology, to obtain between heterogeneous database Main outer key connection, to solve the problems, such as the keyword query on isomeric relationship database.Hristidis, V. etc. it is artificial each Database D BiEstablish keyword relational matrix KRMiAbstract as database.For each keyword word to (ki,kj), item KRMi(ki,kj) for recording keyword kiAnd kjThe frequency occurred in different distance.But keyword relational matrix only leads to The binary crelation beta pruning crossed between Feature Words is fallen to cannot function as the database of inquiry response.In order to overcome the above deficiency, You Renti Go out G-KS method, uses the complex relationship between keyword relational graph characterization keyword.Figure interior joint represents Feature Words, side generation Relationship between table word and word.Therefore, it can be calculated using key relationship figure similar between database and keyword query Property, to retrieve most potential database.
2, offline search
The above querying method, which does not account for online table connection, leads to high time overhead problem.In general, pass through Table in database and tuple are pre-processed, the above problem can be improved.In recent years, someone's further investigated is based on relationship number According to the offline search problem in library, and propose preliminary solution.Feldman, P. et al. be put forward for the first time text object and The concept of virtual document completes the table attended operation in database before user proposes inquiry, shows search efficiency It writes and improves.Teorey, T.J. et al. are further expanded on the basis of text object, will be carried out with the tuple of same alike result value Merge, with the more complete data structure of content construction --- tuple unit, and using multiple tuple units interconnected as Inquiry response returns to user, effectively improves result precision.The above method only considers simple table connection and member The group of group | by operation is not particularly suited for the database of internal table structure complexity.Herein described method is connected by optimization table Sequence is connect, and defines a kind of more reasonable data structure --- theme class cluster unit, to improve inquiry effect significantly Rate and precision.
3, vertical grouping
A given database D (T being made of l tables1,T2,...,Tl), vertical grouping refers to: according to a kind of reasonable Table in database D is divided into one group of theme class cluster C by partition strategyD={ C1,C2,...,Ck, so that (1) each theme class Cluster Ci∈CDComprising being associated with close, the relevant tables of data C of content in one group of structurei={ T1,T2,...,Tj};(2) There is CiI Cj=φ;(3) to all i,
4, horizontal grouping
Give a tables of data T'(t1,t2,...,tn), wherein tiIndicate tuple, the horizontal grouping of tables of data refers to: root According to certain similarity measurements flow function, the tuple with higher similarity is assigned to identity set ΓiIn.As shown in Figure 1, by After horizontal division operation, n tuple in Table A merican football is divided into m unit Γ12,...,Γm, Wherein m≤n.
5, theme class cluster tuple associated diagram
A table T'(t' in given theme class cluster1,t'2,...,t'n), wherein t'iIndicate theme class cluster tuple, theme Class cluster tuple associated diagram is weighted undirected graph G=(V, E), wherein vertex vi∈ V indicates theme class cluster tuple t'iIf two tuples t'iAnd t'jBetween similitude sij(t'i,t'j) > 0, then in node vjThere are a line ei between node vij∈ E, side eij's Weight is denoted as sij(t'i,t'j).Fig. 2 is the distribution subject class cluster tuple associated diagram of table T'.
6, theme class cluster unit
Give a database D (T1,T2,...,Tl), include l tables of data interconnected.Firstly, hanging down to it Straight grouping obtains k theme class cluster C={ C1,C2,...,Ck};Secondly, according to main foreign key relationship by each theme class cluster CiIn Table carry out table connect to obtain consolidated statement T 'i(t1,t2,...,tn).Finally, to consolidated statement T 'i(t1,t2,...,tn) in n Tuple carry out level is grouped to obtain Γ12,...,Γm.Wherein ΓiReferred to as theme class cluster unit.Theme class cluster American Theme class cluster unit Γ in football12,...,Γm, as shown in Figure 1.
7, top-k keyword query
Give a keyword query Q={ k1,k2,...,kmAnd a database D (T comprising l tables of data1, T2,...,Tl), k theme class cluster units are as query result before top-k keyword query returns to Relevance scores ranking.
8, the application integral frame
Existing major part keyword query method needs constantly to carry out table company after user proposes specific inquiry It connects, therefore generates biggish time overhead.To solve problems, it is thus proposed that the concept of offline search, but they are not There is the inquiry on the relational database for considering internal structure complexity.The application wants the thought to be: successively carrying out Vertical Square to tables of data To with the secondary grouping of horizontal direction, construct one group of theme class cluster unit;A kind of optimal table company based on genetic algorithm is designed simultaneously Sequence Choice is connect, the cost of data prediction is reduced;It finally uses association rule algorithm to construct for each theme class cluster to lead Topic index, thus the search efficiency and accuracy rate that are obviously improved on relational database.
Fig. 4 is the architectural framework of the application querying method, is totally divided into online and offline two parts.
Online query part
User submits a group polling keyword to query processor online, and query processor will include by subject index The theme class cluster unit of one or more keywords returns to user as inquiry response.
Off-line data preprocessing part
It is divided into following four module: vertical grouping module, table order of connection optimization module, horizontal grouping module and theme rope Draw building module.
(1) vertical grouping module
This module utilizes a kind of optimal dividing strategy of figure using relational database and user query log as input --- Improved spectral clustering, and the characteristics of marriage relation database itself, while considering the content information of table and coming from inquiry day Tables of data is carried out vertical grouping by the field feedback of will, is formed a theme class gathering and is closed.Wherein each theme class cluster is equal Comprising one group of tables of data, not only the association in structure is close, content is close for these tables, but also has in the inquiry log of user Higher co-occurrence frequency.
(2) table order of connection optimization module
For each theme class cluster that vertical grouping module obtains, tables of data therein is carried out according to main foreign key relationship Table connection.Due to the substantial amounts of table in large database, table attended operation needs relatively large time overhead, it is therefore desirable to Table attended operation is optimized.Optimal Choice of this module based on the genetic Algorithm Design table order of connection, is substantially reduced Table connects cost.
(3) horizontal grouping module
Utilizing the theme class gathering cooperation obtained with upper module is input, by calculating the mixing in theme class cluster between tuple Similarity, respectively each theme class cluster construct tuple associated diagram.Hierarchical clustering algorithm is further used, to each theme class cluster Horizontal division operation is carried out, theme class cluster unit set is formed.
(4) subject index constructs module
This module is carried out using theme class cluster unit set obtained in horizontal grouping module as input using correlation rule The selection of index terms, and then subject index is constructed for each theme class cluster.
The application by the existing keyword online query method of solution there are table frequent in query process connection bring it is huge The problem of big time overhead, and that existing keyword offline search method is complicated for internal structure, data volume is huge is extensive Inquiry on database has that search efficiency is low, and proposes the pass based on theme class cluster unit in a kind of relational database Keyword querying method.
A kind of keyword query method based on theme class cluster unit in relational database, sequentially includes the following steps:
One, theme class cluster unit building process;
One, mono-, the vertical grouping based on tables of data characteristic and inquiry log;
One, bis-, table order of connection prioritization scheme in theme class cluster is proposed;
One, tri-, the horizontal grouping based on theme class cluster tuple associated diagram;
Two, the optimiged index mechanism based on correlation rule is established;
Three, query result is returned into user.
The present invention include it is following the utility model has the advantages that
1, a kind of offline search method based on theme class cluster unit is proposed, suitable on extensive relational database Keyword query;
2, new types of data structure is constructed --- theme class cluster unit.Improved spectral clustering and theme class are used respectively Cluster tuple associated diagram carries out vertical grouping and horizontal grouping to tables of data and tuple;Offline building TCU collection merges as looking into Response is ask, query responding time can not only be substantially reduced, and more abundant, complete theme semantic information can be returned;
3, a kind of table order of connection prioritization scheme based on genetic algorithm is devised, pretreated time overhead is reduced;
4, index terms is selected using association rule algorithm, and then is each theme class cluster building index, it is significant to add Fast inquiry velocity.
Detailed description of the invention
Fig. 1 be the theme class cluster Americanfootball level grouping schematic diagram;
Fig. 2 is distribution subject class cluster tuple associated diagram;
Fig. 3 is to inquire architecture diagram based on theme class cluster unit;
Fig. 4 is vertical grouping method architecture diagram
Fig. 5 is the table order of connection prioritization scheme flow chart based on genetic algorithm;
Fig. 6 is the correspondence diagram of threaded tree and integer sequence;
Fig. 7 is pretreatment time relativity figure;
Fig. 8 is the average response time relativity figure under different keyword numbers;
Fig. 9 is the average response time relativity figure under different value of K;
Figure 10 is keyword quantity and Average Accuracy corresponding relationship;
Figure 11 is keyword quantity and average recall rate corresponding relationship;
Figure 12 is influence relational graph of the different data collection size to query performance
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below with reference to fig. 4 to fig. 6 and tool The present invention is described in further detail for body embodiment.
Key based on theme class cluster unit in a kind of relational database described in specific embodiment one, present embodiment Word querying method, sequentially includes the following steps:
One, theme class cluster unit building process;
One, mono-, the vertical grouping based on tables of data characteristic and inquiry log;
One, bis-, table order of connection prioritization scheme in theme class cluster is proposed;
One, tri-, the horizontal grouping based on theme class cluster tuple associated diagram;
Two, the optimiged index mechanism based on correlation rule is established;
Three, query result is returned into user.
Present embodiment include it is following the utility model has the advantages that
1, a kind of offline search method based on theme class cluster unit is proposed, suitable on extensive relational database Keyword query;
2, new types of data structure is constructed --- theme class cluster unit.Improved spectral clustering and theme class are used respectively Cluster tuple associated diagram carries out vertical grouping and horizontal grouping to tables of data and tuple;Offline building TCU collection merges as looking into Response is ask, query responding time can not only be substantially reduced, and more abundant, complete theme semantic information can be returned;
3, a kind of table order of connection prioritization scheme based on genetic algorithm is devised, pretreated time overhead is reduced;
4, index terms is selected using association rule algorithm, and then is each theme class cluster building index, it is significant to add Fast inquiry velocity.
Specific embodiment two, present embodiment are based on in a kind of relational database described in specific embodiment one The further explanation of the keyword query method of theme class cluster unit, based on tables of data characteristic and inquiry day described in step 1 mono- The detailed process of the vertical grouping of will are as follows: the application uses similarity matrix construction method between table, respectively from table characteristic, including table Between between topological compactness and table two aspects of content similarities and inquiry log construct initial input matrixes, vertical grouping method Using relational database D and user query log as input, one group of theme class cluster is as output, and detailed process is as shown in figure 4, hang down Straight group technology is broadly divided into following 3 big modules: input module, similarity matrix building module and output module.Input mould Block inputs using relational database and its ideograph as system, is respectively used to the content information and structural information of descriptive data base; In addition, inquiry log is also used as system to input, the information distribution characteristics of database is reflected in side;Similarity matrix constructs module: By analysis, the calculating to ideograph in input module and database, the topological compactness obtained between tables of data is similar with content Property, topological compactness matrix and content similarities matrix are constructed respectively, and construct similarity matrix between table on this basis;In addition It is for statistical analysis to inquiry log that the above matrix is made further to be reinforced and be corrected;Finally, being made with one group of theme class cluster For result output.
(1) topological compactness between table
The ideograph G=(V, E) in data-oriented library, node viWith node vjBetween topological compactness be defined as follows:
Wherein, | vi| for table T in databaseiSize, | vj| for table T in databasejSize;σ is impact factor, and σ is got over Interaction force between big node is stronger;Conversely, interaction force is weaker.For node viWith node vjBetween logic away from From: i.e. in database schema figure, node viWith node vjBetween path length.According to the mathematical property of Gaussian function, for giving Fixed σ value, the coverage of each node are approximately equal toRegional area, the logical reach between two nodes When greater than the value, the topological compactness between two nodes decays to rapidly 0.
The topological compactness between any two node is calculated by formula (1), and then constructs the topology of relational database Compactness matrix is as follows:
(2) content similarities between table
Tables of data is made of table name, attribute and tuple, therefore can be obeyed the order when between content similarities are analyzed table Two aspects of name similitude and assignment similitude are deeply probed into;
Naming similitude includes table name similitude and attribute-name similitude two large divisions, and the application is calculated in vector space The method of similitude between two entities, first extraction table TiTable name and attribute-name in keyword be table TiConstruct vector Vi, mention Take table TjTable name and attribute-name in keyword, be table TjConstruct vector Vj, name similitude is calculated using Cosine function:
Sim1(Ti,Tj)=Sim (Vi,Vj)=Vi·Vj/(|Vi|·|Vj|) (2)
Wherein Sim1(Ti,Tj) it is table TiWith table TjBetween name similitude;
Sim(Vi,Vj) it is vector ViWith vector VjBetween similitude
|Vi| and | Vj| it is respectively vector ViWith vector VjSize;
The specific solution procedure of assignment similitude is as follows:
1, the content similarities between two attributes are calculated using Jaccard distance;
J (u, v)=| uI v |/| uUv | (3)
Wherein, u is tables of data TiIn attribute column;V is tables of data TjIn attribute column;
2, using the attribute in greedy matching strategy Test database to set Z;
3, weighting is averaging and obtains the assignment similitude between two tables;
Wherein, | Ti| it is tables of data TiIn attribute column number;|Tj| it is tables of data TjIn attribute column number;max(|Ti |,|Tj|) be | Ti| and | Tj| the larger value in the two;For the coefficient of variation of attribute column u,It is a statistic of each observation degree of variation in measurement table for the coefficient of variation of attribute column v;Variation Coefficient is smaller, and the richness of attribute column content is smaller;Conversely, the coefficient of variation is bigger, the richness of attribute column content is bigger;Formula InFor the standard deviation of attribute column u,For the standard deviation of attribute column v,For the average value of attribute column u;For attribute column v's Average value;Max (u.V, v.V) is the larger value in both u.V and v.V;
In conclusion tables of data TiWith tables of data TjBetween content similarities are as follows: Sim (Ti,Tj)=(Sim1(Ti,Tj)+ Sim2(Ti,Tj))/2;
Content similarities matrix S between tables of data are as follows:
Wherein l is the number of tables of data in database;
Comprehensively consider the similitude in structure and content, obtains similarity matrix between table: ADB=T+S;
Wherein T is the topological compactness matrix of relational database;
(3) similarity matrix modification method
Inquiry log has recorded the history access information of user search database, include 3 fields: User ID, inquiry Q, Tables of data T where query result and result.Vertical grouping method basic thought with user feedback is to inquiry log In inquiry record for statistical analysis, and similarity matrix is modified using following boost function.
boostlog(Ti,Tj)=exp (log (count (Ti,Tj))/log(max(count))) (5)
count(Ti,Tj) have recorded table TiWith table TjCo-occurrence number in inquiry log, max (count) are inquiry day The maximum value of any two tables co-occurrence number in will.By formula (5) it is found that in inquiry log the more table of co-occurrence number, closely The degree that property score is reinforced is bigger.
The above vertical grouping result is reinforced using the information in user query log, proposes that following score reinforces letter Number:
AFinal(ti,tj)=ADB(ti,tj)×boostlog(ti,tj) (6)
And then obtain similarity matrix between tableWherein l is number in database According to the number of table;
(4) vertical grouping
The application is based on the vertical grouping for improving spectral clustering
Input: G=(V, E), k and impact factor σ, wherein V={ v1,...,vl, | E |=m;
Output: theme class gathering closes C={ C1,C2,...,Ck};
Step:
1. constructing similarity matrix A between tableFinal
2. feature vector and characteristic value are calculated, with preceding k feature vector u1,...,ukConstruction feature vector space Rk
3. node all in V is mapped to RkSpace;
4. using k-means algorithm by RkIn node rendezvous to theme class cluster C1,C2,...,CkIn.
Specific embodiment three, present embodiment are in a kind of relational database described in specific embodiment one or two The further explanation of keyword query method based on theme class cluster unit, table connects in proposition theme class cluster described in step 1 bis- Connect the detailed process of sequential optimization scheme are as follows:
In order to avoid table attended operation complicated in query process, needed in data prediction by theme class cluster Ci= (T1,T2,...,Tn) in n table T1,T2,...,TnIt is attached to obtain consolidated statement T 'i.Existing method is only in accordance with main outer Key relationship carries out breadth first traversal and is attached to table.In large database, hundreds and thousands of tables of data are generally comprised, It needs to pay biggish time overhead using above method, pre-processes efficiency by extreme influence.For this problem, the present invention Based on genetic Algorithm Design table order of connection prioritization scheme, as shown in Figure 5.
Firstly, one step of most critical is compiled to tables of data when carrying out the optimization of the table order of connection with genetic algorithm Code.The different table order of connection is expressed using the form of threaded tree, in order to retain the characteristic information of threaded tree comprehensively, using first root The form of traversal threaded tree is encoded.It is as shown in Figure 6:
After encoding above, randomly select poplength threaded tree as initial population, and to first generation population into Row genetic manipulation generates genlength new individual.Genetic manipulation swaps population using two kinds of operators of intersection and variation. Crossover operator: the subtree that same size is randomly generated swaps, and for duplicate tables of data on threaded tree after intersecting, uses other The tables of data replacement not occurred;Mutation operator: the tables of data of any nonzero digit on exchange threaded tree.Then according to well known connection Tree cost formula calculates the cost of each threaded tree, selects the smallest threaded tree of poplength cost as next-generation population. The above heredity, selection course are repeated until reaching predetermined the number of iterations, which passes through the analysis and summary to many experiments Reasonable value is obtained, the smallest threaded tree of cost is exactly the optimal table obtained by genetic algorithm in last generation of evolutionary process The order of connection.
Specific embodiment four, present embodiment are to a kind of relation data described in one of specific embodiment one to three The further explanation of keyword query method in library based on theme class cluster unit, based on theme class cluster member described in step 1 tri- The detailed process of the horizontal grouping of group associated diagram are as follows:
After vertical grouping and theme class cluster table attended operation, obtains theme class gathering and close C=(C1,C2,..., Ck), wherein each theme class cluster CiIt include a consolidated statement T 'i, in general, user wishes to integrate multiple related tuples Theme class cluster is made further to be grouped and can effectively mentioned as response result, therefore using a kind of reasonable horizontal group technology High inquiry velocity.Existing level group technology simply uses the operation of the group by database to carry out, and may result in category Property the different but tuple with higher similitude of value be assigned in different groupings, not enough closed so as to cause horizontal group result Reason.The present invention carries out horizontal grouping to theme class cluster using theme class cluster tuple associated diagram, makes while improving search efficiency The grouping of data more meets user query demand.In addition, making the similitude between tuple using a kind of mixing similarity calculation method Calculated result is more accurate and has good scalability.
Theme class cluster tuple mixes Similarity measures
An important data model is exactly theme class cluster tuple associated diagram G=(V, E) in horizontal grouping strategy, it is one A weighted undirected graph, weight of the similarity as side between theme class cluster tuple, in order to improve similitude between theme class cluster tuple The accuracy rate and scalability of calculating, this trifle propose a kind of mixing similarity calculation method.
It is assumed that t 'iWith t 'jThe class that is the theme cluster CiThe mixing similitude calculating process of middle any two tuple, two tuples is as follows:
1, different distance function d is definedk, comprising: Euclidean distance, editing distance, Hamming distance;
2, theme class cluster tuple is mapped on n-dimensional space, and the distance between two tuples is found out according to distance function dk(t′i,t′j);
3, the mixing similitude between two tuples is found out according to the following formula.
Wherein, Simk (ti', tj') it be distance function is dkWhen, tuple t 'iWith t 'jBetween similitude;
It is d for distance functionkWhen { 1,2,3 } k ∈, tuple t 'iWith t 'jBetween similitude maximum value;
Sim(t′i,t′j) it is two tuple t 'iWith t 'jBetween mixing similitude.
The step of level grouping:
1, theme class cluster C is calculatediIn mixing similitude between each theme class cluster tuple, building theme class cluster tuple association Figure;
2, according to similarity threshold φ, associated diagram is divided into several connected components;
3, the theme class cluster unit that theme class cluster tuple number is greater than Minsize is found out, and calculates its amalgamation, selects and melts The smallest theme class cluster unit of conjunction property, disconnects the smallest side of similitude in the theme class cluster unit;
4, step 3 is repeated, until the number of tuples contained in all theme class cluster units is less than Minsize;
5, the separation property between theme class cluster unit is calculated, the smallest theme class cluster unit of separation property is merged;
6, step 5 is repeated, until theme class cluster CiThe number of middle theme class cluster unit | Ci| until reaching requirement.
The theme class cluster unit amalgamation used in step 3 and step 5, separation property are calculate by the following formula:
Wherein, Sim (t 'i,t′j) class that is the theme cluster tuple t 'iWith t 'jBetween mixing similitude;TCUkAnd TCUlTable respectively Show two different theme class cluster units;
Formula (8) and (9) are related to 3 parameters: similarity threshold φ, final theme class cluster unit number k and Minsize, the first two parameter are set acording to the requirement of user, and Minsize is then set as the 1%~3% of entire data volume.
Specific embodiment five, present embodiment are to a kind of relation data described in one of specific embodiment one to four The further explanation of keyword query method in library based on theme class cluster unit is established based on correlation rule described in step 2 The detailed process of optimiged index mechanism are as follows:
In order to accelerate inquiry velocity, need for database sharing index.Traditional method is only single keyword building Inverted index item list, index efficiency is by larger limitation in multi-key word inquiry.In order to solve problem above, the application mentions The multi-key word index construct mechanism based on theme class cluster unit is gone out.
Frequent item set, each frequent item set corresponding one are found using association rule algorithm in each theme class cluster respectively A inverted index item list contains all theme class cluster units for the keyword or keyword combination directly occur in list. Index structure is as follows:
Keyword(s)→(TCU1,TCU2…··) (10)
We generate the above subject index based on theme class cluster unit using Lucene kit.
Specific embodiment six, present embodiment are to a kind of relation data described in one of specific embodiment one to five The further explanation of keyword query method in library based on theme class cluster unit, it is described described in step 3 to return to query result To the detailed process of user are as follows:
It can be effectively by the theme class cluster comprising keyword (group) using the subject index based on theme class cluster unit Unit returns to user as query result.Give one group of keyword query K={ k1,k2,....,kn, system in parallel search is more A subject index finds corresponding theme class cluster unit, calculates related subject class cluster unit according to existing sort result function Score, carries out descending sort to it according to score, and top-k theme class cluster unit is returned to user.
For verifying beneficial effects of the present invention, make following emulation experiment:
This experiment uses public data collection Freebase.Freebase is an open structured database, scale compared with Greatly and there is certain structural complexity, wherein including about 2000 tables and 39,000,000 entities.Due to by experiment condition Limitation, we carry out data set under the premise of not influencing experimental result to simplify processing.Keep database bottom mode sum number It is constant according to the connection relationship between table, data set of the extraction section data (400M) as this experiment from Freebase database. For the effect and performance for verifying the TCU-Based querying method that the application proposes, following three groups of experiments have been carried out.Firstly, passing through The comparison of the substantial amounts of data prediction time, the validity of proof list order of connection prioritization scheme.Secondly, the TCU- that the application is proposed Based querying method is compared with pedestal method DBXplorer, BLANKS and SAINT, verify the application method efficiency and Accuracy rate.Finally, that verifies this method can by the comparative experiments for carrying out query responding time on the data set of different scales Scalability.
Algorithm is run under 7 operating system of Microsoft Windows, JAVA environment, is usedCore(TM) The CPU of 2.5GHz, 4GB memory, 500G hard disk.
The assessment of table order of connection prioritization scheme
The statistical form order of connection optimization front and back the substantial amounts of data prediction time, and by the database of different scales into The validity of the table order of connection prioritization scheme based on genetic algorithm of the application proposition is verified in row comparative analysis.Experimental result As shown in Figure 7.Abscissa indicates the quantity of tables of data involved in process of data preprocessing in figure, and ordinate indicates data prediction Time.By result in figure it is found that data prediction can significantly improve pretreatment using the table order of connection that genetic algorithm obtains Efficiency, and with the increase of table quantity, effect is more obvious.
Method comparison
By comparing with benchmark querying method, the performance of the application querying method TCU-Based is had evaluated.Respectively from Two aspects of search efficiency and result accuracy rate are tested.Select 100 keywords at random from the inquiry log of database Inquiry is carried out using tetra- kinds of methods of DBXplorer, BLANKS, SAINT and TCU-Based respectively on Freebase data set Inquiry, analyzes the query responding time of each method and the quality of query result.
(1) search efficiency
Firstly, we use different number searching keyword, by the top-2 average lookup response time of four kinds of methods into Row compares.Since query responding time terminate executing inquiry to top-2 inquiry response of generation, does not include that off-line data is located in advance Manage the time.As shown in figure 8, abscissa indicates keyword number, ordinate indicates the average lookup response time.As seen from the figure, from Line querying method SAINT and TCU-Based search efficiency is substantially better than online query method DBXplorer and BLANKS, reason It is that first two method needs to carry out in query process complicated table attended operation, especially in the more complicated database of structure In, such table attended operation needs biggish time overhead;On the contrary, offline search method is before user proposes inquiry to data Table and tuple in library are pre-processed, therefore significantly improve inquiry velocity.In addition, the querying method that the application proposes TCU-Based is that each theme class cluster constructs subject index, and multiple indexes are parallel after user proposes inquiry executes, and imitates inquiry Rate has further promotion relative to offline search method SAINT.As shown in Figure 8, it when keyword number is more than or equal to 3, looks into The raising for asking efficiency becomes apparent, the reason is that being indexed the selection of word using correlation rule, provides multiple queries in user and closes When keyword, the index entry comprising all keywords can be directly retrieved, without being indexed connection.
The performance of the application querying method for further evaluation, We conducted the querying method comparison on top-k, k Value is carried out from 2 to 20.As shown in figure 9, abscissa indicates the different values of k in figure, ordinate is that being averaged under different value of K is looked into Ask the response time.Obviously, the query performance of the application method is substantially better than other three kinds of benchmaring querying methods.For example, in k When taking 12, the average lookup response time of DBXplorer, BLANKS, SAINT are respectively 13500ms, 7452ms, 3420ms, and The querying method TCU-Based of the application only spends 2014ms.Concrete reason is similar to Fig. 8, and details are not described herein again.
(2) validity is inquired
Inquiry validity is measured using two evaluation indexes of accuracy rate and recall rate respectively.It is accurate in order to be carried out to it Calculating, first we randomly select 100 SQL queries and using its corresponding inquiry response as standard queries result.Then Extract the keyword in SQL query, and the input as above-mentioned four kinds of querying methods.Its Average Accuracy and average recall rate pair It is more as shown in Figure 10 and Figure 11 than situation.As shown in Figure 10, with the increase of keyword quantity, the accuracy rate of search algorithm is in decline Trend.The accuracy rate of 1-keyword and 2-keyword inquiry is better than 3-keyword, 4-keyword and 5- under normal circumstances Keyword, because of increasing with keyword quantity, the relationship between keyword becomes increasingly complex.Context of methods TCU-Based It is higher by 3%~7% than congenic method SAINT in terms of accuracy rate, compared with other two methods, effect is more significant.By scheming 11 it is found that context of methods is significantly better than that existing method in terms of recall rate, and 15% or so is higher by compared with SAINT.
Scalability assessment
The average lookup response time of top-5 is measured in this experiment.Abscissa indicates that different data collection is big in Figure 12 Small, ordinate indicates the average lookup response time of top-5.As can be known from Fig. 12, successively increase with database size from 100MB It is added to 500MB, is changed by the average lookup response time that context of methods obtains slow.This is because the increase of data volume, only Tuple connection when to pretreatment produces large effect, and to the influence that indexes online and little.It can be demonstrate,proved by experiment Bright context of methods presents preferable scalability in different data collection size.

Claims (1)

1. it is sequentially included the following steps: based on the keyword query method of theme class cluster unit in a kind of relational database
One, theme class cluster unit building process;
One, mono-, the vertical grouping based on tables of data characteristic and inquiry log;
One, bis-, table order of connection prioritization scheme in theme class cluster is proposed;
One, tri-, the horizontal grouping based on theme class cluster tuple associated diagram;
Two, the optimiged index mechanism based on correlation rule is established;
Three, query result is returned into user;
It is characterized in that the detailed process of the vertical grouping based on tables of data characteristic and inquiry log described in step 1 mono- are as follows:
Using similarity matrix construction method between table, respectively from table characteristic, including between table, topological compactness is similar with content between table Property and the aspect of inquiry log two construct initial input matrixes, vertical grouping method is by relational database and user query log As input, one group of theme class cluster is divided into following 3 big modules: input module, similarity matrix as output, vertical grouping method Construct module and output module;Input module, using relational database and its ideograph as input, inquiry log is also used as defeated Enter;Similarity matrix constructs module: by analysis to ideograph in input module and database, calculating, obtains between tables of data Topological compactness and content similarities, construct topological compactness matrix and content similarities matrix respectively, and on this basis Similarity matrix between building table;It is for statistical analysis to inquiry log that the above matrix is made further to be reinforced and be corrected;Most Afterwards, it is exported as a result with one group of theme class cluster;
(1) topological compactness between table
The ideograph G=(V, E) in data-oriented library, node viWith node vjBetween topological compactness be defined as follows:
Wherein, | vi| for table T in databaseiSize, | vj| for table T in databasejSize;σ is impact factor;For section Point viWith node vjBetween logical reach: i.e. in database schema figure, node viWith node vjBetween path length;According to The mathematical property of Gaussian function, for given σ value, the coverage of each node is approximately equal toRegional area, Logical reach between two nodesWhen greater than the value, the topological compactness between two nodes decays to rapidly 0;
The topological compactness between any two node is calculated by formula (1), and then the topology for constructing relational database is close Property matrix is as follows:
(2) content similarities between table
Tables of data is made of table name, attribute and tuple, therefore the famous prime minister that can obey the order when between content similarities are analyzed table It is deeply probed into like two aspects of property and assignment similitude;
Name similitude includes table name similitude and attribute-name similitude two large divisions, with calculating phase between two entities in vector space Like the method for property, first extraction table TiTable name and attribute-name in keyword be table TiConstruct vector Vi, extract table TjTable Keyword in name and attribute-name is table TjConstruct vector Vj, name similitude is calculated using Cosine function:
Sim1(Ti,Tj)=Sim (Vi,Vj)=Vi·Vj/(|Vi|·|Vj|) (2)
Wherein Sim1(Ti,Tj) it is table TiWith table TjBetween name similitude;
Sim(Vi,Vj) it is vector ViWith vector VjBetween similitude
|Vi| and | Vj| it is respectively vector ViWith vector VjSize;
The specific solution procedure of assignment similitude is as follows:
1. calculating the content similarities between two attributes using Jaccard distance;
J (u, v)=| u ∩ v |/| u ∪ v | (3)
Wherein, u is tables of data TiIn attribute column;V is tables of data TjIn attribute column;
2. using the attribute in greedy matching strategy Test database to set Z:
3. weighting is averaging and obtains the assignment similitude between two tables
Wherein, | Ti| it is tables of data TiIn attribute column number;|Tj| it is tables of data TjIn attribute column number;max(|Ti|,|Tj |) be | Ti| and | Tj| the larger value in the two;For the coefficient of variation of attribute column u, It is a statistic of each observation degree of variation in measurement table for the coefficient of variation of attribute column v;The coefficient of variation is smaller, attribute The richness of column content is smaller;Conversely, the coefficient of variation is bigger, the richness of attribute column content is bigger;In formulaFor attribute column u Standard deviation,For the standard deviation of attribute column v,For the average value of attribute column u;For the average value of attribute column v;max(u.V, It v.V is) the larger value in both u.V and v.V;
Tables of data TiWith tables of data TjBetween content similarities are as follows: Sim (Ti,Tj)=(Sim1(Ti,Tj)+Sim2(Ti,Tj))/2;
Content similarities matrix S between tables of data are as follows:
Wherein l is the number of tables of data in database;
Comprehensively consider the similitude in structure and content, obtains similarity matrix between table: ADB=T+S;
Wherein T is the topological compactness matrix of relational database;
(3) similarity matrix modification method
Inquiry log has recorded the history access information of user search database, includes 3 fields: User ID, inquiry Q, inquiry As a result the tables of data T and where result;Vertical grouping method basic thought with user feedback is in inquiry log Inquiry records for statistical analysis, and is modified using following boost function to similarity matrix;
boostlog(Ti,Tj)=exp (log (count (Ti,Tj))/log(max(count))) (5)
count(Ti,Tj) have recorded table TiWith table TjCo-occurrence number in inquiry log, max (count) are to appoint in inquiry log It anticipates the maximum values of two table co-occurrence numbers;
The above vertical grouping result is reinforced using the information in user query log, proposes that following score reinforces function:
AFinal(ti,tj)=ADB(ti,tj)×boostlog(ti,tj) (6)
And then obtain similarity matrix between tableWherein l is tables of data in database Number;
(4) vertical grouping
Based on the vertical grouping for improving spectral clustering
Input: G=(V, E), k and impact factor σ, wherein, V={ v1,...,vl, | E |=m;V={ v1,...,vl}
Output: theme class gathering closes C={ C1,C2,...,Ck};
Step:
1. constructing similarity matrix A between tableFinal
2. feature vector and characteristic value are calculated, with preceding k feature vector u1,...,ukConstruction feature vector space Rk
3. node all in V is mapped to RkSpace;
4. using k-means algorithm by RkIn node rendezvous to theme class cluster C1,C2,...,CkIn;
The particular content of table order of connection prioritization scheme in theme class cluster described in step 1 bis- are as follows:
By theme class cluster C in data predictioni=(T1,T2..., Tn) in n table T1,T2,...,TnIt is attached to obtain Consolidated statement Ti', based on genetic Algorithm Design table order of connection prioritization scheme, firstly, carrying out table connection with genetic algorithm When sequential optimization, the different table order of connection is expressed using the form of threaded tree, and using the form of pre-reset mechanism threaded tree It is encoded;After coding, poplength threaded tree is randomly selected as initial population, and first generation population is carried out Genetic manipulation generates genlength new individual;Genetic manipulation swaps population using two kinds of operators of intersection and variation;It hands over Pitch operator: the subtree that same size is randomly generated swaps, for duplicate tables of data on threaded tree after intersecting, not with other The tables of data of appearance is replaced;Mutation operator: the tables of data of any nonzero digit on exchange threaded tree;Then according to well known threaded tree Cost formula calculates the cost of each threaded tree, selects the smallest threaded tree of poplength cost as next-generation population;Weight Until reaching predetermined the number of iterations, which is obtained by the analysis and summary to many experiments for the multiple above heredity, selection course To reasonable value, the smallest threaded tree of cost is exactly that the optimal table obtained by genetic algorithm connects in last generation of evolutionary process Connect sequence;
The detailed process of horizontal grouping described in step 1 tri- based on theme class cluster tuple associated diagram are as follows:
After vertical grouping and theme class cluster table attended operation, obtains theme class gathering and close C=(C1,C2,...,Ck), In each theme class cluster CiIt include a consolidated statement Ti', horizontal point is carried out to theme class cluster using theme class cluster tuple associated diagram Group uses a kind of mixing similarity calculation method;
Theme class cluster tuple mixes Similarity measures
An important data model is exactly theme class cluster tuple associated diagram G=(V, E) in horizontal grouping strategy, is a weighting Non-directed graph, weight of the similarity as side between theme class cluster tuple;
It is assumed that ti' and tj' the class that is the theme cluster CiThe mixing similitude calculating process of middle any two tuple, two tuples is as follows:
1. defining different distance function dk, comprising: Euclidean distance, editing distance, Hamming distance;
2. theme class cluster tuple is mapped on n-dimensional space, and the distance d between two tuples is found out according to distance functionk (ti',tj');
3. finding out the mixing similitude between two tuples according to the following formula;
Wherein, Simk(ti',tj') it be distance function is dkWhen, tuple ti' and tj' between similitude;
It is d for distance functionk, when { 1,2,3 } k ∈, tuple ti' and tj' between similitude maximum value;
Sim(ti',tj') it is two tuple ti' and tj' between mixing similitude;
The step of level grouping:
1. calculating theme class cluster CiIn mixing similitude between each theme class cluster tuple, construct theme class cluster tuple associated diagram;
2. associated diagram is divided into several connected components according to similarity threshold φ;
3. finding out the theme class cluster unit that theme class cluster tuple number is greater than Minsize, and its amalgamation is calculated, selects amalgamation The smallest theme class cluster unit disconnects the smallest side of similitude in the theme class cluster unit;
4. step is repeated 3., until theme class cluster CiIn until number of tuples contained in each theme class cluster unit is less than Minsize;
5. calculating the separation property between theme class cluster unit, merge the smallest theme class cluster unit of separation property;
6. step is repeated 5., until theme class cluster CiThe number of middle theme class cluster unit | Ci| until reaching requirement;
Step 3. with step 5. in the theme class cluster unit amalgamation used, separation property be calculate by the following formula:
Wherein, Sim (ti',tj') class that is the theme cluster tuple ti' and tj' between mixing similitude;TCUkAnd TCUlRespectively indicate two A different theme class cluster unit;
The detailed process of the optimiged index mechanism based on correlation rule is established described in step 2 are as follows: respectively in each theme class cluster Frequent item set is found using association rule algorithm, each frequent item set corresponds to an inverted index item list, includes in list All theme class cluster units for the keyword or keyword combination directly occur, index structure are as follows:
Keyword(s)→(TCU1,TCU2……) (10)
The above subject index based on theme class cluster unit is generated using Lucene kit;
Query result is returned to the detailed process of user described in step 3 are as follows:
Give one group of keyword query K={ k1,k2,....,kn, the multiple subject index of parallel search find corresponding theme class Cluster unit calculates the score of related subject class cluster unit according to well known sort result function, carries out descending to it according to score Sequence, returns to user for top-k theme class cluster unit.
CN201610264735.4A 2016-04-25 2016-04-25 A kind of keyword query method based on theme class cluster unit in relational database Expired - Fee Related CN105975488B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610264735.4A CN105975488B (en) 2016-04-25 2016-04-25 A kind of keyword query method based on theme class cluster unit in relational database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610264735.4A CN105975488B (en) 2016-04-25 2016-04-25 A kind of keyword query method based on theme class cluster unit in relational database

Publications (2)

Publication Number Publication Date
CN105975488A CN105975488A (en) 2016-09-28
CN105975488B true CN105975488B (en) 2019-06-18

Family

ID=56994549

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610264735.4A Expired - Fee Related CN105975488B (en) 2016-04-25 2016-04-25 A kind of keyword query method based on theme class cluster unit in relational database

Country Status (1)

Country Link
CN (1) CN105975488B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451210B (en) * 2017-07-13 2020-11-20 北京航空航天大学 Graph matching query method based on query relaxation result enhancement
CN107480199B (en) * 2017-07-17 2020-06-12 深圳先进技术研究院 Query reconstruction method, device, equipment and storage medium of database
CN109582698B (en) * 2017-09-29 2021-08-13 上海宽带技术及应用工程研究中心 Method, system, storage medium and terminal for updating query results of multiple continuous top-k keywords
CN110019299A (en) * 2017-11-16 2019-07-16 阿里巴巴集团控股有限公司 A kind of method and apparatus for creating or refreshing the off-line data set of analytic type data warehouse
CN108132927B (en) * 2017-12-07 2022-02-11 西北师范大学 Keyword extraction method for combining graph structure and node association
CN108197175B (en) * 2017-12-20 2021-12-10 国网北京市电力公司 Processing method and device of technical supervision data, storage medium and processor
CN108182520A (en) * 2017-12-22 2018-06-19 深圳市华云中盛科技有限公司 The method and its system of a kind of rapid modeling
CN109325019B (en) * 2018-08-17 2022-02-08 国家电网有限公司客户服务中心 Data association relationship network construction method
CN109241243B (en) * 2018-08-30 2020-11-24 清华大学 Candidate document sorting method and device
CN109670012A (en) * 2019-02-20 2019-04-23 湖北理工学院 What a kind of electric power foundation of civil work based on Internet of Things was checked and accepted instructs system and method
CN110263225A (en) * 2019-05-07 2019-09-20 南京智慧图谱信息技术有限公司 Data load, the management, searching system of a kind of hundred billion grades of knowledge picture libraries
CN110362798B (en) * 2019-06-17 2023-12-19 平安科技(深圳)有限公司 Method, apparatus, computer device and storage medium for judging information retrieval analysis
CN112559554B (en) * 2020-12-24 2024-01-26 北京百家科技集团有限公司 Query statement optimization method and device
CN112783952A (en) * 2021-03-16 2021-05-11 浪潮云信息技术股份公司 Method for constructing result set based on electronic official document keyword query
CN113722560A (en) * 2021-09-03 2021-11-30 南京协胜智能科技有限公司 Method for screening data center data search results
CN114116806A (en) * 2021-12-03 2022-03-01 北京天融信网络安全技术有限公司 Top-k ranking query and library falling method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104036051A (en) * 2014-07-04 2014-09-10 南开大学 Database mode abstract generation method based on label propagation
CN104050162A (en) * 2013-03-11 2014-09-17 富士通株式会社 Data processing method and data processing device
CN104391908A (en) * 2014-11-17 2015-03-04 南京邮电大学 Locality sensitive hashing based indexing method for multiple keywords on graphs

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050162A (en) * 2013-03-11 2014-09-17 富士通株式会社 Data processing method and data processing device
CN104036051A (en) * 2014-07-04 2014-09-10 南开大学 Database mode abstract generation method based on label propagation
CN104391908A (en) * 2014-11-17 2015-03-04 南京邮电大学 Locality sensitive hashing based indexing method for multiple keywords on graphs

Also Published As

Publication number Publication date
CN105975488A (en) 2016-09-28

Similar Documents

Publication Publication Date Title
CN105975488B (en) A kind of keyword query method based on theme class cluster unit in relational database
US7392250B1 (en) Discovering interestingness in faceted search
CN105045875B (en) Personalized search and device
JP3860046B2 (en) Program, system and recording medium for information processing using random sample hierarchical structure
CN107590128B (en) Paper homonymy author disambiguation method based on high-confidence characteristic attribute hierarchical clustering method
Liu et al. Stratified sampling for data mining on the deep web
CN104699786A (en) Communication network complaint system for semantic intelligent search
CN106156271A (en) Related information directory system based on distributed storage and foundation thereof and using method
CN113157943A (en) Distributed storage and visual query processing method for large-scale financial knowledge map
Wang et al. Research and implementation of the customer-oriented modern hotel management system using fuzzy analytic hiererchical process (FAHP)
Wang et al. Aggregate queries on knowledge graphs: Fast approximation with semantic-aware sampling
Ibrahim et al. Compact weighted class association rule mining using information gain
Zou et al. Survey on learnable databases: A machine learning perspective
CN104317853B (en) A kind of service cluster construction method based on Semantic Web
CN110032676A (en) One kind being based on the associated SPARQL enquiring and optimizing method of predicate and system
CN105956012B (en) Database schema abstract method based on figure partition strategy
CN110162580A (en) Data mining and depth analysis method and application based on distributed early warning platform
Zhang et al. Leveraging data-analysis session logs for efficient, personalized, interactive view recommendation
Fang et al. A query-level distributed database tuning system with machine learning
JP7428250B2 (en) Method, system, and apparatus for evaluating document retrieval performance
Liu et al. EntityManager: Managing dirty data based on entity resolution
Tian et al. Retrieving deep web data through multi-attributes interfaces with structured queries
Ye et al. Generalized learning of neural network based semantic similarity models and its application in movie search
Zhao et al. Organizing structured deep web by clustering query interfaces link graph
Jain et al. Phrase based clustering scheme of suffix tree document clustering model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190618

Termination date: 20200425