CN105975488B - A kind of keyword query method based on theme class cluster unit in relational database - Google Patents
A kind of keyword query method based on theme class cluster unit in relational database Download PDFInfo
- Publication number
- CN105975488B CN105975488B CN201610264735.4A CN201610264735A CN105975488B CN 105975488 B CN105975488 B CN 105975488B CN 201610264735 A CN201610264735 A CN 201610264735A CN 105975488 B CN105975488 B CN 105975488B
- Authority
- CN
- China
- Prior art keywords
- class cluster
- theme class
- data
- theme
- tables
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of keyword query method based on theme class cluster unit in relational database, is related to a kind of keyword query method in information retrieval field more particularly to relational database based on theme class cluster unit.The present invention will there are table frequent in query process connections to be brought huge time overhead to solve the problems, such as existing keyword online query method, and the inquiry on the large scale database that existing keyword offline search method is complicated for internal structure, data volume is huge has that search efficiency is low.The keyword query method based on theme class cluster unit sequentially includes the following steps: 1, theme class cluster unit building process in a kind of relational database;1., be based on tables of data characteristic and inquiry log vertical grouping;2., propose theme class cluster in table order of connection prioritization scheme;3., based on theme class cluster tuple associated diagram level be grouped;2, the optimiged index mechanism based on correlation rule is established;3, query result is returned into user.The present invention is applied to information retrieval field.
Description
Technical field
The present invention relates to the keys based on theme class cluster unit in information retrieval field more particularly to a kind of relational database
Word querying method.
Background technique
In recent years, keyword query is successfully applied as one important inquiring technology of information retrieval field.
Due to its feature easy to use, received by more and more users.For relational database, also needing one kind simply has
The querying method of effect obtains the interested information of user from numerous and complicated relational database.Traditional structuralized query side
Method, such as SQL query not only need user to understand the bottom mode of relational database complexity, it is also necessary to which user grasps correlation and looks into
The application method for asking language brings bigger difficulty and inconvenience to inquiry work.Therefore, based on the keyword of relational database
Inquiring technology has received widespread attention.Well known some correlative studys attempt traditional keyword query method being introduced directly into pass
It is database, but since relational database needs to follow certain Standardization Requirement, information is dispersed in different tables of data
In, being simply introduced into can not bring good inquiry to experience to user.Therefore the structure of marriage relation database itself is needed
Feature studies a kind of keyword query technology of suitable relational database.
Existing keyword query method can be divided into online query and offline search two major classes.The main think of of online query
Think that use pattern figure or datagram model relational database, after user proposes a group polling keyword, online
Figure traversal is carried out, returns to one or more subgraph or candidate network or steiner tree as query result.Due to looking into
Table connection is constantly carried out during asking, this class inquiry method is caused to generate high time cost.On the contrary, offline search method
Then efficiently solve the problems, such as that online query exists using the data structure of similar virtual document or tuple unit.It is mentioned in user
Before inquiring out, table connection is carried out using the method for breadth first traversal, so as to avoid table frequent in query process connection
Brought time overhead.But the above offline search method does not consider that the search efficiency in extensive relational database is asked
Topic.Enterprise database generally comprises hundreds and thousands of tables of data, and using above method, preprocessing process needs sizable time
Expense.In addition, since finally formed table is in large scale, even if building index, is also unfavorable for user and finds within a short period of time
Desired inquiry response.
In order to solve this problem, present applicant proposes TCU-Based inquiries --- it is a kind of based on theme class cluster unit from
Line askes method.Firstly, constructing number by carrying out division operation twice both vertically and horizontally to the data in database
According to structure --- theme class cluster unit.Secondly, in order to further increase the efficiency of data prediction, the application is based on genetic algorithm
Devise a kind of table order of connection prioritization scheme.Finally, constructing subject index for each theme class cluster, make these indexes can be
Concurrent working on machine node, significantly improves inquiry velocity.After user proposes inquiry, one or more theme is returned
Class cluster unit can include more complete information as inquiry response, meet user query intention.
Summary of the invention
This application involves correlation theory
Keyword query technology can substantially be divided into online query and offline search two major classes.Online query main thought
It is: before inquiry, constructs ideograph corresponding with database or datagram;After user proposes inquiry, figure traversal is carried out online, with hair
Existing top-k candidate network or steiner tree, and user is returned to as inquiry response.And offline search is then with offline mode pair
Database is pre-processed, and virtual document or tuple unit are constructed, and returns to top-k inquiry response using information retrieval technique,
Online table connection and figure traversing operation are avoided, to be obviously improved query processing efficiency.
1, online query
It is divided according to the quantity that inquiry is related to database, online query can be divided into the key towards single relational database again
The keyword query of word inquiry and Based on Distributed database.
Keyword query towards single relational database
In DBXplorer and DISCOVER system, database is modeled as ideograph G, wherein node on behalf relationship, side
Main foreign key constraint between representation relation, the result of keyword query i.e. one group candidate network.BANKS, BANKS-II and BLINKS
Etc. systems then use the keyword query method based on datagram, directly on bottom data figure retrieval include keyword this
Tan Na tree.BANKS system uses inverse expansion searching algorithm, when encountering the biggish node of an in-degree, this method performance by
To seriously affecting.BLINKS proposes a kind of new searching method --- bidirectional research on its basis, significantly improves search
Performance.
The keyword query of Based on Distributed database
In order to solve the problems, such as the keyword query in distributed data base, currently known research work is primarily present following
Several strategies: Kite system, the comprehensive use pattern matching of the system and topology discovery technology, to obtain between heterogeneous database
Main outer key connection, to solve the problems, such as the keyword query on isomeric relationship database.Hristidis, V. etc. it is artificial each
Database D BiEstablish keyword relational matrix KRMiAbstract as database.For each keyword word to (ki,kj), item
KRMi(ki,kj) for recording keyword kiAnd kjThe frequency occurred in different distance.But keyword relational matrix only leads to
The binary crelation beta pruning crossed between Feature Words is fallen to cannot function as the database of inquiry response.In order to overcome the above deficiency, You Renti
Go out G-KS method, uses the complex relationship between keyword relational graph characterization keyword.Figure interior joint represents Feature Words, side generation
Relationship between table word and word.Therefore, it can be calculated using key relationship figure similar between database and keyword query
Property, to retrieve most potential database.
2, offline search
The above querying method, which does not account for online table connection, leads to high time overhead problem.In general, pass through
Table in database and tuple are pre-processed, the above problem can be improved.In recent years, someone's further investigated is based on relationship number
According to the offline search problem in library, and propose preliminary solution.Feldman, P. et al. be put forward for the first time text object and
The concept of virtual document completes the table attended operation in database before user proposes inquiry, shows search efficiency
It writes and improves.Teorey, T.J. et al. are further expanded on the basis of text object, will be carried out with the tuple of same alike result value
Merge, with the more complete data structure of content construction --- tuple unit, and using multiple tuple units interconnected as
Inquiry response returns to user, effectively improves result precision.The above method only considers simple table connection and member
The group of group | by operation is not particularly suited for the database of internal table structure complexity.Herein described method is connected by optimization table
Sequence is connect, and defines a kind of more reasonable data structure --- theme class cluster unit, to improve inquiry effect significantly
Rate and precision.
3, vertical grouping
A given database D (T being made of l tables1,T2,...,Tl), vertical grouping refers to: according to a kind of reasonable
Table in database D is divided into one group of theme class cluster C by partition strategyD={ C1,C2,...,Ck, so that (1) each theme class
Cluster Ci∈CDComprising being associated with close, the relevant tables of data C of content in one group of structurei={ T1,T2,...,Tj};(2)
There is CiI Cj=φ;(3) to all i,
4, horizontal grouping
Give a tables of data T'(t1,t2,...,tn), wherein tiIndicate tuple, the horizontal grouping of tables of data refers to: root
According to certain similarity measurements flow function, the tuple with higher similarity is assigned to identity set ΓiIn.As shown in Figure 1, by
After horizontal division operation, n tuple in Table A merican football is divided into m unit Γ1,Γ2,...,Γm,
Wherein m≤n.
5, theme class cluster tuple associated diagram
A table T'(t' in given theme class cluster1,t'2,...,t'n), wherein t'iIndicate theme class cluster tuple, theme
Class cluster tuple associated diagram is weighted undirected graph G=(V, E), wherein vertex vi∈ V indicates theme class cluster tuple t'iIf two tuples
t'iAnd t'jBetween similitude sij(t'i,t'j) > 0, then in node vjThere are a line ei between node vij∈ E, side eij's
Weight is denoted as sij(t'i,t'j).Fig. 2 is the distribution subject class cluster tuple associated diagram of table T'.
6, theme class cluster unit
Give a database D (T1,T2,...,Tl), include l tables of data interconnected.Firstly, hanging down to it
Straight grouping obtains k theme class cluster C={ C1,C2,...,Ck};Secondly, according to main foreign key relationship by each theme class cluster CiIn
Table carry out table connect to obtain consolidated statement T 'i(t1,t2,...,tn).Finally, to consolidated statement T 'i(t1,t2,...,tn) in n
Tuple carry out level is grouped to obtain Γ1,Γ2,...,Γm.Wherein ΓiReferred to as theme class cluster unit.Theme class cluster American
Theme class cluster unit Γ in football1,Γ2,...,Γm, as shown in Figure 1.
7, top-k keyword query
Give a keyword query Q={ k1,k2,...,kmAnd a database D (T comprising l tables of data1,
T2,...,Tl), k theme class cluster units are as query result before top-k keyword query returns to Relevance scores ranking.
8, the application integral frame
Existing major part keyword query method needs constantly to carry out table company after user proposes specific inquiry
It connects, therefore generates biggish time overhead.To solve problems, it is thus proposed that the concept of offline search, but they are not
There is the inquiry on the relational database for considering internal structure complexity.The application wants the thought to be: successively carrying out Vertical Square to tables of data
To with the secondary grouping of horizontal direction, construct one group of theme class cluster unit;A kind of optimal table company based on genetic algorithm is designed simultaneously
Sequence Choice is connect, the cost of data prediction is reduced;It finally uses association rule algorithm to construct for each theme class cluster to lead
Topic index, thus the search efficiency and accuracy rate that are obviously improved on relational database.
Fig. 4 is the architectural framework of the application querying method, is totally divided into online and offline two parts.
Online query part
User submits a group polling keyword to query processor online, and query processor will include by subject index
The theme class cluster unit of one or more keywords returns to user as inquiry response.
Off-line data preprocessing part
It is divided into following four module: vertical grouping module, table order of connection optimization module, horizontal grouping module and theme rope
Draw building module.
(1) vertical grouping module
This module utilizes a kind of optimal dividing strategy of figure using relational database and user query log as input ---
Improved spectral clustering, and the characteristics of marriage relation database itself, while considering the content information of table and coming from inquiry day
Tables of data is carried out vertical grouping by the field feedback of will, is formed a theme class gathering and is closed.Wherein each theme class cluster is equal
Comprising one group of tables of data, not only the association in structure is close, content is close for these tables, but also has in the inquiry log of user
Higher co-occurrence frequency.
(2) table order of connection optimization module
For each theme class cluster that vertical grouping module obtains, tables of data therein is carried out according to main foreign key relationship
Table connection.Due to the substantial amounts of table in large database, table attended operation needs relatively large time overhead, it is therefore desirable to
Table attended operation is optimized.Optimal Choice of this module based on the genetic Algorithm Design table order of connection, is substantially reduced
Table connects cost.
(3) horizontal grouping module
Utilizing the theme class gathering cooperation obtained with upper module is input, by calculating the mixing in theme class cluster between tuple
Similarity, respectively each theme class cluster construct tuple associated diagram.Hierarchical clustering algorithm is further used, to each theme class cluster
Horizontal division operation is carried out, theme class cluster unit set is formed.
(4) subject index constructs module
This module is carried out using theme class cluster unit set obtained in horizontal grouping module as input using correlation rule
The selection of index terms, and then subject index is constructed for each theme class cluster.
The application by the existing keyword online query method of solution there are table frequent in query process connection bring it is huge
The problem of big time overhead, and that existing keyword offline search method is complicated for internal structure, data volume is huge is extensive
Inquiry on database has that search efficiency is low, and proposes the pass based on theme class cluster unit in a kind of relational database
Keyword querying method.
A kind of keyword query method based on theme class cluster unit in relational database, sequentially includes the following steps:
One, theme class cluster unit building process;
One, mono-, the vertical grouping based on tables of data characteristic and inquiry log;
One, bis-, table order of connection prioritization scheme in theme class cluster is proposed;
One, tri-, the horizontal grouping based on theme class cluster tuple associated diagram;
Two, the optimiged index mechanism based on correlation rule is established;
Three, query result is returned into user.
The present invention include it is following the utility model has the advantages that
1, a kind of offline search method based on theme class cluster unit is proposed, suitable on extensive relational database
Keyword query;
2, new types of data structure is constructed --- theme class cluster unit.Improved spectral clustering and theme class are used respectively
Cluster tuple associated diagram carries out vertical grouping and horizontal grouping to tables of data and tuple;Offline building TCU collection merges as looking into
Response is ask, query responding time can not only be substantially reduced, and more abundant, complete theme semantic information can be returned;
3, a kind of table order of connection prioritization scheme based on genetic algorithm is devised, pretreated time overhead is reduced;
4, index terms is selected using association rule algorithm, and then is each theme class cluster building index, it is significant to add
Fast inquiry velocity.
Detailed description of the invention
Fig. 1 be the theme class cluster Americanfootball level grouping schematic diagram;
Fig. 2 is distribution subject class cluster tuple associated diagram;
Fig. 3 is to inquire architecture diagram based on theme class cluster unit;
Fig. 4 is vertical grouping method architecture diagram
Fig. 5 is the table order of connection prioritization scheme flow chart based on genetic algorithm;
Fig. 6 is the correspondence diagram of threaded tree and integer sequence;
Fig. 7 is pretreatment time relativity figure;
Fig. 8 is the average response time relativity figure under different keyword numbers;
Fig. 9 is the average response time relativity figure under different value of K;
Figure 10 is keyword quantity and Average Accuracy corresponding relationship;
Figure 11 is keyword quantity and average recall rate corresponding relationship;
Figure 12 is influence relational graph of the different data collection size to query performance
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below with reference to fig. 4 to fig. 6 and tool
The present invention is described in further detail for body embodiment.
Key based on theme class cluster unit in a kind of relational database described in specific embodiment one, present embodiment
Word querying method, sequentially includes the following steps:
One, theme class cluster unit building process;
One, mono-, the vertical grouping based on tables of data characteristic and inquiry log;
One, bis-, table order of connection prioritization scheme in theme class cluster is proposed;
One, tri-, the horizontal grouping based on theme class cluster tuple associated diagram;
Two, the optimiged index mechanism based on correlation rule is established;
Three, query result is returned into user.
Present embodiment include it is following the utility model has the advantages that
1, a kind of offline search method based on theme class cluster unit is proposed, suitable on extensive relational database
Keyword query;
2, new types of data structure is constructed --- theme class cluster unit.Improved spectral clustering and theme class are used respectively
Cluster tuple associated diagram carries out vertical grouping and horizontal grouping to tables of data and tuple;Offline building TCU collection merges as looking into
Response is ask, query responding time can not only be substantially reduced, and more abundant, complete theme semantic information can be returned;
3, a kind of table order of connection prioritization scheme based on genetic algorithm is devised, pretreated time overhead is reduced;
4, index terms is selected using association rule algorithm, and then is each theme class cluster building index, it is significant to add
Fast inquiry velocity.
Specific embodiment two, present embodiment are based on in a kind of relational database described in specific embodiment one
The further explanation of the keyword query method of theme class cluster unit, based on tables of data characteristic and inquiry day described in step 1 mono-
The detailed process of the vertical grouping of will are as follows: the application uses similarity matrix construction method between table, respectively from table characteristic, including table
Between between topological compactness and table two aspects of content similarities and inquiry log construct initial input matrixes, vertical grouping method
Using relational database D and user query log as input, one group of theme class cluster is as output, and detailed process is as shown in figure 4, hang down
Straight group technology is broadly divided into following 3 big modules: input module, similarity matrix building module and output module.Input mould
Block inputs using relational database and its ideograph as system, is respectively used to the content information and structural information of descriptive data base;
In addition, inquiry log is also used as system to input, the information distribution characteristics of database is reflected in side;Similarity matrix constructs module:
By analysis, the calculating to ideograph in input module and database, the topological compactness obtained between tables of data is similar with content
Property, topological compactness matrix and content similarities matrix are constructed respectively, and construct similarity matrix between table on this basis;In addition
It is for statistical analysis to inquiry log that the above matrix is made further to be reinforced and be corrected;Finally, being made with one group of theme class cluster
For result output.
(1) topological compactness between table
The ideograph G=(V, E) in data-oriented library, node viWith node vjBetween topological compactness be defined as follows:
Wherein, | vi| for table T in databaseiSize, | vj| for table T in databasejSize;σ is impact factor, and σ is got over
Interaction force between big node is stronger;Conversely, interaction force is weaker.For node viWith node vjBetween logic away from
From: i.e. in database schema figure, node viWith node vjBetween path length.According to the mathematical property of Gaussian function, for giving
Fixed σ value, the coverage of each node are approximately equal toRegional area, the logical reach between two nodes
When greater than the value, the topological compactness between two nodes decays to rapidly 0.
The topological compactness between any two node is calculated by formula (1), and then constructs the topology of relational database
Compactness matrix is as follows:
(2) content similarities between table
Tables of data is made of table name, attribute and tuple, therefore can be obeyed the order when between content similarities are analyzed table
Two aspects of name similitude and assignment similitude are deeply probed into;
Naming similitude includes table name similitude and attribute-name similitude two large divisions, and the application is calculated in vector space
The method of similitude between two entities, first extraction table TiTable name and attribute-name in keyword be table TiConstruct vector Vi, mention
Take table TjTable name and attribute-name in keyword, be table TjConstruct vector Vj, name similitude is calculated using Cosine function:
Sim1(Ti,Tj)=Sim (Vi,Vj)=Vi·Vj/(|Vi|·|Vj|) (2)
Wherein Sim1(Ti,Tj) it is table TiWith table TjBetween name similitude;
Sim(Vi,Vj) it is vector ViWith vector VjBetween similitude
|Vi| and | Vj| it is respectively vector ViWith vector VjSize;
The specific solution procedure of assignment similitude is as follows:
1, the content similarities between two attributes are calculated using Jaccard distance;
J (u, v)=| uI v |/| uUv | (3)
Wherein, u is tables of data TiIn attribute column;V is tables of data TjIn attribute column;
2, using the attribute in greedy matching strategy Test database to set Z;
3, weighting is averaging and obtains the assignment similitude between two tables;
Wherein, | Ti| it is tables of data TiIn attribute column number;|Tj| it is tables of data TjIn attribute column number;max(|Ti
|,|Tj|) be | Ti| and | Tj| the larger value in the two;For the coefficient of variation of attribute column u,It is a statistic of each observation degree of variation in measurement table for the coefficient of variation of attribute column v;Variation
Coefficient is smaller, and the richness of attribute column content is smaller;Conversely, the coefficient of variation is bigger, the richness of attribute column content is bigger;Formula
InFor the standard deviation of attribute column u,For the standard deviation of attribute column v,For the average value of attribute column u;For attribute column v's
Average value;Max (u.V, v.V) is the larger value in both u.V and v.V;
In conclusion tables of data TiWith tables of data TjBetween content similarities are as follows: Sim (Ti,Tj)=(Sim1(Ti,Tj)+
Sim2(Ti,Tj))/2;
Content similarities matrix S between tables of data are as follows:
Wherein l is the number of tables of data in database;
Comprehensively consider the similitude in structure and content, obtains similarity matrix between table: ADB=T+S;
Wherein T is the topological compactness matrix of relational database;
(3) similarity matrix modification method
Inquiry log has recorded the history access information of user search database, include 3 fields: User ID, inquiry Q,
Tables of data T where query result and result.Vertical grouping method basic thought with user feedback is to inquiry log
In inquiry record for statistical analysis, and similarity matrix is modified using following boost function.
boostlog(Ti,Tj)=exp (log (count (Ti,Tj))/log(max(count))) (5)
count(Ti,Tj) have recorded table TiWith table TjCo-occurrence number in inquiry log, max (count) are inquiry day
The maximum value of any two tables co-occurrence number in will.By formula (5) it is found that in inquiry log the more table of co-occurrence number, closely
The degree that property score is reinforced is bigger.
The above vertical grouping result is reinforced using the information in user query log, proposes that following score reinforces letter
Number:
AFinal(ti,tj)=ADB(ti,tj)×boostlog(ti,tj) (6)
And then obtain similarity matrix between tableWherein l is number in database
According to the number of table;
(4) vertical grouping
The application is based on the vertical grouping for improving spectral clustering
Input: G=(V, E), k and impact factor σ, wherein V={ v1,...,vl, | E |=m;
Output: theme class gathering closes C={ C1,C2,...,Ck};
Step:
1. constructing similarity matrix A between tableFinal;
2. feature vector and characteristic value are calculated, with preceding k feature vector u1,...,ukConstruction feature vector space Rk;
3. node all in V is mapped to RkSpace;
4. using k-means algorithm by RkIn node rendezvous to theme class cluster C1,C2,...,CkIn.
Specific embodiment three, present embodiment are in a kind of relational database described in specific embodiment one or two
The further explanation of keyword query method based on theme class cluster unit, table connects in proposition theme class cluster described in step 1 bis-
Connect the detailed process of sequential optimization scheme are as follows:
In order to avoid table attended operation complicated in query process, needed in data prediction by theme class cluster Ci=
(T1,T2,...,Tn) in n table T1,T2,...,TnIt is attached to obtain consolidated statement T 'i.Existing method is only in accordance with main outer
Key relationship carries out breadth first traversal and is attached to table.In large database, hundreds and thousands of tables of data are generally comprised,
It needs to pay biggish time overhead using above method, pre-processes efficiency by extreme influence.For this problem, the present invention
Based on genetic Algorithm Design table order of connection prioritization scheme, as shown in Figure 5.
Firstly, one step of most critical is compiled to tables of data when carrying out the optimization of the table order of connection with genetic algorithm
Code.The different table order of connection is expressed using the form of threaded tree, in order to retain the characteristic information of threaded tree comprehensively, using first root
The form of traversal threaded tree is encoded.It is as shown in Figure 6:
After encoding above, randomly select poplength threaded tree as initial population, and to first generation population into
Row genetic manipulation generates genlength new individual.Genetic manipulation swaps population using two kinds of operators of intersection and variation.
Crossover operator: the subtree that same size is randomly generated swaps, and for duplicate tables of data on threaded tree after intersecting, uses other
The tables of data replacement not occurred;Mutation operator: the tables of data of any nonzero digit on exchange threaded tree.Then according to well known connection
Tree cost formula calculates the cost of each threaded tree, selects the smallest threaded tree of poplength cost as next-generation population.
The above heredity, selection course are repeated until reaching predetermined the number of iterations, which passes through the analysis and summary to many experiments
Reasonable value is obtained, the smallest threaded tree of cost is exactly the optimal table obtained by genetic algorithm in last generation of evolutionary process
The order of connection.
Specific embodiment four, present embodiment are to a kind of relation data described in one of specific embodiment one to three
The further explanation of keyword query method in library based on theme class cluster unit, based on theme class cluster member described in step 1 tri-
The detailed process of the horizontal grouping of group associated diagram are as follows:
After vertical grouping and theme class cluster table attended operation, obtains theme class gathering and close C=(C1,C2,...,
Ck), wherein each theme class cluster CiIt include a consolidated statement T 'i, in general, user wishes to integrate multiple related tuples
Theme class cluster is made further to be grouped and can effectively mentioned as response result, therefore using a kind of reasonable horizontal group technology
High inquiry velocity.Existing level group technology simply uses the operation of the group by database to carry out, and may result in category
Property the different but tuple with higher similitude of value be assigned in different groupings, not enough closed so as to cause horizontal group result
Reason.The present invention carries out horizontal grouping to theme class cluster using theme class cluster tuple associated diagram, makes while improving search efficiency
The grouping of data more meets user query demand.In addition, making the similitude between tuple using a kind of mixing similarity calculation method
Calculated result is more accurate and has good scalability.
Theme class cluster tuple mixes Similarity measures
An important data model is exactly theme class cluster tuple associated diagram G=(V, E) in horizontal grouping strategy, it is one
A weighted undirected graph, weight of the similarity as side between theme class cluster tuple, in order to improve similitude between theme class cluster tuple
The accuracy rate and scalability of calculating, this trifle propose a kind of mixing similarity calculation method.
It is assumed that t 'iWith t 'jThe class that is the theme cluster CiThe mixing similitude calculating process of middle any two tuple, two tuples is as follows:
1, different distance function d is definedk, comprising: Euclidean distance, editing distance, Hamming distance;
2, theme class cluster tuple is mapped on n-dimensional space, and the distance between two tuples is found out according to distance function
dk(t′i,t′j);
3, the mixing similitude between two tuples is found out according to the following formula.
Wherein, Simk (ti', tj') it be distance function is dkWhen, tuple t 'iWith t 'jBetween similitude;
It is d for distance functionkWhen { 1,2,3 } k ∈, tuple t 'iWith t 'jBetween similitude maximum value;
Sim(t′i,t′j) it is two tuple t 'iWith t 'jBetween mixing similitude.
The step of level grouping:
1, theme class cluster C is calculatediIn mixing similitude between each theme class cluster tuple, building theme class cluster tuple association
Figure;
2, according to similarity threshold φ, associated diagram is divided into several connected components;
3, the theme class cluster unit that theme class cluster tuple number is greater than Minsize is found out, and calculates its amalgamation, selects and melts
The smallest theme class cluster unit of conjunction property, disconnects the smallest side of similitude in the theme class cluster unit;
4, step 3 is repeated, until the number of tuples contained in all theme class cluster units is less than Minsize;
5, the separation property between theme class cluster unit is calculated, the smallest theme class cluster unit of separation property is merged;
6, step 5 is repeated, until theme class cluster CiThe number of middle theme class cluster unit | Ci| until reaching requirement.
The theme class cluster unit amalgamation used in step 3 and step 5, separation property are calculate by the following formula:
Wherein, Sim (t 'i,t′j) class that is the theme cluster tuple t 'iWith t 'jBetween mixing similitude;TCUkAnd TCUlTable respectively
Show two different theme class cluster units;
Formula (8) and (9) are related to 3 parameters: similarity threshold φ, final theme class cluster unit number k and
Minsize, the first two parameter are set acording to the requirement of user, and Minsize is then set as the 1%~3% of entire data volume.
Specific embodiment five, present embodiment are to a kind of relation data described in one of specific embodiment one to four
The further explanation of keyword query method in library based on theme class cluster unit is established based on correlation rule described in step 2
The detailed process of optimiged index mechanism are as follows:
In order to accelerate inquiry velocity, need for database sharing index.Traditional method is only single keyword building
Inverted index item list, index efficiency is by larger limitation in multi-key word inquiry.In order to solve problem above, the application mentions
The multi-key word index construct mechanism based on theme class cluster unit is gone out.
Frequent item set, each frequent item set corresponding one are found using association rule algorithm in each theme class cluster respectively
A inverted index item list contains all theme class cluster units for the keyword or keyword combination directly occur in list.
Index structure is as follows:
Keyword(s)→(TCU1,TCU2…··) (10)
We generate the above subject index based on theme class cluster unit using Lucene kit.
Specific embodiment six, present embodiment are to a kind of relation data described in one of specific embodiment one to five
The further explanation of keyword query method in library based on theme class cluster unit, it is described described in step 3 to return to query result
To the detailed process of user are as follows:
It can be effectively by the theme class cluster comprising keyword (group) using the subject index based on theme class cluster unit
Unit returns to user as query result.Give one group of keyword query K={ k1,k2,....,kn, system in parallel search is more
A subject index finds corresponding theme class cluster unit, calculates related subject class cluster unit according to existing sort result function
Score, carries out descending sort to it according to score, and top-k theme class cluster unit is returned to user.
For verifying beneficial effects of the present invention, make following emulation experiment:
This experiment uses public data collection Freebase.Freebase is an open structured database, scale compared with
Greatly and there is certain structural complexity, wherein including about 2000 tables and 39,000,000 entities.Due to by experiment condition
Limitation, we carry out data set under the premise of not influencing experimental result to simplify processing.Keep database bottom mode sum number
It is constant according to the connection relationship between table, data set of the extraction section data (400M) as this experiment from Freebase database.
For the effect and performance for verifying the TCU-Based querying method that the application proposes, following three groups of experiments have been carried out.Firstly, passing through
The comparison of the substantial amounts of data prediction time, the validity of proof list order of connection prioritization scheme.Secondly, the TCU- that the application is proposed
Based querying method is compared with pedestal method DBXplorer, BLANKS and SAINT, verify the application method efficiency and
Accuracy rate.Finally, that verifies this method can by the comparative experiments for carrying out query responding time on the data set of different scales
Scalability.
Algorithm is run under 7 operating system of Microsoft Windows, JAVA environment, is usedCore(TM)
The CPU of 2.5GHz, 4GB memory, 500G hard disk.
The assessment of table order of connection prioritization scheme
The statistical form order of connection optimization front and back the substantial amounts of data prediction time, and by the database of different scales into
The validity of the table order of connection prioritization scheme based on genetic algorithm of the application proposition is verified in row comparative analysis.Experimental result
As shown in Figure 7.Abscissa indicates the quantity of tables of data involved in process of data preprocessing in figure, and ordinate indicates data prediction
Time.By result in figure it is found that data prediction can significantly improve pretreatment using the table order of connection that genetic algorithm obtains
Efficiency, and with the increase of table quantity, effect is more obvious.
Method comparison
By comparing with benchmark querying method, the performance of the application querying method TCU-Based is had evaluated.Respectively from
Two aspects of search efficiency and result accuracy rate are tested.Select 100 keywords at random from the inquiry log of database
Inquiry is carried out using tetra- kinds of methods of DBXplorer, BLANKS, SAINT and TCU-Based respectively on Freebase data set
Inquiry, analyzes the query responding time of each method and the quality of query result.
(1) search efficiency
Firstly, we use different number searching keyword, by the top-2 average lookup response time of four kinds of methods into
Row compares.Since query responding time terminate executing inquiry to top-2 inquiry response of generation, does not include that off-line data is located in advance
Manage the time.As shown in figure 8, abscissa indicates keyword number, ordinate indicates the average lookup response time.As seen from the figure, from
Line querying method SAINT and TCU-Based search efficiency is substantially better than online query method DBXplorer and BLANKS, reason
It is that first two method needs to carry out in query process complicated table attended operation, especially in the more complicated database of structure
In, such table attended operation needs biggish time overhead;On the contrary, offline search method is before user proposes inquiry to data
Table and tuple in library are pre-processed, therefore significantly improve inquiry velocity.In addition, the querying method that the application proposes
TCU-Based is that each theme class cluster constructs subject index, and multiple indexes are parallel after user proposes inquiry executes, and imitates inquiry
Rate has further promotion relative to offline search method SAINT.As shown in Figure 8, it when keyword number is more than or equal to 3, looks into
The raising for asking efficiency becomes apparent, the reason is that being indexed the selection of word using correlation rule, provides multiple queries in user and closes
When keyword, the index entry comprising all keywords can be directly retrieved, without being indexed connection.
The performance of the application querying method for further evaluation, We conducted the querying method comparison on top-k, k
Value is carried out from 2 to 20.As shown in figure 9, abscissa indicates the different values of k in figure, ordinate is that being averaged under different value of K is looked into
Ask the response time.Obviously, the query performance of the application method is substantially better than other three kinds of benchmaring querying methods.For example, in k
When taking 12, the average lookup response time of DBXplorer, BLANKS, SAINT are respectively 13500ms, 7452ms, 3420ms, and
The querying method TCU-Based of the application only spends 2014ms.Concrete reason is similar to Fig. 8, and details are not described herein again.
(2) validity is inquired
Inquiry validity is measured using two evaluation indexes of accuracy rate and recall rate respectively.It is accurate in order to be carried out to it
Calculating, first we randomly select 100 SQL queries and using its corresponding inquiry response as standard queries result.Then
Extract the keyword in SQL query, and the input as above-mentioned four kinds of querying methods.Its Average Accuracy and average recall rate pair
It is more as shown in Figure 10 and Figure 11 than situation.As shown in Figure 10, with the increase of keyword quantity, the accuracy rate of search algorithm is in decline
Trend.The accuracy rate of 1-keyword and 2-keyword inquiry is better than 3-keyword, 4-keyword and 5- under normal circumstances
Keyword, because of increasing with keyword quantity, the relationship between keyword becomes increasingly complex.Context of methods TCU-Based
It is higher by 3%~7% than congenic method SAINT in terms of accuracy rate, compared with other two methods, effect is more significant.By scheming
11 it is found that context of methods is significantly better than that existing method in terms of recall rate, and 15% or so is higher by compared with SAINT.
Scalability assessment
The average lookup response time of top-5 is measured in this experiment.Abscissa indicates that different data collection is big in Figure 12
Small, ordinate indicates the average lookup response time of top-5.As can be known from Fig. 12, successively increase with database size from 100MB
It is added to 500MB, is changed by the average lookup response time that context of methods obtains slow.This is because the increase of data volume, only
Tuple connection when to pretreatment produces large effect, and to the influence that indexes online and little.It can be demonstrate,proved by experiment
Bright context of methods presents preferable scalability in different data collection size.
Claims (1)
1. it is sequentially included the following steps: based on the keyword query method of theme class cluster unit in a kind of relational database
One, theme class cluster unit building process;
One, mono-, the vertical grouping based on tables of data characteristic and inquiry log;
One, bis-, table order of connection prioritization scheme in theme class cluster is proposed;
One, tri-, the horizontal grouping based on theme class cluster tuple associated diagram;
Two, the optimiged index mechanism based on correlation rule is established;
Three, query result is returned into user;
It is characterized in that the detailed process of the vertical grouping based on tables of data characteristic and inquiry log described in step 1 mono- are as follows:
Using similarity matrix construction method between table, respectively from table characteristic, including between table, topological compactness is similar with content between table
Property and the aspect of inquiry log two construct initial input matrixes, vertical grouping method is by relational database and user query log
As input, one group of theme class cluster is divided into following 3 big modules: input module, similarity matrix as output, vertical grouping method
Construct module and output module;Input module, using relational database and its ideograph as input, inquiry log is also used as defeated
Enter;Similarity matrix constructs module: by analysis to ideograph in input module and database, calculating, obtains between tables of data
Topological compactness and content similarities, construct topological compactness matrix and content similarities matrix respectively, and on this basis
Similarity matrix between building table;It is for statistical analysis to inquiry log that the above matrix is made further to be reinforced and be corrected;Most
Afterwards, it is exported as a result with one group of theme class cluster;
(1) topological compactness between table
The ideograph G=(V, E) in data-oriented library, node viWith node vjBetween topological compactness be defined as follows:
Wherein, | vi| for table T in databaseiSize, | vj| for table T in databasejSize;σ is impact factor;For section
Point viWith node vjBetween logical reach: i.e. in database schema figure, node viWith node vjBetween path length;According to
The mathematical property of Gaussian function, for given σ value, the coverage of each node is approximately equal toRegional area,
Logical reach between two nodesWhen greater than the value, the topological compactness between two nodes decays to rapidly 0;
The topological compactness between any two node is calculated by formula (1), and then the topology for constructing relational database is close
Property matrix is as follows:
(2) content similarities between table
Tables of data is made of table name, attribute and tuple, therefore the famous prime minister that can obey the order when between content similarities are analyzed table
It is deeply probed into like two aspects of property and assignment similitude;
Name similitude includes table name similitude and attribute-name similitude two large divisions, with calculating phase between two entities in vector space
Like the method for property, first extraction table TiTable name and attribute-name in keyword be table TiConstruct vector Vi, extract table TjTable
Keyword in name and attribute-name is table TjConstruct vector Vj, name similitude is calculated using Cosine function:
Sim1(Ti,Tj)=Sim (Vi,Vj)=Vi·Vj/(|Vi|·|Vj|) (2)
Wherein Sim1(Ti,Tj) it is table TiWith table TjBetween name similitude;
Sim(Vi,Vj) it is vector ViWith vector VjBetween similitude
|Vi| and | Vj| it is respectively vector ViWith vector VjSize;
The specific solution procedure of assignment similitude is as follows:
1. calculating the content similarities between two attributes using Jaccard distance;
J (u, v)=| u ∩ v |/| u ∪ v | (3)
Wherein, u is tables of data TiIn attribute column;V is tables of data TjIn attribute column;
2. using the attribute in greedy matching strategy Test database to set Z:
3. weighting is averaging and obtains the assignment similitude between two tables
Wherein, | Ti| it is tables of data TiIn attribute column number;|Tj| it is tables of data TjIn attribute column number;max(|Ti|,|Tj
|) be | Ti| and | Tj| the larger value in the two;For the coefficient of variation of attribute column u,
It is a statistic of each observation degree of variation in measurement table for the coefficient of variation of attribute column v;The coefficient of variation is smaller, attribute
The richness of column content is smaller;Conversely, the coefficient of variation is bigger, the richness of attribute column content is bigger;In formulaFor attribute column u
Standard deviation,For the standard deviation of attribute column v,For the average value of attribute column u;For the average value of attribute column v;max(u.V,
It v.V is) the larger value in both u.V and v.V;
Tables of data TiWith tables of data TjBetween content similarities are as follows: Sim (Ti,Tj)=(Sim1(Ti,Tj)+Sim2(Ti,Tj))/2;
Content similarities matrix S between tables of data are as follows:
Wherein l is the number of tables of data in database;
Comprehensively consider the similitude in structure and content, obtains similarity matrix between table: ADB=T+S;
Wherein T is the topological compactness matrix of relational database;
(3) similarity matrix modification method
Inquiry log has recorded the history access information of user search database, includes 3 fields: User ID, inquiry Q, inquiry
As a result the tables of data T and where result;Vertical grouping method basic thought with user feedback is in inquiry log
Inquiry records for statistical analysis, and is modified using following boost function to similarity matrix;
boostlog(Ti,Tj)=exp (log (count (Ti,Tj))/log(max(count))) (5)
count(Ti,Tj) have recorded table TiWith table TjCo-occurrence number in inquiry log, max (count) are to appoint in inquiry log
It anticipates the maximum values of two table co-occurrence numbers;
The above vertical grouping result is reinforced using the information in user query log, proposes that following score reinforces function:
AFinal(ti,tj)=ADB(ti,tj)×boostlog(ti,tj) (6)
And then obtain similarity matrix between tableWherein l is tables of data in database
Number;
(4) vertical grouping
Based on the vertical grouping for improving spectral clustering
Input: G=(V, E), k and impact factor σ, wherein, V={ v1,...,vl, | E |=m;V={ v1,...,vl}
Output: theme class gathering closes C={ C1,C2,...,Ck};
Step:
1. constructing similarity matrix A between tableFinal;
2. feature vector and characteristic value are calculated, with preceding k feature vector u1,...,ukConstruction feature vector space Rk;
3. node all in V is mapped to RkSpace;
4. using k-means algorithm by RkIn node rendezvous to theme class cluster C1,C2,...,CkIn;
The particular content of table order of connection prioritization scheme in theme class cluster described in step 1 bis- are as follows:
By theme class cluster C in data predictioni=(T1,T2..., Tn) in n table T1,T2,...,TnIt is attached to obtain
Consolidated statement Ti', based on genetic Algorithm Design table order of connection prioritization scheme, firstly, carrying out table connection with genetic algorithm
When sequential optimization, the different table order of connection is expressed using the form of threaded tree, and using the form of pre-reset mechanism threaded tree
It is encoded;After coding, poplength threaded tree is randomly selected as initial population, and first generation population is carried out
Genetic manipulation generates genlength new individual;Genetic manipulation swaps population using two kinds of operators of intersection and variation;It hands over
Pitch operator: the subtree that same size is randomly generated swaps, for duplicate tables of data on threaded tree after intersecting, not with other
The tables of data of appearance is replaced;Mutation operator: the tables of data of any nonzero digit on exchange threaded tree;Then according to well known threaded tree
Cost formula calculates the cost of each threaded tree, selects the smallest threaded tree of poplength cost as next-generation population;Weight
Until reaching predetermined the number of iterations, which is obtained by the analysis and summary to many experiments for the multiple above heredity, selection course
To reasonable value, the smallest threaded tree of cost is exactly that the optimal table obtained by genetic algorithm connects in last generation of evolutionary process
Connect sequence;
The detailed process of horizontal grouping described in step 1 tri- based on theme class cluster tuple associated diagram are as follows:
After vertical grouping and theme class cluster table attended operation, obtains theme class gathering and close C=(C1,C2,...,Ck),
In each theme class cluster CiIt include a consolidated statement Ti', horizontal point is carried out to theme class cluster using theme class cluster tuple associated diagram
Group uses a kind of mixing similarity calculation method;
Theme class cluster tuple mixes Similarity measures
An important data model is exactly theme class cluster tuple associated diagram G=(V, E) in horizontal grouping strategy, is a weighting
Non-directed graph, weight of the similarity as side between theme class cluster tuple;
It is assumed that ti' and tj' the class that is the theme cluster CiThe mixing similitude calculating process of middle any two tuple, two tuples is as follows:
1. defining different distance function dk, comprising: Euclidean distance, editing distance, Hamming distance;
2. theme class cluster tuple is mapped on n-dimensional space, and the distance d between two tuples is found out according to distance functionk
(ti',tj');
3. finding out the mixing similitude between two tuples according to the following formula;
Wherein, Simk(ti',tj') it be distance function is dkWhen, tuple ti' and tj' between similitude;
It is d for distance functionk, when { 1,2,3 } k ∈, tuple ti' and tj' between similitude maximum value;
Sim(ti',tj') it is two tuple ti' and tj' between mixing similitude;
The step of level grouping:
1. calculating theme class cluster CiIn mixing similitude between each theme class cluster tuple, construct theme class cluster tuple associated diagram;
2. associated diagram is divided into several connected components according to similarity threshold φ;
3. finding out the theme class cluster unit that theme class cluster tuple number is greater than Minsize, and its amalgamation is calculated, selects amalgamation
The smallest theme class cluster unit disconnects the smallest side of similitude in the theme class cluster unit;
4. step is repeated 3., until theme class cluster CiIn until number of tuples contained in each theme class cluster unit is less than Minsize;
5. calculating the separation property between theme class cluster unit, merge the smallest theme class cluster unit of separation property;
6. step is repeated 5., until theme class cluster CiThe number of middle theme class cluster unit | Ci| until reaching requirement;
Step 3. with step 5. in the theme class cluster unit amalgamation used, separation property be calculate by the following formula:
Wherein, Sim (ti',tj') class that is the theme cluster tuple ti' and tj' between mixing similitude;TCUkAnd TCUlRespectively indicate two
A different theme class cluster unit;
The detailed process of the optimiged index mechanism based on correlation rule is established described in step 2 are as follows: respectively in each theme class cluster
Frequent item set is found using association rule algorithm, each frequent item set corresponds to an inverted index item list, includes in list
All theme class cluster units for the keyword or keyword combination directly occur, index structure are as follows:
Keyword(s)→(TCU1,TCU2……) (10)
The above subject index based on theme class cluster unit is generated using Lucene kit;
Query result is returned to the detailed process of user described in step 3 are as follows:
Give one group of keyword query K={ k1,k2,....,kn, the multiple subject index of parallel search find corresponding theme class
Cluster unit calculates the score of related subject class cluster unit according to well known sort result function, carries out descending to it according to score
Sequence, returns to user for top-k theme class cluster unit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610264735.4A CN105975488B (en) | 2016-04-25 | 2016-04-25 | A kind of keyword query method based on theme class cluster unit in relational database |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610264735.4A CN105975488B (en) | 2016-04-25 | 2016-04-25 | A kind of keyword query method based on theme class cluster unit in relational database |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105975488A CN105975488A (en) | 2016-09-28 |
CN105975488B true CN105975488B (en) | 2019-06-18 |
Family
ID=56994549
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610264735.4A Expired - Fee Related CN105975488B (en) | 2016-04-25 | 2016-04-25 | A kind of keyword query method based on theme class cluster unit in relational database |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105975488B (en) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107451210B (en) * | 2017-07-13 | 2020-11-20 | 北京航空航天大学 | Graph matching query method based on query relaxation result enhancement |
CN107480199B (en) * | 2017-07-17 | 2020-06-12 | 深圳先进技术研究院 | Query reconstruction method, device, equipment and storage medium of database |
CN109582698B (en) * | 2017-09-29 | 2021-08-13 | 上海宽带技术及应用工程研究中心 | Method, system, storage medium and terminal for updating query results of multiple continuous top-k keywords |
CN110019299A (en) * | 2017-11-16 | 2019-07-16 | 阿里巴巴集团控股有限公司 | A kind of method and apparatus for creating or refreshing the off-line data set of analytic type data warehouse |
CN108132927B (en) * | 2017-12-07 | 2022-02-11 | 西北师范大学 | Keyword extraction method for combining graph structure and node association |
CN108197175B (en) * | 2017-12-20 | 2021-12-10 | 国网北京市电力公司 | Processing method and device of technical supervision data, storage medium and processor |
CN108182520A (en) * | 2017-12-22 | 2018-06-19 | 深圳市华云中盛科技有限公司 | The method and its system of a kind of rapid modeling |
CN109325019B (en) * | 2018-08-17 | 2022-02-08 | 国家电网有限公司客户服务中心 | Data association relationship network construction method |
CN109241243B (en) * | 2018-08-30 | 2020-11-24 | 清华大学 | Candidate document sorting method and device |
CN109670012A (en) * | 2019-02-20 | 2019-04-23 | 湖北理工学院 | What a kind of electric power foundation of civil work based on Internet of Things was checked and accepted instructs system and method |
CN110263225A (en) * | 2019-05-07 | 2019-09-20 | 南京智慧图谱信息技术有限公司 | Data load, the management, searching system of a kind of hundred billion grades of knowledge picture libraries |
CN110362798B (en) * | 2019-06-17 | 2023-12-19 | 平安科技(深圳)有限公司 | Method, apparatus, computer device and storage medium for judging information retrieval analysis |
CN112559554B (en) * | 2020-12-24 | 2024-01-26 | 北京百家科技集团有限公司 | Query statement optimization method and device |
CN112783952A (en) * | 2021-03-16 | 2021-05-11 | 浪潮云信息技术股份公司 | Method for constructing result set based on electronic official document keyword query |
CN113722560A (en) * | 2021-09-03 | 2021-11-30 | 南京协胜智能科技有限公司 | Method for screening data center data search results |
CN114116806A (en) * | 2021-12-03 | 2022-03-01 | 北京天融信网络安全技术有限公司 | Top-k ranking query and library falling method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104036051A (en) * | 2014-07-04 | 2014-09-10 | 南开大学 | Database mode abstract generation method based on label propagation |
CN104050162A (en) * | 2013-03-11 | 2014-09-17 | 富士通株式会社 | Data processing method and data processing device |
CN104391908A (en) * | 2014-11-17 | 2015-03-04 | 南京邮电大学 | Locality sensitive hashing based indexing method for multiple keywords on graphs |
-
2016
- 2016-04-25 CN CN201610264735.4A patent/CN105975488B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104050162A (en) * | 2013-03-11 | 2014-09-17 | 富士通株式会社 | Data processing method and data processing device |
CN104036051A (en) * | 2014-07-04 | 2014-09-10 | 南开大学 | Database mode abstract generation method based on label propagation |
CN104391908A (en) * | 2014-11-17 | 2015-03-04 | 南京邮电大学 | Locality sensitive hashing based indexing method for multiple keywords on graphs |
Also Published As
Publication number | Publication date |
---|---|
CN105975488A (en) | 2016-09-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105975488B (en) | A kind of keyword query method based on theme class cluster unit in relational database | |
US7392250B1 (en) | Discovering interestingness in faceted search | |
CN105045875B (en) | Personalized search and device | |
JP3860046B2 (en) | Program, system and recording medium for information processing using random sample hierarchical structure | |
CN107590128B (en) | Paper homonymy author disambiguation method based on high-confidence characteristic attribute hierarchical clustering method | |
Liu et al. | Stratified sampling for data mining on the deep web | |
CN104699786A (en) | Communication network complaint system for semantic intelligent search | |
CN106156271A (en) | Related information directory system based on distributed storage and foundation thereof and using method | |
CN113157943A (en) | Distributed storage and visual query processing method for large-scale financial knowledge map | |
Wang et al. | Research and implementation of the customer-oriented modern hotel management system using fuzzy analytic hiererchical process (FAHP) | |
Wang et al. | Aggregate queries on knowledge graphs: Fast approximation with semantic-aware sampling | |
Ibrahim et al. | Compact weighted class association rule mining using information gain | |
Zou et al. | Survey on learnable databases: A machine learning perspective | |
CN104317853B (en) | A kind of service cluster construction method based on Semantic Web | |
CN110032676A (en) | One kind being based on the associated SPARQL enquiring and optimizing method of predicate and system | |
CN105956012B (en) | Database schema abstract method based on figure partition strategy | |
CN110162580A (en) | Data mining and depth analysis method and application based on distributed early warning platform | |
Zhang et al. | Leveraging data-analysis session logs for efficient, personalized, interactive view recommendation | |
Fang et al. | A query-level distributed database tuning system with machine learning | |
JP7428250B2 (en) | Method, system, and apparatus for evaluating document retrieval performance | |
Liu et al. | EntityManager: Managing dirty data based on entity resolution | |
Tian et al. | Retrieving deep web data through multi-attributes interfaces with structured queries | |
Ye et al. | Generalized learning of neural network based semantic similarity models and its application in movie search | |
Zhao et al. | Organizing structured deep web by clustering query interfaces link graph | |
Jain et al. | Phrase based clustering scheme of suffix tree document clustering model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190618 Termination date: 20200425 |