CN102799681A

CN102799681A - Top-k query method oriented to any data segment

Info

Publication number: CN102799681A
Application number: CN2012102576401A
Authority: CN
Inventors: 冯钧; 唐志贤; 邱男; 印玉兰; 徐黎明; 盛震宇; 任锋; 朱祖会; 付言章; 王祥忠; 史涯晴
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2012-07-24
Filing date: 2012-07-24
Publication date: 2012-11-28
Anticipated expiration: 2032-07-24
Also published as: CN102799681B

Abstract

The invention discloses a Top-k query method oriented to any data segment, comprising the following steps: firstly acquiring data; then analyzing the characteristics of the data, building an index structure according to the characteristics of the data, if data size is small and a DG (dominant graph) index is created, entering into DG-index-based Top-k query based on any data segment; if the data size is large and nodes on the DG index are sparse, entering into the Top-k query based on a double-layer dominant graph (DDG) index structure; and if any segment is more difficult to determine, entering into a mixed index query based on the DG and GS (general subject). The method disclosed by the invention can be applicable to the indexing of global Top-k query and partial Top-k query of any data segment, and freedom and randomicity of a Top-k query application are improved.

Description

A kind of Top-k querying method towards any section data

Technical field

The present invention relates to a kind of Top-k querying method, belong to technical field of information retrieval towards any section data.

Background technology

Continuous development along with infotech; People improve constantly the requirement of information retrieval, and Top-k inquiry has obtained using widely in information retrieval, multimedia similarity searching, text and data integration, business analysis, products catalogue and preference inquiry, distributed network gathering and Sensor Data Record and some other application based on the suggestion source of internet.

At present, the algorithm to preference Top-k inquiry mainly contains four big types: 1) sort-list method; 2) hierarchy type method; 3) view approach; 4) summarization methods.

The most classical in the Sort-list method is TA algorithm (H.Bast, D.Majumdar, R.Schenkel; M.Theobald, andG.Weikum.IO-Top-k:Index-accessoptimizedtop-kquerypr ocessing.InVLDB, pages475 – 486; 2006.FaginR, LotemA, NaorM.Optimal aggregation algorithms for middleware.Journal of Computerand System Sciences 66; 2003, pp.614-656.).Algorithm is to build up a plurality of sorted lists after each computing dimension sorts independently.Find out all greater than all tuples of given threshold value, rather than directly search all tuples.Sequentially scan each tabulation in the computation process, when sequential access, as run into the tuple indicator, immediately random access other tabulate and calculate the Top-k score value.The tuple of having visited obtains Top-k result through ordering, and the topmost difficult point of this method is the decision threshold size, will cause return results too much if threshold value is crossed pine, if the threshold value tension will cause return results very few.

In the hierarchy type method, the data centralization tuple is pressed the layering of given level rule.The Top-k inquiry of arbitrary function F obtains Query Result in the k layer in the past.Existing multiple layered approach: DG (Zou L; Chen L.Dominant graph:An efficient indexing struture to answer top-k queries [C] //Proc of the IEEE 24th Int Conf on Data Engineering.Washington; DC:IEEE Computer Society, 2008:536-545.), AppRI (Xin D, Chen C; Han J.Towards robust indexing for rankedqueries [C] //Proc of the 32nd Int Conf on Very Large Data Bases.Trondheim; Norwary:VLDB Endowment, 2006:235-246.) and Onion (C hang Y C, Bergman L D; Castelli V; Et al.The onion technique:Indexing for linear optimization queries [J] .ACM SIGMOD Record, 2000,29 (4): 391-402.).The Onion method is the layering rule with the convex closure.A given linear query function, interested tuple only is present in the convex closure.Onion method building process is the convex closure that calculates tuple, at first calculates the 1st convex closure, calculates the 2nd convex closure of residue tuple then, by that analogy, finishes up to all first set of calculated.The level rule that defines in the AppRI method is: tuple t puts into the l layer, and and if only if satisfies two conditions: 1) given any linear query makes t not in Top-(l-1) result; 2) have at least an inquiry to make t belong to the top-l layer.The level rule of DG method definition is: each layer is previous skyline.Skyline is introduced, at first calculate the 1st skyline, calculate the 2nd skyline of residue tuple then, finish up to all first set of calculated.Different with top two kinds of methods, owing to add the dominance relation between data point among the DG, make and need not visit and calculate all k layers query function value of tuple in the past.

Method based on view is a matching result in by the view of given function pre-sorting; Typical method has: PREFER (Hristidis V; Koudas N, Papakonstantinou Y.Prefer:A system for the efficient execution of multi-parametric ranked queries [J] .ACM SIGMOD Record, 2001; 30 (2): 259-270.) and LPTA (Das G; Gunopulos D, Koudas N, et al.Answering top-k queries using views [C] //Proc of the32nd Int Conf on Very Large Data Bases.Trondheim; Norway:VLDB Endowment, 2006:451-462.).In this type of algorithm, if the function of query function and presort view is fast more near inquiry velocity more.The PREFER algorithm uses view sequence Rv, will write down tuple and press the preference function ordering.In the time will inquiring about preference function, calculate the watermark among the Rv, guarantee that it is the 1st value that inquiry obtains. repeat said process and obtain the Top-k value.The LPTA algorithm is safeguarded the tuple ID tabulation of some preference function orderings.Retrieval in these tuples ID tabulation is up to finding the Top-k value.

Method based on summary is generally to use grid dividing (waiting dark or wide) data set, and data point in the record grid cell.When inquiry, pass through the approximate function score value of grid summary info computational data intensive data point, with the data point of the non-Query Result of beta pruning.In the grid cell that satisfies condition, obtain accurate function score value and ordering through further access number strong point, obtain Query Result.RankCube concentrates multidimensional to select inquiry to adopt summarization methods in historical data.This method structure grid is very fast but computation process is more rough, is applicable to set up fast in the indexed data continuous query.Domestic researcher also makes big quantity research in the Top-k computing field; Like data stream Top-k frequent item set mining method (Yang Bei; Huang Houkuan. mining data stream boundary mark window Top-K frequent item set [J]. computer research and development; 2010,47 (3): 463-473)., data stream Top-k abnormity point discover method etc.

More than the method for relevant Top-k inquiry, obtain the Top-k result set of global optimum emphatically, seldom study, thereby reduced freedom and arbitrariness that the Top-k inquiry is used to the Top-k inquiry of data in any section.Therefore, be necessary that research and establishment can be fit to the index that overall Top-k inquiry again can any section section data Top-k inquiry.

Summary of the invention

Goal of the invention: to the problem that exists in the prior art; The present invention provides a kind of Top-k querying method towards any section data; This method has the index that can be fit to overall Top-k inquiry and any section section data Top-k inquiry, improves freedom and arbitrariness that the Top-k inquiry is used.

Technical scheme: a kind of Top-k querying method towards any section data comprises the steps:

Steps A: reading of data;

Step B: analyze data characteristics, set up index structure according to data characteristics: if data volume is less, the DG index has been built up and has been got into step B-1; If data volume is bigger, the node on the corresponding DG index of data set is " sparse " (need to add " pseudo-node " just can be reduced to the continuous subgraph in DG index middle level 50% or more) comparatively, entering step B-2; If confirm more at need when any section, get into step B-3;

Step B-1: any section data Top-k querying method based on the DG index comprises the steps;

Step B-1-1: add the pseudo-node of part, reduction DG index;

Step B-1-2: carry out handling, specifically comprise following steps based on the Traveler of DG:

Step B-1-2-1: scan the level number of data segment to be checked, the node of smallest tier minlayer is added Candidate Set RS according to the non-decreasing order, the maximal value R among the RS is added result set;

Step B-1-2-2: the size of judged result collection and the relation of K, if the result set size less than K, changes step B-1-2-3 over to, otherwise changes step B-1-3 over to;

Step B-1-2-3: son's node C of scanning R; If all father's nodes of C all in Candidate Set and C do not visited; Node C is added Candidate Set; And the max node in the Candidate Set added result set, otherwise the node that gets into result set is in query context, and the size of result set is added 1;

Step B-1-3: the dummy record in the deletion result set obtains final Top-k inquiry result;

Step B-2: the Top-k querying method based on double-deck dominating figure DDG index structure comprises the steps;

Step B-2-1: data are carried out segmentation;

Step B-2-2: to the data construct DDG index structure after the segmentation;

Step B-2-3: carry out the Top-k inquiry, specifically comprise the steps;

Step B-2-3-1: the DG index that calculates inquiry section place;

Step B-2-3-2: each the DG index to the inquiry place carries out basic Traveler processing, forms result set result;

Step B-2-3-3: the bottom DG index to the inquiry place carries out handling based on the Traveler of DG, and the result is write result;

Step B-2-3-4: the top DG index to the inquiry place carries out handling based on the Traveler of DG, and the result is write result, forms final Top-K Query Result.

Step B-3: the hybrid index querying method based on DG and GS comprises the steps;

Step B-3-1: set up DGS domination network, be divided into double-layer structure up and down, the upper strata is the DG index structure, be suitable for overall Top-k inquiry, and for any section data Top-k inquiry, the GS of lower floor data structure can well keep the throwback dominance relation.

Step B-3-2: the notion according to adapted mesh is adjusted, and each dimension data of GS layer is all carried out adaptive adjustment, makes data all reach even distribution above the dimension at each;

Step B-3-3: according to the DG index GS structure is adjusted, made that the inner node of same level in the GS network keeps certain sequence, reduce the number of comparisons between the same level data in the DG index;

Step B-3-4: inquire about based on DGS domination network, specifically comprise following steps:

Step B-3-4-1: the row number (column) and the row number (rower) of computational data section data query section place grid;

Step B-3-4-2: handle in the grid row successively and number be column, row number is 0 to rower node, the data node that falls in the above-mentioned grid between interrogation zone is added Candidate Set according to the non-decreasing order, and calculate the maximum column number (col) that satisfies condition;

Step B-3-4-3: handle in the grid row successively and number be rower, row number is 0 to column node, the data node that falls in the above-mentioned grid between interrogation zone is added Candidate Set according to the non-decreasing order, and calculate the maximum line number (row) that satisfies condition;

Step B-3-4-4: with ranks number is that the data node of the mesh node of i (row <i < rower) and j (col＜j < column) is pressed the non-decreasing order and added Candidate Set;

Step B-3-4-5: first node in the Candidate Set is added result set;

Step B-3-4-6: the big or small len of judged result collection and the relation of K, if len K gets into step B-3-4-7, otherwise, get into step B-3-5, finish inquiry;

Step B-3-4-7: whether the nodal point number of judged result collection equals the node number in the query context, if unequal, if equate, then its follow-up node is added Candidate Set;

Step B-3-4-8: then len data node in the Candidate Set added result set, and len is added 1, get into step B-3-4-6;

Step B-3-5: the return results collection finishes inquiry as the Top-k Query Result.

Beneficial effect: compared with prior art, the Top-k querying method towards any section data provided by the invention has the index that can be fit to overall Top-k inquiry and any section section data Top-k inquiry, improves freedom and arbitrariness that the Top-k inquiry is used

Description of drawings

Fig. 1 is the process flow diagram of the embodiment of the invention;

Fig. 2 is the even distributed data figure of the embodiment of the invention;

Fig. 3 is the normal distribution data plot of the embodiment of the invention;

Fig. 4 is the Top-k inquiry synoptic diagram based on DGS of the embodiment of the invention.

Embodiment

Below in conjunction with specific embodiment; Further illustrate the present invention; Should understand these embodiment only be used to the present invention is described and be not used in the restriction scope of the present invention; After having read the present invention, those skilled in the art all fall within the application's accompanying claims institute restricted portion to the modification of the various equivalent form of values of the present invention.

As shown in Figure 1, the detailed technology scheme of present embodiment is:

The steps A reading of data

Originally the case of having a try adopts two groups of data that produce at random, all is 20K bar records, two attributes of every record, and one group of data S is evenly distributed in (on 0 to 1000 interval), and another group data N is normal distribution (μ=500, σ=1), and is like accompanying drawing 2, shown in Figure 3.

Step B analyzes the books characteristics of above-mentioned two group data sets; Set up index structure according to data characteristics: if data volume is less, the DG index has been built up and has been got into step B-1; If data volume is bigger, the node on the corresponding DG index of data set is " sparse " (need to add " pseudo-node " just can be reduced to the continuous subgraph in DG index middle level 50% or more) comparatively, entering step B-2; If confirm more at need when any section, get into step B-3.

Step B-1 carries out any section data Top-k querying method based on the DG index;

Top-k querying method (based on the Traveler disposal route of DG) based on any section data of DG index shown in algorithm 1, at first scans the level number of data segment to be checked, and the node of smallest tier minlayer is added Candidate Set RS according to the non-decreasing order; Maximal value R among the RS is added result set, scans all son's node C of R then, if all father's nodes of C all in Candidate Set and C do not visited; Node C is added Candidate Set, then the max node in the Candidate Set is added result set, and the like; If the node that gets into result set is not in query context; The query results number K from increasing 1, when the node number that and if only if gets into result set is K, is stopped inquiry; Then result set is rejected pseudo-node operation; When the node in the result set in query context is not, from result set, to reject, final acquisition needs the Top-k result set result of the data segment of inquiry.

Algorithm 1 is based on the Top-k inquiry of any section data of DG index

Step B-2: adopt Top-k querying method to carry out Top-k, comprise the steps based on double-deck dominating figure DDG index structure;

Step B-2-1: data are carried out segmentation, and, create algorithm shown in algorithm 2 to the data construct DDG index structure after the segmentation

The foundation of algorithm 2DDG index structure:

Wherein, CreateDGIndex () method be basic DG index structure creation method (shown in algorithm 3) wherein the function of SkylineNode (i) method be to find out the big layer of i.Through with data sementation, create DG index separately then, on the basis of each index, create DG index then, thereby realize the establishment of DDG index structure for the ground floor data.

Algorithm 3DG index set up algorithm

Step B-2-3: carry out the Top-k inquiry, shown in algorithm 4, specifically comprise the steps;

Step B-2-3-1: the DG index that calculates inquiry section place;

Algorithm 4 is based on search algorithm Top-k of any section data of DDG index

Algorithm 5 basic Traveler traversals

In order to solve the problem that the DG index can not keep the throwback domination of data, the GS network of the relation that we can fine maintenance throwback domination combines with the DG index; But also there are some problems in the GS network; If such as data is not uniform distribution; May cause the data volume of certain several grid too big; And some grid data is too sparse in addition, and for the grid of same level, the dominance relation of the data that it is inner does not well keep; We will adjust the GS grid for this reason, specifically shown in step B-3-2 and step B-3-3.

Step B-3-2: the notion according to adapted mesh is adjusted;

In order to make data all reach even distribution above the dimension at each, avoid the data in the grid too intensive, we carry out adaptive adjustment to each dimension data in the GS grid, make data all reach even distribution above the dimension at each; Adaptive adjustment algorithm is shown in algorithm 6.

Algorithm 6 adaptive adjustment algorithm

Step B-3-3: according to the DG index GS structure is adjusted, increased the dominance relation of same level;

The inner node of same level that makes in the GS network that its objective is that our DG index adjustment GS network is adjusted keeps certain sequence; Because the inquiry of GS index structure is the level inquiry equally; Therefore can reduce the number of comparisons between the same level data; Can utilize technology of prunning branches to get rid of the node that needn't appear at candidate's nodal set, thereby improve search efficiency.

Step B-3-4: inquire about based on DGS domination network, shown in algorithm 6, specifically comprise following steps:

Step B-3-4-5: first node in the Candidate Set is added result set;

We can learn according to existing dominance relation so that as node among Fig. 4 being carried out the Top-k inquiry node c [3] [2] and node c [2] [3] get into Candidate Set, and promptly node 3,11,4 gets into Candidate Set.According to aggregate function F node 4 is added result set then, this moment, the follow-up node with node 4 added Candidate Set, was 6,2,1, because node 1 is simultaneously by grid node c [3] [2] domination, so only node 6,2 is added Candidate Set.Next step adds result sets with node 6, and the node among the grid c [1] this moment [3] still not entirely in result set, therefore need not sought follow-up node and add Candidate Set.Node 3 is added result set, equally node 2 is added result set, the node among the grid node c [1] this moment [3] is all in result set; So need its follow-up node be added result set, be node 7, this moment, node 7 was arranged by c [3] [2]; So need not add Candidate Set, then node 11 is added result set, all nodes among the grid c [3] this moment [2] have all added result set; Need its follow-up node be added Candidate Set, its follow-up node is 5,1,7.In like manner node 5,1 is added result set, the follow-up node adding Candidate Set 8,9,10 with node 1 after node 10 gets into result sets, adds Candidate Set with node 0, thereby accomplishes all inquiries of Top-k.

Claims

1. the Top-k querying method towards any section data is characterized in that, comprises the steps:

Steps A: reading of data;

Step B: analyze data characteristics, set up index structure according to data characteristics: if data volume is less, the DG index is built up, and then gets into any section data Top-k inquiry based on the DG index; If data volume is bigger, when the node on the corresponding DG index of data set is comparatively sparse, then get into Top-k inquiry based on double-deck dominating figure DDG index structure; If confirm more at need, then get into hybrid index inquiry based on DG and GS when any section; Said node is comparatively sparse to be meant needs to add that pseudo-node just can be reduced to the continuous subgraph in DG index middle level more than 50%.

2. the Top-k querying method towards any section data as claimed in claim 1 is characterized in that, any section data Top-k querying method based on the DG index comprises the steps;

Step B-1-1: add pseudo-node with reduction DG index;

Step B-1-2: carry out handling based on the Traveler of DG,

Step B-1-2-2: the size of judged result collection and the relation of K, if the size of result set changes step B-1-2-3 over to, otherwise changes step B-1-3 over to less than K;

Step B-1-3: the dummy record in the deletion result set obtains final Top-k Query Result result.

3. the Top-k querying method towards any section data as claimed in claim 1 is characterized in that the Top-k querying method based on double-deck dominating figure DDG index structure comprises the steps;

Step B-2-1: data are carried out segmentation;

Step B-2-2: to the data construct DDG index structure after the segmentation;

Step B-2-3: carry out the Top-k inquiry, specifically comprise the steps;

Step B-2-3-1: the DG index that calculates inquiry section place;

4. the Top-k querying method towards any section data as claimed in claim 1 is characterized in that the hybrid index querying method based on DG and GS comprises the steps;

5. step B-3-2: the notion according to adapted mesh is adjusted, and each dimension data of GS layer is all carried out adaptive adjustment, makes data all reach even distribution above the dimension at each;

Step B-3-4-1: the row column and row rower of computational data section data query section place grid;

Step B-3-4-2: handle in the grid row successively and number be column, row number is 0 to rower node, the data node that falls in the above-mentioned grid between interrogation zone is added Candidate Set according to the non-decreasing order, and calculate the maximum column col that satisfies condition;

Step B-3-4-3: handle in the grid row successively and number be rower, row number is 0 to rower node, the data node that falls in the above-mentioned grid between interrogation zone is added Candidate Set according to the non-decreasing order, and calculate the maximum line number row that satisfies condition;

Step B-3-4-4: with ranks number is that the data node of the mesh node of i and j is pressed the non-decreasing order and added Candidate Set; Row <i < rower, col < j < column wherein;

Step B-3-4-5: first node in the Candidate Set is added result set;