CN105045806A

CN105045806A - Dynamic splitting and maintenance method of quantile query oriented summary data

Info

Publication number: CN105045806A
Application number: CN201510304691.9A
Authority: CN
Inventors: 王树鹏; 张燕琴; 吴广君
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2015-06-04
Filing date: 2015-06-04
Publication date: 2015-11-11
Anticipated expiration: 2035-06-04
Also published as: CN105045806B

Abstract

The present invention relates to a dynamic splitting and maintenance method of quantile query oriented summary data. The method comprises: firstly, sampling a written data item to construct q-digit summary data; secondly, according to a quantile query rule of q-digit backward traversal, querying an intermediate point of a data item in the q-digit summary data; and then, reversely traversing the q-digit summary data based on the intermediate point, establishing a segmentation path, and according to the segmentation path, splitting the q-digit summary data into two summary data structures with approximately equal data volumes, wherein after splitting, each structure is still an independent q-digit structure, and can normally receive and process a newly arrived data source. The dynamic splitting and maintenance method of quantile query oriented summary data can be used to dynamically manage the q-digit summary data in a distributed environment, effectively support the maintenance and management of summary data in a big-data environment, and effectively support the quantile query and computation.

Description

A kind of summary data Dynamic Division towards fractile inquiry and maintaining method

Technical field

The invention belongs to areas of information technology, propose a kind of summary data Dynamic Division based on q-digit and maintaining method, the method comprises that the split point of summary data structure is selected, method for estimating error etc. after Dynamic Division algorithm and division.The method can be used for the dynamic management of q-digit summary data under distributed environment, effectively supports the maintenance and management of summary data under large data environment, effectively supports fractile inquiry and calculate.

Background technology

Under the large data environment of streaming, the important querying method of a class carries out fractile (Quantile) inquiry on stream data, is typically expressed as φ-fractile inquiry, its physical significance be data are sorted after return the , be called for short fractile inquiry.The span of fractile φ is a real number between 0 to 1, that is: (0,1].1-fractile (Φ=1) is exactly the maximal value that data query is concentrated, and 0.5-fractile (Φ=1) is then the intermediate value of data centralization, also known as making median.Such as: given stream data collection D={6,1,8,7,9,0,4,2,5,3}, after sorting D '=0,1,2,3,4,5,6,7,8,9}, 0.1-fractile inquiry return 0; The inquiry that the inquiry of 0.5-fractile returns 4,1-fractile returns maximal value 9.

Under flow data environment, due to cannot total data be obtained, therefore cannot effectively sort to data, now fractile inquiry seems particularly important, such as, the temperature tendency in each place of monitoring, inquires about the maximum temperature of some sensor nodes within nearest a period of time (1-fractile) in real time, medium temperature (0.5-fractile), or even the profiling temperatures of whole ratio.In addition fractile inquiry is also applied in the fields such as stock market trend analysis, web Aggregation Query, Web log mining, distributed storage data management.

Because flow data arrives at a high speed, cannot obtain and store whole partial datas, the more employings of current industry are similar to fractile querying method, by the sampled data of part, obtain approximate fractile inquiry, the target calculated with the real-time fractile reached under flow data environment.

The research that approximate fractile calculates at present mainly concentrates in the counting yield of related algorithm and the optimization of storage efficiency.Typical achievement in research is summarized as follows: the MRL99 algorithm (G.S.Manku that Manku etc. carry, S.Rajagopalan, andB.G.Lindsay.Randomsamplingtechniquesforspaceefficient onlinecomputationoforderstatisticsoflargedatasets.InACMS IGMOD, 1999.) be a kind of search algorithm based on single pass.The space complexity of this algorithm is return the consistance approximation (| r'-r|≤ε N) determined.This algorithm weak point knows data item number N definite in data stream in advance.Greenwald and Khanna proposes another kind of fractile search algorithm-GK algorithm (M.GreenwaldandS.Khanna.Space-efficientonlinecomputationo fquantilesummaries.InACMSIGMOD, 2001.).A upper algorithm space complexity has not only been reduced to by this algorithm also need not predict data item number N in advance simultaneously.When data stream codomain is known, Cormode and Muthukrishnan proposes application count-min technology (G.Cormode further, S.Muthukrishnan.Animproveddatastreamsummary:thecount-min sketchanditsapplications.JournalofAlgorithms.2005, vol.55, no.1.pp.58-75.) carry out interval management, space complexity is this algorithm space complexity is only relevant with the codomain divided and irrelevant with the data item number of the actual arrival in data stream, reduce space consuming, but this method effectively cannot support the dynamic division in any codomain interval.

Q-digit approximate enquiring method (the N.Shrivastava that Shrivastava etc. propose, C.Buragohain, D.Agrawal, andS.Suri.Mediansandbeyond:Newaggregationtechniquesforse nsornetworks.InACMSenSys, 2004.), when data item constantly arrives, can dynamic conditioning summary data be responsible for numerical intervals, by certain traversal rule, support the inquiry of flow data fractile.Summary data constructed by q-digit can capture-data distribution characteristics approx, and need not store all concrete data reached and sort.The core concept that summary data builds is distribution according to data, carries out Auto-grouping to the sample values in summary data, and puts it to the having in the bucket of similar weight of variable-size.Q-digit can support the operation of some complexity further, as inquiry mid point, figure place inquiry, the inquiry of reversion fractile, the inquiry of range query frequent episode and cooperation control inquiry etc.

In addition, q-digit algorithm has the controlled feature of error.If the integer range of data item key word value is [1, σ], in q-digit summary data, sample data size is m, then the resultant error of carrying out fractile inquiry is less than O (log (σ)/m).Q-digit is that the fractile of extensively employing in flow data is at present according to querying method.

Summary of the invention

The current algorithm about fractile inquiry and related application are mainly launched under centralized stores environment, and how primary study improves the approximate treatment precision of algorithm and the efficiency of algorithm.But under distributed environment, Data distribution8, on different memory devices and loading equipemtn, needs to build data partition independent of each other module, along with the continuous write of data, the operations such as the separation that summary data corresponding in each subregion is also faced with and merging.

Under the present invention is directed to distributed environment, support the summary data of fractile inquiry, high-precision summary data separation/splitting method is proposed, by the summary data structure of a subregion, select the intermediate point (Φ=0.5) of data volume equalization to be separated, split into two data volumes and be similar to impartial summary data structure.Each summary data structure is independent after division supports follow-up data query and process.

Specifically, the technical solution used in the present invention is as follows:

Towards a summary data splitting method for fractile inquiry, its step comprises:

1) data item of write is sampled, build q-digit summary data;

2) according to the fractile rule searching of q-digit postorder traversal, the intermediate point of data item in inquiry q-digit summary data;

3) backward traversal q-digit summary data based on intermediate point, sets up split path, according to split path, q-digit summary data is split into two data volumes and be similar to impartial summary data structure.

Further, step 1) data organizational structure of described q-digit summary data can adopt tree structure, array, chained list etc.

Preferably, the data organizational structure of described q-digit summary data is tree structure, comprises the concrete steps that it divides:

A) according to the requirement of split point, intermediate point is found, as split point according to the fractile rule searching of q-digit postorder traversal;

B) take split point as starting point, along tree structure backward to father node, until root node, thus obtain split path; Based on this split path, the node of q-digit summary data is divided into two subtrees in left and right, the node on this split path is preserved respectively in left subtree and right subtree;

C) on left and right two stalk tree, revise respectively interior nodes the scope of the codomain of data space be responsible for, when intermediate node be responsible for scope identical time, merging intermediate node.

A kind of summary data dynamic maintaining method towards fractile inquiry, when load occurs unbalanced, or when needing to increase new treatment facility, said method is adopted to carry out splitting operation to summary data, a part of data are shared on other processing nodes, divides later summary data and independently support to divide the data query in later data interval.

The technology of the present invention key point mainly below 3 points:

1., in conjunction with fractile rule searching and error analysis method, the method for a kind of backward traversal q-digit is proposed.In Q-digit query script, be taken to the end and on traversal method, obtain the fractile Query Result of arbitrfary point with this.According to this rule searching, the present invention proposes a kind of from any quantile, and adopt tree type structure backward traversal method, said method effectively can set up the split path of any quantile, and split path can be divided into two a certain proportion of two summary data collection;

2. the split path proposed in Application way 1, proposes the splitting method of q-digit.First method sets up split path according to the intermediate point of Φ=0.5, the method of postorder traversal is adopted to obtain two the y-bend subtrees in left and right, and revise the interval range of the data that interior nodes is corresponding in the middle of each binary tree, and then complete rebuilding of q-digit summary data in new data interval;

3. the summary data after being separated under pair method (1) (2) carries out error estimation and analysis, through theoretical analysis, divide later summary data can completely independently support to divide the data query in later data interval, and keep maximum error not change.

Compared with prior art, beneficial effect of the present invention is as follows:

1. the splitting method that the present invention proposes carries out according to q-digit rule searching, ensure that division result does not change the original querying method of q-digit, method for estimating error and relative various application, make the inventive method possess good application prospect and theoretical foundation;

2. the present invention has only used the original summary data structure of q-digit, achieves the splitting function of summary data, ensure that fission process performs fast.After division, each structure remains independently q-digit structure, normally can receive and process newly arrived data source, and therefore this method effectively can support the process such as the Dynamic Division of arbitrary data subregion under distributed environment and merging.

3. the present invention can be used for Dynamic Maintenance and the management of q-digit summary data under distributed environment, can obtain corresponding structure at any time according to method of the present invention.When such as load occurs unbalanced, when increasing new treatment facility, method of the present invention now can be adopted to share a part of data on other processing nodes.The method that upper-layer service can propose according to the present invention, according to situation at that time, triggers splitting operation.

Accompanying drawing explanation

Fig. 1 is q-digit summary data structural representation in embodiment.

Fig. 2 is the left subtree q1 and the right subtree q2 schematic diagram that carry out dividing rear generation in embodiment according to splitpath, and wherein (a) figure is q1 subtree exemplary plot, and (b) figure is q2 subtree exemplary plot.

Fig. 3 is that in embodiment, after division, left subtree q1 safeguards schematic diagram, and wherein (a) figure is amendment q1 range of nodes schematic diagram after division, and (b) figure is that after division, q1 node merges schematic diagram.

Fig. 4 is the rear right subtree q2 node maintenance schematic diagram of division in embodiment, wherein (a) figure is amendment q2 range of nodes schematic diagram after division, b () figure is that after division, q2 node merges schematic diagram, (c) figure is that after division, right subtree q2 node merges net result schematic diagram.

Fig. 5 is that in embodiment, q-digit applies schematic diagram under distributed environment.

Embodiment

For enabling above-mentioned purpose of the present invention, feature and advantage become apparent more, and below by specific embodiments and the drawings, the present invention will be further described.

The present invention, based on q-digit, according to fractile rule searching, can inquire the intermediate point of data item, i.e. Φ=0.5.Then based on intermediate point, according to backward traversal q-digit summary data method, set up split path, according to split path, summary data is divided into into approximately equalised two subsets in left and right interval, the original data volume namely obtaining two sub-ranges after apportion corresponding respectively accounts for 50%.Only use the summary data preserved in q-digit in detachment process, and remain q-digit original function and character, fractile inquiry error does not change.

Q-digit can use the realizations such as tree structure, array, chained list, is described below for the structure of binary tree.Other data organizational structures specifically can implement with reference to this structure, such as array, list structure etc., corresponding relation between its level still demand fulfillment the present invention describe tree, the relation in implementation procedure between element and sampling principle, all consistent with tree structure.

The node rule of 1.q-digit summary data binary tree

Q-digitdigit interior joint is divided into root node, leaf node and interior nodes, and interior nodes must meet following two conditions:

(1)

(2)

Wherein count (v) represents the value of node v, v _pthe father node of v, v _sbe the brotgher of node of v, n is first norm (data scale) of all data item, and k is the compression parameters of algorithm setting.

2. summary data division core Methods and steps

1) q-digit summary data builds.According to above-mentioned joint structure condition, the data item of write is sampled, and builds binary tree structure; The structure of binary tree and query script can with reference to traditional q-digit theories of algorithm;

2) inquiry of split point.When dividing, according to the requirement of input split point, according to the fractile querying method of q-digit postorder traversal, find corresponding node;

3) calculating of backward splitpath.Take split point as starting point, along the right subtree backward of binary tree to father node, until root node, based on this path, the node of q-digit summary data is divided into two, left and right subtree (node on path is preserved respectively in left subtree and right subtree);

4) the codomain scope of knot modification record.On left and right two stalk tree, revise respectively interior nodes the scope of data space codomain be responsible for, when intermediate node be responsible for scope identical time, merging intermediate node.

3. concrete case study on implementation

Below in conjunction with concrete data, provide the concrete implementation method process of above-mentioned steps.

1) q-digit summary data builds

The data item of tentation data is key-value type, i.e. key-value form.When the data arrives, write corresponding leaf node according to key, according to the condition that the node in above-mentioned q-digit must meet, total is safeguarded.Usual way is from bottom to top, successively carries out the compression of node, merging, and the compression degree of summary data is relevant with above-mentioned parameter k.

If Fig. 1 builds the q-digit summary data structure of getting up according to above-mentioned rule, dotted portion represents selected Split Right subtree (q2).Wherein, in figure, the node of white is empty (node of the disappearance after compression merging), in order to the integrality of binary tree still records in the example shown.Solid node is the node still existed after compression, the numbering of numeral node in node, root node is 1, the left child nodes of root node is labeled as 2 with this, the right child nodes of root node is 3..., near diagram interior joint the scope [min, max] of mark represent this node the scope of (key word key) be responsible for.In the scope [1-16] of whole q-digit key word, k=5, as shown in Figure 1.

2) inquiry of split point.

The process of division q-digit is as follows: utilize q-digit to carry out fractile inquiry Φ=0.5, follow-up binary tree traversal (point of black), order is left subtree, right subtree, root node, with this order recursive traversal, the sequence node obtained is <8>, <18>, <9>, <20>, <26>, <13>, <6> ... <15>, <3>, <1>.Suppose that label be the node of <13> is Φ=0.5, the scope that it represents key is [11-12], so obtain the key=12 of the fractile of Φ=0.5, key=12 is as the intermediate point of division.

3) calculating of backward splitpath.

The computation process of backward splitpath is along right subtree to the order of root node from the node of fractile key=12, be divided into two parts in left and right: splitpath is: <13>, <6>, <3>, <1>.Two subtrees are divided into according to backward path, be respectively: left subtree: <8>, <18>, <26>, <13>, <6>, <3>, <1>; Right subtree: <6>, <28>, <14>, <15>, <3>, <1>, wherein splitpath is at each portion of left and right subtree, and split point only remains on left subtree.In fig. 2, left and right subtree is recorded as respectively: q1 and q2.In Fig. 2, (a) figure is q1 subtree exemplary plot, and (b) figure is q2 subtree exemplary plot.The scope of the key that q1 is responsible for is [1-12]; The scope that q2 is responsible for is the data partition of [13-16].

4) the codomain scope of knot modification record.

Amendment q1, q2 interior joint scope, if the Range Representation that each node is responsible for is: [min, max].The principle of amendment is: the range of nodes of max>12 in q1 made into [min, 12].The range of nodes of min<13 in q2 is changed into [13, max].Here suppose that the value of key is integer.Fig. 3 (a) is amendment q1 range of nodes exemplary plot, and Fig. 4 (a) is amendment q2 range of nodes exemplary plot.

Owing to have changed the scope of some node in fission process, the scope of key is identical to cause some nodes to represent, needs they to be merged into a node further, and the value value merging posterior nodal point is two node value value sums to be combined.Such as, in q2, node such as label for max<13 is the node of <6>, its scope is become [13,13], now identical with the scope that leaf node <28> represents, now need to merge with leaf node.Merging process is as shown in Fig. 4 (b).

In q1, the scope [1-16] of node <1> changes into [1-12], the scope [9-16] of node <3> changes into [9-12], and the value of <3> and <6> is merged into <6>.As shown in Fig. 3 (b).

<1>, <3>, <7> scope is identical, be merged into <7>. node <6>, <28> is merged into leaf node <28>.As shown in Fig. 4 (b) He Fig. 4 (c).

4. error analysis

1) q-digit inquires about error

V is certain node in q-digit, and x is the ancestor node of v, follows the process of establishing according to q-digit, can obtain with lower inequality:

error (v) \leq \underset{x &Element; ancestor (v)}{Σ} count (x),

Because the node in q-digit must satisfy condition so have inequality again

error (v) \leq \underset{x &Element; ancestor (v)}{Σ} count (x) \leq \underset{x &Element; ancestor (v)}{Σ} \frac{n}{k} \leq \log σ \cdot \frac{n}{k},

Wherein log σ is the binary tree height of tree, and n is first norm (data scale) of all data item, and k is compression parameters.

Interval query (range-query) defines: to key ₁and key ₂the value value summation of interval data item, namely wherein value _iinterval [key ₁, key ₂] in the value value of certain data item.From q-digit building process, interval query maximum error is too

2) after division, q-digit inquires about error

Because in fact still utilize original binary tree to inquire about after division, being equivalent to obtain two identical trees through copying, removing some unnecessary nodes and forming left subtree q1; Another one tree removes some unnecessary nodes, and remaining part forms q2.No matter inquire about the node in q1 or q2, the sequence node order formed through follow-up traversal with original all the same, so error and original q-digit are identical.

The application of 5.q-digit under distributed environment

Fig. 5 is that q-digit applies schematic diagram under distributed environment.Data are input to " Data distribution8 statistics " module, carry out data analysis after arriving.This module is made up of multiple separate q-digit, and each q-digit is responsible for adding up a certain section of interval censored data information.Under distributed environment, carry out as required between multiple fractile structure dividing and safeguard.

Above embodiment is only in order to illustrate technical scheme of the present invention but not to be limited; those of ordinary skill in the art can modify to technical scheme of the present invention or equivalent replacement; and not departing from the spirit and scope of the present invention, protection scope of the present invention should be as the criterion with described in claims.

Claims

1., towards a summary data splitting method for fractile inquiry, its step comprises:

1) data item of write is sampled, build q-digit summary data;

2. the method for claim 1, is characterized in that, step 1) data organizational structure of described q-digit summary data be following in one: tree structure, array, chained list.

3. method as claimed in claim 2, it is characterized in that, the data organizational structure of described q-digit summary data is tree structure, comprises the concrete steps that it divides:

4. method as claimed in claim 3, is characterized in that: the node of the tree structure of described q-digit summary data is divided into root node, leaf node and interior nodes, and wherein interior nodes meets following two conditions:

Wherein count (v) represents the value of node v, v _pthe father node of v, v _sbe the brotgher of node of v, n is the first norm of all data item, and k is the compression parameters of setting.

5. method as claimed in claim 3, is characterized in that: described tree structure is binary tree structure.

6. the method for claim 1, is characterized in that: step 1) data item that writes is key-value type.

7. the summary data dynamic maintaining method towards fractile inquiry, it is characterized in that, when load occurs unbalanced, or when needing to increase new treatment facility, method according to any one of claim 1 to 6 is adopted to carry out splitting operation to summary data, a part of data are shared on other processing nodes, divides later summary data and independently support to divide the data query in later data interval.