CN106202384A

CN106202384A - A kind of indexing means supporting time series data aggregate function

Info

Publication number: CN106202384A
Application number: CN201610536956.2A
Authority: CN
Inventors: 王建民; 黄向东; 郑亮帆; 康荣; 龙明盛; 刘英博
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2016-07-08
Filing date: 2016-07-08
Publication date: 2016-12-07

Abstract

A kind of indexing means supporting time series data aggregate function, it would be preferable to support the most extemporaneous inquiry of simple aggregation operation.Its basic thought is summary table and Kd-Trees (Segment Tree) to be combined, and sets up the line segment forest model being made up of many Kd-Trees, thus avoid the full table scan of summary table to operate on summary table.Meanwhile, by bottom-up mode dynamic construction line segment forest, the shortcoming that traditional Kd-Trees is not supported to increase has been avoided.Additionally, search algorithm directly positions index data by calculating, it is to avoid the recursive traversal of line segment forest is operated, decreases disk I/O number of times.Test result indicate that the calculating inquiry mode of the summary table+line segment forest used herein effectively reduces the number of times of disk I/O, has been obviously improved query performance.

Description

A kind of indexing means supporting time series data aggregate function

Technical field

The present invention relates to a kind of big data system automatic Model Selection and the side of parameter configuration in big market demand development process Method, belongs to computer data base management technical field.

Background technology

Along with the development of sensor technology and popularizing of the Internet, the collection of data and the spread speed of information have reached sky Front level.The aggregation information such as the extreme value of data, average are become particularly significant, the most quick and precisely obtains these polymerization letters Breath is research emphasis herein.

Meet this kind of inquiry, in the range of data base is necessary for supporting at any time, mass data is carried out quickly Converging operationJu Hecaozuo.

Traditional Relational DataBase mainly uses the mode of summary table or Materialized View to reach to accelerate the purpose of aggregate query. Wherein, Materialized View is that the querying command relating to table connection is carried out pretreatment, and result is saved in view table, Yong Hufa During raw inquiry, data base directly inquires about from view table and returns result.Summary table is then while write data, calculates also Preserve corresponding summary info, thus when there is inquiry, directly inquire about from summary table and return result.

The essence of both modes is all to precalculate and preserve conventional aggregation information, reduces query context, improves real Border inquiry velocity.Its drawback is the increase in the expansion rate of data base；Along with increasing of data, it may appear that the problem of performance degradation.

And in NoSQL data base, some data bases have employed the mode of MapReduce to process these converging operationJu Hecaozuos: Aggregate query detects the table data related in real time from data base every time, is submitted in Map program process.In the Map stage, Program filters goes out to meet the data of condition and submits to Reduce program.Reduce program collects and calculates Query Result.Separately Some data bases such as MongoDB, then propose the concept of polymerization pipeline (Aggregation Pipeline).It is to combine The thought of MapReduce and the product of the thought of linux system pipeline.Its principle is, converging operationJu Hecaozuo acts directly on data literary composition On part, by the primary operation of class system, directly filter the data in aggregate file.

The mode of MapReduce and polymerization pipeline is all the representative calculated in real time.Although not increasing the expansion of data base Rate, but query script creates substantial amounts of disk and computing cost, poor efficiency is time-consuming, it is impossible to meet the demand of extemporaneous inquiry.

The thought of Materialized View is then applied in NoSQL by Plamen Nikolov et al.: precalculate counting, summation Etc. common statistical information, and being saved in view table, follow-up incremental updates, to reach to accelerate the purpose of inquiry response.

This mode promotes clearly compared to the speed carrying out MapReduce calculating on NoSQL data base, but also There is its drawback.The Forming Mechanism of Materialized View itself determines its inquiry not supporting any range and operates.It addition, along with data The rising of amount, the disk expense of inquiry operation also can increase.

Summary of the invention

Based on the problems referred to above, this paper presents a kind of Indexing Mechanism supporting NoSQL data base's converging operationJu Hecaozuo.It is thought substantially Want to combine summary table and Kd-Trees (Segment Tree), summary table is set up the line being made up of many Kd-Trees Section forest model, thus avoid the full table scan of summary table to operate.Meanwhile, gloomy by bottom-up mode dynamic construction line segment Woods, has avoided the shortcoming that traditional Kd-Trees is not supported to increase.Additionally, search algorithm directly positions index data by calculating, keep away Exempt from the operation of the recursive traversal to line segment forest, decrease disk I/O number of times.Achieve on Cassandra data base herein The index engine stated, and design 2 groups of contrast experiments: directly inquiries based on data and directly inquiry based on summary table.Experiment Result shows, the calculating inquiry mode of this summary table+line segment forest, effectively reduces the number of times of disk I/O, is obviously improved Query performance.

A kind of indexing means supporting time series data aggregate function, it is characterised in that include two steps:

Step one, the data model of definition time series data and query demand

Definition 1: data item: data item D (data point) is that (s, t, v), wherein s is sensor to a tlv triple ID, t are timestamps, and wherein, s and t constitutes globally unique mark, and v is the value of sensor, the consecutive hours of same sensor Between data item constitute time series data, on this basis, define inquiry problem to be solved: on time series data, look into Ask time window t₁～t₂(t₁And t₂For any time) in the value of time series data, variance statistic information；

Definition 2: summary info: in time series data, the statistical information of the individual continuous print data item in time of k and time thereof Window constitutes 1 summary info (data Digest)；

Definition 3: leaf node: the summary info directly produced by data item constitutes leaf plus specific label information Node (leaf node)；

Definition 4: intermediate node: collected by 2 leafy nodes or 2 intermediate nodes and constitute plus specific label information Intermediate node (parent node)；In order in the recursive operation avoiding tree, it is achieved the quick-searching of summary forest, tie at leaf Point and intermediate node on the addition of necessity label information: sequence number and numbering；

Definition 5: sequence number: when initially setting up index, according to generation order, corresponding 1 sequence number of each leafy node, sequence number by 1 starts to be incremented by, and intermediate node does not has sequence number (serial)；

Definition 6: numbering: according to the order of line segment forest postorder traversal, corresponding 1 numbering of each node, number (code) It is incremented by by 1；

Definition 7: summary forest: summary forest (Synopsis Forest) be the summary tree produced by node constitute gloomy Woods.

Step 2, the structure of summary forest and inquiry

(1) summary forest builds

Summary forest safeguards stack architecture rootStack, is used for improving combined efficiency；Safeguard a queue simultaneously, use With temporary to be brushed enter disk nodal information,

A. when i-th leaf node arrives:

A) if i is odd number:

A) the most directly adding this leafy node, this leafy node is from becoming one tree, now, and the sequence number that this leafy node is corresponding For i, numbered 2i-ones (i), wherein, ones (i) function be i binary representation in 1 number；

B) this leaf node is added into rootStack and queue；

B) if i is even number:

A) while adding this leafy node, generating and triggered the new tree generated by this leafy node, now, this leaf is tied Serial number i that point is corresponding, the numbering of numbered (i-1) leafy node adds 1 i.e. 2 (i-1)-ones (i-1)+1；

B) this leaf node is added into queue；

C) the numbered 2i-ones (i) of root node of the new tree produced due to this leaf node, remaining newly-generated middle junction The numbering of point is followed successively by 2 (i-1)-ones (i-1)+2 to 2i-ones (i)-1；

D) this leaf node is put into rootStack；

E) ejecting the first two node of rootStack, the two node has identical height and is root node, merges Both form new tree, and the root node numbering of this tree constantly rises to 2i-ones (i) from 2 (i-1)-ones (i-1)+2；

F) root node that 1-a-ii-5 generates is put into queue；

G) root node that 1-a-ii-5 generates is put into rootStack, repeat 1-a-ii-5, until newly-generated root node Numbering reaches 2i-ones (i)；

B. the node brush kept in queue is entered disk.

(2) summary forest inquiry

1) query demand is first defined: query time window t_a～t_bThe summary info of corresponding data item.

2) inquiry specifically comprises the following steps that

A. normalized temporal window, it is assumed that t_is<t_a<t_ie、t_js<t_b<t_je, then the time window of inquiry can be divided into 3 Time window: t_a<t_ie, t_(i+1)s～t_(j-1)eAnd t_js<t_b；

B. for time window t_a<t_ieAnd t_js<t_b, need from data base, directly read t_aTo t_ieAnd t_jsTo t_bData , and from data item, directly calculate the summary info of window during this period of time；

C. for time window t_(i+1)S～t_(j-1)e, from line segment forest, find out minimal number of line segment so that these lines Section is referred to as the division of time window t (i+1) s～t (j-1) e.Assume to need altogether s line segment, from data base, read this successively The summary node that s line segment is corresponding, obtains s summary info；This step specific implementation process is as follows:

A) from data base, 2 corresponding summary bags are read out according to initial time t (i+1) s and t (j-1) e, from summary Bag respectively obtains sequence number i and the j of correspondence；

B) lower bound sequence number is obtained: if i is even number, summary bag corresponding for t (i+1) s is added to pending queue, now Lower bound serial number (i+1).Otherwise, lower bound serial number i；

C) upper bound sequence number is obtained: if j is odd number, summary bag corresponding for t (j-1) e is added to pending queue, now Upper bound serial number (j-1), otherwise, upper bound serial number j；

D) calculated the numbering of correspondence by upper bound sequence number, and cover the volume of the superiors' node of this sequence number correspondence node Number；

E) sequence number of the lobus sinister child node that the superiors' node covers is calculated by the numbering of numbering and the superiors' node；

If f) the most left sequence number is more than lower bound sequence number, then the numbering of the superiors' node is added queue to be checked, and upper Boundary's sequence number is set to the sequence number of lobus sinister child node and subtracts 1, forwards step d to；

If g) the most left sequence number is less than lower bound sequence number, then the numbering of the superiors' node subtracts 1, forwards step e to；

If h) the most left sequence number is equal to lower bound sequence number, then the numbering of the superiors' node is added in queue to be checked, then Exit circulation；

I) find corresponding summary bag finally according to band query request, and add these summary bags to pending queue.

Time window t can be calculated by (s+2) the individual summary info in step B and C_a～t_bSummary info.

The present invention proposes a kind of efficient index method supporting time series data converging operationJu Hecaozuo, and its advantage is:

1. can support the most extemporaneous inquiry that simple aggregation operates.In query script, this Indexing Mechanism it can be avoided that Substantial amounts of disk expense, solves the problem that Materialized View and summary table increase, along with data volume, the hydraulic performance decline caused；

2. summary table and Kd-Trees (Segment Tree) are combined, summary table is set up by many Kd-Trees structures The line segment forest model become, thus avoid the full table scan of summary table to operate；

3., by bottom-up mode dynamic construction line segment forest, avoided traditional Kd-Trees and do not supported that increase lacks Point.Additionally, search algorithm directly positions index data by calculating, it is to avoid the recursive traversal of line segment forest is operated, reduces Disk I/O number of times；

The most this Indexing Mechanism is unrelated with underlying database, by the query engine based on JAVA from realization, and can be light Pine is transplanted in the platform of arbitrary data storehouse.

Accompanying drawing explanation

Below in conjunction with the accompanying drawings, by the citing of indefiniteness, the preferred embodiment of the present invention is described further, In accompanying drawing:

Fig. 1 is the summary info schematic diagram corresponding to one group of data item.

Fig. 2 is summary forest and the time window of the inventive method definition.

Fig. 3 be the present invention relates to interpolation serial number odd number (on) and even number (under) node.

Fig. 4 is the interpolation leaf node algorithm false code in the present invention.

Fig. 5 is the query script algorithm false code in the present invention.

Detailed description of the invention

The present invention is described in further detail below in conjunction with the accompanying drawings.

1. the indexing means supporting time series data aggregate function, it is characterised in that include two steps:

Step one, the data model of definition time series data and query demand

Step 2, the structure of summary forest and inquiry

(1) summary forest builds

A. when i-th leaf node arrives:

A) if i is odd number:

B) this leaf node is added into rootStack and queue；

B) if i is even number:

B) this leaf node is added into queue；

D) this leaf node is put into rootStack；

F) root node that 1-a-ii-5 generates is put into queue；

B. the node brush kept in queue is entered disk.

(3) summary forest inquiry

1) query demand is first defined: query time window t_a～t_bThe summary info of corresponding data item,

2) inquiry specifically comprises the following steps that

Claims

Step one, the data model of definition time series data and query demand

Definition 1: data item: data item D (data point) is that (s, t, v), wherein s is sensor ID to a tlv triple, t Being timestamp, wherein, s and t constitutes globally unique mark, and v is the value of sensor, the continuous time of same sensor Data item constitutes time series data, on this basis, defines inquiry problem to be solved: on time series data, during inquiry Between window t₁～t₂(t₁And t₂For any time) in the value of time series data, variance statistic information；

Definition 2: summary info: in time series data, the statistical information of the individual continuous print data item in time of k and time window thereof Constitute 1 summary info (data Digest)；

Definition 3: leaf node: the summary info directly produced by data item constitutes leafy node plus specific label information (leaf node)；

Definition 4: intermediate node: collected by 2 leafy nodes or 2 intermediate nodes and constitute centre plus specific label information Node (parent node)；In order in the recursive operation avoiding tree, it is achieved the quick-searching of summary forest, at leafy node and With the addition of on intermediate node necessity label information: sequence number and numbering；

Definition 5: sequence number: when initially setting up index, according to generation order, corresponding 1 sequence number of each leafy node, sequence number is opened by 1 Beginning to be incremented by, intermediate node does not has sequence number (serial)；

Definition 6: numbering: according to the order of line segment forest postorder traversal, corresponding 1 numbering of each node, numbering (code) is opened by 1 Begin to be incremented by；

Definition 7: summary forest: summary forest (Synopsis Forest) is the forest that the summary tree produced by node is constituted.

Step 2, the structure of summary forest and inquiry

(1) summary forest builds

Summary forest safeguards a stack architecture (rootStack), is used for improving combined efficiency；Safeguard a queue simultaneously (queue), be configured to temporarily store to be brushed enter disk nodal information.

A. when i-th leaf node arrives:

A) if i is odd number:

A) the most directly adding this leafy node, this leafy node is from becoming one tree, now, and serial number i that this leafy node is corresponding, Numbered 2i-ones (i), wherein, ones (i) function be i binary representation in 1 number；

B) this leaf node is added into rootStack and queue；

B) if i is even number:

A) while adding this leafy node, generate and triggered the new tree generated, now, this leafy node pair by this leafy node Serial number i answered, the numbering of numbered (i-1) leafy node adds 1 i.e. 2 (i-1)-ones (i-1)+1；

B) this leaf node is added into queue；

C) the numbered 2i-ones (i) of root node of the new tree produced due to this leaf node, remaining newly-generated intermediate node Numbering is followed successively by 2 (i-1)-ones (i-1)+2 to 2i-ones (i)-1；

D) this leaf node is put into rootStack；

E) ejecting the first two node of rootStack, the two node has identical height and is root node, merges both Forming new tree, the root node numbering of this tree constantly rises to 2i-ones (i) from 2 (i-1)-ones (i-1)+2；

F) root node that 1-a-ii-5 generates is put into queue；

G) root node that 1-a-ii-5 generates is put into rootStack, repeat 1-a-ii-5, until newly-generated root node numbering Reach 2i-ones (i)；

B. the node brush kept in queue is entered disk；

(2) summary forest inquiry

2) inquiry specifically comprises the following steps that

A. normalized temporal window, it is assumed that t_is<t_a<t_ie、t_js<t_b<t_je, then the time window of inquiry can be divided into 3 times Window: t_a<t_ie, t_(i+1)s～t_(j-1)eAnd t_js<t_b；

B. for time window t_a<t_ieAnd t_js<t_b, need from data base, directly read t_aTo t_ieAnd t_jsTo t_bData item, And from data item, directly calculate the summary info of window during this period of time；

C. for time window t_(i+1)S～t_(j-1)e, from line segment forest, find out minimal number of line segment so that these line segments claim Division for time window t (i+1) s～t (j-1) e.Assume to need altogether s line segment, from data base, read this s successively The summary node that line segment is corresponding, obtains s summary info；This step specific implementation process is as follows:

A) from data base, read out 2 corresponding summary bags according to initial time t (i+1) s and t (j-1) e, divide from summary bag Do not obtain sequence number i and the j of correspondence；

C) upper bound sequence number is obtained: if j is odd number, summary bag corresponding for t (j-1) e is added to pending queue, the now upper bound Serial number (j-1), otherwise, upper bound serial number j；

D) calculated the numbering of correspondence by upper bound sequence number, and cover the numbering of the superiors' node of this sequence number correspondence node；

If f) the most left sequence number is more than lower bound sequence number, then the numbering of the superiors' node is added queue to be checked, and upper bound sequence The sequence number number being set to lobus sinister child node subtracts 1, forwards step d to；

If h) the most left sequence number is equal to lower bound sequence number, then the numbering of the superiors' node is added in queue to be checked, be then log out Circulation；