CN106991137B - A method for indexing time series data based on Hbase hash summary forest - Google Patents

A method for indexing time series data based on Hbase hash summary forest Download PDF

Info

Publication number
CN106991137B
CN106991137B CN201710154614.9A CN201710154614A CN106991137B CN 106991137 B CN106991137 B CN 106991137B CN 201710154614 A CN201710154614 A CN 201710154614A CN 106991137 B CN106991137 B CN 106991137B
Authority
CN
China
Prior art keywords
time
tree
node
hash
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710154614.9A
Other languages
Chinese (zh)
Other versions
CN106991137A (en
Inventor
尹建伟
冯诗淳
邓水光
李莹
吴健
吴朝晖
易峥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Hithink Royalflush Information Network Co Ltd
Original Assignee
Zhejiang University ZJU
Hithink Royalflush Information Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU, Hithink Royalflush Information Network Co Ltd filed Critical Zhejiang University ZJU
Priority to CN201710154614.9A priority Critical patent/CN106991137B/en
Publication of CN106991137A publication Critical patent/CN106991137A/en
Application granted granted Critical
Publication of CN106991137B publication Critical patent/CN106991137B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于Hbase散列概要森林对时序数据进行索引的方法,包括以下步骤:(1)根据时间粒度建立每棵时间单元树;(2)求取每棵时间单元树的散列码,并将带有散列码的时间单元树组成基于Hbase的散列概要森林;(3)将采集的时序数据根据散列码插入到散列概要森林中;(4)根据时间范围查询读取存储的时序数据。本发明通过结合概要森林树形索引方案,提高时序数据聚合操作的查询速度,同时通过生成散列码为单元树提供散列索引,解决Hbase分布式存储时序数据产生热点问题。

The invention discloses a method for indexing time series data based on an Hbase hash summary forest, comprising the following steps: (1) establishing each time unit tree according to the time granularity; (2) obtaining the hash of each time unit tree code, and form the time unit tree with hash code into Hbase-based hash summary forest; (3) insert the collected time series data into the hash summary forest according to the hash code; (4) query and read according to the time range Get the stored time series data. The present invention improves the query speed of time-series data aggregation operation by combining the outline forest tree index scheme, and at the same time provides hash indexes for unit trees by generating hash codes, so as to solve the hotspot problem of Hbase distributed storage time-series data.

Description

基于Hbase散列概要森林对时序数据进行索引的方法A method for indexing time series data based on Hbase hash summary forest

技术领域technical field

本发明涉及存储技术领域,具体涉及一种基于Hbase散列概要森林对时序数据进行索引的方法。The invention relates to the field of storage technology, in particular to a method for indexing time series data based on an Hbase hash summary forest.

背景技术Background technique

时序数据为以时间序列索引的连续数据,随着计算机应用的普及,时序数据在各个领域也得到了广泛的应用。例如:随着金融领域与互联网的结合越来越紧密,金融领域大量的量化回撤操作对时序数据的聚合操作性能需求越来越大。例如:对期货中一个季度时间范围内的某种商品合约的市价、盘口价格或成交量等进行统计,进行求和或计算最大值等聚合操作。这样的应用场景在金融量化中出现频繁,并且由于数据量巨大,如何快速准确地计算t1~t2时间内的金融时序数据的聚合操作结果变得十分重要。Time series data is continuous data indexed by time series. With the popularization of computer applications, time series data has also been widely used in various fields. For example: as the financial field is more and more closely integrated with the Internet, a large number of quantitative retracement operations in the financial field require more and more performance for the aggregation operation of time series data. For example: collect statistics on the market price, order price or trading volume of a certain commodity contract within a quarterly time range in futures, and perform aggregation operations such as summation or calculation of the maximum value. Such application scenarios frequently appear in financial quantification, and due to the huge amount of data, how to quickly and accurately calculate the aggregation operation results of financial time series data within t1~t2 becomes very important.

以对Au金属期货交易数据中一定时间范围内市价的求和操作为例:Take the summation operation of market prices within a certain time range in Au metal futures trading data as an example:

Select SUM(Last Price)From‘Au’WHERE time>t1AND time<t2Select SUM(Last Price)From‘Au’WHERE time>t1 AND time<t2

在这样的应用场景下,必须支持在海量的时序数据中快速取得聚合操作结果。In such an application scenario, it is necessary to quickly obtain aggregation operation results from massive time-series data.

传统关系型数据库主要采用物化视图或概要表的方式达到加速聚合查询的目的。物化视图是对涉及表连接的查询命令进行预处理,并将结果保存在视图表中,查询时直接取出预处理好的结果。概要表则是在写入数据的同时,计算并保存相应的概要信息,从而发生查询时,直接从概要表中查询并返回结果。此类方法提高了查询效率,但是缺点是增加了数据库的膨胀率。在NoSQL数据库中,一些数据库采用MapReduce和聚合管道的方式来处理这些聚合操作,其都是实时计算的代表,虽然没有增加数据库的膨胀率,但查询过程中产生了大量的磁盘和计算开销,低效耗时,无法满足即席查询的需求。一些NoSQL数据库将树型索引结构融合,提高了查询效率,减少了磁盘访问次数。Traditional relational databases mainly use materialized views or summary tables to speed up aggregation queries. The materialized view is to preprocess the query command involving table connection, and save the result in the view table, and directly fetch the preprocessed result when querying. The summary table calculates and saves the corresponding summary information while writing data, so that when a query occurs, the query is directly queried from the summary table and the result is returned. This type of method improves query efficiency, but the disadvantage is that it increases the expansion rate of the database. In NoSQL databases, some databases use MapReduce and aggregation pipelines to process these aggregation operations, which are representative of real-time computing. Although the expansion rate of the database is not increased, a large amount of disk and computing overhead is generated during the query process, which is low. It is time-consuming and cannot meet the needs of ad hoc queries. Some NoSQL databases integrate tree index structures to improve query efficiency and reduce the number of disk accesses.

发明内容Contents of the invention

鉴于上述,本发明提出了一种基于Hbase散列概要森林对时序数据进行索引的方法,通过建立树形索引加快了时序数据的查询时间,并通过散列码避免了时序数据在分布式数据库中顺序存储产生的空间分配不均的问题。In view of the above, the present invention proposes a method for indexing time-series data based on the Hbase hash summary forest, which speeds up the query time of time-series data by establishing a tree index, and avoids the time-series data in the distributed database through hash codes. The problem of uneven space allocation caused by sequential storage.

一种基于Hbase散列概要森林对时序数据进行索引的方法,包括以下步骤:A method for indexing time series data based on Hbase hash summary forest, comprising the following steps:

(1)根据时间粒度建立每棵时间单元树;(1) Establish each time unit tree according to the time granularity;

(2)求取每棵时间单元树的散列码,并将带有散列码的时间单元树组成基于Hbase的散列概要森林;(2) Obtain the hash code of each time unit tree, and form the time unit tree with the hash code into an Hbase-based hash summary forest;

(3)将采集的时序数据根据散列码插入到散列概要森林中;(3) Insert the collected time series data into the hash summary forest according to the hash code;

(4)根据时间范围查询读取存储的时序数据。(4) Query and read the stored time series data according to the time range.

步骤(1)中,建立时间单元树的过程为:首先,预先确定时间单元树的时间粒度;然后以根节点开始进行递归,每次建立一个新的节点,接下来,递归建立此节点的左右孩子节点,当创建的节点超出预先计算的范围时停止递归,完成整棵树的建立过程。In step (1), the process of establishing a time unit tree is as follows: first, predetermine the time granularity of the time unit tree; then start recursion with the root node, and create a new node each time, and then recursively establish the left and right sides of this node Child nodes, stop recursion when the created nodes exceed the pre-calculated range, and complete the establishment process of the entire tree.

在步骤(1)中,每棵时间单元树是一棵线段树,且包含一个固定时间粒度。通过控制每棵树的树高来控制时间粒度。线段树节点存储该节点范围的概要信息,主要包括:LBound、RBound、LNode、RNode以及Data;其中,LBound、RBound分别表示该节点包含时间范围的起始时间点和终止时间点;LNode、RNode分别表示该节点左孩子和右孩子节点包含时间点的中点;Data表示该节点存放的概要数据值,此时建立的时间单元树的每个节点的Data是空的。In step (1), each time unit tree is a line segment tree and contains a fixed time granularity. Time granularity is controlled by controlling the tree height of each tree. The line segment tree node stores the summary information of the node range, mainly including: LBound, RBound, LNode, RNode and Data; among them, LBound and RBound respectively represent the start time point and end time point of the node containing the time range; LNode and RNode respectively Indicates that the left and right child nodes of the node contain the midpoint of the time point; Data indicates the summary data value stored in the node, and the Data of each node of the time unit tree established at this time is empty.

每棵树的根节点表示这棵树携带的t时间长度内时序数据的聚合结果,第二层节点携带t/2时间长度内时序数据的聚合结果,类推每层节点携带上一层节点一半时间长度的索引数据。通过控制树高从而实现包含固定时间粒度的时间单元树可以方便用同一个散列码聚合每棵树的节点,实现热点的负载均衡。The root node of each tree represents the aggregation result of time-series data within the time length of t carried by this tree, and the aggregation result of time-series data within the time length of t/2 is carried by the second-level nodes. By analogy, each layer of nodes carries half the time of the upper-layer nodes The length of the index data. By controlling the height of the tree to achieve a time unit tree with a fixed time granularity, it is convenient to use the same hash code to aggregate the nodes of each tree to achieve load balancing of hotspots.

一棵时间单元树表示一个单元时间粒度范围的聚合结果,每棵树的叶子节点表示最细粒度范围的聚合概要信息。粒度可根据实际需求调整。A time unit tree represents the aggregation result of a unit time granularity range, and the leaf node of each tree represents the aggregate summary information of the finest granularity range. The granularity can be adjusted according to actual needs.

时间单元树覆盖时间(TreeBound)计算公式:Time unit tree coverage time (TreeBound) calculation formula:

TreeBound=(2^(TreeMaxLevel-1))*Leaf BoundTreeBound=(2^(TreeMaxLevel-1))*Leaf Bound

其中,TreeMaxLevel为最大树高,LeafBound为叶子节点表示的时间范围。例如:一棵时间单元树时间范围定为一天,树高定为9层,叶子节点表示6分钟的概要信息。所以一棵树共管辖1536分钟的聚合数据,实现一棵时间单元树覆盖一天的时间范围。Among them, TreeMaxLevel is the maximum tree height, and LeafBound is the time range represented by the leaf node. For example: the time range of a time unit tree is set as one day, the tree height is set as 9 layers, and the leaf nodes represent 6-minute summary information. Therefore, a tree governs 1536 minutes of aggregated data, and a time unit tree covers the time range of one day.

在步骤(2)中,求取每棵时间单元树的散列码的方式有很多,作为优选,选取通过对每棵树的信息进行md5转码处理生成此时间单元树的对应散列码(Hash),并将其写入tree hash表中;计算散列码的具体方式为:In step (2), there are many ways to obtain the hash code of each time unit tree. As a preference, the corresponding hash code ( Hash), and write it into the tree hash table; the specific way to calculate the hash code is:

Hash=md5(tree Info+tree low bound)Hash=md5(tree Info+tree low bound)

tree info为数据标识简要信息,例如:此数据代表Au金属元素期货数据,则标注为Au,tree low bound为时间单元树的起始时间点。The tree info is the brief information of the data identification, for example: if this data represents the Au metal element futures data, it is marked as Au, and the tree low bound is the starting time point of the time unit tree.

多棵时间单元树组成散列概要森林,整个散列概要森林所索引的数据的时间范围为组成他的时间单元树所表示的时间范围之和。Multiple time unit trees form a hash summary forest, and the time range of the data indexed by the entire hash summary forest is the sum of the time ranges represented by the time unit trees that make up it.

基于Hbase的散列概要森林由tree-hash和tree-node两个Hbase表组成,其中,tree-hash表用于查找时间单元树的散列码,tree-node表存储所有的时间单元树的树节点,且tree-hash表与tree-node表是单独存储,即将每棵时间单元树的散列码与每棵时间单元树载有的时序数据分开单独存储,并且将拥有相同散列码的时间单元树载有的时序数据集中存储。The Hbase-based hash summary forest consists of two Hbase tables, tree-hash and tree-node, where the tree-hash table is used to find the hash code of the time unit tree, and the tree-node table stores all the time unit tree trees node, and the tree-hash table and tree-node table are stored separately, that is, the hash code of each time unit tree is stored separately from the time series data carried by each time unit tree, and the time with the same hash code The time series data carried by the cell tree are stored centrally.

在步骤(3)中,所述的时序数据可以是金融时序数据、期货交易数据等任一时序数据。In step (3), the time-series data may be any time-series data such as financial time-series data and futures trading data.

在步骤(3)中,将时序数据插入到散列概要森林的具体过程为:In step (3), the specific process of inserting time series data into the hash summary forest is as follows:

(3-1)通过时序数据的所属于的时间在tree-hash表中找到此时序数据所在的时间单元树的tree hash值;(3-1) Find the tree hash value of the time unit tree where the time series data is located in the tree-hash table through the time to which the time series data belongs;

(3-2)找到该tree hash值所对应的tree-node表,将时序数据递归插入到此tree-node表中,具体过程为:(3-2) Find the tree-node table corresponding to the tree hash value, and recursively insert the time series data into the tree-node table. The specific process is:

首先,根据tree hash值找到所处时间单元树的根节点开始递归,然后进行时序数据的时间点与当前查询节点的时间对比,当时序数据的时间点小于该节点的时间时,向该节点的左孩子递归插入时序数据,当时序数据的时间点大于该节点时间时,则该节点的右孩子递归插入时序数据;直到插入到时间单元树的叶子节点为止。First, find the root node of the time unit tree according to the tree hash value and start recursion, and then compare the time point of the time series data with the time of the current query node. When the time point of the time series data is less than the time of the node, the The left child recursively inserts time series data. When the time point of the time series data is greater than the time of the node, the right child of the node recursively inserts time series data until it is inserted into the leaf node of the time unit tree.

在步骤(4)中,在数据查询时,散列概要森林由于同一个时间单元树中的数据的rowkey拥有同样的散列码,所以hbase范围查询可以快速找出整棵时间单元树并存入内存中,这大大节省对树进行递归后的磁盘IO操作。In step (4), when data is queried, because the rowkey of the data in the same time unit tree has the same hash code, the hbase range query can quickly find out the entire time unit tree and store it in the hash summary forest In-memory, this greatly saves disk IO operations after recursing the tree.

进行数据查询的过程为:The process of data query is:

(a)判断查询时间范围(t1,t2)是否属于同一个时间单元树范围,若是,执行步骤(b),若否,执行步骤(c);(a) Determine whether the query time range (t1, t2) belongs to the same time unit tree range, if so, execute step (b), if not, execute step (c);

(b)查询操作为Query(t1,t2);(b) The query operation is Query(t1,t2);

(c)查询操作为Query(t1,EndUnitTime(t1))、Query(midUnitTime)以及Query(StartUnitTime(t2),t2);(c) The query operations are Query(t1,EndUnitTime(t1)), Query(midUnitTime) and Query(StartUnitTime(t2),t2);

其中,StartUnitTime(t1)为时间点t1所处的时间单元的起始时间点;Wherein, StartUnitTime(t1) is the starting time point of the time unit where the time point t1 is located;

EndUnitTime(t2)为时间点t2所处的时间单元的结束时间点;EndUnitTime(t2) is the end time point of the time unit where the time point t2 is located;

midUnitTime为t1与t2所处的时间单元树的时间范围之间的若干单元树的时间范围;midUnitTime is the time range of several unit trees between the time range of the time unit tree where t1 and t2 are located;

Query(t1,t2)表示为在同一棵时间单元树中查询时间范围(t1,t2)的执行查询操作;Query(t1, t2) is expressed as an execution query operation of querying the time range (t1, t2) in the same time unit tree;

Query(t1,EndUnitTime(t1))表示在查询范围中的第一棵单元树中执行查询操作;Query(t1,EndUnitTime(t1)) indicates that the query operation is performed in the first unit tree in the query scope;

Query(midUnitTime)表示在查询范围中的第二棵到第倒数第二棵单元树中执行查询操作;Query(midUnitTime) indicates that the query operation is performed in the second to penultimate unit trees in the query range;

Query(StartUnitTime(t2),t2)表示在查询范围中的最后一棵单元树中执行查询操作。Query(StartUnitTime(t2), t2) means to execute the query operation in the last unit tree in the query range.

Query(t1,t2)的具体过程为:The specific process of Query(t1,t2) is:

(a)通过查找项的时间推算出其所属时间单元树的散列码并定位到此时间单元树,并将此时间单元树的根节点作为当前根节点;(a) Calculate the hash code of the time unit tree to which it belongs through the time of the search item and locate the time unit tree, and use the root node of the time unit tree as the current root node;

(b)从当前根节点开始递归查询,查询任务从(t1,t2)开始;(b) Start recursive query from the current root node, and the query task starts from (t1, t2);

(c)递归查询到当前节点时,解析出该节点包含的时间范围的起始时间点LBound、中间时间点midTime以及终止时间点RBound;(c) When the current node is recursively queried, the starting time point LBound, the middle time point midTime and the ending time point RBound of the time range included in the node are parsed out;

(d)判断t1与t2是否满足t1=LBound且t2=RBound,若是,记录该节点结果并退出递归,若否,执行步骤(e);(d) Judging whether t1 and t2 satisfy t1=LBound and t2=RBound, if so, record the node result and exit recursion, if not, execute step (e);

(e)判断t1与t2是否满足t1≤midTime且t2≤midTime,若是,将该节点的左孩子节点作为当前根节点,跳转执行步骤(b)~步骤(d),若否,执行步骤(f);(e) Judging whether t1 and t2 satisfy t1≤midTime and t2≤midTime, if so, take the left child node of the node as the current root node, skip to step (b) to step (d), if not, perform step ( f);

(f)判断t1与t2是否满足t1≥midTime且t2≥midTime,若是,将该节点的右孩子节点作为当前根节点,执行步骤(b)~步骤(d);若否,执行步骤(g);(f) Determine whether t1 and t2 satisfy t1≥midTime and t2≥midTime, if so, take the right child node of the node as the current root node, and perform steps (b) to (d); if not, perform step (g) ;

(g)判断时间t2是否满足LBound<t2<midTime,若是,将左孩子作为当前节点,将midTime作为t2,执行步骤(b)~步骤(d);将右孩子节点作为当前节点,将midTime作为t1,执行步骤(b)~步骤(d)。(g) Determine whether the time t2 satisfies LBound<t2<midTime, if so, take the left child as the current node, midTime as t2, and execute steps (b) to (d); take the right child node as the current node, and midTime as t1, execute step (b) to step (d).

本发明通过结合概要森林树形索引方案,提高时序数据聚合操作的查询速度,同时通过生成散列码为单元树提供散列索引,解决Hbase分布式存储时序数据产生热点问题。The present invention improves the query speed of time-series data aggregation operation by combining the outline forest tree index scheme, and at the same time provides hash indexes for unit trees by generating hash codes, so as to solve the hotspot problem of Hbase distributed storage time-series data.

附图说明Description of drawings

图1为本发明基于Hbase散列概要森林对时序数据进行索引的方法的流程图;Fig. 1 is the flow chart of the method that time series data is indexed based on Hbase hash general forest of the present invention;

图2为建立的时间单元树的结构示意图;Fig. 2 is the structural representation of the established time unit tree;

图3为本发明实施例1中写入数据吞吐量对比图;FIG. 3 is a comparison chart of write data throughput in Embodiment 1 of the present invention;

图4为本发明实施例2中大跨度时间范围聚合查询耗时对比图;FIG. 4 is a time-consuming comparison diagram of large-span time range aggregation query in Embodiment 2 of the present invention;

图5为本发明实施例3中散列概要深林方法与非散列概要深林方法查询耗时对比图。FIG. 5 is a comparison diagram of query time consumption between the hash summary deep forest method and the non-hash summary deep forest method in Embodiment 3 of the present invention.

具体实施方式Detailed ways

为了更为具体地描述本发明,下面结合附图及具体实施方式对本发明的技术方案进行详细说明。In order to describe the present invention more specifically, the technical solutions of the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图1所示,基于Hbase散列概要森林对时序数据进行索引的方法的具体步骤为:As shown in Figure 1, the specific steps of the method for indexing time series data based on the Hbase hash summary forest are:

步骤1,建立每棵时间单元树:首先,预先确定时间单元树的时间范围;然后以根节点开始进行递归,每次建立一个新的节点,接下来,递归建立此节点的左右孩子节点,当创建的节点超出预先计算的范围时停止递归,完成整棵树的建立过程,并将每个节点写入tree node表中,建立的时间单元树如图2所示;Step 1, establish each time unit tree: first, predetermine the time range of the time unit tree; then start recursion with the root node, and create a new node each time, and then recursively establish the left and right child nodes of this node, when Stop the recursion when the created node exceeds the pre-calculated range, complete the establishment process of the whole tree, and write each node into the tree node table, and the established time unit tree is shown in Figure 2;

此步骤中,建立的每棵时间单元树是一棵线段树,且包含一个固定时间粒度。通过控制每棵树的树高来控制时间粒度。线段树节点存储该节点范围的概要信息,主要包括:LBound、RBound、LNode、RNode以及Data;其中,LBound、RBound分别表示该节点包含时间范围的起始时间点和终止时间点;LNode、RNode分别表示该节点左孩子和右孩子节点包含时间点的中点;Data表示该节点存放的概要数据值,此时建立的时间单元树的每个节点的Data是空的。In this step, each time unit tree established is a line segment tree and contains a fixed time granularity. Time granularity is controlled by controlling the tree height of each tree. The line segment tree node stores the summary information of the node range, mainly including: LBound, RBound, LNode, RNode and Data; among them, LBound and RBound respectively represent the start time point and end time point of the node containing the time range; LNode and RNode respectively Indicates that the left and right child nodes of the node contain the midpoint of the time point; Data indicates the summary data value stored in the node, and the Data of each node of the time unit tree established at this time is empty.

每棵树的根节点表示这棵树携带的t时间长度内时序数据的聚合结果,第二层节点携带t/2时间长度内时序数据的聚合结果,类推每层节点携带上一层节点一半时间长度的索引数据。通过控制树高从而实现包含固定时间粒度的时间单元树可以方便用同一个散列码聚合每棵树的节点,实现热点的负载均衡。The root node of each tree represents the aggregation result of time-series data within the time length of t carried by this tree, and the aggregation result of time-series data within the time length of t/2 is carried by the second-level nodes. By analogy, each layer of nodes carries half the time of the upper-layer nodes The length of the index data. By controlling the height of the tree to achieve a time unit tree with a fixed time granularity, it is convenient to use the same hash code to aggregate the nodes of each tree to achieve load balancing of hotspots.

一棵时间单元树表示一个单元时间粒度范围的聚合结果,每棵树的叶子节点表示最细粒度范围的聚合概要信息。粒度可根据实际需求调整。A time unit tree represents the aggregation result of a unit time granularity range, and the leaf node of each tree represents the aggregate summary information of the finest granularity range. The granularity can be adjusted according to actual needs.

时间单元树覆盖时间(TreeBound)计算公式:Time unit tree coverage time (TreeBound) calculation formula:

TreeBound=(2^(TreeMaxLevel-1))*Leaf BoundTreeBound=(2^(TreeMaxLevel-1))*Leaf Bound

其中,TreeMaxLevel为树最大树高,LeafBound为叶子节点表示的时间范围。Among them, TreeMaxLevel is the maximum tree height of the tree, and LeafBound is the time range represented by the leaf node.

步骤2,求取每棵时间单元树的散列码,并将带有散列码的时间单元树组成基于Hbase的散列概要森林;Step 2, obtain the hash code of each time unit tree, and form the time unit tree with the hash code into an Hbase-based hash summary forest;

此步骤中,计算散列码的具体方式为:In this step, the specific way to calculate the hash code is:

Hash=md5(tree Info+tree low bound)Hash=md5(tree Info+tree low bound)

tree info为数据标识简要信息,例如此数据代表Au金属元素期货数据,则标注为Au;tree low bound为时间单元树的起始时间点,将求得Hash值写入tree hash表中;tree info is the brief information of the data identification, for example, if this data represents Au metal element futures data, it will be marked as Au; tree low bound is the starting time point of the time unit tree, and the obtained Hash value will be written into the tree hash table;

步骤3,将采集的时序数据插入到散列概要森林中,插入过程为:Step 3, insert the collected time series data into the hash summary forest, the insertion process is:

步骤3-1,通过时序数据的所属于的时间在tree-hash表中找到此时序数据所在的时间单元树的tree hash值;Step 3-1, find the tree hash value of the time unit tree where the time series data is located in the tree-hash table through the time to which the time series data belongs;

步骤3-2,找到该tree hash值所对应的tree-node表,将时序数据递归插入到此tree-node表中,具体过程为:Step 3-2, find the tree-node table corresponding to the tree hash value, and recursively insert the time series data into the tree-node table, the specific process is:

首先,根据tree hash值找到所处时间单元树的根节点开始递归,然后进行时序数据的时间点与当前查询节点的时间对比,当时序数据的时间点小于该节点的时间时,向该节点的左孩子递归插入时序数据,当时序数据的时间点大于该节点时间时,则该节点的右孩子递归插入时序数据;直到插入到时间单元树的叶子节点为止。First, find the root node of the time unit tree according to the tree hash value and start recursion, and then compare the time point of the time series data with the time of the current query node. When the time point of the time series data is less than the time of the node, the The left child recursively inserts time series data. When the time point of the time series data is greater than the time of the node, the right child of the node recursively inserts time series data until it is inserted into the leaf node of the time unit tree.

步骤4,根据时间范围查询读取存储的时序数据,查询的过程为:Step 4. Query and read the stored time series data according to the time range. The query process is:

步骤4-1,判断查询时间范围(t1,t2)是否属于同一个时间单元树范围,若是,执行步骤4-2,若否,执行步骤4-3;Step 4-1, determine whether the query time range (t1, t2) belongs to the same time unit tree range, if so, execute step 4-2, if not, execute step 4-3;

步骤4-2,查询操作为Query(t1,t2);Step 4-2, the query operation is Query(t1,t2);

步骤4-3,查询操作为Query(t1,EndUnitTime(t1))、Query(midUnitTime)以及Query(StartUnitTime(t2),t2)。In step 4-3, the query operations are Query(t1, EndUnitTime(t1)), Query(midUnitTime) and Query(StartUnitTime(t2), t2).

实施例1Example 1

利用本发明方法、Opentsdb开源时序数据库方法以及原始Hbase方法对相同的时序数据进行数据写入,并记录每种方法的写入吞吐量,如图3所示,从图3可以得到本发明方法由于有索引(散列码)和建树过程,所以数据写入速度比原始Hbase方法直接存入原始数据慢,但比opentsdb开源时序数据库方法写入速度快。Utilize the method of the present invention, Opentsdb open source time series database method and original Hbase method to carry out data writing to same time series data, and record the writing throughput of every kind of method, as shown in Figure 3, can obtain the method of the present invention from Figure 3 because There are indexes (hash codes) and tree building processes, so the data writing speed is slower than the original Hbase method directly storing the original data, but faster than the opentsdb open source time series database method.

表1显示的是本发明方法进行数据写入时,Hbase中的region的分裂情况。Table 1 shows the division of regions in Hbase when data is written by the method of the present invention.

从表1可以看出散列概要森林在触发Hbase的region分裂后按散列值分裂为多个region。拥有相同散列值的同一棵时间单元树会分在同一个region中。散列值随机生成,新建的树均匀负载地散列在不同的region中,避免了热点问题。It can be seen from Table 1 that the hash summary forest is split into multiple regions according to the hash value after triggering Hbase's region split. The same time unit tree with the same hash value will be divided into the same region. The hash value is randomly generated, and the new tree is evenly loaded and hashed in different regions, avoiding the hot spot problem.

实施例2Example 2

利用本发明方法、Opentsdb开源时序数据库方法以及原始Hbase方法对数据进行大跨度时间范围的聚合查询操作,图4为三种方法在长时间范围聚合查询耗时对比图,从图4可以得到在大跨度时间范围的聚合查询操作中,本发明方法表现较好。同时可以看出原始Hbase方法在查询200000条数据的聚合结果时,耗时数十秒,无法满足快速即席查询需求。Opentsdb开源时序数据库方法由于缓存以及索引机制速度大大提升。本发明方法比Opentsdb查询速度更快,且随着查询范围增大,查询耗时增幅较其他方案更缓慢。说明利用hbase的rowkey设计,使用范围查询将整棵树查出放入内存对查询性能有优化作用,节约了磁盘开销。Utilize the method of the present invention, the Opentsdb open source time series database method and the original Hbase method to carry out the aggregation query operation of the data in a large span time range, and Fig. 4 is a time-consuming comparison chart of the aggregation query in the long-term range of the three methods, and can be obtained from Fig. 4 in large In the aggregation query operation spanning the time range, the method of the present invention performs better. At the same time, it can be seen that the original Hbase method takes tens of seconds to query the aggregation results of 200,000 pieces of data, which cannot meet the needs of fast ad hoc query. The Opentsdb open source time series database method is greatly improved due to the cache and index mechanism. The method of the invention is faster than the Opentsdb query speed, and with the increase of the query range, the query time-consuming increase is slower than other solutions. Explain that using the rowkey design of hbase and using range query to find out the entire tree and put it in memory can optimize query performance and save disk overhead.

实施例3Example 3

利用本发明方法与非散列方法对数据进行聚合查询操作,图5为散列概要森林方法以及非散列方法查询耗时对比图。从图5可以得到非散列方案rowkey缺少散列字段,并且查询搜索线段树每次递归时对hbase进行随机查询。同一棵线段树可能分散在不同的region中,本发明方法的查询耗时小于非散列方法,且随着查询范围的增大优势越明显。Using the method of the present invention and the non-hashing method to aggregate and query data, Fig. 5 is a time-consuming comparison chart of the hash summary forest method and the non-hashing method. From Figure 5, it can be seen that the rowkey of the non-hashing scheme lacks a hash field, and the query search line segment tree performs a random query on hbase every time it recurses. The same line segment tree may be scattered in different regions, and the query time consumption of the method of the present invention is shorter than that of the non-hashing method, and the advantage becomes more obvious as the query range increases.

以上所述的具体实施方式对本发明的技术方案和有益效果进行了详细说明,应理解的是以上所述仅为本发明的最优选实施例,并不用于限制本发明,凡在本发明的原则范围内所做的任何修改、补充和等同替换等,均应包含在本发明的保护范围之内。The above-mentioned specific embodiments have described the technical solutions and beneficial effects of the present invention in detail. It should be understood that the above-mentioned are only the most preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, supplements and equivalent replacements made within the scope shall be included in the protection scope of the present invention.

Claims (4)

1. a kind of method being indexed based on Hbase hash summary forest to time series data, comprising the following steps:
(1) time range that high, leaf node includes according to tree establishes every time quantum tree comprising set time granularity;
(2) hash code of every time quantum tree is sought, and the time quantum tree with hash code is formed into dissipating based on Hbase Column summary forest;
(3) time series data of acquisition is inserted into hash summary forest according to hash code;
(4) time series data for reading storage is inquired according to time range;
Every time quantum tree is a Kd-Trees, and line segment tree node stores the summary info of the range of nodes, comprising: LBound, RBound, LNode, LNode and Data;Wherein, it includes time model that LBound, RBound, which respectively indicate the node, The start time point enclosed and termination time point;LNode, LNode respectively indicate the left child of the node and when right child nodes include Between range midpoint;Data indicates the summary data value of node storage, each node for the time quantum tree established at this time Data is empty;
The hash code Hash's of every time quantum tree seeks formula are as follows:
Hash=md5 (tree Info+tree low bound)
Tree info is Data Identification brief information;Tree low bound is the start time point of time quantum tree;Md5 is A kind of transcoding mode;
The hash summary forest based on Hbase is made of two Hbase tables of tree-hash and tree-node, wherein For tree-hash table for storing the corresponding hash code of all time quantum trees, each tree-node table has corresponding time quantum All leaf nodes of tree, and tree-hash table is individually to store, and will possess same Hash code with tree-node table The time series data that time quantum tree is loaded with is centrally stored.
2. the method being indexed according to claim 1 based on Hbase hash summary forest to time series data, feature are existed In: time series data is inserted into the detailed process of hash summary forest are as follows:
Time where (3-1) finds this timing data by the belonging time of time series data in tree-hash table is single The hash code of member tree;
(3-2) finds tree-node table corresponding to the hash code, and time series data recurrence is inserted into this tree-node table, Detailed process are as follows:
Start recurrence according to the root node that hash code finds locating time quantum tree, then the time point of progress time series data with work as The time range of preceding query node compares, and is less than the middle time point of the time range of the node when the time point of time series data When, into the Data of the left child nodes of the node, recurrence is inserted into time series data, is greater than the node when the time point of time series data Time range middle time point when, then in the Data of the right child nodes of the node recurrence be inserted into time series data;Until inserting Enter until the leaf node of time cell tree.
3. the method being indexed according to claim 1 based on Hbase hash summary forest to time series data, feature are existed In: carry out the process of data query are as follows:
(a) judge whether query time range (t1, t2) belongs to cell tree range at the same time, if so, step (b) is executed, If it is not, executing step (c);
(b) inquiry operation Query (t1, t2) is executed
(c) inquiry operation Query (t1, EndUnitTime (t1)), Query (midUnitTime) and Query are executed (StartUnitTime(t2),t2);
Wherein, StartUnitTime (t2) is the initial time for the time range that time quantum tree locating for time point t2 includes Point;
EndUnitTime (t1) is the end time point for the time range that time quantum tree locating for time point t1 includes;
MidUnitTime is the time range between the time range of time quantum tree locating for t1 and t2;
Query (t1, t2) is expressed as executing inquiry operation in the same time quantum tree belonging to query context t1~t2;
Query (t1, EndUnitTime (t1)) indicates first unit in query context t1~EndUnitTime (t1) Inquiry operation is executed in tree;
Query (midUnitTime) indicates second to the second from the bottom cell tree in query context midUnitTime Middle execution inquiry operation;
Query (StartUnitTime (t2), t2) indicates last in query context StartUnitTime (t2)~t2 Inquiry operation is executed in cell tree.
4. the method being indexed according to claim 3 based on Hbase hash summary forest to time series data, feature are existed In: the detailed process of Query (t1, t2) are as follows:
(a) go out the hash code of its affiliated time quantum tree by searching for the time reckoning of item and navigate to this time cell tree, and Using the root node of this time cell tree as current root node;
(b) recursive query since current root node, query task start from (t1, t2);
(c) when recursive query is to present node, start time point LBound, the centre of the time range that the node includes are parsed Time point midTime and termination time point RBound;
(d) judge whether t1 and t2 meet t1=LBound and t2=RBound, passed if so, recording the node result and exiting Return, if it is not, executing step (e);
(e) judge whether t1 and t2 meet t1≤midTime and t2≤midTime, if so, the left child nodes of the node are made It for current root node, jumps and executes step (b)~step (d), if it is not, executing step (f);
(f) judge whether t1 and t2 meet t1 >=midTime and t2 >=midTime, if so, the right child nodes of the node are made For current root node, step (b)~step (d) is executed;If it is not, executing step (g);
(g) judge whether time t2 meets LBound < t2 < midTime, if so, using left child as present node, it will MidTime executes step (b)~step (d) as t2;Using right child nodes as present node, using midTime as t1, Execute step (b)~step (d).
CN201710154614.9A 2017-03-15 2017-03-15 A method for indexing time series data based on Hbase hash summary forest Active CN106991137B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710154614.9A CN106991137B (en) 2017-03-15 2017-03-15 A method for indexing time series data based on Hbase hash summary forest

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710154614.9A CN106991137B (en) 2017-03-15 2017-03-15 A method for indexing time series data based on Hbase hash summary forest

Publications (2)

Publication Number Publication Date
CN106991137A CN106991137A (en) 2017-07-28
CN106991137B true CN106991137B (en) 2019-10-18

Family

ID=59413068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710154614.9A Active CN106991137B (en) 2017-03-15 2017-03-15 A method for indexing time series data based on Hbase hash summary forest

Country Status (1)

Country Link
CN (1) CN106991137B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535712B (en) * 2021-06-04 2023-09-29 山东大学 Method and system for supporting large-scale time sequence data interaction based on line segment KD tree

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202384A (en) * 2016-07-08 2016-12-07 清华大学 A kind of indexing means supporting time series data aggregate function

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101207510B1 (en) * 2008-12-18 2012-12-03 한국전자통신연구원 Cluster Data Management System And Method for Data Restoring Using Shared Read-Only Log in Cluster Data Management System

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202384A (en) * 2016-07-08 2016-12-07 清华大学 A kind of indexing means supporting time series data aggregate function

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
支持时序数据聚合函数的索引;黄向东等;《清华大学(自然科学)》;20160331;第229-236,245页 *

Also Published As

Publication number Publication date
CN106991137A (en) 2017-07-28

Similar Documents

Publication Publication Date Title
US8700605B1 (en) Estimating rows returned by recursive queries using fanout
US9244974B2 (en) Optimization of database queries including grouped aggregation functions
CN103366015B (en) A kind of OLAP data based on Hadoop stores and querying method
CN102521334B (en) Data storage and query method based on classification characteristics and balanced binary tree
US7941426B2 (en) Optimizing database queries
US9189047B2 (en) Organizing databases for energy efficiency
US8682875B2 (en) Database statistics for optimization of database queries containing user-defined functions
CN105518674B (en) Optimize the mechanism that the parallel query on asymmetric resource executes
CN103577440A (en) Data processing method and device in non-relational database
US20080222129A1 (en) Inheritance of attribute values in relational database queries
US20090077054A1 (en) Cardinality Statistic for Optimizing Database Queries with Aggregation Functions
US20100036805A1 (en) System Maintainable and Reusable I/O Value Caches
US8312007B2 (en) Generating database query plans
CN104361113A (en) OLAP (On-Line Analytical Processing) query optimization method in memory and flesh memory hybrid storage mode
US20100306212A1 (en) Fetching Optimization in Multi-way Pipelined Database Joins
CN107491487A (en) A kind of full-text database framework and bitmap index establishment, data query method, server and medium
CN105740264A (en) Distributed XML database sorting method and apparatus
CN106611044A (en) SQL optimization method and device
CN116089414A (en) Time sequence database writing performance optimization method and device based on mass data scene
Shanoda et al. JOMR: Multi-join optimizer technique to enhance map-reduce job
US20100036804A1 (en) Maintained and Reusable I/O Value Caches
CN109299143B (en) Knowledge fast indexing method of data interoperation test knowledge base based on Redis cache
US8548980B2 (en) Accelerating queries based on exact knowledge of specific rows satisfying local conditions
CN108829343B (en) A Cache Optimization Method Based on Artificial Intelligence
CN110597805B (en) A method for processing memory index structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant