CN109902088A

CN109902088A - A kind of data index method towards streaming time series data

Info

Publication number: CN109902088A
Application number: CN201910113039.7A
Authority: CN
Inventors: 李建欣; 邰振赢; 李晨; 司靖辉; 韦冠宇
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-02-13
Filing date: 2019-02-13
Publication date: 2019-06-18

Abstract

The present invention proposes a kind of data index method towards streaming time series data, comprising the following steps: step 1, establishes overall data index structure, the structure is further added by the B* tree for being directed toward the pointer of brother for the non-root and n omicronn-leaf child node in B+ tree；Step 2, bulk resources application and optimiged index；Step 3, carry out the cutting of trigger-type tree construction with merge.

Description

A kind of data index method towards streaming time series data

Technical field

The present invention relates to a kind of data index methods, are mainly concerned with a kind of data directory side towards streaming time series data Method.

Background technique

Along with the rapid development and extensive use of the forward positions such as virtualization, cloud computing Internet technology, a large amount of Mobile portables Terminal and sensor device are laid in each corner in the world, have served as the role of information search and acquisition.Especially close Several years, " Internet of Things " concept for advocating all things on earth interconnection was suggested, and was existed by real-time network mobile sensor network interconnected The fields such as military affairs, economy, medical treatment have been widely applied, and achieve considerable practical value.However, with magnanimity streaming timing The acquisition and processing of data information are conducive to the monitoring and detection of data hidden feature although on the one hand having expanded data scale, Convenient for the intrinsic essential laws of mining data stream；But on the other hand, the bottom datas such as retrieval, storage, management of time series data Function is but faced with the challenge of very severe.For example, the taxi of more than 40 ten thousand USA New Yorks is being mounted with the sensing such as related GPS After device equipment, more than one hundred million real-time vehicle running datas can be generated per minute, and traditional Relational DataBase will be difficult to right in real time The data of such order of magnitude are written and read, and current distributed database is also needed largely to calculate and is just able to satisfy with storage resource The data processing needs stated.

In order to solve problems, largely the Database Systems towards time series data are suggested, wherein with InfluxDB, OpenTSDB etc. is that the time series database of representative has also been widely applied, and the read-write for alleviating time series data to a certain extent is asked Topic.Such system has generally carried out needle to the data directory module in respective Database Systems to promote itself working efficiency Optimization to property, this is mainly due to the main pressures of reading and writing data to be embodied on data directory.Wherein, it is different from traditional B+ tree structured index, the existing index structure for being most suitable for managing time series data are LSM (log-structured merge) tree, It is asked by the way that data directory is individually positioned in memory, hard disk according to the sequencing of write time with the read-write for responding different frequency It asks, and then two tree constructions is merged by the threshold value of setting and reach dividing and ruling for data directory, make full use of different storage mediums The readwrite performance to differ greatly being capable of providing.

However, the data store organisation and underusing the data characteristic of time series data itself.Specifically, ordinal number when The sampled value of timestamp in is similar and continuous, this means that the part for having many redundancies in time series data and has stringent Context, this write-in/insertion operation for directly resulting in data directory only occurs in the bottom right side gusset of tree construction.This Outside, pass through the operation log of observation database, it has been found that the composition of time series data operation contains 97% write operation, remaining Only a small amount of point inquiry and relatively great amount of range query, this requires index structures can satisfy range query demand. If it is intended to meeting real world to the read-write demand of real-time streaming time series data, the number an of targeted design, optimization is needed The intrinsic unique characteristics of time series data are sufficiently adapted to according to index structure.

Summary of the invention

In view of the above problems, the invention proposes towards stream in order to preferably meet the read-write demand of time series data The data index method of formula time series data, has sufficiently been adapted to the inherent feature of time series data, and the program has replaced index first Tree construction used in structure, while according to time series data characteristic Design index to the application way of storage resource, additionally Propose the index cutting merging method of a set of trigger-type.The present invention can be good at supporting using B* tree, leaf node chained list Range retrieval and continuous data are read；The space resources of the pre- expenditures such as resource bid mode, index acceleration components can reduce hard disk The movement of arm, is substantially improved data read-write efficiency；The index cutting merging method of trigger-type can be dynamic according to the cold and hot situation of data State is coordinated to index ratio, reasonable distribution different performance resource, it is ensured that data consistency and availability in memory and hard disk；Whole Data directory towards streaming time series data can sufficiently agree with time series data read write attribute, promote data write-in, retrieval rate.

Detailed description of the invention

Fig. 1 is overall flow figure of the invention

Fig. 2 is the data directory structure of the invention towards streaming time series data of the invention；

Fig. 3 is that tree construction segmentation of the invention merges figure；

Fig. 4 is the structure chart of Bloom filter of the invention；

Fig. 5 is the cutting figure that trigger-type of the invention indexes

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.

In order to realize above-mentioned goal of the invention, the present invention provides a kind of data directory sides towards streaming time series data Method.Such as Fig. 1, this method includes following several detailed contents: step 1, establishing overall data index structure, the structure is in B+ The non-root and n omicronn-leaf child node of tree are further added by the B* tree for being directed toward the pointer of brother；Step 2, bulk resources application and optimiged index, The optimiged index is that the B* tree building mode of time series data is become to the additional form of batch, and the data of n omicronn-leaf child node layer can also Directly to be updated with similar method；Step 3, the cutting of trigger-type tree construction with merge: propose memory, tree construction cuts between hard disk Point and consolidation strategy.

It is illustrated in figure 2 the data directory structure of the invention towards streaming time series data, conventional data storage is indexed in reality It is generally based on Hash structure or B+ tree now, respectively corresponds the application scenarios of point retrieval sensitivity or general data read-write.For into One step is bonded the time series data write operation practical application request more far more than read operation, range query, and the present invention uses B* tree As the foundation structure of data directory.B* tree is similar to B+ tree construction, is a kind of its application oriented B+ tree mutation.The present invention Used in data directory structure be B* tree construction.As shown in Figure 1, all leaf nodes are pressed in the tree of storing data Timestamp ascending order constitutes a complete chained list, and the n omicronn-leaf child node in tree corresponds to the home key of its signified data Value.When follow-up data is inserted into original index, due to the feature that time series data timestamp is incremented by, newly-increased data can directly chain enter Original leaf node chained list, and be quickly truncated according to B* tree order and generate n omicronn-leaf child node, to realize data quick insertion.By It is usually timestamp in the key that time series data stores every data, insertion and the inquiry operation of such data will become regular It can seek.B* tree is generally based on list structure, and one side being capable of effectively boosted tree knot when handle this type data for this structure On the other hand the overall utilization rate of structure can satisfy the adaptation of data area query demand.

The data of the characteristics of orderly singly increasing in view of time series data, all new insertion B* trees only appear in most right leaf On node.If tactful (keeping two lateral balances when division) using original insertion, which is not full tree, entire to set Service efficiency reduce very much, waste a large amount of storage resources.Meanwhile the node of B* tree at the middle and upper levels can not along with the insertion of data It is divided disconnectedly, causes whole index efficiency very low.The B* tree building mode of time series data is become batch by the present invention Additional form, i.e. the position that be inserted into no longer is found in the insertion of data, but new data batch is directly linked at leaf section The end of chained list is put, then completion upper link and node.Meanwhile the data of n omicronn-leaf child node layer can also use similar method Directly update.The B* tree constructed based on the method in the present invention is Man Shu, and computation complexity is also lower compared with conventional method.Such as Fig. 3 show tree construction segmentation and merges schematic diagram, in the case where known B* tree order, the operation of data in EMS memory insertion index It is greatly simplified.The present invention is realizing internal memory tree with disk tree when merging, according to time series data feature, by analyzing in disk Newest old tree own structural characteristics, it (includes that its is corresponding that suitable leaf node quantity is syncopated as from the new tree in memory Complete tree) by the corresponding part of its direct splicing old tree into disk, operation therein is leaf node in disk Link, and the update of corresponding n omicronn-leaf child node.

Inquiry is one of the main services form that database provides, and inquiry time delay is referring mainly to for measure database performance One of mark.In time series data application environment, inquiry operation is smaller compared with proportion for write operation, thus design towards A part of query capability is sacrificed when the data directory structure of timing write capability is substantially improved.But in fact, time series data Inquiry operation have very strong timeliness, continuity and periodicity.Specifically, timeliness refers to what time series data was queried Probability generates timing node away from being inversely proportional at a distance from current time node with it, this is also to store the index of new data and data In memory the reason of.Continuity is similar to the connected reference in computer system it is assumed that being accessed the data around data It can also be increased by the probability of connected reference.Periodically refer to that part cold data can be inquired periodically, is mainly derived from big It measures upstream and applies periodic query demand.As it can be seen that the inquiry of time series data and the inquiry of general data have certain difference, close Reason optimizes data dispatch using above-mentioned time series data query characteristics and cache policy can effectively promote data in the buffer Hit rate reduces disk operating, promotes whole query performance.

Wherein, it when inquiry operation is not hit in memory, needs to carry out data query on disk.Due to disk I/O The cost of operation is very huge, so still will cause a large amount of inquiry generation even if the index tree in disk is divided Valence.If can quickly judge that data whether there is, and determine the subtree where data, can be substantially improved in disk and inquire Whole efficiency.If Fig. 4 is the Bloom filter that all subtrees in data directory construct respectively, the ordinal number when present invention combines The characteristics of according to key value monotonic increase, be index structure be added to Bloom filter come for persistence piecemeal store time series data Carry out hash calculating, and by collision detection come quickly leaf node layer realize retrieval data whether there is or not judgement, avoid useless Inefficient decompression operation.Whether the Bloom filter can judge data immediately after few I/O operation and calculating In the presence of reducing disk I/O operation and the calculating of original deep search B* tree.

Due to the presence of the cold and hot hypothesis of time series data in true environment, so time series data can be according to its cold and hot degree not It is same to be stored respectively in memory and hard disk.Since index tree is too deep caused when in order to be further reduced data in hard disk inquiry A large amount of I/O operation quantity, index structure proposed by the present invention are equally different by the big B* of script in hard disk based on the cold and hot degree of data Tree construction is split as the forest being made of more small B* tree.

It is illustrated in figure 5 the cutting figure of trigger-type index, the data volume of every stalk tree storage in forest can be according to data Cold and hot degree dynamic adjustment adapt to, the optimal of whole system performance is realized with this.It is past in forest based on such strategy generating Deeper toward colder its data structure depth of storage of data, this also complies with people to the use situation of data.For image, this Invention more refines the temperature for defining data, and accesses the plan that just can be obtained provided with IO hard disk less for comparing for warm data Slightly.The depth of specific each tree will be determined that occupied space will be also applied when using first time by specific requirements.Due to Data access operation on disk compares internal storage access can be more time-consuming, therefore the present invention is in order to reduce the visit of data in magnetic disk to the greatest extent It asks, according to the characteristic that time series data value and sampling time are inversely proportional away from current time interval, defines data for time series data The virtual quantizating index of this measurement data value of temperature.According to the difference of data temperature, we are provided in memory and set cutting Trigger mechanism to adjust the balance of cold and hot data access cost, have adjusted the limitation of data storage depth in disk to ensure temperature Data can have the access cost to match, and memory is applied at random, which becomes unified batch application, reduces the extra of random access initiation Disk arm operation.

Tree construction merges with tree construction in hard disk in memory for the union operation major embodiment of traditional LSM tree, but this Multipart design in invention is so that it possesses the combined tree construction object of more needs.Meanwhile time series data is born orderly, single Increasing this feature makes the variation that new data addition causes in B* tree appear in most right leaf node and corresponding non-leaf forever Node.More specifically, two tree constructions to be combined equally remain this orderly, single the characteristics of increasing, i.e. two leaves Child node be formed by chained list head and the tail be connected directly as merge after generate the leaf node chained list newly set.Therefore, this tree knot The merging of structure is not necessarily to as the merging of original B+ tree, and data reinsert to or rotated tree construction, and can it is more efficient Retain the merging that tree is realized under the premise of the part minor structure of existing tree.But in original LSM, two tree merging be regularly, The data that either tree construction accommodates have been more than its own performance load limit.Consolidation strategy simple in this way can not necessarily utilize Time series data characteristic is improved combined efficiency.In the present invention, the present invention is in order to realize two more efficient combination systems of tree Special trigger policy and consolidation strategy are determined.The present invention deposits two trees combined opportunity with data storage capacity in new tree with it Storage value is bound, that is to say, that when the depth in one tree is more than its acceptable search delay, and in veteran and When newly the identical node of tree depth is non-full, cutting and union operation are just triggered.Further, since two trees are disposably merged meeting It is locked to cause mass data, and all data indexed in memory can be emptied and in turn resulted in and occur a large amount of disk in the short time Access operation.

Therefore, consolidation strategy is the cutting under conditions of tree for retaining veteran and partially newly setting does not change And reconstruct the maximum minor structure that can be directly embedded into veteran in new tree, it is ensured that only the colder data in part can quilt when merging Locking, the availability of lifting system data.Since union operation occurs on disk, the present invention can be in original reserved data bit Set additional update leaf node and non-leaf node data.

The distance of time interval current time that is acquired by data due to time series data value is determined, and memory is deposited with disk Access cost brought by storing up mutually goes larger, it is therefore desirable to which it is most efficient to realize to adjust the storage of memory and disk by dynamic Index efficiency.Wherein slicing operation is the operation that the partial data in memory is persisted to disk storage, and union operation is old Set the processing operation that tree construction is carried out to newly added bulk information.Wherein, in trigger condition, (depth of tree expires slicing operation Sufficient structure matching) meet when progress, the right subtree that can index storing data in memory retains, by the tree-shaped part in left side and disk In newest tree merge (leaf node link, n omicronn-leaf child node link and reorganize).In addition, in order to further ensure that number According to the cost of access, the tree in disk is not one, but according to data temperature determine tree depth, and be more than threshold value when into The merging of row tree, operating method are identical.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify to technical solution documented by previous embodiment or equivalent replacement of some of the technical features；And These are modified or replaceed, the spirit and model of technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution It encloses.

Claims

1. a kind of data index method towards streaming time series data, which comprises the following steps: step 1, establish whole Volume data index structure, the structure are further added by the B* tree for being directed toward the pointer of brother for the non-root and n omicronn-leaf child node in B+ tree； Step 2, bulk resources application and optimiged index；Step 3, carry out the cutting of trigger-type tree construction with merge.

2. the method as described in claim 1, which is characterized in that in the step 2, the mode of the bulk resources application is, The B* tree building mode of time series data becomes the additional form of batch, and the data of all new insertion B* trees are in most right leaf node On, the new data batch is directly linked to when the insertion of new data the end of leaf node chained list.

3. method according to claim 2, which is characterized in that the mode of the optimiged index is the institute in data directory Some subtrees construct Bloom filter respectively.

4. method as claimed in claim 3, which is characterized in that in the step 3, the concrete mode of the cutting is to index The big B* tree construction of script in hard disk is split as the forest being made of more small B* tree based on the different of the cold and hot degree of data by structure, The data volume of every stalk tree storage in the forest can be adjusted according to the cold and hot degree dynamic of data and be adapted to.

5. method as claimed in claim 4, which is characterized in that in the step 3, the cutting and combined trigger condition For two trees combined opportunity stores value with it with data storage capacity in new tree and bound, and the mode of the binding is to work as Depth in one tree is more than that it searches for delay threshold, and when identical with depth is newly set node is non-full in veteran, cutting and conjunction And it operates and is just triggered；Wherein slicing operation is the operation that the partial data in memory is persisted to disk storage, merges behaviour The processing operation of tree construction is carried out to newly added bulk information as old tree；Wherein, slicing operation is tree in trigger condition Depth meets structure matching when progress, the right subtree of the index of storing data in memory is retained, by the tree-shaped part in left side and magnetic Newest tree merges in disk, described to merge into leaf node link, and n omicronn-leaf child node is linked and reorganized.