CN115878684A

CN115878684A - Time sequence data distributed query method and device, electronic equipment and storage medium

Info

Publication number: CN115878684A
Application number: CN202111155015.1A
Authority: CN
Inventors: 王玉华
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2023-03-31

Abstract

The application provides a time series data distributed query method, a time series data distributed query device, electronic equipment and a storage medium, wherein the method comprises the following steps: receiving a time sequence data query request, wherein the time sequence data query request comprises the starting time, the ending time and the data type of data to be queried; when the target time range covered by the starting time and the ending time exceeds the preset unit duration, decomposing the time sequence data query request according to the starting time, the ending time, the preset unit duration and the number of available cluster nodes to obtain N1 subtasks; distributing the N1 subtasks to N1 cluster nodes, so that the N1 cluster nodes query corresponding time sequence data from a time sequence database according to the distributed subtasks; and receiving the query results returned by the N1 cluster nodes, and returning the received query results to the initiator device of the time sequence data query request. The method can be used.

Description

Time sequence data distributed query method and device, electronic equipment and storage medium

Technical Field

The present application relates to data processing technologies, and in particular, to a time series data distributed query method and apparatus, an electronic device, and a storage medium.

Background

The traditional open source time sequence database supports distributed query in a rare way, for example, although the bottom layer of OpenTSDB (an expandable distributed time sequence database) utilizes the distributed storage of Hbase, only one node is in query operation during query, and distributed query is not implemented, so that query efficiency is not high.

Disclosure of Invention

In view of this, the present application provides a time series data distributed query method, apparatus, electronic device and storage medium.

Specifically, the method is realized through the following technical scheme:

according to a first aspect of an embodiment of the present application, there is provided a time series data distributed query method, including:

receiving a time sequence data query request, wherein the time sequence data query request comprises the starting time, the ending time and the data type of data to be queried;

when the target time range covered by the starting time and the ending time exceeds a preset unit time length, decomposing the time sequence data query request according to the starting time, the ending time, the preset unit time length and the number of available cluster nodes to obtain N1 subtasks; n1 is more than or equal to 2 and less than or equal to N2, and N2 is the number of available cluster nodes; different subtasks correspond to different sub time ranges in the target time range, and the union of the sub time ranges corresponding to the subtasks is the target time range;

distributing the N1 subtasks to N1 cluster nodes, so that the N1 cluster nodes query corresponding time sequence data from a time sequence database according to the distributed subtasks;

and receiving the query results returned by the N1 cluster nodes, and returning the received query results to the initiator device of the time sequence data query request.

According to a second aspect of the embodiments of the present application, there is provided a time series data distributed query apparatus, including:

the query unit is used for receiving a time sequence data query request, wherein the time sequence data query request comprises the starting time, the ending time and the data type of data to be queried;

the decomposition unit is used for decomposing the time sequence data query request according to the starting time, the ending time, the preset unit duration and the number of available cluster nodes to obtain N1 subtasks when the target time range covered by the starting time and the ending time exceeds the preset unit duration; n1 is more than or equal to 2 and less than or equal to N2, and N2 is the number of available cluster nodes; different subtasks correspond to different sub time ranges in the target time range, and the union of the sub time ranges corresponding to the subtasks is the target time range;

the distribution unit is used for distributing the N1 subtasks to N1 cluster nodes, so that the N1 cluster nodes inquire corresponding time sequence data from a time sequence database according to the distributed subtasks;

and the query unit is further configured to receive query results returned by the N1 cluster nodes, and return the received query results to the initiator device of the time series data query request.

According to the distributed query method for the time series data, when the target time range covered by the starting time and the ending time included in the time series data query request is determined to exceed the preset unit time length, the time series data query request is decomposed into a plurality of subtasks according to the starting time, the ending time, the preset unit time length and the number of available cluster nodes, the subtasks are respectively distributed to different cluster nodes to be executed, the corresponding cluster nodes perform time series data query according to the distributed subtasks, query results are returned, distributed ground time series data query is achieved, and the efficiency of time series data query is improved.

Drawings

FIG. 1 is a flow chart diagram illustrating a distributed query method for time series data according to an exemplary embodiment of the present application;

FIG. 2 is a flow diagram illustrating a decomposition of an temporal data query request according to an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a time series data distributed query architecture according to an exemplary embodiment of the present application;

FIG. 4 is a process flow diagram of a proxy node according to an exemplary embodiment of the present application;

fig. 5 is a schematic structural diagram of a time-series data distributed query apparatus according to an exemplary embodiment of the present application;

fig. 6 is a schematic diagram illustrating a hardware structure of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

In order to make those skilled in the art better understand the technical solutions provided by the embodiments of the present application, a brief description will be given below of some technical terms related to the embodiments of the present application.

1. TSDB: timeseries Database, time series Database.

2. Time series data: data is monitored based on a series of indicators that are continuously generated at a stable frequency. Such as a series of data generated by the temperature and power values collected by the sensor every 1 minute.

3. Label (Tag): and the attribute information is used for indicating the index monitoring object. Such as the date of manufacture, manufacturer, model number, etc. of a sensor, which often does not change over time. One Tag may include a Tag key (TagKey) and a Tag value (TagValue).

4. Metric index (metric): a set of indicators of the data is monitored. Such as temperature, power.

5. Timestamp (Timestamp): the point in time at which the metric is generated.

6. Data points: the method comprises the steps of time stamping and measuring indexes, namely, each measuring value collected at a specific time aiming at a certain index of a monitored object is a data point.

7. Single value model: one monitoring record corresponds to data of only one index, taking the data generated by the wind power generation source as an example, the single-value model data is shown in table 1:

TABLE 1 Single value model example

metric	timestamp	Manufacturer of the product	Wind field	Type number	value
						Power of	2019-01-01T00:00:10Z	Vestas	Tuoli	7AD45EC	1800
Wind speed	2019-01-01T00:00:10Z	Vestas	Tuoli	7AD45EC	11.24

8. The multi-value model data is shown in table 2, each line of data is a monitoring record, and each record can reflect information of different monitoring indexes.

TABLE 2 multivalued model example

timestamp	Manufacturer of the product	Wind field	Type number	Power of	Wind speed
						2019-01-01T00:00:10Z	Vestas	Tuoli	7AD45EC	1800	11.24

9. OpenTSDB: a distributed, scalable time-series database based on Hbase.

10. rowkey (row key): primary key in hbase to identify a unique row of records.

11. columns Family (or called column cluster): each column in the Hbase table belongs to a column group. Part of the schema of the column family table (instead of the columns), must be defined before the table is used. Column names are prefixed by column families. For example, info: powder, info: speed all belong to the column family of info.

12. UID is a Unique Identifier, and the Unique Identifier identifies a 3-byte integer of a certain field.

13. Unix time stamp: leap seconds were not considered for the number of seconds elapsed since 1/1970 (midnight of UTC/GMT).

14. region: partitioning of Hbase.

15. TSD: time Series Daemon.

In order to make the aforementioned objects, features and advantages of the embodiments of the present application more comprehensible, embodiments of the present application are described in detail below with reference to the accompanying drawings.

Referring to fig. 1, a schematic flow chart of a distributed query method for time series data according to an embodiment of the present application is shown in fig. 1, where the distributed query method for time series data includes the following steps:

it should be noted that the execution subject of steps S100 to S130 may be a proxy node, and the proxy node may receive the time series data query request, split and allocate the subtasks, and return the query result according to the manner described in steps S100 to S130.

Step S100, a time sequence data query request is received, wherein the time sequence data query request comprises the starting time, the ending time and the data type of data to be queried.

Step S110, when the target time range covered by the start time and the end time exceeds the preset unit duration, decomposing the time sequence data query request according to the start time, the end time, the preset unit duration and the number of available cluster nodes to obtain N1 subtasks; n1 is more than or equal to 2 and less than or equal to N2, and N2 is the number of available cluster nodes; the different subtasks correspond to different sub time ranges in the target time range, and the union of the sub time ranges corresponding to the subtasks is the target time range.

In the embodiment of the application, in order to implement distributed query of time series data and improve query efficiency of time series data, when a time series data query data request is received and a time range (referred to as a target time range herein) covered by start time and end time carried in the time series data query request exceeds a preset unit duration (which may be set according to an actual requirement, such as 1 hour), the received time series data query request may be decomposed according to the start time, the end time, the preset unit duration, and the number of available cluster nodes, so as to obtain at least two (denoted as N1 in this document) subtasks.

Illustratively, 2 ≦ N1 ≦ N2, i.e., the number of subtasks does not exceed the number of available cluster nodes.

Illustratively, different subtasks correspond to different sub-time ranges in the target time range, and the union of the sub-time ranges corresponding to the respective subtasks is the target time range.

For example, assuming that the starting time in the time series data query request is 3.

It should be noted that, when the target time range covered by the start time and the end time in the time series data query request does not exceed the preset unit time length, the time series data query request may be determined as a small data size query task, at this time, the time series data query request may not be split into subtasks, but may be allocated to one available cluster node, and the available cluster node performs time series data query according to the time series data query request.

Step S120, allocating the N1 subtasks to the N1 cluster nodes, so that the N1 cluster nodes query the corresponding time sequence data from the time sequence database according to the allocated subtasks.

In this embodiment of the application, when the time series data query request is split into N1 subtasks according to the above manner, the N1 subtasks may be allocated to N1 cluster nodes of N2 available cluster nodes, and the N1 cluster nodes query corresponding time series data from the time series database according to the split subtasks.

Step S130, receiving query results returned by the N1 cluster nodes, and returning the received query results to the initiator device of the time series data query request.

In the embodiment of the application, when the cluster node receives the subtasks distributed by the agent node, the cluster node can query the corresponding time sequence data from the time sequence database according to the distributed subtasks, and return the query result to the agent node.

The proxy node may return the query result returned by the N1 cluster nodes to an initiator device, such as a client device, of the time-series data query request.

It can be seen that, in the method flow shown in fig. 1, when it is determined that the target time range covered by the start time and the end time included in the time series data query request exceeds the preset unit duration, the time series data query request is decomposed into a plurality of subtasks according to the start time, the end time, the preset unit duration, and the number of available cluster nodes, the plurality of subtasks are respectively allocated to different cluster nodes for execution, the corresponding cluster nodes perform time series data query according to the allocated subtasks, and a query result is returned, thereby implementing distributed time series data query and improving the efficiency of time series data query.

In some embodiments, the time series data in the time series database is stored in a multi-value model distributed storage mode based on Hbase;

a plurality of pre-partitions are created in the time sequence database, and different time sequence data uniformly cover the pre-partitions according to data types and data sources; time sequence data of the same data type of the same data source are stored in the same pre-partition;

a multi-value model time sequence data table is stored in the time sequence database, and the rowkey of the multi-value model time sequence data table comprises the data type of the time sequence data, the unique identification of a data source and the identification of a pre-partition where the time sequence data is located.

Illustratively, the time-series database stores the time-series data in a multi-value model distributed storage mode based on Hbase.

In order to fully develop the characteristics of the Hbase distributed database and reduce the short-time unavailability of the system caused by division of Hbase partitions, a plurality of pre-partitions can be created in the time-series database.

For example, when the time series data is stored in the time series database, different time series data can uniformly cover each pre-partition according to the data type and the data source.

For example, the modulo of the number of the pre-partitions in the time series database may be performed according to the data type of the time series data and the hash value of the unique identifier of the data source, and according to the modulo result, the pre-partition corresponding to the time series data is determined, and the time series data is stored in the corresponding pre-partition.

For example, the same type of time series data of the same data source can be stored in the same pre-partition.

It should be noted that, in the embodiment of the present application, it is assumed that the same type of time-series data of the same data source is stored in the same pre-partition, where the data amount stored in the corresponding pre-partition does not reach the capacity of the pre-partition, and when the data amount stored in a certain pre-partition reaches the capacity of the pre-partition, a new pre-partition may be used to store the corresponding time-series data.

For example, in order to reduce the probability that the amount of data stored in a single pre-partition reaches the capacity of the pre-partition, the pre-partition may be created corresponding to a specific time dimension according to the time sequence data storage requirement in an actual scene, accordingly, when the time sequence data is stored, the time may be divided into a plurality of time ranges matched with the specific time dimension according to the specific time dimension, and the time sequence data in different time ranges are stored in different pre-partitions.

For example, taking a specific time dimension as a natural month as an example, when creating the pre-partition, the pre-partition may be created corresponding to a month, for example, a preset number of pre-partitions may be created corresponding to 7 months, and the time sequence data of 7 months may be stored in each pre-partition created corresponding to 7 months according to the above policy; correspondingly, the time sequence data of 8 months are respectively stored in each pre-partition created corresponding to 8 months according to the strategy.

For example, the time series data in the time series database may be recorded in the form of a multi-valued model, and the instant time series data may be stored in a multi-valued model time series data table.

Illustratively, the rowkey of the multi-valued model time series data table comprises a data type of the time series data, a unique identifier of a data source and an identifier of a pre-partition in which the time series data is located. The time sequence data can be recorded to the multi-valued model time sequence data table according to the data type, the data source and the pre-partition where the time sequence data is located.

In an example, the rowkey of the multi-valued model timing data table further includes a unit time stamp corresponding to a preset unit time length, and different unit time stamps correspond to different time lengths with time lengths being the preset unit time length;

the column group of the multi-valued model time sequence data table comprises a plurality of columns for storing time sequence data, the columns are respectively used for storing different sub-time periods in a preset unit time length, and the time sequence data are stored to the corresponding columns according to the sub-time periods to which the timestamps belong.

For example, in order to further improve the query efficiency of the time series data, the rowkey of the multi-valued model time series data table may further include a unit time stamp corresponding to the preset unit time length, and the time lengths of different unit time stamps are different time lengths of the preset unit time length.

For example, assuming that the preset unit time duration is 1 hour, the time period of the preset unit time duration may be a time period with a time duration of 1 hour, such as 3.

Illustratively, in order to further improve the query request of the time series data, the column group of the multi-valued model time series data table includes a plurality of columns for storing the time series data, the plurality of columns are respectively used for storing different sub-periods within a preset unit time length, and the time series data are stored to the corresponding columns according to the sub-periods to which the timestamps belong.

For example, assuming that the preset unit time length is 1 hour, the time length corresponding to the sub-period may be 15 minutes or 20 minutes, etc.

Taking the sub-period of 15 minutes as an example, the list of the multi-valued pattern timing data table may include 4 columns for storing the timing data.

As an example, the column family of the multi-valued model timing data table may further include a tag column for storing a modifiable attribute of the data source;

and a tag index table is also stored in the time sequence database, and the tag index table is used for storing the corresponding relation between tag and rowkey.

Illustratively, to further improve the query efficiency of the time series data, the column family of the multi-valued model time series data table may further include a tag column for storing a modifiable attribute of the data source.

For example, the tag column may store an IP address of the data source, an area identification, a device type or model number, and the like.

In addition, a tag index table may also be stored in the time sequence database, and the tag index table is used to store the corresponding relationship between tag and rowkey.

As an example, the time series data query request may further include a filter condition, where the filter condition includes at least one alterable attribute stored in a multi-valued model time series data table, and is used for a cluster node to which a subtask is assigned, and the time series data query is filtered according to the filter condition.

Illustratively, the time series data query request can further include a filter condition, and the filter condition can include at least one alterable attribute stored in the multi-valued model time series data table.

Correspondingly, when time series data are queried, the tag index table can be queried according to the filtering conditions included in the time series data query request, the corresponding rowkey is determined, then the rowkey corresponding to the time series data to be queried is queried and determined from the corresponding rowkey, and then the corresponding time series data are queried from the multi-valued model time series data table according to the rowkey.

In one example, the column family of the above-described multivalued model timing data table is used to store value information;

the time-series database also stores a schema for storing key (value) information of each column family.

Illustratively, in order to simplify the multi-valued model time sequence data table, the time sequence data can be stored in a way of separately storing key and value.

Accordingly, the column family of the multi-valued model timing data table can be used to store value information without storing key information.

The key information of the time series data may be stored in the schema, and a specific implementation thereof may be described below with reference to a specific example.

In some embodiments, as shown in fig. 2, in step S110, the time series data query request is decomposed according to the start time, the end time, the preset unit duration, and the number of available cluster nodes, and the method may be implemented by:

step S111, determining whether the starting time is the starting time of a target time period and determining whether the ending time is the ending time of the target time period, wherein the target time period is a time period corresponding to a preset unit time length, and the time length of the time period is the preset unit time length;

step S112, when the starting time is not the starting time of the target time period, adjusting the starting time to the starting time of the target time period where the starting time is; and/or when the ending time is not the ending time of the target time period, adjusting the ending time to the ending time of the target time period where the ending time is located;

step S113, decomposing the time series data query request according to the ratio of the time length covered by the current starting time and the current ending time to the preset unit time length and the number of available cluster nodes; and the difference value between the time lengths of the sub-time ranges corresponding to different subtasks does not exceed the preset unit time length.

For example, in order to improve the time-series data query request decomposition efficiency, when the time-series data query request is decomposed, it may be determined whether the start time of the time-series data query request is the start time of a time period (referred to as a target time period herein) corresponding to a preset unit time length, and whether the end time is the end time of the target time period.

For example, assuming that the preset unit duration is 1 hour, the starting time of the time period is an integer, such as 3; the end time of a time segment is the start time of the next time segment.

For example, for a period of 3. For the time period 4.

For another example, assuming that the preset unit duration is a natural day, the starting time of the time period is 0.

When it is determined that the start time of the time series data query request is not the start time of the target time period, the start time may be adjusted to the start time of the target time period at which the start time is located.

For example, assuming that the start time is 3.

Similarly, when it is determined that the end time of the time series data query request is not the end time of the target time period, the end time may be adjusted to the end time of the target time period where the end time is.

For example, assuming that the end time is 7.

When the starting time of the time sequence data query request is determined to be the starting time of the target time period, and the ending time of the time sequence data query request is determined to be the ending time of the target time period, or the starting time and/or the ending time of the time sequence data query request are/is adjusted according to the above mode, the time length of the time sequence data query request can be decomposed according to the ratio of the time length covered by the current starting time and the current ending time to the preset unit time length, and the number of available cluster nodes.

For example, still taking the preset unit duration as 1 hour as an example, assuming that the current start time is 3. Assuming that the number of available cluster nodes is 4, the time series data query request can be decomposed in a manner of 2.

If the start time and/or the end time of the time-series data query task are adjusted as described above, the start time and/or the end time need to be restored to the time before the adjustment when the subtask resolution is performed.

For example, assuming that the start time is 3.

Similarly, assuming that the end time is 7, and is adjusted to 8 in the above manner, the sub-period corresponding to the 4 th sub-task should actually be 7.

In some embodiments, the time series data query request further includes an aggregation function, which is used to instruct the cluster nodes to which the subtasks are allocated to process the queried time series data by using the aggregation function, so as to obtain a query result.

In some embodiments, the time series data query request may further include a down-sampling interval and a down-sampling function, and the down-sampling interval and the down-sampling function are used to instruct the time series data query result to process the queried time series data according to the down-sampling interval.

For example, assuming that the down-sampling interval is 2 hours, the queried time series data may be divided by taking 2 hours as the down-sampling interval. If the queried time series data is from 2 00 to 5, the queried time series data can be divided into a part of from 2.

In some embodiments, the time series data query request further includes an instruction for instructing grouping of the time series data query results, and the instruction is used for instructing grouping and returning of the time series data query results.

Illustratively, grouping may include, but is not limited to, grouping the queried time series data by department (e.g., materials department, production department) or by city.

In some embodiments, the time series data query request further includes an instruction for instructing paging of the time series data query result, and the instruction is used for instructing paging return of the time series data query result, so as to improve efficiency when returning large data volume.

For example, an instruction to indicate paging of sequential data may indicate an amount of data in a page when a return of sequential data is made.

Illustratively, the instruction may also include a page number offset, i.e. returning the queried time series data from the page number.

In order to enable those skilled in the art to better understand the technical solutions provided in the embodiments of the present application, the technical solutions provided in the embodiments of the present application are described below with reference to specific examples.

In the embodiment, a multi-value storage model of time series data is redesigned based on Hbase, a distributed query scheme is provided, the hot spot problem of data storage can be solved, and the large-range query efficiency of massive time series data is improved.

For example, the Hbase-based multivalued model distributed storage scheme can be respectively shown in table 3:

TABLE 3 multivalued model storage scheme

The meaning of each parameter in the multi-valued model can be as shown in table 4:

TABLE 4, MULTI-VALUE MODEL conceptually explain TABLE

In the traditional Hbase scheme, only one partition is defaulted when a table is built. In order to fully exert the characteristics of the Hbase distributed database and reduce the situation that the system is unavailable in a short time due to Hbase region splitting, in the embodiment of the application, a pre-partition RN can be set when a table is created, so that the same batch of data can be uniformly distributed to each node of a cluster. Wherein:

1. designing a rowKey: < saltValue > < measurement > < identifier > < timestylpHour >

Exemplary, saltValue: the value of the pre-partition value is obtained by modulo the pre-partition number RN by the absolute value of the hash value of measure + identifier.

In mass data storage, the measurement + identifier can uniquely determine certain metadata, and data generated by different devices at the same time can uniformly cover each pre-partition of Hbase according to saltValue, so that the data hotspot problem is solved.

2. The design of the column depends on the actual timeframe of each record.

Illustratively, the remainder Δ tn resulting from dividing the timeframe by 3600 (taking each column for 1 second as an example) can be used as the column name under the fields column family.

It can be seen that by utilizing the dynamic expansion feature of Hbase column, a new column is added under the fields column family every time a new remainder is obtained for data in a single hour.

It should be noted that 3600 seconds can be extended to 1 day or other unit time according to practical situations, and correspondingly, timesampphour in Rowkey can be changed to timesamppday or other.

3. One line of data for each hour is stored separately.

In this embodiment, only the content corresponding to the value is stored in the time sequence data table, the key information corresponding to the values in tags and fields in each measurement is stored in the schema, and the schema design may be as shown in table 5:

TABLE 5 schema storage design

Wherein, measurement represents data type and logical table name. the tags column family stores the order of tags under the measurement, each tag contains tag key as tag name, and tag type represents the data type of the tag, which can be Strirng, byte, double, long, float, int, etc.

The field column family stores the order of fields under measurement, field key represents the name of the monitoring index, and field type represents the data type of the monitoring index.

Illustratively, according to the data storage mode, when time series data retrieval is performed, the tag position to be filtered and the value position of fields in a tag column group in an original data table can be confirmed from a schema according to a time series data query request, and then single-table or multi-table retrieval of different combinations is completed in one time series data table, so that the method is flexible and efficient.

Therefore, the multi-value model simultaneously supports single-value or multi-value time sequence data storage, simultaneously supports one-time retrieval to obtain values of a plurality of field indexes, and greatly improves the retrieval efficiency. The saltValue setting solves the hot spot problem in data storage and avoids data inclination.

In this embodiment, in order to further improve the efficiency of querying the time series data, a tag index table may be further provided, and the format of the tag index table may be as shown in table 6:

table 6, tag index table

tags	rowkey
		tagvalue1	rowkey1，rowkey3，rowkey6
tagvalue2	rowkey4，rowkey7，rowkey88
		…	…
tagvalueN	rowkey5，rowkey8，rowkeyM

Based on the data storage manner, the implementation of the time series data distributed query method provided by the embodiment of the present application may include:

1. storage of time series data

1.1, setting the number RN of pre-partitions of the time sequence data storage table according to months, and creating the time sequence data table.

Illustratively, each month corresponds to a pre-partition created with a pre-partition number for storing the timing data for the corresponding month.

And 1.2, inserting time sequence data according to the design of a multi-valued model.

2. Distributed querying of time series data

As shown in fig. 3, for a schematic diagram of a time series data distributed query architecture provided in this embodiment of the present application, as shown in fig. 3, a client (e.g., client1, client2, …, and client n in fig. 3) may send a time series data query request to a proxy node, the proxy node allocates tasks to available cluster nodes (e.g., TSD1, …, TSD (n-1), and TSD n in fig. 3), and the cluster nodes allocated with the tasks perform time series data query, where an implementation flow may be as follows:

2.1, the client sends a retrieval request (instant data query request) to the proxy node, the mandatory items of the request content comprise a start time (which can be recorded as startTime), an end time (which can be recorded as endTime), and a measurement name (namely a data type), and the optional items comprise an aggregation function type, fields name, a filter condition filter, whether to group, a down sampling function, a down sampling interval and whether to page.

Illustratively, if paging is performed, the amount of data returned by paging, and the offset of the number of pages returned by paging need to be filled.

2.2, the processing flow of the proxy node may be as shown in fig. 4, and may include:

2.2.1, acquiring the number of available TSD nodes (namely the number of available cluster nodes) of the cluster as JN;

2.2.2, determining the hour number TN of the time range of the query request;

for example, the time interval Δ T = endTime-startTime (hour) may be calculated, and whether Δ T is less than or equal to 1 hour may be determined, if yes, TN =1; otherwise, if Δ T > 1, it is determined whether the start time startTime is the whole point.

If startTime is the integer, T1= startTime; if the startTime is not the integer, the starting time of the target time period where the starting time is located is taken as T1, that is, T1= [ startTime ] is taken as the integer.

Similarly, if the end time endTime is the whole point, T2= endTime; if the ending time endTime is not the integer, the ending time of the target time period where the ending time is located is taken as T2, that is, T2= [ endTime ] takes the integer +1 hour.

TN＝T2-T1。

2.2.3, determining the number TaskNum of the query subtasks and the time range of execution of each subtask.

For example, the proxy node may determine whether TN is greater than JN.

If TN is less than or equal to JN, taskNum = TN;

if TN > JN, taskNum = JN. The time range size calculation performed by each available cluster node (assuming node 1 through node JN are included) may be implemented according to a round robin algorithm.

E.g., starting at node 1, up to node JN, and then starting the loop again until all TNs are allocated. Finally, an array timeRange = [ timeRange1, timeRange2, …, timeRange JN ] with the size JN is obtained.

2.2.4, task allocation.

Illustratively, if TaskNum < JN, then TaskNum nodes are selected and the subtask is assigned. And if the TaskNum is more than or equal to JN, sending the subtasks of the sub time range in the array in the step 2.2.3 to each node.

2.2.5, each TSD executes the query request.

Illustratively, each node performs time series data query of each sub-time range by using the Hbase distributed storage principle, and the steps comprise filtering, down-sampling, aggregating, grouping and paging to return results to a cache.

And 2.2.6, summarizing and returning results.

Illustratively, the proxy node collects and sorts the query results in the cache and returns the query results to the client.

For example, time series data generated by the server operation and maintenance monitoring platform is taken as a data source, the data generation frequency is generated once every 15 minutes, the number of TSD nodes of the cluster is 3, the number of pre-partitions created during hbase acquisition is 100, wherein the time series data generated by the web01 server is taken as an example:

/>

an example of the storage of the traffic time sequence data and the device operation time sequence data generated by the 2 servers at points 9, 10 and 11 of 2019-11-02 is shown in table 6, where the measure defining the time sequence data corresponding to the traffic is flow, the measure defining the time sequence data corresponding to the device operation is device, and the time stamp is Unix time stamp, for example, 2019-11-01 00 corresponds to Unix time stamp 1572570000, values of onlineestus (online status) and CPU usage (CPU occupancy) in fields are separated by a semicolon:

TABLE 6 multivalued model timing data storage embodiment

It can be seen from the above storage that each device corresponds to different saltValue, which ensures that the load of the original data is evenly distributed to different pre-partitions, and avoids the hot spot problem. Meanwhile, the original time sequence data table only stores the value of the monitoring index, and does not store the name of the monitoring index, so that the storage space is greatly reduced. Each measurement is stored in the schema in one-to-one correspondence with its corresponding metric index name. Taking the above as an example, the schema is shown in table 7:

TABLE 7 example schema

The tag index table embodiment may be as shown in table 8:

TABLE 8 tag index Table embodiments

Take the following time series data query requests as an example: count 2019-11-02 30 to 2019-11-02 11.

Decompose the above into startTime =2019-11-02 05, 30, endtime =2019-11-02 11, 40, measurement = V770_50, aggregation function avg, and down-sampling interval 1h.

When the agent node receives the time sequence data query request, the agent node can carry out subtask decomposition.

Exemplarily, the starting time is a non-integer, and the integer to which the starting time belongs is taken as T1=2019-11-02 00; the end time is a non-integer, the next integer of the end time is T2=2019-11-02 12, 00, the time range Δ T =12-5=7 hours is obtained, which is greater than the number of cluster nodes 3, the array timeRange of the time range is obtained [3,2,2] according to the polling algorithm, then the sub-query time range executed by the TSD node 1 is 2019-11-02 00 to 2019-11-02 08, the sub-query time range executed by the TSD node 2 is 2019-11-02 00.

Taking the execution process of the TSD node 3 as an example, according to the tag index table, a large amount of useless data is filtered, and the following rowkeys are accurately located as shown in table 9:

TABLE 9

Half of the data was filtered out according to the measurements, and the results are shown in table 10 below:

TABLE 7 data obtained by filtration

By using the feature of distributed storage, corresponding records are pulled from the pre-partition 79 and the pre-partition 80 in a distributed manner on 3 nodes.

And finally grouping and performing down-sampling calculation according to hours, wherein the calculation result is an array: { a, b, c }; where a denotes a server ID, b denotes time, and c denotes an aggregation calculation result.

[“web02”：[{1572660000，2550}，{1572663600，5800}，”web03”：[{1572660000，3975}，{1572663600，3575}]

Finally, the proxy node aggregates all data of the TSD node 1, the TSD node 2 and the TSD node 3 to obtain the following results:

[“web02”：[{1572642000，4570}，{1572645600，2950}，{1572649200，3255}，{1572652800，3590}，{1572656400，3800}，{1572660000，2550}，{1572663600，5800}，{1572667200，3460}]，

[“web03”：[{1572642000，4915}，{1572645600，3575}，{1572649200，3975}，{1572652800，3375}，{1572656400，4215}，{1572660000，3975}，{1572663600，3575}，{1572667200，5100}]

the proxy node may return the aggregated results to the client.

The same query condition, adopting a traditional OpenTSDB retrieval mode, initiates 2019-11-02 aggregation query from any TSD node in the cluster for nearly 7 hours from 00 to 2019-11-02. By using the scheme provided by the embodiment of the application, the queries in the sub-time ranges of 3 hours, 3 hours and 2 hours can be respectively executed by the 3 TSD nodes, and the results are gathered and returned.

Therefore, the time range is reduced from 7 hours to 3 hours, and the traversal of useless data is greatly reduced.

Through tests, by using the same data set and a traditional OpenTSDB retrieval mode, the time consumed for executing the aggregation query for 3 hours by one TSD node is 21.609 seconds, the retrieval queries for 1 hour are respectively executed by 3 TSD nodes, the time consumed for summarizing and returning the results is 4.04 seconds, and the performance is improved by 5 times.

The methods provided herein are described above. The following describes the apparatus provided in the present application:

referring to fig. 5, a schematic structural diagram of a time series data distributed query apparatus according to an embodiment of the present application is shown in fig. 5, where the time series data distributed query apparatus may include:

a receiving unit 510, configured to receive a time series data query request, where the time series data query request includes a start time, an end time, and a data type of data to be queried;

a decomposition unit 520, configured to decompose the time series data query request according to the start time, the end time, the preset unit duration, and the number of available cluster nodes when a target time range covered by the start time and the end time exceeds a preset unit duration, so as to obtain N1 subtasks; n1 is more than or equal to 2 and less than or equal to N2, and N2 is the number of available cluster nodes; different subtasks correspond to different sub-time ranges in the target time range, and the union of the sub-time ranges corresponding to the subtasks is the target time range;

an allocating unit 530, configured to allocate the N1 subtasks to N1 cluster nodes, so that the N1 cluster nodes query corresponding time sequence data from a time sequence database according to the allocated subtasks;

the result responding unit 540 is further configured to receive query results returned by the N1 cluster nodes, and return the received query results to the initiator device of the time series data query request.

a multi-value model time sequence data table is stored in the time sequence database, and the rowkey of the multi-value model time sequence data table comprises the data type of the time sequence data, the unique identifier of the data source and the identifier of the pre-partition where the time sequence data is located.

In some embodiments, the rowkey of the multi-valued model timing data table further includes unit timestamps corresponding to a preset unit duration, and different unit timestamps correspond to different time periods of the preset unit duration;

the column group of the multi-valued model time sequence data table comprises a plurality of columns for storing time sequence data, the columns are respectively used for storing different sub-time periods in preset unit time length, and the time sequence data are stored to the corresponding columns according to the sub-time periods to which the time stamps belong.

In some embodiments, the column family of the multi-valued model timing data table further includes a tag column for storing modifiable attributes of a data source;

In some embodiments, the time series data query request further includes a filter condition, where the filter condition includes at least one alterable attribute stored in the multi-valued model time series data table, and is used for a cluster node to which a subtask is allocated to perform time series data query filtering according to the filter condition.

In some embodiments, a column family of the multi-valued model timing data table is used to store value information;

the time sequence database is also stored with a mode schema, and the schema is used for storing key information of each column family.

In some embodiments, the decomposing unit decomposes the time-series data query request according to the start time, the end time, the preset unit duration, and the number of available cluster nodes, and includes:

determining whether the starting time is the starting time of a target time period and determining whether the ending time is the ending time of the target time period, wherein the target time period is a time period corresponding to the preset unit time length, and the time length of the time period is the preset unit time length;

when the starting time is not the starting time of the target time period, adjusting the starting time to the starting time of the target time period where the starting time is located; and/or when the end time is not the end time of the target time period, adjusting the end time to the end time of the target time period where the end time is located;

decomposing the time sequence data query request according to the ratio of the time length covered by the current starting time and the current ending time to the preset unit time length and the number of the available cluster nodes; and the difference value between the time lengths of the corresponding sub-time ranges of different subtasks does not exceed the preset unit time length.

Fig. 6 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure. The electronic device may include a processor 601, a memory 602 storing machine executable instructions. The processor 601 and the memory 602 may communicate via a system bus 603. Also, by reading and executing machine-executable instructions in memory 602 corresponding to the time series data distributed query control logic, processor 601 may perform the time series data distributed query method described above.

The memory 602 referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof.

In some embodiments, there is also provided a machine-readable storage medium, such as the memory 602 in fig. 6, having stored therein machine-executable instructions that, when executed by a processor, implement the time-series data distributed query method described above. For example, the machine-readable storage medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. A distributed query method for time series data is characterized by comprising the following steps:

when the target time range covered by the starting time and the ending time exceeds a preset unit time length, decomposing the time sequence data query request according to the starting time, the ending time, the preset unit time length and the number of available cluster nodes to obtain N1 subtasks; n1 is more than or equal to 2 and less than or equal to N2, and N2 is the number of available cluster nodes; different subtasks correspond to different sub-time ranges in the target time range, and the union of the sub-time ranges corresponding to the subtasks is the target time range;

2. The method according to claim 1, wherein the time series data in the time series database is stored in a multi-valued model distributed storage mode based on Hbase;

3. The method of claim 2, wherein the rowkey of the multi-valued model timing data table further comprises unit time stamps corresponding to preset unit time lengths, and different unit time stamps correspond to different time lengths of the preset unit time lengths;

4. The method of claim 3, wherein the column family of the multi-valued model timing data table further comprises a tag column for storing modifiable attributes of a data source;

and a tag index table is also stored in the time sequence database and used for storing the corresponding relation between tag and rowkey.

5. The method of claim 4, wherein the time series data query request further comprises a filter condition, and the filter condition comprises at least one modifiable attribute stored in the multi-valued model time series data table and used for a cluster node to which a subtask is assigned, and the time series data query is filtered according to the filter condition.

6. The method according to claim 3 or 4, wherein a column family of the multivalued model timing data table is used for storing value information;

7. The method of claim 1, wherein decomposing the time series data query request according to the start time, the end time, the preset unit duration, and the number of available cluster nodes comprises:

when the starting time is not the starting time of the target time period, adjusting the starting time to the starting time of the target time period in which the starting time is located; and/or when the end time is not the end time of the target time period, adjusting the end time to the end time of the target time period where the end time is located;

decomposing the time sequence data query request according to the ratio of the time length covered by the current starting time and the current ending time to the preset unit time length and the number of the available cluster nodes; and the difference value between the time lengths of the sub-time ranges corresponding to different subtasks does not exceed the preset unit time length.

8. A time series data distributed query apparatus, comprising:

the decomposition unit is used for decomposing the time sequence data query request according to the starting time, the ending time, the preset unit duration and the number of available cluster nodes to obtain N1 subtasks when the target time range covered by the starting time and the ending time exceeds the preset unit duration; n1 is more than or equal to 2 and less than or equal to N2, and N2 is the number of available cluster nodes; different subtasks correspond to different sub-time ranges in the target time range, and the union of the sub-time ranges corresponding to the subtasks is the target time range;

the distribution unit is used for distributing the N1 subtasks to N1 cluster nodes, so that the N1 cluster nodes query corresponding time sequence data from a time sequence database according to the distributed subtasks;

and the result response unit is further configured to receive the query result returned by the N1 cluster nodes, and return the received query result to the initiator device of the time series data query request.

9. The device according to claim 8, wherein the time series data in the time series database is stored by a multi-valued model distributed storage mode based on Hbase;

a multi-valued model time sequence data table is stored in the time sequence database, and the rowkey of the multi-valued model time sequence data table comprises the data type of the time sequence data, the unique identifier of a data source and the identifier of a pre-partition where the time sequence data is located;

the rowkey of the multi-valued model time sequence data table further comprises unit time stamps corresponding to preset unit time length, and the time lengths corresponding to different unit time stamps are different time lengths of the preset unit time length;

the column group of the multi-valued model time sequence data table comprises a plurality of columns for storing time sequence data, the columns are respectively used for storing different sub-time periods in preset unit time length, and the time sequence data are stored to the corresponding columns according to the sub-time periods to which the timestamps belong;

wherein the column family of the multi-valued model timing data table further comprises a tag column for storing modifiable attributes of a data source;

a tag index table is also stored in the time sequence database, and the tag index table is used for storing the corresponding relation between tag and rowkey;

the time sequence data query request also comprises a filtering condition, wherein the filtering condition comprises at least one changeable attribute stored in the multi-valued model time sequence data table and is used for cluster nodes distributed with subtasks, and the time sequence data query and filtering are carried out according to the filtering condition;

wherein the column family of the multi-valued model timing data table is used for storing value information;

the time sequence database is also stored with a mode schema, and the schema is used for storing key information of each column family;

the decomposing unit decomposes the time series data query request according to the starting time, the ending time, the preset unit duration and the number of available cluster nodes, and comprises:

10. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor being configured to execute the machine executable instructions to implement the method of any one of claims 1 to 7.

11. A machine-readable storage medium having stored therein machine-executable instructions which, when executed by a processor, perform the method of any one of claims 1-7.