WO2019114754A1 - 一种列式存储下多时间序列的连接查询方法及系统 - Google Patents

一种列式存储下多时间序列的连接查询方法及系统 Download PDF

Info

Publication number
WO2019114754A1
WO2019114754A1 PCT/CN2018/120603 CN2018120603W WO2019114754A1 WO 2019114754 A1 WO2019114754 A1 WO 2019114754A1 CN 2018120603 W CN2018120603 W CN 2018120603W WO 2019114754 A1 WO2019114754 A1 WO 2019114754A1
Authority
WO
WIPO (PCT)
Prior art keywords
timestamp
sequence
filtering
filter
query
Prior art date
Application number
PCT/CN2018/120603
Other languages
English (en)
French (fr)
Inventor
王建民
黄向东
曹高飞
乔嘉林
江天
芮蕾
王晨
龙明盛
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Priority to EP18887800.3A priority Critical patent/EP3726397A4/en
Publication of WO2019114754A1 publication Critical patent/WO2019114754A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2477Temporal data queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • G06F16/2456Join operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data

Definitions

  • the present disclosure belongs to the field of computer data management technologies, and more specifically, to a method and system for querying multiple time series of linked queries.
  • time series data refers to the data collected by the sensor at different time points.
  • the data must include a timestamp timestamp field, in addition to the timestamp field, the sensor unique ID and the data value under the timestamp, such as a segment. Ambient temperature, stock price, machine memory usage, etc. This type of data reflects the state of change of a certain thing, phenomenon, etc. over time.
  • the sequential data storage mostly uses columnar storage, that is, the timestamp sequence and the value column are separately stored, because the data types of the same column are consistent, so the data stored in the columnar can adopt the efficient data compression coding method. , which greatly reduces the space occupied by the storage data.
  • the present disclosure provides a method for querying multiple time series connection queries that overcomes the above problems or at least partially solves the above problems, and includes:
  • Step S1 dividing a plurality of time series stored in the column into a plurality of to-be-queried sequences and a plurality of filtering sequences, where the time series includes a timestamp sequence and a sequence of data values;
  • Step S2 Filter out, from the timestamp sequence of each filtering sequence, a timestamp that each filtering sequence meets a preset filtering condition
  • Step S3 traversing the query sequence based on a timestamp of the preset filter condition, and obtaining a connection query result, where the connection query result is a timestamp of the query sequence and a timestamp corresponding to the query sequence. Data value.
  • step S2 comprises:
  • step S22 specifically includes:
  • the batch read is to read a preset number of timestamps and data values each time until the The timestamps in the timestamp sequence and the data values of the sequence of data values are all read;
  • step S3 comprises:
  • the data value corresponding to the second timestamp and the second timestamp is used as a connection query result.
  • step S31 includes:
  • the target timestamp is used as The first timestamp.
  • step S31 further includes:
  • the storage queue is obtained from the storage queue. Delete the target timestamp.
  • the storage queue is a priority queue.
  • connection query system for a multi-time series of in-column storage, which includes:
  • a sequence division module configured to divide the plurality of time series stored in the column into a plurality of to-be-queried sequences and a plurality of filtering sequences, where the time series includes a timestamp sequence and a sequence of data values;
  • a timestamp filtering module configured to filter, from a timestamp sequence of each filtering sequence, a timestamp that each filtering sequence meets a preset filtering condition
  • connection query module configured to traverse the query sequence based on a timestamp of the preset filter condition, and obtain a connection query result, where the connection query result is a timestamp of the query sequence and a query sequence The data value corresponding to the timestamp.
  • a computer program product comprising program code for performing a time series compressed storage method as described above.
  • a non-transitory computer readable storage medium for storing a computer program as described above is provided.
  • the present invention provides a method and system for querying multiple time series in a columnar storage manner, by calculating a timestamp that satisfies all the conditions for filtering conditions of all columns, and then querying the query column by using the timestamp, Effective filtering and query based on the characteristics of time series data.
  • FIG. 1 is a flowchart of a method for querying a connection of multiple time series under column storage according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of an example of connection query of multiple time series in column storage according to an embodiment of the present disclosure
  • FIG. 3 is a structural diagram of a connection query system for multiple time series in a columnar storage according to an embodiment of the present disclosure.
  • the embodiments of the present disclosure provide a method and system for querying multiple time series of joins in column storage, and calculating timestamps satisfying all the conditions by filtering conditions of all columns, and then The query column is queried by the timestamp, and can be effectively filtered and queried according to the characteristics of the time series data.
  • FIG. 1 is a flowchart of a method for querying multiple time series connection in a column type storage according to an embodiment of the present disclosure. As shown in FIG. 1 , the method includes:
  • Step S1 dividing a plurality of time series stored in the column into a plurality of to-be-queried sequences and a plurality of filtering sequences, where the time series includes a timestamp sequence and a sequence of data values;
  • Step S2 Filter out, from the timestamp sequence of each filtering sequence, a timestamp that each filtering sequence meets a preset filtering condition
  • Step S3 traversing the query sequence based on a timestamp of the preset filter condition, and obtaining a connection query result, where the connection query result is a timestamp of the query sequence and a timestamp corresponding to the query sequence. Data value.
  • the timestamp sequence and the data column of each time series in the plurality of time series stored in the column storage in step S1 are separately stored, and the time column stores a strictly increasing timestamp, and defines all times currently stored.
  • the sequence has a total of N_0 columns, each column stores its own timestamp sequence and value column, and defines that the sequence to be queried has N_1 column data, and the N_1 column is Q 1 , Q 2 ... Q i ... Q N_1-1 , Q N , wherein Q i denotes a query for the i-th column, and the N_2 column filtering sequence is divided therein.
  • the specific embodiment of the present disclosure does not specifically limit the number of specific time series, the number of sequences to be queried, and the number of filtering sequences.
  • the preset filtering condition may be the same or different for each filtering sequence, and the filtering conditions are defined as F 1 , F 2 ... F i ... F N_2-1 , F N_2 , where F i represents the ith
  • the column stores the filtering conditions of the data, and according to the filtering condition, the timestamp that satisfies the preset filtering condition can be filtered out from each filtering sequence timestamp sequence.
  • step S3 the query column is queried by the timestamp that satisfies the preset filtering condition, and the filtering and query can be effectively performed according to the characteristics of the time series data.
  • step S2 includes:
  • the operation object of the embodiment of the present disclosure is substantially performed for each filtering sequence, and the timestamp sequence and the data value sequence of each filtering sequence are simultaneously read data, when the filtering sequence is in the timestamp sequence.
  • the timestamp meets the timestamp limit and the data value of the sequence of the filtered sequence data value satisfies the data value limit, it is determined that the filtering is successful at this time, and the timestamp corresponding to the successfully filtered data is retained.
  • step S22 specifically includes:
  • the batch read is to read a preset number of timestamps and data values each time until the The timestamps in the timestamp sequence and the data values of the sequence of data values are all read;
  • the embodiment of the present disclosure adopts a batch form to read data, and defines that the upper limit of the number of data to be read in batches for each column is T, and then each A filter sequence begins reading, and the amount of data read each time is T until all data volumes in a filter sequence are read.
  • a timestamp in which each filter sequence satisfies the timestamp limit is stored separately during the reading process.
  • step S3 includes:
  • the data value corresponding to the second timestamp and the second timestamp is used as a connection query result.
  • step S31 it can be understood that, corresponding to each filtering sequence, a time stamp set in which each filtering sequence satisfies a filtering condition is included, and the embodiment of the present disclosure needs to filter out a common timestamp from all filtering sequences, that is, the present disclosure.
  • step S32 the first timestamp can be used to traverse in the query sequence to obtain the same timestamp in the query sequence, thereby outputting the query result in step S33.
  • step S31 includes:
  • the target timestamp is used as The first timestamp.
  • Step S31 further includes:
  • the storage queue is obtained from the storage queue. Delete the target timestamp.
  • the embodiment of the present disclosure adopts a method for storing queues to sequentially query, that is, selecting a timestamp from each timestamp that meets the filtering condition of the filtering sequence is stored in the storage queue. And traversing the timestamp in a timestamp that the remaining filtering sequence satisfies the filtering condition, determining whether all the filtering sequences include the timestamp, and if the timestamp is included, storing the target timestamp as the first timestamp, storing In the preset list QLIST, the list QLIST stores the timestamp to be queried calculated according to the filtering condition.
  • the target timestamp is deleted in the storage queue until all the filtering sequences satisfy the timestamp of the filtering condition to complete the foregoing process.
  • the storage queue is a priority queue.
  • the priority queues can be used to prioritize multiple timestamps in the storage queue to preferentially process the higher priority timestamps and improve the traversal efficiency.
  • priority is defined in a small to large timestamp.
  • FIG. 2 is a schematic diagram of a connection query of a multi-time sequence in a columnar storage according to an embodiment of the present disclosure.
  • the embodiment of the present disclosure provides a query column 1 and a query column 2, and a filter column 3 and a filter column. 4.
  • the embodiment of the present disclosure performs a connection query flow for storing multiple time series as follows.
  • the stamp is greater than 5 and the value is equal to c; the column 4 data that satisfies the filter condition has a timestamp of less than or equal to 10 and a value that cannot be equal to x.
  • the steps include:
  • Embodiments of the present disclosure store a timestamp variable using a priority queue PQ.
  • Array_I 1 ... Array_I N_2 respectively represent the number of data read in Array 1 , Array 2 ... Array N_2 , and the value of Array_I 1 ... Array_I N_2 at initialization Both are 0.
  • the embodiment of the present disclosure calculates a common timestamp that satisfies all the conditions by filtering the conditions of all the columns, and then queries the query column by using the timestamp, which can be performed according to the characteristics of the time series data. Effective filtering and querying.
  • a multi-channel batch merging algorithm is adopted, and the memory usage is small each time through batch reading, and a part of each reading result is aggregated by iterating over the batch reading algorithm multiple times. In the end result.
  • FIG. 3 is a structural diagram of a multi-time sequence connection query system according to a column storage according to an embodiment of the present disclosure.
  • a multi-time sequence connection query system including: a window division module. 1.
  • the sequence division module 1 is configured to divide the plurality of time series stored in the column into a plurality of to-be-queried sequences and a plurality of filtering sequences, where the time series includes a timestamp sequence and a sequence of data values;
  • the timestamp filtering module 2 is configured to filter, from a timestamp sequence of each filtering sequence, a timestamp that each filtering sequence meets a preset filtering condition;
  • the connection query module 3 is configured to traverse the query sequence to obtain a connection query result based on a timestamp of the preset filter condition, and the connection query result is a timestamp of the query sequence and a query sequence. The data value corresponding to the timestamp.
  • connection query module 3 how to use the sequence division module 1, the timestamp filtering module 2, and the connection query module 3 to perform a connection query for a plurality of time series under the column storage may be referred to the foregoing embodiment, and details are not described herein again.
  • An embodiment of the present disclosure provides a multi-time sequence connection query system for column storage, comprising: at least one processor; and at least one memory communicatively coupled to the processor, wherein:
  • the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the methods provided by the foregoing method embodiments, for example, including: step S1, storing a plurality of columns The time series is divided into a plurality of to-be-queried sequences and a plurality of filtering sequences, where the time series includes a timestamp sequence and a sequence of data values; and step S2: selecting each filtering sequence from the timestamp sequence of each filtering sequence to meet a preset a timestamp of the filter condition; step S3, traversing the query sequence based on a timestamp of each filter sequence that satisfies the preset filter condition, and obtaining a connection query result, where the connection query result is a timestamp of the query sequence and The data value corresponding to the timestamp of the query sequence.
  • the embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer, the computer
  • the method provided in each of the foregoing method embodiments is performed, for example, comprising: Step S1, dividing a plurality of time series stored in the column into a plurality of to-be-queried sequences and a plurality of filtering sequences, the time series including a timestamp sequence and data a sequence of values; step S2, selecting a timestamp for each filter sequence that meets a preset filter condition from a timestamp sequence of each filter sequence; and step S3, satisfying a timestamp of the preset filter condition based on each filter sequence,
  • the query sequence is traversed to obtain a connection query result, where the connection query result is a timestamp of the query sequence and a data value corresponding to the timestamp of the query sequence.
  • the embodiment provides a non-transitory computer readable storage medium, the non-transitory computer readable storage medium storing computer instructions, the computer instructions causing the computer to perform the methods provided by the foregoing method embodiments, including, for example, Step S1: dividing a plurality of time series stored in the column into a plurality of to-be-queried sequences and a plurality of filtering sequences, the time series including a timestamp sequence and a data value sequence; and step S2, a timestamp from each filtering sequence And selecting, in the sequence, a timestamp that meets a preset filtering condition for each filtering sequence; step S3, traversing the query sequence based on a timestamp of each of the filtering sequences that meets the preset filtering condition, and obtaining a connection query result, where The result of the connection query is the timestamp of the query sequence and the data value corresponding to the timestamp of the query sequence.
  • Step S1 dividing a plurality of time series stored in the column into a pluralit
  • the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed.
  • the foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种列式存储下多时间序列的连接查询方法,包括:将列式存储的多个时间序列划分为多个待查询序列和多个过滤序列,所述时间序列包括时间戳序列和数据值序列(步骤S1);从每一过滤序列的时间戳序列中筛选出每一过滤序列满足预设过滤条件的时间戳(步骤S2);基于每一过滤序列满足所述预设过滤条件的时间戳,对所述查询序列进行遍历,获取连接查询结果,所述连接查询结果为查询序列的时间戳以及查询序列的时间戳对应的数据值(步骤S3)。通过对所有列的过滤条件计算出满足该所有条件的时间戳,再通过该时间戳来对查询列进行查询,能够根据时序数据的特点进行有效的过滤及查询。

Description

一种列式存储下多时间序列的连接查询方法及系统
交叉引用
本申请引用于2017年12月12日提交的专利名称为“一种列式存储下多时间序列的连接查询方法及系统”的第2017113226315号中国专利申请,其通过引用被全部并入本申请。
技术领域
本公开属于计算机数据管理技术领域,更具体地,涉及一种列式存储下多时间序列的连接查询方法及系统。
背景技术
伴随着云计算、互联网、物联网等现代技术的不断发展和成熟,人们对数据的关注度越来越高。数据来源于我们生活的方方面面,包括企业的生产交易数据、互联网中人与人的交互数据、物联网中传感器回传的监测数据等等,在这其中时间序列数据占据了很大的比重。时间序列数据简称时序数据,是指传感器在不同时间点上收集到的数据,数据中必须包含时间戳timestamp字段,除timestamp字段外还需包含传感器唯一ID及该时间戳下的数据值,比如一段时间内的环境温度、股票价格、机器的内存使用率等。这类数据反映了某一事物、现象等随时间的变化状态。
针对以上特点,时序数据存储多采用列式存储,即将时间戳序列、值列分别单独存储,因为同一列的数据类型都是一致的,所以对列式存储的数据可以采用高效的数据压缩编码方法,从而大大降低存储数据占用的空间。
但是在列式存储降低数据存储量的同时,对于数据的查询效率和查询准确率将会降低,因而现在亟须一种针对列式存储下的多时间序列的连接查询方法。
发明内容
本公开提供一种克服上述问题或者至少部分地解决上述问题的一种 列式存储下多时间序列的连接查询方法,其特征在于,包括:
步骤S1、将列式存储的多个时间序列划分为多个待查询序列和多个过滤序列,所述时间序列包括时间戳序列和数据值序列;
步骤S2、从每一过滤序列的时间戳序列中筛选出每一过滤序列满足预设过滤条件的时间戳;
步骤S3、基于每一过滤序列满足所述预设过滤条件的时间戳,对所述查询序列进行遍历,获取连接查询结果,所述连接查询结果为查询序列的时间戳以及查询序列的时间戳对应的数据值。
其中,步骤S2包括:
S21、获取每一过滤序列对应的过滤条件,所述过滤条件包括时间戳限制和数据值限制;
S22、对于每一过滤序列,存储所述过滤序列的时间戳序列中的时间戳满足所述时间戳限制并且所述时间戳对应的数据值同时满足所述数据值限制的时间戳。
其中,步骤S22具体包括:
同时对每一过滤序列时间戳序列中的时间戳和过滤序列数据值序列的数据值进行批量读取,所述批量读取为每次读取预设数量的时间戳和数据值,直至所述时间戳序列中的时间戳和数据值序列的数据值全部被读取完;
基于读取结果,分别存储每一过滤序列满足过滤条件的时间戳。
其中,步骤S3包括:
S31、在每一过滤序列满足过滤条件的时间戳中,筛选出所有过滤序列共有的第一时间戳;
S32、在所述查询序列中遍历,获取所述查询序列的时间戳序列中与所述第一时间戳相同的第二时间戳;
S33、将所述第二时间戳以及所述第二时间戳对应的数据值作为连接查询结果。
其中,步骤S31包括:
从每一过滤序列满足过滤条件的时间戳中分别选取一个目标时间戳存储在预设的存储队列中;
基于每一过滤序列对应的目标时间戳,在其余过滤序列满足过滤条件的时间戳中遍历,若其余过滤序列过滤条件的时间戳中均包含所述目标时间戳,则将所述目标时间戳作为所述第一时间戳。
其中,步骤S31还包括:
基于每一过滤序列对应的目标时间戳,在其余过滤序列满足过滤条件的时间戳中遍历,若其余任一过滤序列满足过滤条件的时间戳中没有所述目标时间戳,则从所述存储队列中删除所述目标时间戳。
其中,所述存储队列为优先级队列。
根据本公开的第二方面,提供一种列式存储下多时间序列的连接查询系统,其特征在于,包括:
序列划分模块,用于将列式存储的多个时间序列划分为多个待查询序列和多个过滤序列,所述时间序列包括时间戳序列和数据值序列;
时间戳过滤模块,用于从每一过滤序列的时间戳序列中筛选出每一过滤序列满足预设过滤条件的时间戳;
连接查询模块,用于基于每一过滤序列满足所述预设过滤条件的时间戳,对所述查询序列进行遍历,获取连接查询结果,所述连接查询结果为查询序列的时间戳以及查询序列的时间戳对应的数据值。
根据本公开的第三方面,提供一种计算机程序产品,包括程序代码,所述程序代码用于执行上述所述的一种时间序列压缩存储方法。
根据本公开的第四方面,提供一种非暂态计算机可读存储介质,用于存储如前所述的计算机程序。
本公开提供的一种列式存储下多时间序列的连接查询方法及系统,通过对所有列的过滤条件计算出满足该所有条件的时间戳,再通过该时间戳来对查询列进行查询,能够根据时序数据的特点进行有效的过滤及查询。
附图说明
图1是本公开实施例提供的一种列式存储下多时间序列的连接查询方法流程图;
图2是本公开实施例提供的一种列式存储下多时间序列的连接查询实例示意图;
图3是本公开实施例提供的一种列式存储下多时间序列的连接查询系 统结构图。
具体实施方式
下面结合附图和实施例,对本公开的具体实施方式作进一步详细描述。以下实施例用于说明本公开,但不用来限制本公开的范围。
现有技术中,对于时序数据的存储更多的已开始采用列式存储的方式,所述列式存储为将时间戳序列、值列分别单独存储,因为同一列的数据类型都是一致的,所以对列式存储的数据可以采用高效的数据压缩编码方法,从而大大降低存储数据占用的空间。
但是列式存储在大大降低数据存储量的同时也引入了新的问题,即如何高效地对多个时间序列进行依据时间戳的连接(join)查询。现有的技术手段中还没有一种连接查询方法能够高效、快速的完成连接查询功能。
针对上述现有技术中存在的问题,本公开实施例提供了一种列式存储下多时间序列的连接查询方法及系统,通过对所有列的过滤条件计算出满足该所有条件的时间戳,再通过该时间戳来对查询列进行查询,能够根据时序数据的特点进行有效的过滤及查询。
图1是本公开实施例提供的一种列式存储下多时间序列的连接查询方法流程图,如图1所示,所述方法包括:
步骤S1、将列式存储的多个时间序列划分为多个待查询序列和多个过滤序列,所述时间序列包括时间戳序列和数据值序列;
步骤S2、从每一过滤序列的时间戳序列中筛选出每一过滤序列满足预设过滤条件的时间戳;
步骤S3、基于每一过滤序列满足所述预设过滤条件的时间戳,对所述查询序列进行遍历,获取连接查询结果,所述连接查询结果为查询序列的时间戳以及查询序列的时间戳对应的数据值。
具体的,步骤S1中所述列式存储的多个时间序列中每一个时间序列的时间戳序列与数据列分别单独存储,且时间列存储的是严格递增的时间戳,定义当前存储的所有时间序列共有N_0列,每一列均存储各自的时间戳序列与值列,定义其中待查询序列共有N_1列数据,该N_1列为Q 1,Q 2…Q i…Q N_1-1,Q N,其中Q i表示对第i列的查询,定义其中划分的有N_2列过滤序列,本公开实施例对具体的时间序列数量、待查询序列的数量以 及过滤序列的数量不做具体限定。
步骤S2中,所述预设的过滤条件对于每一过滤序列可能相同也可能不同,定义过滤条件为F 1,F 2…F i…F N_2-1,F N_2,其中F i表示对第i列存储数据的过滤条件,那么依据过滤条件将可以从每一过滤序列时间戳序列中筛选出满足预设过滤条件的时间戳。
步骤S3中,通过满足预设过滤条件的时间戳来对查询列进行查询,能够根据时序数据的特点进行有效的过滤及查询。
在上述实施例的基础上,步骤S2包括:
S21、获取每一过滤序列对应的过滤条件,所述过滤条件包括时间戳限制和数据值限制;
S22、对于每一过滤序列,存储所述过滤序列的时间戳序列中的时间戳满足所述时间戳限制并且所述时间戳对应的数据值同时满足所述数据值限制的时间戳。
S21中,过滤序列中对应含有过滤条件,所述过滤条件为对时间戳的过滤限制以及对数据值的过滤限制,例如:过滤条件为“time>5,value=c”,那么时间戳限制为必须大于5,数据值限制为必须等于c。
S22中,本公开实施例的操作对象实质上是对于每一过滤序列进行的,对每一过滤序列的时间戳序列和数据值序列同时进行数据读取,当所述过滤序列时间戳序列中的时间戳满足所述时间戳限制并且所述过滤序列数据值序列的数据值满足所述数据值限制时,判定此时过滤成功,保留过滤成功的数据对应的时间戳。
在上述实施例的基础上,步骤S22具体包括:
同时对每一过滤序列时间戳序列中的时间戳和过滤序列数据值序列的数据值进行批量读取,所述批量读取为每次读取预设数量的时间戳和数据值,直至所述时间戳序列中的时间戳和数据值序列的数据值全部被读取完;
基于读取结果,分别存储每一过滤序列满足过滤条件的时间戳。
可以理解的是,由于内存中可存储的数据量有限,故而本公开实施例采用了分批形式读取数据,定义每次对一列进行分批读的数据条数上限是T,那么同时对每一过滤序列开始进行读取,每次读取的数据量为T,直 至一个过滤序列中所有数据量被读取完。
在读取过程中分别存储每一过滤序列满足所述时间戳限制的时间戳。
在上述实施例的基础上,步骤S3包括:
S31、在每一过滤序列满足过滤条件的时间戳中,筛选出所有过滤序列共有的第一时间戳;
S32、在所述查询序列中遍历,获取所述查询序列的时间戳序列中与所述第一时间戳相同的第二时间戳;
S33、将所述第二时间戳以及所述第二时间戳对应的数据值作为连接查询结果。
步骤S31中,可以理解的是,对应于每一过滤序列,均包括每一过滤序列满足过滤条件的时间戳集合,本公开实施例需要从所有过滤序列中筛选出共有的时间戳,即本公开实施例所述的第一时间戳。
步骤S32中,利用所述第一时间戳能够在查询序列中遍历,获取查询序列中相同的时间戳,从而在步骤S33中输出查询结果。
在上述实施例的基础上,步骤S31包括:
从每一过滤序列满足过滤条件的时间戳中分别选取一个目标时间戳存储在预设的存储队列中;
基于每一过滤序列对应的目标时间戳,在其余过滤序列满足过滤条件的时间戳中遍历,若其余过滤序列过滤条件的时间戳中均包含所述目标时间戳,则将所述目标时间戳作为所述第一时间戳。
步骤S31还包括:
基于每一过滤序列对应的目标时间戳,在其余过滤序列满足过滤条件的时间戳中遍历,若其余任一过滤序列满足过滤条件的时间戳中没有所述目标时间戳,则从所述存储队列中删除所述目标时间戳。
可以理解的是,在筛选共有的第一时间戳时本公开实施例采用的是存储队列依次查询的方法,即从每一过滤序列满足过滤条件的时间戳中选取一个时间戳存储在存储队列中,再将该时间戳在其余过滤序列满足过滤条件的时间戳中遍历,确定是否所有过滤序列均包含此时间戳,若包含此时间戳,则将目标时间戳作为所述第一时间戳,存储在预设的列表QLIST中,所述列表QLIST存储根据过滤条件计算得到的要查询的时间戳。
若不包含此时间戳,则在所述存储队列中删除所述目标时间戳,直至所有过滤序列满足过滤条件的时间戳均完成上述过程。
在上述实施例的基础上,所述存储队列为优先级队列。
可以理解的是,采用优先级队列的方式,能够对存储队列中的多个时间戳进行优先级分类,从而优先处理优先级较高的时间戳,提高遍历效率。
一般的,采用时间戳由小到大的方式来定义优先级。
图2是本公开实施例提供的一种列式存储下多时间序列的连接查询实例示意图,如图2所示,本公开实施例提供了查询列1和查询列2以及过滤列3和过滤列4,本公开实施例依据图2做出如下列式存储下多时间序列的连接查询流程。
如图2所示,列3的过滤条件为“time>5,value=c”,列4的过滤条件为“time≤10,value!=x”,即满足过滤条件的列3数据,其时间戳要大于5且值等于c;满足过滤条件的列4数据,其时间戳要小于或等于10,其值不能等于x。
具体的,步骤包括:
(1)记F_INDEX=0;初始化列表Array 1,Array 2…Array N_2与各列表的容量统计值Size 1,Size 2…Size N_2,其中Array i表示第i列数据中满足过滤条件F i的时间戳序列表,Size i表示Array i存储的数据量;记L_Array 1,L_Array 2…L_Array N_2分别表示列表Array 1,Array 2…Array N_2的数据量大小。对应于图2,Array 1与Array 2可分别存储列3与列4满足过滤条件的时间戳,即{7,10}与{3,7,10}。
(2)若F_INDEX>N_2,则直接使用优先级队列存储时间戳变量,跳转至(4);
若Size F_INDEX≥T,T为每批读取的数据个数,或第F_INDEX列所有的数据已被读完,则F_INDEX=F_INDEX+1。
否则,读取第F_INDEX列未被读取的下一条数据,若该条数据的时间戳及值满足过滤条件F F_INDEX则将该条数据的时间戳放到Array F_INDEX,Size F_INDEX=Size F_INDEX+1。
(3)F_INDEX=F_INDEX+1,跳转至(2)重复执行。
(4)本公开实施例使用优先级队列PQ存储时间戳变量。在该队列中, 越小的时间戳其优先级越高;设Array_I 1…Array_I N_2分别表示Array 1,Array 2…Array N_2中已读取的数据条数,初始化时Array_I 1…Array_I N_2的值都为0。
(5)依次遍历Array 1,Array 2…Array N_2,当遍历到Array i时,将Array i的首条数据放入PQ,Array_I i=Array_I i+1。对应于图2,Array 1与Array 2的首条时间戳分别为{7},{3}。
(6)初始化t_0=-1。
(7)若PQ不为空,取出PQ的首条数据t_1,t_1为一个时间戳;否则,直接根据QLIST对查询列进行查询,QLIST是本公开实施例提供。若Array 1,Array 2…Array N_2没有都包含t_1,则重复执行上述过程。若t_0=-1,将t_1加入到列表QLIST,t_0=t_1;否则,若t_1不等于t_0,将t_1加入到列表QLIST,t_0=t_1。对应于图2,首先对于时间戳3,只有列4包含等于该时间戳的数据点,列3未包含等于该时间戳的点,所以时间戳3会被丢弃;对于时间戳7,列3与列4都包含等于该时间戳的点,所以该点是过滤列的共有时间戳。
(8)初始化j=1。
(9)依次遍历Array j的每条数据,若j>N_2,返回(7),若Array_I j>L_Array j,j=j+1,执行(9);若Array j的第Array_I j项等于t_1,则Array_I j=Array_I j+1,执行(9),若Array j的第Array_I j项等于t_1,则Array_I j=Array_I j+1,执行(9);若Array j的第Array_I j项不等于t_1,则将Array j的第Array_I j项加进PQ,Array_I j>L_Array j,j=j+1,执行(9)。
(10)根据QLIST里的时间戳对查询列Q 1,Q 2…Q i…Q N_1-1,Q N进行查询及输出,只有当待查询列里Q 1,Q 2…Q i…Q N_1-1,Q N数据的时间戳在QLIST里才将其输出。若过滤列还有未读取的数据时,跳转到(2),否则结束查询过程。对应于图1,通过过滤列计算得到的公共时间戳为{7,10},查询列为列1与列2,通过对列1与列2的查询,最终输出结果如表1所示:
表1连接查询结果
time value1 value2
7 b g
10 d h
从查询的输出结果可以看出,本公开实施例通过对所有列的过滤条件计算出满足该所有条件的公共时间戳,再通过该时间戳来对查询列进行查询,能够根据时序数据的特点进行有效的过滤及查询。
并且在查询过程中采用多路分批归并的算法,通过分批读取使得每次的内存占用量较小,通过多次对分批读取算法的迭代,将每次读取结果的一部分汇聚成最终结果。
图3是本公开实施例提供的一种列式存储下多时间序列的连接查询系统结构图,如图3所示,一种列式存储下多时间序列的连接查询系统,包括:窗口划分模块1、特征值计算模块2、特征距离比较模块3以及压缩存储模块4,其中:
序列划分模块1用于将列式存储的多个时间序列划分为多个待查询序列和多个过滤序列,所述时间序列包括时间戳序列和数据值序列;
时间戳过滤模块2用于从每一过滤序列的时间戳序列中筛选出每一过滤序列满足预设过滤条件的时间戳;
连接查询模块3用于基于每一过滤序列满足所述预设过滤条件的时间戳,对所述查询序列进行遍历,获取连接查询结果,所述连接查询结果为查询序列的时间戳以及查询序列的时间戳对应的数据值。
具体的如何利用序列划分模块1、时间戳过滤模块2以及连接查询模块3对列式存储下多时间序列进行连接查询可参见上述实施例,本公开实施例对此不再赘述。
本公开实施例提供一种列式存储下多时间序列的连接查询系统,包括:至少一个处理器;以及与所述处理器通信连接的至少一个存储器,其中:
所述存储器存储有可被所述处理器执行的程序指令,所述处理器调用所述程序指令以执行上述各方法实施例所提供的方法,例如包括:步骤S1、将列式存储的多个时间序列划分为多个待查询序列和多个过滤序列,所述时间序列包括时间戳序列和数据值序列;步骤S2、从每一过滤序列的时间戳序列中筛选出每一过滤序列满足预设过滤条件的时间戳;步骤S3、基于每一过滤序列满足所述预设过滤条件的时间戳,对所述查询序列进行遍历,获取连接查询结果,所述连接查询结果为查询序列的时间戳以及查询 序列的时间戳对应的数据值。
本实施例公开一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,计算机能够执行上述各方法实施例所提供的方法,例如包括:步骤S1、将列式存储的多个时间序列划分为多个待查询序列和多个过滤序列,所述时间序列包括时间戳序列和数据值序列;步骤S2、从每一过滤序列的时间戳序列中筛选出每一过滤序列满足预设过滤条件的时间戳;步骤S3、基于每一过滤序列满足所述预设过滤条件的时间戳,对所述查询序列进行遍历,获取连接查询结果,所述连接查询结果为查询序列的时间戳以及查询序列的时间戳对应的数据值。
本实施例提供一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令使所述计算机执行上述各方法实施例所提供的方法,例如包括:步骤S1、将列式存储的多个时间序列划分为多个待查询序列和多个过滤序列,所述时间序列包括时间戳序列和数据值序列;步骤S2、从每一过滤序列的时间戳序列中筛选出每一过滤序列满足预设过滤条件的时间戳;步骤S3、基于每一过滤序列满足所述预设过滤条件的时间戳,对所述查询序列进行遍历,获取连接查询结果,所述连接查询结果为查询序列的时间戳以及查询序列的时间戳对应的数据值。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。
最后,本申请的方法仅为较佳的实施方案,并非用于限定本公开的保护范围。凡在本公开的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。

Claims (10)

  1. 一种列式存储下多时间序列的连接查询方法,其特征在于,包括:
    步骤S1、将列式存储的多个时间序列划分为多个待查询序列和多个过滤序列,所述时间序列包括时间戳序列和数据值序列;
    步骤S2、从每一过滤序列的时间戳序列中筛选出每一过滤序列满足预设过滤条件的时间戳;
    步骤S3、基于每一过滤序列满足所述预设过滤条件的时间戳,对所述查询序列进行遍历,获取连接查询结果,所述连接查询结果为查询序列的时间戳以及查询序列的时间戳对应的数据值。
  2. 根据权利要求1所述的方法,其特征在于,步骤S2包括:
    S21、获取每一过滤序列对应的过滤条件,所述过滤条件包括时间戳限制和数据值限制;
    S22、对于每一过滤序列,存储所述过滤序列的时间戳序列中的时间戳满足所述时间戳限制并且所述时间戳对应的数据值同时满足所述数据值限制的时间戳。
  3. 根据权利要求2所述的方法,其特征在于,步骤S22具体包括:
    同时对每一过滤序列时间戳序列中的时间戳和过滤序列数据值序列的数据值进行批量读取,所述批量读取为每次读取预设数量的时间戳和数据值,直至所述时间戳序列中的时间戳和数据值序列的数据值全部被读取完;
    基于读取结果,分别存储每一过滤序列满足过滤条件的时间戳。
  4. 根据权利要求1所述的方法,其特征在于,步骤S3包括:
    S31、在每一过滤序列满足过滤条件的时间戳中,筛选出所有过滤序列共有的第一时间戳;
    S32、在所述查询序列中遍历,获取所述查询序列的时间戳序列中与所述第一时间戳相同的第二时间戳;
    S33、将所述第二时间戳以及所述第二时间戳对应的数据值作为连接查询结果。
  5. 根据权利要求4所述的方法,其特征在于,步骤S31包括:
    从每一过滤序列满足过滤条件的时间戳中分别选取一个目标时间戳 存储在预设的存储队列中;
    基于每一过滤序列对应的目标时间戳,在其余过滤序列满足过滤条件的时间戳中遍历,若其余过滤序列过滤条件的时间戳中均包含所述目标时间戳,则将所述目标时间戳作为所述第一时间戳。
  6. 根据权利要求5所述的方法,其特征在于,步骤S31还包括:
    基于每一过滤序列对应的目标时间戳,在其余过滤序列满足过滤条件的时间戳中遍历,若其余任一过滤序列满足过滤条件的时间戳中没有所述目标时间戳,则从所述存储队列中删除所述目标时间戳。
  7. 根据权利要求5或6所述的方法,其特征在于,所述存储队列为优先级队列。
  8. 一种列式存储下多时间序列的连接查询系统,其特征在于,包括:
    序列划分模块,用于将列式存储的多个时间序列划分为多个待查询序列和多个过滤序列,所述时间序列包括时间戳序列和数据值序列;
    时间戳过滤模块,用于从每一过滤序列的时间戳序列中筛选出每一过滤序列满足预设过滤条件的时间戳;
    连接查询模块,用于基于每一过滤序列满足所述预设过滤条件的时间戳,对所述查询序列进行遍历,获取连接查询结果,所述连接查询结果为查询序列的时间戳以及查询序列的时间戳对应的数据值。
  9. 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行如权利要求1至7任一所述的方法。
  10. 一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令使所述计算机执行如权利要求1至7任一所述的方法。
PCT/CN2018/120603 2017-12-12 2018-12-12 一种列式存储下多时间序列的连接查询方法及系统 WO2019114754A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP18887800.3A EP3726397A4 (en) 2017-12-12 2018-12-12 LINK INQUIRY PROCEDURE AND SYSTEM FOR MULTIPLE TIME SEQUENCES WITH COLUMN STORAGE

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711322631.5 2017-12-12
CN201711322631.5A CN108062378B (zh) 2017-12-12 2017-12-12 一种列式存储下多时间序列的连接查询方法及系统

Publications (1)

Publication Number Publication Date
WO2019114754A1 true WO2019114754A1 (zh) 2019-06-20

Family

ID=62138243

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/120603 WO2019114754A1 (zh) 2017-12-12 2018-12-12 一种列式存储下多时间序列的连接查询方法及系统

Country Status (3)

Country Link
EP (1) EP3726397A4 (zh)
CN (1) CN108062378B (zh)
WO (1) WO2019114754A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062378B (zh) * 2017-12-12 2018-12-11 清华大学 一种列式存储下多时间序列的连接查询方法及系统
CN110502541A (zh) * 2019-07-26 2019-11-26 联想(北京)有限公司 一种数据处理方法及电子设备
CN113868267A (zh) * 2020-06-30 2021-12-31 华为技术有限公司 注入时序数据的方法、查询时序数据的方法及数据库系统
CN113312313B (zh) * 2021-01-29 2023-09-29 淘宝(中国)软件有限公司 数据查询方法、非易失性存储介质及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090018996A1 (en) * 2007-01-26 2009-01-15 Herbert Dennis Hunt Cross-category view of a dataset using an analytic platform
CN103279530A (zh) * 2013-05-31 2013-09-04 携程计算机技术(上海)有限公司 时间序列数据的组合查询缓存的建立方法、方法及系统
CN104035956A (zh) * 2014-04-11 2014-09-10 江苏瑞中数据股份有限公司 一种基于分布式列存储的时间序列数据存储方法
CN106407395A (zh) * 2016-09-19 2017-02-15 北京百度网讯科技有限公司 数据查询的处理方法及装置
CN108062378A (zh) * 2017-12-12 2018-05-22 清华大学 一种列式存储下多时间序列的连接查询方法及系统

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080147603A1 (en) * 2006-12-14 2008-06-19 Olli Pekka Kostamaa Converting temporal data into time series data
US20110218978A1 (en) * 2010-02-22 2011-09-08 Vertica Systems, Inc. Operating on time sequences of data
CN104331432A (zh) * 2014-10-22 2015-02-04 江苏瑞中数据股份有限公司 一种适用于截面访问模式的电网海量时序数据存取方法
CN106648446B (zh) * 2015-10-30 2020-07-07 阿里巴巴集团控股有限公司 一种用于时序数据的存储方法、装置及电子设备
US10824629B2 (en) * 2016-04-01 2020-11-03 Wavefront, Inc. Query implementation using synthetic time series
CN107092624B (zh) * 2016-12-28 2022-08-30 北京星选科技有限公司 数据存储方法、装置及系统
CN106503276A (zh) * 2017-01-06 2017-03-15 山东浪潮云服务信息科技有限公司 一种用于实时监控系统的时间序列数据库的方法与装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090018996A1 (en) * 2007-01-26 2009-01-15 Herbert Dennis Hunt Cross-category view of a dataset using an analytic platform
CN103279530A (zh) * 2013-05-31 2013-09-04 携程计算机技术(上海)有限公司 时间序列数据的组合查询缓存的建立方法、方法及系统
CN104035956A (zh) * 2014-04-11 2014-09-10 江苏瑞中数据股份有限公司 一种基于分布式列存储的时间序列数据存储方法
CN106407395A (zh) * 2016-09-19 2017-02-15 北京百度网讯科技有限公司 数据查询的处理方法及装置
CN108062378A (zh) * 2017-12-12 2018-05-22 清华大学 一种列式存储下多时间序列的连接查询方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3726397A4 *

Also Published As

Publication number Publication date
CN108062378A (zh) 2018-05-22
EP3726397A1 (en) 2020-10-21
CN108062378B (zh) 2018-12-11
EP3726397A4 (en) 2021-10-20

Similar Documents

Publication Publication Date Title
WO2019114754A1 (zh) 一种列式存储下多时间序列的连接查询方法及系统
US11636379B2 (en) Distributed cluster training method and apparatus
JP7269980B2 (ja) ユーザグループ化方法、装置、コンピュータデバイス、媒体およびコンピュータプログラム
US11347740B2 (en) Managed query execution platform, and methods thereof
WO2019056681A1 (zh) 数据实时监控方法、装置、终端设备及存储介质
CN107103068A (zh) 业务缓存的更新方法及装置
WO2017162086A1 (zh) 任务调度方法和装置
CN102073712B (zh) 基于动态变化帧的过程数据全息归档和反演方法
US20160203416A1 (en) A method and system for analyzing accesses to a data storage type and recommending a change of storage type
WO2018113317A1 (zh) 数据的迁移方法、装置和系统
JP2019204473A (ja) Hadoopに基づいて、データマージモジュールとHBaseキャッシュモジュールを備えるHDFSに複数の2MB以下の小さなファイルを書き込む方法
CN110389967A (zh) 数据存储方法、装置、服务器及存储介质
CN113761013A (zh) 时序数据预统计方法、装置及存储介质
CN113297270A (zh) 数据查询方法、装置、电子设备及存储介质
CN109684328A (zh) 一种高维时序数据压缩存储方法
CN108182244A (zh) 一种基于多层次列式存储结构的时序数据存储方法
CN114020713A (zh) 一种日志结构合并树的文件合并方法、装置、电子设备及存储介质
CN117971488A (zh) 分布式数据库集群的存储管理方法及相关装置
US7647333B2 (en) Cube-based percentile calculation
CN105630706B (zh) 智能存储器块替换方法、系统及计算机可读存储介质
CN116362212A (zh) 报表生成方法、装置、设备及存储介质
CN109241048A (zh) 用于数据统计的数据处理方法、服务器及存储介质
CN111221824B (zh) 存储空间的存储优化方法、装置、设备和介质
JP5252009B2 (ja) データ集計装置およびデータ集計プログラム
JP5252008B2 (ja) データ集計装置およびデータ集計プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18887800

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018887800

Country of ref document: EP

Effective date: 20200713