WO2022126551A1 - 一种时序数据的存储方法 - Google Patents

一种时序数据的存储方法 Download PDF

Info

Publication number
WO2022126551A1
WO2022126551A1 PCT/CN2020/137378 CN2020137378W WO2022126551A1 WO 2022126551 A1 WO2022126551 A1 WO 2022126551A1 CN 2020137378 W CN2020137378 W CN 2020137378W WO 2022126551 A1 WO2022126551 A1 WO 2022126551A1
Authority
WO
WIPO (PCT)
Prior art keywords
time
series data
timeline
data records
records
Prior art date
Application number
PCT/CN2020/137378
Other languages
English (en)
French (fr)
Inventor
程洪泽
廖浩均
陶建辉
Original Assignee
北京涛思数据科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京涛思数据科技有限公司 filed Critical 北京涛思数据科技有限公司
Priority to US18/027,716 priority Critical patent/US20230385254A1/en
Priority to EP20965567.9A priority patent/EP4266187A1/en
Priority to PCT/CN2020/137378 priority patent/WO2022126551A1/zh
Publication of WO2022126551A1 publication Critical patent/WO2022126551A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data

Definitions

  • the invention relates to the field of data processing, in particular to a method for storing time series data.
  • time series data is generally stored in columnar format.
  • Columnar storage needs to store time-series data segment by segment.
  • it is often necessary to reserve a piece of storage space.
  • timelines such as 10 million timelines
  • a large amount of space needs to be reserved, resulting in insufficient system storage resources, especially memory resources.
  • Embodiments of the present invention provide a method for storing time series data, which solves the problem of insufficient storage resources caused by reserving a large amount of space for time series data records when storing time series data records in a columnar manner.
  • An embodiment of the present invention provides a method for storing time series data, the method comprising:
  • time-series data record of a timeline cached in the memory needs to be dropped, determine the sum of the number of time-series data records to be dropped on the timeline and the number of time-series data records in the Last file;
  • the time-series data records to be dropped on the timeline and the time-series data records in the Last file are merged into Columnar storage is written to the data file.
  • time-series data records of a timeline cached in the memory that need to be placed on the disk include:
  • the Last file reserves storage space for each timeline for storing N time-series data records of the timeline.
  • the writing of the time-series data records of the time line to be dropped into the Last file includes:
  • the time-series data records to be dropped on the time line are appended to all time-series data records in the storage space of the time line.
  • the Last file has an offset list including N offset records, and the offset records are used to indicate the offsets of time series data records of the timeline.
  • the offsets of the corresponding time-series data records are sequentially added to the offset list of the timeline.
  • the Last file contains previously written time series data records of the timeline.
  • the writing of the time-series data records of the time line to be dropped into the Last file includes:
  • the steps include:
  • the merged time series data records of the timeline are written to the data file in a columnar storage manner.
  • both the Last file and the data file are files in the persistent storage medium for storing time series data records.
  • the time series data records in the memory are placed on the disk, if the total number of the time series data records in the memory and the time series data records of the Last file is greater than the preset number N of data records, the records are merged and stored in a column format , so there is no need to reserve a large amount of storage resources for each timeline, which solves the problem of insufficient storage resources caused by reserving a large amount of space for time series data records when storing time series data records in a columnar format.
  • FIG. 1 is a schematic flowchart of a method for storing time series data according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a storage structure in a memory provided by an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a Last file using a reserved storage space mode provided by an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a Last file in a mode of not reserving storage space provided by an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart of a method for storing time series data according to an embodiment of the present invention. As shown in FIG. 1 , the method may include:
  • Step S101 Cache the time series data records of each time line from the network to the memory in a row storage manner
  • Step S102 When the time-series data records of a timeline cached in the memory need to be dropped, determine the number of time-series data records to be dropped on the timeline and the number of time-series data records in the Last file. and;
  • Step S103 If the sum of the number of time-series data records of the timeline is less than the predetermined number of data records N, write the time-series data records of the timeline to be dropped into the Last file, if the timeline The sum of the number of time-series data records is greater than or equal to the predetermined number of data records N, then the time-series data records to be dropped on the timeline and the time-series data records in the Last file are merged and written in columnar storage mode into the data file.
  • the time series data record is a time series data recorded in time sequence, and the time series data record of each time line is the data collected by a data collection point, that is, the data of various parameters.
  • the persistent storage medium in the embodiment of the present invention includes two files for saving disk placement data.
  • One is a file for storing time-series data records in a columnar storage manner.
  • the number of time series data records in each data block in the data file is greater than or equal to N, it should be noted that other file names other than the data file can also be used, and the file name used is not used to limit the present invention;
  • the other is a file for storing the latest time series data records, which is denoted as a Last file in the embodiment of the present invention, and the number of records in the Last file is less than N.
  • other file names other than the Last file can also be used. , which file name is used is not used to limit the present invention.
  • the Last file and the data file are files in the persistent storage medium for storing time series data records.
  • the row storage method is to record each time series data record in sequence according to the time series (one time series data record includes data of a plurality of parameters), that is, store the time series data records one by one.
  • the columnar storage method is to record the data of each parameter in sequence according to the time series, that is, to store the data of each parameter in the record one by one.
  • the sum of the number of time series data records in the memory and the number of time series data records of the Last file can be calculated, and then a corresponding execution plan can be determined. details as follows:
  • the above step S102 includes: for any time line, detecting the storage space of the memory or the offset list of a certain time line cached in the memory, if the memory is insufficient (for example, less than a preset value) or the memory is insufficient.
  • the offset list of a certain timeline cached in the memory is full, it is determined that the time series data record of a timeline cached in the memory needs to be dropped to the disk, that is, the memory data is written into the persistent storage medium.
  • check the Last file determine the number of time-series data records that have been stored in the timeline of the Last file, and then record the number of time-series data records that have been stored in the timeline of the Last file. The number is added to the number of time-series data records of the timeline to be dropped in the memory to obtain the total number of time-series data records of the timeline.
  • the offset list of the timeline cached in the memory includes a plurality of offset records, and the offset records are used to indicate the offsets of the time series data records of the timeline in the memory.
  • the predetermined number N of data records is the minimum number required for performing columnar storage, that is to say, the at least N time series data are allowed to be recorded only when the time series data records of each time line are at least N. Data records are written to the data file.
  • step S103 the storage format of the Last file can be various, and the present invention proposes three kinds.
  • writing the time series data record of the disk to be dropped in the timeline into the Last file also includes three ways, respectively. as follows:
  • the Last file reserves a storage space for each timeline for storing N time-series data records of the timeline.
  • Time series data records are appended directly to all time series data records in the timeline's memory space.
  • the append operation is simple, the number of disk IOs is small, and the read and write speed is fast, but a small amount of storage space is required (ie, N time series data records).
  • N time series data records Although storage space needs to be reserved for a small number of time series data records (ie, N time series data records), a large amount of storage space is still saved compared with the storage space reserved in the prior art.
  • Method 2 Last file adopts method 1 that does not reserve space
  • the Last file has an offset list including N offset records, and the offset records are used to indicate the offsets of the time series data records of the timeline.
  • time series data records are written into the Last file, and then the offsets of the corresponding time series data records written into the Last file this time can be sequentially added to the offset list of the time line.
  • the Last file includes the previously written time-series data records of the timeline.
  • time series data records are merged with the time series data records of the timeline read from the Last file to obtain the merged time series data records of the timeline, and create the merged time series data for storing the timeline.
  • the recorded new Last file then the merged time series data records of the timeline are written into the new Last file, and the original Last file of the timeline is deleted.
  • the time series data records in the Last file can be stored in row format to improve the writing speed, and can also be stored in column format to improve the analysis speed.
  • This embodiment does not need to set the storage space for the offset list in the Last file, nor does it need to reserve the storage space for time series data records, thus saving a lot of storage space.
  • step S103 after merging the time-series data records to be dropped on the timeline and the time-series data records in the Last file and writing them into the data file in a columnar storage mode, the process includes: reading from the Last file.
  • the time-series data record of the time line, and the time-series data record of the time line to be dropped and the time-series data record of the time line read from the Last file are merged to obtain the time line of the time line.
  • the time series data records have been merged, and then the merged time series data records of the timeline are written into the data file in a columnar storage manner.
  • the time-series data records of a time line cached in the memory are dropped to the disk for the first time, since the number of time-series data records of the time line in the Last file is 0, if the time-series data records of the time line to be dropped are 0 If the number of records is greater than or equal to the predetermined number of data records N, then the data to be placed on the disk can be directly written into the data file in columnar storage mode. Write the time series data record to be dropped into the storage space corresponding to the time line in the Last file.
  • the number of time-series data records to be placed in the timeline and the number of time-series data records in the Last file of the same timeline are added. If it is greater than or equal to the predetermined number of data records N, the records are merged and written into the data file in a columnar storage manner; otherwise, the time-series data records to be dropped on the timeline are appended to the Last file.
  • the invention combines row storage and column storage, the latest data on each timeline is stored in row, and only when the number of records of a timeline in the row storage reaches a set value (N), will
  • the records for this timeline are stored using columnar storage.
  • the normal row storage is to store records one by one. By recording the offset of each record on the storage medium in the index table, no space can be reserved, which greatly reduces the amount of storage resources required. need.
  • the number of records may be less than N, and records need to be saved in a special file Last.
  • the data placed on the disk is stored in two data files.
  • One file is named data, which stores data in columnar storage mode.
  • the number of records in each data block is greater than or equal to the predetermined value N, and the other file is named. Named with last, it is used to store the latest time series data, but the number of records is less than the predetermined value N.
  • This design not only guarantees data compression rate and analysis speed, but also does not need to reserve storage resources.
  • the system can pre-allocate a piece of storage to hold inserted records, shared by all timelines. This part of the memory is managed according to the first-in, first-out circular buffer. The offsets mentioned later are relative to this storage space.
  • Each timeline has a fixed structure, which is identified by the ID of the timeline, such as TS0 ID, TS1 ID, etc.
  • numOfRecords Record the number of records in memory.
  • offset0,offset1,...,offsetN offset list
  • each timeline has a fixed-size offset list, recording the offset of each record in memory.
  • the list is a circular buffer, because records are written to persistent storage and remain in memory until overwritten by new records.
  • the latest data is generally kept in memory, and row-based storage is used in memory.
  • the memory is managed according to the principle of first-in, first-out.
  • the disk drop process needs to be started to write the old data into persistent storage.
  • the number of records stored in the memory may be small, which cannot reach the minimum number of records required by the columnar storage. Therefore, in the persistent storage medium, in addition to the files stored in the columnar storage, there are also A special Last file needs to be maintained to keep these records. If these records are directly written to the columnar storage file, many data blocks will contain too few data records, resulting in reduced compression and query efficiency.
  • the system needs to check this Last file every time the record in memory is written to the persistent storage medium. For a specified timeline, check the number of records of the timeline in the Last file, sum the number of records in the memory and the number of records in the Last file, and do the following judgments and operations:
  • Last file There are various storage formats of the Last file, and the present invention proposes three ways, one of which is a reserved space method, and the other two are a no-reserved space method.
  • Space is reserved for each timeline, and the space size is the minimum number of N*records required by the columnar storage.
  • each timeline has a fixed structure, which is identified by the ID of the timeline, such as TS0 ID, TS1 ID, etc. .
  • numOfRecords record the number of records in the memory
  • start Time, end time record the start and end time of the timeline in memory
  • the advantage of this mode is that the record merging process is simple. For the data of a single timeline, it is a simple data append operation. The number of disk IOs is small and the read and write speed is fast, but it consumes more storage space.
  • Method 1 For a timeline, first read the saved records from the Last file, and then merge them with the records in the memory. If the number of records is greater than N, write to the columnar storage file. If the number of records is less than N, write a new Last file. After all timelines are processed, delete the old Last file and keep only the new Last file. This Last file can be stored in column or row format. Columnar storage, the analysis speed will increase, while the write speed will decrease. Row storage is the opposite.
  • method 2 needs to rewrite the Last file every time the memory data is persisted, which is inefficient. To improve efficiency, method 2 can be used.
  • Each timeline has a fixed structure, which is identified by the ID of the timeline, such as TS0 ID, TS1 ID, etc.
  • numOfRecords Record the number of records in memory.
  • start Time,end time Record the start and end time of the timeline in memory.
  • Offset that is, the offset of each record in storage.
  • This method does not need to rewrite the Last file.
  • it is mainly an append operation, so the efficiency is high.
  • N when the number of records in a timeline exceeds N, after writing to the columnar storage file, it will be left in the Last file. Holes, when implemented, need to be dealt with regularly to avoid wasting storage space.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种时序数据的存储方法,包括:将来自网络的每条时间线的时序数据记录以行式存储方式缓存至内存;当所述内存中缓存的一条时间线的时序数据记录需要落盘时,确定所述时间线的待落盘的时序数据记录的条数和Last文件中的时序数据记录的条数之和;若所述时间线的时序数据记录条数之和小于预定数据记录条数N,则将所述时间线的待落盘的时序数据记录写入所述Last文件中;若所述时间线的时序数据记录条数之和大于或等于预定数据记录条数N,则将所述时间线的待落盘的时序数据记录和所述Last文件中的时序数据记录合并后以列式存储方式写入data文件中。

Description

一种时序数据的存储方法 技术领域
本发明涉及数据处理领域,特别涉及一种时序数据的存储方法。
背景技术
为提高压缩率和分析速度,时序数据一般采取列式存储。列式存储需要把时序数据按照时间段一段一段的存储。新建一段时,往往需要预留一片存储空间。当时间线特别多时,比如一千万条时间线,需要预留的空间就很大,导致系统存储资源特别是内存资源不足。
发明内容
本发明实施例提供一种时序数据的存储方法,解决在列式存储时序数据记录时为时序数据记录预留大量空间而导致存储资源不足的问题。
本发明实施例提供一种时序数据的存储方法,所述方法包括:
将来自网络的每条时间线的时序数据记录以行式存储方式缓存至内存;
当所述内存中缓存的一条时间线的时序数据记录需要落盘时,确定所述时间线的待落盘的时序数据记录的条数和Last文件中的时序数据记录的条数之和;
若所述时间线的时序数据记录条数之和小于预定数据记录条数N,则将所述时间线的待落盘的时序数据记录写入所述和Last文件中;
若所述时间线的时序数据记录条数之和大于或等于预定数据记录条数N,则将所述时间线的待落盘的时序数据记录和所述Last文件中的时序数据记录合并后以列式存储方式写入data文件中。
优选地,所述内存中缓存的一条时间线的时序数据记录需要落盘包括:
检测所述内存或所述内存中缓存的某一时间线的偏移量列表;
若所述内存不足或所述内存中缓存的某一时间线的偏移量列表被占满时,确定所述内存中缓存的一条时间线的时序数据记录需要落盘。
优选地,所述Last文件为所述每条时间线预留用来存储该条时间线的N条时序数据记录的存储空间。
优选地,所述将所述时间线的待落盘的时序数据记录写入所述Last文件中包括:
将所述时间线的待落盘的时序数据记录追加到所述时间线的存储空间中的所有时序数据记录之后。
优选地,所述Last文件中具有包含N个偏移量记录的偏移量列表,所述偏移量记录用于指示所述时间线的时序数据记录的偏移量。
优选地,在将所述时间线的待落盘的时序数据记录写入所述Last文件中后,将相应时序数据记录的偏移量依次追加到所述时间线的偏移量列表中。
优选地,所述Last文件中包含在前已写入的所述时间线的时序数据记录。
优选地,所述将所述时间线的待落盘的时序数据记录写入所述Last文件中包括:
将所述时间线的待落盘的时序数据记录和从所述Last文件中读取的所述时间线的时序数据记录进行合并,得到所述时间线的已合并时序数据记录;
创建用于存储所述时间线的已合并时序数据记录的新Last文件;
将所述时间线的已合并时序数据记录写入所述新Last文件中,然后删除所述时间线的原有的Last文件。
优选地,所述将所述时间线的待落盘的时序数据记录和所述Last文件中的时序数据记录合并后以列式存储方式写入data文件中包括:
从所述Last文件中读取所述时间线的时序数据记录;
将所述时间线的待落盘的时序数据记录和从所述Last文件中读取的所述时间线的时序数据记录进行合并,得到所述时间线的已合并时序数据记录;
以列式存储方式将所述时间线的已合并时序数据记录写入所述data文件。
优选地,所述Last文件和所述data文件均是所述持久化存储介质中的用来存储时序数据记录的文件。
本发明实施例提供的技术方案具有如下有益效果:
本发明实施例在内存的时序数据记录落盘时,若内存中的时序数据记录和Last文件的时序数据记录的总条数大于预设数据记录条数N,则将记录合并后进行列式存储,因此无需为每条时间线预留大量的存储资源,解决了在列式存储时序数据记录时为时序数据记录预留大量空间而导致存储资源不足的问题。
附图说明
图1是本发明实施例提供的一种时序数据的存储方法的流程示意图;
图2是本发明实施例提供的内存中的存储结构示意图;
图3是本发明实施例提供的采用预留存储空间模式的Last文件示意图;
图4是本发明实施例提供的采用不预留存储空间模式的Last文件示意图。
具体实施方式
以下结合附图对本发明的优选实施例进行详细说明,应当理解,以下所说明的优选实施例仅用于说明和解释本发明,并不用于限定本发明。
图1是本发明实施例提供的一种时序数据的存储方法的流程示意图,如图1所示,所述方法可以包括:
步骤S101:将来自网络的每条时间线的时序数据记录以行式存储方式缓存至内存;
步骤S102:当所述内存中缓存的一条时间线的时序数据记录需要落盘时,确定所述时间线的待落盘的时序数据记录的条数和Last文件中的时序数据记录的条数之和;
步骤S103:若所述时间线的时序数据记录条数之和小于预定数据记录条数N,则将所述时间线的待落盘的时序数据记录写入所述Last文件,若所述时间线 的时序数据记录条数之和大于或等于预定数据记录条数N,则将所述时间线的待落盘的时序数据记录和所述Last文件中的时序数据记录合并后以列式存储方式写入data文件中。
时序数据记录是一种按照时间顺序记录的时间序列数据,每条时间线的时序数据记录是一个数据采集点采集的数据,即各种参数的数据。
本发明实施例的持久化存储介质中包括两个用于保存落盘数据的文件,一个是采用列式存储方式存储时序数据记录的文件,为便于说明,本发明实施例中记作data文件,所述data文件中的每个数据块中的时序数据记录的条数大于等于N,需要说明的是,也可以采用除data文件以外的其它文件名称,采用何种文件名称不用于限制本发明;另一个是用于存储最新时序数据记录的文件,本发明实施例中记作Last文件,所述Last文件的记录条数小于N,需要说明的是,也可以采用除Last文件以外的其它文件名称,采用何种文件名称不用于限制本发明。换句话说,所述Last文件和所述data文件是所述持久化存储介质中的用来存储时序数据记录的文件。
其中,所述行式存储方式是按照时序依次记录每条时序数据记录(一条时序数据记录包括多个参数的数据),即将时序数据记录一条接着一条存储。
其中,所述列式存储方式是按照时序依次记录每个参数的数据,即将记录中每个参数的数据一个接着一个存储。
在一个实施方式中,在内存数据记录落盘时,可计算内存中的时序数据记录的条数和Last文件的时序数据记录条数之和,进而确定相应的执行方案。具体如下:
上述步骤S102包括:对于任意一条时间线,检测所述内存的存储空间或所述内存中缓存的某一时间线的偏移量列表,若所述内存不足(例如小于预设值)或所述内存中缓存的某一时间线的偏移量列表被占满时,确定所述内存中缓存的一条时间线的时序数据记录需要落盘,即将内存数据写入持久化存储介质。此时,检查所述Last文件,确定已存储在所述Last文件的所述时间线的时序数据记录的条数,然后将已存储在所述Last文件的所述时间线的时序数据记录的条数与所述内存中的待落盘的所述时间线的时序数据记录的条数相加,得到所述时间线 的时序数据记录的总条数。
其中,所述内存中缓存的时间线的偏移量列表包括多个偏移量记录,所述偏移量记录用于指示所述时间线的时序数据记录在内存中的偏移量。
上述步骤S103中,所述预定数据记录条数N是执行列式存储所需的最低条数,也就是说,每条时间线的时序数据记录至少N条时才允许将所述至少N条时序数据记录写入data文件。
上述步骤S103中,Last文件的存储格式可以有多种,本发明提出三种,对应地,将所述时间线的待落盘的时序数据记录写入所述Last文件也包括三种方式,分别如下:
方式1:Last文件采用预留空间方式
在本方式中,所述Last文件为所述每条时间线预留用来存储该条时间线的N条时序数据记录的存储空间。
这样,当所述时间线的时序数据记录的总条数小于预定数据记录条数N时,说明没有达到执行列式存储所需的最低条数,此时将所述时间线的待落盘的时序数据记录直接追加到所述时间线的存储空间中的所有时序数据记录之后。
对于单条时间线的时序数据记录,追加操作简单,磁盘IO次数少,读写速度快,但需要耗费少量存储空间(即N条时序数据记录)。
尽管需要为少量时序数据记录(即N条时序数据记录)预留存储空间,但与已有技术预留的存储空间相比,仍节省了大量存储空间。
方式2:Last文件采用不预留空间方式1
在本方式中,所述Last文件中具有包含N个偏移量记录的偏移量列表,所述偏移量记录用于指示所述时间线的时序数据记录的偏移量。
这样,当所述时间线的时序数据记录的总条数小于预定数据记录条数N时,说明没有达到执行列式存储所需的最低条数,此时将所述时间线的待落盘的时序数据记录写入所述Last文件,然后将本次写入所述Last文件的相应时序数据记录的偏移量依次追加到所述时间线的偏移量列表中即可。
对于单条时间线的时序数据记录,仅需要在所述Last文件设置偏移量列表 的存储空间,不需要预留时序数据记录的存储空间,因此节省了大量存储空间。
方式3:Last文件采用不预留空间方式2
在本方式中,所述Last文件中包含在前已写入的所述时间线的时序数据记录。
这样,当所述时间线的时序数据记录的总条数小于预定数据记录条数N时,说明没有达到执行列式存储所需的最低条数,此时将所述时间线的待落盘的时序数据记录和从所述Last文件中读取的所述时间线的时序数据记录进行合并,得到所述时间线的已合并时序数据记录,并创建用于存储所述时间线的已合并时序数据记录的新Last文件,然后将所述时间线的已合并时序数据记录写入所述新Last文件,并删除所述时间线的原有的Last文件。
在本方式中,Last文件中的时序数据记录可以按照行式存储,以便提高写入速度,也可以按照列式存储,以便提高分析速度。
本实施方式不需要在所述Last文件中设置偏移量列表的存储空间,也不需要预留时序数据记录的存储空间,因此节省了大量存储空间。
上述步骤S103中,将所述时间线的待落盘的时序数据记录和所述Last文件中的时序数据记录合并后以列式存储方式写入data文件中包括:从所述Last文件中读取所述时间线的时序数据记录,并将所述时间线的待落盘的时序数据记录和从所述Last文件中读取的所述时间线的时序数据记录进行合并,得到所述时间线的已合并时序数据记录,然后以列式存储方式将所述时间线的已合并时序数据记录写入所述data文件。
按照本发明实施例,内存中缓存的一时间线的时序数据记录第一次落盘时,由于Last文件中该时间线的时序数据记录条数为0,因而若待落盘的时序数据记录的条数大于或等于预定数据记录条数N,则直接对待落盘数据以列式存储方式写入data文件即可,若待落盘的时序数据记录的条数小于预定数据记录条数N,则将待落盘的时序数据记录写入Last文件中的该时间线对应的存储空间中。内存中缓存的该时间线的时序数据记录再次落盘时,将该时间线的待落盘的时序数据记录的条数与同一时间线的Last文件中的时序数据记录的条数相加,若大于或等于预定数据记录条数N,则将记录合并后以列式存储方式写入data文件, 否则将该时间线的待落盘的时序数据记录追加到Last文件中。
本发明将行式存储与列式存储结合起来,每条时间线上最新的数据用行式存储,只有当行式存储里一条时间线的记录条数达到一设定的值(N),才将这一时间线的记录用列式存储进行存储。其中,所述正常的行式存储,是将记录一条接一条的存储,通过在索引表里记录每条记录在存储介质上的偏移量,就可以不预留空间,大幅减少对存储资源的需求。当内存数据写入持久化存储时,记录条数可能小于N,需要将记录保存于一个特殊文件Last。等下次内存数据持久化时,需要将内存里的记录与Last文件里的记录合并,然后判断合并后的记录是否需要写入列式存储,还是继续保留在Last文件。简而言之,落盘的数据分为两个数据文件保存,一个文件用data命名,其用列式存储方式存储数据,每个数据块里的记录条数大于等于预定值N,另外一个文件用last命名,其用于存储最新的时序数据,但记录条数小于预定值N。这种设计既保证了数据的压缩率和分析速度,又无需预留存储资源。下面分别从内存的处理、持久化存储、Last文件的处理三个方面对本发明实施例进行详细说明。
一、内存的处理
系统可以预分配一片存储空间,保存插入的记录,而且为所有时间线所共享。这部分内存按照先进先出的循环buffer来进行管理。后面提到的偏移量都是相对于这片存储空间而言的。
内存里存储结构如图2所示,每条时间线都有一固定结构,用时间线的ID来标识,例如TS0 ID,TS1 ID等。
numOfRecords:记录内存里记录条数。
Current Slot:最新一条记录在偏移量列表的位置。通过numOfRecords以及current Slot,就能推算出来该时间线在内存里第一条记录的位置。
offset0,offset1,…,offsetN:偏移量列表,每条时间线都有一个固定大小的偏移量列表,记录了每条记录在内存里的偏移量。该列表是一个循环的buffer,因为记录写入持久化存储介质后,还继续保留在内存,直到被新的记录覆盖。
当一条新的记录插入进来时,需要执行如下几个操作:
1.从数据内存缓存区里,分配空间,将记录写入,记下偏移量offset;
2.current slot=(current slot+1)%number of Slots;
3.numOfRecords加一。
其中,当分配空间时,如需将老的记录覆盖,这个时候需要对被覆盖的记录做如下操作:
1.numOfRecords减一。
二、持久化存储
最新数据一般保留在内存里,内存里使用行式存储。对于时序数据,内存按照先进先出的原则进行管理,当内存不够时或者某一时间线的偏移量列表占满时,需要启动落盘过程,将老的数据写入持久化存储。
对于一时间线而言,保存在内存里的记录条数可能不多,达不到列式存储所需要的最少记录条数,因此在持久化存储介质里,除列式存储的文件外,还需要维护一个特殊的Last文件,保存这些记录。如果将这些记录直接写入列式存储文件,将导致很多数据块包含的数据条数过少,导致压缩、查询效率下降。
系统每次将内存里的记录写入持久化存储介质时,都需要检查这个Last文件。对于一个指定的时间线,查看Last文件里该时间线的记录条数,将内存里记录条数与Last文件里记录条数求和,做如下判断和操作:
1.如果记录条数的和超过了列式存储所需要的最低条数的值,将Last文件里的记录全部读出,与内存里记录合并,写入列式存储。
2.如果记录条数的和低于列式存储所需要的最低条数的值,将内存里记录写入Last文件。
三、Last文件的处理
Last文件存储格式可以多种,本发明提出三种方式,其中一种为预留空间方式,另外两种为不预留空间方式。
3.1、预留存储空间模式
每条时间线预留空间,空间大小为列式存储所需要的最低条数N*记录大小。
图3是本发明实施例提供的采用预留存储空间模式的Last文件示意图,如图3所示,每条时间线都有一固定结构,用时间线的ID来标识,例如TS0 ID, TS1 ID等。
numOfRecords:记录内存里记录条数;
start Time,end time:记录内存里该时间线的起止时间;
Record0,Record1,…,RecordN共N条记录所需要的空间,每条记录固定大小。这样便于迅速查找。
这个模式的优势在于记录合并过程简单。对于单条时间线的数据而言,是个简单的数据追加操作,磁盘IO次数少,读写速度快,但要多耗费存储空间。
3.2、不预留存储空间模式
将内存数据写入持久化存储时,不预留存储空间,有两种方式处理:
方式1、对于一条时间线,先从Last文件里读取已保存的记录,然后与内存的记录合并,如果记录条数大于N,写入列式存储文件。如果记录条数小于N,写入一个新的Last文件。等所有时间线处理完,将旧的Last文件删除,只保留新的Last文件。这个Last文件可以用列式或者行式存储。列式存储,分析速度会提高,而写入速度会下降。行式存储反之。
本方式在每次将内存数据持久化时,需要重写Last文件,效率较低。为提高效率,可以采用方式2。
方式2、对于每条时间线,维护一个如图4所示的数据结构,每条时间线都有一固定结构,用时间线的ID来标识,例如TS0 ID,TS1 ID等。
numOfRecords:记录内存里记录条数。
start Time,end time:记录内存里该时间线的起止时间。
offset0,offset1,…,offsetN:偏移量,即每条记录的在存储中的偏移量。
本方式无需重写Last文件,添加记录时,主要是追加操作,因此效率高,但是当某条时间线的记录条数超过N,写入到列式存储文件后,会在Last文件里留下空洞,在具体实施时,需要定期处理,以避免浪费存储空间。
尽管上文对本发明进行了详细说明,但是本发明不限于此,本技术领域技术人员可以根据本发明的原理进行各种修改。因此,凡按照本发明原理所作的修改,都应当理解为落入本发明的保护范围。

Claims (10)

  1. 一种时序数据的存储方法,其特征在于,所述方法包括:
    将来自网络的每条时间线的时序数据记录以行式存储方式缓存至内存;
    当所述内存中缓存的一条时间线的时序数据记录需要落盘时,确定所述时间线的待落盘的时序数据记录的条数和Last文件中的时序数据记录的条数之和;
    若所述时间线的时序数据记录条数之和小于预定数据记录条数N,则将所述时间线的待落盘的时序数据记录写入所述Last文件中;
    若所述时间线的时序数据记录条数之和大于或等于预定数据记录条数N,则将所述时间线的待落盘的时序数据记录和所述Last文件中的时序数据记录合并后以列式存储方式写入data文件中。
  2. 根据权利要求1所述的方法,其特征在于,所述内存中缓存的一条时间线的时序数据记录需要落盘包括:
    检测所述内存或所述内存中缓存的某一时间线的偏移量列表;
    若所述内存不足或所述内存中缓存的某一时间线的偏移量列表被占满时,确定所述内存中缓存的一条时间线的时序数据记录需要落盘。
  3. 根据权利要求1所述的方法,其特征在于,所述Last文件为所述每条时间线预留用来存储该条时间线的N条时序数据记录的存储空间。
  4. 根据权利要求3所述的方法,其特征在于,所述将所述时间线的待落盘的时序数据记录写入所述Last文件中包括:
    将所述时间线的待落盘的时序数据记录追加到所述时间线的存储空间中的所有时序数据记录之后。
  5. 根据权利要求1所述的方法,其特征在于,所述Last文件中具有包含N个偏移量记录的偏移量列表,所述偏移量记录用于指示所述时间线的时序数据记录的偏移量。
  6. 根据权利要求5所述的方法,其特征在于,在将所述时间线的待落盘的时序数据记录写入所述Last文件中后,将相应时序数据记录的偏移量依次追加 到所述时间线的偏移量列表中。
  7. 根据权利要求1所述的方法,其特征在于,所述Last文件中包含在前已写入的所述时间线的时序数据记录。
  8. 根据权利要求7所述的方法,其特征在于,所述将所述时间线的待落盘的时序数据记录写入所述Last文件中包括:
    将所述时间线的待落盘的时序数据记录和从所述Last文件中读取的所述时间线的时序数据记录进行合并,得到所述时间线的已合并时序数据记录;
    创建用于存储所述时间线的已合并时序数据记录的新Last文件;
    将所述时间线的已合并时序数据记录写入所述新Last文件中,然后删除所述时间线的原有的Last文件。
  9. 根据权利要求1所述的方法,其特征在于,所述将所述时间线的待落盘的时序数据记录和所述Last文件中的时序数据记录合并后以列式存储方式写入data文件中包括:
    从所述Last文件中读取所述时间线的时序数据记录;
    将所述时间线的待落盘的时序数据记录和从所述Last文件中读取的所述时间线的时序数据记录进行合并,得到所述时间线的已合并时序数据记录;
    以列式存储方式将所述时间线的已合并时序数据记录写入所述data文件。
  10. 根据权利要求1-9任意一项所述的方法,其特征在于,所述Last文件和所述data文件均是所述持久化存储介质中的用来存储时序数据记录的文件。
PCT/CN2020/137378 2020-12-17 2020-12-17 一种时序数据的存储方法 WO2022126551A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US18/027,716 US20230385254A1 (en) 2020-12-17 2020-12-17 Method for storing time series data
EP20965567.9A EP4266187A1 (en) 2020-12-17 2020-12-17 Method for storing time series data
PCT/CN2020/137378 WO2022126551A1 (zh) 2020-12-17 2020-12-17 一种时序数据的存储方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/137378 WO2022126551A1 (zh) 2020-12-17 2020-12-17 一种时序数据的存储方法

Publications (1)

Publication Number Publication Date
WO2022126551A1 true WO2022126551A1 (zh) 2022-06-23

Family

ID=82059936

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/137378 WO2022126551A1 (zh) 2020-12-17 2020-12-17 一种时序数据的存储方法

Country Status (3)

Country Link
US (1) US20230385254A1 (zh)
EP (1) EP4266187A1 (zh)
WO (1) WO2022126551A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501267A (zh) * 2023-06-27 2023-07-28 苏州浪潮智能科技有限公司 一种独立冗余磁盘阵列卡控制方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077405A (zh) * 2014-07-08 2014-10-01 国家电网公司 时序类型数据存取方法
CN108182244A (zh) * 2017-12-28 2018-06-19 清华大学 一种基于多层次列式存储结构的时序数据存储方法
CN108563711A (zh) * 2018-03-28 2018-09-21 山东昭元信息科技有限公司 一种基于时间节点的时序数据存储方法
US20200372071A1 (en) * 2019-05-24 2020-11-26 Hydrolix Inc. Efficient and scalable time-series data storage and retrieval over a network
CN112181973A (zh) * 2019-07-01 2021-01-05 北京涛思数据科技有限公司 一种时序数据的存储方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077405A (zh) * 2014-07-08 2014-10-01 国家电网公司 时序类型数据存取方法
CN108182244A (zh) * 2017-12-28 2018-06-19 清华大学 一种基于多层次列式存储结构的时序数据存储方法
CN108563711A (zh) * 2018-03-28 2018-09-21 山东昭元信息科技有限公司 一种基于时间节点的时序数据存储方法
US20200372071A1 (en) * 2019-05-24 2020-11-26 Hydrolix Inc. Efficient and scalable time-series data storage and retrieval over a network
CN112181973A (zh) * 2019-07-01 2021-01-05 北京涛思数据科技有限公司 一种时序数据的存储方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501267A (zh) * 2023-06-27 2023-07-28 苏州浪潮智能科技有限公司 一种独立冗余磁盘阵列卡控制方法和装置
CN116501267B (zh) * 2023-06-27 2023-09-29 苏州浪潮智能科技有限公司 一种独立冗余磁盘阵列卡控制方法和装置

Also Published As

Publication number Publication date
EP4266187A1 (en) 2023-10-25
US20230385254A1 (en) 2023-11-30

Similar Documents

Publication Publication Date Title
JP4249267B2 (ja) ファイル・システムにおけるディスク・スペースの解放
CN107870973B (zh) 一种加快多路监控同时回放的文件存储系统
US7694103B1 (en) Efficient use of memory and accessing of stored records
CN104699416B (zh) 一种数据存储系统以及一种数据存储方法
CN108399047B (zh) 一种闪存文件系统及其数据管理方法
US7577808B1 (en) Efficient backup data retrieval
CN106951375B (zh) 在存储系统中删除快照卷的方法及装置
CN102136290A (zh) 一种嵌入式实时视频文件存储方法
CN103226965B (zh) 基于时间位图的音视频数据存取方法
JP2012094220A (ja) 書込みレコードの重複を排除する記憶装置、及びその書込み方法
CN113568582B (zh) 数据管理方法、装置和存储设备
US11853566B2 (en) Management method and system for address space of low delay file system and medium
WO2021249201A1 (zh) 一种基于叠瓦式磁记录盘的监控数据存储方法及装置
CN104484427A (zh) 一种录像文件存储装置及方法
WO2022126551A1 (zh) 一种时序数据的存储方法
CN104917788A (zh) 一种数据存储方法及装置
US20140115293A1 (en) Apparatus, system and method for managing space in a storage device
CN112181973B (zh) 一种时序数据的存储方法
JPH039494B2 (zh)
US20100287218A1 (en) Methods and devices for managing and editing files in a file system
JP4930358B2 (ja) データ処理装置及びデータ処理方法
JP6390196B2 (ja) ストレージシステム、ストレージ方法、および、プログラム
CN111930320B (zh) 一种基于分布式存储数据的内存优化方法及其系统
CN111597152A (zh) 固态硬盘文件系统管理方法、装置及电子设备
CN111225169B (zh) 一种视频小文件的存储方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20965567

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020965567

Country of ref document: EP

Effective date: 20230717