CN108021650A

CN108021650A - A kind of efficient storage of time series data and reading system

Info

Publication number: CN108021650A
Application number: CN201711240991.0A
Authority: CN
Inventors: 徐化岩; 李勇
Original assignee: Automation Research and Design Institute of Metallurgical Industry
Current assignee: Automation Research and Design Institute of Metallurgical Industry
Priority date: 2017-11-30
Filing date: 2017-11-30
Publication date: 2018-05-11

Abstract

A kind of efficient storage of time series data and reading system, belong to Real-Time Databases System Technique field.Include the computer of one or more networking, constitute the hardware platform of system；The software of the system, including Data write. module, data compressing module and data read module are run on computers, and Data write. module is responsible for receiving new data, and data are respectively written into memory cache and journal file；Data compressing module is responsible for that compression algorithm and index structure are compressed into data file designed according to this invention by the data of journal file；Read module responds read requests, is returned after comprehensive memory cache and data file query result.Advantage is, compared to relevant database, disk takes less, read or write speed is fast；It is less that data take disk space after overcompression, and it is only 35% that disk space, which takes,；Faster, with Mysql database no-load voltage ratios, writing speed improves 3 times to writing speed；Faster, with Mysql database no-load voltage ratios, reading speed improves 20 times to data reading speed.

Description

A kind of efficient storage of time series data and reading system

Technical field

The invention belongs to Real-Time Databases System Technique field, more particularly to a kind of efficient storage of time series data and reading are System.

Background technology

Time series data, that is, time series data, refers to and (changes with time tag according to the order of time, i.e. the time serializes) Data.Time series data is mainly gathered with analytical equipment by all types of monitorings in real time such as electric power, chemical industry, metallurgy, inspection, produced Data, the typical feature of these industrial datas is：Producing frequency, (each monitoring point can produce a plurality of number in one second soon According to), to depend critically upon acquisition time (each data be required to correspond to unique time), measuring point multiple data quantity big (conventional Real-time monitoring system has thousands of monitoring point, and monitoring point all produces data each second, produces the data of tens GB daily Amount).

The storage and processing for time series data are often handled by the way of relevant database at present, but due to The born inferior position of relevant database causes it can not carry out efficiently storage and the inquiry of data.Therefore there is an urgent need to a kind of special Door does the efficient storage optimized and reading system for time series data.

The content of the invention

It is an object of the invention to provide a kind of efficient storage of time series data and system is read, when solving all kinds The Efficient Compression of ordinal number evidence, efficiently write and efficiently read problem.

The system of the present invention includes the computer of one or more networking, constitutes the hardware platform of system；In computer The software of upper operation the system, including Data write. module, data compressing module and data read module, Data write. module are born Duty receives new data, and data are respectively written into memory cache and journal file；Data compressing module is responsible for the number of journal file Data file is compressed into according to compression algorithm designed according to this invention and index structure；Read module responds read requests, comprehensive Returned after memory cache and data file query result.

The present invention devises special compression method for all kinds time series data.Time series data type includes integer, floats Five kinds of points, boolean, character string, markers data types, the compression method separately designed for this five kinds of data types are as follows：

The compression method of integer is that first integer does not compress, and the difference with previous number is calculated since second integer Value, and ZigZag is carried out to difference and (proposes) coding first in protocol-buffers agreements by Google, by difference For the positive number that is changed into of negative, difference (is then come from into paper using simple8b algorithms：Ann and Moffat, " Index compression using 64-bit words",Softw.Pract.Exper.2010；40:131-147) it is compressed.

The compression method of floating number is that first floating number is not compressed, since second floating number with previous number into Difference is calculated in row exclusive or.The difference very little obtained when two floating number numerical value are close, 10 is only deposited when difference is 0； 11 is deposited when being not zero, then with 0 quantity for being located at left end in 5 storages 64,0 quantity for occupying right end is stored with 6, Again nonzero digit is intercepted out and stored.

The compression method of Boolean is that Boolean directly can be stored 64 with 1 storage, each 64 unsigned ints A Boolean.

The compression method of character string is that character string order is added to after byte stream with snappy algorithms (by Google In http:The algorithm of increasing income that //google.github.io/snappy/ is provided) compression.

The compression method of markers number is that first markers number does not compress, since second markers number with previous number into Row mathematic interpolation, first difference are not compressed, then since the 3rd number calculating difference difference, if the difference of difference is 0 (when the memory gap of data is identical), only stores 0 and 0 number occurred；Otherwise the difference of the difference is stored using simple8b Value.

Ensure efficiently to write using memory cache and journal file.Speed random write soon is sequentially written in due to disk It is slow (tracking and rotational latency) to enter speed, and is that mass data is constantly gathered, constantly write the characteristics of time series data, in order to improve Write efficiency compiles batch of data (generally 5000 to 10000 points) according to roll-call, points, markers, value, markers, value ... order Code is byte stream, and journal file is write after recycling snappy compression algorithms.Meanwhile the data of journal file will be write with point The purpose of name, markers, structure deposit internal memory cache region of value, memory cache is synchronous with the holding of area's journal file, memory cache is generation There is provided for journal file and read to closing on the efficient of data.Log file size is fixed, and is automatically generated when reaching prescribed level One new journal file.

Special data file structure is devised for time series data, and designs multi-stage compression mechanism by journal file boil down to Data file is used to efficiently read, reduces disk occupancy.Multiple data blocks and an index block, data block are included in data file For one group of data point according to time sequence after data pass through the corresponding compressed word of compression algorithm of the foregoing point data type Throttling, index block are made of roll-call, data number of blocks, initial time, end time, relative position, byte number.Timing into Row multi-stage compression, is compressed since multiple journal files first, obtains level one data file, followed by from multiple level one datas Compressing file obtains secondary data file, so compresses layer by layer, until data file reaches prescribed level.System is in memory Each point structure memory index structure, is made of roll-call, time range, Data Filename, for rapidly locating point section Data file corresponding to time data.Read data when, system first determine whether from close on internal memory cache region read or from Read in which data file, its index structure fast positioning to the position where data is just utilized if being read from data file Put, read so as to fulfill efficient.

It is an advantage of the current invention that comparing relevant database, disk takes less, read or write speed is fast.First, data are passed through It is less that disk space is taken after compression, with Mysql database no-load voltage ratios, it is only 35% that disk space, which takes,；Secondly, data write-in speed Faster, with Mysql database no-load voltage ratios, writing speed improves 3 times to degree；Finally, data reading speed faster, with Mysql databases No-load voltage ratio, reading speed improve 20 times.

Brief description of the drawings

Fig. 1 is the building-block of logic of system.

Fig. 2 is data compression flow chart.

Fig. 3 is the structure chart of data file.

Embodiment

As shown in Figure 1, system includes memory cache, journal file, three kinds of data storage formats of data file and write-in, pressure Contracting, read three kinds of data processing behaviors.In write-in, memory cache is synchronously written with journal file, and journal file timing is compressed For data file, the data file progressively data file of boil down to bigger again.When reading, system needs to judge from memory cache Middle read in still data file is read, and if being read in data file, utilizes the index rapidly locating position in file Put.

As shown in Fig. 2, time series data includes roll-call, markers and numerical information, traversal is each when writing one group of time series data Point, the data type for judging numerical value are integer, floating number, Boolean or character string, call respectively corresponding compression algorithm into Row compression, calls the compression of markers compression algorithm by markers, is preserved after both are merged byte stream.

As shown in figure 3, data file is made of file header, data block area, index block and end-of-file.Wherein file header is fixed Size is used for the version number for preserving system, and end-of-file fixed size is used to preserve the position of index block hereof.Data block area Multiple databases can be stored, database produces after being compressed by Fig. 2 flows.Index block once stores the data that data are called the roll, put Between at the beginning of type, data block number, data block, the end time of data block, data block initial position hereof and institute The byte number accounted for.

Using said system as core, there is provided after necessary calling interface, can use, can use extensively as time series database In plant processes monitoring and Internet of Things field.

Claims

1. a kind of efficient storage of time series data and reading system, it is characterised in that include the computer of one or more networking, The hardware platform of composition system；Data write. module, data compressing module and data read module, data are run on computers Writing module is responsible for receiving new data, and data are respectively written into memory cache and journal file；Data compressing module was responsible for day The data of will file are compressed into data file according to the compression algorithm and index structure of design；Read module responds read requests, Returned after comprehensive memory cache and data file query result；

Time series data type includes five kinds of integer, floating number, boolean, character string, markers data types, for this five kinds of data class The compression method that type separately designs is as follows：

The compression method of integer is that first integer does not compress, and the difference with previous number is calculated since second integer, and ZigZag codings are carried out to difference, difference is changed into positive number for negative, is then pressed difference using simple8b algorithms Contracting；

The compression method of floating number is that first floating number is not compressed, different with the progress of previous number since second floating number Or difference is calculated；The difference very little obtained when two floating number numerical value are close, 10 is only deposited when difference is 0；It is not 11 is deposited when zero, then with 0 quantity for being located at left end in 5 storages 64,0 quantity for occupying right end is stored with 6, then will Nonzero digit, which intercepts out, to be stored；

The compression method of Boolean is that Boolean directly can be stored 64 cloth with 1 storage, each 64 unsigned ints Value of；

The compression method of character string is to use snappy compression algorithms after character string order is added to byte stream；

The compression method of markers number is that first markers number does not compress, poor with the progress of previous number since second markers number Value calculates, and first difference do not compress, then since the 3rd number calculating difference difference, when the difference of difference is 0, only deposit The number that storage 0 and 0 occurs；Otherwise the difference of the difference is stored using simple8b；

Ensure efficiently to write using memory cache and journal file.