CN113078908A

CN113078908A - Simple encoding and decoding method suitable for time sequence database

Info

Publication number: CN113078908A
Application number: CN202110259307.3A
Authority: CN
Inventors: 黄励博
Original assignee: Hangzhou Upyun Technology Co ltd
Current assignee: Hangzhou Upyun Technology Co ltd
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2021-07-06
Anticipated expiration: 2041-03-10
Also published as: CN113078908B

Abstract

The invention discloses a simple coding and decoding method suitable for a time sequence database, which comprises the following steps: 1) identifying the numerical values of the time sequence, determining the type of the data, and if the numerical values are floating point numbers, entering a step 2), and if the numerical values are integer, entering a step 4); 2) converting an integer; 3) compressing floating-point numbers; 4) and if the numerical value of the time sequence is the time stamp, compressing the time stamp, and if the integer numerical value of the time sequence is not the time stamp, performing integer compression. The method is used for storing and accessing the time sequence monitoring data of a specific scene, and floating point number compression is optimized through a conversion strategy of a floating point type to an integer type under the scene that the floating point type and the integer type are used in a mixed mode. The invention has better compression effect than a general lossless compression algorithm under a specific use scene. The algorithm of the invention is simple and easy to process, and can more conveniently process the data input of large data volume.

Description

Simple encoding and decoding method suitable for time sequence database

Technical Field

The invention relates to the technical field of simple coding and decoding of a time sequence database, in particular to a simple coding and decoding method suitable for the time sequence database.

Background

The time sequence database is one of databases, is specially used for storing data which changes along with time, such as sensor data, machine monitoring data and the like, and is suitable for scenes of Internet of things, industrial internet, operation and maintenance monitoring and the like.

The time sequence database is specially used for storing data center monitoring indexes, each piece of data is a combination of a timestamp and an index and a numerical value, the data are written in according to a time sequence, and flexible and diverse aggregation query capabilities are provided. The index data has the characteristics of vertical writing and horizontal reading, namely, various indexes are collected according to time sequence to be written into a database, and one index or a plurality of indexes in a specified time are read.

The time sequence database, because of its huge data volume, will generally be targeted to the data that it stored compressed to save, and compress and reduce the redundancy and save the storage space, also need convenient retrieval data. Therefore, different compression schemes are required for different data formats, and the compression complexity and the query speed are also considered.

Disclosure of Invention

The invention aims to provide a simple coding and decoding method suitable for a time sequence database so as to store and access time sequence monitoring data of a specific scene.

A simple coding and decoding method suitable for a time sequence database comprises the following steps:

1) identifying the numerical values of the time sequence, determining the type of the data, and if the numerical values are floating point numbers, entering a step 2), and if the numerical values are integer, entering a step 4);

2) integer conversion:

the integer conversion specifically comprises:

2a) taking floating point number as multiplier and [ 1101001000100001000001000000 ] as multiplicand in turn to obtain the first result, if the decimal part of the first result (i.e. result O) is 0, the conversion is successful and the multiplicand and integer part I are returned to complete the integer conversion;

2b) if the fractional part of the first result is not 0, the first result is converted to an integer according to the standard;

2c) if the last floating point number corresponding to the integer obtained in the step 2b) is the same as the integer part I of the first result (namely, the result O), the conversion is successful and the multiplicand and the integer part I are returned, and the integer conversion is completed;

2d) if the next floating point number corresponding to the integer obtained in the step 2b) is the same as the integer part I +1 of the first result (i.e. the result O), the conversion is successful and the multiplicand and the integer part I +1 are returned;

2e) if the integer conversion fails, entering step 3);

2f) if the integer conversion is successful, entering the step 4);

3) compression of floating point number:

4) if the numerical value of the time sequence is the time stamp, compressing the time stamp, and if the integer numerical value of the time sequence is the non-time stamp, performing integer compression;

in step 2b), the first result is converted into a 64-bit integer according to the IEEE 754 binary floating-point arithmetic standard. The arithmetic standard of the binary floating point number is IEEE 754 arithmetic standard of the binary floating point number. The integer is an integer of 64 bits.

In step 3), floating point number compression specifically includes:

3a) the floating point number is converted into a 64-bit integer according to the standard;

3b) if the floating point number is the first floating point number in the time sequence, writing an identifier which indicates the floating point number, and storing data according to 64 bits;

3c) if the last numerical value of the time sequence of the floating point number is an integer and is a floating point number, storing an identifier to represent that the numerical value is changed, storing an identifier to represent that the numerical value is not repeated, and storing an identifier to represent the floating point number;

3d) if the last numerical value of the time sequence of the floating point number is the floating point number and is the same as the numerical value of the floating point number, storing an identifier to represent that no change occurs, and storing an identifier to represent repetition;

3e) if the last time of the time sequence of the floating point number is a floating point number and is different from the numerical value of the floating point number, storing an identifier to represent that no change occurs, converting the floating point number into 64 bits and then calculating an exclusive-or value, comparing the exclusive-or value with the last exclusive-or value (0 if not), and comparing leading zeros and trailing zeros if the leading zeros PLZ of the previous exclusive-or value and the trailing zeros PTZ of the previous exclusive-or value are both less than or equal to the CLZ of the leading zeros of the current exclusive-or value and the CTZ of the trailing zeros of the current exclusive-or value, and storing a significant numerical value (64-the PLZ of the leading zeros of the previous exclusive-or value-the trailing zeros PTZ of the previous exclusive-or value) in bits.

3f) If the number of leading zeros of the previous xor value PLZ and the zero-number of the trailing digits of the previous xor value PTZ are greater than the number of leading zeros of the current xor value CLZ and the zero-number of the trailing digits of the current xor value CTZ, a significant value of (64-the number of leading zeros of the current xor value CLZ-the zero-number of the trailing digits of the current xor value CTZ) bits needs to be stored.

In step 3a), the floating point number is converted into a 64-bit integer according to the IEEE 754 binary floating point number arithmetic standard;

in step 3f), the significant value of the (64-leading zero number CLZ of the current xor value-leading zero number CTZ of the current xor value) bits is stored, which specifically includes:

first store 6 bits (2^6 ^ 64) to represent CLZ, then store 6 bits to represent 64-CLZ-CTZ, and finally store meaningful data of 64-CLZ-CTZ bits.

In step 4), compressing the timestamp specifically includes:

4a) defining time units including nanosecond, microsecond, millisecond and second, representing the minimum time which can be identified by the time sequence, wherein the finer the unit, the more the data quantity which needs to be stored;

4b) storing a value converted by a fixed time unit into a first time stamp in the time sequence;

4c) calculating the difference value from the last timestamp from the second timestamp in the time sequence, then calculating the difference value (0 if not), then converting by a fixed time unit, if the unit is second or millisecond, the value is converted into the result of maximum 32 bits, if the unit is nanosecond or microsecond, the value is converted into the result of maximum 64 bits, finally selecting proper bytes in the byte queues [0,7,9,12,32] or [0,7,9,12,64] according to the size of the converted result value, and distinguishing different byte queues by prefix bits.

Integer compression, specifically including:

4l) if the current integer is the first value of the time sequence, writing an identifier which indicates the integer, then storing the identifier to mark the multiplicand, and distinguishing the positive number from the negative number by using the identifier;

the number CLZ of leading zeros of the absolute value of the current integer is consistent according to 64-CLZ and 64-PLZ (the number of leading zeros of the previous integer), the identifier is stored to indicate consistency, if the number CLZ of the leading zeros of the current integer is inconsistent, the number CLZ of the current integer is stored into the 64-CLZ, and then the significant value of the 64-CLZ bit is stored;

4m) if a value in the time sequence of the current integer is also an integer, the multiplier is the same and the difference D is 0, storing an identifier to represent that the change occurs, and recording an identifier to represent the repeated integer;

4n) if a value in the time sequence of the current integer is an integer, but the multiplier is not the same or the difference D is not 0, distinguishing positive and negative numbers by using an identifier, not increasing the multiplicand and the number of bits 64-CLZ of the effective value is the same as the difference calculated at the last time, storing an identifier to represent no change, then storing an identifier to distinguish the positive and negative numbers, and finally storing a meaningful value of the 64-CLZ bits;

4o) if a value in the time series of the current integer is not an integer, distinguishing positive and negative numbers by an identifier, storing an identifier to represent that the change occurs, storing an identifier to represent that the change does not occur, storing the number CLZ of leading zeros of the absolute value of the current integer, according to the agreement between 64-CLZ and 64-PLZ (the number of leading zeros of the previous integer), storing an identifier to indicate the agreement, if the agreement does not occur, storing the value of 64-CLZ, and storing the meaningful value of the 64-CLZ bit.

Compared with the prior art, the invention has the following advantages:

1) under the scene of mixed use of the floating point type and the integer type, the floating point number compression is optimized through the conversion strategy of the floating point type to the integer type.

2) The invention has better compression effect than a general lossless compression algorithm under a specific use scene. For example, under the use scenario of the monitoring index collection, the compression effect of 1.66 bytes of single data can be achieved, which is better than 2-5 bytes of the traditional related compression algorithm.

3) The algorithm of the invention is simple and easy to process, saves CPU compared with the traditional compression algorithm, and more conveniently processes the data input with large data volume. Meanwhile, the invention can quickly read one index or a plurality of indexes in the appointed time.

Drawings

FIG. 1 is a flow chart illustrating a simplified encoding and decoding method for a time series database according to the present invention.

Detailed Description

As shown in fig. 1, a simple encoding and decoding method suitable for a time series database includes the following steps:

2) integer conversion:

2a) taking floating point number as multiplier, in turn [ 1101001000100001000001000000 ] as multiplicand to obtain the first result, if the decimal part of the first result (i.e. result O) is 0, the conversion is successful and returns multiplicand B and integer part I, thus finishing the integer conversion;

2c) if the last floating point number corresponding to the integer obtained in the step 2B) is the same as the integer part I of the first result (namely the result O), the conversion is successful and the multiplicand B and the integer part I are returned, and the integer conversion is completed;

2d) if the next floating point number corresponding to the integer obtained in step 2B) is the same as the integer part I +1 of the first result (i.e., result O), the conversion is successful and the multiplicand B and the integer part I +1 are returned.

2e) If the integer conversion fails, entering step 3);

2f) if the integer conversion is successful, entering the step 4);

in step 2), the first result is converted into a 64-bit integer according to the IEEE 754 binary floating-point number arithmetic standard. The arithmetic standard of the binary floating point number is IEEE 754 arithmetic standard of the binary floating point number. The integer is an integer of 64 bits.

3) Compression of floating point number:

3a) the floating point number is converted into a 64-bit integer according to the IEEE 754 binary floating point number arithmetic standard;

3e) if the last time sequence of the floating point number is a floating point number and is different from the numerical value of the floating point number, storing an identifier to represent that no change occurs, converting the floating point number into 64 bits and then calculating an exclusive-or value, comparing the exclusive-or value with the last exclusive-or value (0 if not), and comparing leading zeros and trailing zeros if the number PLZ of leading zeros of the previous exclusive-or value and the number PTZ of trailing zeros of the previous exclusive-or value are both less than or equal to the CLZ of the number of leading zeros of the current exclusive-or value and the CTZ of trailing zeros of the current exclusive-or value, and storing a meaningful numerical value (64-the number of leading zeros of the previous exclusive-or value-the number CTZ of trailing zeros of the previous exclusive-or value).

3f) If the number of leading zeros of the previous xor value PLZ and the zero-number of trailing zeros of the previous xor value PTZ are greater than the number of leading zeros of the current xor value CLZ and the zero-number of trailing zeros of the current xor value CTZ, a significant value of (64-number of leading zeros of the current xor value-number of trailing zeros of the current xor value) bits needs to be stored.

storing 6 bits (2^6 ^ 64) to represent CLZ, then 6 bits to represent 64-CLZ-CTZ and finally storing meaningful data of 64-CLZ-CTZ bits

the timestamp compression specifically comprises the following steps:

Integer compression, specifically including:

Specifically, the method comprises the following steps:

1) before compressing the data, a period of compression is selected in units of hours, a minimum of 1 hour, and a maximum of 24 hours.

2) A timestamp compression algorithm.

a) Defining time units, including nanoseconds, microseconds, milliseconds, seconds, represents the minimum time that the time series can identify, the more data that needs to be stored for a fine unit.

b) A first time stamp stored in a value converted in a fixed time unit

c) And calculating the difference value of the later time stamp and the last time stamp, then calculating the difference value (0 if not), and converting by a fixed time unit. It is noted that this value is a maximum of 32 bits if the unit is seconds or milliseconds, and a maximum of 64 bits if the unit is nanoseconds or microseconds, because of the compression period. Finally, the value is stored in the byte queue [0,7,9,12,32] or [0,7,9,12,64] according to the size selection suitable byte. Of course we need to distinguish the different byte queues by prefix bits.

3) A numerical compression algorithm.

a) Integer conversion: taking floating point number as multiplier, in turn [ 1101001000100001000001000000 ] as multiplicand, if the fractional part of result O is 0, the conversion is successful and returns multiplicand B and integer part I. Otherwise, the result O is converted into a 64-bit integer according to the IEEE 754 binary floating point number arithmetic standard, and if the last floating point number corresponding to the integer is the same as the integer part I of the result O, the conversion is successful and the multiplicand B and the integer part I are returned. Similarly, if the next floating point number corresponding to this integer is the same as the integer portion I +1 of result O, the conversion is successful and returns multiplicands B and I + 1. Otherwise, floating point number is processed according to 4) floating point number compression algorithm.

b) If it is the first integer, a bit 0x0 is written, indicating an integer, and three bits are stored to mark the multiplicand (since the multiplicand has only 7 options). The positive and negative numbers are distinguished and stored with one bit. The number CLZ of leading zeros of the absolute value is first stored in 64-CLZ, and compressed and stored according to whether the number CLZ is identical to 64-PLZ (the number of leading zeros in the last time), whether the number CLZ is 0, and the like. Significant value stored in 64-CLZ bit

c) If the data after this fails in judgment of 3.a, or if the difference D from the previous integer is successful and does not jump within the range of int64, the process goes to 4. c.

d) If the previous integer is also the same and the multiplier is the same and the difference D is 0, a bit is stored to indicate that a change has occurred and a bit of 0x1 is recorded to indicate a repeated integer.

e) Otherwise, as in 3.c, the difference D is divided into positive and negative numbers by one bit, if the last is also an integer, the multiplicand is not increased and the number of bits 64-CLZ of the effective value is the same as the last difference, a bit is stored to represent no change, a bit is stored to divide the positive and negative numbers, and finally a meaningful value of the 64-CLZ bit is stored.

f) Otherwise, storing a bit indicates that a change has occurred and storing a bit of 0x0 indicates no duplication. And then the difference is stored according to the value 3. b.

4) Floating point number compression algorithms.

a) The floating point number is converted to a 64-bit integer according to the IEEE 754 binary floating point arithmetic standard

b) If the floating point number is the first, a bit 0x1 is written to indicate the floating point number, and the data is stored in 64 bits

c) If the previous integer is a floating point number, a bit is stored to indicate that the change is generated, a bit 0x0 is stored to indicate that the change is not repeated, and a bit is stored to indicate the floating point number.

d) On the other hand, if the floating point number is the same as the previous floating point number, a bit is stored to indicate that the change is generated, and a bit 0x0 is stored to indicate that the change is repeated.

e) And finally, how to be different from the sum of the previous floating point number, storing a bit to represent that no change occurs, and calculating an exclusive or value after converting the sum of the previous floating point number and the 64 bits. And comparing the leading zero with the trailing zero by the exclusive-or value and the last exclusive-or value (0 if not), and storing a significant value of 64-PLZ-PTZ bits if the number PLZ of leading zeros and the number PTZ of trailing zeros of the previous exclusive-or value are less than or equal to the current number CLZ of leading zeros and the number CTZ of trailing zeros.

f) Conversely, a meaningful value in the 64-CLZ-CTZ bit needs to be stored. Firstly, we need to distinguish the two cases by prefix bit, then, before storing 64-CLZ-CTZ bit data, 6 bits (2^6 ═ 64) are stored to represent CLZ, then 6 bits are stored to represent 64-CLZ-CTZ, and finally, meaningful data of 64-CLZ-CTZ bit are stored

5) Simple string compression algorithm.

a) The use scenario is as follows: the string field is rarely changed or the same string is often used.

b) Recording the most recently used character strings, caching a certain number N (compression and decompression, N needs to be kept consistent) of the most recently coded character string list, checking whether a new character string exists in the cache or not at first and adding the new character string to the cache if the new character string does not exist in the cache, and clearing the least recently used character string if the cache number is larger than N.

c) If the string is identical to the last time, a bit 0x0 is stored to indicate that there is no change, and the process is ended. Otherwise, store into 0x1

d) If the string is present in the buffer, a bit 0x0 is stored to indicate that the sequence number I representing the string in the buffer is stored next, and the number of bits occupied depends on N. Otherwise, storing a bit 0x1 indicates that the next string is stored, the length of the string is stored as a variable length integer, and then the string itself is stored.

Claims

1. A simple coding and decoding method suitable for a time sequence database is characterized by comprising the following steps:

2) integer conversion:

2a) taking the floating point number as a multiplier, multiplying the floating point number by a multiplicand to obtain a first result, and if the decimal part of the first result is 0, successfully returning the multiplicand and an integer part I to finish the integer conversion;

2c) if the last floating point number corresponding to the integer obtained in the step 2b) is the same as the integer part I of the first result, successfully converting and returning the multiplicand and the integer part I to finish the integer conversion;

2d) if the next floating point number corresponding to the integer obtained in the step 2b) is the same as the integer part I +1 of the first result, the conversion is successful, and the multiplicand and the integer part I +1 are returned to complete the integer conversion;

2e) if the integer conversion fails, entering step 3);

2f) if the integer conversion is successful, entering the step 4);

3) compression of floating point number:

4) and if the numerical value of the time sequence is the time stamp, compressing the time stamp, and if the integer numerical value of the time sequence is not the time stamp, performing integer compression.

2. The simplified coding/decoding method for time series database as claimed in claim 1, wherein in step 2a), the multiplicands are sequentially [ 1101001000100001000001000000 ].

3. The simplified encoding and decoding method for time series databases as claimed in claim 1, wherein in step 2b), the first result is converted into 64-bit integer according to the IEEE 754 binary floating point arithmetic standard.

4. The simplified coding and decoding method applied to the time series database according to claim 1, wherein in step 3), the floating point number compression specifically includes:

3e) if the last one of the time sequence of the floating point number is a floating point number and is different from the numerical value of the floating point number, storing an identifier representing that no change occurs, converting the identifier into 64 bits with the last floating point number, then calculating an exclusive-or value, comparing leading zeros and trailing zeros of the exclusive-or value and the last exclusive-or value, and storing a (64-PLZ-PTZ) bit meaningful numerical value if the leading zeros PLZ of the previous exclusive-or value and the trailing zeros PTZ of the previous exclusive-or value are less than or equal to the leading zeros CLZ of the current exclusive-or value and the trailing zeros CTZ of the current exclusive-or value;

3f) if the number of leading zeros of the previous xor value PLZ and the zero-number of the trailing zeros of the previous xor value PTZ are greater than the number of leading zeros of the current xor value CLZ and the zero-number of the trailing zeros of the current xor value CTZ, a significant value of (64-CLZ-CTZ) bits needs to be stored.

5. The simplified encoding and decoding method for time series databases as claimed in claim 4, wherein in step 3a), the floating point number is converted into a 64-bit integer according to the IEEE 754 binary floating point arithmetic standard.

6. The simplified coding and decoding method applied to the time series database as claimed in claim 4, wherein the step 3f) of storing (64-CLZ-CTZ) significant values includes:

the 6 bits are stored to represent the CLZ, then the 6 bits are stored to represent the 64-CLZ-CTZ, and finally the meaningful data of the 64-CLZ-CTZ bits are stored.

7. The simplified coding and decoding method applied to the time series database according to claim 1, wherein the step 4) of compressing the time stamp specifically comprises:

4a) defining time units including nanosecond, microsecond, millisecond and second;

4c) calculating the difference value from the last timestamp from the second timestamp in the time sequence, then calculating the difference value, then performing conversion by a fixed time unit, if the unit is second or millisecond, the result of the conversion of the value is maximum 32 bits, if the unit is nanosecond or microsecond, the result of the conversion of the value is maximum 64 bits, and finally selecting proper bytes in a byte queue [0,7,9,12,32] or [0,7,9,12,64] according to the size of the converted result value to store, and distinguishing different byte queues by prefix bits.

8. The simplified coding and decoding method applied to the time series database according to claim 1, wherein in the step 4), the integer compression specifically comprises:

leading zero number CLZ of the current integer absolute value is stored according to the consistency of 64-CLZ and 64-PLZ, the identifier is stored to indicate the consistency, if the leading zero number CLZ is inconsistent with the 64-CLZ, the 64-CLZ value is stored, and then the significant value of the 64-CLZ bit is stored;

4o) if a value in the time sequence of the current integer is not an integer, distinguishing positive and negative numbers by an identifier, storing an identifier to represent that the change occurs, storing an identifier to represent that the change does not occur, storing the number CLZ of leading zeros of the absolute value of the current integer, according to the agreement between 64-CLZ and 64-PLZ, storing the identifier to indicate the agreement, if the agreement does not occur, storing the 64-CLZ value, and then storing the meaningful value of the 64-CLZ bit.