CN115934792A - Array type time sequence data compression and cross-dimension query method - Google Patents

Array type time sequence data compression and cross-dimension query method Download PDF

Info

Publication number
CN115934792A
CN115934792A CN202211506113.XA CN202211506113A CN115934792A CN 115934792 A CN115934792 A CN 115934792A CN 202211506113 A CN202211506113 A CN 202211506113A CN 115934792 A CN115934792 A CN 115934792A
Authority
CN
China
Prior art keywords
data
time sequence
time
compression
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211506113.XA
Other languages
Chinese (zh)
Inventor
尚剑红
胡许冰
高钒
杨帆
高越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Shangyuan Intelligent Technology Co ltd
Original Assignee
Shenyang Shangyuan Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Shangyuan Intelligent Technology Co ltd filed Critical Shenyang Shangyuan Intelligent Technology Co ltd
Priority to CN202211506113.XA priority Critical patent/CN115934792A/en
Publication of CN115934792A publication Critical patent/CN115934792A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A kind of array type time series data compression and cross-dimension inquiry method, it belongs to big data technology field, the compression method is to look for the law of the data, compress according to the data law: the key is compressed first, and then the value is compressed. The query method comprises the steps of constructing array type time sequence data in a time sequence database, and constructing a physical super table according to the time sequence data; compressing the acquired time sequence data, and then inserting the time sequence data into a time sequence database; constructing a bottom layer implementation of a time sequence data table based on time and specified dimensions, splitting the whole data, and using the split data for realizing subsequent cross-dimension query; and constructing a passing time and a specified dimension implementation query method according to the bottom specified query conditions, thereby implementing the cross-dimension query of the time sequence data. The method is suitable for equipment data, environment data, energy consumption data, security data, fire safety data, production data and the like, and has the advantages of low storage cost, convenience in cross-dimension query, large storage capacity, safety, accuracy and the like.

Description

Array type time sequence data compression and cross-dimension query method
Technical Field
The invention belongs to the technical field of big data, and particularly relates to a method for compressing mass data, reducing storage cost and quickly inquiring time sequence data.
Background
Along with the development of big data technology, time sequence data need to be recorded in hydrology monitoring, equipment monitoring of factories, related data monitoring of national security, communication monitoring, index data of financial industry, sensor data and the like, and hundreds of millions of data are generated every day in the scene of the internet of things faced by a time sequence database. The importance of data today in the big data era is self-evident, but if these time series data cannot be managed and compressed well, the storage cost of the data increases invisibly. There is therefore a need for data compression, and for data compression algorithms there is a general explanation for general and service-related scenarios, such as video, audio, image data stream compression.
In addition, historical data and real-time data of equipment operation of the Internet of things platform are inquired, trends and rules behind the data are analyzed, reference and decision suggestions are provided for equipment operation of enterprises, and therefore the operation efficiency of the enterprises is improved. And accurately querying the relevant data of a certain device is a function which must be realized. Such as querying time series data managed by a particular tag from a given time window; aggregating time sequence data from a given time window or performing linear interpolation on the time sequence data to fill up missing time sequence data; grouping the time sequence data according to attributes, data quality, sampling point number and time, and facilitating subsequent analysis; screening for specific time series data or screening for data at the latest time.
The existing data compression method adopts an LZ algorithm, takes a pure English character string as an example, and traverses each character string one by one, if the character does not exist in a current dictionary table, the character is added into the dictionary, a key is the character, and a value is a number returned by a generator; if the current character exists, the next character is continuously added, at this time, two characters form a character string, whether the character string is in the current dictionary or not is judged, if yes, the next character is continuously added to form a new character string, and if not, the character string is added into the dictionary. The final output string is in the form of a dictionary with values in place of string + last character. The current query method is that a database table is constructed through time, certain analysis needs to be carried out on the structure of the table, and the data of different dimensions are queried through analysis and application of sql statements, so that the table relation and the data of each dimension cannot be visually seen.
Along with the development of the internet of things technology, the use of a time sequence database is more and more extensive, and the requirements on time sequence data are more and more. The time series database will also face the following problems:
(1) The storage cost is large: the bottom layer equipment, instruments and meters of the Internet of things are numerous. Data collected by a device at different times in different scenes are different, so that the data need to be stored, and if the data are not compressed at all, the data volume is huge. The conventional compression algorithm for time series data is designed for processing time series data, and the highest compression rate of the conventional method for compressing data is also 3. Thus, data will always exist as long as the data is not cleaned regularly, occupying most of the storage space. Therefore, how to process data faster, more safely and more space-saving is a problem to be solved at present.
(2) Cross-dimension query is inconvenient: when a service queries time series data, the service generally filters time intervals. Therefore, there is usually a concept of partitioning according to time when partitioning. However, if the device information is filtered for a certain device and a certain device attribute, it is very inconvenient and time-consuming, and further the management amount of the user is increased, and the user needs to manually expand the partition. If a user wants to inquire more detailed information of the device according to time, the attribute field of the device information needs to be expanded horizontally. Therefore, how to quickly and conveniently query data of certain attribute of certain equipment at a certain moment is a problem to be solved.
(3) Mass data storage problem: the amount of data that the devices and sensors monitor in real time and store in the database is very large, up to TB or even more per day, how to transmit, store and analyze the data, and the sensors involved, for example, in capturing images, in the database at least have access to the characteristic values of the images. This is also true for other vertical industries, where more accurate data types are necessary to support the business for maximum interpretation.
(4) Quality index problem: when compressing equivalent data, it should be considered that the compression algorithm will not affect the loss of the compressed data, and the accuracy of the compressed data. On the premise of high compression ratio, the problems that the compression speed is high and the compression accuracy is high to the maximum extent and data is damaged after being compressed are solved.
(5) Problem of secure storage of data: the problem of guaranteeing safe storage of data is very difficult at present, because a program to be considered when a storage engine operates can crash at any time, a server can encounter a power supply problem or a hardware fault, and a disk can be damaged, under the conditions, the data can be safely compressed and stored and inquired, and the sql is supported while the backup and the recovery function at any time point of the data are perfected.
Disclosure of Invention
Aiming at the technical problems of storage and query of the existing time sequence database, the invention provides a group type time sequence data compression and cross-dimension query method, which can perform high-compression rate compression on data in the group type time sequence database, ensure quality indexes and perform cross-dimension query on the data in the database so as to query time sequence data with a certain attribute. The specific technical scheme is as follows:
a kind of array type time series data compression method, as follows:
and (3) searching a rule of data, and compressing according to the data rule: searching the rules among the keys according to the attribute keys of the data, and encoding the keys through a compressor to form new keys according to the rules among the keys; finding out the relationship between the compressed key and the value according to the value corresponding to the original key and the relation between the values, and compressing the values;
the method comprises the steps that floating point type time sequence data are compressed, a threshold value of imprecision is set, data compression cannot exceed the threshold value, and the data are prevented from being seriously abraded and damaged after the threshold value is reached; when the data is inquired, the data return is not accurate, and the observation and analysis of the data and the judgment trend of the data are influenced; after data is compressed, matching point locations by using a straight line or a curve, taking out the currently compressed point location and the point location before compression, calculating a difference value, connecting the point locations into a rectangle, calculating a variance, leaving a small variance, and discarding a number with a large variance; when the rectangle formed by the current point and the previous recording point cannot contain the middle point, the previous point is recorded, and then most data points can be lost; during query, the lost points are found back during query according to the recorded points, so that the lossy compression can greatly reduce the storage cost, reduce the data writing and reduce the network bandwidth; the data compression also supports the use of setting binary codes for each symbol in a signal source, the symbol with higher occurrence frequency obtains shorter bits, and the symbol with lower occurrence frequency is allocated with longer bits, so that the data compression rate is improved, and the transmission efficiency is improved;
for lossless data compression irrelevant to data properties, codes with variable lengths are used for replacing continuously repeated original data to realize compression: matching a plurality of algorithm compressed data modes, and matching an algorithm compressed data mode most suitable for data modification according to the added compression conditions; meanwhile, the storage is particularly safe, and when the database is lost, another high-performance algorithm can be matched in time to prevent data loss.
In the above technical solution, the data compression method by multiple algorithms includes: data characteristic design algorithms, differential encoding algorithms (also known as incremental encoding algorithms), XOR algorithms, RLE algorithms, simple8b algorithms, zig-zag algorithms, delta-of-Delta algorithms (also known as second order differential encoding algorithms), snappy compression algorithms, LZO block compression algorithms, DEFLATE lossless data compression algorithms, and Bit-packing Bit compression algorithms that conform to the IEEE754 standard floating point number storage format.
A digital time series data cross-dimension query method comprises the following steps:
s1: constructing a time sequence database to obtain time sequence data, ensuring real-time storage of the currently acquired time sequence data by constructing the time sequence data, and storing the data in the time sequence database;
s2: constructing a physical super table according to the time sequence data, and visually observing the time sequence data when inquiring the data;
s3: compressing the acquired time sequence data, then carrying out layered processing on the time sequence data, and then inserting the time sequence data into a time sequence database;
s4: constructing a bottom layer implementation of a time sequence data table based on time and specified dimensions, splitting the whole data, and using the split data for realizing subsequent cross-dimension query;
s5: and constructing a passing time and a specified dimension implementation query method according to the bottom specified query conditions, thereby implementing cross-dimension query of the time sequence data.
In the step S1 of the above technical solution, firstly, a data pool is selected, the principle of selecting the data pool is to provide an application scenario with high concurrent data transaction connection, reduce resource application and release overhead in the high concurrent application scenario, respond to a database client request in an application more quickly, and solve the problem of frequent conversation between multiple applications and time series data.
In the step S1 of the above technical solution, next, a storage model is to be constructed, when the number of time series data fields that need to be stored in the time series database is large, a time series data table structure of a number group table is constructed, and dimension and index information are recorded independently, and device models having the same dimension and index type or number can be multiplexed or stored independently in the same index table, and finally converted into a view that maps index values to fields for use.
In the step S2 of the above technical solution, the time series data is processed in layers, so as to meet the requirements of the time series data performance and the capacity of the time series database in the actual time series data cross-dimensional service scene, and simultaneously save the purchase cost of the infrastructure, and the overall implemented strategy is divided into: a hot data layer (tier 1), a cold data layer (tier 2), and a historical data layer (tier 3).
The hot data layer (tier 1) is responsible for storing original data indexes which are frequently queried in real time in a time sequence database recently, and a high-performance storage type, such as an SSD medium, is adopted; the hot data layer uses a designated table space, each hyper-table shares a plurality of table spaces, and the independent chunk inherits the unique table space for storage.
The cold data layer (tier 2) ages along with the time sequence data, the data of the hot data layer (tier 1) is gradually transferred to the cold data layer (tier 2), and the data is compressed through the TimescaleDB and then stored in a storage medium; the cold data layer (tier 2) is partitioned from the table space of the hot data layer (tier 1) using a specified table space.
And the historical data layer (tier 3) adopts object storage to continuously store cold data after the cold data layer (tier 2) exceeds, and the data can be migrated and deleted from the time sequence database and stored in a parcuet form.
In the step S3 of the above technical solution, after the original data is adopted, the compression logic is changed into time series data according to the array type time series data compression method and the time series data is directly put in storage, and after the time series database performs service processing, the current data write mode can obtain data from MQ in batches by multiple processes, write a plurality of data in batches by a single transaction, and support the utilization of a connection pool to improve the connection performance, thereby reducing the database overhead and improving the efficiency of writing time series data.
In step S5 of the above technical solution, the entire time-series database performs writing operation to the time-series database according to 11 pieces/second at an acquisition frequency of 5 seconds/time, and the result is TPS =250 × 150/5=7500; according to the performance report of the time sequence database and referring to the actual pressure measurement report, a deployment framework adopts a main-secondary flow type replication cluster supporting a master-service (R/W) and a replica-service (Readonly) read-write separation mode; the deployment architecture adopts a TimescaleDB single-machine super-meter scheme, and the distributed super-meter is additionally expanded after the functional characteristics are improved; 5363 a table flow type copy scheme of Shan Chao supports the characteristics required by continuous set, data retention, data layering and data compression services; a single PG database server adopts a virtual machine configured by more than 32c/256G to meet the high TPS requirement of time sequence data writing; the data storage hot data meets the storage requirement of 10-day total collection (5 seconds/time), and an SSD storage medium with the capacity of 2TB (each node) is adopted, so that the data storage hot data comprises a real-time archiving storage space of a database; the temperature data meets the storage requirement of starting a compression mechanism (according to site and equipment ID) for full collection for not less than 6 months, and the capacity is 5.4TB (each node); cold data adopts object storage service, and meets the requirements of historical data and filing storage for not less than 3 years; meanwhile, when data restoration processing is carried out, an additional mounting storage space is provided to support the synchronization of the filing log stored by the object back to the data node for restoration.
Compared with the prior art, the array type time sequence data compression and cross-dimension query method has the beneficial effects that:
1. the compression method searches for a rule according to the attribute key of the data, compresses the key, finds out the relationship between the key and the value according to the value corresponding to the key and the relation between the values, and compresses the values. The method can obtain compressed data after the original data is compressed and coded, represent the data object by the least bit number, reduce the storage space or transmission bandwidth, and improve the transmission, storage and processing efficiency of the data by converting the original data into a form which is more compact than the original format.
2. When the time sequence database is subjected to time sequence compression, the compression method preferentially matches a most suitable data compression mode for data modification according to the added compression conditions. Therefore, the most rapid and convenient optimal compression method for reducing the data storage cost is adopted when the data is compressed, the optimal solution for compressing the time sequence data is realized, the storage cost can be reduced, the resources are saved, and the compressed quality index is very high. This makes the storage particularly secure as well, since multiple algorithms are matched, enabling matching another high performance algorithm to prevent data loss when there is a database loss.
3. The bottom layer implementation of designing and constructing the time sequence data table by the query method is based on time and a certain specified dimension, and the whole data is split for realizing subsequent cross-dimension query; and constructing a passing time according to the specific query conditions of the bottom layer, and realizing a query method by a certain specified dimension, thereby realizing cross-dimension query of the time sequence data. The method is characterized in that a time sequence data table structure design of a 'array table' is used, dimension and index information are recorded independently, equipment models with the same dimension and index types or numbers can be multiplexed or independently stored in an index table of the same clock, and finally the equipment models are converted into a view for mapping index values to fields and are provided for the outside to use. The problem that join link operation of multiple indexes of a traditional narrow table is low in efficiency is solved, and the problem that the traditional wide table is high in expansion complexity is solved. Because the number of table index columns is limited, the multi-index query performance is faster than that of the traditional wide table. Different compression algorithms can be used in the compression process for different data types (integer, float, etc.) of different indexes, so as to achieve the purpose of optimizing storage. On the table structure of the actual design data storage, data of different value types are classified, the same type is placed in the array of the same type, and the sequence consistency during writing is kept.
4. The invention provides a new method for compressing the array type time sequence data, ensures the storage of the time sequence data to be safer and more reliable, provides a new query method for cross-dimension query, improves the query efficiency and saves the storage cost. In practical application, the storage cost can be saved by 15-25% for users on the premise of ensuring the quality index after compression. And on the premise of screening time, cross-dimension query is carried out on equipment information by certain equipment and certain equipment attributes, so that a user can observe a query result more clearly and more intuitively. And further, the management amount of the user is reduced, and the partition is not allowed to be manually expanded by the user.
In conclusion, the method is suitable for equipment data, environment data, energy consumption data, security data, fire safety data, production data and the like; the more comprehensive the data perception is, the more complete the data analysis is, and the more reasonable the response and processing strategies are; the method has the advantages of low storage cost, convenient cross-dimension query, large storage capacity, safety, accuracy and the like. The algorithm for selective compression has the capability of continuously generating time series data compression processing under more general scenes. The array type time sequence data compression method can obtain compressed data after the original data is compressed and coded, expresses data objects by the least number of bits, reduces storage space or transmission bandwidth, and improves the transmission, storage and processing efficiency of the data by converting the original data into a form which is more compact than the original format.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a method for cross-dimensional query of array type time series data according to the present invention;
fig. 2 is a diagram of a system management architecture of an unmanned unattended station according to embodiment 1 of the present invention.
Detailed Description
The invention will be further described with reference to the following examples and figures 1-2, but the invention is not limited to these examples.
Example 1
A kind of array type time series data compression method, as follows:
the method can obtain compressed data after the original data is compressed and coded, represent the data object by the least bit number, reduce the storage space or transmission bandwidth, and improve the transmission, storage and processing efficiency of the data by converting the original data into a form which is more compact than the original format.
The method comprises the steps of firstly searching a specific rule of data and compressing according to the specific rule of the data. The method specifically comprises the following steps: and (3) searching a rule of data, and compressing according to the data rule: searching the rules among the keys according to the attribute keys of the data, and encoding the keys through a compressor to form new keys according to the rules among the keys; and finding out the relationship between the compressed key and the value according to the value corresponding to the original key, and the connection between each value, and then compressing the value.
When the floating-point data is compressed, if the precision lost by the compressed data is to be reduced as much as possible, a threshold value of the loss of precision needs to be set, the data compression cannot exceed the threshold value, and the data is prevented from being damaged due to serious abrasion of the data after the threshold value is reached; and the data return is not accurate when the data is inquired, so that the observation and analysis of the data and the judgment trend of the data are influenced. The main idea of floating-point time series data compression is to match these points as much as possible with a straight line or a curve after compressing the data. For example, the currently recorded point locations are taken out, the currently compressed point locations and the point locations before compression are taken out, difference values are calculated, the point locations are connected into a rectangle, variance is calculated, a number with small variance is left, and a number with large variance is discarded. When the rectangle formed by the current point and the last recording point can not contain the middle point, the last point is recorded and then can be seen, most data points can be lost, and the lost points need to be found out according to the recorded points in the query process. Lossy compression can also significantly reduce storage costs. And the capability of the equipment end is combined, so that the writing of data can be reduced, and the network bandwidth is reduced. In addition, the data compression of the method supports the use of binary codes for each symbol in the signal source. The symbols with higher frequency of occurrence can obtain shorter bits, and the symbols with lower frequency of occurrence can be allocated with longer bits, so that the data compression rate is improved, and the transmission efficiency is improved.
For lossless data compression independent of data properties, compression is achieved based on "replacing continuously repeated original data with codes of varying length", for example, a group of character strings "AAAABBBCCDEEEE" consisting of 4a, 3B, 2C, 1D, and 4E, which can be compressed into 4A3B2C1D4E through RLE. The method has the advantages of simplicity, high speed, capability of compressing continuous and high-repeatability data into small units, obvious defect and poor data compression effect with low repeatability. The specific algorithm which supports the data characteristic design conforming to the IEEE754 standard floating point number storage format is that the first value is not compressed, the following values are the result of calculating XOR (exclusive OR) with the first value, if the result is the same, only one 0 is stored, if the result is different, the result after XOR is stored, the algorithm is greatly influenced by data fluctuation, and the fluctuation is more severe, and the compression effect is worse. The algorithm is widely applied, if historical change records (version control, git and the like) of a file need to be checked, the algorithm is rarely used independently in a time sequence database, and generally used together with RLE, simple8b or Zig-zag, so that the compression effect is better. Delta-of-Delta also known as second-order differential encoding is supported, delta encoding is used again on the basis of Delta encoding, and the method is more suitable for encoding sequence data which is monotonically increased or decreased, for example, 2,4, 6,8, 2,2,0,2,2 after Delta encoding, 2,0 and-2,2,0 after Delta encoding. The zigzag-zag is supported, the flag bit is moved backwards to the tail, redundant 0 in the code is removed, and therefore the compression effect is achieved, the compression efficiency is high for small numerical values, but the efficiency is not improved or is possibly reduced for large data, therefore, the zigzag-zag is usually matched with Delta code, and Delta can well change large numerical data into small numerical values. The principle of supporting the Snappy compression algorithm is that we assume a sequence S = [9,1,2,3,4,5,1,2,3,4], matching finds that subsequences S2,5 = [1,2,3,4] are identical to S7,10 = [1,2,3,4], so that the sequence is encoded as S = [9,1,2,3,4,5,6, (2,4) ],2 represents a start position, 4 represents a bit number, and the Snappy has the advantages of high speed and reasonable compression ratio. The Simple8b is supported by 64-bit algorithm, which realizes that a plurality of shaped data (between 0 and 1< <60 < -1 >) are compressed into a 64-bit storage structure, wherein the first 4 bits represent a selector, and the last 60 bits are used for storing data. The LZO block compression algorithm is supported. The DEFLATE lossless data compression algorithm is supported, and in fact DEFLATE is only one algorithm that compresses data streams, and is used when some important data is compressed. Supporting the Bit-packing compression algorithm, based on the premise that not all the integers need 32 bits or 64 bits to store, unnecessary bits are deleted from the data we want to compress, such as a 32-Bit integer data, whose value ranges from (0-100), and can be represented by 7 bits. When the time sequence database is compressed in time sequence, a most suitable data compression mode is matched according to the added compression conditions. Therefore, the most rapid and convenient optimal compression method for reducing the data storage cost is adopted when the data is compressed, the optimal solution for compressing the time sequence data is realized, the storage cost can be reduced, the resources are saved, and the compressed quality index is very high. This makes the storage particularly secure as well, since multiple algorithms are matched, and matching another high performance algorithm prevents data loss when there is a database loss.
As shown in fig. 1, a method for querying array type time series data across dimensions includes:
s1: constructing a design of array type time sequence data in a relevant time sequence database of the system, ensuring that the time sequence data acquired by the current system is stored in real time by constructing the time sequence data, and storing the data in the time sequence database;
s2: constructing a physical super table according to the time sequence data, then carrying out layered processing on the time sequence data, and visually observing the time sequence data when inquiring the data;
s3: compressing the acquired time sequence data, and then inserting the compressed time sequence data into a time sequence database;
s4: the bottom layer implementation of constructing the time sequence data table is to split the whole data based on time and a certain specified dimension for the implementation of subsequent cross-dimension query;
s5: and constructing a passing time according to the specific query conditions of the bottom layer, and realizing a query method by a certain specified dimension, thereby realizing cross-dimension query of the time sequence data.
In the method, in the step S1, a data pool is selected firstly, the principle of selecting the data pool is to provide an application scene of high-concurrency data transaction connection, reduce resource application and release expenses under the high-concurrency application scene, respond to a database client request in an application more quickly, and solve the problem of frequent conversation of multiple applications and time sequence data.
In the method described above, step S1, the second step is to construct a storage model. When the time sequence database needs to store more time sequence data fields, for example, a database table with related indexes, dimensions and attributes associated together. Because different contents of time sequence data are stored in a table of a time sequence database, the table designed by the design mode is not in accordance with the model design specification of a classical three-model type, and simultaneously has certain defects that a large amount of redundancy exists in the time sequence data due to too much stored data, but the design mode has the advantages that the query performance is improved, so that the data query function is more accurate and more complete, and the use is convenient for a user. The design mode is mainly applied to time sequence data storage of terminal equipment of the Internet of things, the efficiency problem in data query can be greatly improved by placing relevant fields of the time sequence data in the same table, and the stored mode is easier to query across multiple indexes because the storage mode does not need to be connected. Moreover, ingest is faster because only one timestamp is written for multiple metrics. A typical wide-table model will match a typical data stream. At the same time, another storage mode can be constructed, which is very different from the former storage mode. The embodiment uses a time sequence data table structure design of a 'array table', independently records dimensionality and index information, has the same dimensionality and index type or number of equipment models, can be multiplexed or independently stored in an index table of the same clock, and finally is converted into a view for mapping index values to fields to be provided for external use. The problem that join link operation of multiple indexes of a traditional narrow table is low in efficiency is solved, and the problem that the traditional wide table is high in expansion complexity is solved. Because the number of table index columns is limited, the multi-index query performance is faster than that of the traditional wide table. Different compression algorithms can be used in the compression process for different data types (integer, float, etc.) of different indexes, so as to achieve the purpose of optimizing storage. On the table structure of the actual design data storage, data of different value types are classified, the same type is placed in the array of the same type, and the sequence consistency during writing is kept.
And carrying out hierarchical processing on the time sequence data aiming at the cross-dimension technical scheme. In order to meet the requirements of actual time sequence data on time sequence data performance and the capacity of a time sequence database in a cross-dimensional service scene and maximally save infrastructure purchasing cost, the integrally implemented strategy is divided into a hot data layer (tier 1), a cold data layer (tier 2) and a historical data layer (tier 3). The specific levels are as follows:
tier1: the method is used for storing original data indexes which are frequently queried in real time in a time sequence database recently, and a high-performance storage type, such as an SSD medium, is adopted. The hierarchy uses a designated table space, each super table shares multiple table spaces, and the independent chunk inherits the unique table space for storage.
tier2: with the aging of time series data, the data of tier1 is gradually migrated to the layer, compressed by TimescaleDB and stored in a mass storage medium. The hierarchy is partitioned from the tier1 table space using a designated table space.
tier3: and the historical data layer adopts an object to store cold data after the tier2 exceeds. Data is migrated and deleted from the time series database and stored in the form of a parcuet.
And when the step of S3 is executed, the original data is changed into time sequence data to be directly stored in a warehouse according to the compression logic, and other business processing is carried out by the time sequence database. In the current data writing mode, data are obtained from MQ in batches by multiple processes, multiple data are written in batches by a single transaction, and the connection performance is improved by using a connection pool, so that the database overhead is reduced, and the time sequence data writing efficiency is improved.
In the method, in order to improve the cross-dimension query rate of the time series data, the performance of the cross-dimension query is stabilized. It is necessary to realize that the whole time sequence database system calculates according to 11 pieces/second (actual average value of 355 stations accessed by the current system) by writing operation to the time sequence database according to the acquisition frequency of 5 seconds/time. The estimation was TPS =250 × 150/5=7500. According to the performance report of the time sequence database and referring to the actual pressure measurement report, a deployment framework adopts a one-master two-slave flow type replication cluster supporting a master-service (R/W) and a replica-service (Readonly) read-write separation mode. If the deployment architecture adopts a TimescaleDB single-machine super-meter scheme during platform design, the distributed super-meter is additionally expanded after the functional characteristics are improved. Shan Chao table flow type copying scheme supports the characteristics of continuous collection, data retention, data layering, data compression and other service requirements. The single PG database server suggests that the virtual machine configured above 32c/256G meets the high TPS requirement of time sequence data writing. The data storage hot data meets the storage requirement of 10-day full collection (5 seconds/time), preferably adopts an SSD (solid State disk) storage medium, and suggests a capacity of 2TB (each node), including a real-time archiving storage space of a database; the temperature data meets the storage requirement of starting a compression mechanism (according to site and equipment ID) for full collection of not less than 6 months, and the capacity is estimated to be 5.4TB (each node); the cold data adopts object storage service, and meets the requirements of historical data and filing storage for not less than 3 years. Meanwhile, when data restoration processing is performed, extra mount storage space is guaranteed to be provided so as to support the archive logs stored in the object to be synchronized back to the data node for restoration.
The method of the embodiment is suitable for equipment data, environment data, energy consumption data, security data, fire safety data, production data and the like; the more comprehensive the data perception is, the more complete the data analysis is, and the more reasonable the response and processing strategy is.
As shown in fig. 2, based on the above method, the management framework of the unattended station system in this embodiment includes a physical device layer, a collection surface, a data management platform layer, and a display layer; the acquisition plane comprises a plurality of data acquisition blocks. The data storage operation time sequence database comprises compression of the array type time sequence data and cross-dimension query of the time sequence data. Firstly, a data pool needs to be selected so as to provide high concurrent time sequence data connection, reduce resource overhead in a high concurrent application scene, and solve frequent conversation between multiple applications and time sequence data. And then, aiming at the time sequence data, constructing a database storage mode of the time sequence data, and designing a time sequence data table with some related indexes, dimensions and attributes of the system associated together when the time sequence database needs to store more time sequence data fields. When these preconditions are designed, the data needs to be compressed. The time sequence database starts a data compression strategy, so that the time sequence data storage cost is reduced, and the query speed is increased. When the time sequence data is put in storage, the time sequence data is in an uncompressed row form and is stored in a tier1 so as to pursue high performance of reading and writing. After the data aging is over, it will be used as the job scheduling task to convert the cold data into the warm data in the form of compressed columns. According to the actual business scene requirement, the chunk exceeding the designated time is subjected to automatic data compression according to a method of certain data algorithm best matching. Before compression, backup is carried out on the chunk data, and deletion is carried out after successful compression, so that data loss caused by abnormal conditions of the database in the compression process is avoided. In addition, the compressed chunk will migrate the compressed chunk and the corresponding indexed table space from tier1 to tier2 based on subsequent timing tasks. Meanwhile, the data is compressed again by matching with other compression strategies, so that the storage space of the time sequence data is reduced to save cost, and then the time sequence data can be queried in a cross-dimension mode.
The embodiment provides a new method for compressing the array type time sequence data, simultaneously ensures that the time sequence data is stored more safely and reliably, provides a new query method for cross-dimension query, improves the query efficiency and saves the storage cost. In practical application, on the premise of ensuring the quality index after compression, the storage cost of a user can be saved by 15-25%. And on the premise of screening time, cross-dimension query is carried out on equipment information of a certain equipment and certain equipment attribute, so that a user can observe a query result more clearly and more intuitively. And further, the management amount of the user is reduced, and the partition is not allowed to be manually expanded by the user.

Claims (10)

1. A kind of array type time series data compression method, characterized by, the step of the method is as follows, look for the law of the data, compress according to the data law: searching the rules among the keys according to the attribute keys of the data, and encoding the keys through a compressor to form new keys according to the rules among the keys; and finding out the relationship between the compressed key and the value according to the value corresponding to the original key, and the connection between each value, and then compressing the value.
2. The array type time series data compression method according to claim 1, wherein when compressing the floating point type time series data, a threshold value of the imprecision is set, and the data compression does not exceed the threshold value, so that the data is prevented from being damaged due to serious abrasion caused by reaching the threshold value; when the data is inquired, the data return is not accurate, and the observation and analysis of the data and the judgment trend of the data are influenced; after data is compressed, matching point locations by using a straight line or a curve, taking out the currently compressed point location and the point location before compression, calculating a difference value, connecting the point locations into a rectangle, calculating a variance, leaving a small variance, and discarding a number with a large variance; when the rectangle formed by the current point and the previous recording point cannot contain the middle point, the previous point is recorded, and then most data points can be lost; during query, the lost points are found back during query according to the recorded points, so that the lossy compression can greatly reduce the storage cost, reduce the data writing and reduce the network bandwidth; data compression also supports the use of binary codes for each symbol in the signal source, symbols that occur more frequently will get shorter bits, and symbols that occur less frequently will be assigned longer bits;
when lossless data irrelevant to data properties are compressed, codes with variable lengths are used for replacing continuously repeated original data to realize compression: matching a plurality of algorithm compressed data modes, and matching an algorithm compressed data mode most suitable for data modification according to the added compression conditions; meanwhile, the storage is particularly safe, and when the database is lost, another high-performance algorithm can be matched in time to prevent data loss.
3. The array type time series data compression method according to claim 1, wherein the plurality of algorithms compress data in a manner including: a data feature design algorithm, a differential encoding algorithm, an XOR algorithm, an RLE algorithm, a Simple8b algorithm, a Zig-zag algorithm, a Delta-of-Delta algorithm, a Snappy compression algorithm, an LZO block compression algorithm, a DEFLATE lossless data compression algorithm, and a Bit-packing Bit compression algorithm that conform to the IEEE754 standard floating point number storage format.
4. A digital time sequence data cross-dimension query method adopts the digital time sequence data compression method to compress, and comprises the following steps:
s1: constructing the array type time sequence data in the time sequence database, ensuring that the currently acquired time sequence data is stored in real time by constructing the time sequence data, and storing the data in the time sequence database;
s2: constructing a physical hyper-table according to the time sequence data, and visually observing the time sequence data when inquiring the data;
s3: compressing the acquired time sequence data, then carrying out layered processing on the time sequence data, and then inserting the time sequence data into a time sequence database;
s4: constructing a bottom layer implementation of a time sequence data table based on time and specified dimensions, splitting the whole data, and using the split data for realizing subsequent cross-dimension query;
s5: and constructing a passing time and a specified dimension implementation query method according to the bottom specified query conditions, thereby implementing the cross-dimension query of the time sequence data.
5. The method according to claim 4, wherein in the step S1, a data pool is selected, and a principle of selecting the data pool is to provide an application scenario with high concurrent data transaction connection, reduce resource application and release overhead in the high concurrent application scenario, respond to a database client request in an application more quickly, and solve a problem of frequent conversation between multiple applications and time series data.
6. The method according to claim 4, wherein in step S1, when a storage model is to be constructed, when the number of time series data fields required to be stored in the time series database is large, a time series data table structure of a number series table is constructed, dimension and index information are recorded independently, and device models having the same dimension and index type or number can be multiplexed or stored independently in the same index table, and finally converted into a view for mapping index values to fields for use.
7. The method according to claim 4, wherein in the step S2, the time series data is hierarchically processed to meet the requirements of the time series data performance and the capacity of the time series database in the actual time series data cross-dimensional service scenario, and the infrastructure procurement cost is saved, and the overall implementation strategy is as follows: a hot data layer, a cold data layer, and a historical data layer.
8. The array-type time-series data cross-dimension query method according to claim 7,
the hot data layer is responsible for storing original data indexes which are frequently queried in real time in a time sequence database recently, and adopts a high-performance storage type, such as an SSD medium; the hot data layer uses a designated table space, each hyper-table shares and uses a plurality of table spaces, and the independent chunk inherits the unique table space therein for storage;
the data of the hot data layer is gradually transferred to the cold data layer along with the aging of the time sequence data of the cold data layer, and is stored on a storage medium after being compressed by the TimescaleDB data; the cold data layer is divided from the table space of the hot data layer by using the designated table space;
the historical data layer adopts object storage to continuously store cold data after the cold data layer exceeds, and the data can be migrated and deleted from the time sequence database and stored in a pareqet form.
9. The array type time series data cross-dimension query method according to claim 4, 7 or 8, characterized in that in the step S3, after the original data is adopted, the compression logic according to the array type time series data compression method is changed into time series data to be directly put in storage, after the time series database performs service processing, the current data writing mode can obtain data from MQ in batches by multiple processes, write a plurality of pieces of data in batches by a single transaction, and support the utilization of a connection pool to improve the connection performance, thereby reducing the database overhead and improving the efficiency of writing time series data.
10. The method according to claim 4, wherein in the step S5, the whole time-series database is written into the time-series database according to 11 pieces/second at an acquisition frequency of 5 seconds/time, and the result is TPS =250 × 150/5=7500; according to the performance report of the time sequence database and referring to the actual pressure measurement report, a deployment framework adopts a main-secondary flow type replication cluster supporting a master-service and a regenerative-service read-write separation mode; the deployment architecture adopts a TimescaleDB single-machine super-meter scheme, and the distributed super-meter is additionally expanded after the functional characteristics are improved; 5363 a table flow type copy scheme of Shan Chao supports the characteristics required by continuous set, data retention, data layering and data compression services; the single PG database server adopts a virtual machine configured above 32c/256G to meet the high TPS requirement of time sequence data writing; the data storage hot data meets the requirement of 10-day total acquisition and storage, and an SSD (solid State disk) storage medium with the capacity of 2TB (transport data Block) is adopted and comprises a real-time database archiving storage space; the temperature data meets the storage requirement of starting a compression mechanism for full collection of not less than 6 months, and the capacity is 5.4TB; cold data adopts object storage service, and meets the requirements of historical data and filing storage for not less than 3 years; meanwhile, when data restoration processing is carried out, an additional mounting storage space is provided to support the synchronization of the filing log stored by the object back to the data node for restoration.
CN202211506113.XA 2022-11-29 2022-11-29 Array type time sequence data compression and cross-dimension query method Pending CN115934792A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211506113.XA CN115934792A (en) 2022-11-29 2022-11-29 Array type time sequence data compression and cross-dimension query method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211506113.XA CN115934792A (en) 2022-11-29 2022-11-29 Array type time sequence data compression and cross-dimension query method

Publications (1)

Publication Number Publication Date
CN115934792A true CN115934792A (en) 2023-04-07

Family

ID=86648407

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211506113.XA Pending CN115934792A (en) 2022-11-29 2022-11-29 Array type time sequence data compression and cross-dimension query method

Country Status (1)

Country Link
CN (1) CN115934792A (en)

Similar Documents

Publication Publication Date Title
CN103177111B (en) Data deduplication system and delet method thereof
US9613043B2 (en) Object deduplication and application aware snapshots
US20200117649A1 (en) Data set compression within a database system
US8812738B2 (en) Method and apparatus for content-aware and adaptive deduplication
CA2645354C (en) Database adapter for compressing tabular data partitioned in blocks
CN111339103B (en) Data exchange method and system based on full-quantity fragmentation and incremental log analysis
AU7058591A (en) Media storage and retrieval system
KR20110014987A (en) Managing storage of individually accessible data units
CN105144157A (en) System and method for compressing data in database
CN110727406A (en) Data storage scheduling method and device
CN111611250A (en) Data storage device, data query method, data query device, server and storage medium
US20240126762A1 (en) Creating compressed data slabs that each include compressed data and compression information for storage in a database system
CN113297208A (en) Data processing method and device
Wang et al. Apache IoTDB: A time series database for IoT applications
CN109947730A (en) Metadata restoration methods, device, distributed file system and readable storage medium storing program for executing
CN111061428B (en) Data compression method and device
CN115934792A (en) Array type time sequence data compression and cross-dimension query method
CN108182198A (en) Store the control device and read method of Dynamic matrix control device operation data
CN112100197A (en) Quasi-real-time log data analysis and statistics method based on Elasticissearch
Zhou et al. Adaptive subspace symbolization for content-based video detection
CN114138559A (en) Rapid backup method and system based on synthesis technology
CN117056134B (en) Method for quickly backing up database data in energy consumption monitoring system
Deng et al. imdedup: A lossless deduplication scheme to eliminate fine-grained redundancy among images
JP2021052263A (en) Data compression device and data compression method
CN117149914A (en) Storage method based on ClickHouse

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination