CN116841973A

CN116841973A - Data intelligent compression method and system for embedded database

Info

Publication number: CN116841973A
Application number: CN202310830705.5A
Authority: CN
Inventors: 张昊然; 王宏志; 丁小欧; 杨东华; 左德承
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-10-03

Abstract

The application discloses an intelligent data compression method and system for an embedded database, which relate to the technical field of data compression and aim at the problem that the compression speed is low because the embedded database cannot be adapted to different environments in the prior art.

Description

Data intelligent compression method and system for embedded database

Technical Field

The application relates to the technical field of data compression, in particular to an intelligent data compression method and system for an embedded database.

Background

Data compression technology is one of the important points of research of databases at present. The meaning of researching the data compression algorithm is that firstly, the compressed data can improve the utilization rate of the storage space, thereby reducing the hardware cost of the stored data; secondly, when the data with the same size is transmitted through the network, the transmission quantity of bytes can be reduced under the compression state, the throughput rate of data transmission can be greatly improved, and the bandwidth of network transmission is improved in a phase-changing manner. These two aspects can save a lot of hardware and software costs for enterprises or individuals, and because of urgent demands of enterprises for hardware cost reduction, current data compression technology exploration mainly focuses on data compression of server-side databases, such as gzip algorithm based on LZ4 algorithm and ZSTD algorithm improvement, etc., is widely used. These general, more cumbersome compression algorithms are called heavy duty algorithms. On the other hand, light algorithms including delta-delta algorithm, RLE (run length encoding) and the like, which are compression algorithms dedicated to bit operations, are also more often used. In the academic world, the recent hot spot is to integrate a neural network or a reinforcement learning algorithm into a compression algorithm, so that the neural network or reinforcement learning algorithm is more intelligent, such as a VAE improved compression algorithm, a GAN improved lossy compression algorithm and the like, and good effects are achieved.

Existing algorithms, whether heavy algorithms or light algorithms, whether industry or academia, are classified from application modes and concentrated on a distributed database based on the database background of main researches of the algorithms; according to the storage mode, the method mainly focuses on databases of line storage and column storage; from the difference in stored data, the focus is mainly on relational databases and time series databases. The application scenarios described above all have a common point that the algorithms are often deployed on a server, and are specially used for providing a solution for data compression of a database on the server, where the database on the server is characterized by a single database running in a single device, and the resources of the device are often concentrated on the database for use by the single database. The algorithms that have been proposed so far are often implemented with sufficient resources (especially computing resources), reliable network and device environments, and with little consideration of device factors such as power consumption. However, for embedded databases, the operation environment is often bad, and the embedded databases can be deployed in not only server rooms, but also large-interference and unstable environments such as factories and farmlands, in which network resources and computing resources which can be obtained by the embedded databases are reduced, and the power consumption problem of the embedded databases is often considered, so that the electric quantity loss or the increase of the cost caused by the excessive power consumption is prevented. The complex environment and structure of embedded database deployment dictates the compression algorithm for which it needs to be specifically tailored.

On the other hand, embedded databases are widely used as databases currently downloaded 500 hundred million times, including internet of things devices (such as temperature sensors, monitoring devices, etc.), mobile phones (whether android and iOS use embedded databases in large quantities), personal computers (including both program and system services), etc., in which case it is almost difficult for DBAs to go deep into codes in real time and in a tracking manner to optimize the performance of the database, and after one deployment, the embedded databases can only be mechanically kept running. This results in the various properties of the embedded database not being adaptable to different environments, resulting in slow compression speeds.

Disclosure of Invention

The purpose of the invention is that: aiming at the problem of slow compression speed caused by the fact that an embedded database cannot be adapted to different environments in the prior art, the data intelligent compression method and system for the embedded database are provided.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the intelligent data compression method for the embedded database comprises the following steps:

step one: detecting the current use conditions of the CPU, the memory and the hard disk, if the use ratio of any one of the CPU and the memory exceeds 70%, not compressing, if the use ratio of the CPU and the memory does not exceed 70%, compressing, and executing the second step;

Step two: judging whether an embedded database connection exists currently, if not, connecting the embedded databases, if so, sending information to each embedded database, inquiring the current transaction state and limitation of the embedded databases, and not compressing the embedded databases which are in progress or the databases with data modification limitation, otherwise, compressing, and executing the third step;

step three: for the compressed embedded database, firstly counting the latest service transactions of each table in the embedded database, if the table contains more than 100 data-related transactions, marking the table as an unusual table if the table contains data which is never accessed, then using a K-Means algorithm to classify the data in each unusual table into main keys or timestamps, and dividing each table into different data sets;

step four: carrying out characteristic evaluation on the data layer of each data set obtained in the step three, and selecting the data set needing to be compressed according to the evaluation result;

the fourth step is specifically as follows:

setting the index value of the read operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for every more than 10 transaction numbers, setting the index value of the write operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for every more than 10 transaction numbers, marking the index value of the structural rule as 1 in the structure of the data, marking the index value of the structural irregularity as 0, setting the weight of the read operation as 1, setting the weight of the write operation as 2, setting the structural weight of the data as 1, multiplying all the index values by the weight to obtain the total weight value of one data set, sequencing the total weight values of all the data sets from high to low, and selecting the data set with the highest total weight value for data compression;

Step five: using Q learning as a mode of selecting a compression algorithm, using the current CPU, memory and hard disk use condition as Q learning state input, using the selected compression algorithm as output, calculating rewards by using throughput rate improvement amount of the compressed embedded database and reduction amount of occupied resources, and obtaining the compression algorithm through each iteration;

step six: and (3) compressing the data set obtained in the step four through the compression algorithm selected in the step five.

Further, the specific steps of the first step are as follows:

and creating a daemon, detecting the current use conditions of the CPU, the memory and the hard disk by using the daemon, if the use ratio of any one of the CPU and the memory exceeds 70%, not compressing, enabling the daemon to enter a dormant state, waiting to wake up again for judging again, and if the use ratio of the CPU and the memory does not exceed 70%, compressing and executing the second step.

Further, the specific steps of creating the daemon process are as follows:

using the python script creates a daemon that wakes up every 5 minutes, which detects current CPU, memory and hard disk usage every 5 minutes.

Further, the specific steps of connecting the embedded database are as follows:

And sending a connection request to each embedded database through the port number, connecting the embedded databases with the databases after the embedded databases agree to connect, and ensuring the connection of the embedded databases through a heartbeat mechanism in subsequent operation.

Further, the maximum data line number of the set in the third step is 1000 lines.

Further, the manner of using Q learning as the selective compression algorithm in the fifth step is specifically as follows:

when the CPU or memory utilization rate reaches more than 50% and is lower than 70%, a lightweight compression algorithm is selected;

when the CPU and the memory use ratio is lower than 50%, and the hard disk use ratio is lower than 50%, selecting a lightweight compression algorithm, and when the hard disk use ratio is not lower than 50%, selecting a heavy compression algorithm;

when the utilization rate of the hard disk reaches more than 95%, the CPU and the utilization rate of the memory are ignored, and the data in the hard disk are rapidly compressed by directly using a heavy algorithm.

Further, the lightweight compression algorithm comprises RLE or delta-of-delta, and the heavy compression algorithm is a Gzip algorithm.

Further, the specific compression step in the step six is as follows:

when compression is carried out, firstly, the compressed data set is extracted from the corresponding table in the embedded database, the extracted data set is stored in a single file, the file is compressed by using the compression algorithm selected in the fifth step, the compressed file is obtained, and finally, the compressed file is stored in an additional folder, and all the compressed data files are stored in the folder.

Further, the method further comprises a step seven, and the specific steps of the step seven are as follows:

recording all compressed data files stored in the folder, establishing an index for the data files by using the B+ tree, monitoring the transaction of the embedded database in real time, finding the compressed data files through the index once the transaction related to the compressed data files exists, decompressing the data files, returning the decompressed data to the table of the embedded database, and finally updating the transaction record of the corresponding data.

The system comprises a system detection module, a connection judgment module, a data set classification module, a data set evaluation module and a Q learning module;

the system detection module is used for detecting the current use conditions of the CPU, the memory and the hard disk, if the use ratio of any one of the CPU and the memory exceeds 70%, the compression is not performed, and if the use ratio of the CPU and the memory does not exceed 70%, the compression is performed;

the connection judging module is used for judging whether the embedded database connection exists currently, if the embedded database connection does not exist currently, the embedded databases are connected, if the embedded database connection exists currently, information is sent to each embedded database, the current transaction state and limitation of the embedded databases are inquired, and the embedded databases which are subjected to the transaction or the databases which are subjected to the data modification limitation are not compressed, otherwise, the embedded databases are compressed;

The data set classification module is used for counting the latest service transactions of each table in the embedded database aiming at the compressed embedded database, if the table contains more than 100 data-related transactions, the table is marked as an unusual table if the table contains data which is never accessed, then the K-Means algorithm is used for classifying the data in each unusual table to carry out the classification of a primary key or a timestamp, and each table is divided into different data sets;

the data set evaluation module is used for carrying out characteristic evaluation on the data layer on each data set obtained in the data set classification module, and selecting the data set needing to be compressed according to the evaluation result;

the data set evaluation module specifically comprises:

setting the index value of the read operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for more than 10 transaction numbers, setting the index value of the write operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for more than 10 transaction numbers, marking the index value of the structural rule as 1 and the index value of the structural irregularity as 0 in the structure of the data,

setting the weight of the reading operation as 1, the weight of the writing operation as 2, setting the structural weight of the data as 1, multiplying all index values by the weight of the index values to obtain the total weight value of one data set, sequencing the total weight value of all the data sets from high to low, and selecting the data set with the highest total weight value for data compression;

The Q learning module is used for using Q learning as a mode of selecting a compression algorithm, taking the current CPU, memory and hard disk service conditions as Q learning state input, taking the selected compression algorithm as output, calculating rewards by using throughput rate improvement quantity of the compressed embedded database and reduction quantity of occupied resources, acquiring the compression algorithm through each iteration, and compressing the data set obtained by the data set evaluation module;

the system detection module comprises the following specific steps:

creating a daemon, detecting the current use conditions of the CPU, the memory and the hard disk by using the daemon, if the use ratio of any one of the CPU and the memory exceeds 70%, not compressing, enabling the daemon to enter a dormant state, waiting to wake up again for judging again, and if the use ratio of the CPU and the memory does not exceed 70%, compressing;

the specific steps of creating the daemon process are as follows:

creating a daemon which wakes up every 5 minutes by using the python script, wherein the daemon detects the current CPU, memory and hard disk use condition every 5 minutes;

the method for connecting the embedded database comprises the following specific steps:

sending a connection request to each embedded database through a port number, connecting the embedded databases after the embedded databases agree to connect, and ensuring the connection of the embedded databases through a heartbeat mechanism in subsequent operation;

The maximum data line number of the set in the data set classification module is 1000 lines;

the mode of using Q learning as the compression algorithm in the Q learning module is specifically as follows:

when the utilization rate of the hard disk reaches more than 95%, the CPU and the utilization rate of the memory are ignored, and the data in the hard disk is compressed rapidly by directly using a heavy algorithm;

the lightweight compression algorithm comprises RLE or delta-of-delta, and the heavy compression algorithm is a Gzip algorithm;

the specific steps of compression in the Q learning module are as follows:

when compression is carried out, firstly, a compressed data set is extracted from a corresponding table in an embedded database, the extracted data set is stored in a single file, a compression algorithm selected by a Q learning module is used for compressing the file to obtain a compressed file, and finally, the compressed file is stored in an additional folder, and all the compressed data files are stored in the folder;

The Q learning module further includes the steps of:

The beneficial effects of the application are as follows:

the application can classify and identify different scenes and system conditions, automatically select a needed compression algorithm after that, and automatically decompress when needed, thereby being applicable to various use environments, improving the compression speed of the embedded database aiming at different environments, intelligently judging the compression time after that, such as the environments of equipment such as Internet of things equipment, mobile phones, personal computers and the like, and carrying out adaptation and adjustment according to the quantity of resources in the environments so as to realize the intelligent compression and decompression of the embedded database, and finally achieving the purposes of saving the storage resource space, improving the resource utilization rate and accelerating the network transmission speed.

Drawings

FIG. 1 is a flow chart 1 of the present application;

FIG. 2 is a flow chart 2 of the present application;

fig. 3 is a flow chart 3 of the present application.

Detailed Description

It should be noted that, in particular, the various embodiments of the present disclosure may be combined with each other without conflict.

The first embodiment is as follows: referring to fig. 1, the method for intelligently compressing data for an embedded database according to the present embodiment specifically includes the following steps:

the fourth step is specifically as follows:

setting the index value of the read operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for every more than 10 transaction numbers, setting the index value of the write operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for every more than 10 transaction numbers, marking the index value of the structural rule as 1 in the structure of the data, marking the index value of the structural irregularity as 0, setting the weight of the read operation as 1, setting the weight of the write operation as 2, setting the structural weight of the data as 1, multiplying all the index values by the weight to obtain the total weight value of one data set, sequencing the total weight values of all the data sets from high to low, and selecting the data set with the highest total weight value for data compression; as shown in fig. 1.

Step five: using Q learning as a mode of selecting a compression algorithm, using the current CPU, memory and hard disk use condition as Q learning state input, using the selected compression algorithm as output, calculating rewards by using throughput rate improvement amount of the compressed embedded database and reduction amount of occupied resources, and obtaining the compression algorithm through each iteration; as shown in fig. 2.

Step six: and (3) compressing the data set obtained in the step four through the compression algorithm selected in the step five. The compression step is shown in fig. 3.

The second embodiment is as follows: this embodiment is further described with respect to the first embodiment, and the difference between this embodiment and the first embodiment is that the specific steps of the first embodiment are as follows:

And a third specific embodiment: this embodiment is further described with respect to the second embodiment, and the difference between this embodiment and the second embodiment is that the specific steps of creating the daemon are as follows:

The specific embodiment IV is as follows: this embodiment is further described in the third embodiment, and the difference between this embodiment and the third embodiment is that the specific steps of connecting the embedded database are as follows:

Fifth embodiment: this embodiment is further described with respect to the fourth embodiment, and the difference between this embodiment and the fourth embodiment is that the maximum number of data lines collected in the third step is 1000 lines.

Specific embodiment six: this embodiment is further described with respect to the fifth embodiment, and the difference between this embodiment and the fifth embodiment is that the manner of using Q learning as the selective compression algorithm in the fifth step is specifically as follows:

Seventh embodiment: this embodiment is further described with respect to embodiment six, where the lightweight compression algorithm includes RLE or delta-of-delta, and the heavy compression algorithm is a Gzip algorithm.

Eighth embodiment: this embodiment is further described in the seventh embodiment, and the difference between this embodiment and the seventh embodiment is that the specific steps of compression in the sixth step are as follows:

Detailed description nine: this embodiment is a further description of the eighth embodiment, and the difference between this embodiment and the eighth embodiment is that the method further includes a step seven, where the specific steps of the step seven are:

Detailed description ten: the data intelligent compression system for the embedded database comprises a system detection module, a connection judgment module, a data set classification module, a data set evaluation module and a Q learning module;

The data set evaluation module specifically comprises:

The system detection module comprises the following specific steps:

the specific steps of creating the daemon process are as follows:

the specific steps of compression in the Q learning module are as follows:

the Q learning module further includes the steps of:

Examples: an intelligent data compression method of an embedded database, comprising the following steps:

s1, using a python script to create a daemon which wakes up every 5 minutes, wherein the daemon detects the current CPU, memory and hard disk use condition of the system every 5 minutes, if the current CPU and memory use ratio exceeds 70%, the daemon does not compress, at the moment, other programs exist in the system and the resources of the system are occupied and cannot perform additional data compression plans; if the CPU and the memory use rate is not more than 70%, and the current hard disk use rate is more than 70%, the pressure of other programs facing the current system is smaller, and the hard disk use rate is higher, the necessity of data compression is higher, the benefit is higher, and data compression is started. If the current selection is not compressed, the daemon process directly enters a dormant state, and wakes up again to judge again after waiting for 5 minutes;

s2, if the data compression is judged in S1, namely the CPU and memory utilization rate is not more than 70%, which represents that the current system level can carry out data compression plan, the second step is to check whether the database existing in the current system can carry out data compression. The embedded database is used as a database which is embedded into a program to run, a plurality of database instances can exist in the same system to run, if no embedded database is connected currently, a connection request is sent to each embedded database through a specific port number, the embedded database is connected with the database after agreeing to connect, and the connection of the databases is ensured through a heartbeat mechanism in the subsequent running. If the database connection exists currently, information is sent to each database to inquire the current transaction state and limitation of the database, and when the database is in a transaction, the database cannot be compressed; for databases where there is a data modification limit, compression is not possible. For the database which is judged to be incompressible, skipping the data in the database during compression, otherwise, selecting the data in the database for compression;

And S3, if the current database is judged to be suitable for data compression in the S2, firstly marking the data which is never accessed in the last 100 transactions related to the data in the database. The specific marking method is to process the tables in a unit of a single table, count the most recently used transactions of each table, and mark the tables as unusual tables if more than 100 transactions are used. Then, classifying the data in each table by using a primary key or a time stamp, wherein the data with smaller primary key value size difference and time similar to the time stamp are classified into a set by using a K-Means algorithm, each table is classified into different sets, the maximum data line number of the set is 1000 lines, and the set is called a data set;

s4, evaluating the characteristics of the data layer of each data set in each table obtained in the S3, wherein the method specifically comprises the following steps: the higher the transaction number of the last database read transaction distance, the lower the query frequency of the database for all data in the set, the less likely it is to be accessed again to decompress after compression, and the more suitable it is to compress. The index value is set to be increased by 1 if the number of transactions exceeds 10 and the operation is not read; the higher the number of transactions of the transaction distance of the writing operation, the lower the frequency of the writing operation of the current database on the data, and the weight of the writing operation is higher than that of the reading operation because the change of the whole compressed file is caused after a part of the compressed data is written and changed and the compression needs to be carried out again. The index value is set to be increased by 1 if no write operation is performed for more than 10 transactions; the structure of the data is regular if the data in the set is stored regularly, such as time sequence data in a time sequence database, the time stamp, the field and the length of the label in each column of each piece of data are similar, the change is not large, the structure is regular, the index value is marked as 1, the compression ratio achieved by the compression algorithm can be maximized during compression, more hardware space use is reduced, the method is suitable for compressing the data, and if the structure is irregular, the index value is marked as 0. Wherein the weight of the read operation is 1, the weight of the write operation is 2, the self-structure weight is 1, and the index values are multiplied by the weights to obtain the total weight of a data set. The sets are ordered from high to low according to the total weight, the data compression is carried out on the data set which is the most suitable at present, and only the data set is subjected to data compression, so that the data which has overlarge influence on a system and is accessed in the future is prevented from being decompressed due to the expansion of compression operation, and therefore only one set is selected for compression;

S5, the algorithm obtains the CPU, the memory and the hard disk utilization rate of the current system, and a required data compression method is selected according to the states. In the algorithm, Q learning is used as a mode for selecting the compression algorithm, the state of the system is used as the state input of the Q learning, the selection of the compression algorithm is used as an action, the throughput rate improvement amount of the compressed database and the reduction amount of occupied resources are used for calculating rewards, and the most suitable compression algorithm is obtained through each iteration. When the CPU and memory utilization rate is higher and reaches more than 50% and less than 70%, other processes currently exist in the system, the algorithm should select a lightweight compression algorithm, such as RLE or delta-of-delta, and the specific algorithm is output by Q learning; if the CPU and memory usage of the system is low, less than 50%, the Q learning algorithm may be more prone to select heavy compression algorithms, such as Gzip algorithm. When hard disk usage is low, such as below 50%, Q-learning algorithms are more prone to use lightweight compression algorithms that save time but compress less, while when hard disk usage is high, such as above 50%, heavy compression algorithms that have high compression ratios and low efficiency are more prone to use. In addition, when the utilization rate of the hard disk is very high, such as more than 95%, the punishment of the utilization rate of the high hard disk in the reward function is caused, at the moment, the CPU and the memory utilization rate of the system are ignored, and the data in the hard disk are directly compressed by using a heavy algorithm, so that the utilization rate of the hard disk is prevented from reaching 100%, and the data is prevented from being lost. Alternative compression algorithms include lightweight compression algorithms such as RLE, delta-of-delta and the like, and heavy general algorithms such as the Gzip algorithm and the like;

And S6, compressing the data set to be compressed in the S4 by using a compression algorithm output by the algorithm in the S5. During compression, firstly extracting the data set determined to be compressed from a table of a database, storing the data set in a single file, compressing the file by using a compression algorithm output in S5 to obtain a compressed file, and finally storing the compressed file in an additional folder, wherein all the compressed data files are stored in the folder;

s7, recording the compressed data files obtained after compression in the S6, and establishing indexes for the data files by using a B+ tree. And meanwhile, the transaction of the database is monitored in real time, once the transaction related to the data in compression exists, the compressed data file is found from the index file, the data is decompressed in time, and is recombined into the database table, so that the influence on the transaction is reduced, and finally, the transaction record of the reading or writing of the data is updated.

It should be noted that the detailed description is merely for explaining and describing the technical solution of the present invention, and the scope of protection of the claims should not be limited thereto. All changes which come within the meaning and range of equivalency of the claims and the specification are to be embraced within their scope.

Claims

1. The intelligent data compression method for the embedded database is characterized by comprising the following steps of:

the fourth step is specifically as follows:

2. The intelligent compression method for data of an embedded database according to claim 1, wherein the specific steps of the first step are as follows:

3. The intelligent compression method for data of an embedded database according to claim 2, wherein the specific steps of creating a daemon process are as follows:

4. The intelligent compression method for data of an embedded database according to claim 3, wherein the specific steps of connecting the embedded database are as follows:

5. The intelligent compression method for data of embedded database according to claim 4, wherein the maximum number of data lines collected in the third step is 1000 lines.

6. The intelligent compression method for data of an embedded database according to claim 5, wherein the method of using Q learning as the selective compression algorithm in the fifth step is specifically as follows:

7. The intelligent data compression method for an embedded database according to claim 6, wherein the lightweight compression algorithm comprises RLE or delta-of-delta, and the heavy compression algorithm is Gzip.

8. The intelligent data compression method for the embedded database according to claim 7, wherein the specific steps of compression in the step six are as follows:

9. The intelligent compression method for data of an embedded database according to claim 8, wherein the method further comprises a step seven, the specific steps of the step seven are as follows:

10. The data intelligent compression system for the embedded database is characterized by comprising a system detection module, a connection judgment module, a data set classification module, a data set evaluation module and a Q learning module;

the data set evaluation module specifically comprises:

The system detection module comprises the following specific steps:

the specific steps of creating the daemon process are as follows:

the specific steps of compression in the Q learning module are as follows:

the Q learning module further includes the steps of: