CN116841973A - Data intelligent compression method and system for embedded database - Google Patents

Data intelligent compression method and system for embedded database Download PDF

Info

Publication number
CN116841973A
CN116841973A CN202310830705.5A CN202310830705A CN116841973A CN 116841973 A CN116841973 A CN 116841973A CN 202310830705 A CN202310830705 A CN 202310830705A CN 116841973 A CN116841973 A CN 116841973A
Authority
CN
China
Prior art keywords
data
embedded
embedded database
compression
compressed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310830705.5A
Other languages
Chinese (zh)
Inventor
张昊然
王宏志
丁小欧
杨东华
左德承
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202310830705.5A priority Critical patent/CN116841973A/en
Publication of CN116841973A publication Critical patent/CN116841973A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1865Transactional file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses an intelligent data compression method and system for an embedded database, which relate to the technical field of data compression and aim at the problem that the compression speed is low because the embedded database cannot be adapted to different environments in the prior art.

Description

Data intelligent compression method and system for embedded database
Technical Field
The application relates to the technical field of data compression, in particular to an intelligent data compression method and system for an embedded database.
Background
Data compression technology is one of the important points of research of databases at present. The meaning of researching the data compression algorithm is that firstly, the compressed data can improve the utilization rate of the storage space, thereby reducing the hardware cost of the stored data; secondly, when the data with the same size is transmitted through the network, the transmission quantity of bytes can be reduced under the compression state, the throughput rate of data transmission can be greatly improved, and the bandwidth of network transmission is improved in a phase-changing manner. These two aspects can save a lot of hardware and software costs for enterprises or individuals, and because of urgent demands of enterprises for hardware cost reduction, current data compression technology exploration mainly focuses on data compression of server-side databases, such as gzip algorithm based on LZ4 algorithm and ZSTD algorithm improvement, etc., is widely used. These general, more cumbersome compression algorithms are called heavy duty algorithms. On the other hand, light algorithms including delta-delta algorithm, RLE (run length encoding) and the like, which are compression algorithms dedicated to bit operations, are also more often used. In the academic world, the recent hot spot is to integrate a neural network or a reinforcement learning algorithm into a compression algorithm, so that the neural network or reinforcement learning algorithm is more intelligent, such as a VAE improved compression algorithm, a GAN improved lossy compression algorithm and the like, and good effects are achieved.
Existing algorithms, whether heavy algorithms or light algorithms, whether industry or academia, are classified from application modes and concentrated on a distributed database based on the database background of main researches of the algorithms; according to the storage mode, the method mainly focuses on databases of line storage and column storage; from the difference in stored data, the focus is mainly on relational databases and time series databases. The application scenarios described above all have a common point that the algorithms are often deployed on a server, and are specially used for providing a solution for data compression of a database on the server, where the database on the server is characterized by a single database running in a single device, and the resources of the device are often concentrated on the database for use by the single database. The algorithms that have been proposed so far are often implemented with sufficient resources (especially computing resources), reliable network and device environments, and with little consideration of device factors such as power consumption. However, for embedded databases, the operation environment is often bad, and the embedded databases can be deployed in not only server rooms, but also large-interference and unstable environments such as factories and farmlands, in which network resources and computing resources which can be obtained by the embedded databases are reduced, and the power consumption problem of the embedded databases is often considered, so that the electric quantity loss or the increase of the cost caused by the excessive power consumption is prevented. The complex environment and structure of embedded database deployment dictates the compression algorithm for which it needs to be specifically tailored.
On the other hand, embedded databases are widely used as databases currently downloaded 500 hundred million times, including internet of things devices (such as temperature sensors, monitoring devices, etc.), mobile phones (whether android and iOS use embedded databases in large quantities), personal computers (including both program and system services), etc., in which case it is almost difficult for DBAs to go deep into codes in real time and in a tracking manner to optimize the performance of the database, and after one deployment, the embedded databases can only be mechanically kept running. This results in the various properties of the embedded database not being adaptable to different environments, resulting in slow compression speeds.
Disclosure of Invention
The purpose of the invention is that: aiming at the problem of slow compression speed caused by the fact that an embedded database cannot be adapted to different environments in the prior art, the data intelligent compression method and system for the embedded database are provided.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the intelligent data compression method for the embedded database comprises the following steps:
step one: detecting the current use conditions of the CPU, the memory and the hard disk, if the use ratio of any one of the CPU and the memory exceeds 70%, not compressing, if the use ratio of the CPU and the memory does not exceed 70%, compressing, and executing the second step;
Step two: judging whether an embedded database connection exists currently, if not, connecting the embedded databases, if so, sending information to each embedded database, inquiring the current transaction state and limitation of the embedded databases, and not compressing the embedded databases which are in progress or the databases with data modification limitation, otherwise, compressing, and executing the third step;
step three: for the compressed embedded database, firstly counting the latest service transactions of each table in the embedded database, if the table contains more than 100 data-related transactions, marking the table as an unusual table if the table contains data which is never accessed, then using a K-Means algorithm to classify the data in each unusual table into main keys or timestamps, and dividing each table into different data sets;
step four: carrying out characteristic evaluation on the data layer of each data set obtained in the step three, and selecting the data set needing to be compressed according to the evaluation result;
the fourth step is specifically as follows:
setting the index value of the read operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for every more than 10 transaction numbers, setting the index value of the write operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for every more than 10 transaction numbers, marking the index value of the structural rule as 1 in the structure of the data, marking the index value of the structural irregularity as 0, setting the weight of the read operation as 1, setting the weight of the write operation as 2, setting the structural weight of the data as 1, multiplying all the index values by the weight to obtain the total weight value of one data set, sequencing the total weight values of all the data sets from high to low, and selecting the data set with the highest total weight value for data compression;
Step five: using Q learning as a mode of selecting a compression algorithm, using the current CPU, memory and hard disk use condition as Q learning state input, using the selected compression algorithm as output, calculating rewards by using throughput rate improvement amount of the compressed embedded database and reduction amount of occupied resources, and obtaining the compression algorithm through each iteration;
step six: and (3) compressing the data set obtained in the step four through the compression algorithm selected in the step five.
Further, the specific steps of the first step are as follows:
and creating a daemon, detecting the current use conditions of the CPU, the memory and the hard disk by using the daemon, if the use ratio of any one of the CPU and the memory exceeds 70%, not compressing, enabling the daemon to enter a dormant state, waiting to wake up again for judging again, and if the use ratio of the CPU and the memory does not exceed 70%, compressing and executing the second step.
Further, the specific steps of creating the daemon process are as follows:
using the python script creates a daemon that wakes up every 5 minutes, which detects current CPU, memory and hard disk usage every 5 minutes.
Further, the specific steps of connecting the embedded database are as follows:
And sending a connection request to each embedded database through the port number, connecting the embedded databases with the databases after the embedded databases agree to connect, and ensuring the connection of the embedded databases through a heartbeat mechanism in subsequent operation.
Further, the maximum data line number of the set in the third step is 1000 lines.
Further, the manner of using Q learning as the selective compression algorithm in the fifth step is specifically as follows:
when the CPU or memory utilization rate reaches more than 50% and is lower than 70%, a lightweight compression algorithm is selected;
when the CPU and the memory use ratio is lower than 50%, and the hard disk use ratio is lower than 50%, selecting a lightweight compression algorithm, and when the hard disk use ratio is not lower than 50%, selecting a heavy compression algorithm;
when the utilization rate of the hard disk reaches more than 95%, the CPU and the utilization rate of the memory are ignored, and the data in the hard disk are rapidly compressed by directly using a heavy algorithm.
Further, the lightweight compression algorithm comprises RLE or delta-of-delta, and the heavy compression algorithm is a Gzip algorithm.
Further, the specific compression step in the step six is as follows:
when compression is carried out, firstly, the compressed data set is extracted from the corresponding table in the embedded database, the extracted data set is stored in a single file, the file is compressed by using the compression algorithm selected in the fifth step, the compressed file is obtained, and finally, the compressed file is stored in an additional folder, and all the compressed data files are stored in the folder.
Further, the method further comprises a step seven, and the specific steps of the step seven are as follows:
recording all compressed data files stored in the folder, establishing an index for the data files by using the B+ tree, monitoring the transaction of the embedded database in real time, finding the compressed data files through the index once the transaction related to the compressed data files exists, decompressing the data files, returning the decompressed data to the table of the embedded database, and finally updating the transaction record of the corresponding data.
The system comprises a system detection module, a connection judgment module, a data set classification module, a data set evaluation module and a Q learning module;
the system detection module is used for detecting the current use conditions of the CPU, the memory and the hard disk, if the use ratio of any one of the CPU and the memory exceeds 70%, the compression is not performed, and if the use ratio of the CPU and the memory does not exceed 70%, the compression is performed;
the connection judging module is used for judging whether the embedded database connection exists currently, if the embedded database connection does not exist currently, the embedded databases are connected, if the embedded database connection exists currently, information is sent to each embedded database, the current transaction state and limitation of the embedded databases are inquired, and the embedded databases which are subjected to the transaction or the databases which are subjected to the data modification limitation are not compressed, otherwise, the embedded databases are compressed;
The data set classification module is used for counting the latest service transactions of each table in the embedded database aiming at the compressed embedded database, if the table contains more than 100 data-related transactions, the table is marked as an unusual table if the table contains data which is never accessed, then the K-Means algorithm is used for classifying the data in each unusual table to carry out the classification of a primary key or a timestamp, and each table is divided into different data sets;
the data set evaluation module is used for carrying out characteristic evaluation on the data layer on each data set obtained in the data set classification module, and selecting the data set needing to be compressed according to the evaluation result;
the data set evaluation module specifically comprises:
setting the index value of the read operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for more than 10 transaction numbers, setting the index value of the write operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for more than 10 transaction numbers, marking the index value of the structural rule as 1 and the index value of the structural irregularity as 0 in the structure of the data,
setting the weight of the reading operation as 1, the weight of the writing operation as 2, setting the structural weight of the data as 1, multiplying all index values by the weight of the index values to obtain the total weight value of one data set, sequencing the total weight value of all the data sets from high to low, and selecting the data set with the highest total weight value for data compression;
The Q learning module is used for using Q learning as a mode of selecting a compression algorithm, taking the current CPU, memory and hard disk service conditions as Q learning state input, taking the selected compression algorithm as output, calculating rewards by using throughput rate improvement quantity of the compressed embedded database and reduction quantity of occupied resources, acquiring the compression algorithm through each iteration, and compressing the data set obtained by the data set evaluation module;
the system detection module comprises the following specific steps:
creating a daemon, detecting the current use conditions of the CPU, the memory and the hard disk by using the daemon, if the use ratio of any one of the CPU and the memory exceeds 70%, not compressing, enabling the daemon to enter a dormant state, waiting to wake up again for judging again, and if the use ratio of the CPU and the memory does not exceed 70%, compressing;
the specific steps of creating the daemon process are as follows:
creating a daemon which wakes up every 5 minutes by using the python script, wherein the daemon detects the current CPU, memory and hard disk use condition every 5 minutes;
the method for connecting the embedded database comprises the following specific steps:
sending a connection request to each embedded database through a port number, connecting the embedded databases after the embedded databases agree to connect, and ensuring the connection of the embedded databases through a heartbeat mechanism in subsequent operation;
The maximum data line number of the set in the data set classification module is 1000 lines;
the mode of using Q learning as the compression algorithm in the Q learning module is specifically as follows:
when the CPU or memory utilization rate reaches more than 50% and is lower than 70%, a lightweight compression algorithm is selected;
when the CPU and the memory use ratio is lower than 50%, and the hard disk use ratio is lower than 50%, selecting a lightweight compression algorithm, and when the hard disk use ratio is not lower than 50%, selecting a heavy compression algorithm;
when the utilization rate of the hard disk reaches more than 95%, the CPU and the utilization rate of the memory are ignored, and the data in the hard disk is compressed rapidly by directly using a heavy algorithm;
the lightweight compression algorithm comprises RLE or delta-of-delta, and the heavy compression algorithm is a Gzip algorithm;
the specific steps of compression in the Q learning module are as follows:
when compression is carried out, firstly, a compressed data set is extracted from a corresponding table in an embedded database, the extracted data set is stored in a single file, a compression algorithm selected by a Q learning module is used for compressing the file to obtain a compressed file, and finally, the compressed file is stored in an additional folder, and all the compressed data files are stored in the folder;
The Q learning module further includes the steps of:
recording all compressed data files stored in the folder, establishing an index for the data files by using the B+ tree, monitoring the transaction of the embedded database in real time, finding the compressed data files through the index once the transaction related to the compressed data files exists, decompressing the data files, returning the decompressed data to the table of the embedded database, and finally updating the transaction record of the corresponding data.
The beneficial effects of the application are as follows:
the application can classify and identify different scenes and system conditions, automatically select a needed compression algorithm after that, and automatically decompress when needed, thereby being applicable to various use environments, improving the compression speed of the embedded database aiming at different environments, intelligently judging the compression time after that, such as the environments of equipment such as Internet of things equipment, mobile phones, personal computers and the like, and carrying out adaptation and adjustment according to the quantity of resources in the environments so as to realize the intelligent compression and decompression of the embedded database, and finally achieving the purposes of saving the storage resource space, improving the resource utilization rate and accelerating the network transmission speed.
Drawings
FIG. 1 is a flow chart 1 of the present application;
FIG. 2 is a flow chart 2 of the present application;
fig. 3 is a flow chart 3 of the present application.
Detailed Description
It should be noted that, in particular, the various embodiments of the present disclosure may be combined with each other without conflict.
The first embodiment is as follows: referring to fig. 1, the method for intelligently compressing data for an embedded database according to the present embodiment specifically includes the following steps:
step one: detecting the current use conditions of the CPU, the memory and the hard disk, if the use ratio of any one of the CPU and the memory exceeds 70%, not compressing, if the use ratio of the CPU and the memory does not exceed 70%, compressing, and executing the second step;
step two: judging whether an embedded database connection exists currently, if not, connecting the embedded databases, if so, sending information to each embedded database, inquiring the current transaction state and limitation of the embedded databases, and not compressing the embedded databases which are in progress or the databases with data modification limitation, otherwise, compressing, and executing the third step;
Step three: for the compressed embedded database, firstly counting the latest service transactions of each table in the embedded database, if the table contains more than 100 data-related transactions, marking the table as an unusual table if the table contains data which is never accessed, then using a K-Means algorithm to classify the data in each unusual table into main keys or timestamps, and dividing each table into different data sets;
step four: carrying out characteristic evaluation on the data layer of each data set obtained in the step three, and selecting the data set needing to be compressed according to the evaluation result;
the fourth step is specifically as follows:
setting the index value of the read operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for every more than 10 transaction numbers, setting the index value of the write operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for every more than 10 transaction numbers, marking the index value of the structural rule as 1 in the structure of the data, marking the index value of the structural irregularity as 0, setting the weight of the read operation as 1, setting the weight of the write operation as 2, setting the structural weight of the data as 1, multiplying all the index values by the weight to obtain the total weight value of one data set, sequencing the total weight values of all the data sets from high to low, and selecting the data set with the highest total weight value for data compression; as shown in fig. 1.
Step five: using Q learning as a mode of selecting a compression algorithm, using the current CPU, memory and hard disk use condition as Q learning state input, using the selected compression algorithm as output, calculating rewards by using throughput rate improvement amount of the compressed embedded database and reduction amount of occupied resources, and obtaining the compression algorithm through each iteration; as shown in fig. 2.
Step six: and (3) compressing the data set obtained in the step four through the compression algorithm selected in the step five. The compression step is shown in fig. 3.
The second embodiment is as follows: this embodiment is further described with respect to the first embodiment, and the difference between this embodiment and the first embodiment is that the specific steps of the first embodiment are as follows:
and creating a daemon, detecting the current use conditions of the CPU, the memory and the hard disk by using the daemon, if the use ratio of any one of the CPU and the memory exceeds 70%, not compressing, enabling the daemon to enter a dormant state, waiting to wake up again for judging again, and if the use ratio of the CPU and the memory does not exceed 70%, compressing and executing the second step.
And a third specific embodiment: this embodiment is further described with respect to the second embodiment, and the difference between this embodiment and the second embodiment is that the specific steps of creating the daemon are as follows:
Using the python script creates a daemon that wakes up every 5 minutes, which detects current CPU, memory and hard disk usage every 5 minutes.
The specific embodiment IV is as follows: this embodiment is further described in the third embodiment, and the difference between this embodiment and the third embodiment is that the specific steps of connecting the embedded database are as follows:
and sending a connection request to each embedded database through the port number, connecting the embedded databases with the databases after the embedded databases agree to connect, and ensuring the connection of the embedded databases through a heartbeat mechanism in subsequent operation.
Fifth embodiment: this embodiment is further described with respect to the fourth embodiment, and the difference between this embodiment and the fourth embodiment is that the maximum number of data lines collected in the third step is 1000 lines.
Specific embodiment six: this embodiment is further described with respect to the fifth embodiment, and the difference between this embodiment and the fifth embodiment is that the manner of using Q learning as the selective compression algorithm in the fifth step is specifically as follows:
when the CPU or memory utilization rate reaches more than 50% and is lower than 70%, a lightweight compression algorithm is selected;
When the CPU and the memory use ratio is lower than 50%, and the hard disk use ratio is lower than 50%, selecting a lightweight compression algorithm, and when the hard disk use ratio is not lower than 50%, selecting a heavy compression algorithm;
when the utilization rate of the hard disk reaches more than 95%, the CPU and the utilization rate of the memory are ignored, and the data in the hard disk are rapidly compressed by directly using a heavy algorithm.
Seventh embodiment: this embodiment is further described with respect to embodiment six, where the lightweight compression algorithm includes RLE or delta-of-delta, and the heavy compression algorithm is a Gzip algorithm.
Eighth embodiment: this embodiment is further described in the seventh embodiment, and the difference between this embodiment and the seventh embodiment is that the specific steps of compression in the sixth step are as follows:
when compression is carried out, firstly, the compressed data set is extracted from the corresponding table in the embedded database, the extracted data set is stored in a single file, the file is compressed by using the compression algorithm selected in the fifth step, the compressed file is obtained, and finally, the compressed file is stored in an additional folder, and all the compressed data files are stored in the folder.
Detailed description nine: this embodiment is a further description of the eighth embodiment, and the difference between this embodiment and the eighth embodiment is that the method further includes a step seven, where the specific steps of the step seven are:
recording all compressed data files stored in the folder, establishing an index for the data files by using the B+ tree, monitoring the transaction of the embedded database in real time, finding the compressed data files through the index once the transaction related to the compressed data files exists, decompressing the data files, returning the decompressed data to the table of the embedded database, and finally updating the transaction record of the corresponding data.
Detailed description ten: the data intelligent compression system for the embedded database comprises a system detection module, a connection judgment module, a data set classification module, a data set evaluation module and a Q learning module;
the system detection module is used for detecting the current use conditions of the CPU, the memory and the hard disk, if the use ratio of any one of the CPU and the memory exceeds 70%, the compression is not performed, and if the use ratio of the CPU and the memory does not exceed 70%, the compression is performed;
The connection judging module is used for judging whether the embedded database connection exists currently, if the embedded database connection does not exist currently, the embedded databases are connected, if the embedded database connection exists currently, information is sent to each embedded database, the current transaction state and limitation of the embedded databases are inquired, and the embedded databases which are subjected to the transaction or the databases which are subjected to the data modification limitation are not compressed, otherwise, the embedded databases are compressed;
the data set classification module is used for counting the latest service transactions of each table in the embedded database aiming at the compressed embedded database, if the table contains more than 100 data-related transactions, the table is marked as an unusual table if the table contains data which is never accessed, then the K-Means algorithm is used for classifying the data in each unusual table to carry out the classification of a primary key or a timestamp, and each table is divided into different data sets;
the data set evaluation module is used for carrying out characteristic evaluation on the data layer on each data set obtained in the data set classification module, and selecting the data set needing to be compressed according to the evaluation result;
The data set evaluation module specifically comprises:
setting the index value of the read operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for more than 10 transaction numbers, setting the index value of the write operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for more than 10 transaction numbers, marking the index value of the structural rule as 1 and the index value of the structural irregularity as 0 in the structure of the data,
setting the weight of the reading operation as 1, the weight of the writing operation as 2, setting the structural weight of the data as 1, multiplying all index values by the weight of the index values to obtain the total weight value of one data set, sequencing the total weight value of all the data sets from high to low, and selecting the data set with the highest total weight value for data compression;
the Q learning module is used for using Q learning as a mode of selecting a compression algorithm, taking the current CPU, memory and hard disk service conditions as Q learning state input, taking the selected compression algorithm as output, calculating rewards by using throughput rate improvement quantity of the compressed embedded database and reduction quantity of occupied resources, acquiring the compression algorithm through each iteration, and compressing the data set obtained by the data set evaluation module;
The system detection module comprises the following specific steps:
creating a daemon, detecting the current use conditions of the CPU, the memory and the hard disk by using the daemon, if the use ratio of any one of the CPU and the memory exceeds 70%, not compressing, enabling the daemon to enter a dormant state, waiting to wake up again for judging again, and if the use ratio of the CPU and the memory does not exceed 70%, compressing;
the specific steps of creating the daemon process are as follows:
creating a daemon which wakes up every 5 minutes by using the python script, wherein the daemon detects the current CPU, memory and hard disk use condition every 5 minutes;
the method for connecting the embedded database comprises the following specific steps:
sending a connection request to each embedded database through a port number, connecting the embedded databases after the embedded databases agree to connect, and ensuring the connection of the embedded databases through a heartbeat mechanism in subsequent operation;
the maximum data line number of the set in the data set classification module is 1000 lines;
the mode of using Q learning as the compression algorithm in the Q learning module is specifically as follows:
when the CPU or memory utilization rate reaches more than 50% and is lower than 70%, a lightweight compression algorithm is selected;
When the CPU and the memory use ratio is lower than 50%, and the hard disk use ratio is lower than 50%, selecting a lightweight compression algorithm, and when the hard disk use ratio is not lower than 50%, selecting a heavy compression algorithm;
when the utilization rate of the hard disk reaches more than 95%, the CPU and the utilization rate of the memory are ignored, and the data in the hard disk is compressed rapidly by directly using a heavy algorithm;
the lightweight compression algorithm comprises RLE or delta-of-delta, and the heavy compression algorithm is a Gzip algorithm;
the specific steps of compression in the Q learning module are as follows:
when compression is carried out, firstly, a compressed data set is extracted from a corresponding table in an embedded database, the extracted data set is stored in a single file, a compression algorithm selected by a Q learning module is used for compressing the file to obtain a compressed file, and finally, the compressed file is stored in an additional folder, and all the compressed data files are stored in the folder;
the Q learning module further includes the steps of:
recording all compressed data files stored in the folder, establishing an index for the data files by using the B+ tree, monitoring the transaction of the embedded database in real time, finding the compressed data files through the index once the transaction related to the compressed data files exists, decompressing the data files, returning the decompressed data to the table of the embedded database, and finally updating the transaction record of the corresponding data.
Examples: an intelligent data compression method of an embedded database, comprising the following steps:
s1, using a python script to create a daemon which wakes up every 5 minutes, wherein the daemon detects the current CPU, memory and hard disk use condition of the system every 5 minutes, if the current CPU and memory use ratio exceeds 70%, the daemon does not compress, at the moment, other programs exist in the system and the resources of the system are occupied and cannot perform additional data compression plans; if the CPU and the memory use rate is not more than 70%, and the current hard disk use rate is more than 70%, the pressure of other programs facing the current system is smaller, and the hard disk use rate is higher, the necessity of data compression is higher, the benefit is higher, and data compression is started. If the current selection is not compressed, the daemon process directly enters a dormant state, and wakes up again to judge again after waiting for 5 minutes;
s2, if the data compression is judged in S1, namely the CPU and memory utilization rate is not more than 70%, which represents that the current system level can carry out data compression plan, the second step is to check whether the database existing in the current system can carry out data compression. The embedded database is used as a database which is embedded into a program to run, a plurality of database instances can exist in the same system to run, if no embedded database is connected currently, a connection request is sent to each embedded database through a specific port number, the embedded database is connected with the database after agreeing to connect, and the connection of the databases is ensured through a heartbeat mechanism in the subsequent running. If the database connection exists currently, information is sent to each database to inquire the current transaction state and limitation of the database, and when the database is in a transaction, the database cannot be compressed; for databases where there is a data modification limit, compression is not possible. For the database which is judged to be incompressible, skipping the data in the database during compression, otherwise, selecting the data in the database for compression;
And S3, if the current database is judged to be suitable for data compression in the S2, firstly marking the data which is never accessed in the last 100 transactions related to the data in the database. The specific marking method is to process the tables in a unit of a single table, count the most recently used transactions of each table, and mark the tables as unusual tables if more than 100 transactions are used. Then, classifying the data in each table by using a primary key or a time stamp, wherein the data with smaller primary key value size difference and time similar to the time stamp are classified into a set by using a K-Means algorithm, each table is classified into different sets, the maximum data line number of the set is 1000 lines, and the set is called a data set;
s4, evaluating the characteristics of the data layer of each data set in each table obtained in the S3, wherein the method specifically comprises the following steps: the higher the transaction number of the last database read transaction distance, the lower the query frequency of the database for all data in the set, the less likely it is to be accessed again to decompress after compression, and the more suitable it is to compress. The index value is set to be increased by 1 if the number of transactions exceeds 10 and the operation is not read; the higher the number of transactions of the transaction distance of the writing operation, the lower the frequency of the writing operation of the current database on the data, and the weight of the writing operation is higher than that of the reading operation because the change of the whole compressed file is caused after a part of the compressed data is written and changed and the compression needs to be carried out again. The index value is set to be increased by 1 if no write operation is performed for more than 10 transactions; the structure of the data is regular if the data in the set is stored regularly, such as time sequence data in a time sequence database, the time stamp, the field and the length of the label in each column of each piece of data are similar, the change is not large, the structure is regular, the index value is marked as 1, the compression ratio achieved by the compression algorithm can be maximized during compression, more hardware space use is reduced, the method is suitable for compressing the data, and if the structure is irregular, the index value is marked as 0. Wherein the weight of the read operation is 1, the weight of the write operation is 2, the self-structure weight is 1, and the index values are multiplied by the weights to obtain the total weight of a data set. The sets are ordered from high to low according to the total weight, the data compression is carried out on the data set which is the most suitable at present, and only the data set is subjected to data compression, so that the data which has overlarge influence on a system and is accessed in the future is prevented from being decompressed due to the expansion of compression operation, and therefore only one set is selected for compression;
S5, the algorithm obtains the CPU, the memory and the hard disk utilization rate of the current system, and a required data compression method is selected according to the states. In the algorithm, Q learning is used as a mode for selecting the compression algorithm, the state of the system is used as the state input of the Q learning, the selection of the compression algorithm is used as an action, the throughput rate improvement amount of the compressed database and the reduction amount of occupied resources are used for calculating rewards, and the most suitable compression algorithm is obtained through each iteration. When the CPU and memory utilization rate is higher and reaches more than 50% and less than 70%, other processes currently exist in the system, the algorithm should select a lightweight compression algorithm, such as RLE or delta-of-delta, and the specific algorithm is output by Q learning; if the CPU and memory usage of the system is low, less than 50%, the Q learning algorithm may be more prone to select heavy compression algorithms, such as Gzip algorithm. When hard disk usage is low, such as below 50%, Q-learning algorithms are more prone to use lightweight compression algorithms that save time but compress less, while when hard disk usage is high, such as above 50%, heavy compression algorithms that have high compression ratios and low efficiency are more prone to use. In addition, when the utilization rate of the hard disk is very high, such as more than 95%, the punishment of the utilization rate of the high hard disk in the reward function is caused, at the moment, the CPU and the memory utilization rate of the system are ignored, and the data in the hard disk are directly compressed by using a heavy algorithm, so that the utilization rate of the hard disk is prevented from reaching 100%, and the data is prevented from being lost. Alternative compression algorithms include lightweight compression algorithms such as RLE, delta-of-delta and the like, and heavy general algorithms such as the Gzip algorithm and the like;
And S6, compressing the data set to be compressed in the S4 by using a compression algorithm output by the algorithm in the S5. During compression, firstly extracting the data set determined to be compressed from a table of a database, storing the data set in a single file, compressing the file by using a compression algorithm output in S5 to obtain a compressed file, and finally storing the compressed file in an additional folder, wherein all the compressed data files are stored in the folder;
s7, recording the compressed data files obtained after compression in the S6, and establishing indexes for the data files by using a B+ tree. And meanwhile, the transaction of the database is monitored in real time, once the transaction related to the data in compression exists, the compressed data file is found from the index file, the data is decompressed in time, and is recombined into the database table, so that the influence on the transaction is reduced, and finally, the transaction record of the reading or writing of the data is updated.
It should be noted that the detailed description is merely for explaining and describing the technical solution of the present invention, and the scope of protection of the claims should not be limited thereto. All changes which come within the meaning and range of equivalency of the claims and the specification are to be embraced within their scope.

Claims (10)

1. The intelligent data compression method for the embedded database is characterized by comprising the following steps of:
step one: detecting the current use conditions of the CPU, the memory and the hard disk, if the use ratio of any one of the CPU and the memory exceeds 70%, not compressing, if the use ratio of the CPU and the memory does not exceed 70%, compressing, and executing the second step;
step two: judging whether an embedded database connection exists currently, if not, connecting the embedded databases, if so, sending information to each embedded database, inquiring the current transaction state and limitation of the embedded databases, and not compressing the embedded databases which are in progress or the databases with data modification limitation, otherwise, compressing, and executing the third step;
step three: for the compressed embedded database, firstly counting the latest service transactions of each table in the embedded database, if the table contains more than 100 data-related transactions, marking the table as an unusual table if the table contains data which is never accessed, then using a K-Means algorithm to classify the data in each unusual table into main keys or timestamps, and dividing each table into different data sets;
Step four: carrying out characteristic evaluation on the data layer of each data set obtained in the step three, and selecting the data set needing to be compressed according to the evaluation result;
the fourth step is specifically as follows:
setting the index value of the read operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for every more than 10 transaction numbers, setting the index value of the write operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for every more than 10 transaction numbers, marking the index value of the structural rule as 1 in the structure of the data, marking the index value of the structural irregularity as 0, setting the weight of the read operation as 1, setting the weight of the write operation as 2, setting the structural weight of the data as 1, multiplying all the index values by the weight to obtain the total weight value of one data set, sequencing the total weight values of all the data sets from high to low, and selecting the data set with the highest total weight value for data compression;
step five: using Q learning as a mode of selecting a compression algorithm, using the current CPU, memory and hard disk use condition as Q learning state input, using the selected compression algorithm as output, calculating rewards by using throughput rate improvement amount of the compressed embedded database and reduction amount of occupied resources, and obtaining the compression algorithm through each iteration;
Step six: and (3) compressing the data set obtained in the step four through the compression algorithm selected in the step five.
2. The intelligent compression method for data of an embedded database according to claim 1, wherein the specific steps of the first step are as follows:
and creating a daemon, detecting the current use conditions of the CPU, the memory and the hard disk by using the daemon, if the use ratio of any one of the CPU and the memory exceeds 70%, not compressing, enabling the daemon to enter a dormant state, waiting to wake up again for judging again, and if the use ratio of the CPU and the memory does not exceed 70%, compressing and executing the second step.
3. The intelligent compression method for data of an embedded database according to claim 2, wherein the specific steps of creating a daemon process are as follows:
using the python script creates a daemon that wakes up every 5 minutes, which detects current CPU, memory and hard disk usage every 5 minutes.
4. The intelligent compression method for data of an embedded database according to claim 3, wherein the specific steps of connecting the embedded database are as follows:
and sending a connection request to each embedded database through the port number, connecting the embedded databases with the databases after the embedded databases agree to connect, and ensuring the connection of the embedded databases through a heartbeat mechanism in subsequent operation.
5. The intelligent compression method for data of embedded database according to claim 4, wherein the maximum number of data lines collected in the third step is 1000 lines.
6. The intelligent compression method for data of an embedded database according to claim 5, wherein the method of using Q learning as the selective compression algorithm in the fifth step is specifically as follows:
when the CPU or memory utilization rate reaches more than 50% and is lower than 70%, a lightweight compression algorithm is selected;
when the CPU and the memory use ratio is lower than 50%, and the hard disk use ratio is lower than 50%, selecting a lightweight compression algorithm, and when the hard disk use ratio is not lower than 50%, selecting a heavy compression algorithm;
when the utilization rate of the hard disk reaches more than 95%, the CPU and the utilization rate of the memory are ignored, and the data in the hard disk are rapidly compressed by directly using a heavy algorithm.
7. The intelligent data compression method for an embedded database according to claim 6, wherein the lightweight compression algorithm comprises RLE or delta-of-delta, and the heavy compression algorithm is Gzip.
8. The intelligent data compression method for the embedded database according to claim 7, wherein the specific steps of compression in the step six are as follows:
When compression is carried out, firstly, the compressed data set is extracted from the corresponding table in the embedded database, the extracted data set is stored in a single file, the file is compressed by using the compression algorithm selected in the fifth step, the compressed file is obtained, and finally, the compressed file is stored in an additional folder, and all the compressed data files are stored in the folder.
9. The intelligent compression method for data of an embedded database according to claim 8, wherein the method further comprises a step seven, the specific steps of the step seven are as follows:
recording all compressed data files stored in the folder, establishing an index for the data files by using the B+ tree, monitoring the transaction of the embedded database in real time, finding the compressed data files through the index once the transaction related to the compressed data files exists, decompressing the data files, returning the decompressed data to the table of the embedded database, and finally updating the transaction record of the corresponding data.
10. The data intelligent compression system for the embedded database is characterized by comprising a system detection module, a connection judgment module, a data set classification module, a data set evaluation module and a Q learning module;
The system detection module is used for detecting the current use conditions of the CPU, the memory and the hard disk, if the use ratio of any one of the CPU and the memory exceeds 70%, the compression is not performed, and if the use ratio of the CPU and the memory does not exceed 70%, the compression is performed;
the connection judging module is used for judging whether the embedded database connection exists currently, if the embedded database connection does not exist currently, the embedded databases are connected, if the embedded database connection exists currently, information is sent to each embedded database, the current transaction state and limitation of the embedded databases are inquired, and the embedded databases which are subjected to the transaction or the databases which are subjected to the data modification limitation are not compressed, otherwise, the embedded databases are compressed;
the data set classification module is used for counting the latest service transactions of each table in the embedded database aiming at the compressed embedded database, if the table contains more than 100 data-related transactions, the table is marked as an unusual table if the table contains data which is never accessed, then the K-Means algorithm is used for classifying the data in each unusual table to carry out the classification of a primary key or a timestamp, and each table is divided into different data sets;
The data set evaluation module is used for carrying out characteristic evaluation on the data layer on each data set obtained in the data set classification module, and selecting the data set needing to be compressed according to the evaluation result;
the data set evaluation module specifically comprises:
setting the index value of the read operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for more than 10 transaction numbers, setting the index value of the write operation transaction of the last embedded database to be increased by 1 when the read operation is not performed for more than 10 transaction numbers, marking the index value of the structural rule as 1 and the index value of the structural irregularity as 0 in the structure of the data,
setting the weight of the reading operation as 1, the weight of the writing operation as 2, setting the structural weight of the data as 1, multiplying all index values by the weight of the index values to obtain the total weight value of one data set, sequencing the total weight value of all the data sets from high to low, and selecting the data set with the highest total weight value for data compression;
the Q learning module is used for using Q learning as a mode of selecting a compression algorithm, taking the current CPU, memory and hard disk service conditions as Q learning state input, taking the selected compression algorithm as output, calculating rewards by using throughput rate improvement quantity of the compressed embedded database and reduction quantity of occupied resources, acquiring the compression algorithm through each iteration, and compressing the data set obtained by the data set evaluation module;
The system detection module comprises the following specific steps:
creating a daemon, detecting the current use conditions of the CPU, the memory and the hard disk by using the daemon, if the use ratio of any one of the CPU and the memory exceeds 70%, not compressing, enabling the daemon to enter a dormant state, waiting to wake up again for judging again, and if the use ratio of the CPU and the memory does not exceed 70%, compressing;
the specific steps of creating the daemon process are as follows:
creating a daemon which wakes up every 5 minutes by using the python script, wherein the daemon detects the current CPU, memory and hard disk use condition every 5 minutes;
the method for connecting the embedded database comprises the following specific steps:
sending a connection request to each embedded database through a port number, connecting the embedded databases after the embedded databases agree to connect, and ensuring the connection of the embedded databases through a heartbeat mechanism in subsequent operation;
the maximum data line number of the set in the data set classification module is 1000 lines;
the mode of using Q learning as the compression algorithm in the Q learning module is specifically as follows:
when the CPU or memory utilization rate reaches more than 50% and is lower than 70%, a lightweight compression algorithm is selected;
When the CPU and the memory use ratio is lower than 50%, and the hard disk use ratio is lower than 50%, selecting a lightweight compression algorithm, and when the hard disk use ratio is not lower than 50%, selecting a heavy compression algorithm;
when the utilization rate of the hard disk reaches more than 95%, the CPU and the utilization rate of the memory are ignored, and the data in the hard disk is compressed rapidly by directly using a heavy algorithm;
the lightweight compression algorithm comprises RLE or delta-of-delta, and the heavy compression algorithm is a Gzip algorithm;
the specific steps of compression in the Q learning module are as follows:
when compression is carried out, firstly, a compressed data set is extracted from a corresponding table in an embedded database, the extracted data set is stored in a single file, a compression algorithm selected by a Q learning module is used for compressing the file to obtain a compressed file, and finally, the compressed file is stored in an additional folder, and all the compressed data files are stored in the folder;
the Q learning module further includes the steps of:
recording all compressed data files stored in the folder, establishing an index for the data files by using the B+ tree, monitoring the transaction of the embedded database in real time, finding the compressed data files through the index once the transaction related to the compressed data files exists, decompressing the data files, returning the decompressed data to the table of the embedded database, and finally updating the transaction record of the corresponding data.
CN202310830705.5A 2023-07-07 2023-07-07 Data intelligent compression method and system for embedded database Pending CN116841973A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310830705.5A CN116841973A (en) 2023-07-07 2023-07-07 Data intelligent compression method and system for embedded database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310830705.5A CN116841973A (en) 2023-07-07 2023-07-07 Data intelligent compression method and system for embedded database

Publications (1)

Publication Number Publication Date
CN116841973A true CN116841973A (en) 2023-10-03

Family

ID=88170322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310830705.5A Pending CN116841973A (en) 2023-07-07 2023-07-07 Data intelligent compression method and system for embedded database

Country Status (1)

Country Link
CN (1) CN116841973A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117421288A (en) * 2023-12-18 2024-01-19 云和恩墨(北京)信息技术有限公司 Database data compression method and device
CN118069894A (en) * 2024-04-12 2024-05-24 乾健科技有限公司 Big data storage management method and system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117421288A (en) * 2023-12-18 2024-01-19 云和恩墨(北京)信息技术有限公司 Database data compression method and device
CN117421288B (en) * 2023-12-18 2024-06-11 云和恩墨(北京)信息技术有限公司 Database data compression method and device
CN118069894A (en) * 2024-04-12 2024-05-24 乾健科技有限公司 Big data storage management method and system

Similar Documents

Publication Publication Date Title
CN116841973A (en) Data intelligent compression method and system for embedded database
CN106452450B (en) Method and system for data compression
CN102571966B (en) Network transmission method for large extensible markup language (XML) document
WO2021091489A1 (en) Method and apparatus for storing time series data, and server and storage medium thereof
CN101739292A (en) Application characteristic-based isomeric group operation self-adapting dispatching method and system
CN102411533A (en) Log-management optimizing method for clustered storage system
US7587388B2 (en) Separating uploads into aggregate and raw data storage
CN108228322B (en) Distributed link tracking and analyzing method, server and global scheduler
CN113094346A (en) Big data coding and decoding method and device based on time sequence
CN107729406A (en) A kind of data classification storage method and device
CN113901279A (en) Graph database retrieval method and device
Aceto et al. Efficient storage and processing of high-volume network monitoring data
CN115380267A (en) Data compression method and device, data compression equipment and readable storage medium
CN103778258A (en) Method for sending and receiving data of database, client terminal and server
WO2021147319A1 (en) Data processing method, apparatus, device, and medium
CN102693315A (en) Method and device for removing URL (uniform resource locator) duplicate on basis of shared memory mapping
CN110032432B (en) Example compression method and device and example decompression method and device
CN112269726A (en) Data processing method and device
CN115795368B (en) Enterprise internal training data processing method and system based on artificial intelligence
CN109743362B (en) Data storage method applied to full-format data structure
CN104636432A (en) Method and device for journal file compression and decompression
CN110188160A (en) Date storage method and method for reading data
CN112054805B (en) Model data compression method, system and related equipment
CN115982436A (en) Efficient retrieval and compression system and compression method for stream data
CN113609081B (en) Method for storing radio ultrashort wave frequency band monitoring sweep frequency basic data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination