CN111522710B

CN111522710B - Data compression method, device and medium based on big data

Info

Publication number: CN111522710B
Application number: CN202010300892.2A
Authority: CN
Inventors: 黄南溪; 郭建新; 罗辉
Original assignee: Transwarp Technology Shanghai Co Ltd
Current assignee: Transwarp Technology Shanghai Co Ltd
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2021-02-26
Anticipated expiration: 2040-04-16
Also published as: CN111522710A

Abstract

The embodiment of the invention discloses a data compression method, equipment and a medium based on big data. The method comprises the following steps: when a big data compression request is detected, acquiring historical index data to be compressed corresponding to the big data decision request, wherein the historical index data to be compressed comprises a plurality of index data sets taking days as units, and the index data sets comprise all index data collected on corresponding dates; obtaining a mode corresponding to each data fragment subscript according to a plurality of index data sets taking days as units and a preset data fragment range size parameter; and determining a mark field of the index data in each index data set according to the mode corresponding to each data fragment subscript and the short-time memory base number corresponding to each index data set, and compressing and storing the index data in each index data set. The embodiment of the invention can compress and store the index data, reduce the waste of the storage space of the index data and simultaneously avoid losing the index data.

Description

Data compression method, device and medium based on big data

Technical Field

The embodiments of the present invention relate to data processing technologies, and in particular, to a data compression method, device, and medium based on big data.

Background

An important content in Internet Technology (IT) operation and maintenance work is to monitor and record the running state of each host device in the system and information such as network load in real time, and obtain index data of each host device, so as to realize functions of timely alarming, fault diagnosis, data mining and the like of abnormal conditions.

The data acquisition points are numerous, and the acquisition interval is short, so that the data volume of the index data monitored in real time is huge. When the system nodes are many and the indexes are defined many, the index data will have a great data volume and will occupy a great storage space.

Disclosure of Invention

Embodiments of the present invention provide a data compression method, device, and medium based on big data, so as to implement compressed storage of index data, and reduce waste of storage space without losing data.

In a first aspect, an embodiment of the present invention provides a data compression method based on big data, including:

when a big data compression request is detected, acquiring historical index data to be compressed corresponding to the big data decision request, wherein the historical index data to be compressed comprises a plurality of index data sets taking days as units, the index data sets comprise all index data collected on corresponding dates, and the index data in the index data sets are arranged according to a time sequence;

obtaining a mode corresponding to each data fragment subscript according to a plurality of index data sets taking days as units and a preset data fragment range size parameter;

and determining a mark field of the index data in each index data set according to the mode corresponding to each data fragment subscript and the short-time memory base number corresponding to each index data set, and compressing and storing the index data in each index data set.

In a second aspect, embodiments of the present invention also provide a computer device, including a processor and a memory, the memory storing instructions that, when executed, cause the processor to:

In a third aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements:

According to the technical scheme of the embodiment of the invention, when a big data compression request is detected, historical index data to be compressed corresponding to the big data decision request is obtained, the historical index data to be compressed comprises a plurality of index data sets in a unit of day, then a mode corresponding to each data slicing subscript is obtained according to the index data sets in the unit of day and a preset data slicing range size parameter, a label field of the index data in each index data set is determined according to the mode corresponding to each data slicing subscript and a short-time memory base corresponding to each index data set, the index data in each index data set is compressed and stored, the label field of the index data in each index data set can be determined according to the mode corresponding to each data slicing subscript and the short-time memory base corresponding to each index data set aiming at the historical index data to be compressed, the index data in each index data set are compressed and stored, a large number of repeatedly appearing modes in the index data sets can be compressed and stored in a mode compression storage mode, the index data sets deviating from the modes can be compressed in an acceptable error range through a short-time memory compression storage mode by a short-time memory data compression method, and the index data can be prevented from being lost while waste of storage space of the index data is reduced.

Drawings

FIG. 1a is a diagram illustrating a daily CPU usage trend of a host device.

FIG. 1b is a diagram illustrating CPU usage trends during a host device promotional campaign.

FIG. 1c shows a sample CPU utilization index raw data.

FIG. 1d is a graph of data value.

Fig. 1e is a flowchart of a data compression method based on big data according to an embodiment of the present invention.

Fig. 1f is a flowchart of compressing and storing index data in a target index data set according to an embodiment of the present invention.

Fig. 1g is a diagram illustrating CPU utilization data collection according to an embodiment of the present invention.

Fig. 1h is a daily CPU utilization trend chart of a host device according to a first embodiment of the present invention.

Fig. 1i is a CPU usage trend chart of a host device on the same day as a promotional event date according to an embodiment of the present invention.

Fig. 2 is a flowchart of a data compression method based on big data according to a second embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a data compression apparatus based on big data according to a third embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The term "index" as used herein is a characteristic value or a calculated value of a target object under a certain rule at a specific time point or a specific time range. For example, the target object may be each host device in the system, and the index may be a Central Processing Unit (CPU) usage rate, a memory usage rate, and the like of each host device in the system at a certain time point.

The term "index data" as used herein is index data based on a time series. Illustratively, the index data may be CPU usage, memory usage, etc. of the host device every 5 seconds.

The term "mode" as used herein is the number that occurs the most frequently in a collection of data. The following are exemplary: the data set is {1, 2, 1, 3, 4, 5, 1, 8}, and the number of occurrences of 1 in the data set is the largest, so the mode of the data set is 1.

The term "big data compression request" used herein is an operation request for requesting to compress and store the to-be-compressed historical index data.

The term "historical index data to be compressed" used herein is historical index data that is stored in an uncompressed state at the current time. The historical index data is all index data except the index data collected on the current day. The historical index data to be compressed includes a plurality of index data sets in units of days. The index data set includes all index data collected on the corresponding date. The index data in the index data set are arranged in chronological order. Illustratively, the CPU usage rate of the host device every 1 minute for 1 year (365 days) is collected as the history index data to be compressed. The historical index data to be compressed comprises 365 index data sets in a unit of day. The set of metric data includes CPU usage per 1 minute collected on the corresponding date. The CPU usage rates for every 1 minute in the index data set are arranged in chronological order.

The term "preset data slice range size parameter" as used herein is a compression parameter for performing mode compression storage, and is used for slicing index data based on time series. When the mode compression storage is carried out, for each index data set in the mode election sample, fragmenting the index data in the index data set according to a preset data fragmentation range size parameter, dividing the index data in the index data set into a plurality of index data fragments, and determining data fragmentation subscripts corresponding to the index data fragments according to a time sequence; merging the index data fragments of the same data fragment subscript in each index data set in the mode election sample to obtain an index data merged set corresponding to each data fragment subscript; and acquiring the index data with the most occurrence times in each index data merging set as the mode corresponding to the subscript of the corresponding data fragment.

The size of the preset data fragmentation range size parameter depends on the number of index data in the index data set. If the quantity of the index data in the index data set is less, the preset data fragment range size parameter can be appropriately adjusted to be larger. Illustratively, the preset data slice range size parameter is 5 minutes. The time range of the index data set in units of days is 0:00:00 to 23:59: 59. The method comprises the steps of fragmenting index data in an index data set according to a preset data fragmentation range size parameter, dividing the index data in the index data set into a plurality of index data fragments, namely dividing the time range of the index data set from 0:00:00 to 23:59:59 into 288 index data fragments at intervals of 5 minutes. The index data collected in the time range of the index data fragment is the index data belonging to the index data fragment.

The term "data slice subscript" as used herein refers to a subscript that references a data slice. Illustratively, the preset data slice range size parameter is 5 minutes. The time range of the index data set in units of days is 0:00:00 to 23:59: 59. The time range of the index data set from 0:00:00 to 23:59:59 is divided into 288 index data pieces at 5-minute intervals, and the lower label of each index data piece is 1-288 in chronological order.

The term "short-term memory cardinality" as used herein is the cardinality used to perform short-term memory data compression. And acquiring the index value of the first piece of index data in each index data set as a short-time memory base number corresponding to each index data set. In the data compression and storage process, if the current index data is not in the range of the up-and-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set, setting the index value of the current index data as a new short-time memory base number corresponding to the target index data set.

The term "flag field" as used herein is a field for marking the compressed storage of the index data. The flag field may be set to a mode flag, a short-time memory flag, or a raw data flag. The mode mark represents that index data are compressed and stored in a mode of mode compression and storage. The short-time memory mark represents that index data are compressed and stored in a short-time memory compression storage mode. The original data mark represents that the index value of the index data is not compressed and is stored in the original data format.

The term "mode election sample" as used herein is a plurality of sets of index data used to determine the mode corresponding to each data slice index. A set number of index data sets are randomly acquired as a mode election sample from among a plurality of index data sets in units of days.

The term "acceptable percentage range of fluctuation of the short-term memory cardinality" used herein is a compression parameter for performing short-term memory compression storage, and is a deviation ratio of a query value to a real value of an index value of the index data. Generally, the larger the value of the acceptable percentage range of the fluctuation of the short-time memory base number is, the higher the compression rate of the short-time memory compression storage mode is. The higher the distortion degree in data query. A balance point needs to be found in compression ratio and distortion degree according to service requirements. Illustratively, the acceptable percentage range of fluctuation of the short-term memory base is 5%.

For ease of understanding, the main inventive concepts of the embodiments of the present invention are briefly described.

The index data based on the time series is generally regularly stable, and the range of the index data is determined. Illustratively, the CPU usage of the host devices in the system is regular most of the time. FIG. 1a is a diagram illustrating a daily CPU usage trend of a host device. FIG. 1b is a diagram illustrating CPU usage trends during a host device promotional campaign. In general terms: the change trends of the index data based on time series, such as the CPU utilization rate and the memory utilization rate of the host device, are generally similar every day. Occasionally, the trend of the change may be different due to some special activities. Illustratively, the traffic of the e-commerce system is much higher than usual when the e-commerce promotion is active. The variation trend of the CPU utilization rate of the host equipment in the e-commerce system is different from the daily variation trend of the CPU utilization rate due to e-commerce promotion activities.

In the prior art, when index data based on a time sequence is stored, the index data is directly stored in an original data format, a storage space is not considered, compression is not performed, a large data storage engine is used for storing, and historical cold data is periodically cleaned (for example, the historical cold data is deleted). FIG. 1c shows a sample CPU utilization index raw data. FIG. 1c shows a sample of raw data for a CPU usage index for 100 machines. 100 machines collect the index data of CPU utilization rate once every 5 seconds. Approximately 173 tens of thousands of data were generated that day. There are approximately 5184 thousand pieces of data a month and approximately 6.22 million pieces of data a year. Whereas CPU usage is mostly regular. Obviously, the prior art directly stores the data in the original data format, which brings great space waste and has little meaning.

Based on index data (such as CPU utilization rate) of time series, the change trend of each day is similar in most cases, and one piece of similar index data is stored every day in units of days, which is waste of storage space.

Index data (such as CPU utilization rate) based on time series is relatively stable in most cases in a short time (such as when a system is idle after 0 pm, CPU utilization rate approaches to 0), and a similar index data is stored in each acquisition point in the continuous interval, which is a waste of storage space.

In addition, the prior art method of cleaning up historical cold data regularly can result in the loss of historical data. And customers typically want historical data to remain. Because the data may not find value for utilization at that time, it does not represent a subsequent failure.

The inventor considers whether index data based on time series can be compressed and stored through a method or not so as to reduce the waste of storage space and simultaneously avoid losing one piece of data, aiming at the problem that the index data storage mode of directly storing the index data in the original data format and regularly cleaning historical cold data brings great space waste and causes historical data loss in the prior art.

FIG. 1d is a graph of data value. Thermal data is data that has been recently generated and is commonly used. The value of the thermal data is highest. The temperature data is data that has been used for a period of time and is used occasionally. Cold data is data that is used for a long time and rarely used. The hot data is the most valuable data and is frequently used at the time, so that the query efficiency is more considered when storing. The warm data and the cold data are occasionally used data, so that data compression is more considered when the data are stored, and the occupied data storage space is reduced.

The index data collected on the same day is thermal data. The index data collected on the day is stored in an original data format, so that the index data can be used for searching when the data is inquired. The historical index data is all index data except the index data collected on the current day. The historical index data is temperature data or cold data. The historical index data is compressed and stored, so that the waste of storage space is reduced, one piece of data is not lost, and friendly and efficient data query service is provided.

Based on the above thought, the inventor creatively proposes that when a big data compression request is detected, historical index data to be compressed corresponding to the big data decision request is obtained, the historical index data to be compressed comprises a plurality of index data sets with a day as a unit, the index data sets comprise all the index data collected on corresponding dates, and the index data in the index data sets are arranged according to a time sequence; obtaining a mode corresponding to each data fragment subscript according to a plurality of index data sets taking days as units and a preset data fragment range size parameter; and determining a mark field of the index data in each index data set according to the mode corresponding to each data fragment subscript and the short-time memory base number corresponding to each index data set, and compressing and storing the index data in each index data set. The benefits of this are: index data collected on the same day is not compressed, and the index data is searched and used; for historical index data to be compressed, according to a mode corresponding to each data fragment subscript and a short-time memory base number corresponding to each index data set, a mark field of the index data in each index data set is determined, the index data in each index data set is compressed and stored, a large number of repeatedly appearing modes in the index data set can be compressed and stored, the index data sets deviating from the mode are compressed in an acceptable error range by using a short-time memory data compression method, the compression rate and accuracy of the index data set are adjusted by supporting parameter configuration, and the problem of compression and storage of the index data based on the time series, which is similar in daily trend change and stable in short-time index value, is well solved.

Example one

Fig. 1e is a flowchart of a data compression method based on big data according to an embodiment of the present invention. The embodiment of the present invention is applicable to the case of compressing and storing index data, and the method may be executed by the data compression apparatus based on big data provided in the embodiment of the present invention, and the apparatus may be implemented in a software and/or hardware manner, and may be generally integrated in a computer device. As shown in fig. 1e, the method of the embodiment of the present invention specifically includes:

step 101, when a big data compression request is detected, obtaining historical index data to be compressed corresponding to the big data decision request.

The historical index data to be compressed comprises a plurality of index data sets taking days as units, the index data sets comprise all the index data collected on corresponding dates, and the index data in the index data sets are arranged according to a time sequence.

The historical index data to be compressed is all historical index data which is not compressed and stored at the current time. The historical index data is all index data except the index data collected on the current day. The historical index data to be compressed includes a plurality of index data sets in units of days. The index data set includes all index data collected on the corresponding date. The index data in the index data set are arranged in chronological order.

The big data compression request is an operation request for requesting compression storage of the historical index data to be compressed. Optionally, when a big data compression request is detected, obtaining historical index data to be compressed corresponding to the big data decision request includes: when a big data compression request is detected, all index data which are not subjected to compression storage except the index data collected on the same day are obtained and serve as historical index data to be compressed corresponding to the big data decision request.

Index data is index data in time series. Illustratively, the index data may be CPU usage, memory usage, etc. of the host device every 5 seconds.

In one embodiment, the metric data is CPU usage of the host device every 1 minute. The historical index data to be compressed comprises 365 index data sets in a unit of day. The set of metric data includes CPU usage per 1 minute collected on the corresponding date. The CPU usage rates for every 1 minute in the index data set are arranged in chronological order.

And 102, obtaining a mode corresponding to each data slicing subscript according to a plurality of index data sets taking days as units and a preset data slicing range size parameter.

Optionally, obtaining a mode corresponding to each data slicing subscript according to a plurality of index data sets taking a day as a unit and a preset data slicing range size parameter, where the mode may include: randomly acquiring a set number of index data sets as mode election samples from a plurality of index data sets taking days as units; for each index data set in a mode election sample, fragmenting the index data in the index data set according to a preset data fragmentation range size parameter, dividing the index data in the index data set into a plurality of index data fragments, and determining data fragmentation subscripts corresponding to the index data fragments according to a time sequence; merging the index data fragments of the same data fragment subscript in each index data set in the mode election sample to obtain an index data merged set corresponding to each data fragment subscript; and acquiring the index data with the most occurrence times in each index data merging set as the mode corresponding to the subscript of the corresponding data fragment.

The mode election sample is a plurality of index data sets for determining a mode corresponding to each data slice index. In a specific example, the historical index data to be compressed includes 3650 index data sets in units of days. Of 3650 index data sets in units of days, 300 index data sets are randomly acquired as a mode election sample.

The preset data fragmentation range size parameter is a compression parameter for performing mode compression storage, and is used for fragmenting index data based on a time sequence. The size of the preset data fragmentation range size parameter depends on the number of index data in the index data set. If the quantity of the index data in the index data set is less, the preset data fragment range size parameter can be appropriately adjusted to be larger.

In one embodiment, the preset data slice range size parameter is 5 minutes. The time range of the index data set in units of days is 0:00:00 to 23:59: 59. The method comprises the steps of fragmenting index data in an index data set according to a preset data fragmentation range size parameter, dividing the index data in the index data set into a plurality of index data fragments, namely dividing the time range of the index data set from 0:00:00 to 23:59:59 into 288 index data fragments at intervals of 5 minutes. The index data collected in the time range of the index data fragment is the index data belonging to the index data fragment.

The data slice subscript is a subscript that indexes the data slice. In one embodiment, the preset data slice range size parameter is 5 minutes. The time range of the index data set in units of days is 0:00:00 to 23:59: 59. The time range of the index data set from 0:00:00 to 23:59:59 is divided into 288 index data pieces at 5-minute intervals, and the lower label of each index data piece is 1-288 in chronological order.

The mode is the most frequent number of occurrences in a collection of data sets. The following are exemplary: the data set is {1, 2, 1, 3, 4, 5, 1, 8}, and the number of occurrences of 1 in the data set is the largest, so the mode of the data set is 1.

Optionally, the index data with the largest occurrence number in each index data merge set is obtained as a mode corresponding to the corresponding data slice subscript, and the mode corresponding to each data slice subscript is stored.

Optionally, the mode corresponding to each data fragment subscript is stored in a preset data table.

And 103, determining a mark field of the index data in each index data set according to the mode corresponding to each data fragment subscript and the short-time memory base number corresponding to each index data set, and compressing and storing the index data in each index data set.

The flag field is a field for marking the compression storage method of the index data. The flag field may be set to a mode flag, a short-time memory flag, or a raw data flag. The mode mark represents that index data are compressed and stored in a mode of mode compression and storage. The short-time memory mark represents that index data are compressed and stored in a short-time memory compression storage mode. The original data mark represents that the index value of the index data is not compressed and is stored in the original data format.

Optionally, determining a label field of the index data in each index data set according to a mode corresponding to each data slice subscript and a short-time memory base corresponding to each index data set, and performing compression storage on the index data in each index data set, where the method includes: acquiring one piece of index data in a target index data set according to a time sequence to serve as current index data; judging whether the current index data is the mode corresponding to the data fragment subscript of the index data fragment to which the current index data belongs; if the current index data is a mode corresponding to the data fragment subscript of the index data fragment, setting a mark field of the current index data as a mode mark; and returning to execute the operation of acquiring one piece of index data in the target index data set as the current index data according to the time sequence until the processing of all the index data in the target index data set is completed.

Optionally, according to the data time of the current index data, a mode corresponding to the data slice subscript of the index data slice to which the current index data belongs is obtained, and whether the current index data is the mode corresponding to the data slice subscript of the index data slice to which the current index data belongs is judged.

And if the current index data is the mode corresponding to the data fragment subscript of the index data fragment, setting the mark field of the current index data as a mode mark. That is, if the current index data is the mode corresponding to the data slice subscript of the index data slice to which the index data belongs, the mode flag of the current index data may be stored during storage. When index value query of current index data is carried out, if a query result is a mode mark, a numerical value of a corresponding mode is returned.

One integer type value takes 32 bits and one double precision floating point type value takes 64 bits. While the mode flag bit only needs 1 bit to be stored. Therefore, the larger the proportion of the mode in the target index data set, the larger the data compression rate.

Optionally, after determining whether the current index data is a mode corresponding to the data slice subscript of the index data slice to which the index data slice belongs, the method may further include: if the current index data is not the mode corresponding to the data fragment subscript of the index data fragment to which the current index data belongs, judging whether the current index data is within the upper and lower floating acceptable percentage range of the short-time memory base number corresponding to the target index data set; if the current index data is within the up-down floating acceptable percentage range of the short-time memory base number corresponding to the target index data set, setting the mark field of the current index data as a short-time memory mark; and returning to execute the operation of acquiring one piece of index data in the target index data set as the current index data according to the time sequence until the processing of all the index data in the target index data set is completed.

The short-term memory base is a base for performing short-term memory data compression. Optionally, the index value of the first piece of index data in the target index data set is obtained as a short-time memory base number corresponding to the target index data set, and the short-time memory base number corresponding to the target index data set is stored.

Optionally, a short-time memory base corresponding to the target index data set is stored in the base value field.

The up-and-down floating acceptable percentage range of the short-time memory base number is a compression parameter for performing short-time memory compression storage, and is a deviation ratio of a query value and a true value of an index value of the allowable index data. Generally, the larger the value of the acceptable percentage range of the fluctuation of the short-time memory base number is, the higher the compression rate of the short-time memory compression storage mode is. The higher the distortion degree in data query. A balance point needs to be found in compression ratio and distortion degree according to service requirements. Illustratively, the acceptable percentage range of fluctuation of the short-term memory base is 5%.

And judging whether the current index data is in the range of the up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set. And if the current index data is within the range of the up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set, setting the mark field of the current index data as a short-time memory mark. That is, if the current index data is within the range of the acceptable percentage of fluctuation of the short-term memory base number corresponding to the target index data set, the short-term memory mark of the current index data is stored during storage. When index value query of current index data is carried out, if the query result is a short-time memory mark, the numerical value of the corresponding short-time memory base number is returned.

One integer type value takes 32 bits and one double precision floating point type value takes 64 bits. While a short-time memoization mark only needs 1 bit to be stored. Therefore, the more stable the index value of the index data of the continuous section in the target index data set is, the larger the data compression rate is.

Optionally, after determining whether the current index data is within the range of the upper and lower floating acceptable percentages of the short-time memory base number corresponding to the target index data set, the method may further include: if the current index data is not in the range of the up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set, setting a mark field of the current index data as an original data mark, storing the index value of the current index data, and setting the index value of the current index data as a new short-time memory base number corresponding to the target index data set; and returning to execute the operation of acquiring one piece of index data in the target index data set as the current index data according to the time sequence until the processing of all the index data in the target index data set is completed.

And if the current index data is not in the range of the up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set, setting the mark field of the current index data as an original data mark, storing the index value of the current index data, and setting the index value of the current index data as a new short-time memory base number corresponding to the target index data set. That is, if the current index data is not within the range of the upper and lower floating acceptable percentages of the short-time memory base number corresponding to the target index data set, the original value of the index value of the current index data is stored when the current index data is stored, and the index value of the current index data is set as the new short-time memory base number corresponding to the target index data set.

Fig. 1f is a flowchart of compressing and storing index data in a target index data set according to an embodiment of the present invention. As shown in fig. 1f, compressing and storing the index data in the target index data set specifically includes:

step 1, acquiring one piece of index data in a target index data set according to a time sequence to serve as current index data.

Step 2, judging whether the current index data is the mode corresponding to the data fragment subscript of the index data fragment to which the current index data belongs: if the current index data is the mode corresponding to the data fragment subscript of the index data fragment, executing the step 3; and if the current index data is not the mode corresponding to the data fragment subscript of the index data fragment, executing the step 4.

And 3, setting the mark field of the current index data as a mode mark.

Step 4, judging whether the current index data is in the range of the up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set: if the current index data is within the up-down floating acceptable percentage range of the short-time memory base number corresponding to the target index data set, executing the step 5; and if the current index data is not within the range of the up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set, executing the step 6.

And 5, setting the mark field of the current index data as a short-time memory mark.

And 6, setting the mark field of the current index data as an original data mark, storing the index value of the current index data, and setting the index value of the current index data as a new short-time memory base number corresponding to the target index data set.

In one specific example, the following is performed for each set of metric data in units of days: and acquiring one piece of index data in the target index data set according to the time sequence as the current index data. And positioning the data fragment subscript of the index data fragment to which the current index data belongs. And judging whether the current index data is the mode corresponding to the data fragment subscript of the index data fragment to which the current index data belongs. If the current index data is a mode corresponding to the data slice subscript of the index data slice to which it belongs, the flag field of the current index data is set to a mode flag "10". And if the current index data is not the mode corresponding to the data fragment subscript of the index data fragment, judging whether the current index data is in the range of the up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set. And if the current index data is within the range of acceptable percentage of fluctuation of the short-time memory base number corresponding to the target index data set, setting the mark field of the current index data to be the short-time memory mark '01'. And if the current index data is not in the range of the up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set, setting the mark field of the current index data as the original data mark '00', storing the index value of the current index data, setting the index value of the current index data as the new short-time memory base number corresponding to the target index data set, and storing the index value into the base number value field. When index value query of index data is performed, if a mark field of the index data is a mode mark '10', a data slice subscript of an index data slice to which the index data belongs is located, a value of a mode corresponding to the data slice subscript is read, and a value of the mode is returned. When index value query of index data is carried out, if the mark field of the index data is a short-time memory mark '01', reading the numerical value of the short-time memory base number corresponding to the index data, and returning the numerical value of the short-time memory base number. When index value query of index data is carried out, if the mark field of the index data is the original data mark '00', the index value of the index data is read, and the index value of the index data is returned.

In order to verify the data compression effect of the scheme, the CPU utilization rate acquisition data of 10 host devices (same application) in a production line for 1 year (365 days) is acquired for data compression. Host device CPU utilization acquisition logic: collected once per minute. The original format of the data is shown in fig. 1 g. Each host device has one copy of data per day as shown in fig. 1g, and the data size of 10 host devices per year is 5256000 pieces of data, and the memory space is 80 Mb.

For 10 host devices, data was sampled randomly for 10 days, respectively, and a trend graph of daily CPU usage was found to be very close to that of fig. 1 h. On the day of the company promotional campaign date, a CPU usage trend graph is shown in FIG. 1 i.

The data compression targets are: and compressing and storing the index data in a mode of mode compression storage aiming at data with similar index values in most cases of different host devices at the same time every day. And compressing and storing the index data in a short-time memory compression storage mode aiming at continuous intervals with little change of daily flow in a short-time range.

Each of the host apparatuses (10 host apparatuses) randomly takes index data for 30 days for a total of 300 days. Namely, 300 index data sets are randomly acquired as mode election samples. The preset data slice range size parameter is 5 minutes. The time range 0:00:00 to 23:59:59 of each index data set is divided into 288 index data pieces at 5-minute intervals, and the lower marks of the index data pieces are respectively marked with 1-288 in chronological order. The index data collected in the time range of the index data fragment is the index data belonging to the index data fragment. And merging the index data fragments of the same data fragment subscript in each index data set in the mode election sample to obtain an index data merged set corresponding to each data fragment subscript. And acquiring the index data with the most occurrence times in each index data merging set as the mode corresponding to the subscript of the corresponding data fragment.

The following operations are performed for each index data set in units of days (data for 300 days in total): and acquiring one piece of index data in the target index data set according to the time sequence as the current index data. And positioning the data fragment subscript of the index data fragment to which the current index data belongs. And judging whether the current index data is the mode corresponding to the data fragment subscript of the index data fragment to which the current index data belongs. If the current index data is a mode corresponding to the data slice subscript of the index data slice to which it belongs, the flag field of the current index data is set to a mode flag "10". And if the current index data is not the mode corresponding to the data fragment subscript of the index data fragment, judging whether the current index data is in the range of the up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set. And if the current index data is within the range of acceptable percentage of fluctuation of the short-time memory base number corresponding to the target index data set, setting the mark field of the current index data to be the short-time memory mark '01'. And if the current index data is not in the range of the up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set, setting the mark field of the current index data as the original data mark '00', storing the index value of the current index data, setting the index value of the current index data as the new short-time memory base number corresponding to the target index data set, and storing the index value into the base number value field.

The data compression result is: after data compression, the storage space is occupied by 6.36 Mb.

The index data of the first 10 host devices are stored, and each index data has 60 (minutes) 24 (hours) 365 query key values, and total 5256000 query key values. Because trend graphs of each host device are similar, after compression is performed by adopting a mode compression storage mode, theoretically, the query key value can be only time without system number, and therefore 10 pieces of key value data can be stored in one piece. I.e., the query key value can theoretically be compressed to 1/10.

The CPU utilization is stored in a data format of double-precision floating-point type numerical value, and each index value occupies 64 bits. After the compression is carried out by adopting the scheme. Ideally, most data need only store 2 bits of flag bits. A small portion of the data stores 2 bits of flag bits and 64 bits of indicator values. When the data amount is sufficiently large, it is considered that the index value data can maximally compress the original data to 1/32.

The embodiment of the invention provides a data compression method based on big data, which comprises the steps of obtaining historical index data to be compressed corresponding to a big data decision request when a big data compression request is detected, wherein the historical index data to be compressed comprises a plurality of index data sets with day as a unit, then obtaining a mode corresponding to each data slicing subscript according to the index data sets with day as a unit and a preset data slicing range size parameter, determining a mark field of the index data in each index data set according to the mode corresponding to each data slicing subscript and a short-time memory base corresponding to each index data set, compressing and storing the index data in each index data set, and aiming at the historical index data to be compressed, according to the mode corresponding to each data slicing subscript and the short-time memory base corresponding to each index data set, the method comprises the steps of determining mark fields of index data in each index data set, compressing and storing the index data in each index data set, compressing and storing a large number of repeatedly-appearing modes in the index data set in a mode compression storage mode, compressing and storing the index data deviating from the modes in an acceptable error range in a short-time memory compression storage mode by using a short-time memory data compression method, and reducing waste of storage space of the index data without losing the index data.

Example two

Fig. 2 is a flowchart of a data compression method based on big data according to a second embodiment of the present invention. In this embodiment of the present invention, obtaining a mode corresponding to each data slicing subscript according to a plurality of index data sets taking a day as a unit and a preset data slicing range size parameter may include: randomly acquiring a set number of index data sets as mode election samples from a plurality of index data sets taking days as units; for each index data set in a mode election sample, fragmenting the index data in the index data set according to a preset data fragmentation range size parameter, dividing the index data in the index data set into a plurality of index data fragments, and determining data fragmentation subscripts corresponding to the index data fragments according to a time sequence; merging the index data fragments of the same data fragment subscript in each index data set in the mode election sample to obtain an index data merged set corresponding to each data fragment subscript; and acquiring the index data with the most occurrence times in each index data merging set as the mode corresponding to the subscript of the corresponding data fragment.

As shown in fig. 2, the method of the embodiment of the present invention specifically includes:

step 201, when a big data compression request is detected, obtaining historical index data to be compressed corresponding to the big data decision request.

In step 202, a set number of index data sets are randomly acquired as a mode election sample from a plurality of index data sets in a daily unit.

Step 203, for each index data set in the mode election sample, according to a preset data fragmentation range size parameter, fragmenting the index data in the index data set, dividing the index data in the index data set into a plurality of index data fragments, and according to a time sequence, determining data fragmentation subscripts corresponding to the index data fragments.

And 204, merging the index data fragments of the same data fragment subscript in each index data set in the mode election sample to obtain an index data merged set corresponding to each data fragment subscript.

Each index data merging set comprises index data of the same data fragment subscript in each index data set.

And step 205, acquiring the index data with the most occurrence times in each index data merging set as the mode corresponding to the corresponding data fragment subscript.

And step 206, determining the label field of the index data in each index data set according to the mode corresponding to each data fragment subscript and the short-time memory base number corresponding to each index data set, and compressing and storing the index data in each index data set.

The embodiment of the invention provides a data compression method based on big data, which comprises the steps of randomly acquiring a set number of index data sets in a plurality of index data sets taking day as a mode election sample, fragmenting the index data in the index data sets according to a preset data fragment range size parameter aiming at each index data set in the mode election sample, dividing the index data in the index data sets into a plurality of index data fragments, and determining data fragment subscripts corresponding to the index data fragments according to a time sequence; merging the index data fragments of the same data fragment subscript in each index data set in the mode election sample to obtain an index data merged set corresponding to each data fragment subscript; the index data with the largest occurrence frequency in each index data merging set is obtained as the mode corresponding to the corresponding data slice subscript, and the mode corresponding to each data slice subscript can be obtained according to a set number of index data sets obtained randomly and a preset data slice range size parameter.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a data compression apparatus based on big data according to a third embodiment of the present invention. The apparatus may be implemented in software and/or hardware and may generally be integrated in a computer device. As shown in fig. 3, the apparatus includes: a data acquisition module 301, a data slicing module 302, and a data compression module 303.

The data acquisition module 301 is configured to acquire to-be-compressed historical index data corresponding to a big data decision request when a big data compression request is detected, where the to-be-compressed historical index data includes a plurality of index data sets in a unit of day, each index data set includes all index data acquired on a corresponding date, and the index data in each index data set are arranged according to a time sequence; a data slicing module 302, configured to obtain a mode corresponding to each data slicing subscript according to a plurality of index data sets taking a day as a unit and a preset data slicing range size parameter; and the data compression module 303 is configured to determine a label field of the index data in each index data set according to a mode corresponding to each data slice subscript and a short-time memory base corresponding to each index data set, and compress and store the index data in each index data set.

The embodiment of the invention provides a data compression device based on big data, which is characterized in that when a big data compression request is detected, historical index data to be compressed corresponding to the big data decision request is obtained, the historical index data to be compressed comprises a plurality of index data sets in a day unit, then a mode corresponding to each data slicing subscript is obtained according to the index data sets in the day unit and a preset data slicing range size parameter, a label field of the index data in each index data set is determined according to the mode corresponding to each data slicing subscript and a short-time memory base corresponding to each index data set, the index data in each index data set is compressed and stored, the historical index data to be compressed can be stored according to the mode corresponding to each data slicing subscript and the short-time memory base corresponding to each index data set, the method comprises the steps of determining mark fields of index data in each index data set, compressing and storing the index data in each index data set, compressing and storing a large number of repeatedly-appearing modes in the index data set in a mode compression storage mode, compressing and storing the index data deviating from the modes in an acceptable error range in a short-time memory compression storage mode by using a short-time memory data compression method, and reducing waste of storage space of the index data without losing the index data.

On the basis of the foregoing embodiments, the data slicing module 302 may include: a sample acquisition unit configured to randomly acquire a set number of index data sets as a mode election sample from among a plurality of index data sets in units of days; the data fragmentation unit is used for fragmenting the index data in the index data set according to a preset data fragmentation range size parameter aiming at each index data set in the mode election sample, dividing the index data in the index data set into a plurality of index data fragments, and determining data fragmentation subscripts corresponding to the index data fragments according to a time sequence; the segment merging unit is used for merging the index data segments of the same data segment subscripts in all the index data sets in the mode election sample to obtain an index data merging set corresponding to all the data segment subscripts; and the mode determining unit is used for acquiring the index data with the most occurrence times in each index data merging set as the mode corresponding to the corresponding data fragment subscript.

On the basis of the foregoing embodiments, the big data based data compression apparatus may further include: and the base number acquisition module is used for acquiring the index value of the first piece of index data in each index data set as a short-time memory base number corresponding to each index data set.

On the basis of the foregoing embodiments, the data compression module 303 may include: a data acquisition unit, configured to acquire one piece of index data in a target index data set in a chronological order as current index data; the mode judging unit is used for judging whether the current index data is a mode corresponding to the data fragment subscript of the index data fragment; a first flag setting unit configured to set a flag field of current index data to a mode flag if the current index data is a mode corresponding to a data slice subscript of an index data slice to which the current index data belongs; and the operation returning unit is used for returning and executing the operation of acquiring one piece of index data in the target index data set as the current index data according to the time sequence until the processing of all the index data in the target index data set is completed.

On the basis of the foregoing embodiments, the data compression module 303 may further include: the base number judging unit is used for judging whether the current index data is in the range of up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set or not if the current index data is not the mode corresponding to the data fragment subscript of the index data fragment to which the current index data belongs; the second mark setting unit is used for setting the mark field of the current index data as a short-time memory mark if the current index data is within the range of the up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set; and the operation returning unit is used for returning and executing the operation of acquiring one piece of index data in the target index data set as the current index data according to the time sequence until the processing of all the index data in the target index data set is completed.

On the basis of the foregoing embodiments, the data compression module 303 may further include: the third mark setting unit is used for setting a mark field of the current index data as an original data mark, storing an index value of the current index data and setting the index value of the current index data as a new short-time memory base number corresponding to the target index data set if the current index data is not in the range of up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set; and the operation returning unit is used for returning and executing the operation of acquiring one piece of index data in the target index data set as the current index data according to the time sequence until the processing of all the index data in the target index data set is completed.

The data compression device based on big data can execute the data compression method based on big data provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects for executing the data compression method based on big data.

Example four

Fig. 4 is a schematic structural diagram of a computer apparatus according to a fourth embodiment of the present invention, as shown in fig. 4, the computer apparatus includes a processor 410, a memory 420, an input device 430, and an output device 440; the number of the processors 410 in the computer device may be one or more, and one processor 410 is taken as an example in fig. 4; the processor 410, the memory 420, the input device 430 and the output device 440 in the computer apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 4.

The memory 420 serves as a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to a big data-based data compression method in the embodiment of the present invention (for example, the data acquisition module 301, the data fragmentation module 302, and the data compression module 303 in a big data-based data compression apparatus). The processor 410 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the memory 420, that is, implements one of the big data based data compression methods described above. That is, the program when executed by the processor implements: when a big data compression request is detected, acquiring historical index data to be compressed corresponding to the big data decision request, wherein the historical index data to be compressed comprises a plurality of index data sets taking days as units, the index data sets comprise all index data collected on corresponding dates, and the index data in the index data sets are arranged according to a time sequence; obtaining a mode corresponding to each data fragment subscript according to a plurality of index data sets taking days as units and a preset data fragment range size parameter; and determining a mark field of the index data in each index data set according to the mode corresponding to each data fragment subscript and the short-time memory base number corresponding to each index data set, and compressing and storing the index data in each index data set.

The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to a computer device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus, and may include a keyboard and a mouse, etc. The output device 440 may include a display device such as a display screen.

On the basis of the foregoing embodiments, the processor 410 is configured to obtain a mode corresponding to each data slice index according to a plurality of index data sets in units of days and a preset data slice range size parameter in the following manner: randomly acquiring a set number of index data sets as mode election samples from a plurality of index data sets taking days as units; for each index data set in a mode election sample, fragmenting the index data in the index data set according to a preset data fragmentation range size parameter, dividing the index data in the index data set into a plurality of index data fragments, and determining data fragmentation subscripts corresponding to the index data fragments according to a time sequence; merging the index data fragments of the same data fragment subscript in each index data set in the mode election sample to obtain an index data merged set corresponding to each data fragment subscript; and acquiring the index data with the most occurrence times in each index data merging set as the mode corresponding to the subscript of the corresponding data fragment.

On the basis of the above embodiments, the processor 410 further performs the following operations: and acquiring the index value of the first piece of index data in each index data set as a short-time memory base number corresponding to each index data set.

On the basis of the foregoing embodiments, the processor 410 is configured to determine the tag field of the index data in each index data set according to the mode corresponding to each data slice index and the short-time memory base corresponding to each index data set, and perform compression storage on the index data in each index data set by: acquiring one piece of index data in a target index data set according to a time sequence to serve as current index data; judging whether the current index data is the mode corresponding to the data fragment subscript of the index data fragment to which the current index data belongs; if the current index data is a mode corresponding to the data fragment subscript of the index data fragment, setting a mark field of the current index data as a mode mark; and returning to execute the operation of acquiring one piece of index data in the target index data set as the current index data according to the time sequence until the processing of all the index data in the target index data set is completed.

On the basis of the foregoing embodiments, after determining whether the current index data is the mode corresponding to the data slice subscript of the index data slice to which the index data slice belongs, the processor 410 further performs the following operations: if the current index data is not the mode corresponding to the data fragment subscript of the index data fragment to which the current index data belongs, judging whether the current index data is within the upper and lower floating acceptable percentage range of the short-time memory base number corresponding to the target index data set; if the current index data is within the up-down floating acceptable percentage range of the short-time memory base number corresponding to the target index data set, setting the mark field of the current index data as a short-time memory mark; and returning to execute the operation of acquiring one piece of index data in the target index data set as the current index data according to the time sequence until the processing of all the index data in the target index data set is completed.

On the basis of the foregoing embodiments, after determining whether the current index data is within the range of the acceptable percentage of fluctuation of the short-time memory base number corresponding to the target index data set, the processor 410 further performs the following operations: if the current index data is not in the range of the up-down floating acceptable percentage of the short-time memory base number corresponding to the target index data set, setting a mark field of the current index data as an original data mark, storing the index value of the current index data, and setting the index value of the current index data as a new short-time memory base number corresponding to the target index data set; and returning to execute the operation of acquiring one piece of index data in the target index data set as the current index data according to the time sequence until the processing of all the index data in the target index data set is completed.

EXAMPLE five

Fifth, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the big data based data compression method provided in any embodiment of the present invention. Of course, the embodiment of the present invention provides a computer-readable storage medium, which can perform related operations in a big data based data compression method according to any embodiment of the present invention. That is, the computer program when executed by the processor implements: when a big data compression request is detected, acquiring historical index data to be compressed corresponding to the big data decision request, wherein the historical index data to be compressed comprises a plurality of index data sets taking days as units, the index data sets comprise all index data collected on corresponding dates, and the index data in the index data sets are arranged according to a time sequence; obtaining a mode corresponding to each data fragment subscript according to a plurality of index data sets taking days as units and a preset data fragment range size parameter; and determining a mark field of the index data in each index data set according to the mode corresponding to each data fragment subscript and the short-time memory base number corresponding to each index data set, and compressing and storing the index data in each index data set.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the data compression apparatus based on big data, the units and modules included in the embodiment are only divided according to the functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for compressing data based on big data is characterized by comprising the following steps:

obtaining a mode corresponding to each data slicing subscript according to the index data sets taking the day as a unit and a preset data slicing range size parameter;

determining a label field of the index data in each index data set according to the mode corresponding to each data slice subscript and the short-time memory base number corresponding to each index data set, and performing compression storage on the index data in each index data set, wherein the short-time memory base number is a base number used for performing short-time memory data compression, and the short-time memory base number corresponding to each index data set is an index value of the first piece of index data in each index data set; wherein the following is performed for each set of metric data: acquiring one piece of index data in the index data set according to the time sequence to serve as the current index data; judging whether the current index data is the mode corresponding to the data fragment subscript of the index data fragment to which the current index data belongs; if the current index data is a mode corresponding to a data fragment subscript of the index data fragment, setting a mark field of the current index data as a mode mark; if the current index data is not the mode corresponding to the data fragment subscript of the index data fragment, judging whether the current index data is in the range of up-down floating acceptable percentage of the short-time memory base number corresponding to the index data set; if the current index data is within the up-down floating acceptable percentage range of the short-time memory base number corresponding to the index data set, setting the mark field of the current index data as a short-time memory mark; if the current index data is not in the range of the up-down floating acceptable percentage of the short-time memory base number corresponding to the index data set, setting a mark field of the current index data as an original data mark, storing an index value of the current index data, and setting the index value of the current index data as a new short-time memory base number corresponding to the index data set; and processing all the index data in the index data set until the processing of all the index data in the index data set is completed.

2. The method according to claim 1, wherein obtaining a mode corresponding to each data slice index according to the index data sets in units of days and a preset data slice range size parameter includes:

randomly acquiring a set number of index data sets as mode election samples from the plurality of index data sets taking the day as a unit;

for each index data set in the mode election sample, fragmenting the index data in the index data set according to a preset data fragmentation range size parameter, dividing the index data in the index data set into a plurality of index data fragments, and determining data fragmentation subscripts corresponding to the index data fragments according to a time sequence;

merging the index data fragments of the same data fragment subscript in each index data set in the mode election sample to obtain an index data merged set corresponding to each data fragment subscript;

and acquiring the index data with the most occurrence times in each index data merging set as the mode corresponding to the subscript of the corresponding data fragment.

3. A computer device comprising a processor and a memory, the memory to store instructions that, when executed, cause the processor to:

4. The computer device of claim 3, wherein the processor is configured to derive a mode corresponding to each data slice index from the plurality of index data sets in units of days and a preset data slice range size parameter by:

5. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a big data based data compression method according to any of claims 1-2.