CN112953550A - Data compression method, electronic device and storage medium - Google Patents

Data compression method, electronic device and storage medium Download PDF

Info

Publication number
CN112953550A
CN112953550A CN202110309332.8A CN202110309332A CN112953550A CN 112953550 A CN112953550 A CN 112953550A CN 202110309332 A CN202110309332 A CN 202110309332A CN 112953550 A CN112953550 A CN 112953550A
Authority
CN
China
Prior art keywords
data
length
scanning window
compression
compressed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110309332.8A
Other languages
Chinese (zh)
Other versions
CN112953550B (en
Inventor
杨卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Fujia Information Technology Co ltd
Original Assignee
Shanghai Fujia Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Fujia Information Technology Co ltd filed Critical Shanghai Fujia Information Technology Co ltd
Priority to CN202110309332.8A priority Critical patent/CN112953550B/en
Publication of CN112953550A publication Critical patent/CN112953550A/en
Application granted granted Critical
Publication of CN112953550B publication Critical patent/CN112953550B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Abstract

The embodiment of the invention relates to the field of data processing, and discloses a data compression method, electronic equipment and a storage medium. The data compression method comprises the following steps: acquiring at least two types of initial data, wherein the length of each data to be compressed in each type of initial data is the same; according to any piece of data to be compressed in each type of initial data, adjusting the length of a scanning window in dictionary coding until the length of the scanning window is matched with the current type; and compressing the initial data of the current category in a dictionary coding mode according to the length of the matched scanning window to obtain a first compression result of the initial data. By adopting the embodiment of the invention, the compression speed and the compression ratio of the data can be improved.

Description

Data compression method, electronic device and storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a data compression method, an electronic device, and a storage medium.
Background
With the development of the world information technology, various industries generate more and more data, which far exceeds the processing capability of the conventional data processing means, and the storage of mass data faces a significant challenge. As an effective means for mass data storage, data compression technology development has grown day by day and has been widely developed in various fields of life services.
The aviation system has mass data, and one airplane can generate a plurality of TBs and associated data thereof after one airplane flies for one time, which brings huge data management and analysis problems to the aviation field. But massive amounts of data are both challenging and opportunistic. The data compression technology can compress massive multi-source heterogeneous aviation data, large-scale aviation data storage is achieved by means of emerging means such as data transmission and distributed storage, data mining is convenient to conduct on the decompressed data, development of an aviation system is better assisted, airplane flight safety is improved, flight quality assessment and airplane maintenance are assisted.
The lossless compression technology is a technology for coding a large amount of data according to a certain method to achieve information compression and storage, precision loss is not allowed in the data compression process, and the compressed data can be restored to an original state before compression through decoding. Lossless compression techniques are mainly used for compression of text files, databases, program data and data for special applications. However, the compression ratio of the current lossless compression method is low, generally 1/2-1/5, and the compression requirement of mass data cannot be met.
Disclosure of Invention
An object of embodiments of the present invention is to provide a data compression method, an electronic device, and a storage medium, which can improve a compression rate and a compression ratio of data.
To solve the above technical problem, an embodiment of the present invention provides a method for data compression, including: acquiring at least two types of initial data, wherein the length of each data to be compressed in each type of initial data is the same; according to any piece of data to be compressed in each type of initial data, adjusting the length of a scanning window in dictionary coding until the length of the scanning window is matched with the current type; and compressing the initial data of the current category in a dictionary coding mode according to the length of the matched scanning window to obtain a first compression result of the initial data.
An embodiment of the present invention also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of data compression described above.
Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the above-mentioned data compression method.
In the embodiment of the invention, the initial data is compressed by adopting a dictionary coding mode, the length of the corresponding scanning window is matched for the data with different lengths, the data of the type is compressed by the adjusted length of the scanning window, the number of times of small windows caused by repeatedly changing the scanning window in the scanning process is reduced because the length of the scanning window corresponding to the current type is fixed, the speed of acquiring the scanning data each time is increased, the compression speed is further increased, the length of the inquired scanning data is increased along with the increase of the scanning window, the compressed data is also increased, and the compression ratio is further increased. The length of the scanning window can be adjusted according to the length of the data to be compressed, so that the scanning window is suitable for different categories, and the compression adaptability is improved.
In addition, after the initial data of the current category is compressed in the dictionary coding manner according to the length of the matched scanning window, and a first compression result of the initial data is obtained, the method further includes: taking the first compression result as compression input data; and coding the compressed input data by adopting a finite state entropy coding mode to generate a second compression result. And the first compression result is compressed for the second time, and a finite state entropy coding mode with high compression speed is adopted for compression, so that the compression speed is further improved, and meanwhile, the data is compressed again, so that the compression ratio is improved.
Additionally, prior to the obtaining at least two categories of initial data, the method further comprises: acquiring original data as compressed input data; coding the compressed input data by adopting a finite state entropy coding mode to generate a second compression result; and taking the second compression result as the initial data. Before dictionary coding, the original data is coded in a finite state entropy coding mode, and then dictionary coding is carried out, so that the compression ratio of the data is further improved.
In addition, according to any piece of data to be compressed in each type of initial data, adjusting the length of a scanning window in dictionary coding until the length of the scanning window is matched with the current type, and the method comprises the following steps: acquiring data positioned in a forward cache region from the data to be compressed as query data; acquiring data positioned in a scanning window from a coding library as coded data; judging whether data matched with the query data exists in the coded data or not, if the matched data does not exist, adjusting the length of the scanning window, and taking the adjusted scanning window as a scanning window for obtaining the coded data next time until the data to be compressed are matched; if the current scanning window exists, recording the position of the matched coded data, and taking the current scanning window as a scanning window for acquiring the coded data next time; and determining the length of the scanning window matched with the current category according to the length of the corresponding scanning window when the data to be compressed are matched. By encoding any piece of data to be compressed, if no data matched with the query data exists, the length of the scanning window is adjusted, and if data matched with the query data exists, the length of the scanning window is kept, so that the length of the scanning window corresponding to the current category is continuously optimized.
In addition, if there is no matching data, the method further comprises: storing the query data into the coding library; and acquiring the storage address of the query data, and taking the storage address as a reference address for acquiring the encoded data next time.
In addition, determining the length of the scanning window matched with the current category according to the length of the corresponding scanning window when the data to be compressed is matched, includes: and judging whether the length of the corresponding scanning window exceeds a preset length when the data to be compressed are matched, and if so, taking the preset length as the length of the scanning window matched with the current category. The maximum length of the scanning window is the number of bits of the processor, so that the problem of reduction of the scanning rate caused by overlong scanning window is avoided.
In addition, the dictionary code is: LZ4 encoding; the storing the query data into the coding library includes: carrying out hash processing on the query data to obtain a hash value; and storing the hash value and the address corresponding to the hash value in the hash table. The LZ4 has fast coding speed and improves the data coding speed.
In addition, the initial data includes: aviation data or traffic data.
Drawings
One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.
FIG. 1 is a flow chart of a method of data compression provided in accordance with a first embodiment of the present invention;
FIG. 2 is a flow chart of a method of data compression provided in accordance with a second embodiment of the present invention;
FIG. 3 is a flow chart of a method of data compression provided in accordance with a third embodiment of the present invention;
FIG. 4 is a flow chart of a method of data compression provided in accordance with a fourth embodiment of the present invention;
fig. 5 is a block diagram of an electronic device provided in a fifth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.
The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.
A first embodiment of the invention relates to a method of data compression. Applied to electronic equipment, the flow is shown in fig. 1:
step 101: acquiring at least two types of initial data, wherein the length of each data to be compressed in each type of initial data is the same.
Step 102: and adjusting the length of a scanning window in the dictionary code until the length of the scanning window is matched with the current category according to any piece of data to be compressed in each category of initial data.
Step 103: and compressing the initial data of the current category in a dictionary coding mode according to the length of the matched scanning window to obtain a first compression result of the initial data.
In the embodiment of the invention, the initial data is compressed by adopting a dictionary coding mode, the length of the corresponding scanning window is matched for the data with different lengths, the data of the type is compressed by the adjusted length of the scanning window, the number of times of small windows caused by repeatedly changing the scanning window in the scanning process is reduced because the length of the scanning window corresponding to the current type is fixed, the speed of acquiring the scanning data each time is increased, the compression speed is further increased, the length of the inquired scanning data is increased along with the increase of the scanning window, the compressed data is also increased, and the compression ratio is further increased. The length of the scanning window can be adjusted according to the length of the data to be compressed, so that the scanning window is suitable for different categories, and the compression adaptability is improved.
A second embodiment of the invention relates to a method of data compression. The second embodiment is a further improvement of the first embodiment, and the main improvements are as follows: in this embodiment, the length of the scanning window matched with the current category is determined according to the length of the corresponding scanning window when any piece of data to be compressed is matched. The flow is shown in figure 2:
step 201: acquiring at least two types of initial data, wherein the length of each data to be compressed in each type of initial data is the same.
Specifically, at least two categories of initial data are obtained, and the length of each data to be compressed in each category of initial data is the same. For example, the length of airborne data includes: 1 bit, 2 bits, 3 bits, 12 bits, 16 bits, 18 bits, 20 bits, 24 bits, and the like. Each bit number can be classified into one category, for example, data with a length of 1 bit is classified into the same category. It is understood that at least 2 pieces of data to be compressed are included in the initial data of each class.
In practical applications, the initial data may be aviation data or traffic data. The aviation data generally generates massive data in TB units, and the data size is large, while the aviation data generally includes at least two types of initial data with different lengths.
Step 202 to step 207 in the present embodiment are detailed description of step 101. And adaptively determining the length of the scanning window matched with the current category according to the length of any piece of data to be compressed. Wherein the compression ratio is equal to the size of the source file/the size of the compressed file.
Step 202: and acquiring data positioned in a forward buffer area from the data to be compressed as query data.
Specifically, in this embodiment, the original data is compressed by using dictionary coding. The dictionary coding method is to code data in a dictionary looking-up manner. The principle of dictionary coding is to combine long character strings or frequently occurring letters to form each entry in a dictionary, and replace the entry in the dictionary with a relatively short number or symbol, thereby realizing compression of data. The compression effect of dictionary coding is related to the occurrence of repeated data and the size of the dictionary. The method mainly comprises the following steps: LZ77 algorithm, LZSS algorithm, LZ78 algorithm, LZW algorithm, etc.
The dictionary coding mode comprises a forward cache region and a scanning window, query data are obtained by moving the forward cache region, and coding data in a coding library are obtained by the scanning window, wherein the coding data comprise: each dictionary data and the code corresponding to the dictionary data, which may be a character, an address, etc.
And coding the data to be compressed, wherein the obtained coding result is used as a first compression result of the data to be compressed. The length of the forward buffer may be a preset fixed length, such as 3 bits. In order to improve the coding efficiency, the query data to be queried each time is acquired by continuously moving the forward cache region, the data to be compressed is aligned with the forward cache region during the first query, and the data in the forward cache region is acquired as the query data. For example, a piece of data to be compressed is "abccccdefgfe", the length of the forward buffer is 3 bits, the data to be compressed and the forward buffer are aligned, the query data acquired for the first time is "abc", the forward buffer is moved, the query data acquired for the second time is "bcc", and the forward buffer is continuously moved until all the data to be compressed are read.
Step 203: and acquiring data positioned in the scanning window from the coding library as coded data.
Specifically, the encoded data is continuously acquired by moving the scanning window. The encoded data includes: each dictionary data and the code corresponding to the dictionary data, which may be a character, an address, etc. In one example, the dictionary is encoded as: LZ4 encoding; storing the query data into a code library, comprising: carrying out hash processing on the query data to obtain a hash value; and storing the hash value and the address corresponding to the hash value in a hash table.
The LZ4 encoding stores each dictionary as a hash table, i.e., hashes the characters and stores a hash value and the corresponding address of the hash value. The mapping relationship of the hash table comprises keys and key values corresponding to the keys, and the mapping relationship is expressed as key: value: the key is represented by the character "key", which is a binary of 4 bytes; a value is a key value, and is used to indicate the position of 4 bytes in the key in the hash table. An appropriate hash table size may be selected according to the size of the memory. The data stored in the hash table is "location", and there are three cases: if the input size is less than 64 bits, then the command is used: byU16, the command indicates a 16-bit offset value. If the input size is greater than 64 bits and the pointer size is 8 bytes, then command byU32 is used, which indicates a 32-bit offset. If the input size is larger than 64 bits and the pointer size is 4 bytes, then the command byPtr is used, which represents a 32-bit pointer. The hash value may be calculated using an integer hash algorithm, 2654435761U being 2 to 2^The golden element number of 32, 2654435761/4294967296 ═ 0.618033987. The hash value is calculated using multiplication. Address generation methodAnd storing the hash value into a hash table.
In specific application, the hash value of the beginning of the first byte can be calculated by calling the command "LZ 4_ putPosition", the hash value and the address of the hash value are stored, the hash value of the data can be inquired by calling the command "LZ 4_ hashPosition", and the corresponding relation between the hash value and the address is stored by calling the command "LZ 4_ putPositionOnHash". .
Step 204: judging whether the data matched with the query data exists in the coded data, if not, executing step 205; otherwise, step 206 is performed.
Specifically, if there is data matching the query data in the encoded data, the position of the data matching the query data in the hash table may be obtained. If not, go to step 205.
Step 205: and adjusting the length of the scanning window, and taking the adjusted scanning window as the scanning window for acquiring the encoded data next time until the data to be compressed is matched.
Specifically, if there is no matching data, it indicates that no matching data is found in the encoded data this time. The length of the scanning window can be increased, and the adjusted scanning window is used as the scanning window for acquiring the encoded data next time. For example, the length of the initial scanning window is 10 preset bits, when the initial scanning window is matched for the first time, no data is stored in the hash table, no data located in the scanning window exists, that is, no matched encoded data exists, and the length of the scanning window for acquiring the encoded data next time is adjusted to 11 bits.
It is understood that the length of the scanning window may be increased according to a fixed length, for example, the length of each increase is fixed to 2 bits, or may be increased according to any value, for example, the length of the first scanning window is 10 bits, the length of the second scanning window is 12 bits, the length of the third scanning window is 16 bits, and the length of the fourth scanning window is 20 bits.
Further, if no matched data exists, the query data can be stored in a code library; and acquiring the storage address of the query data, and taking the storage address as a reference address for acquiring the encoded data next time.
Specifically, if the query data is stored in the coding library in a dictionary form, the query data may be directly stored in the coding library, and the storage address of the query data may be stored at the same time. If the encoded data is stored in the hash table, the query data may be subjected to hash processing, the hash-processed query data is stored in the hash table, and the storage address corresponding to the hash value is stored at the same time.
The current storage address may be used as a reference address for acquiring the encoded data next time, and the next encoded data is acquired according to the scanning window with the adjusted length by using the reference address as a starting point, that is, the step 202 is executed again.
Step 206: and recording the position of the matched coded data, and taking the current scanning window as the scanning window for acquiring the coded data next time.
Specifically, if there is matching encoded data, the position of the encoded data is acquired. If the next query data is empty, i.e. the data matching is completed, step 207 is executed.
Step 207: and after the data to be compressed are matched, determining the length of the scanning window matched with the current category according to the length of the corresponding scanning window when the data to be compressed are matched.
Specifically, whether the length of the corresponding scanning window exceeds a preset length when the data to be compressed are matched is judged, if the length exceeds the preset length, the preset length is used as the length of the scanning window matched with the current category, and the preset length is the number of bits of the processor. The preset length may be the number of bits of a processor in the current electronic device, for example, if the processor is 32 bits, the preset length is 32 bits. And if the length of the corresponding scanning window is 33 bits when the data to be compressed is matched, selecting 33 bits as the length of the matched scanning window of the current class.
It should be noted that, before the data to be compressed is scanned for the first time, a default length of the scanning window may be set in advance, and the default length may be 10 bits.
Step 208: and compressing the initial data of the current category in a dictionary coding mode according to the length of the matched scanning window to obtain a first compression result of the initial data.
Specifically, the remaining data to be compressed in the current category is encoded according to the determined length of the scanning window, and a first compression result of the initial data is generated.
In the embodiment, any one piece of data to be compressed is selected for compression, and the length of a scanning window is continuously expanded in the compression process until the data to be compressed is matched; the size of the scanning window is adjusted in a searching mode, so that the length of the determined scanning window is matched with the initial data of the current category, and the compression speed is improved.
The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.
A third embodiment of the present invention relates to a method of data compression. The embodiment is a further improvement of the above embodiment, and the main improvements are as follows: after the first compression result of the initial data is obtained, the compressed input data is encoded by adopting a finite state entropy coding mode to generate a second compression result. The flow is shown in fig. 3.
Step 301: acquiring at least two types of initial data, wherein the length of each data to be compressed in each type of initial data is the same.
Step 302: and adjusting the length of a scanning window in the dictionary code until the length of the scanning window is matched with the current category according to any piece of data to be compressed in each category of initial data.
The specific implementation process of step 302 is substantially the same as that of step 202 to step 207 in the second embodiment, and is not described herein again.
Step 303: and compressing the initial data of the current category in a dictionary coding mode according to the length of the matched scanning window to obtain a first compression result of the initial data.
This step is substantially the same as step 208 in the second embodiment, and is not described herein again.
Step 304: the first compression result is taken as the compression input data.
In particular, the first compression result comprises a sequence of dictionary positions. The first result is used as the compressed input data for the second compression.
Step 305: and coding the compressed input data by adopting a finite state entropy coding mode to generate a second compression result.
Specifically, the principle of entropy coding is to count the occurrence frequency of characters to re-encode the characters, entropy coding is independent of the arrangement order of original data and is related to the occurrence frequency of the characters, and the main compression algorithms include Shanno-Fano coding, run-length coding, huffman coding, arithmetic coding and the like. In this embodiment, Finite State Entropy coding (FSE) is used, and this coding method also belongs to Entropy coding.
The following describes the FSE encoding scheme specifically:
and counting the compressed input data, constructing a compression query table according to the counted data, and compressing the compressed input data through the compression query table. The compression look-up table is shown in table 1:
state/line number A B C
1 2 3 5
2 4 6 10
3 7 8 10
4 9 11 20
5 12 14 25
6 13 17 30
7 16 21
8 18 22
9 19 26
10 23 28
11 24
12 27
13 29
14 31
TABLE 1
The first line in table 1 is a character of each style, the first column indicates a line number, and values corresponding to the line number and the character collectively indicate a state value, wherein the line number occupied by each character is determined by an initial state value and a probability, which represents the number of times the character appears, as shown in table 1, a occupies 14 lines, and B occupies 10 lines.
And counting the number of different characters appearing in the compressed input data, the number of times of each character appearing and the number of times of the character appearing most through functions. The maximum number of times of occurrence of the character is used to determine whether FSE compression can be performed on the compressed input data, the maximum number of times of occurrence of the character may be compared with a preset determination threshold, and if the maximum number of times of occurrence of the character is smaller than the determination threshold, FSE compression is not performed, for example, in this example, the determination threshold is set to 1, and if the maximum number of times of occurrence of the character is smaller than 1, it indicates that there is no character that appears repeatedly in the current compressed input data, compression is not performed.
In this example, the number of characters in different styles appearing in compressed input data may be counted first, which may also be referred to as a maximum character value, the number of times of appearance of the characters in each style is counted by judging the maximum character value and comparing the maximum character value with a preset maximum value, and selecting a HIST function matched with the compressed input data according to a comparison result, for example, if the preset maximum value is 255, and if the maximum character value is not equal to 255, counting the number of times of appearance of the characters in the various styles in the compressed input data by executing a "hit _ count _ parallel _ wksp function"; if the maximum value of the characters is equal to 255 and the data volume of the compressed input data is 1500 bits, counting the times of occurrence of the various characters in the compressed input data by executing a function 'HIST _ countFast _ wksp'; if the maximum value of the character is equal to 255 and the data size of the compressed input data is larger than 1500 bits, counting the occurrence times of the various characters in the compressed input data by executing a "HIST _ count _ parallel _ wksp" function. Wherein, the "HIST _ count _ wksp" function counts the occurrence times of various pattern symbols through four intermediate data.
The number of times the various pattern symbols appear and the compressed input data may determine the size of the compression look-up table. The maximum value of the compression look-up table may be determined according to the size of the compressed input data. Comparing the compressed input data with the maximum character value can determine the maximum value of the compression lookup table, and the size of the compression lookup table is between the minimum value of the compression lookup table and the maximum value of the compression lookup table. Further, the maximum value of the DEFAULT compression lookup table may be set to FSE _ defiult _ tagelog in advance, the maximum value of the compression lookup table determined by the size of the compression input data may be maxbitsrc, and if maxbitsrc is smaller than FSE _ defiult _ tagelog, the maximum value of the DEFAULT compression lookup table may be changed to maxbitsrc.
And distributing a corresponding memory for the compression lookup table according to the determined size of the compression lookup table. The FSE _ optimalTableLog function is executed to find the optimal line number (marked as tablelog) through the maximum value of the characters and the maximum value of the occurrence times of various pattern characters, namely the maximum line number of each pattern character. Executing FSE _ normalzeCount function, normalizing the character array to form normalzedCounter, and executing FSE _ writeCount function to write the normalized array into normalzedCounter buffer.
And executing the FSE _ buildCTable _ wksp function to establish a compression lookup table, and then executing the FSE _ compression _ using CTable to perform compression by using the lookup table. The process of FSE compression is: reading a first character and an initial state value in compressed input data, finding a row number where the character is located, enabling the state value to be smaller than the row number through shifting, outputting a shifting result in a binary system form, taking the state value obtained after the state value is shifted as a next state value, carrying out next search on the state value and a next symbol until all compressed input data are searched, outputting a final state value to a binary stream, and taking the generated binary stream as a second compression result, wherein the second compression result comprises: each time searching the shifted binary stream, the binary system of the state value obtained finally and the binary bit number occupied by the compression lookup table.
And the first compression result is compressed for the second time in an FSE compression mode, so that the compression ratio is further improved. The following is a comparison of the effectiveness of the data compression method in this example with other compression algorithms.
In practical application, the operating environment may be "Core i5 CPU 2.40 MHz; 8GB memory, MacOS 10.15.7 operating System ". Test data sets may be pre-constructed, the data including: electric compatibility, Traffic Usage, SolarEnergy, and Exchange Rate, etc. The electric compatibility data is the recorded Electricity usage in kWh every 15 minutes from 2011 to 2014. The data obtained in this example includes power consumption of 321 clients from 2012 to 2014. We convert the data to hourly consumption. Traffic Usage data collected hourly data for the 48 months of California department of transportation (2015-2016). These data describe road occupancy (between 0 and 1) measured by different sensors on the highway in the gulf of san francisco. Solar Energy contains a record of Solar Energy production in 2006, with data collected every 10 minutes for 137 photovoltaic plants in alabama. Exchange Rate collected the daily Exchange rates of 8 countries, Australia, British, Canada, Switzerland, China, Japan, New Zealand, Singapore, etc., from 1990 to 2016. The compression performance is shown in Table 2, and the time taken to compress data is shown in Table 3, where SA-LZFSE represents the compression mode of the superposition FSE encoding in this example:
Figure BDA0002989005650000101
TABLE 2
Figure BDA0002989005650000102
TABLE 3
As can be seen from Table 1, SA-LZFSE has better compression effect than single coding mode. Since the dictionary coding compresses the redundant information, the entropy coding further compresses the information after the dictionary coding to obtain a better compression ratio, so that the algorithm has a particularly excellent effect on GB and TB.
A fourth embodiment of the present invention relates to a method of data compression. This embodiment is substantially the same as the third embodiment, and mainly differs in that: in this embodiment, the original data is obtained, the FSE compression is performed on the original data to obtain a second compression result, and the second compression result is used as the original data of the dictionary code. The flow is shown in figure 4:
step 401: raw data is acquired as compressed input data.
Specifically, the raw data may include: aviation data or traffic data.
Step 402: and coding the compressed input data by adopting a finite state entropy coding mode to generate a second compression result.
This step is substantially the same as step 305 in the third embodiment, and will not be described herein.
Step 403: and taking the second compression result as initial data.
Specifically, the second compression result is a binary stream. If the original data includes a plurality of data with different lengths, the step 403 will obtain a plurality of original data.
Step 404: acquiring at least two types of initial data, wherein the length of each data to be compressed in each type of initial data is the same.
Step 405: and adjusting the length of a scanning window in the dictionary code until the length of the scanning window is matched with the current category according to any piece of data to be compressed in each category of initial data.
Step 406: and compressing the initial data of the current category in a dictionary coding mode according to the length of the matched scanning window to obtain a first compression result of the initial data.
In the embodiment, the original data is compressed by the FSE coding mode to obtain a second compression result as the initial data, and the initial data is compressed again by the dictionary coding mode, so that the compression ratio and the compression speed are further improved.
A fifth embodiment of the present invention relates to an electronic apparatus, a block diagram of which is shown in fig. 5, and includes: at least one processor 501; and a memory communicatively coupled to the at least one processor 501; the memory 502 stores instructions executable by the at least one processor 501, and the instructions are executed by the at least one processor 501 to enable the at least one processor to perform the above-mentioned data compression method.
The memory 502 and the processor 501 are connected by a bus, which may include any number of interconnected buses and bridges, that electrically link one or more of the processors 501 and the memory 502. The bus may also link various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. Data processed by processor 501 is transmitted over a wireless medium through an antenna, which further receives the data and transmits the data to processor 502.
The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.
A sixth embodiment of the present invention relates to a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of data compression described above.
Those skilled in the art can understand that all or part of the steps in the method of the foregoing embodiments may be implemented by a program to instruct related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims (10)

1. A method of data compression, comprising:
acquiring initial data of at least two categories, wherein the length of each data to be compressed in each category of the initial data is the same;
according to any piece of data to be compressed in each type of initial data, adjusting the length of a scanning window in dictionary coding until the length of the scanning window is matched with the current type;
and compressing the initial data of the current category in the dictionary coding mode according to the length of the matched scanning window to obtain a first compression result of the initial data.
2. The method of claim 1, wherein after the initial data of the current category is compressed by the dictionary coding according to the length of the matched scanning window to obtain a first compression result of the initial data, the method further comprises:
taking the first compression result as compression input data;
and coding the compressed input data by adopting a finite state entropy coding mode to generate a second compression result.
3. The method of data compression as claimed in claim 1, wherein prior to said obtaining at least two categories of initial data, the method further comprises:
acquiring original data as compressed input data;
coding the compressed input data by adopting a finite state entropy coding mode to generate a second compression result;
and taking the second compression result as the initial data.
4. The method according to any one of claims 1 to 3, wherein the adjusting the length of the scanning window in dictionary coding until the length of the scanning window matches the current class according to any one of the initial data to be compressed comprises:
acquiring data positioned in a forward cache region from the data to be compressed as query data;
acquiring data positioned in a scanning window from a coding library as coded data;
judging whether data matched with the query data exists in the coded data or not, if the matched data does not exist, adjusting the length of the scanning window, and taking the adjusted scanning window as a scanning window for obtaining the coded data next time until the data to be compressed are matched; if the matched data exists, recording the position of the matched coded data, and taking the current scanning window as the scanning window for acquiring the coded data next time;
and after the data to be compressed are matched, determining the length of the scanning window matched with the current category according to the length of the corresponding scanning window when the data to be compressed are matched.
5. The method of claim 4, wherein if there is no matching data, the method further comprises:
storing the query data into the coding library;
and acquiring the storage address of the query data, and taking the storage address as a reference address for acquiring the encoded data next time.
6. The method according to claim 5, wherein the determining the length of the scanning window matched with the current category according to the length of the corresponding scanning window when the data to be compressed is matched comprises:
and judging whether the length of the corresponding scanning window exceeds a preset length when the data to be compressed are matched, if so, taking the preset length as the length of the scanning window matched with the current category, wherein the preset length is the digit of the processor.
7. The method of data compression as claimed in claim 5 wherein the dictionary is encoded as: LZ4 encoding; the storing the query data into the coding library includes:
carrying out hash processing on the query data to obtain a hash value;
and storing the hash value and the address corresponding to the hash value in the hash table.
8. The method of data compression according to claim 1, wherein the initial data comprises: aviation data or traffic data.
9. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of data compression as claimed in any one of claims 1 to 8.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of data compression according to any one of claims 1 to 8.
CN202110309332.8A 2021-03-23 2021-03-23 Data compression method, electronic device and storage medium Active CN112953550B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110309332.8A CN112953550B (en) 2021-03-23 2021-03-23 Data compression method, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110309332.8A CN112953550B (en) 2021-03-23 2021-03-23 Data compression method, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN112953550A true CN112953550A (en) 2021-06-11
CN112953550B CN112953550B (en) 2023-01-31

Family

ID=76227984

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110309332.8A Active CN112953550B (en) 2021-03-23 2021-03-23 Data compression method, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN112953550B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023045204A1 (en) * 2021-09-22 2023-03-30 苏州浪潮智能科技有限公司 Method and system for generating finite state entropy coding table, medium, and device
CN115988569A (en) * 2023-03-21 2023-04-18 浙江省疾病预防控制中心 Bluetooth device data rapid transmission method
CN116582133A (en) * 2023-07-12 2023-08-11 东莞市联睿光电科技有限公司 Intelligent management system for data in transformer production process
CN116938256A (en) * 2023-09-18 2023-10-24 苏州科尔珀恩机械科技有限公司 Rotary furnace operation parameter intelligent management method based on big data
CN116933734A (en) * 2023-09-15 2023-10-24 山东济矿鲁能煤电股份有限公司阳城煤矿 Intelligent diagnosis method for cutter faults of shield machine
CN116980457A (en) * 2023-09-21 2023-10-31 江苏赛融科技股份有限公司 Remote control system based on Internet of things
CN117273764A (en) * 2023-11-21 2023-12-22 威泰普科技(深圳)有限公司 Anti-counterfeiting management method and system for electronic atomizer
CN117579078A (en) * 2024-01-11 2024-02-20 央视国际网络有限公司 Data encoding method, data decoding method, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103236847A (en) * 2013-05-06 2013-08-07 西安电子科技大学 Multilayer Hash structure and run coding-based lossless compression method for data
US20150161158A1 (en) * 2012-08-23 2015-06-11 Fujitsu Limited Method of compressing compression target data, method of decompressing data in file, and system
CN105207678A (en) * 2015-09-29 2015-12-30 东南大学 Hardware realizing system for improved LZ4 compression algorithm
US20190377804A1 (en) * 2018-06-06 2019-12-12 Yingquan Wu Data compression algorithm
CN110868222A (en) * 2019-11-29 2020-03-06 中国人民解放军战略支援部队信息工程大学 LZSS compressed data error code detection method and device
CN112527754A (en) * 2020-12-23 2021-03-19 山东鲁能软件技术有限公司 Numerical data compression method and system based on bitwise variable length storage

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150161158A1 (en) * 2012-08-23 2015-06-11 Fujitsu Limited Method of compressing compression target data, method of decompressing data in file, and system
CN103236847A (en) * 2013-05-06 2013-08-07 西安电子科技大学 Multilayer Hash structure and run coding-based lossless compression method for data
CN105207678A (en) * 2015-09-29 2015-12-30 东南大学 Hardware realizing system for improved LZ4 compression algorithm
US20190377804A1 (en) * 2018-06-06 2019-12-12 Yingquan Wu Data compression algorithm
CN112514270A (en) * 2018-06-06 2021-03-16 吴英全 Data compression
CN110868222A (en) * 2019-11-29 2020-03-06 中国人民解放军战略支援部队信息工程大学 LZSS compressed data error code detection method and device
CN112527754A (en) * 2020-12-23 2021-03-19 山东鲁能软件技术有限公司 Numerical data compression method and system based on bitwise variable length storage

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023045204A1 (en) * 2021-09-22 2023-03-30 苏州浪潮智能科技有限公司 Method and system for generating finite state entropy coding table, medium, and device
CN115988569A (en) * 2023-03-21 2023-04-18 浙江省疾病预防控制中心 Bluetooth device data rapid transmission method
CN116582133A (en) * 2023-07-12 2023-08-11 东莞市联睿光电科技有限公司 Intelligent management system for data in transformer production process
CN116582133B (en) * 2023-07-12 2024-02-23 东莞市联睿光电科技有限公司 Intelligent management system for data in transformer production process
CN116933734A (en) * 2023-09-15 2023-10-24 山东济矿鲁能煤电股份有限公司阳城煤矿 Intelligent diagnosis method for cutter faults of shield machine
CN116933734B (en) * 2023-09-15 2023-12-19 山东济矿鲁能煤电股份有限公司阳城煤矿 Intelligent diagnosis method for cutter faults of shield machine
CN116938256B (en) * 2023-09-18 2023-11-28 苏州科尔珀恩机械科技有限公司 Rotary furnace operation parameter intelligent management method based on big data
CN116938256A (en) * 2023-09-18 2023-10-24 苏州科尔珀恩机械科技有限公司 Rotary furnace operation parameter intelligent management method based on big data
CN116980457B (en) * 2023-09-21 2023-12-08 江苏赛融科技股份有限公司 Remote control system based on Internet of things
CN116980457A (en) * 2023-09-21 2023-10-31 江苏赛融科技股份有限公司 Remote control system based on Internet of things
CN117273764A (en) * 2023-11-21 2023-12-22 威泰普科技(深圳)有限公司 Anti-counterfeiting management method and system for electronic atomizer
CN117273764B (en) * 2023-11-21 2024-03-08 威泰普科技(深圳)有限公司 Anti-counterfeiting management method and system for electronic atomizer
CN117579078A (en) * 2024-01-11 2024-02-20 央视国际网络有限公司 Data encoding method, data decoding method, device and storage medium
CN117579078B (en) * 2024-01-11 2024-04-12 央视国际网络有限公司 Data encoding method, data decoding method, device and storage medium

Also Published As

Publication number Publication date
CN112953550B (en) 2023-01-31

Similar Documents

Publication Publication Date Title
CN112953550B (en) Data compression method, electronic device and storage medium
Williams et al. Compressing integers for fast file access
US10116325B2 (en) Data compression/decompression device
US8704685B2 (en) Encoding method, encoding apparatus, decoding method, decoding apparatus, and system
US10268380B2 (en) Methods, devices and systems for semantic-value data compression and decompression
CN102143039B (en) Data segmentation method and equipment for data compression
CN107682016B (en) Data compression method, data decompression method and related system
US20120130965A1 (en) Data compression method
JPH07283739A (en) Method and device to compress and extend data of short block
CN103326732A (en) Method for packing data, method for unpacking data, coder and decoder
US9900025B2 (en) Efficient adaptive seismic data flow lossless compression and decompression method
US20200294629A1 (en) Gene sequencing data compression method and decompression method, system and computer-readable medium
CN105144157A (en) System and method for compressing data in database
CN101534124B (en) Compression algorithm for short natural language
CA2770348A1 (en) Compression of bitmaps and values
CN113312325B (en) Track data transmission method, device, equipment and storage medium
CN113078908B (en) Simple encoding and decoding method suitable for time sequence database
CN108880559B (en) Data compression method, data decompression method, compression equipment and decompression equipment
Cannane et al. General‐purpose compression for efficient retrieval
CN116707532A (en) Decompression method and device for compressed text, storage medium and electronic equipment
CN109255090B (en) Index data compression method of web graph
Li et al. Erasing-based lossless compression method for streaming floating-point time series
CN111832257A (en) Conditional transcoding of encoded data
CN111858391A (en) Method for optimizing compressed storage format in data processing process
CN112181869A (en) Information storage method, device, server and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant