CN113659992B - Data compression method and device and storage medium - Google Patents

Data compression method and device and storage medium Download PDF

Info

Publication number
CN113659992B
CN113659992B CN202110815981.5A CN202110815981A CN113659992B CN 113659992 B CN113659992 B CN 113659992B CN 202110815981 A CN202110815981 A CN 202110815981A CN 113659992 B CN113659992 B CN 113659992B
Authority
CN
China
Prior art keywords
data block
compressed
compression
data
computing resources
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110815981.5A
Other languages
Chinese (zh)
Other versions
CN113659992A (en
Inventor
白志得
哈米德
白智德
黄坤
殷燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhihuilin Network Technology Co ltd
Original Assignee
Shenzhen Zhihuilin Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhihuilin Network Technology Co ltd filed Critical Shenzhen Zhihuilin Network Technology Co ltd
Priority to CN202110815981.5A priority Critical patent/CN113659992B/en
Publication of CN113659992A publication Critical patent/CN113659992A/en
Application granted granted Critical
Publication of CN113659992B publication Critical patent/CN113659992B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The application discloses a data compression method, a data compression device and a storage medium. The method comprises the following steps: determining the use condition of computing resources in the compression process of a given data block; and determining the computing resources required by the data block to be compressed according to the use condition of the computing resources in the compression process of the given data block, wherein the data block to be compressed has similarity with the given data block. By adopting the scheme of the application, the compression performance is determined by determining the calculation resources required in the compression process, and the compression performance in the data compression process is improved.

Description

Data compression method and device and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data compression method and apparatus, and a storage medium.
Background
Currently existing compression techniques utilize algorithms derived from traditional information theory. This results in compression, especially lossless compression, as a result of which redundant data in the file is found and removed. Conventional compression algorithms, even those that utilize AI and ML, focus on redundancy. The more redundancy that is found, the better the compression ratio.
For example, huffman and Run-length algorithms tend to find pure redundancy, meaning that they tend to notice a block of data (e.g., a feature of text), so that as large a duplicate of that block of data as identical as possible is found in a larger block of data. Those algorithms perform well to some extent, but their main problem is that they have developed a bottleneck to compression, and all those redundancy-based algorithms cannot find new ways to generate redundancy.
Existing approaches are based on removing or reducing redundancy present in selected blocks of data. In addition to focusing on the redundancy that exists, rather than creating more, the problem with conventional compression algorithms is essentially that they each take into account data blocks of fixed or variable size, or all of the large number of data blocks contained in only one file. And most conventional compression algorithms only perform checking for redundancy in small data blocks, i.e., an exponent of 2 (i.e., 4,8,16,32,63,128,256 bytes).
Relying solely on existing redundancy found in small blocks of data limits the performance of those conventional compression algorithms.
Disclosure of Invention
The application provides a data compression method, a data compression device and a storage medium, so as to improve compression performance in a data compression process.
In a first aspect, a data compression method is provided, the method comprising:
determining the use condition of computing resources in the compression process of a given data block;
and determining the computing resources required by the data block to be compressed according to the use condition of the computing resources in the compression process of the given data block, wherein the data block to be compressed has similarity with the given data block.
In one possible implementation, the usage of the computing resources in the given data block compression process includes at least one of:
computing resources used for the last compressed file;
computing resources applied to the entire file that has been compressed.
In another possible implementation, the determining the usage of the computing resource in the compression process of the given data block includes:
the performance of the element when the compressed data is read to measure the usage of the computing resources in the compression of said given data block.
In yet another possible implementation, the method further includes:
the performance of the compression algorithm is evaluated using the corresponding performance labels based on the artificial intelligence driven algorithm.
In yet another possible implementation, the method further includes:
reading a data block to be compressed with a set size;
analyzing a likelihood of adding redundancy in the data block to be compressed;
determining an index number of a function generating redundant data in the data block to be compressed;
and generating redundant data in the data block to be compressed by adopting a function corresponding to the index number.
In yet another possible implementation, the analyzing increases a likelihood of redundancy in the data block to be compressed, including:
and analyzing the possibility of adding redundancy in the data block to be compressed according to the data type of the data block to be compressed.
In yet another possible implementation, the method further includes:
generating a heat map, wherein the heat map comprises high-value numbers with m-bit length redundant in the data block to be compressed, and m is a positive integer.
In yet another possible implementation, the method further includes:
storing the redundant data in the data block to be compressed.
In a second aspect, there is provided a data compression apparatus, the apparatus comprising:
a first determining unit, configured to determine a usage of a computing resource in a given data block compression process;
and the second determining unit is used for determining the computing resources required by the data block to be compressed according to the use condition of the computing resources in the compression process of the given data block, wherein the data block to be compressed has similarity with the given data block.
In one possible implementation, the usage of the computing resources in the given data block compression process includes at least one of:
computing resources used for the last compressed file;
computing resources applied to the entire file that has been compressed.
In another possible implementation, the first determining unit is configured to read performance of the element when compressing data, so as to measure usage of computing resources in the compression process of the given data block.
In yet another possible implementation, the apparatus further includes:
and the evaluation unit is used for evaluating the performance of the compression algorithm by using the corresponding performance label based on the algorithm driven by the artificial intelligence.
In yet another possible implementation, the apparatus further includes:
the reading unit is used for reading the data block to be compressed with the set size;
an analysis unit for analyzing a possibility of redundancy increase in the data block to be compressed;
a third determining unit configured to determine an index number of a function generating redundant data in the data block to be compressed;
and the first generation unit is used for generating redundant data in the data block to be compressed by adopting a function corresponding to the index number.
In a further possible implementation, the analysis unit is configured to analyze a possibility of adding redundancy in the data block to be compressed according to a data type of the data block to be compressed.
In yet another possible implementation, the apparatus further includes:
and a second generation unit, configured to generate a heat map, where the heat map includes high-value numbers with m bit lengths that are redundant in the data block to be compressed, and m is a positive integer.
In yet another possible implementation, the apparatus further includes:
and the storage unit is used for storing the redundant data in the data block to be compressed.
In a third aspect, there is provided a data compression apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as in the first aspect or any one of the first aspects when executing the computer program.
In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as in the first aspect or any one of the first aspects.
The data compression scheme of the application has the following beneficial effects:
the compression performance is determined by determining the computing resources required in the compression process, so that the compression performance in the data compression process is improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a data compression method according to an embodiment of the present application;
FIG. 2 is a flow chart of another data compression method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a data compression device according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of another data compression device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The present application uses data records in a set of training model manifests to compress input data.
The present application uses both the inventive and conventional methods to receive manipulated data for compression.
The algorithm of the present application utilizes the inventive and conventional methods to compress data received from other algorithms that has been made more compressible. For its concept, we call the algorithm a "compression pit" (CP) algorithm.
The CP includes the following parts:
1-traditional compression method
2-inventive compression method
Wherein, 1-traditional compression method
The conventional compression method used in the present algorithm includes:
(1) Huffman coding:
is a specific type of best prefix code commonly used for lossless data compression. The process of finding or using such codes is done by huffman coding, an algorithm developed by davidia a huffman.
The output of the huffman algorithm may be regarded as a variable length code table for encoding the source symbol (e.g. a character in a file).
The algorithm derives this table from the estimated probability or frequency of occurrence (weight) of each possible value of the source symbol. As with other entropy encoding methods, more common symbols are typically represented with fewer bits than less common symbols.
The huffman method can be effectively implemented and if the weights are ordered, a code is found that is linear in time with the number of input weights.
A set of symbols and their weights (generally proportional to probability) are assumed.
It finds the prefix-free binary code (set of codewords) with the smallest desired codeword length (equivalently, a tree with the smallest weighted path length starting from the root).
The input is typically a letter, which is the size of the alphabet of symbols. A tuple, i.e. a tuple of (positive) symbol weights (typically proportional to probability).
The output is the code, which is a tuple of the (binary) codeword.
The goal is to let the weighted path length be the code. Conditions are as follows: and is applicable to any coding.
(2) Run-length encoding (RLE):
run-length encoding is a form of lossless data compression in which the running data (a sequence of identical data values occurring in multiple consecutive data elements) is stored as a single data value and count, rather than the original running. This is most useful for data containing many such runs.
For example, consider a screen that contains solid black text on a solid white background. There will be many white pixels in long-term white space and many short-term black pixels in text. Assuming that one scan line, B represents a black pixel and W represents a white pixel, it can be read as follows:
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWW
WWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWW
WW
applying a Run Length Encoding (RLE) data compression algorithm to the hypothetical scan lines can be implemented as follows:
12W1B12W3B24W1B14W
this can be interpreted as a sequence of twelve w, one B, twelve w, three B, etc.
Length coding can be expressed in a number of ways to accommodate data attributes and other compression algorithms. For example, one popular method is to encode run lengths for only two or more characters, use an "escape" symbol to identify the run, or use the character itself as an escape, so that any character appears twice to represent the run.
In the above example, this should be expressed as:
WW12BWW12BB3WW24BWW14
this can be interpreted as twelve w, one B, twelve w, three B, etc. In data with a lower operating frequency, this can significantly increase the compression rate.
(3) Partial match prediction (PPM):
is an adaptive statistical data compression technique based on context modeling and prediction. The PPM model uses a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the stream.
In cluster analysis, the PPM algorithm may also be used to cluster data into predicted packets.
The predictions are typically reduced to symbol ranks. Each symbol (a letter, bit, or any other number of data) is ordered before being compressed, and the ordering system determines the corresponding codeword (and thus also the compression rate).
In many compression algorithms, the ordering is equivalent to a probability mass function estimation. Each symbol is given a probability given the preceding letter (or given context).
For example, in arithmetic coding, symbols are arranged with probabilities that they occur after a preceding symbol, and the entire sequence is compressed into a single score that is calculated from these probabilities.
The number of symbols n in front determines the order of the PPM model, denoted PPM (n). There are also unbounded variants with no length limitation in context, denoted PPM.
If prediction cannot be made based on all n context symbols, then an attempt is made to make predictions using n-1 symbols. This process is repeated until a match is found or no more symbols are present in the context. A fixed prediction is made at this point.
(4) Cryptographic system:
in cryptography, a cryptographic system is a set of cryptographic algorithms required to implement a particular security service, most often to achieve confidentiality (encryption).
Typically, cryptographic systems consist of three algorithms: one for key generation, one for encryption and one for decryption. The term cipher (sometimes referred to as cytoer) is often used to refer to a pair of algorithms, one for encryption and one for decryption.
Thus, when key generation algorithms are important, the term cryptosystem is most often used. For this reason, the term cryptosystem is often used to refer to a common key technology; however, "cipher" and "cryptosystem" are both used for symmetric key technology.
(5) Kaiser code:
in cryptography, the Kaiser code, also known as the Kaiser code, the transfer code, the Kaiser code or the Kaiser transfer, is one of the simplest and most widely known encryption techniques.
It is a substitute code in which each letter in the plain text is replaced by a fixed-position letter. For example, 3 shifts left, D would be replaced by A, E would become B, and so on. This method is named by cinnabar, which he used in private letters.
The encryption step performed by the Kaiser cipher is typically part of a more complex scheme, such as the Vigen re cipher, and is still a modern application in the ROT13 system.
The conversion may be represented by aligning two letters; the cipher letter is to rotate the common letter to the left or right in certain position.
For example, this is a Kaiser cipher, using 3 bits left-hand, corresponding to a shift right of 23 bits (the shift parameter is used as the key):
common letter: abcdefghijklmnopqrpuvwxyz;
cipher letter: XYZABCDEGHIJKLMNOPQRSTUW.
When encrypting, each letter in the information of "common letter" is searched, and the corresponding "cipher letter" is written.
Plain text: THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG;
cipher text: QEB NRFZH YOLTK CLU GRJMP LSBO QEB IXWV ALD.
Decryption is reversed, right-shifted by 3.
Encryption may also be expressed in modular arithmetic, with the letter first converted to a number based on a→0, b→1, …, z→25.
The present application includes a simpler but more efficient compression method than conventional compression methods.
The application provides a data compression scheme, which determines the compression performance by determining the calculation resources required in the compression process, thereby improving the compression performance in the data compression process.
As shown in fig. 1, a flow chart of a data compression method according to an embodiment of the present application is shown, and the method includes the following steps:
s101, determining the use condition of computing resources in the compression process of a given data block.
S102, determining the computing resources required by the data block to be compressed according to the use condition of the computing resources in the compression process of the given data block, wherein the data block to be compressed has similarity with the given data block.
In data compression, not only is the compression ratio important, but the overall compression performance is typically determined by a combination of the ratio under consideration and the amount of computational resources that are used to compress and decompress the data.
A computing resource enhancement tool (computing resources enhancer, CRE) determines computing resources to accommodate the utilization of each data block to be compressed based on results previously collected by the TMI.
To provide a high performance CRE, to provide better management of computing resources in compressing and decompressing data, the data compression in the algorithm not only produces compressed data, but also leaves traces in the TMI for evaluating how much computing resources have been used for a particular set of previously compressed data blocks that have similarities to a given input data block to be compressed, thus determining how to use the computing resources for any input data.
Traces left in the TMI (the usage of computing resources in a given data block compression process) include:
a) Measuring how much computing power and computing resources are used to: 1-last compressed file, 2-whole file previously compressed. The measurement is done by reading the performance of the element when the data is compressed. Because of the decompression process in our compression technique, which typically employs nearly the same amount of computational resources, these measurements can be widely used to analytically calculate the resources used to decompress the data blocks.
B) Performance tags for a variety of compression algorithms.
These performance labels show how well each of these multiple compression algorithms has used computing resources when processing an incoming data block. As more and more data is compressed by compression algorithms, the performance of the compression algorithms is evaluated and the utilization of each algorithm and/or a combination of algorithms is adjusted to extend the compression (and decompression) performance by artificial intelligence driven ML algorithms using these tags.
According to the data compression method provided by the embodiment of the application, the compression performance is determined by determining the calculation resources required in the compression process, so that the compression performance in the data compression process is improved.
As shown in fig. 2, a flow chart of another data compression method according to an embodiment of the present application is shown, and the method includes the following steps:
s201, reading element performance when compressed data to measure the use condition of computing resources in the given data block compression process.
S202, determining the computing resources required by the data block to be compressed according to the use condition of the computing resources in the compression process of the given data block, wherein the data block to be compressed has similarity with the given data block.
The CRE determines the computational resources to accommodate the utilization of each data block to be compressed based on the results previously collected by the TMI.
To provide a high performance CRE, to provide better management of computing resources in compressing and decompressing data, the data compression in the algorithm not only produces compressed data, but also leaves traces in the TMI for evaluating how much computing resources have been used for a particular set of previously compressed data blocks that have similarities to a given input data block to be compressed, thus determining how to use the computing resources for any input data.
Traces left in the TMI (the usage of computing resources in a given data block compression process) include:
a) Measuring how much computing power and computing resources are used to: 1-last compressed file, 2-whole file previously compressed. The measurement is done by reading the performance of the element when the data is compressed. Because of the decompression process in our compression technique, which typically employs nearly the same amount of computational resources, these measurements can be widely used to analytically calculate the resources used to decompress the data blocks.
B) Performance tags for a variety of compression algorithms.
S203, reading the data block to be compressed with the set size.
S204, analyzing the possibility of redundancy increase in the data block to be compressed.
S205, determining an index number of a function for generating redundant data in the data block to be compressed.
S206, generating redundant data in the data block to be compressed by adopting a function corresponding to the index number.
S207, generating a heat map, wherein the heat map comprises high-value numbers with m bit lengths redundant in the data block to be compressed, and m is a positive integer.
Specifically, a set number of data blocks may be compressed and a heat map generated by a batch of the same type of data analyzers (the bulk same-type data analyzer, BSTDA). BSTDA is an algorithm that does not apply RGA and PPA to a particular piece of data, but rather to a large number of data blocks, where each data block in the large number of data blocks belongs to an independent or non-independent file.
Unlike traditional compression techniques, which compress, each operation is performed in one file, BSTDA tends to study, analyze, and train a large number of files in the same specific form.
BSTDA is more useful when dealing with large data and large data compression. This can greatly improve the compression efficiency.
The data from the BSTDA indicates that the data in files having the same type have the same compression parameters.
This portion of the stored data has the following characteristics:
a) The data type, i.e. data from Bitmap (BMP) files.
B) Index, i.e., index data/value in the form of each file.
C) A heat map, which is in fact a map similar to a heat map, shows the concentration of given values (binary or hexadecimal) normally distributed in files of the same file format. For example, BSTDA can detect high-value digits of n-bits length at the beginning of most.Mp4 files (excluding their header files). This will generate a heat map of higher-value numbers that are more dense at the beginning of the data block.
D) Data storage, if necessary, stores the input data in actual data blocks. The viability of storing actual data blocks is determined by a manual intelligent algorithm that scans previously recorded data in the BSTDA section to see if an incoming actual data block would increase BSTDA inventory.
The data received by the BSTDA data compressor is in fact data collected in a large number of data blocks contained in a specific file group, i.e. file format.
The collected data represents similarities in types between files, such as MP4s.
What the BSTDA data compressor does is apply the traditional method and artificial intelligence ML algorithm to the received data, helping the traditional method algorithm to provide better performance in processing compression ratios.
This can be achieved by using the TMI as the core of an AI-driven record generator that helps to compress the input data in a constantly sophisticated manner.
S208, the algorithm based on artificial intelligence driving evaluates the performance of the compression algorithm by using the corresponding performance label.
The performance tags described above show how well each of these multiple compression algorithms has used computing resources when processing an incoming data block. As more and more data is compressed by compression algorithms, the performance of the compression algorithms is evaluated and the utilization of each algorithm and/or a combination of algorithms is adjusted to extend the compression (and decompression) performance by artificial intelligence driven ML algorithms using these tags.
According to the data compression method provided by the embodiment of the application, the compression performance is determined by determining the calculation resources required in the compression process, so that the compression performance in the data compression process is improved;
and compression efficiency is improved by compressing a large amount of data or big data.
It will be appreciated that, in order to implement the functions of the above embodiments, the data compression model training apparatus includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application scenario and design constraints imposed on the solution.
As shown in fig. 3, a schematic structural diagram of a data compression device according to the present application is provided, and the device 300 may include:
a first determining unit 31 for determining the use of computing resources in a given data block compression process;
a second determining unit 32, configured to determine, according to the usage of computing resources in the compression process of the given data block, computing resources required for the data block to be compressed, where the data block to be compressed has a similarity with the given data block.
In one possible implementation, the usage of the computing resources in the given data block compression process includes at least one of:
computing resources used for the last compressed file;
computing resources applied to the entire file that has been compressed.
In another possible implementation, the first determining unit 31 is configured to read performance of elements when compressing data, so as to measure usage of computing resources in the compression process of the given data block.
In yet another possible implementation, the apparatus further includes:
an evaluation unit 38 for evaluating the performance of the compression algorithm using the corresponding performance tags based on the artificial intelligence driven algorithm.
In yet another possible implementation, the apparatus further includes:
a reading unit 33 for reading a data block to be compressed of a set size;
an analysis unit 34 for analyzing a possibility of adding redundancy in the data block to be compressed;
a third determining unit 35 for determining an index number of a function generating redundant data in the data block to be compressed;
a first generating unit 36, configured to generate redundant data in the data block to be compressed by using a function corresponding to the index number.
In a further possible implementation, the analysis unit 34 is configured to analyze a possibility of adding redundancy in the data block to be compressed according to a data type of the data block to be compressed.
In yet another possible implementation, the apparatus further includes:
a second generating unit 37 for generating a heat map comprising high-value numbers of m bits long redundant in the data block to be compressed, m being a positive integer.
In yet another possible implementation, the apparatus further includes (not shown in the figure):
and the storage unit is used for storing the redundant data in the data block to be compressed.
For specific implementation of the above units, reference may be made to the related description of the method flow shown in fig. 1 or fig. 2, which is not repeated here.
The reading unit 33, the analyzing unit 34, the third determining unit 35, the first generating unit 36, the second generating unit 37 and the evaluating unit 38 are optional units, which are shown and connected in broken lines.
It should be noted that one or more of the above units or units may be implemented in software, hardware or a combination of both. When any of the above units or units are implemented in software, the software exists in the form of computer program instructions and is stored in a memory, a processor may be used to execute the program instructions and implement the above method flows. The processor may be built in a system on chip (SoC) or ASIC, or may be a separate semiconductor chip. The processor may further include necessary hardware accelerators, such as field programmable gate arrays (field programmable gate array, FPGAs), programmable logic devices (programmable logic device, PLDs), or logic circuits implementing dedicated logic operations, in addition to the cores for executing software instructions for operation or processing.
When the above units or units are implemented in hardware, the hardware may be any one or any combination of a CPU, microprocessor, digital signal processing (digital signal processing, DSP) chip, micro control unit (microcontroller unit, MCU), artificial intelligence processor, ASIC, soC, FPGA, PLD, dedicated digital circuitry, hardware accelerator, or non-integrated discrete device, which may run the necessary software or be independent of the software to perform the above method flows.
According to the data compression device provided by the embodiment of the application, the compression performance is determined by determining the calculation resources required in the compression process, so that the compression performance in the data compression process is improved.
As shown in fig. 4, another data compression device according to the present application is shown, and the device 400 may include:
input device 41, output device 42, memory 43, and processor 44 (the number of processors 44 in the device may be one or more, one processor being an example in fig. 4). In some embodiments of the present application, the input device 41, the output device 42, the memory 43, and the processor 44 may be connected by a bus or other means, where a bus connection is exemplified in fig. 4.
Wherein the processor 44 is configured to perform the steps of:
determining the use condition of computing resources in the compression process of a given data block;
and determining the computing resources required by the data block to be compressed according to the use condition of the computing resources in the compression process of the given data block, wherein the data block to be compressed has similarity with the given data block.
In one possible implementation, the usage of the computing resources in the given data block compression process includes at least one of:
computing resources used for the last compressed file;
computing resources applied to the entire file that has been compressed.
In another possible implementation, the step of determining the usage of computing resources in a given data block compression process performed by the processor 44 includes:
the performance of the element when the compressed data is read to measure the usage of the computing resources in the compression of said given data block.
In yet another possible implementation, the processor 44 is further configured to perform the steps of:
the performance of the compression algorithm is evaluated using the corresponding performance labels based on the artificial intelligence driven algorithm.
In yet another possible implementation, the processor 44 is further configured to perform the steps of:
reading a data block to be compressed with a set size;
analyzing a likelihood of adding redundancy in the data block to be compressed;
determining an index number of a function generating redundant data in the data block to be compressed;
and generating redundant data in the data block to be compressed by adopting a function corresponding to the index number.
In yet another possible implementation, the step of the processor 44 performing the analysis increases the likelihood of redundancy in the data block to be compressed includes:
and analyzing the possibility of adding redundancy in the data block to be compressed according to the data type of the data block to be compressed.
In yet another possible implementation, the processor 44 is further configured to perform the steps of:
generating a heat map, wherein the heat map comprises high-value numbers with m-bit length redundant in the data block to be compressed, and m is a positive integer.
In yet another possible implementation, the processor 44 is further configured to perform the steps of:
storing the redundant data in the data block to be compressed.
It is to be appreciated that the processor in embodiments of the application may be a central processing unit (central processing unit, CPU), other general purpose processor, digital signal processor (digital signal processor, DSP), application specific integrated circuit (application specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.
According to the data compression device provided by the embodiment of the application, the compression performance is determined by determining the calculation resources required in the compression process, so that the compression performance in the data compression process is improved.
The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by executing software instructions by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access memory, flash memory, read only memory, programmable read only memory, erasable programmable read only memory, electrically erasable programmable read only memory, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may be located in a data compression device. It is also possible that the processor and the storage medium reside as discrete components in a data compression device.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a base station, a user equipment, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, e.g., floppy disk, hard disk, tape; but also optical media such as digital video discs; but also semiconductor media such as solid state disks.
In various embodiments of the application, where no special description or logic conflict exists, terms and/or descriptions between the various embodiments are consistent and may reference each other, and features of the various embodiments may be combined to form new embodiments based on their inherent logic.
It should be understood that in the description of the present application, "/" means that the associated objects are in a "or" relationship, unless otherwise specified, for example, a/B may represent a or B; wherein A, B may be singular or plural. Also, in the description of the present application, unless otherwise indicated, "a plurality" means two or more than two. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural. In addition, in order to facilitate the clear description of the technical solution of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ. Meanwhile, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations or explanations. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion that may be readily understood.
It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application. The sequence number of each process does not mean the sequence of the execution sequence, and the execution sequence of each process should be determined according to the function and the internal logic.

Claims (8)

1. A method of data compression, the method comprising:
determining the use condition of computing resources in the compression process of a given data block;
determining the computing resources required by the data block to be compressed according to the use condition of the computing resources in the compression process of the given data block, wherein the data block to be compressed has similarity with the given data block;
the determining the usage of computing resources in the compression process of the given data block comprises:
reading element performance when compressing data to measure usage of computing resources in the given data block compression process, the usage of computing resources in the given data block compression process including at least one of: computing resources used for the last compressed file; computing resources applied to the compressed entire file, performance labels for a plurality of compression algorithms for evaluating the performance of the compression algorithms and adjusting the utilization of each algorithm and/or combinations of algorithms.
2. The method according to claim 1, wherein the method further comprises:
the performance of the compression algorithm is evaluated using the corresponding performance labels based on the artificial intelligence driven algorithm.
3. The method according to claim 1, wherein the method further comprises:
reading the data block to be compressed with the set size;
analyzing a likelihood of adding redundancy in the data block to be compressed;
determining an index number of a function generating redundant data in the data block to be compressed;
and generating redundant data in the data block to be compressed by adopting a function corresponding to the index number.
4. A method according to claim 3, wherein said analyzing increases the likelihood of redundancy in said data block to be compressed, comprising:
and analyzing the possibility of adding redundancy in the data block to be compressed according to the data type of the data block to be compressed.
5. A method according to claim 3, characterized in that the method further comprises:
generating a heat map, wherein the heat map comprises high-value numbers with m-bit length redundant in the data block to be compressed, and m is a positive integer.
6. A data compression apparatus, the apparatus comprising:
a first determining unit, configured to determine a usage of a computing resource in a given data block compression process;
a second determining unit, configured to determine, according to a usage situation of a computing resource in the given data block compression process, a computing resource required by a data block to be compressed, where the data block to be compressed has a similarity with the given data block;
the first determining unit is specifically configured to measure a usage of a computing resource in the given data block compression process by using element performance when the compressed data is read, where the usage of the computing resource in the given data block compression process includes at least one of: computing resources used for the last compressed file; computing resources applied to the compressed entire file, performance labels for a plurality of compression algorithms for evaluating the performance of the compression algorithms and adjusting the utilization of each algorithm and/or combinations of algorithms.
7. A data compression device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 5 when executing the computer program.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-5.
CN202110815981.5A 2021-07-16 2021-07-16 Data compression method and device and storage medium Active CN113659992B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110815981.5A CN113659992B (en) 2021-07-16 2021-07-16 Data compression method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110815981.5A CN113659992B (en) 2021-07-16 2021-07-16 Data compression method and device and storage medium

Publications (2)

Publication Number Publication Date
CN113659992A CN113659992A (en) 2021-11-16
CN113659992B true CN113659992B (en) 2023-08-11

Family

ID=78477654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110815981.5A Active CN113659992B (en) 2021-07-16 2021-07-16 Data compression method and device and storage medium

Country Status (1)

Country Link
CN (1) CN113659992B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101517539A (en) * 2006-09-29 2009-08-26 高通股份有限公司 Method and apparatus for managing resources at a wireless device
CN105553937A (en) * 2015-12-03 2016-05-04 华为技术有限公司 System and method for data compression
US10223008B1 (en) * 2016-09-30 2019-03-05 EMC IP Holding Company LLC Storage array sizing for compressed applications
CN110662061A (en) * 2018-06-29 2020-01-07 想象技术有限公司 Guaranteed data compression
CN111277274A (en) * 2020-01-13 2020-06-12 平安国际智慧城市科技股份有限公司 Data compression method, device, equipment and storage medium
CN112181919A (en) * 2019-07-05 2021-01-05 深信服科技股份有限公司 Compression method, compression system, electronic equipment and storage medium
CN112488306A (en) * 2020-12-22 2021-03-12 中国电子科技集团公司信息科学研究院 Neural network compression method and device, electronic equipment and storage medium
CN112506879A (en) * 2020-12-18 2021-03-16 深圳智慧林网络科技有限公司 Data processing method and related equipment
CN112994701A (en) * 2019-12-02 2021-06-18 阿里巴巴集团控股有限公司 Data compression method and device, electronic equipment and computer readable medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8639672B2 (en) * 2012-03-27 2014-01-28 International Business Machines Corporation Multiplex classification for tabular data compression
US11200004B2 (en) * 2019-02-01 2021-12-14 EMC IP Holding Company LLC Compression of data for a file system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101517539A (en) * 2006-09-29 2009-08-26 高通股份有限公司 Method and apparatus for managing resources at a wireless device
CN105553937A (en) * 2015-12-03 2016-05-04 华为技术有限公司 System and method for data compression
US10223008B1 (en) * 2016-09-30 2019-03-05 EMC IP Holding Company LLC Storage array sizing for compressed applications
CN110662061A (en) * 2018-06-29 2020-01-07 想象技术有限公司 Guaranteed data compression
CN112181919A (en) * 2019-07-05 2021-01-05 深信服科技股份有限公司 Compression method, compression system, electronic equipment and storage medium
CN112994701A (en) * 2019-12-02 2021-06-18 阿里巴巴集团控股有限公司 Data compression method and device, electronic equipment and computer readable medium
CN111277274A (en) * 2020-01-13 2020-06-12 平安国际智慧城市科技股份有限公司 Data compression method, device, equipment and storage medium
CN112506879A (en) * 2020-12-18 2021-03-16 深圳智慧林网络科技有限公司 Data processing method and related equipment
CN112488306A (en) * 2020-12-22 2021-03-12 中国电子科技集团公司信息科学研究院 Neural network compression method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于RCFile存储模型的分布式压缩存储优化研究;何海洋;《中国优秀硕士学位论文全文数据库信息科技辑》(第2(2018年)期);I137-120 *

Also Published As

Publication number Publication date
CN113659992A (en) 2021-11-16

Similar Documents

Publication Publication Date Title
US6674908B1 (en) Method of compression of binary data with a random number generator
US20110181448A1 (en) Lossless compression
WO2019153700A1 (en) Encoding and decoding method, apparatus and encoding and decoding device
US20130307709A1 (en) Efficient techniques for aligned fixed-length compression
Vestergaard et al. Generalized deduplication: Bounds, convergence, and asymptotic properties
US11677416B2 (en) Hardware implementable data compression/decompression algorithm
WO2019076177A1 (en) Gene sequencing data compression preprocessing, compression and decompression method, system, and computer-readable medium
WO2019080670A1 (en) Gene sequencing data compression method and decompression method, system, and computer readable medium
CN110771161A (en) Digital perspective method
CN113687773B (en) Data compression model training method and device and storage medium
CN110021368B (en) Comparison type gene sequencing data compression method, system and computer readable medium
CN113659992B (en) Data compression method and device and storage medium
Haque et al. Study on data compression technique
Mehta et al. Run‐Length‐Based Test Data Compression Techniques: How Far from Entropy and Power Bounds?—A Survey
US20110317759A1 (en) System, method, and computer program product for parameter estimation for lossless video compression
Singla et al. Data compression modelling: Huffman and Arithmetic
CN110111851B (en) Gene sequencing data compression method, system and computer readable medium
Hameed et al. A new lossless method of Huffman coding for text data compression and decompression process with FPGA implementation
Wei et al. Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding
Sharma et al. Evaluation of lossless algorithms for data compression
Nishad et al. Efficient random sampling statistical method to improve big data compression ratio and pattern matching techniques for compressed data
CN117786169B (en) Data self-adaptive storage method and device, electronic equipment and storage medium
Wang et al. Transformed HCT for parallel Huffman decoding
Pannirselvam et al. A Comparative Analysis on Different Techniques in Text Compression
CN114928747B (en) Context probability processing circuit and method based on AV1 entropy coding and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant