WO2011080031A1 - Prefix-offset encoding method for data compression - Google Patents

Prefix-offset encoding method for data compression Download PDF

Info

Publication number
WO2011080031A1
WO2011080031A1 PCT/EP2010/069089 EP2010069089W WO2011080031A1 WO 2011080031 A1 WO2011080031 A1 WO 2011080031A1 EP 2010069089 W EP2010069089 W EP 2010069089W WO 2011080031 A1 WO2011080031 A1 WO 2011080031A1
Authority
WO
WIPO (PCT)
Prior art keywords
prefix
bits
offset
order
preserving
Prior art date
Application number
PCT/EP2010/069089
Other languages
French (fr)
Inventor
Oliver Draese
Peter Bendel
Tianchao Li
Vijayshankar Raman
Original Assignee
International Business Machines Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corporation filed Critical International Business Machines Corporation
Publication of WO2011080031A1 publication Critical patent/WO2011080031A1/en

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • the present invention relates in general to data compression and data encoding.
  • the present invention relates to a prefix-offset encoding method for compressing data .
  • Data compression is an important aspect of various computing and storage systems.
  • data warehouses are discussed in some detail as an example of systems where data compression is relevant, but it is appreciated that data compression and efficient handling of compressed data is relevant in many other systems where large amounts of data are stored.
  • Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis.
  • employed techniques include dictionary based compression and, for strings, offset-based compression and prefix-offset based compression .
  • Dictionary based compression encodes a value from a large value space but relatively much smaller set of actual values (low cardinality) with a dictionary code.
  • Figure la shows an example of dictionary encoding for fruit name strings.
  • the well-known Huffman code is applied.
  • Dictionary based compression is feasible only if the amount of distinct values is limited so that a complete table of values and dictionary codes can be kept in the memory of the computer system. This assumption typically breaks when the cardinality of values is very big: for example, 64 bit floating point has 1.8E19 possible values, and dictionary encoding is not
  • Offset based compression compresses data by subtracting a common base value (the minimum of the value range) from each of the original values and uses the remaining offset to represent the original value.
  • Figure lb illustrates some examples of how offset based compression works for integer and decimal values. As the last row in the table of Figure lb shows, normalization is usually applied to decimals. With a common base value applied to all values, the effectiveness of offset based compression highly depends on the value
  • prefix-offset compression which encodes a value literally with a prefix code and an offset.
  • This method is naturally applied to strings, where the existence of common "prefixes" is often observed. For example, the string “United States of America”, “United Kingdom” and “United Arab Emirates” all share a common prefix of "United “ and different offsets "States of America", “Kingdom”, “Arab Emirates”.
  • dictionary encoding By applying dictionary encoding for the prefix, it allows to store the value more efficiently. Furthermore, by limiting the length of prefix, the memory exhaust problem of (pure) dictionary based compression is solved. For non-string data, applying prefix-offset
  • Order preserving codes where the relative order of the codes is the same as the relative order of the original values, are important for the fast processing of queries that involve range predicates.
  • Range predicates refer to testing if a data field is within certain range of values. If the applied encoding method is not order-preserving, range predicates must be applied on the decoded values. This means that data needs to be decoded to be able to process queries involving range predicates.
  • Offset encoding is generally order-preserving by nature, and some variants of dictionary based compression are order-preserving. The order-preserving characteristic is a major challenge for prefix-offset encoding compression for some data types.
  • the present invention aims at providing an order-preserving prefix-offset encoding method for numerical data types, for allowing efficient data scans on encoded values.
  • a first aspect of the invention provides a computerized method for compressing data, said method comprising the following steps :
  • a second aspect of the invention provides a data processing system comprising
  • an input component for receiving data to be encoded; and an encoding component for encoding received data, said encoding component adapted to
  • prefix bits using an order-preserving dictionary coding, resulting in prefix codes
  • prefix codes concatenate said prefix codes and respective offset bits, resulting in order-preserving binary prefix-offset codes.
  • a third aspect of the invention provides a computer program product comprising a computer-usable medium and a computer readable program.
  • Figure la shows, as a table, an example of dictionary based compression using Hoffmann coding
  • Figure lb shows some examples of offset-based encoding
  • Figure 2 shows schematically, as an example, prefix-offset coding of a binary representation of a data value
  • FIG. 3 shows, as an example, a flowchart of a method in accordance with an embodiment of the present invention
  • Figure 4a shows, as an example, the binary presentation of floating point numbers using sign bit, exponent bits and fraction bits
  • Figure 4b shows, as an example, a binary representation of the floating point number 0.15625 according to IEEE754-1985 ;
  • Figure 4c show, as an example, prefix-offset encoding of a floating point number whose value is close to 0.15625 with sign bit flipping;
  • Figure 5 shows, as an example, some data types that are typical for very large cardinality data and where binary order-preserving prefix-offset compression according to embodiment of the present invention can be used, but encoding is not feasible with dictionary encoding or offset encoding;
  • Figure 6 shows, as an example, an order-preserving binary representation for 16-bit integers with flipped sign bit;
  • Figure 7 shows, as an example, a table illustrating the transformation of IEEE 754-1985 binary representation of floating point numbers into order-preserving binary
  • Figure 8a shows, as an example, a table illustrating
  • Figure 8b shows continuation of the table in Figure 8a
  • Figure 9 shows, as an example, pseudocode for an algorithm suitable for determining the optimal split point between prefix and offset bits
  • Figure 10 shows a block diagram of a data processing system according to an embodiment of the invention.
  • the basic idea in embodiments of the invention is to combine binary variant of the prefix-offset compression with order- preserving dictionary encoding method and order-preserving binary representation of values to derive order-preserving prefix-offset code.
  • the proposed prefix-offset compression encodes a binary representation of a value with a code that consists of two concatenating parts: prefix code (i.e. dictionary code for the prefix bits) and binary offset bits.
  • prefix code i.e. dictionary code for the prefix bits
  • Figure 2 illustrates the formation of prefix-offset code out of the binary
  • FIG. 3 shows, as an example, a method 300 for compressing data according to an embodiment of the invention.
  • the method 300 is typically implemented in a computing system.
  • the computing system provides an order-preserving binary
  • the computing system determines the number of offset bits.
  • the number of offset bit may be a predefined number or the optimal split point (between prefix and offset bits) may be determined using, for example, the algorithms described below.
  • step 303 the computing system divides the data values in said order-preserving binary representation into at least prefix bits and offset bits, as Figure 2 shows.
  • step 304 the computing system encodes the prefix bits using an order- preserving dictionary coding; this dictionary encoding results in prefix codes.
  • a simplest approach is to sort the prefix bits and assign dictionary codes ascendant.
  • the resulting code is fixed length and can be applied very conveniently.
  • any order-preserving dictionary encoding may be used in connection with the present invention.
  • the variable length dictionary code derived from the frequency-partitioning method described in the US patent application US20090254521A1 may be applied, and the resulting prefix-offset code will be variable length, however still order-preserving.
  • step 305 the computing system concatenates the prefix codes and respective offset bits. This concatenation results in the order-preserving binary prefix-offset codes.
  • Figures 4a to 4c shows an example of order-preserving binary prefix-offset coding of floating point numbers.
  • the fraction part is stored in a binary format with the first bit indicating 2 _1 , the second bit indicating 2 ⁇ 2 , and so on.
  • Figure 4b shows an example with a 32-bit floating point number 0.15625:
  • V (-1)° X 2 124"127 X 1.01 (binary)
  • bit in position 31 is the sign bit
  • bits in positions 30 to 23 represent the exponent (124 for this example)
  • the fraction part "01" is stored in bit positions 22 and 21.
  • 0.1562500186264514923095703125 is encoded into a 17 bit prefix-offset code 10111000000010100.
  • the sign bit in Figure 4c has been flipped so that the resulting prefix-offset coding is indeed order- preserving .
  • the encoding efficiency of binary prefix-offset encoding partially depends on the distribution of the values in its whole value space. More exactly, it depends on the
  • clustering of values which is a phenomenon commonly seen in many applications, for example: sensor measurement results are often around an average value due to measurement errors .
  • the IEEE 754-1985 standard represents a floating point value with the most significant bit depicting the sign, followed by several bits of exponent, and the fraction bits of descending fractions (the first fraction bit depicts 1 ⁇ 2, the second for 1 ⁇ 4, and so on) .
  • the binary prefix-offset compression can not be better than (pure) dictionary based compression. This is due to waste in offset bits, because usually not all possible combinations of the offset bits are filled. Generally, it can be concluded that the encoding efficiency improves with a smaller length of offset bits (a formal proof will be provided later) .
  • (pure) dictionary based compression can be considered as one extreme case of (binary) prefix-offset compression, with offset bits of length 0. And the other extreme case, when the number of offset bits is set to the maximum (the same as the original value) , it is exactly the same as un-encoded data and thus is the most inefficient case.
  • the binary prefix-offset compression method as described above works in many of the cases that otherwise commonly applied dictionary based compression and offset based compression method does not work.
  • the table in Figure 5 summarizes the data types that are typical for very large cardinality data and where binary order-preserving prefix-offset compression can be used.
  • dictionary based compression is not applicable when the data has very large cardinality.
  • offset encoding cannot be applied to many data types because either the plus and minus operations are not defined or plus and then minus the same value can not guarantee to return to the original value.
  • Binary order-preserving prefix-offset compression thus can be used with various data types where dictionary encoding and/or offset encoding cannot be efficiently used.
  • the memory consumption of (binary) prefix-offset compression is dominated by the dictionary for encoding the prefix bits. If the number of prefix bits is n, the upper bound of the dictionary size is determined as 2 A n.
  • the memory exhaust problem of (pure) dictionary based compression can be solved by limiting n. Due to the existence of data clustering which is explained above, the actual size of the dictionary will be in many situations considerably less than this maximum.
  • prefix-offset code can be order preserving only if the binary representation is order-preserving.
  • an order-preserving dictionary encoding must be applied for the dictionary encoding of the prefix bits to guarantee the resulting prefix-offset code to be order-preserving.
  • the binary representation of data should be order-preserving.
  • IEEE 754-1985 is order preserving only for positive floating point numbers
  • IEEE 754-2008 for decimal numbers is not order-preserving.
  • Figure 6 shows an example with integers of 16 bits.
  • the order of 2's complement with sign bit flipped matches exactly that of the integer values. Therefore, the derived prefix-offset code that concatenates the order-preserving dictionary code for prefix bits and the offset bits is also order-preserving.
  • floating point numbers are commonly following IEEE standard 754-1985 and they are represented with a sequence of sign bits, mantissa bits and fraction bits.
  • This binary representation requires one bit more to encode the same range of values than the compact format.
  • the decimal32 has one sign bit, a coefficient length in 7 digits that can be stored in 24 bits, and exponent to be -95 to 96 that can be stored in 8 bits. This amounts to 33 bits instead of the original 32 bits. However, this additional bit does not have much impact the encoding efficiency. Similar to the case of floating point described above, bit flopping must be applied to guarantee the order-preserving characteristics for negative numbers. The two bit flopping steps listed above for floating point numbers also apply for decimal numbers with the simple binary presentation above. The tables in Figures 8a and 8b presents examples using decimal32 numbers.
  • Order-preserving binary representations for other typical data types in data bases are also available. As some examples, consider the following. Timestamp, defined as the number of seconds since 1970, is naturally order-preserving when stored using a binary representation of unsigned integer. An order- preserving binary representation for Date and Time can be defined, for example, by storing Year, Month, Day, Hour,
  • the length of offset bits is an important factor for the effectiveness of the above described binary prefix-offset compression.
  • the number of offset bits may be
  • prefix-offset encoding If automatic detection of the optimal split point between the prefix and offset bits is employed, it needs to consider two properties of prefix-offset encoding that have already been discussed: 1) with increasing number of prefix bits n, the upper bound of dictionary size increases
  • Figure 9 shows pseudocode for an algorithm suitable for this purpose.
  • the target is to minimize the size of prefix-offset code which includes the dictionary code for the prefix bits and the offset bits under the constraint of specified maximal size of dictionary.
  • the search starts from a maximum of offset bits that equals the size of un-encoded data (suppose data is fixed length) in units of bits N (we do not actually need to calculate the size of encoded value, since in this case the value is kept
  • the length of the prefix- offset code with S offset bits (namely L s ) is a sum of the length of dictionary code for the prefix bits and the length of offset bits.
  • Figure 10 shows a block diagram of a data processing system 100 according to an embodiment of the invention. It is appreciated that schematic blocks are provided as an example to facilitate the understanding of the present invention.
  • the data processing system 100 has an input component 110 for receiving data to be encoded and an encoding component 120 for encoding the received data.
  • the encoding component 120 contains a binary representation component 122 that provides an order-preserving binary representation applicable for the received data.
  • the component 122 may contain various order- preserving binary representations applicable to different data types.
  • the division component 124 divides data values in the order-preserving binary representation into at least prefix bits and offset bits using a given number of the offset bits.
  • the dictionary encoding component 126 encodes the prefix bits using an order-preserving dictionary coding, and this results in the prefix codes.
  • the concatenation component 128 is an input component 110 for receiving data to be encoded and an encoding component 120 for encoding the received data.
  • the encoding component 120 contains a binary representation component 122 that provides an order-preserving binary representation applicable for the received data.
  • the component 122 may contain various order- preserving binary representations applicable to different data types.
  • the division component 124 divides data
  • the storage component 150 may use memory or some persistent storage means, such as disk space, for storing the encoded data .
  • the encoding component 120 may contain a component 125 to determine the number of offset bits by minimizing the size of the order-preserving binary prefix- offset code under the constraint of a given maximal size for prefix code dictionary.
  • the size of the order-preserving binary prefix-offset code is the sum of the number of offset bits and the size of said prefix code.
  • the encoding component 120 typically contains a transformation component 121 for transforming received data in a first binary representation into a suitable order-preserving binary
  • the data processing system 100 typically also contains a query processing component 140 for performing a value scan on the order-preserving binary prefix-offset codes stored in the storage component 150.
  • the query processing component 140 needs to have access to
  • the encoding schema 130 for the order- preserving binary prefix-offset encoding is determined by the binary order-preserving representation, the number of offset bits and the dictionary used for encoding the prefix bits.
  • the data processing system 100 may be a database, preferably an in-memory database, for storing said order-preserving binary prefix-offset codes.
  • the data processing system 100 may contain a database for storing data on a persistent medium and a further in-memory database connected to the database, for uploading data from said database to the further in-memory database for enhanced processing .
  • aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware
  • aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium (s) having computer readable program code embodied thereon.
  • the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
  • a computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical,
  • a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
  • a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) .
  • LAN local area network
  • WAN wide area network
  • Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function (s).
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • a computerized method refers to a method whose steps are performed by a computing system containing a suitable combination of one or more processors, memory means and storage means.

Abstract

The following prefix-offset encoding method is order-preserving. An order-preserving binary representation is provided for a data type. The number of offset bits for the prefix-offset encoding is determined. Data values in said order-preserving binary representation are divided into prefix bits and offset bits. The prefix bits are encoded using an order-preserving dictionary coding, resulting in prefix codes. The prefix codes and respective offset bits concatenated and the result is order-preserving binary prefix-offset codes.

Description

PREFIX-OFFSET ENCODING METHOD FOR DATA COMPRESSION
BACKGROUND OF THE INVENTION Field of the invention
The present invention relates in general to data compression and data encoding. In particular, the present invention relates to a prefix-offset encoding method for compressing data .
Related art
Data compression is an important aspect of various computing and storage systems. Here data warehouses are discussed in some detail as an example of systems where data compression is relevant, but it is appreciated that data compression and efficient handling of compressed data is relevant in many other systems where large amounts of data are stored. Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis.
The effectiveness of data warehouses that employ table scans for fast processing of queries relies on efficient compression of the data. With adequate data compression method, table scans can be directly applied on the compressed data, instead of having to decode each value first. Also, well designed algorithms can scan over multiple compressed values that are packed into one word size in each loop. Therefore, shorter code typically means faster table scan. Most commonly
employed techniques include dictionary based compression and, for strings, offset-based compression and prefix-offset based compression .
Dictionary based compression encodes a value from a large value space but relatively much smaller set of actual values (low cardinality) with a dictionary code. Figure la shows an example of dictionary encoding for fruit name strings. In the example in Figure la, the well-known Huffman code is applied. Dictionary based compression is feasible only if the amount of distinct values is limited so that a complete table of values and dictionary codes can be kept in the memory of the computer system. This assumption typically breaks when the cardinality of values is very big: for example, 64 bit floating point has 1.8E19 possible values, and dictionary encoding is not
feasible .
Offset based compression compresses data by subtracting a common base value (the minimum of the value range) from each of the original values and uses the remaining offset to represent the original value. Figure lb illustrates some examples of how offset based compression works for integer and decimal values. As the last row in the table of Figure lb shows, normalization is usually applied to decimals. With a common base value applied to all values, the effectiveness of offset based compression highly depends on the value
distribution of the original values. It is only efficient if the resulting offsets on average are much shorter than the original value, which is usually not the case for high volume data with very large cardinality. Otherwise, offset based compression will not be much more efficient than un-encoded data .
An extension of the basic offset-based compression approach would be to use multiple base values, each for a cluster of data. The automatic determination of the optimal set of base values is, however, quite difficult. Even more importantly, offset based compression implicitly requires that a minus (-) and a plus (+) operation are defined for the corresponding data type, which is not the case for many non-numerical data types like fixed/variable length strings, etc. Furthermore, the equation "Base=Base+Offset-Offset" must be true for all values involved. This is, however, not true for floating-point (e.g. float, double) values due to inaccuracy of floating¬ point arithmetic operations. Such data types are typical for very large cardinality values.
There is another type of compression, the prefix-offset compression, which encodes a value literally with a prefix code and an offset. This method is naturally applied to strings, where the existence of common "prefixes" is often observed. For example, the string "United States of America", "United Kingdom" and "United Arab Emirates" all share a common prefix of "United " and different offsets "States of America", "Kingdom", "Arab Emirates". By applying dictionary encoding for the prefix, it allows to store the value more efficiently. Furthermore, by limiting the length of prefix, the memory exhaust problem of (pure) dictionary based compression is solved. For non-string data, applying prefix-offset
compression literally is normally not efficient, since the literal representation of value usually takes much more memory. For example, a 32 bit floating-point number can have a literal that takes nearly 20 bytes to store, e.g.
1.3000337465E188.
Order preserving codes, where the relative order of the codes is the same as the relative order of the original values, are important for the fast processing of queries that involve range predicates. Range predicates refer to testing if a data field is within certain range of values. If the applied encoding method is not order-preserving, range predicates must be applied on the decoded values. This means that data needs to be decoded to be able to process queries involving range predicates. Offset encoding is generally order-preserving by nature, and some variants of dictionary based compression are order-preserving. The order-preserving characteristic is a major challenge for prefix-offset encoding compression for some data types.
The present invention aims at providing an order-preserving prefix-offset encoding method for numerical data types, for allowing efficient data scans on encoded values.
SUMMARY OF INVENTION
A first aspect of the invention provides a computerized method for compressing data, said method comprising the following steps :
providing an order-preserving binary representation for a data type;
determining a number of offset bits;
dividing data values in said order-preserving binary representation into prefix bits and offset bits;
encoding said prefix bits using an order-preserving
dictionary coding, resulting in prefix codes;
concatenating said prefix codes and respective offset bits, resulting in order-preserving binary prefix-offset codes.
A second aspect of the invention provides a data processing system comprising
an input component for receiving data to be encoded; and an encoding component for encoding received data, said encoding component adapted to
provide an order-preserving binary representation applicable for the received data;
divide data values in said order-preserving binary
representation into at least prefix bits and offset bits using a given number of said offset bits;
encode said prefix bits using an order-preserving dictionary coding, resulting in prefix codes;
concatenate said prefix codes and respective offset bits, resulting in order-preserving binary prefix-offset codes.
A third aspect of the invention provides a computer program product comprising a computer-usable medium and a computer readable program.
BRIEF DESCRIPTION OF FIGURES
For a better understanding of the present invention and as how the same may be carried into effect, reference will now be made by way of example only to the accompanying drawings in which :
Figure la shows, as a table, an example of dictionary based compression using Hoffmann coding;
Figure lb shows some examples of offset-based encoding;
Figure 2 shows schematically, as an example, prefix-offset coding of a binary representation of a data value;
Figure 3 shows, as an example, a flowchart of a method in accordance with an embodiment of the present invention;
Figure 4a shows, as an example, the binary presentation of floating point numbers using sign bit, exponent bits and fraction bits;
Figure 4b shows, as an example, a binary representation of the floating point number 0.15625 according to IEEE754-1985 ;
Figure 4c show, as an example, prefix-offset encoding of a floating point number whose value is close to 0.15625 with sign bit flipping;
Figure 5 shows, as an example, some data types that are typical for very large cardinality data and where binary order-preserving prefix-offset compression according to embodiment of the present invention can be used, but encoding is not feasible with dictionary encoding or offset encoding; Figure 6 shows, as an example, an order-preserving binary representation for 16-bit integers with flipped sign bit;
Figure 7 shows, as an example, a table illustrating the transformation of IEEE 754-1985 binary representation of floating point numbers into order-preserving binary
representation with examples of 32-bit floating point numbers; Figure 8a shows, as an example, a table illustrating
transformations into binary order-preserving presentation of 32 -bit decimal numbers;
Figure 8b shows continuation of the table in Figure 8a;
Figure 9 shows, as an example, pseudocode for an algorithm suitable for determining the optimal split point between prefix and offset bits;
Figure 10 shows a block diagram of a data processing system according to an embodiment of the invention.
DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
The basic idea in embodiments of the invention is to combine binary variant of the prefix-offset compression with order- preserving dictionary encoding method and order-preserving binary representation of values to derive order-preserving prefix-offset code.
The proposed prefix-offset compression encodes a binary representation of a value with a code that consists of two concatenating parts: prefix code (i.e. dictionary code for the prefix bits) and binary offset bits. Figure 2 illustrates the formation of prefix-offset code out of the binary
representation of data. Each value is first separated into prefix bits and offset bits. The prefix bits of all values are first encoded with an order-preserving dictionary encoding method, and the resulting dictionary code for the prefix bits (prefix code) is concatenated with the offset bits. Figure 3 shows, as an example, a method 300 for compressing data according to an embodiment of the invention. The method 300 is typically implemented in a computing system. The computing system provides an order-preserving binary
representation for a data type in step 301. For examples of such order-preserving binary representations for various data types, see details and examples below. In step 302, the computing system determines the number of offset bits. The number of offset bit may be a predefined number or the optimal split point (between prefix and offset bits) may be determined using, for example, the algorithms described below.
In step 303, the computing system divides the data values in said order-preserving binary representation into at least prefix bits and offset bits, as Figure 2 shows. In step 304, the computing system encodes the prefix bits using an order- preserving dictionary coding; this dictionary encoding results in prefix codes. A simplest approach is to sort the prefix bits and assign dictionary codes ascendant. The resulting code is fixed length and can be applied very conveniently. Other possibilities also exist, and in fact any order-preserving dictionary encoding may be used in connection with the present invention. For example, the variable length dictionary code derived from the frequency-partitioning method described in the US patent application US20090254521A1 may be applied, and the resulting prefix-offset code will be variable length, however still order-preserving.
In step 305, the computing system concatenates the prefix codes and respective offset bits. This concatenation results in the order-preserving binary prefix-offset codes.
Figures 4a to 4c shows an example of order-preserving binary prefix-offset coding of floating point numbers. The binary representation of floating point value according to IEEE754- 1985 is separated into a sign bit, exponent bits, and fraction bits (Figure 4a) , which forms the value to be represented by formula v=(-l)sign x 2 exponent~bias x 1. fraction (binary) . Note that the non-fraction part "1." is always omitted in the binary
representation, and the fraction part is stored in a binary format with the first bit indicating 2_1, the second bit indicating 2~2, and so on.
Figure 4b shows an example with a 32-bit floating point number 0.15625:
V = (-1)° X 2124"127 X 1.01 (binary)
= 1 X 2 X 1.25 (decimal)
= 0.15625
The bit in position 31 is the sign bit, bits in positions 30 to 23 represent the exponent (124 for this example), and the fraction part "01" is stored in bit positions 22 and 21.
In Figure 4b, the fraction part is translated from .01 (binary) to .25 (decimal) , which is calculated by 0 x 2_1 + 1 x 2~2. Suppose we have a number as shown in Figure 4c, which is a number
slightly different from 0.15625 (a common phenomenon seen, for example, in instrument measured data) . With a number of offset bits of 10 and the prefix bits represented by dictionary code of 92 (1011100 in binary), the number
0.1562500186264514923095703125 is encoded into a 17 bit prefix-offset code 10111000000010100. For reasons discussed in more detail later, the sign bit in Figure 4c has been flipped so that the resulting prefix-offset coding is indeed order- preserving .
The encoding efficiency of binary prefix-offset encoding partially depends on the distribution of the values in its whole value space. More exactly, it depends on the
"clustering" of values, which is a phenomenon commonly seen in many applications, for example: sensor measurement results are often around an average value due to measurement errors .
prices are often around a certain value (1, 2, 5, 10, etc), for example, 0.95, 0.98, 0.99, 1.
the existence of common prefixes in names
One thing to be noted is that the "clustering" of values must be seen from the binary representation value with prefix bits and offset bits. Most common binary representations preserve to certain degree the major part (if not all) of the
clustering effect by having the semantically most important part of the value represented in the most significant bits. As is seen in previous examples, the IEEE 754-1985 standard represents a floating point value with the most significant bit depicting the sign, followed by several bits of exponent, and the fraction bits of descending fractions (the first fraction bit depicts ½, the second for ¼, and so on) .
Another part of the encoding efficiency comes inherently from that of the dictionary encoding for the prefix bits. It takes effect even when the data is totally random distributed in its value space as seen from the binary presentation.
Solely from the perspective of compression efficiency, the binary prefix-offset compression can not be better than (pure) dictionary based compression. This is due to waste in offset bits, because usually not all possible combinations of the offset bits are filled. Generally, it can be concluded that the encoding efficiency improves with a smaller length of offset bits (a formal proof will be provided later) . In fact, (pure) dictionary based compression can be considered as one extreme case of (binary) prefix-offset compression, with offset bits of length 0. And the other extreme case, when the number of offset bits is set to the maximum (the same as the original value) , it is exactly the same as un-encoded data and thus is the most inefficient case.
The binary prefix-offset compression method as described above works in many of the cases that otherwise commonly applied dictionary based compression and offset based compression method does not work. The table in Figure 5 summarizes the data types that are typical for very large cardinality data and where binary order-preserving prefix-offset compression can be used. For all these data types, dictionary based compression is not applicable when the data has very large cardinality. Furthermore, offset encoding cannot be applied to many data types because either the plus and minus operations are not defined or plus and then minus the same value can not guarantee to return to the original value. For those data types where offset encoding is applicable, it might not be efficient due to its assumption on the existence of a common base value. Binary order-preserving prefix-offset compression thus can be used with various data types where dictionary encoding and/or offset encoding cannot be efficiently used.
The memory consumption of (binary) prefix-offset compression is dominated by the dictionary for encoding the prefix bits. If the number of prefix bits is n, the upper bound of the dictionary size is determined as 2An. The memory exhaust problem of (pure) dictionary based compression can be solved by limiting n. Due to the existence of data clustering which is explained above, the actual size of the dictionary will be in many situations considerably less than this maximum.
Therefore, a larger number of prefix bits and smaller number of offset bits can be allowed, which also means higher
encoding efficiency (see comment on encoding efficiency above) .
For the prefix-offset compression described above, whether or not the resulting code is order-preserving is dependent on the concrete binary representation of data (prefix bits and offset bits) . The prefix-offset code can be order preserving only if the binary representation is order-preserving. In addition, an order-preserving dictionary encoding must be applied for the dictionary encoding of the prefix bits to guarantee the resulting prefix-offset code to be order-preserving.
In order to guarantee the order-preserving characteristics of the derived prefix-offset code, the binary representation of data should be order-preserving. The unfortunate fact is that the commonly used binary presentations of data values are seldom order-preserving. For example, IEEE 754-1985 is order preserving only for positive floating point numbers, and IEEE 754-2008 for decimal numbers is not order-preserving.
Therefore, transformation of original binary representation into order-preserving one is needed before deriving the prefix bits and offset bits. In the following, we briefly review methods of transforming the binary representation of typical data types into an order-preserving one.
Integers are commonly represented in computers using two's complements. This binary representation is by nature
fulfilling our requirement of encoding efficiency, since the semantically more significant bits (possibly also less frequently changing bits) are placed more to the left. A minor adjustment by flipping the sign bit is sufficient to make it order-preserving for the whole value range. The table in
Figure 6 shows an example with integers of 16 bits. The order of 2's complement with sign bit flipped matches exactly that of the integer values. Therefore, the derived prefix-offset code that concatenates the order-preserving dictionary code for prefix bits and the offset bits is also order-preserving.
As mentioned above, floating point numbers are commonly following IEEE standard 754-1985 and they are represented with a sequence of sign bits, mantissa bits and fraction bits.
Similar to integers with two's complement, some adjustments are needed to make the binary representation according to IEEE 754-1985 order-preserving for the whole value range. This includes the following steps: 1. flip all mantissa bits and fraction bits for negative values and 2. flip the sign bits for all values. The table in Figure 7 illustrates the
transformation IEEE754-1985 into order-preserving binary representation with examples of 32-bit floating point numbers. After bit-flipping, the order of the binary representation matches exactly the literal sequence of floating-point values. Therefore, the prefix-offset code derived by concatenating the order-preserving dictionary code for prefix bits and the offset bits is also order-preserving.
Decimal values are typically encoded with v=(-l)sign x
coefficient (decimal ) x ioexponent " bias, in accordance with the IEEE standard 754-2008. This is not order preserving, because it allows redundant encoding for the same value. For order- preserving binary representation of decimals, such redundancy must be avoided by normalizing the coefficient (similar to the normalization applied in the last row of the table in Figure lb) . For example, by normalizing the coefficient into 7 digits, the coefficient of 999999 is 9999990, and the
coefficient of 12.0 is 1200000. The fraction part of
coefficient is often stored as a compressed sequence of decimal digits as described, for example, in the US patent application US20070050436A1. This representation is not order- preserving. Also, the layout of the bits have the first digit (which only take value 0 through 9) compacted together with the two most significant bits of the exponent (which only take value 0 though 2) in five bit combination field as described in "Decimal Arithmetic Encodings", Version 1.01, 7 April
2009 (downloadable from
http://speleotrove.com/decimal/decbits.html). To guarantee order-preserving characteristics, both of these compressions should be avoided and be replaced with a simple binary
representation that includes sequentially one sign bit, several bits of exponent, followed by coefficient bits. For example,
999999 = (-1)° x 9999990 x i o100"101 = 0 01100100 100110001001011001110110 and 12.0 = (-1)° x 1200000 x 10 96-101 = 0 01100000 000100100100111110000000
This binary representation requires one bit more to encode the same range of values than the compact format. For example, the decimal32 has one sign bit, a coefficient length in 7 digits that can be stored in 24 bits, and exponent to be -95 to 96 that can be stored in 8 bits. This amounts to 33 bits instead of the original 32 bits. However, this additional bit does not have much impact the encoding efficiency. Similar to the case of floating point described above, bit flopping must be applied to guarantee the order-preserving characteristics for negative numbers. The two bit flopping steps listed above for floating point numbers also apply for decimal numbers with the simple binary presentation above. The tables in Figures 8a and 8b presents examples using decimal32 numbers.
Order-preserving binary representations for other typical data types in data bases are also available. As some examples, consider the following. Timestamp, defined as the number of seconds since 1970, is naturally order-preserving when stored using a binary representation of unsigned integer. An order- preserving binary representation for Date and Time can be defined, for example, by storing Year, Month, Day, Hour,
Minute, Seconds, and Microseconds as unsigned integer using the minimal number of bits and concatenate them in sequence.
It is appreciated that in addition to the order-preserving binary representation discussed above, any other order- preserving binary representations may be used in connection with the present invention.
The length of offset bits is an important factor for the effectiveness of the above described binary prefix-offset compression. In some applications of embodiments of the present invention, the number of offset bits may be
predefined. If automatic detection of the optimal split point between the prefix and offset bits is employed, it needs to consider two properties of prefix-offset encoding that have already been discussed: 1) with increasing number of prefix bits n, the upper bound of dictionary size increases
exponentially, i.e. 2An and 2) the encoding efficiency also increases with an increasing number of prefix bits.
The optimal split point can be automatically detected by comparing different possibilities. Figure 9 shows pseudocode for an algorithm suitable for this purpose. The target is to minimize the size of prefix-offset code which includes the dictionary code for the prefix bits and the offset bits under the constraint of specified maximal size of dictionary. The search starts from a maximum of offset bits that equals the size of un-encoded data (suppose data is fixed length) in units of bits N (we do not actually need to calculate the size of encoded value, since in this case the value is kept
unencoded, i.e. N) to a minimum of 0 offset bits. It also stops when the size of the dictionary for prefix bits exceeds a pre-configured maximum size, which avoids exhausting the memory in case of data with very large cardinality. In the algorithm shows in Figure 9, # is an abbreviated notion for "number of". When the algorithm terminates, the optimal number of offset bits is indicated by bestOffsetBits . In step 3, the #offsetBits is decrement from N-l to the minimum of 0. This pseudocode is shown to illustrate the most basic idea, and various variances and optimizations exist when practically applied. Simple examples that can improve the speed of execution include starting the loop from a smaller number of offset bits instead of just N-l, and only test selectively on certain bits.
Next we provide a formal proof of the split point property of prefix-offset encoding, which was formally roughly described above as "the encoding efficiency improves with a smaller number of offset bits". For the sake of preciseness, we rephrase this property into the following: "the length of prefix-offset code of S+l offset bits is at least as large as that of S offset bits", or "the length of a prefix-offset code of S offset bits can not be larger than that of S+l offset bits" .
Define the total number of bits of values as N, and the number of offset bits as S with 1<S<N. The length of the prefix- offset code with S offset bits (namely Ls) is a sum of the length of dictionary code for the prefix bits and the length of offset bits. The length of prefix bits is simply P=N-S, and the length of dictionary code is determined as a logarithm of the size of the dictionary for prefix bits DP. That is, Ls = log2 ( Dp ) + S. And the length of prefix-offset code with S+l prefix bits is Ls+i = log2 ( DP_i ) + (S+l).
To prove Ls+1 > Ls, i.e. log2 (DP_i ) + (S+l) > log2 (DP) + S, we only need to prove log2 ( DP ) - log2 ( DP _i ) ≤1. This is now quite obvious because it can be transformed to DP ≤ 2 x DP_i , which is always true because each entry (P-l prefix bits) in DP_i can have maximally 2 x DP possible corresponding entries in DP - by adding a trailing 0 or 1 to the (P-l) -bits entry.
Figure 10 shows a block diagram of a data processing system 100 according to an embodiment of the invention. It is appreciated that schematic blocks are provided as an example to facilitate the understanding of the present invention;
actual implementations of the invention may provide the described functionality with a different number and
configuration of hardware and/or software blocks.
The data processing system 100 has an input component 110 for receiving data to be encoded and an encoding component 120 for encoding the received data. The encoding component 120 contains a binary representation component 122 that provides an order-preserving binary representation applicable for the received data. The component 122 may contain various order- preserving binary representations applicable to different data types. The division component 124 divides data values in the order-preserving binary representation into at least prefix bits and offset bits using a given number of the offset bits. The dictionary encoding component 126 encodes the prefix bits using an order-preserving dictionary coding, and this results in the prefix codes. The concatenation component 128
concatenates the prefix codes and respective offset bits, resulting in order-preserving binary prefix-offset codes.
These codes are then stored to the storage component 150. The storage component 150 may use memory or some persistent storage means, such as disk space, for storing the encoded data .
As discussed above, the encoding component 120 may contain a component 125 to determine the number of offset bits by minimizing the size of the order-preserving binary prefix- offset code under the constraint of a given maximal size for prefix code dictionary. The size of the order-preserving binary prefix-offset code is the sum of the number of offset bits and the size of said prefix code.
The encoding component 120 typically contains a transformation component 121 for transforming received data in a first binary representation into a suitable order-preserving binary
representation. If the received data is already in an order- preserving binary format, there is no need to activate the functionality of this transformation component 121.
The data processing system 100 typically also contains a query processing component 140 for performing a value scan on the order-preserving binary prefix-offset codes stored in the storage component 150. For processing compressed data, the query processing component 140 needs to have access to
encoding schemas 130. The encoding schema 130 for the order- preserving binary prefix-offset encoding is determined by the binary order-preserving representation, the number of offset bits and the dictionary used for encoding the prefix bits.
The data processing system 100 may be a database, preferably an in-memory database, for storing said order-preserving binary prefix-offset codes. As a further example, the data processing system 100 may contain a database for storing data on a persistent medium and a further in-memory database connected to the database, for uploading data from said database to the further in-memory database for enhanced processing .
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or
"comprising, " when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of
illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and
described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware
embodiment, an entirely software embodiment (including
firmware, resident software, micro-code, etc.) or an
embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," "module" or "system." Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium (s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium (s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical,
electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM) , a read-only memory (ROM) , an erasable programmable read-only memory (EPROM or Flash memory) , an optical fiber, a portable compact disc read-only memory (CD-ROM) , an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) .
Aspects of the present invention are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be
understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be
implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible
implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function (s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In the appended claims a computerized method refers to a method whose steps are performed by a computing system containing a suitable combination of one or more processors, memory means and storage means.
While the foregoing has been with reference to particular embodiments of the invention, it will be appreciated by those skilled in the art that changes in these embodiments may be made without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims .

Claims

1. A computerized method for compressing data, comprising
providing an order-preserving binary representation for a data type;
determining a number of offset bits;
dividing data values in said order-preserving binary representation into prefix bits and offset bits;
encoding said prefix bits using an order-preserving
dictionary coding, resulting in prefix codes;
concatenating said prefix codes and respective offset bits, resulting in order-preserving binary prefix-offset codes.
2. Method of claim 1, wherein the size of the order-preserving binary prefix-offset code is the sum of the number of offset bits and the size of said prefix code, said method comprising determining the number of offset bits by minimizing the size of the order-preserving binary prefix-offset code under the constraint of a given maximal size for prefix code dictionary.
3. Method of any preceding claim, comprising transforming values in a first binary representation into values of said order-preserving binary representation.
4. Method of claim 3, said transformation comprising at least one of the following steps:
flipping the sign bit when said first binary
representation is representing integer numbers as complements of 2 with a sign bit as the leftmost bit, said sign bit being set to 1 for negative values in said first binary
representation;
flipping the sign bit for all values and flipping all mantissa bits and all fraction bits for negative values when said first binary representation is representing a floating point value v = (-l)sign x 2 exponent " bias x 1. fraction (binary) with a sequence of sign bit, mantissa bits and fraction bits; and
normalizing a coefficient to a fixed length, flipping the sign bit for all values and flipping all coefficient bits and all exponent bits for negative values, when said first binary representation is representing a decimal value v = (-i)sign x coefficient (decimal) x 10 exponent " bias
with a sequence of a sign bit, coefficient bits and exponent bits .
5. Method of any preceding claim, comprising performing a value scan on said order-preserving binary prefix-offset codes .
6. A data processing system, comprising
an input component for receiving data to be encoded; and an encoding component for encoding received data, said
encoding component adapted to
provide an order-preserving binary representation applicable for the received data;
divide data values in said order-preserving binary
representation into at least prefix bits and offset bits using a given number of said offset bits;
encode said prefix bits using an order-preserving dictionary coding, resulting in prefix codes;
concatenate said prefix codes and respective offset bits, resulting in order-preserving binary prefix-offset codes.
7. Data processing system of claim 6, said encoding component being adapted to determine the number of offset bits by minimizing the size of the order-preserving binary prefix- offset code under the constraint of a given maximal size for prefix code dictionary, wherein the size of the order- preserving binary prefix-offset code is the sum of the number of offset bits and the size of said prefix code.
8. Data processing system of claim 6 or 7, said encoding component being adapted to transform received data in a first binary representation into said order-preserving binary representation .
9. Data processing system of any one of claims 6 to 8, further comprising a query processing component for performing a value scan on said order-preserving binary prefix-offset codes.
10. Data processing system of any one of claims 6 to 9, comprising a database, preferably an in-memory database, for storing said order-preserving binary prefix-offset codes.
11. Data processing system of any one of claims 6 to 9, comprising a database for storing data on a persistent medium and a further database, preferably in-memory database, connected to said database, for uploading data from said database to said further database for enhanced processing.
12. A computer program product comprising a computer-usable medium and a computer readable program, wherein the computer readable program when executed on a data processing system causes the data processing system to carry out method steps of any one of claims 1 to 5.
PCT/EP2010/069089 2009-12-29 2010-12-07 Prefix-offset encoding method for data compression WO2011080031A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP09180918 2009-12-29
EP09180918.6 2009-12-29

Publications (1)

Publication Number Publication Date
WO2011080031A1 true WO2011080031A1 (en) 2011-07-07

Family

ID=43731802

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2010/069089 WO2011080031A1 (en) 2009-12-29 2010-12-07 Prefix-offset encoding method for data compression

Country Status (2)

Country Link
TW (1) TW201141081A (en)
WO (1) WO2011080031A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314697A (en) * 2011-07-20 2012-01-11 张行清 Data type-based numeric data compression and decompression method
US8653993B2 (en) 2009-12-29 2014-02-18 International Business Machines Corporation Data value occurrence information for data compression
EP4014128A4 (en) * 2019-08-16 2023-08-09 Advanced Micro Devices, Inc. Semi-sorting compression with encoding and decoding tables

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11742875B1 (en) * 2022-04-20 2023-08-29 Mediatek Inc. Compression of floating-point numbers for neural networks

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070050436A1 (en) 2005-08-26 2007-03-01 International Business Machines Corporation Order-preserving encoding formats of floating-point decimal numbers for efficient value comparison
US20090254521A1 (en) 2008-04-04 2009-10-08 International Business Machines Corporation Frequency partitioning: entropy compression with fixed size fields

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070050436A1 (en) 2005-08-26 2007-03-01 International Business Machines Corporation Order-preserving encoding formats of floating-point decimal numbers for efficient value comparison
US20090254521A1 (en) 2008-04-04 2009-10-08 International Business Machines Corporation Frequency partitioning: entropy compression with fixed size fields

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DECIMAL ARITHMETIC ENCODINGS, 7 April 2009 (2009-04-07), Retrieved from the Internet <URL:http://speleotrove.com/decimal/decbits.html>
KAREN MILLER: "Chapter 5 -- representations", 17 June 2007 (2007-06-17), University of Wisconsin, XP002633147, Retrieved from the Internet <URL:http://replay.waybackmachine.org/20070617222401/http://pages.cs.wisc.edu/~smoler/x86text/lect.notes/represent.html> [retrieved on 20110415] *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8653993B2 (en) 2009-12-29 2014-02-18 International Business Machines Corporation Data value occurrence information for data compression
CN102314697A (en) * 2011-07-20 2012-01-11 张行清 Data type-based numeric data compression and decompression method
CN102314697B (en) * 2011-07-20 2013-04-10 张行清 Data type-based numeric data compression and decompression method
EP4014128A4 (en) * 2019-08-16 2023-08-09 Advanced Micro Devices, Inc. Semi-sorting compression with encoding and decoding tables

Also Published As

Publication number Publication date
TW201141081A (en) 2011-11-16

Similar Documents

Publication Publication Date Title
US7797360B2 (en) Sortable floating point numbers
Korn et al. The VCDIFF generic differencing and compression data format
US7296030B2 (en) Method and apparatus for windowing in entropy encoding
EP3120266B1 (en) Ozip compression and decompression
US8239421B1 (en) Techniques for compression and processing optimizations by using data transformations
US20120278291A1 (en) Avoiding three-valued logic in predicates on dictionary-encoded data
US9337863B1 (en) Methods and apparatus for rational compression and decompression of numbers
EP2193454A2 (en) Two-pass hash extraction of text strings
US20070174238A1 (en) Indexing and searching numeric ranges
US20130019029A1 (en) Lossless compression of a predictive data stream having mixed data types
US7773005B2 (en) Method and apparatus for decoding variable length data
GB2493103A (en) Compressing copy pointers to a history buffer using variable length code tables
WO2011080031A1 (en) Prefix-offset encoding method for data compression
Pibiri et al. Handling massive N-gram datasets efficiently
US20180183462A1 (en) Techniques for parallel data decompression
US20130018856A1 (en) Compression of bitmaps and values
CN110874346B (en) Compression scheme for floating point values
US9998142B1 (en) Techniques for invariant-reference compression
US10496703B2 (en) Techniques for random operations on compressed data
Mesut et al. A method to improve full-text search performance of MongoDB
US20060059181A1 (en) Method and system for high speed encoding, processing and decoding of data
Kolditz et al. Needles in the haystack—tackling bit flips in lightweight compressed data
Mahoney The Zpaq compression algorithm
US20230367752A1 (en) Systems and methods for processing timeseries data
Chen et al. CMIC: an efficient quality score compressor with random access functionality

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10785451

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10785451

Country of ref document: EP

Kind code of ref document: A1