WO2011080031A1

WO2011080031A1 - Prefix-offset encoding method for data compression

Info

Publication number: WO2011080031A1
Application number: PCT/EP2010/069089
Authority: WO
Inventors: Oliver Draese; Peter Bendel; Tianchao Li; Vijayshankar Raman
Original assignee: International Business Machines Corporation
Priority date: 2009-12-29
Filing date: 2010-12-07
Publication date: 2011-07-07
Also published as: TW201141081A

Abstract

The following prefix-offset encoding method is order-preserving. An order-preserving binary representation is provided for a data type. The number of offset bits for the prefix-offset encoding is determined. Data values in said order-preserving binary representation are divided into prefix bits and offset bits. The prefix bits are encoded using an order-preserving dictionary coding, resulting in prefix codes. The prefix codes and respective offset bits concatenated and the result is order-preserving binary prefix-offset codes.

Description

PREFIX-OFFSET ENCODING METHOD FOR DATA COMPRESSION

BACKGROUND OF THE INVENTION Field of the invention

The present invention relates in general to data compression and data encoding. In particular, the present invention relates to a prefix-offset encoding method for compressing data .

Related art

Data compression is an important aspect of various computing and storage systems. Here data warehouses are discussed in some detail as an example of systems where data compression is relevant, but it is appreciated that data compression and efficient handling of compressed data is relevant in many other systems where large amounts of data are stored. Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis.

The effectiveness of data warehouses that employ table scans for fast processing of queries relies on efficient compression of the data. With adequate data compression method, table scans can be directly applied on the compressed data, instead of having to decode each value first. Also, well designed algorithms can scan over multiple compressed values that are packed into one word size in each loop. Therefore, shorter code typically means faster table scan. Most commonly

employed techniques include dictionary based compression and, for strings, offset-based compression and prefix-offset based compression .

Dictionary based compression encodes a value from a large value space but relatively much smaller set of actual values (low cardinality) with a dictionary code. Figure la shows an example of dictionary encoding for fruit name strings. In the example in Figure la, the well-known Huffman code is applied. Dictionary based compression is feasible only if the amount of distinct values is limited so that a complete table of values and dictionary codes can be kept in the memory of the computer system. This assumption typically breaks when the cardinality of values is very big: for example, 64 bit floating point has 1.8E19 possible values, and dictionary encoding is not

feasible .

Offset based compression compresses data by subtracting a common base value (the minimum of the value range) from each of the original values and uses the remaining offset to represent the original value. Figure lb illustrates some examples of how offset based compression works for integer and decimal values. As the last row in the table of Figure lb shows, normalization is usually applied to decimals. With a common base value applied to all values, the effectiveness of offset based compression highly depends on the value

distribution of the original values. It is only efficient if the resulting offsets on average are much shorter than the original value, which is usually not the case for high volume data with very large cardinality. Otherwise, offset based compression will not be much more efficient than un-encoded data .

An extension of the basic offset-based compression approach would be to use multiple base values, each for a cluster of data. The automatic determination of the optimal set of base values is, however, quite difficult. Even more importantly, offset based compression implicitly requires that a minus (-) and a plus (+) operation are defined for the corresponding data type, which is not the case for many non-numerical data types like fixed/variable length strings, etc. Furthermore, the equation "Base=Base+Offset-Offset" must be true for all values involved. This is, however, not true for floating-point (e.g. float, double) values due to inaccuracy of floating^¬ point arithmetic operations. Such data types are typical for very large cardinality values.

There is another type of compression, the prefix-offset compression, which encodes a value literally with a prefix code and an offset. This method is naturally applied to strings, where the existence of common "prefixes" is often observed. For example, the string "United States of America", "United Kingdom" and "United Arab Emirates" all share a common prefix of "United " and different offsets "States of America", "Kingdom", "Arab Emirates". By applying dictionary encoding for the prefix, it allows to store the value more efficiently. Furthermore, by limiting the length of prefix, the memory exhaust problem of (pure) dictionary based compression is solved. For non-string data, applying prefix-offset

compression literally is normally not efficient, since the literal representation of value usually takes much more memory. For example, a 32 bit floating-point number can have a literal that takes nearly 20 bytes to store, e.g.

1.3000337465E188.

Order preserving codes, where the relative order of the codes is the same as the relative order of the original values, are important for the fast processing of queries that involve range predicates. Range predicates refer to testing if a data field is within certain range of values. If the applied encoding method is not order-preserving, range predicates must be applied on the decoded values. This means that data needs to be decoded to be able to process queries involving range predicates. Offset encoding is generally order-preserving by nature, and some variants of dictionary based compression are order-preserving. The order-preserving characteristic is a major challenge for prefix-offset encoding compression for some data types.

The present invention aims at providing an order-preserving prefix-offset encoding method for numerical data types, for allowing efficient data scans on encoded values.

SUMMARY OF INVENTION

A first aspect of the invention provides a computerized method for compressing data, said method comprising the following steps :

providing an order-preserving binary representation for a data type;

determining a number of offset bits;

dividing data values in said order-preserving binary representation into prefix bits and offset bits;

encoding said prefix bits using an order-preserving

dictionary coding, resulting in prefix codes;

concatenating said prefix codes and respective offset bits, resulting in order-preserving binary prefix-offset codes.

A second aspect of the invention provides a data processing system comprising

an input component for receiving data to be encoded; and an encoding component for encoding received data, said encoding component adapted to

provide an order-preserving binary representation applicable for the received data;

divide data values in said order-preserving binary

representation into at least prefix bits and offset bits using a given number of said offset bits;

encode said prefix bits using an order-preserving dictionary coding, resulting in prefix codes;

concatenate said prefix codes and respective offset bits, resulting in order-preserving binary prefix-offset codes.

A third aspect of the invention provides a computer program product comprising a computer-usable medium and a computer readable program.

BRIEF DESCRIPTION OF FIGURES

For a better understanding of the present invention and as how the same may be carried into effect, reference will now be made by way of example only to the accompanying drawings in which :

Figure la shows, as a table, an example of dictionary based compression using Hoffmann coding;

Figure lb shows some examples of offset-based encoding;

Figure 2 shows schematically, as an example, prefix-offset coding of a binary representation of a data value;

Figure 3 shows, as an example, a flowchart of a method in accordance with an embodiment of the present invention;

Figure 4a shows, as an example, the binary presentation of floating point numbers using sign bit, exponent bits and fraction bits;

Figure 4b shows, as an example, a binary representation of the floating point number 0.15625 according to IEEE754-1985 ;

Figure 4c show, as an example, prefix-offset encoding of a floating point number whose value is close to 0.15625 with sign bit flipping;

Figure 5 shows, as an example, some data types that are typical for very large cardinality data and where binary order-preserving prefix-offset compression according to embodiment of the present invention can be used, but encoding is not feasible with dictionary encoding or offset encoding; Figure 6 shows, as an example, an order-preserving binary representation for 16-bit integers with flipped sign bit;

Figure 7 shows, as an example, a table illustrating the transformation of IEEE 754-1985 binary representation of floating point numbers into order-preserving binary

representation with examples of 32-bit floating point numbers; Figure 8a shows, as an example, a table illustrating

transformations into binary order-preserving presentation of 32 -bit decimal numbers;

Figure 8b shows continuation of the table in Figure 8a;

Figure 9 shows, as an example, pseudocode for an algorithm suitable for determining the optimal split point between prefix and offset bits;

Figure 10 shows a block diagram of a data processing system according to an embodiment of the invention.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

The basic idea in embodiments of the invention is to combine binary variant of the prefix-offset compression with order- preserving dictionary encoding method and order-preserving binary representation of values to derive order-preserving prefix-offset code.

The proposed prefix-offset compression encodes a binary representation of a value with a code that consists of two concatenating parts: prefix code (i.e. dictionary code for the prefix bits) and binary offset bits. Figure 2 illustrates the formation of prefix-offset code out of the binary

representation of data. Each value is first separated into prefix bits and offset bits. The prefix bits of all values are first encoded with an order-preserving dictionary encoding method, and the resulting dictionary code for the prefix bits (prefix code) is concatenated with the offset bits. Figure 3 shows, as an example, a method 300 for compressing data according to an embodiment of the invention. The method 300 is typically implemented in a computing system. The computing system provides an order-preserving binary

representation for a data type in step 301. For examples of such order-preserving binary representations for various data types, see details and examples below. In step 302, the computing system determines the number of offset bits. The number of offset bit may be a predefined number or the optimal split point (between prefix and offset bits) may be determined using, for example, the algorithms described below.

In step 303, the computing system divides the data values in said order-preserving binary representation into at least prefix bits and offset bits, as Figure 2 shows. In step 304, the computing system encodes the prefix bits using an order- preserving dictionary coding; this dictionary encoding results in prefix codes. A simplest approach is to sort the prefix bits and assign dictionary codes ascendant. The resulting code is fixed length and can be applied very conveniently. Other possibilities also exist, and in fact any order-preserving dictionary encoding may be used in connection with the present invention. For example, the variable length dictionary code derived from the frequency-partitioning method described in the US patent application US20090254521A1 may be applied, and the resulting prefix-offset code will be variable length, however still order-preserving.

In step 305, the computing system concatenates the prefix codes and respective offset bits. This concatenation results in the order-preserving binary prefix-offset codes.

Figures 4a to 4c shows an example of order-preserving binary prefix-offset coding of floating point numbers. The binary representation of floating point value according to IEEE754- 1985 is separated into a sign bit, exponent bits, and fraction bits (Figure 4a) , which forms the value to be represented by formula v=(-l)^sign x ₂ ^{exponent~bias} x 1. fraction _(binary) . Note that the non-fraction part "1." is always omitted in the binary

representation, and the fraction part is stored in a binary format with the first bit indicating 2^_1, the second bit indicating 2^~2, and so on.

Figure 4b shows an example with a 32-bit floating point number 0.15625:

V = (-1)° X 2^124"127 X 1.01 (binary)

= 1 X 2 X 1.25 (decimal)

= 0.15625

The bit in position 31 is the sign bit, bits in positions 30 to 23 represent the exponent (124 for this example), and the fraction part "01" is stored in bit positions 22 and 21.

In Figure 4b, the fraction part is translated from .01 (binary) to .25 (decimal) , which is calculated by 0 x 2^_1 + 1 x 2^~2. Suppose we have a number as shown in Figure 4c, which is a number

slightly different from 0.15625 (a common phenomenon seen, for example, in instrument measured data) . With a number of offset bits of 10 and the prefix bits represented by dictionary code of 92 (1011100 in binary), the number

0.1562500186264514923095703125 is encoded into a 17 bit prefix-offset code 10111000000010100. For reasons discussed in more detail later, the sign bit in Figure 4c has been flipped so that the resulting prefix-offset coding is indeed order- preserving .

The encoding efficiency of binary prefix-offset encoding partially depends on the distribution of the values in its whole value space. More exactly, it depends on the

"clustering" of values, which is a phenomenon commonly seen in many applications, for example: sensor measurement results are often around an average value due to measurement errors .

prices are often around a certain value (1, 2, 5, 10, etc), for example, 0.95, 0.98, 0.99, 1.

the existence of common prefixes in names

One thing to be noted is that the "clustering" of values must be seen from the binary representation value with prefix bits and offset bits. Most common binary representations preserve to certain degree the major part (if not all) of the

clustering effect by having the semantically most important part of the value represented in the most significant bits. As is seen in previous examples, the IEEE 754-1985 standard represents a floating point value with the most significant bit depicting the sign, followed by several bits of exponent, and the fraction bits of descending fractions (the first fraction bit depicts ½, the second for ¼, and so on) .

Another part of the encoding efficiency comes inherently from that of the dictionary encoding for the prefix bits. It takes effect even when the data is totally random distributed in its value space as seen from the binary presentation.

Solely from the perspective of compression efficiency, the binary prefix-offset compression can not be better than (pure) dictionary based compression. This is due to waste in offset bits, because usually not all possible combinations of the offset bits are filled. Generally, it can be concluded that the encoding efficiency improves with a smaller length of offset bits (a formal proof will be provided later) . In fact, (pure) dictionary based compression can be considered as one extreme case of (binary) prefix-offset compression, with offset bits of length 0. And the other extreme case, when the number of offset bits is set to the maximum (the same as the original value) , it is exactly the same as un-encoded data and thus is the most inefficient case.

The binary prefix-offset compression method as described above works in many of the cases that otherwise commonly applied dictionary based compression and offset based compression method does not work. The table in Figure 5 summarizes the data types that are typical for very large cardinality data and where binary order-preserving prefix-offset compression can be used. For all these data types, dictionary based compression is not applicable when the data has very large cardinality. Furthermore, offset encoding cannot be applied to many data types because either the plus and minus operations are not defined or plus and then minus the same value can not guarantee to return to the original value. For those data types where offset encoding is applicable, it might not be efficient due to its assumption on the existence of a common base value. Binary order-preserving prefix-offset compression thus can be used with various data types where dictionary encoding and/or offset encoding cannot be efficiently used.

The memory consumption of (binary) prefix-offset compression is dominated by the dictionary for encoding the prefix bits. If the number of prefix bits is n, the upper bound of the dictionary size is determined as 2^An. The memory exhaust problem of (pure) dictionary based compression can be solved by limiting n. Due to the existence of data clustering which is explained above, the actual size of the dictionary will be in many situations considerably less than this maximum.

Therefore, a larger number of prefix bits and smaller number of offset bits can be allowed, which also means higher

encoding efficiency (see comment on encoding efficiency above) .

For the prefix-offset compression described above, whether or not the resulting code is order-preserving is dependent on the concrete binary representation of data (prefix bits and offset bits) . The prefix-offset code can be order preserving only if the binary representation is order-preserving. In addition, an order-preserving dictionary encoding must be applied for the dictionary encoding of the prefix bits to guarantee the resulting prefix-offset code to be order-preserving.

In order to guarantee the order-preserving characteristics of the derived prefix-offset code, the binary representation of data should be order-preserving. The unfortunate fact is that the commonly used binary presentations of data values are seldom order-preserving. For example, IEEE 754-1985 is order preserving only for positive floating point numbers, and IEEE 754-2008 for decimal numbers is not order-preserving.

Therefore, transformation of original binary representation into order-preserving one is needed before deriving the prefix bits and offset bits. In the following, we briefly review methods of transforming the binary representation of typical data types into an order-preserving one.

Integers are commonly represented in computers using two's complements. This binary representation is by nature

fulfilling our requirement of encoding efficiency, since the semantically more significant bits (possibly also less frequently changing bits) are placed more to the left. A minor adjustment by flipping the sign bit is sufficient to make it order-preserving for the whole value range. The table in

Figure 6 shows an example with integers of 16 bits. The order of 2's complement with sign bit flipped matches exactly that of the integer values. Therefore, the derived prefix-offset code that concatenates the order-preserving dictionary code for prefix bits and the offset bits is also order-preserving.

As mentioned above, floating point numbers are commonly following IEEE standard 754-1985 and they are represented with a sequence of sign bits, mantissa bits and fraction bits.

Similar to integers with two's complement, some adjustments are needed to make the binary representation according to IEEE 754-1985 order-preserving for the whole value range. This includes the following steps: 1. flip all mantissa bits and fraction bits for negative values and 2. flip the sign bits for all values. The table in Figure 7 illustrates the

transformation IEEE754-1985 into order-preserving binary representation with examples of 32-bit floating point numbers. After bit-flipping, the order of the binary representation matches exactly the literal sequence of floating-point values. Therefore, the prefix-offset code derived by concatenating the order-preserving dictionary code for prefix bits and the offset bits is also order-preserving.

Decimal values are typically encoded with v=(-l)^sign x

coefficient (_decima_l ) x io^{exponent " bias}, in accordance with the IEEE standard 754-2008. This is not order preserving, because it allows redundant encoding for the same value. For order- preserving binary representation of decimals, such redundancy must be avoided by normalizing the coefficient (similar to the normalization applied in the last row of the table in Figure lb) . For example, by normalizing the coefficient into 7 digits, the coefficient of 999999 is 9999990, and the

coefficient of 12.0 is 1200000. The fraction part of

coefficient is often stored as a compressed sequence of decimal digits as described, for example, in the US patent application US20070050436A1. This representation is not order- preserving. Also, the layout of the bits have the first digit (which only take value 0 through 9) compacted together with the two most significant bits of the exponent (which only take value 0 though 2) in five bit combination field as described in "Decimal Arithmetic Encodings", Version 1.01, 7 April

2009 (downloadable from

http://speleotrove.com/decimal/decbits.html). To guarantee order-preserving characteristics, both of these compressions should be avoided and be replaced with a simple binary

representation that includes sequentially one sign bit, several bits of exponent, followed by coefficient bits. For example,

999999 = (-1)° x 9999990 x i o^100"101 = 0 01100100 100110001001011001110110 and 12.0 = (-1)° x 1200000 x 10 ^96-101 = 0 01100000 000100100100111110000000

This binary representation requires one bit more to encode the same range of values than the compact format. For example, the decimal32 has one sign bit, a coefficient length in 7 digits that can be stored in 24 bits, and exponent to be -95 to 96 that can be stored in 8 bits. This amounts to 33 bits instead of the original 32 bits. However, this additional bit does not have much impact the encoding efficiency. Similar to the case of floating point described above, bit flopping must be applied to guarantee the order-preserving characteristics for negative numbers. The two bit flopping steps listed above for floating point numbers also apply for decimal numbers with the simple binary presentation above. The tables in Figures 8a and 8b presents examples using decimal32 numbers.

Order-preserving binary representations for other typical data types in data bases are also available. As some examples, consider the following. Timestamp, defined as the number of seconds since 1970, is naturally order-preserving when stored using a binary representation of unsigned integer. An order- preserving binary representation for Date and Time can be defined, for example, by storing Year, Month, Day, Hour,

Minute, Seconds, and Microseconds as unsigned integer using the minimal number of bits and concatenate them in sequence.

It is appreciated that in addition to the order-preserving binary representation discussed above, any other order- preserving binary representations may be used in connection with the present invention.

The length of offset bits is an important factor for the effectiveness of the above described binary prefix-offset compression. In some applications of embodiments of the present invention, the number of offset bits may be

predefined. If automatic detection of the optimal split point between the prefix and offset bits is employed, it needs to consider two properties of prefix-offset encoding that have already been discussed: 1) with increasing number of prefix bits n, the upper bound of dictionary size increases

exponentially, i.e. 2^An and 2) the encoding efficiency also increases with an increasing number of prefix bits.

The optimal split point can be automatically detected by comparing different possibilities. Figure 9 shows pseudocode for an algorithm suitable for this purpose. The target is to minimize the size of prefix-offset code which includes the dictionary code for the prefix bits and the offset bits under the constraint of specified maximal size of dictionary. The search starts from a maximum of offset bits that equals the size of un-encoded data (suppose data is fixed length) in units of bits N (we do not actually need to calculate the size of encoded value, since in this case the value is kept

unencoded, i.e. N) to a minimum of 0 offset bits. It also stops when the size of the dictionary for prefix bits exceeds a pre-configured maximum size, which avoids exhausting the memory in case of data with very large cardinality. In the algorithm shows in Figure 9, # is an abbreviated notion for "number of". When the algorithm terminates, the optimal number of offset bits is indicated by bestOffsetBits . In step 3, the #offsetBits is decrement from N-l to the minimum of 0. This pseudocode is shown to illustrate the most basic idea, and various variances and optimizations exist when practically applied. Simple examples that can improve the speed of execution include starting the loop from a smaller number of offset bits instead of just N-l, and only test selectively on certain bits.

Next we provide a formal proof of the split point property of prefix-offset encoding, which was formally roughly described above as "the encoding efficiency improves with a smaller number of offset bits". For the sake of preciseness, we rephrase this property into the following: "the length of prefix-offset code of S+l offset bits is at least as large as that of S offset bits", or "the length of a prefix-offset code of S offset bits can not be larger than that of S+l offset bits" .

Define the total number of bits of values as N, and the number of offset bits as S with 1<S<N. The length of the prefix- offset code with S offset bits (namely L_s) is a sum of the length of dictionary code for the prefix bits and the length of offset bits. The length of prefix bits is simply P=N-S, and the length of dictionary code is determined as a logarithm of the size of the dictionary for prefix bits D_P. That is, L_s = log₂ ( Dp ) + S. And the length of prefix-offset code with S+l prefix bits is L_s+i = log2 ( D_P_i ) + (S+l).

To prove L_s+1 > L_s, i.e. log₂ (D_P_i ) + (S+l) > log₂ (D_P) + S, we only need to prove log₂ ( D_P ) - log₂ ( D_P _i ) ≤1. This is now quite obvious because it can be transformed to D_P ≤ 2 x D_P_i , which is always true because each entry (P-l prefix bits) in D_P_i can have maximally 2 x D_P possible corresponding entries in D_P - by adding a trailing 0 or 1 to the (P-l) -bits entry.

Figure 10 shows a block diagram of a data processing system 100 according to an embodiment of the invention. It is appreciated that schematic blocks are provided as an example to facilitate the understanding of the present invention;

actual implementations of the invention may provide the described functionality with a different number and

configuration of hardware and/or software blocks.

The data processing system 100 has an input component 110 for receiving data to be encoded and an encoding component 120 for encoding the received data. The encoding component 120 contains a binary representation component 122 that provides an order-preserving binary representation applicable for the received data. The component 122 may contain various order- preserving binary representations applicable to different data types. The division component 124 divides data values in the order-preserving binary representation into at least prefix bits and offset bits using a given number of the offset bits. The dictionary encoding component 126 encodes the prefix bits using an order-preserving dictionary coding, and this results in the prefix codes. The concatenation component 128

concatenates the prefix codes and respective offset bits, resulting in order-preserving binary prefix-offset codes.

These codes are then stored to the storage component 150. The storage component 150 may use memory or some persistent storage means, such as disk space, for storing the encoded data .

As discussed above, the encoding component 120 may contain a component 125 to determine the number of offset bits by minimizing the size of the order-preserving binary prefix- offset code under the constraint of a given maximal size for prefix code dictionary. The size of the order-preserving binary prefix-offset code is the sum of the number of offset bits and the size of said prefix code.

The encoding component 120 typically contains a transformation component 121 for transforming received data in a first binary representation into a suitable order-preserving binary

representation. If the received data is already in an order- preserving binary format, there is no need to activate the functionality of this transformation component 121.

The data processing system 100 typically also contains a query processing component 140 for performing a value scan on the order-preserving binary prefix-offset codes stored in the storage component 150. For processing compressed data, the query processing component 140 needs to have access to

encoding schemas 130. The encoding schema 130 for the order- preserving binary prefix-offset encoding is determined by the binary order-preserving representation, the number of offset bits and the dictionary used for encoding the prefix bits.

The data processing system 100 may be a database, preferably an in-memory database, for storing said order-preserving binary prefix-offset codes. As a further example, the data processing system 100 may contain a database for storing data on a persistent medium and a further in-memory database connected to the database, for uploading data from said database to the further in-memory database for enhanced processing .

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or

"comprising, " when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of

illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and

described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware

embodiment, an entirely software embodiment (including

firmware, resident software, micro-code, etc.) or an

embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," "module" or "system." Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium (s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium (s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical,

electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM) , a read-only memory (ROM) , an erasable programmable read-only memory (EPROM or Flash memory) , an optical fiber, a portable compact disc read-only memory (CD-ROM) , an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) .

Aspects of the present invention are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be

understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be

implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible

implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function (s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In the appended claims a computerized method refers to a method whose steps are performed by a computing system containing a suitable combination of one or more processors, memory means and storage means.

While the foregoing has been with reference to particular embodiments of the invention, it will be appreciated by those skilled in the art that changes in these embodiments may be made without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims .

Claims

1. A computerized method for compressing data, comprising

providing an order-preserving binary representation for a data type;

determining a number of offset bits;

encoding said prefix bits using an order-preserving

dictionary coding, resulting in prefix codes;

2. Method of claim 1, wherein the size of the order-preserving binary prefix-offset code is the sum of the number of offset bits and the size of said prefix code, said method comprising determining the number of offset bits by minimizing the size of the order-preserving binary prefix-offset code under the constraint of a given maximal size for prefix code dictionary.

3. Method of any preceding claim, comprising transforming values in a first binary representation into values of said order-preserving binary representation.

4. Method of claim 3, said transformation comprising at least one of the following steps:

flipping the sign bit when said first binary

representation is representing integer numbers as complements of 2 with a sign bit as the leftmost bit, said sign bit being set to 1 for negative values in said first binary

representation;

flipping the sign bit for all values and flipping all mantissa bits and all fraction bits for negative values when said first binary representation is representing a floating point value v = (-l)^sign _x 2 ^{exponent " bias} x 1. fraction _(binary) with a sequence of sign bit, mantissa bits and fraction bits; and

normalizing a coefficient to a fixed length, flipping the sign bit for all values and flipping all coefficient bits and all exponent bits for negative values, when said first binary representation is representing a decimal value v = (-i)^sign _x coefficient _(decimal) x 10 ^{exponent " bias}

with a sequence of a sign bit, coefficient bits and exponent bits .

5. Method of any preceding claim, comprising performing a value scan on said order-preserving binary prefix-offset codes .

6. A data processing system, comprising

an input component for receiving data to be encoded; and an encoding component for encoding received data, said

encoding component adapted to

divide data values in said order-preserving binary

7. Data processing system of claim 6, said encoding component being adapted to determine the number of offset bits by minimizing the size of the order-preserving binary prefix- offset code under the constraint of a given maximal size for prefix code dictionary, wherein the size of the order- preserving binary prefix-offset code is the sum of the number of offset bits and the size of said prefix code.

8. Data processing system of claim 6 or 7, said encoding component being adapted to transform received data in a first binary representation into said order-preserving binary representation .

9. Data processing system of any one of claims 6 to 8, further comprising a query processing component for performing a value scan on said order-preserving binary prefix-offset codes.

10. Data processing system of any one of claims 6 to 9, comprising a database, preferably an in-memory database, for storing said order-preserving binary prefix-offset codes.

11. Data processing system of any one of claims 6 to 9, comprising a database for storing data on a persistent medium and a further database, preferably in-memory database, connected to said database, for uploading data from said database to said further database for enhanced processing.

12. A computer program product comprising a computer-usable medium and a computer readable program, wherein the computer readable program when executed on a data processing system causes the data processing system to carry out method steps of any one of claims 1 to 5.