WO2011014182A1

WO2011014182A1 - Non-greedy differential compensation

Info

Publication number: WO2011014182A1
Application number: PCT/US2009/052377
Authority: WO
Inventors: Krishnamurthy Viswanathan; Ram Swaminathan
Original assignee: Hewlett-Packard Development Company, L.P.
Priority date: 2009-07-31
Filing date: 2009-07-31
Publication date: 2011-02-03

Abstract

A compression process (400) or system (700) constructs tables (740) associated with first locations in input data (710). Each table (500) includes entries corresponding to second locations in a reference file, and each second location identifies parts of the reference file that are candidate approximate matches for parts of the input data (710) at the associated first location. For each table (500), costs associated with differential encoding of the parts of the input data at the first location can be determined using the candidate approximate matches identified by the entries in the table. A partition of the input data using the parts for which the costs were determined can be selected to minimize a sum of the costs determined for the parts in the partition, and the parts in the partition can be encoded to compress the input data.

Description

NON-GREEDY DIFFERENTIAL COMPRESSION

BACKGROUND

Data structures and digital signals often contain sets of symbols that either precisely or approximately reappear many times in the data structures or digital signals. Compression methods can use this redundancy to reduce the size of a data structure or signal by replacing the most repeated sets for symbols with shorter codes. Such data and signal compression principles are well known and used in a variety of applications. For example, information transmitted as a digital signal by modems, networks, and similar systems, can be compressed to improve the effective bandwidth of a transmission and decrease the time required for the transmission. Data storage in computer systems, computer peripherals, removable media such as CD and DVD disks, consumer devices such as telephones and music and video players, and many other devices that store digital data can also use compression to increase the amount of data that can be stored with the available resources.

Compression processes for strings of symbols generally involve parsing or partitioning of the string into a concatenation of substrings and encoding each substring. For example, many compression processes compare each substring in a partition of an original string to entries in a dictionary and when a match is found, encode the substring using an index or pointer that identifies the matching entry in the dictionary. The parsing of the original string to create the partition can be done in a greedy fashion. With greedy parsing, a compression process looks for the longest substring beginning with the starting symbol of the string and having a match in the dictionary. The longest substring is encoded using the dictionary, and the compression process looks for a longest substring that immediately follows the substring just encoded and is in the dictionary. Compression using greedy parsing operates on the assumption that encoding a series of "longest" substrings generally results in the most compression.

Non-greedy parsing for a compression process does not assume that partitioning a string into a series of substrings with each being the longest encodable substring at its location provides the most compression. In some cases, encoding a shorter substring at one point in the string can allow improved compression further along in the string, for example, when a non-greedy partition of a string includes shorter strings in some locations but even longer substrings elsewhere. Non-greedy parsing can thus improve compression at least for some strings, but non-greedy parsing generally requires more processing power.

Differential compression generally refers to compression of input data relative to a reference file where differences may be included in the codes for the separately encoded parts of the input data. One way to perform differential compression on a string is to partition the string into a concatenation of substrings, and then for each substring, find candidate approximate matches in a reference string or file and encode the substring by pointing to a chosen approximate match and describing the difference between the substring and the chosen approximate match.

Compression processes are known for optimal non-greedy parsing of a string when the compression finds exact matches to entries in a dictionary. However, the computational cost involved in directly applying those non-greedy parsing techniques to the differential compression process may be prohibitive, particularly because differential compression processes often need to evaluate multiple candidates for the approximate match to each substring.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 illustrates the relationships of and some notations used for a string containing symbols and substrings as described herein.

Fig. 2 shows a graph structure illustrating a method for non-greedy parsing in a compression process.

Fig. 3 illustrates the relationships of and notations used for an input string and a reference file in a differential encoding process in accordance with an embodiment of the invention.

Fig. 4 is a flow diagram of a compression process in accordance with an embodiment of the invention.

Fig. 5 illustrates the logical structure of tables that may be generated during a compression process in accordance with an embodiment of the invention.

Fig. 6 shows a sparse graph illustrating a method for non-greedy parsing of a string for a compression process in accordance with an embodiment of the invention. Fig. 7 is a block diagram representing a compression system in accordance with an embodiment of the invention.

Use of the same reference symbols in different figures indicates similar or identical items.

DETAILED DESCRIPTION

In accordance with an aspect of the present invention, the speed of differential compression processes and systems using non-greedy parsing can be increased. The increase in compression process speed can result from limiting candidate approximate matches to those that have the same set of initial symbols as in a part (e.g., substring) that may be encoded during compression. For compression of a string, a table can be generated for each starting location of substrings in the string being compressed, and each table can contain entries associated with entry pointers or indices identifying the candidate approximate matches to all substrings that begin at the starting location. Each entry can further include mismatch identifiers, which may be offsets from the symbol indicated by the entry pointer to symbols in the reference file that do not match symbols at the same location relative to the starting location in the string being compressed. Each entry can further include a maximum length in the reference file of a suitable candidate approximate match that begins with the symbol identified by the entry pointer.

The tables for different starting locations in the string being compressed can be generated in a process that compares symbols in the reference file to symbols in the string. However, entries for two different tables will often be related, so that an entry for one table can often be efficiently generated from an entry in an already generated table.

Each table thus generated provides information used for determining the differential encoding cost for every substring that begins at the starting location associated with the table. During non-greedy parsing, a substring at the starting location associated with a table can be considered to have infinite encoding costs if the substring is longer than the maximum lengths of candidate approximate matches in the table. Making some coding cost infinite reduces the number of edges in a graph structure evaluated for identification of a non-greedy parsing of the string being compressed. As a result, the graph is sparser that normally required for non-greedy parsing, and identifying the non-greedy parsing may require less processing time.

The improvement in compression speed may sacrifice the optimality of the compression process, i.e., produce larger compressed files, but in most cases, particularly when the input data being compressed is an executable file, the loss in compression may be negligible. The compression process can be applied to one- dimensional or multi-dimensional data structures. However, the example of one- dimensional strings is primarily described below for illustration of definite applications.

Fig. 1 illustrates a string X that may be compressed using systems and processes in accordance with the invention. String X can be a one-dimensional ordered data structure, a digital file, or the input data to be represented by a transmitted signal. In the example of Fig. 1, string X includes n symbols X₁, x₂, ... x_n that have an order that can be defined according to the properties or uses of string X. For example, the order may be a temporal order for the transmission of the symbols in string X, an order for the addresses of storage locations in a buffer or memory storing string X, or a nominal program order for executable instructions that make up the string X. The subscripts of symbols X₁ to x_n indicates the position of symbols X₁ to x_n within the order assigned in string X.

Each symbol X₁ for an index i between 1 and n can generally be a symbol from some known alphabet. The number of symbols in the alphabet is generally independent of the number n of symbols in string X. In a common example, each symbol X₁ to x_n is a byte value, and the alphabet contains the values from 0 to 255.

A compression process performed on string X generally requires partitioning string X into substrings. (The notation x[i:j] is used herein to denote a substring of symbols X₁, X₁₊₁, ...x, from string X, where indices i andj satisfy the relations 1 < i < j < n.) With a partition of string X selected, the compression process can encode or describe each substring or part from the partition, preferably using a code or description that is shorter (e.g., contain fewer bits) than are in the substring or part. The parsing process that selects the partition of string X can be critical to the efficiency of the encoding process.

In accordance with an aspect of the invention, a compression process using only a modest amount of processing power can implement both non-greedy parsing and differential compression. A non-greedy parsing of string X involves selecting a partition of string X without requiring that each substring be the longest substring that can be compressed or encoded at the substrings location. Conceptually, the process of finding a partition for non-greedy parsing may involve construction of a graph structure having n+1 nodes and n(n+l)/2 edges or links connecting the nodes. Each edge corresponds to a different substring x[i,j] and to a coding or description cost c(i,j) for encoding the corresponding substring x[i,j]. The graph structure represents the problem of finding the path along the edges from the node corresponding to first symbol X₁ in the string to an end node that minimizes the sum of the description cost along the path. (Conventional techniques for solving the shortest path problem using computers can be employed.) This process can effectively evaluate the total encoding cost of all possible partitions of string X, so that the partition with optimal compression, i.e., that produces the smallest compressed string, can be found.

Fig. 2 illustrates a specific example of a graph structure 200 for the overly simple case of compressing a string containing five (i.e., n=5) symbols X₁, x₂, x₃, x₄, and X5. An actual string for compression may contain thousands of symbols or more. Graph structure 200 includes six nodes corresponding to the five symbols X₁ to X₅ and an end node 290. Fifteen (i.e., n(n+l)/2) edges or links 210 connect each node to end node 290 or nodes corresponding to symbols that follow in the string X. Each edge 210 corresponds to a substring x[i,j] and has a corresponding description cost c(i,j) for the encoding of the corresponding substring. With graph structure 200, selecting the optimal partition or parsing of string X for a compression process corresponds to finding which the path along edges 210 that minimizes the sum of the description cost from the node corresponding to symbol X₁ to end node 290.

The description cost c(i,j) associated with encoding a substring x[i,j] generally depends on the details of the compression process and reflects the number of bits required to encode or describe substring x[i,j] using that compression process. When the compression process encodes substrings based on exact matches to entries in a dictionary, each description cost c(i,j) can be simply computed based on the dictionary used to compress structure X. For differential encoding, description cost c(i,j) reflects the number of bits required for a pointer to an approximate match in a reference file and a description of the difference between substring x[i,j] and the approximate match.

Direct application of the approach illustrated in Fig. 2 to a differential compression process can require considerable processing power. In particular, the "dictionary" for differential compression can be the set of all substrings in a reference file. To compute the cost c(i,j) of describing for a string x[i:j], the string x[i,j] needs to be compared to every length j-i+1 substring in the reference file. If computed in a naive fashion, this operation involves on the order of n(j-i+l) symbol comparisons. This is unlike the case of compression using exact matches where data structures such as digital search trees may be used to reduce the complexity of identifying a matching dictionary entry. Using similar data structures for approximate matching can reduce the complexity to on the order of n*k comparisons where k is an upper bound on the acceptable number of mismatched symbols in an approximate match. Since this computation is required for every pair of indices (i,j), the total complexity of determining the description costs for the graph structure is on the order k*n³. After the costs c(i,j) are computed, finding the shortest path in the graph still requires computation that is linear in the size of the graph including vertices and edges, which may be quadratic in the number n+1 of vertices if the graph is not sparse.

In accordance with an aspect of the present invention, a compression process employs a sparse (or sparser) graph structure for non-greedy parsing of a string during compression and thereby reduces the number of computations or comparisons required for the compression process. Further, approximate description costs c(i,j) can be computed efficiently and in a manner that leads to the sparse graph for selection of the partition of a string X. The resulting values c(i,j) may not be accurate for all values of indices i and j, but the final parsing is likely, particularly in the case of a string X representing executables, to be close to optimal.

Differential compression of a string X as noted above involves comparisons to a reference file or string Y to generate a compressed string X' as illustrated in Fig. 3. Reference file Y contains symbols yi, y₂, ... y_m, where the number m of symbols in reference file Y may differ from the number n of symbols in string X. Reference file Y must be known at the time for compression and decompression. For each substring x[i,i+A] in a partition of string X, the illustrated compression process of Fig. 3 identifies a reference substring y[k,k+A] of the same length A+l in reference file Y and generates a index, pointer, or code p identifying the reference substring y[k,k+A] and a difference code d indicating the difference between substrings x[i,i+A] and y[k,k+A]. A compressed string X' can thus be generated containing codes pi, di to p_q, d_q for a partition of string X into q substrings. This compression can be lossless if the difference codes dl to dq exactly indicate the differences between each substring and the corresponding reference substring. In particular, the differences di to d_q can be encoded using lossless data compression encoding techniques such as Huffman or arithmetic coding. Alternatively, a lossy compression process can approximate the differences to reduce the size of compressed string X'.

Decompression of string X' requires knowledge of reference string Y. The decompression process generally identifies reference substrings using codes pi to pq and adds to the reference substrings respective difference determined from respective codes dl to dq to the identified reference strings. The original string X is thus recovered when the compression is lossless or an approximation of the original string X is recovered if the compression process is lossy.

Fig. 4 is a flow diagram of a compression process 400 in accordance with an exemplary embodiment of the invention for compressing a string X using a reference file Y. String X as noted above may be input data containing redundancy, and reference file Y may be a string that was previously provided for encoding and decoding. In one exemplary application, differential compression 400 is applied where the file or string X to be compressed is an updated version of the reference file Y. In the case of where files X and Y executables or source code, there are typically two kinds of changes introduced by an update process: primary changes such as introduction of new lines of code; and secondary changes, which come about, for example, due to a change in the logical address of an unmodified instruction or data due to introduction of new lines. Secondary changes are likely to cause only a few bytes of mismatch between files X and Y. Some of the techniques employed in process 400 as described further below are aimed at identifying matches such that the secondary changes do not limit the length of a match but a primary change does. Process 400 can be implemented using a computer executing software or firmware (e.g., an installation program) or using application specific hardware. One particular embodiment of the invention is a computer readable medium such as a disk or electronic memory containing instructions that when executed by a computer performs process 400.

Process 400 as shown in Fig. 4 begins by initializing a table index i in a step 410. Table index i corresponds to the location of a symbol X₁ in the string X and also identifies a corresponding table that will be constructed for that symbol. The table for a symbol X₁ as described below can be used to determine the description costs for differential encoding of any substring x[i,i+A] beginning with symbol X₁, and these description costs can be used to identify a non-greedy parsing of string X. In particular, for a graph structure of the type shown in Fig. 2, a table having index i can provide the information needed to determine the encoding costs of all edges originating at the node associated with symbol X₁.

Associated with the index i and length values A are a set of substrings x[i,i+A] of length A+l in string X. Process 400 considers reference substrings y[k,k+A] as candidate approximate matches for substring x[i,i+A], only if the first B+l symbols match exactly, that is y[k,k+B]=x[i,i+B]. Value B is a parameter of compression process 400, and may differ for different embodiments of process 400. In general, larger values for B reduce the number of candidate substrings y[k,k+A] that must be evaluated as approximate matches, but B is preferably small enough that for each substring of length B+l in string X, reference file Y includes at least one copy of that substring. If a substring of length B+l in string X does not have a copy in Y, then there are no candidate matches for describing the substrings beginning at the location, say i, corresponding to this length B+l substring. In that event, x[i, i+B] can be described as is without reference to Y. The approximate cost of this non-differential encoding can be computed where necessary. Step 415 determines whether there is an index value k such that y[k,k+B]=x[i,i+B]. This can be done by searching reference file Y for matches to x[i,i+B], but a hash function H(x[i,i+B]) generated from reference file Y can return index values k pointing to substrings y[k,k+B] of reference file Y that are equal to substring x[i,i+B]. Use of a hash function can more efficiently use processing power since repeated searches of reference file Y for the same substring are avoided.

Step 420 creates in a table corresponding to table index i an entry corresponding to the selected index k. If step 425 determines that an available prior table contains an entry with information related to the current value of index k, the entry for index k in the current table can be filled in a step 430 based on information from the prior table as described further below. However, if step 425 determines that no available prior table contains a related entry, a subprocess 435 determines a longest substring y[k,k+A] in reference file Y that is suitable to be a candidate approximate match for substring x[i,i+A] in string X.

A reference substring y[k,k+A] generally will not be a good candidate for an approximate match for substring x[i,i+A] if encoding the difference between substrings x[i,i+A] and y[k,k+A] has too great of a description cost, that is requires too many bits of encode. In accordance with one embodiment of the invention, a substring y[k,k+A] that starts with the same B+l symbols as in substring x[i,i+A] will be considered a suitable candidate approximate match for substring x[i,i+A] if there are fewer than C consecutive symbols in substring y[k,k+A] that differ from corresponding symbols in substring x[i,i+A]. The value of C is another parameter of compression process 400 that may differ from one embodiment to another, and the choice of value C may particularly depend on the specific technique employed to encode differences or the type of data being compressed. However, value C may be around 3, for example, for an update of executable code three bytes that do not match may indicate a primary change in stirng X that would not be efficiently compressed using differential compression. Alternative embodiments can use other criteria for judging whether a substring y[k,k+A] is a suitable candidate for an approximate match for substring x[i,i+A]. For example, a poor candidate can be identified according to the total number of mismatched symbols or the percentage of mismatched symbols in substring y[k,k+A].

Subprocess 435 determines for indices i and k a maximum value A such that reference substring y[k,k+A] is a good candidate for an approximate match to substring x[i,i+A]. In the illustrated embodiment, this determination is based on the number of consecutive mismatched symbols. Subprocess 435 begins in step 440 by initializing an index j (e.g., j=l initially) and a mismatch count MM (e.g., MM=O initially.) Decision step 445 then determines if symbol X_1+B+J in string X is equal to symbol Vk₊B_+j in reference file Y. (The first B+l symbols are already known to match as a result of the selection of index k in step 415.) If the symbols match, mismatch count MM is reset in step 450, index j is incremented, and the process branches from a decision step 455 back to step 445 unless i+B+j is greater than the number n of symbols in string X or k+B+j is greater than a number m of symbols in reference file Y.

If step 445 determines symbols XI+B+_J and Vk+B+_j do not match, step 460 increments mismatch count MM and stores a mismatch identifier (e.g., offset k+B+j) in the entry for index k in table i. Decision step 465 then determines whether mismatch count MM has reached the maximum consecutive count C. If not, process 400 increments index j and branches from step 455 to determine whether the process has run out of symbols to compare. Subprocess 435 is complete either when step 465 determines that mismatch count MM is equal to the maximum consecutive mismatches C or no further comparisons can be made.

Step 470 follows process 435 and stores a maximum length LM=B+j of a reference substring y[k,k+LM] that is a good candidate approximate match to substring x[i,i+LM} in the entry corresponding to index k of the table corresponding to index i. (The storing of maximum length LM can employ some technique to distinguish maximum length LM from the previously stored mismatch offsets, e.g., by writing an end marker or an entry length in the entry.) That entry in the table corresponding to index i is then complete. From step 470, process 400 returns to step 415 and determines whether there is another value of index k for which y[k,k+B]=x[i,i+B]. If there is another value of k, another new entry in the table is created and filled. Once the last value of index k for which y[k,k+B]=x[i,i+B] is processed, the table for index i is complete.

Fig. 5 illustrates the logical relations of values in a table 500 that process 400 can generate for an index i or equivalently for a symbol X₁ in string X, i.e. for the symbol at the location indicated by a value of index i as used in Fig. 4. Table 500 contains four entries corresponding to entry index values Ic₁ , k₂, k₃, and k₄. The number of entries (and associated entry index values) will generally vary from table to table. The entry index values k_ls k₂, k₃, and k₄ point to locations in reference file Y that have symbols matching the X₁ to X_1+B in string X. Each entry contains one or more mismatch identifiers, which in an exemplary embodiment are offsets indicating the locations relative to the entry index k of symbols in reference file Y that do not match symbols in string X having the same offsets relative to the table index i. For example, table 500 indicates that symbol v, does not match symbol x . The number of mismatch identifiers will generally be different in different entries. In particular, table 500 illustrates an example where the entry corresponding to entry index k₂ has mismatch offsets mm₂i, mm₂i+l, and mm₂i+2 that are consecutive values, in a case where maximum number C of consecutive mismatches allowed in a candidate approximate match is 3. The other entries in table 500 have more than three mismatch offsets but the last three mismatch offsets should be consecutive integers.

Entries in tables associated with different symbols in string X will often be related. Fig. 5, for example, also shows a table 510 that process 400 may have generated for an index i-1 or equivalently for a symbol X₁-I in string X. In the illustrated example, one entry of table 510 has an entry index value k₃' that is equal to k₃-l, where k3 is an entry index value in table 500. This will result when the B+l symbols starting at index i in string X match the B+l symbols starting at index k₃ in reference file Y match and the symbol X₁-I matches symbol y . In this case, the entry corresponding to index k₃' in table 510 has mismatch identifiers identifying the same mismatches as identified in the entry corresponding to index k₃ in table 500. However, when the mismatch identifiers are relative offsets, offsets mm₃i, mm₃2, ... for the entry corresponding to index k₃ in table 500 will be one less than the corresponding offsets in the entry corresponding to index k₃' in table 510. (If instead of offsets, the mismatch identifiers were absolute index values, the mismatch identifiers would be the same for both entries.) The maximum length LM₃ in the entry corresponding to index k₃ in table 500 will be one less than the maximum length in the entry corresponding to index k₃' in table 510.

Tables 500 and 510 need not necessarily contain the locations of the mismatches. In an alternative implementation, each table could store only the length of the match and the number of mismatches with the location of the match. This way, the property of being able to derive tables based on tables for past symbols is still preserved. This would suffice for computing the costs as long as the cost depends only on the number and not the nature of the mismatches. However, while computing the costs, the cost minimizing location in Y needs to be recorded, which can be done by scanning Y just once more while encoding the mismatches.

Table 510 may be generated in process 400 before generation of table 500, so that the entry corresponding to k₃ in table 500 can be generated from the entry corresponding to k₃' in table 510 without the need to scan through reference file Y to find mismatches. Other tables may include entries with similar relations to table 500. For example, if a table generated for index i-b where b<B includes an entry having an index k' such that k-k' is less than b for an index k corresponding to an entry in the table generated for index i, the entry corresponding to index k in table i can be simply generated from the entry corresponding to index k' in table i-b. Step 425 in process 400 can recognizes some or all of the relations of this type, so that step 430 can generate an entry based on a prior entry without requiring the processing burden of scanning the reference file Y as done, for example, in subprocess 435. The number of comparisons or operations to fill an entry is then proportional to the number of mismatches, while otherwise the number of comparisons or operation depends on the number of symbol matches and mismatches.

Once step 415 of process 400 in Fig. 4 determines that the last index k to suitable candidate approximate matches for substrings starting with symbol X₁ has been evaluated, e.g., table 500 is complete, process 400 branches from step 415 to a step 480, which determines description costs c(i,i+j) for all values of index j from 1 to n-i. For any value j, the number of candidate approximate matches will be equal to the number of entries in the table, and the candidate that provides the lowest costs can be chosen as the approximate match used for encoding of substring x[i,i+j] .

The table for the current location index i identifies locations of the mismatches so that candidates for costs c(i,i+j) can be determined or approximated for all values of j using the information in the table and known encoding techniques. The cost generally should capture the number of bits required for a complete unambiguous description of the substring. The exact number of bits may be difficult to evaluate and generally requires accessing the input data X and the reference file Y, but an estimate that uses just the information from the generated tables can be used. With one estimate, the cost of differential encoding a substring x[i,i+j] using a substring y[k,k+j] can be estimated to be 2*log(m)+c*log(m)+c*8 bits where m is the length of reference file Y in bytes and c is the number of mismatches between x[i,i+j] and y[k,k+j]. This estimated encoding cost first describes k (requires log m bits), the location of the match, followed by j (requires log m bits), the length of the match, followed by the location of each of the mismatches (log m bits per mismatch) and finally the new symbol in each of the mismatch locations (1 byte = 8 bits per mismatch). However, if the estimated description cost of encoding is greater than the length of substring x[i,i+j], substring x[i,i+j] can be left in original form, and the description cost is the length of substring x[U+j]- The minimum of the determined candidate costs for each value of j can be associated with edges in a graph structure that is used to find a non-greedy partition of string X. However, in accordance with an aspect of the current invention, if index j for a candidate cost c(i,i+j) is greater than stored LM for an entry, the candidate cost c(i,i+j) corresponding to that entry is considered infinite. Accordingly, for any index value j greater than any of the values of LM in the table, the candidate cost c(i,i+j) is treated as being infinite. In a graph used for determining the non-greedy partition of string X, infinite costs corresponds to an edge being removed from the graph structure.

Fig. 6 illustrates how a graph 600 is made sparser by elimination of edges 610 associated with infinite costs. In comparison to graph 200 of Fig. 2, graph 600 contains some of the edges 210 associated with finite costs and edges 610 associated with infinite costs. When finding the shortest or lowest cost path from the node corresponding to symbol X₁ to node 290, edges 610 are ignored. Accordingly, fewer paths need to be evaluated and processing time is reduced.

Returning to process 400 of Fig. 4, step 480 uses the table i just completed and determines the description costs c(i,i+j) over the required range of j. The costs c(i,i+j) can be stored in a data structure that will be used to determine the non-greedy partition of string X. After determining the costs, any unwanted prior tables can be deleted. For example, if step 425 can only recognize entries from table i-1 as being related to entries in table i, any table associated with an index preceding i-1 (e.g., table i-2) can be deleted. However, if table recognizes relations from more than one preceding table, more than one preceding table can be kept and used in determining entries of the next table. The number of tables kept can be one or more in different embodiments of process 400.

Step 490 determines whether a table needs to be generated for a next starting location in string X. Generally, the last table will be generated for i is about equal to n-B, where n is the number of symbols in string X and B is the number of initial matching symbols required in a suitable candidate for an approximate match. After the last execution of step 490, repetitions of step 480 will have filled the data structure needed for determining the non-greedy partition to be used in compression of string X. Step 495 determines the partition logically based on the sparser graph (e.g., graph 600 of Fig. 6) and then encodes each substring in that partition to generate the compressed string.

The exemplary embodiments of the invention described above involve finding matching strings in which symbols are related in one dimension. In principle, the embodiments of the invention can be extended to data related in two or more dimensions. For example, for non-greedy partitioning and differential compression of image or video data, which generally has data (e.g., pixel values) related using two (e.g., row and column) indices, one task is to identify approximate matches of blocks of symbols in one image or frame to blocks of symbols in another image or frame. Suitable candidate approximate matches can be required to have BlxB2 block of symbols that match exactly. Tables can then be constructed for each point in the input image or frame and have entries respectively corresponding to the locations of exact matches to a BlxB2 block at the location corresponding to the table. The maximum size of the suitable candidate matches can be limited according to a condition analogous to the condition concerning C consecutive mismatches, so that blocks do not need to be considered as candidate approximate matches if the block contains more than C consecutive mismatches. A straightforward extension of this condition to two dimensions would be to require a candidate approximate match to have fewer than C consecutive mismatches in every row and column. Finding such candidate approximate matches and evaluating the description costs can be useful for image and video compression as well although the cost function and the process for finding the non- greedy partition differs from those used for a one-dimensional process described above.

The compression processes described above can be employed in any systems and devices in which compression is desired. Such applications include but are not limited to data transmitted as a digital signal by modems, networks, and similar devices or systems and data storage in computer systems, computer peripherals, removable media such as video disks, consumer devices such as telephones and music and video players. The processes and systems can be implemented using custom hardware or software or firmware executed by a computer or processor. Further, the software or firmware products can be embodied as a physical media containing machine readable instructions that are executed to carry out a process in accordance with an embodiment of the invention.

Fig. 7 shows a block diagram of a system 700 in accordance with an embodiment of the invention. System 700 includes data storage such as computer memory containing data such as a reference file 720, location tables 740, and a graph structure 760 that are manipulated by a table construction unit 730, a cost calculator 750, a partition unit 770, and an encoder 780. Table construction unit 730, cost calculator 750, partition unit 770, and encoder 780 are processing units that can be implemented by custom hardware or program routines being executed by a processor. Table construction unit 730 constructs node tables from input data 710 and reference file 720, for example, using the techniques illustrated by steps 415 to 470 described above with reference to Fig. 4. Cost calculator 750 calculates description costs using tables 740 (and in some embodiments also using input data 710, reference file 720). Cost calculator 750 stores the calculated description costs in graph structure 760, for example, using the techniques described for step 480 in Fig. 4. Partition unit 770 uses graph structure 760 to select a non-greedy partition for input data 710, and encoder 780 encodes the parts of the non-greedy partition to generate compressed output data X'.

Although the invention has been described with reference to particular embodiments, the description is only an example of the invention's application and should not be taken as a limitation. Various adaptations and combinations of features of the embodiments disclosed are within the scope of the invention as defined by the following claims.

Claims

What is claimed is:

1. A compression process comprising:

constructing, by a computer, a plurality of tables respectively associated with a plurality of first locations in input data, each of the tables including one or more entries respectively corresponding to one or more second locations in a reference file, wherein each of the second locations in each table identifies parts of the reference file that are candidate approximate matches for parts of the input data identified by the first location associated with the table;

for each of the tables, determining, by the computer, costs associated with differential encoding of the parts of the input data identified by the first location associated with the table, wherein the costs are determined using the candidate approximate matches identified by the entries in that table;

selecting, by the computer, a non- greedy partition of the input data using the parts for which the costs were determined, wherein the non- greedy partition is selected to minimize a sum of the costs determined for the parts in the partition; and

encoding, by the computer, the parts in the partition to compress the input data.

2. The process of claim 1 , wherein the input data comprises a string of symbols.

3. The process of claim 2, wherein each of the entries in each of the tables identifies parts in the reference file containing initial symbols that match initial symbols in the string at the first location associated with the table containing the entry.

4. The process of any preceding claim, wherein each of the entries in each of the tables comprises a size value indicating a maximum size for the candidate approximate matches identified by the corresponding second locations.

5. The process of claim 4, wherein selecting the partition comprises treating the costs as being infinite for any of the parts identified by each of the first locations if the part is longer than all of the size values in the table corresponding to the first location.

6. The process of claim 4, wherein the input comprises a string of symbols, and each of the size values is equal to a number of symbols in the reference file between the second location associated with the entry containing the size value and a third location at which C consecutive mismatches between symbols in the reference file and corresponding symbols in the string occur, C being a parameter of the process.

7. The process of claim 2, wherein each of the entries comprises a set of mismatch identifiers, each mismatch identifier indicating an offset such that a symbol in the reference file at the offset relative to the second location associated with the entry containing the mismatch identifier does not match a symbol in the input data at the offset relative to the first location associated with the table containing the entry.

8. The process of claim 2, wherein constructing the tables comprises constructing a first entry of a first table by:

comparing symbols in the input data at offsets relative to the first location associated with the table to symbols in the reference file with the offsets relative to the second location associated with the first entry;

recording in the first entry the offsets that are such that the respective symbols in the reference file do not match the respective symbols in the input data; and

recording in the first entry a size that indicates an occurrence of C consecutive offsets such that the respective symbols in the reference file do not match the respective symbols in the input data.

9. The process of claim 2, wherein constructing the tables comprises:

constructing a first entry of a first table by comparing symbols in the input data at offsets relative to the first location associated with the table to symbols in the reference file with the offsets relative to the second location associated with the first entry and recording in the first entry information regarding the comparisons; and

constructing a second entry of a second table by recognizing a relationship between the first location associated with the first entry and the second location associated with the second entry and using the information from the first entry to construct information recorded in the second entry.

10. A computer readable media containing instructions that when executed by a computer perform any of the processes of claims 1 to 9.

11. A compression system, comprising:

storage containing a reference file (720);

a table construction unit (730) that receives input data (710) for compression and constructs tables (740) respectively for first locations in the input data (710), wherein for each of the first locations, the table construction unit (730) operates to identify second locations of sets of symbols in the reference file (720) that match a set of symbols at the first location in the input data (710) and to construct a table containing entries respectively corresponding to the second locations, wherein each entry indicates a maximum size for an approximate match to a part of the input data at the first location; a cost calculator (750) coupled to use the tables (740) to calculate costs for differential encoding of parts of the input data at the first locations and from the calculated costs, construct a graph structure (760);

a partition unit (770) that uses the graph structure (760) to select a non-greedy partition of the input data (710); and

an encoder (780) that encodes each part in the non-greedy partition to produce compressed output data.

12. The system of claim 11, wherein each of the entries further comprises mismatch identifiers indicating symbols in the reference file that differ from

corresponding symbols from the input data.

13. The system of claim 11, wherein at least one of the cost calculator (750) and the partition unit (770) treat a cost associated with a part at each of the first locations to be infinite if the part has a size greater than all of the maximum sizes in the entries of the table for the first location.

14. The system of claim 11, wherein the input data (710) comprises a string of symbols, and the table construction unit (730) sets the maximum size equal to a number of symbols in the reference file (720) between the second location associated with the entry containing the maximum size and a third location at which C consecutive mismatches between symbols in the reference file and corresponding symbols in the string occur, C being a parameter of the compression.

15. The system of claim 11, wherein the input data (710) comprises a string of symbols, and each of the entries in each of the tables identifies parts in the reference file (720) containing initial symbols that match initial symbols in the string (710) at the first location associated with the table containing the entry.