CN100349160C

CN100349160C - Data compression method by finite exhaustive optimization

Info

Publication number: CN100349160C
Application number: CNB2005100960026A
Authority: CN
Inventors: 陈淮琰; 张汪洋; 闫海红
Original assignee: Inventec Besta Xian Co Ltd
Current assignee: Inventec Besta Xian Co Ltd
Priority date: 2005-09-08
Filing date: 2005-09-08
Publication date: 2007-11-14
Anticipated expiration: 2025-09-08
Also published as: CN1737791A

Abstract

The present invention relates to a compression method by finite exhaustive optimization to data statistic characteristics, which has the technical proposals which comprise the following steps 1) The encoding type of data to be compressed is judged, and then, a code with a long byte is replaced or code data are pre-compressed. 2) The present invention judges whether the date are data in the class of dictionaries. 3) Repeating linguistic units in the range of maximum length are counted. 4) The present invention makes use of the finite exhaustive method to search the length range which is the optimal to repeating the repeating linguistic units. 5) The repeating linguistic units in the length range optimal which are the optimal to repeating the repeating linguistic units are replaced from long repeating linguistic units to short repeating linguistic units in sequence, and the repeating linguistic units are encoded to output replacement information files, and are recorded simultaneously. 6) The frequency of non replacement characters is counted to generate Huffman trees according to the frequency. Seventhly, compressed data are generated. The present invention a compression method which has finite exhaustive optimization to the data which has large data quantity, high linguistic repetition rate and statistical characteristics. The compressed data has strong adaptability, realizes the optimization of the compressed data, and improves compression ratio.

Description

A kind of compression method that data is carried out limited exhaustive optimization

One, technical field

The present invention relates to a kind of compression method that the data statistics feature is carried out limited exhaustive optimization.

Two, background technology

The world today is the world of electronic information industry develop rapidly, and the development of high-tech products such as counter is swift and violent.Widespread use along with current palm type electronic consumer products, people are more and more higher to the request for utilization of palm type electronic consumer products, and whether leading can following palm type electronic consumer products provide the knowledge of vast capacity more and the service of other message to become to estimate the high-tech product technology sign.

But, current palm type electronic consumer products, especially various embedded devices are owing to the restriction that is subjected to resource, and are limited as internal memory finite sum CPU speed, sometimes can't solve the storage of vast capacity data and read problem fast, therefore need compress data.And at present in data handling procedures such as data compression, generally be to adopt the HUFFMAN compression algorithm to add fixed length cataphasia unit to substitute, it can not fully compress data.Therefore, when running into Large Volume Data, during especially as the language repetition frequency is high in the dictionary data data,, may cause the waste of memory storage space so if, propose optimum compression scheme not at the own characteristic of data.

And when various dictionary class data are compressed, because the dictionary language has repeatability and independence, adopt fixed length cataphasia unit to substitute and can not reach optimum, can influence the adaptability on a large scale of dictionary compression simultaneously.

Three, summary of the invention

The above-mentioned technical matters that the present invention exists for the data compression method that solves in the background technology, and provide a kind of big to data volume, the high data with statistical nature of language repetition rate are carried out the compression method of limited exhaustive optimization, the optimization that its data compression adaptability is strong, can realize packed data, have improved compressibility.

Technical solution of the present invention is that the present invention a kind ofly carries out the compression method of limited exhaustive optimization to data, and its special character is: this method may further comprise the steps:

1) type of coding of judgement data to be compressed is Unicode code datas, or local code data, if then carrying out the slab sign indicating number, replaces the Unicode code data, the Unicode code data is carried out precompression, code value is eliminated high byte less than 0 * 80 coding, other coding is by the frequency of utilization ordering, 127 higher codings of frequency of utilization substitute with code value 0 * 80-0 * FE, remaining coding uses 0 * ff mark to add that two byte codes substitute, proceed to step 2 then), if the local code data then directly proceed to step 2);

2) judge whether to be dictionary class data,, proceed to step 3) then, if not, then directly proceed to step 3) if dictionary class data then write down the piecemeal mark of these dictionary class data;

3) the cataphasia unit in the statistics maximum length scope; At first sort according to successive character to all identical characters then in the position of all identical characters in the statistics, finds out all interior cataphasia units of maximum length scope at last and carry out record according to length;

4) utilize the limited method of exhaustion to search the optimum length range that substitutes cataphasia unit;

5) length range according to the alternative cataphasia unit of optimum generates cataphasia unit's alternative information file, from growing to lacking, replace the cataphasia unit in the optimum length range that substitutes cataphasia unit in turn, and it is encoded, output alternative information file writes down cataphasia unit simultaneously;

6) according to former data and alternative information file, add up the frequency of non-alternatives, generate the Huffman tree according to this frequency;

7) generate packed data; Utilize LZSS algorithm and Huffman algorithm to carry out the hybrid coding replacement according to alternative information file and Huffman tree to former data, simultaneously carry out piecemeal according to block message, generate definitive document, at last cataphasia unit, block message and Huffman tree are squeezed into final data.

Above-mentioned steps 4) concrete steps in are as follows:

4.1) determine from reference value to maximum cataphasia unit length to be limited exhaustive optimizing scope;

4.2) be that optimization length carries out substituting one by one from long to short with each the cataphasia unit length in the optimizing scope, write down the length that reduces after each substitutes then;

4.3) length that reduces after substituting according to each, find a maximum to reduce length, the length range that maximum reduces the cataphasia unit of length correspondence is the optimum length range that substitutes cataphasia unit.

The present invention mainly is at when current embedded device is developed, and generally all adopts Huffman (HUFFMAN) compression method on the data compressing method, adds the cataphasia unit in the designated length scope is substituted, and realizes compression.Yet because the independence of data and the variation of data length, the scope of the cataphasia length of appointment is also wanted corresponding change, just can reach compression to greatest extent, just can guarantee not increase the time complexity and the space complexity of decompression simultaneously.

Therefore, at above-mentioned situation, the inventive method mainly is to obtain data characteristic by statistical technique, count the quantity of maximum length cataphasia unit according to data characteristics, and then use the limited method of exhaustion to seek optimum alternative method, replace in regular turn number cataphasia unit according to optimum alternative method, non-recurring unit simple language is then encoded according to Huffman, thereby reach the purpose that improves efficiency of data compression under the prerequisite that does not increase the decompression space complexity, in addition, utilize the piecemeal uniqueness of dictionary class DATABASE again, with its big data morsel, carry out little piecemeal compression, thereby improve data query speed, thereby reach the purpose that improves efficiency of data compression under the prerequisite that does not increase the decompression time complexity.

Therefore the present invention has the following advantages:

1, the present invention adopts limited exhaustive optimization can the cataphasia unit of data fully be compressed, thereby has improved the adaptability of data compression rate and assurance compression algorithm.

2, the present invention has eliminated the correlativity between piece and the piece by data being carried out the piecemeal compression, has improved the inquiry velocity of data, realizes the decompress(ion) at random of block unit simultaneously.

3, this law is bright realizes big data, and correct decompress(ion) is applicable to resource limited embedded system under the less situation of time and space demand.

Four, description of drawings

Fig. 1 is a method flow diagram of the present invention;

Fig. 2 is the particular flow sheet of step 4 of the present invention;

Fig. 3 reduces the embodiment of length records table when carrying out limited exhaustive optimization for the present invention;

Fig. 4 is for using the present invention and the known compression method compression result table of comparisons to the dictionary data compression of the English-Chinese Japan and Korea S of Da Ying;

Fig. 5 is for using the present invention and the known compression method compression result table of comparisons to Oxford dictionary data compression.

Five, embodiment

Referring to Fig. 1, method flow of the present invention is as follows:

2) judge whether to be dictionary class data,, proceed to step 3) then, if not, then directly proceed to the step step 3) if dictionary class data then write down the piecemeal mark of these dictionary class data;

Referring to Fig. 2, according to above-mentioned process description, the optimum alternate range of limited as can be seen exhaustive searching is the key that realizes the inventive method, improves data compression rate, therefore will do one to the algorithm of realizing the statistics repetition rate below and briefly introduce: Len is for repeating linguistic unit length; Rep counts again for repeating linguistic unit; NBit is for repeating the figure place of linguistic unit coding; CompressRate is a compressibility

The compressibility that cataphasia unit substitutes is

k＝[0，kMax-1]

Len _k＝BEG_NUM+k

Rep _k＞＝kMax+BEG_NUM-1-k

CompressRate % = \frac{Σ_{k = 0}^{kMax - 1} {Len}_{k} + Σ_{k = 0}^{kMax - 1} [(\underset{i = 0}{Σ} {Rep}_{ki}) * nBit / 8]}{Σ_{k = 0}^{kMax} [Le n_{k} * (\underset{i = 0}{Σ} {Rep}_{ki})]}

The compressibility that wall scroll cataphasia unit substitutes

CompressRate % = \frac{1}{Rep} + \frac{nBit}{Len * 8}

The compressibility that is substituted by wall scroll cataphasia unit as can be known

Rep is big more, and compressibility will be corresponding more little;

Len is big more, and compressibility will be corresponding more little;

As from the foregoing, the increase of cataphasia length has the compression of being beneficial to.Consider the intersection (cat and at) of cataphasia unit in addition, might reduce alternative efficient, and the intersection of cataphasia unit belongs to complicated phenomenon, be difficult to quantitative valuation.Therefore the present invention uses and at first adds up big length cataphasia unit, and limited then exhaustive searching optimal compression realizes compression optimization.Its concrete steps are as follows:

4.3) length that reduces after substituting according to each, find a maximum to reduce length, the cataphasia unit length scope that maximum reduces the length correspondence is the optimum length range that substitutes cataphasia unit.

Referring to Fig. 3, the tabulation of this embodiment is each length tabulation that reduces after substituting, and maximum reduces 35 of length correspondence and is the optimum length range that substitutes cataphasia unit.

Below in conjunction with specific embodiment the present invention is described in further detail.

Referring to Fig. 4, the English-Chinese Japan and Korea S of Da Ying dictionary data are arranged, its source document length is 45,776,158bytes, at first file size is 27 after the pre-service, 668,745bytes counts big length (0x7f) data repeat character string and repeat character string frequency and deposits in the .rep file, optimization length is 35 after the limited exhaustive optimization, repeat character string in the optimization length scope is numbered from long to short, and obtaining the repeat character (RPT) string length is 491,862bytes, for overcoming dictionary class high capacity DATABASE to the directly capable piecemeal of data, and divide build to set up allocation index at each, and allocation index to be deposited in the .idx file, its length is 55,379bytes, after finishing above-mentioned work, begin data are compressed, obtain compression result 12,115,479bytes; And when using existing compression method to compress these dictionary data, its data length is 12,839,525bytes.

In addition, it is 24,862 that the data of the inventive method compression are divided into, and compressibility is 28.05%, and existing compression method compressibility is 66.7%.

Referring to Fig. 5, again being arranged on the Oxford dictionary data in the hand-held electronic products,, can see that raw data length is 22,580 according to this Oxford dictionary data compression table of comparisons as a result, 376Bytes, compression back data length is 4,505,792Bytes; To obtain data length be 5,089 and compress with existing compression method, 223Bytes.Thus, it is 146,292 that the data of the inventive method compression are divided into, and compressibility is 19.95%, and existing compression method compressibility is 22.54%.

By concrete real data contrast as can be seen, the inventive method has not only improved the efficient of data compression, and, on handling, realized the effect of faster, more convenient statistics character string repetition frequency especially to vast capacity data high data of repeat character string frequency particularly.

Claims

1, a kind of data are carried out the compression method of limited exhaustive optimization, it is characterized in that: this method may further comprise the steps:

4.3) length that reduces after substituting according to each, find a maximum to reduce length, the length range that maximum reduces the cataphasia unit of length correspondence is the optimum length range that substitutes cataphasia unit;