CN100349160C - Data compression method by finite exhaustive optimization - Google Patents

Data compression method by finite exhaustive optimization Download PDF

Info

Publication number
CN100349160C
CN100349160C CNB2005100960026A CN200510096002A CN100349160C CN 100349160 C CN100349160 C CN 100349160C CN B2005100960026 A CNB2005100960026 A CN B2005100960026A CN 200510096002 A CN200510096002 A CN 200510096002A CN 100349160 C CN100349160 C CN 100349160C
Authority
CN
China
Prior art keywords
data
cataphasia
length
unit
repeating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2005100960026A
Other languages
Chinese (zh)
Other versions
CN1737791A (en
Inventor
陈淮琰
张汪洋
闫海红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inventec Besta Xian Co Ltd
Original Assignee
Inventec Besta Xian Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inventec Besta Xian Co Ltd filed Critical Inventec Besta Xian Co Ltd
Priority to CNB2005100960026A priority Critical patent/CN100349160C/en
Publication of CN1737791A publication Critical patent/CN1737791A/en
Application granted granted Critical
Publication of CN100349160C publication Critical patent/CN100349160C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a compression method by finite exhaustive optimization to data statistic characteristics, which has the technical proposals which comprise the following steps 1) The encoding type of data to be compressed is judged, and then, a code with a long byte is replaced or code data are pre-compressed. 2) The present invention judges whether the date are data in the class of dictionaries. 3) Repeating linguistic units in the range of maximum length are counted. 4) The present invention makes use of the finite exhaustive method to search the length range which is the optimal to repeating the repeating linguistic units. 5) The repeating linguistic units in the length range optimal which are the optimal to repeating the repeating linguistic units are replaced from long repeating linguistic units to short repeating linguistic units in sequence, and the repeating linguistic units are encoded to output replacement information files, and are recorded simultaneously. 6) The frequency of non replacement characters is counted to generate Huffman trees according to the frequency. Seventhly, compressed data are generated. The present invention a compression method which has finite exhaustive optimization to the data which has large data quantity, high linguistic repetition rate and statistical characteristics. The compressed data has strong adaptability, realizes the optimization of the compressed data, and improves compression ratio.

Description

A kind of compression method that data is carried out limited exhaustive optimization
One, technical field
The present invention relates to a kind of compression method that the data statistics feature is carried out limited exhaustive optimization.
Two, background technology
The world today is the world of electronic information industry develop rapidly, and the development of high-tech products such as counter is swift and violent.Widespread use along with current palm type electronic consumer products, people are more and more higher to the request for utilization of palm type electronic consumer products, and whether leading can following palm type electronic consumer products provide the knowledge of vast capacity more and the service of other message to become to estimate the high-tech product technology sign.
But, current palm type electronic consumer products, especially various embedded devices are owing to the restriction that is subjected to resource, and are limited as internal memory finite sum CPU speed, sometimes can't solve the storage of vast capacity data and read problem fast, therefore need compress data.And at present in data handling procedures such as data compression, generally be to adopt the HUFFMAN compression algorithm to add fixed length cataphasia unit to substitute, it can not fully compress data.Therefore, when running into Large Volume Data, during especially as the language repetition frequency is high in the dictionary data data,, may cause the waste of memory storage space so if, propose optimum compression scheme not at the own characteristic of data.
And when various dictionary class data are compressed, because the dictionary language has repeatability and independence, adopt fixed length cataphasia unit to substitute and can not reach optimum, can influence the adaptability on a large scale of dictionary compression simultaneously.
Three, summary of the invention
The above-mentioned technical matters that the present invention exists for the data compression method that solves in the background technology, and provide a kind of big to data volume, the high data with statistical nature of language repetition rate are carried out the compression method of limited exhaustive optimization, the optimization that its data compression adaptability is strong, can realize packed data, have improved compressibility.
Technical solution of the present invention is that the present invention a kind ofly carries out the compression method of limited exhaustive optimization to data, and its special character is: this method may further comprise the steps:
1) type of coding of judgement data to be compressed is Unicode code datas, or local code data, if then carrying out the slab sign indicating number, replaces the Unicode code data, the Unicode code data is carried out precompression, code value is eliminated high byte less than 0 * 80 coding, other coding is by the frequency of utilization ordering, 127 higher codings of frequency of utilization substitute with code value 0 * 80-0 * FE, remaining coding uses 0 * ff mark to add that two byte codes substitute, proceed to step 2 then), if the local code data then directly proceed to step 2);
2) judge whether to be dictionary class data,, proceed to step 3) then, if not, then directly proceed to step 3) if dictionary class data then write down the piecemeal mark of these dictionary class data;
3) the cataphasia unit in the statistics maximum length scope; At first sort according to successive character to all identical characters then in the position of all identical characters in the statistics, finds out all interior cataphasia units of maximum length scope at last and carry out record according to length;
4) utilize the limited method of exhaustion to search the optimum length range that substitutes cataphasia unit;
5) length range according to the alternative cataphasia unit of optimum generates cataphasia unit's alternative information file, from growing to lacking, replace the cataphasia unit in the optimum length range that substitutes cataphasia unit in turn, and it is encoded, output alternative information file writes down cataphasia unit simultaneously;
6) according to former data and alternative information file, add up the frequency of non-alternatives, generate the Huffman tree according to this frequency;
7) generate packed data; Utilize LZSS algorithm and Huffman algorithm to carry out the hybrid coding replacement according to alternative information file and Huffman tree to former data, simultaneously carry out piecemeal according to block message, generate definitive document, at last cataphasia unit, block message and Huffman tree are squeezed into final data.
Above-mentioned steps 4) concrete steps in are as follows:
4.1) determine from reference value to maximum cataphasia unit length to be limited exhaustive optimizing scope;
4.2) be that optimization length carries out substituting one by one from long to short with each the cataphasia unit length in the optimizing scope, write down the length that reduces after each substitutes then;
4.3) length that reduces after substituting according to each, find a maximum to reduce length, the length range that maximum reduces the cataphasia unit of length correspondence is the optimum length range that substitutes cataphasia unit.
The present invention mainly is at when current embedded device is developed, and generally all adopts Huffman (HUFFMAN) compression method on the data compressing method, adds the cataphasia unit in the designated length scope is substituted, and realizes compression.Yet because the independence of data and the variation of data length, the scope of the cataphasia length of appointment is also wanted corresponding change, just can reach compression to greatest extent, just can guarantee not increase the time complexity and the space complexity of decompression simultaneously.
Therefore, at above-mentioned situation, the inventive method mainly is to obtain data characteristic by statistical technique, count the quantity of maximum length cataphasia unit according to data characteristics, and then use the limited method of exhaustion to seek optimum alternative method, replace in regular turn number cataphasia unit according to optimum alternative method, non-recurring unit simple language is then encoded according to Huffman, thereby reach the purpose that improves efficiency of data compression under the prerequisite that does not increase the decompression space complexity, in addition, utilize the piecemeal uniqueness of dictionary class DATABASE again, with its big data morsel, carry out little piecemeal compression, thereby improve data query speed, thereby reach the purpose that improves efficiency of data compression under the prerequisite that does not increase the decompression time complexity.
Therefore the present invention has the following advantages:
1, the present invention adopts limited exhaustive optimization can the cataphasia unit of data fully be compressed, thereby has improved the adaptability of data compression rate and assurance compression algorithm.
2, the present invention has eliminated the correlativity between piece and the piece by data being carried out the piecemeal compression, has improved the inquiry velocity of data, realizes the decompress(ion) at random of block unit simultaneously.
3, this law is bright realizes big data, and correct decompress(ion) is applicable to resource limited embedded system under the less situation of time and space demand.
Four, description of drawings
Fig. 1 is a method flow diagram of the present invention;
Fig. 2 is the particular flow sheet of step 4 of the present invention;
Fig. 3 reduces the embodiment of length records table when carrying out limited exhaustive optimization for the present invention;
Fig. 4 is for using the present invention and the known compression method compression result table of comparisons to the dictionary data compression of the English-Chinese Japan and Korea S of Da Ying;
Fig. 5 is for using the present invention and the known compression method compression result table of comparisons to Oxford dictionary data compression.
Five, embodiment
Referring to Fig. 1, method flow of the present invention is as follows:
1) type of coding of judgement data to be compressed is Unicode code datas, or local code data, if then carrying out the slab sign indicating number, replaces the Unicode code data, the Unicode code data is carried out precompression, code value is eliminated high byte less than 0 * 80 coding, other coding is by the frequency of utilization ordering, 127 higher codings of frequency of utilization substitute with code value 0 * 80-0 * FE, remaining coding uses 0 * ff mark to add that two byte codes substitute, proceed to step 2 then), if the local code data then directly proceed to step 2);
2) judge whether to be dictionary class data,, proceed to step 3) then, if not, then directly proceed to the step step 3) if dictionary class data then write down the piecemeal mark of these dictionary class data;
3) the cataphasia unit in the statistics maximum length scope; At first sort according to successive character to all identical characters then in the position of all identical characters in the statistics, finds out all interior cataphasia units of maximum length scope at last and carry out record according to length;
4) utilize the limited method of exhaustion to search the optimum length range that substitutes cataphasia unit;
5) length range according to the alternative cataphasia unit of optimum generates cataphasia unit's alternative information file, from growing to lacking, replace the cataphasia unit in the optimum length range that substitutes cataphasia unit in turn, and it is encoded, output alternative information file writes down cataphasia unit simultaneously;
6) according to former data and alternative information file, add up the frequency of non-alternatives, generate the Huffman tree according to this frequency;
7) generate packed data; Utilize LZSS algorithm and Huffman algorithm to carry out the hybrid coding replacement according to alternative information file and Huffman tree to former data, simultaneously carry out piecemeal according to block message, generate definitive document, at last cataphasia unit, block message and Huffman tree are squeezed into final data.
Referring to Fig. 2, according to above-mentioned process description, the optimum alternate range of limited as can be seen exhaustive searching is the key that realizes the inventive method, improves data compression rate, therefore will do one to the algorithm of realizing the statistics repetition rate below and briefly introduce: Len is for repeating linguistic unit length; Rep counts again for repeating linguistic unit; NBit is for repeating the figure place of linguistic unit coding; CompressRate is a compressibility
The compressibility that cataphasia unit substitutes is
k=[0,kMax-1]
Len k=BEG_NUM+k
Rep k>=kMax+BEG_NUM-1-k
CompressRate % = Σ k = 0 kMax - 1 Len k + Σ k = 0 kMax - 1 [ ( Σ i = 0 Rep ki ) * nBit / 8 ] Σ k = 0 kMax [ Le n k * ( Σ i = 0 Rep ki ) ]
The compressibility that wall scroll cataphasia unit substitutes
CompressRate % = 1 Rep + nBit Len * 8
The compressibility that is substituted by wall scroll cataphasia unit as can be known
Rep is big more, and compressibility will be corresponding more little;
Len is big more, and compressibility will be corresponding more little;
As from the foregoing, the increase of cataphasia length has the compression of being beneficial to.Consider the intersection (cat and at) of cataphasia unit in addition, might reduce alternative efficient, and the intersection of cataphasia unit belongs to complicated phenomenon, be difficult to quantitative valuation.Therefore the present invention uses and at first adds up big length cataphasia unit, and limited then exhaustive searching optimal compression realizes compression optimization.Its concrete steps are as follows:
4.1) determine from reference value to maximum cataphasia unit length to be limited exhaustive optimizing scope;
4.2) be that optimization length carries out substituting one by one from long to short with each the cataphasia unit length in the optimizing scope, write down the length that reduces after each substitutes then;
4.3) length that reduces after substituting according to each, find a maximum to reduce length, the cataphasia unit length scope that maximum reduces the length correspondence is the optimum length range that substitutes cataphasia unit.
Referring to Fig. 3, the tabulation of this embodiment is each length tabulation that reduces after substituting, and maximum reduces 35 of length correspondence and is the optimum length range that substitutes cataphasia unit.
Below in conjunction with specific embodiment the present invention is described in further detail.
Referring to Fig. 4, the English-Chinese Japan and Korea S of Da Ying dictionary data are arranged, its source document length is 45,776,158bytes, at first file size is 27 after the pre-service, 668,745bytes counts big length (0x7f) data repeat character string and repeat character string frequency and deposits in the .rep file, optimization length is 35 after the limited exhaustive optimization, repeat character string in the optimization length scope is numbered from long to short, and obtaining the repeat character (RPT) string length is 491,862bytes, for overcoming dictionary class high capacity DATABASE to the directly capable piecemeal of data, and divide build to set up allocation index at each, and allocation index to be deposited in the .idx file, its length is 55,379bytes, after finishing above-mentioned work, begin data are compressed, obtain compression result 12,115,479bytes; And when using existing compression method to compress these dictionary data, its data length is 12,839,525bytes.
In addition, it is 24,862 that the data of the inventive method compression are divided into, and compressibility is 28.05%, and existing compression method compressibility is 66.7%.
Referring to Fig. 5, again being arranged on the Oxford dictionary data in the hand-held electronic products,, can see that raw data length is 22,580 according to this Oxford dictionary data compression table of comparisons as a result, 376Bytes, compression back data length is 4,505,792Bytes; To obtain data length be 5,089 and compress with existing compression method, 223Bytes.Thus, it is 146,292 that the data of the inventive method compression are divided into, and compressibility is 19.95%, and existing compression method compressibility is 22.54%.
By concrete real data contrast as can be seen, the inventive method has not only improved the efficient of data compression, and, on handling, realized the effect of faster, more convenient statistics character string repetition frequency especially to vast capacity data high data of repeat character string frequency particularly.

Claims (1)

1, a kind of data are carried out the compression method of limited exhaustive optimization, it is characterized in that: this method may further comprise the steps:
1) type of coding of judgement data to be compressed is Unicode code datas, or local code data, if then carrying out the slab sign indicating number, replaces the Unicode code data, the Unicode code data is carried out precompression, code value is eliminated high byte less than 0 * 80 coding, other coding is by the frequency of utilization ordering, 127 higher codings of frequency of utilization substitute with code value 0 * 80-0 * FE, remaining coding uses 0 * ff mark to add that two byte codes substitute, proceed to step 2 then), if the local code data then directly proceed to step 2);
2) judge whether to be dictionary class data,, proceed to step 3) then, if not, then directly proceed to step 3) if dictionary class data then write down the piecemeal mark of these dictionary class data;
3) the cataphasia unit in the statistics maximum length scope; At first sort according to successive character to all identical characters then in the position of all identical characters in the statistics, finds out all interior cataphasia units of maximum length scope at last and carry out record according to length;
4) utilize the limited method of exhaustion to search the optimum length range that substitutes cataphasia unit;
4.1) determine from reference value to maximum cataphasia unit length to be limited exhaustive optimizing scope;
4.2) be that optimization length carries out substituting one by one from long to short with each the cataphasia unit length in the optimizing scope, write down the length that reduces after each substitutes then;
4.3) length that reduces after substituting according to each, find a maximum to reduce length, the length range that maximum reduces the cataphasia unit of length correspondence is the optimum length range that substitutes cataphasia unit;
5) length range according to the alternative cataphasia unit of optimum generates cataphasia unit's alternative information file, from growing to lacking, replace the cataphasia unit in the optimum length range that substitutes cataphasia unit in turn, and it is encoded, output alternative information file writes down cataphasia unit simultaneously;
6) according to former data and alternative information file, add up the frequency of non-alternatives, generate the Huffman tree according to this frequency;
7) generate packed data; Utilize LZSS algorithm and Huffman algorithm to carry out the hybrid coding replacement according to alternative information file and Huffman tree to former data, simultaneously carry out piecemeal according to block message, generate definitive document, at last cataphasia unit, block message and Huffman tree are squeezed into final data.
CNB2005100960026A 2005-09-08 2005-09-08 Data compression method by finite exhaustive optimization Expired - Fee Related CN100349160C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2005100960026A CN100349160C (en) 2005-09-08 2005-09-08 Data compression method by finite exhaustive optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005100960026A CN100349160C (en) 2005-09-08 2005-09-08 Data compression method by finite exhaustive optimization

Publications (2)

Publication Number Publication Date
CN1737791A CN1737791A (en) 2006-02-22
CN100349160C true CN100349160C (en) 2007-11-14

Family

ID=36080588

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100960026A Expired - Fee Related CN100349160C (en) 2005-09-08 2005-09-08 Data compression method by finite exhaustive optimization

Country Status (1)

Country Link
CN (1) CN100349160C (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100437627C (en) * 2006-07-12 2008-11-26 中国物品编码中心 Data information encoding method
US8819288B2 (en) * 2007-09-14 2014-08-26 Microsoft Corporation Optimized data stream compression using data-dependent chunking
CN102222075A (en) * 2010-04-15 2011-10-19 李朝中 Tree-structure-based language bank compression method and system
CN103731154B (en) * 2013-11-01 2017-01-11 陕西理工学院 Data compression algorithm based on semantic analysis
CN104679776B (en) * 2013-11-29 2019-08-27 腾讯科技(深圳)有限公司 The compression method and device of inverted index
CN104682965A (en) * 2015-03-20 2015-06-03 深圳市微科通讯设备有限公司 GPS data compression method
US9515678B1 (en) * 2015-05-11 2016-12-06 Via Alliance Semiconductor Co., Ltd. Hardware data compressor that directly huffman encodes output tokens from LZ77 engine
CN109412604A (en) * 2018-12-05 2019-03-01 云孚科技(北京)有限公司 A kind of data compression method based on language model
CN115834504A (en) * 2022-11-04 2023-03-21 电子科技大学 AXI bus-based data compression/decompression method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4799242A (en) * 1987-08-24 1989-01-17 International Business Machines Corporation Multi-mode dynamic code assignment for data compression
CN1286959A (en) * 2000-04-11 2001-03-14 西安交通大学 Non-destructive data compressing method and device for Holter system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4799242A (en) * 1987-08-24 1989-01-17 International Business Machines Corporation Multi-mode dynamic code assignment for data compression
CN1286959A (en) * 2000-04-11 2001-03-14 西安交通大学 Non-destructive data compressing method and device for Holter system

Also Published As

Publication number Publication date
CN1737791A (en) 2006-02-22

Similar Documents

Publication Publication Date Title
CN100349160C (en) Data compression method by finite exhaustive optimization
CN1166072C (en) Data compaction, transmission, storage and program transmission
US6253264B1 (en) Coding network grouping data of same data type into blocks using file data structure and selecting compression for individual block base on block data type
CN102708187B (en) Reverse index mixed compression and decompression method based on Hbase database
CN103236847B (en) Based on the data lossless compression method of multilayer hash data structure and Run-Length Coding
CN101783788B (en) File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device
CN103067022B (en) A kind of integer data lossless compression method, decompression method and device
Gilchrist Parallel data compression with bzip2
CN101923569B (en) Storage method of structure type data of real-time database
CN102970043B (en) A kind of compression hardware system based on GZIP and accelerated method thereof
CN112953550B (en) Data compression method, electronic device and storage medium
EP1803225A1 (en) Adaptive compression scheme
CA1241760A (en) File compressor
CN104156990B (en) A kind of lossless compression-encoding method and system for supporting super-huge data window
CA2364820A1 (en) Variable length encoding and decoding of ascending numerical sequences
CN113078908A (en) Simple encoding and decoding method suitable for time sequence database
CN105005464A (en) Burrows Wheeler Transform hardware processing apparatus
CN202931290U (en) Compression hardware system based on GZIP
CN112506876B (en) Lossless compression query method supporting SQL query
WO2009001174A1 (en) System and method for data compression and storage allowing fast retrieval
CN114337682A (en) Huffman coding and compressing device
Cannane et al. A compression scheme for large databases
Anisimov et al. Practical Word-based Text Compression Using the Reverse Multi-Delimiter Codes.
CN114665887B (en) JSON character string data compression method based on integral compression
TWI287362B (en) Compressing method for statistical data characteristics by finite exhaustive optimization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Assignee: Village Technology Limited

Assignor: Wudi Science and Technology Co., Ltd. (Xian)

Contract fulfillment period: 2006.12.1 to 2011.11.30 contract change

Contract record no.: 2008310000047

Denomination of invention: Data compression method by finite exhaustive optimization

Granted publication date: 20071114

License type: Exclusive license

Record date: 2008.9.18

LIC Patent licence contract for exploitation submitted for record

Free format text: EXCLUSIVE LICENCE; TIME LIMIT OF IMPLEMENTING CONTACT: 2006.12.1 TO 2011.11.30

Name of requester: BESTA SCIENCE CO., LTD.

Effective date: 20080918

C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20071114