CN103618554A

CN103618554A - Internal storage page compression method based on dictionary

Info

Publication number: CN103618554A
Application number: CN201310643898.XA
Authority: CN
Inventors: 宋彬; 裴远; 宋秉玺; 李慧玲; 甄立
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2013-12-01
Filing date: 2013-12-01
Publication date: 2014-03-05
Anticipated expiration: 2033-12-01
Also published as: CN103618554B

Abstract

The invention discloses an internal storage page compression method based on a dictionary in the technical field of data processing. The main purpose is to solve the problem that an existing compression method is low in internal storage page compression speed. The internal storage page compression method is characterized in that four bytes serve as a basic unit to compress and decompress integral storage page data; a new hash function and a compression format suitable for compressing an internal storage page are designed. The dictionary is a hash table to which a key value is used for getting access, the four bytes are read in from an input data flow, the first two bytes are used for being subject to exclusive OR, so that a new byte A is obtained, the last two bytes are used for being subject to exclusive OR, so that a new byte B is obtained, and two low-order bits of the A and two high-order bits of the B are used for being subject to exclusive OR, so that the key value of fourteen bits is obtained. The new compression format is that first four bits of the first byte are used for recording the length of repetition characters, and last four bits are used for recording the length of new four characters. The length of remaining new four characters is recorded from the second byte, and then the new four characters are recorded. The length and anaphora distance of remaining repetition four bytes in the internal storage page are then recorded. Coding is easy, and decoding is rapid.

Description

Memory pages compression method based on dictionary

Technical field

The invention belongs to technical field of data processing, relate to the data compression method of device memory.The present invention adopts new data compression format to improve the speed of compression according to the feature of internal storage data when data compression, can be used in the embedded mobile device of memory-limited.

Background technology

In recent years, along with the development of mobile Internet, mobile device more and more becomes a kind of means of communication that people are indispensable.Due to the memory-limited of mobile device, if can its internal storage data be compressed, economize out memory headroom, can improve the overall performance of equipment.The continuous growth of modern society's amount of information, people also have higher requirement to the performance of computer system, as higher speed, lower power consumption, less volume, can the more information of access etc.In order to reach various performance requirements above, people have proposed various improved methods.Wherein, less expensive one of improve one's methods is data compression technique.

Lempel and Ziv have proposed a kind of high efficiency undistorted compression technology in 1977, be LZ77, the cardinal principle of this compression algorithm is the repetition word string of utilizing shorter mark representative to occur above, and tag format is (repeat length refers to back distance), as abcdekabcdeha, can be encoded into abcdek (5,6) ha and represent, so on the whole, shorter information replaces longer information, thereby has reached the effect of compression.Nineteen eighty-two, James Storer and Thomas Szymanski improve algorithm on LZ77 basis, have improved compression ratio, have proposed LZSS algorithm.Lempel-Ziv-Oberhumer improved algorithm again on the basis of LZSS afterwards, had improved compression speed, had proposed LZO algorithm.LZO algorithm is a kind of harmless data compression algorithm based on dictionary, has that compression speed is fast, the feature of instantaneity.This algorithm is according to repeat character (RPT) number and refer to back that distance has designed five kinds of compressed formats, by these five kinds of different forms of first byte size discrimination of compressed format.Its key step is the length that (1) reads internal storage data and the internal storage data of mobile device; (2) judge whether institute's read data is new data, if institute's read data is not recorded in dictionary, is judged to new data, and new data is charged in dictionary, continues to read internal storage data, until there is not new data; (3) if institute's read data has been recorded in dictionary, according to the length of repeating data with refers to back apart from carrying out compressed encoding; (4) judge that whether coding site is internal storage data ending, if data and the data length after output squeezing, and record end flag, otherwise return to step (2), continue to read in new data.The weak point that the method exists is, the main flow system that current 32 systems are computers, consider the impact of internal memory alignment, the data overwhelming majority in internal memory be take 4 bytes and is write as unit, and LZO algorithm is to take a byte to be applicable to compression memory data completely as unit, this will spend the more time; LZO initial designs object is the indefinite data of reduction length, and for the memory pages of 4K size, the compressed format of LZO is also inapplicable.

Summary of the invention

The object of the invention is to overcome the deficiency of above-mentioned prior art, proposed a kind of memory pages compression method based on dictionary, can compress faster and decompression memory pages, thus the delay of minimizing EMS memory data access.

Realizing technical scheme of the present invention is: according to the data characteristics of memory pages, design the compressed format (decompressed format is identical) of a kind of new hash function and memory pages, the nybble of take carries out compressed encoding and decompression as elementary cell to memory pages, and concrete steps are as follows:

(1) read internal storage data in mobile device and the length of internal storage data;

(2) judge whether institute's read data is new data, if institute's read data is not recorded in dictionary, is judged to new data, and this new data is charged in dictionary, continues to read internal storage data, until there is not new data;

(3) if institute's read data has been recorded in dictionary, institute's read data is carried out to compressed encoding and decompression by new compressed format;

(4) judge whether to be encoded to internal storage data ending, if data and the data length after output squeezing, and record end flag, otherwise return to step (2), continue to read in new data;

Dictionary in described step (2) is the Hash table of directly accessing according to key value, key value is to calculate by hash function, being designed to of hash function: read in four bytes from input traffic, by the first two byte, do xor operation and obtain new byte A, by latter two byte, do xor operation and obtain new byte B, with low level 2 bits of new byte A and a high position 2 bits of B, do the key value that xor operation obtains 14 bits;

New compressed format in described step (3) be take nybble as elementary cell is to memory pages compressed encoding and decoding, and its form is:

1) front 4 bits record of first byte repeats the length of four characters, and rear 4 bits record the length of new four characters;

2) since second byte, record remaining new four character lengths, then record new four characters;

3) in step 2) after new four character records complete, then record the length of remaining repetition four characters of memory pages and refer to back distance.Refer to back that distance is the distance between last times four character position recording in the position of current repetition four characters and Hash table.

In the present invention, the compression encoding process of memory pages is described below:

1.1) first with 4 bits after first byte, record the length of new four characters, if new four character lengths are greater than 14, after first byte, 4 bits serve as a mark with 15, since second byte, record remaining new four character lengths, if remaining new four character lengths are greater than 255, record a byte 0 and length is subtracted to 255, until remain new four character lengths, being less than 255, recording this and remain new four character lengths;

1.2) in step 1.1) after new four character lengths have recorded, record new four characters;

1.3) with the front 4 bit records of first byte, repeat the length of four characters, if repeat four character lengths, be greater than 14, with front 4 bits of first byte, with 15, serve as a mark, then record remaining repetition four character lengths.If remaining repetition four character lengths are greater than 255, record a byte 0 and length is subtracted to 255, until repeating four character lengths, residue is less than 255, record this residue and repeat four character lengths;

1.4) in step 1.3) complete after, the finger that record repeats four characters returns distance.

In the present invention, the decompression process of memory pages is described below:

2.1) read the first byte of compressed format, after judgement first byte, the size of 4 bits, if be less than 15 for the length of new four characters, and exports new four characters; If equal 15, new four character lengths add 14, since second byte, if 0 new four character length of byte add 255, until the byte of reading is non-zero, new four character lengths are added to this non-zero byte, and export new four characters;

2.2) determining step 2.1) in the size of front 4 bits of first byte, if be less than 15 for the length of repetition four characters; Otherwise, if equal 15, repeat four character lengths and add 14, continue to read, if byte is 0, repeats four character lengths and add 255, until the byte of reading is non-zero, repetition four character lengths are added to this non-zero byte;

2.3) read last byte of compressed format, the finger that is repetition four characters returns distance, and according to the length that repeats four characters, output repeats four characters.

Compared with prior art, tool of the present invention has the following advantages:

Compare with current LZO lossless compression algorithm, new compressed format of the present invention is simple, very fast to the compression of internal memory page data and decompression speed, compression ratio is substantially suitable simultaneously, can significantly improve the operational efficiency of mobile device, test result also proves that compression time and decompression time all improved 60%.

Accompanying drawing explanation

Fig. 1 is the compression and decompression format chart in the present invention;

Fig. 2 is compression process figure of the present invention;

Fig. 3 is decompression flow process figure of the present invention;

Embodiment

Below in conjunction with Fig. 1, compression and decompression form of the present invention is described in further detail:

1) with front 4 bits record of first byte, repeat the length of four characters, rear 4 bits record the length of new four characters;

2) if new four character lengths are greater than 14, after first byte, 4 bits serve as a mark with 15, since second byte, record remaining new four character lengths.If remaining new four character lengths are greater than 255, record a byte 0 length and deduct 255, until remain new four character lengths, be less than 255, record this simultaneously and remain new four character lengths, after new four character lengths have recorded, then record new four characters;

3) if repeat four character lengths, be less than or equal to 14, with front 4 bits of first byte in step 1), represent repetition four character lengths; If repeat four character lengths, be greater than 14, front 4 bits of first byte serve as a mark with 15, then record remaining repetition four character lengths, if remaining repetition four character lengths are greater than 255, record a byte 0 length and deduct 255, until residue repeats four character lengths, be less than 255, record this residue simultaneously and repeat four character lengths;

4), after step 3) completes, the finger that record repeats four characters returns distance.

Below in conjunction with Fig. 2, the implementation process of compressed encoding of the present invention is described in further detail:

Step 1: read in four characters from input traffic, do Hash operation for the first time, enter step 2;

Step 2: whether the position that judges nybble is legal, enters step 3 if legal, if illegal renewal Hash table returns to step 1;

Step 3: whether the data that judge position that Hash table is deposited with whether read in four characters identical, enter step 6, if difference enters step 4 if identical;

Step 4: do Hash operation for the second time, judge that whether four character positions are legal, enter step 5 if legal, if illegal renewal Hash table returns to step 1;

Step 5: whether the data that judge position that Hash table is deposited with whether read in four characters identical, enter step 6, if different update Hash table returns to step 1 if identical;

Step 6: calculate new four character lengths, judge whether to be longer than 14, enter step 7 if be longer than, otherwise directly use first byte record, enter step 8;

Step 7: judge whether new four character lengths are longer than 255, if be recorded as a byte 0, length subtracts 255 simultaneously, until new four character lengths are less than 255, finally new four character lengths of record residue, carry out step 8;

Step 8: record new four character datas, enter step 9;

Step 9: calculate and repeat four character numbers, judge whether to be longer than 14, if enter step 10, otherwise directly use first byte record, enter step 10;

Step 10: judgement repeats four character lengths and whether is longer than 255, if be recorded as a byte 0, length subtracts 255 simultaneously, until repeat four character numbers, is less than 255, finally record residue repeats four character lengths, enters step 11;

Step 11: calculate with record and refer to back distance; Judge whether to be encoded to ending, if record remains new four characters, output encoder length; Otherwise enter step 1.

The implementation process decompressing below in conjunction with 3 couples of the present invention of accompanying drawing is described in further detail:

Step 1: from input traffic, read in a byte, judge after this byte whether 4 bits are 15, if the step 2 of entering, otherwise the size that rear four bits represent is new four character lengths, enters step 5;

Step 2: new four character lengths add 14;

Step 3: judge whether next byte is 0, if new four character lengths add 255, until the byte of reading is non-zero, then enter step 4;

Step 4: new four character lengths add remaining new four character lengths, enter step 5;

Step 5: according to new four character lengths, write new four characters, enter step 6;

Step 6: whether front four bits of first byte that judgement is read in are 15, if the step 7 of entering, otherwise the size that rear 4 bits represent is repetition four character lengths, enters step 10;

Step 7: repeat four character numbers and add 14;

Step 8: judge whether next byte is 0, adds 255 if repeat four character lengths, until the byte of reading is non-zero, enters step 9;

Step 9: repeat four character lengths and add remaining repetition four character lengths, enter step 10;

Step 10: calculate and refer to back distance, according to repeating four character lengths, write repetition four characters, enter step 11;

Step 11: judge whether to be encoded to ending, if so, output encoder length; Otherwise enter step 1.

Step 12: if output page-size represents normal decoder, if not, output error.

Below in conjunction with following table, effect of the present invention is described further:

This experiment adopts C language to write the compression method that invention proposes, and by comparing the present invention and the compression effectiveness of traditional LZO dictionary method to internal storage data page, the advantage that the inventive method compression speed is fast is described.LZO is current best lossless compression method.The internal storage data that this experiment adopts is the internal storage data page of the 4K size of typical mobile device, in VS2010 programming development environment test result:

Table 1

Test usage data is memory pages compressed package, and compressed package size is 256M.Time in table is compression time and the decompression time of all memory pages of whole compressed package, in form, data are to have moved the result being averaged for 100 times, compression time and decompression time have all improved 60% as can be seen from the above table, completed the project indicator, compression ratio loss is 5.12%, concrete numerical value is for using LZO algorithm can be compressed to 96M left and right, and use the present invention can be compressed to 109M left and right.Therefore the fast access for internal storage data, exchanges the compression time of a times for the compression stroke of 10M and conciliates compression time and be worth.

Claims

1. the memory pages compression method based on dictionary, designs the compressed format of a kind of new hash function and memory pages, and the nybble of take carries out compressed encoding and decompression as elementary cell to memory pages coding, and concrete steps are as follows:

(2) judge whether institute's read data is new data, if institute's read data is not recorded in dictionary, is judged to new data, and new data is charged in dictionary, continues to read internal storage data, until there is not new data;

(4) judge that whether coding site is internal storage data ending, if the data after output squeezing and the length of data, and record end flag, otherwise return to step (2), continue to read in new data;

3) in step 2) after new four character records complete, then record the length of remaining repetition four characters of memory pages and refer to back distance.

2. the memory pages compression method based on dictionary according to claim 1, is characterized in that: the compression encoding process of memory pages is described below:

2.1) first with 4 bits after first byte, record the length of new four characters, if new four character lengths are greater than 14, after first byte, 4 bits serve as a mark with 15, since second byte, record remaining new four character lengths, if remaining new four character lengths are greater than 255, record a byte 0 and length is subtracted to 255, until remain new four character lengths, being less than 255, recording this and remain new four character lengths;

2.2) in step 2.1) after new four character lengths have recorded, record new four characters;

2.3) with the front 4 bit records of first byte, repeat the length of four characters, if repeat four character lengths, be greater than 14, with front 4 bits of first byte, with 15, serve as a mark, then record remaining repetition four character lengths.If remaining repetition four character lengths are greater than 255, record a byte 0 length and subtract 255, until repeating four character lengths, residue is less than 255, record this residue and repeat four character lengths;

2.4) in step 2.3) complete after, the finger that record repeats four characters returns distance.

3. the memory pages compression method based on dictionary according to claim 1, is characterized in that: the decompression process of memory pages is described below:

3.1) read the first byte of compressed format, after judgement first byte, the size of 4 bits, if be less than 15 for the length of new four characters, exports new four characters; If equal 15, new four character lengths add 14, since second byte, if 0 new four character length of byte add 255, until the byte of reading is non-zero, new four character lengths are added to this non-zero byte, export new four characters;

3.2) determining step 3.1) in the size of front 4 bits of first byte, if be less than 15 for the length of repetition four characters; If equal 15, repeat four character lengths and add 14, continue to read, if byte is 0, repeats four character lengths and add 255, until the byte of reading is non-zero, repetition four character lengths are added to this non-zero byte;

3.3) read last byte of compressed format, the finger that is repetition four characters returns distance, and according to the length that repeats four characters, output repeats four characters.