CN101478311B

CN101478311B - Hardware accelerated implementation process for bzip2 compression algorithm

Info

Publication number: CN101478311B
Application number: CN2009100955967A
Authority: CN
Inventors: 陈天洲; 严力科; 胡威; 王罡; 冯德贵; 吴斌斌; 陈度; 王勇刚; 刘敬伟
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2009-01-22
Filing date: 2009-01-22
Publication date: 2010-10-20
Anticipated expiration: 2029-01-22
Also published as: CN101478311A

Abstract

The invention discloses a hardware accelerating implementation method of a bzip2 compression algorithm, wherein, a hardware accelerator is utilized to implement preposing conversion and stroke length coding which cost a great deal of runtime so as to accelerate the program compression speed. The hardware accelerating implementation method has the characteristics as follows: firstly, an input/output buffer memory of the hardware accelerator is used as a communication interface, and is communicated with a general-purpose computing system through the communication interface; software prepares input data for the hardware accelerator and sorts and reads output data; and so the design of the hardware accelerator is simplified; and secondly, the preposing conversion and stroke length coding is realized in a hardware manner, a fully expanding 2048-bit parallel comparator and a shifter are adopted, so that the program execution is accelerated, the data compression speed of the bzip2 algorithm is accelerated, and the program performance is enhanced effectively.

Description

The hardware-accelerated implementation method of bzip2 compression algorithm

Technical field

The present invention relates to software-hardware synergism design, data compression technique field, relate in particular to the hardware-accelerated implementation method of a kind of bzip2 compression algorithm.

Background technology

Along with the application of new material and the development of new technology, the VLSI technology makes great progress, and this is that polycaryon processor (Chip Multi-Processor, lay a good foundation by development CMP).CMP is integrated in a plurality of calculating kernels in the processor chips exactly, thereby improves computing capability., by the equity of calculating kernel whether CMP can be divided into isomorphism multinuclear and heterogeneous polynuclear.

In the years to come, the number of handling nuclear will get more and more, but, along with processing check figure order integrated in the single chip is more and more, increase processing check figure order and be difficult to bring bigger performance boost, general processor also is difficult to satisfy the fusion application demand gradually simultaneously, and increasing polycaryon processor turns to the SoC framework, just the heterogeneous polynuclear framework.Increasing research institution has carried out the research towards heterogeneous multi-nucleus processor, and these researchs have comprised the every aspect of heterogeneous multi-nucleus processor system, as handling the optimization of nuclear structure; Thread on the heterogeneous multi-nucleus processor distributes and migration; And at CPU+DSP polycaryon processor structural research of looking Audio Processing etc.And some commercial processor have begun to adopt the isomery system, perhaps at some special-purpose accelerators of specific applied customization.

Bzip2 is higher than the compression efficiency of traditional gzip or ZIP, but its compression speed is slower.From this point, it is very similar to some other compression algorithm of nearest appearance.Other different is with RAR or ZIP etc., and bzip2 is a data tool of compression, rather than the filing instrument, and it and gzip are similar in this.Program itself does not comprise the instrument that is used for a plurality of files, encryption or document cutting, on the contrary need use external tool as tar or Gnu PG according to the tradition of UNIX.

Bzip2 uses Burrows-Wheeler transform to convert the character string that repeats the character string of same letter to, handles with move-to-front transform then, uses Huffman encoding to compress at last.All data blocks all are equirotal plain text data pieces in bzip2, and they can be selected with the order line variable, use any bit sequence that obtains from the decimal representation of π to identify into compressed text then.

Though the compression efficiency of bzip2 is than gzip or zip height, its slower compression speed has limited the scope of application.Along with the development of VLSI technology, the number of transistors purpose increases on the chip, can quicken its compression process for the special-purpose accelerator of bzip2 customization.

Summary of the invention

In order to satisfy the demand of the calculated performance that improves constantly, finish the program focus function of bzip2 algorithm by customizing special-purpose accelerator, improve the compression speed of bzip2 algorithm, the object of the present invention is to provide the hardware-accelerated implementation method of a kind of bzip2 compression algorithm.

The technical scheme that technical solution problem of the present invention is adopted is:

The hardware-accelerated implementation method of a kind of bzip2 compression algorithm:

1) software manages the input and output of hardware accelerator:

Hardware accelerator with the input and output buffer memory as with the communication interface of general-purpose computing system;

The input and output buffer memory of the direct access hardware accelerator of software, for hardware accelerator is prepared the input data, and dateout is read in arrangement:

1. before hardware accelerator began to calculate, the input data of the good hardware accelerator of software organization were written to the input-buffer of hardware accelerator;

2. after hardware accelerator calculated and finishes, software was taken the dateout of hardware accelerator away from buffer memory, write back to Installed System Memory;

2) hardware accelerator is realized preposing conversion and run length encoding

Hardware accelerator mainly comprises registers group, 2048 parallel-by-bit comparators, 2048 bit shift devices, a 256-8 encoder and a length encoder;

Registers group comprises local storage, local cache, current byte register, Current Address Register, output address register, consecutive identical byte counter, 2048 character lists register;

The specific implementation step is as follows:

1. reading of content is to current byte register from input-buffer according to the current address, and the current address adds 1;

2. with the input of current byte content of registers and character lists register, walk abreast relatively as 2048 parallel-by-bit comparators;

3. with the output of 2048 parallel-by-bit comparators input, encode as the 256-8 encoder;

I, when coding result is 00000000, consecutive identical byte counter adds 1, continues step 1.;

II, when coding result is not 00000000, and consecutive identical byte counter is 0 o'clock, continues execution in step 4.;

III, when coding result is not 00000000, and consecutive identical byte counter is not 0 o'clock, continues execution in step 5.;

4. with the input of the output result of 2048 parallel-by-bit comparators and character lists register as 2048 bit shift devices, with a byte in one among the output result of the 2048 parallel-by-bit comparators corresponding character lists register, with preposition first byte of ' 1 ' byte pointed among the output result of 2048 parallel-by-bit comparators, the byte in ' 0 ' the pairing character lists register on ' 1 ' left side is moved 8 backward to the character lists register; Continue execution in step 6.;

5. with of the input of consecutive identical byte counter count value, carry out run length encoding, continue execution in step then 4. as length encoder;

6. the coding result with the 256-8 encoder writes back to the space that output address register points in the local storage; If the input data are not also handled, continue step 1.;

If the input data are all handled, hardware accelerator is hung up, and notice software is fetched result data.

The beneficial effect that the present invention has is:

At first, with hardware accelerator input and output buffer memory as with the communication interface of general-purpose computing system, and be that hardware accelerator is prepared the input data, and arrangement reads dateout by software, simplified the design of hardware accelerator; Secondly, be implemented in preposing conversion and the run length encoding that holding time is maximum in the whole procedure, quickened program implementation, accelerated the data compression speed of bzip2 algorithm, effectively improve the performance of program with hardware mode.

Description of drawings

Fig. 1 is an overview flow chart of the present invention.

Fig. 2 is the module diagram of hardware accelerator of the present invention.

Embodiment

Specific implementation flow process based on the hardware thread execution method of processor and FPGA mixed architecture is as follows:

The hardware-accelerated implementation method of a kind of bzip2 compression algorithm, concrete steps be as shown in Figure 1:

1) software manages the input and output of accelerator

Hardware accelerator with the input and output buffer memory as with the communication interface of general-purpose computing system, general-purpose computing system refers to traditional desktop computer to be the all-purpose computer of representative.General-purpose computing system is by the input and output buffer memory of PCI-E bus access hardware accelerator, in the present invention, input-buffer separates with output buffers, input-buffer is called local cache, input data as the buffer memory hardware accelerator, output buffers is called local storage, as the result of calculation of storage hardware accelerator.

Software is by the input and output buffer memory of the direct access hardware accelerator of PCI-E bus, and for hardware accelerator is prepared the input data, and dateout is read in arrangement:

1. before hardware accelerator began to calculate, software was organized the input data of hardware accelerator in Installed System Memory, will be organized in the local cache that is transferred to hardware accelerator of data in the Installed System Memory then by PCI-E, notified hardware accelerator to begin to calculate then;

2. after hardware accelerator calculates and finishes, produce and interrupt, notice software is taken the dateout of hardware accelerator away from the storage of this locality, write back to Installed System Memory.

The module diagram of hardware accelerator comprises local storage, local cache, registers group, 2048 parallel-by-bit comparators, 2048 bit shift devices, a 256-8 encoder and a length encoder as shown in Figure 2;

2048 parallel-by-bit comparators have two inputs: 18 input and 2048 inputs; The output result is 256, per 8 comparative result in 8 inputs of per 1 bit representation and 2048 inputs, and identical then is ' 1 ', otherwise is ' 0 '.

2048 bit shift devices also have two inputs: 1 256 input and 2048 inputs; The output result is 2048,256 inputs per 1 for 8 in 2048 inputs, shift unit is preposition to first byte with the byte of ' 1 ' pairing 2048 inputs in 256 inputs, and the byte in ' 0 ' pairing 2048 inputs on ' 1 ' left side moved 8 backward, produce 2048 output result.

The 256-8 encoder produces 8 output result according to ' 1 ' position in 256 the input, its numerical value be 256 be in ' 1 ' position.

Registers group comprises 8 current byte register, 16 Current Address Register, 16 output address register, 16 consecutive identical byte counter, 2048 character lists register.Current Address Register, output address register and consecutive identical byte counter initial value are 0, when the character lists register is initial from left to right in order storing value be 0 to 256 byte.

The specific implementation step is as follows:

1. reading of content is to current byte register from local cache according to the current address, and the current address adds 1;

5. with of the input of consecutive identical byte counter count value, carry out run length encoding as length encoder, specific as follows:

If position, I consecutive identical byte counter end is ' 1 ', the output address register pointing space writes 1 in the storage of this locality, and input address register adds 1;

If position, II consecutive identical byte counter end is ' 0 ', the output address register pointing space writes 0 in the storage of this locality, and input address register adds 1;

If 4. the consecutive identical byte counter value of III less than 2, continues execution in step; Otherwise consecutive identical byte counter subtracts 2 and also moves 1 again, continues execution in step 5.;

6. the coding result with the 256-8 encoder writes back to the space that output address register points in the local storage, and the output address register content adds 1; If the input data are not also handled, continue step 1.;

Claims

1. hardware-accelerated implementation method of bzip2 compression algorithm is characterized in that:

1) software manages the input and output of hardware accelerator:

The input and output buffer memory of the direct access hardware accelerator of software, for hardware accelerator is prepared the input data, and put and read dateout in order:

2. after hardware accelerator calculated and finishes, software was taken the dateout of hardware accelerator away from output buffers, write back to Installed System Memory;

2) hardware accelerator is realized preposing conversion and run length encoding:

Input-buffer separates with output buffers, and input-buffer is called local cache, and as the input data of buffer memory hardware accelerator, output buffers is called local storage, as the result of calculation of storage hardware accelerator; Hardware accelerator comprises local storage, local cache, registers group, 2048 parallel-by-bit comparators, 2048 bit shift devices, a 256-8 encoder and a length encoder;

Registers group comprises current byte register, Current Address Register, output address register, consecutive identical byte counter, 2048 character lists register; Wherein during character lists register initial from left to right in order storing value be 0 to 256 byte;

The specific implementation step is as follows:

2. with the input of current byte content of registers and character lists register as 2048 parallel-by-bit comparators, walk abreast as follows relatively: 2048 parallel-by-bit comparators have two inputs, 8 current byte register input and 2048 character lists register input; The output result is 256, per 8 comparative result in 8 inputs of per 1 bit representation and 2048 inputs, and identical then is ' 1 ', otherwise is ' 0 ';