CN103414901A

CN103414901A - Quick JPED 2000 image compression system

Info

Publication number: CN103414901A
Application number: CN2013103750293A
Authority: CN
Inventors: 刘迎春; 魏华峰
Original assignee: JIANGSU XINRUIFENG INFORMATION TECHNOLOGY Co Ltd
Current assignee: JIANGSU XINRUIFENG INFORMATION TECHNOLOGY Co Ltd
Priority date: 2013-08-26
Filing date: 2013-08-26
Publication date: 2013-11-27

Abstract

The invention discloses a quick JPED 2000 image compression system. A new-arrival CUDA (Compute Unified Device Architecture) of the NVIDIA company is utilized by a GPGPU (General Purpose Graphic Process Unit) to technically achieve parallelization acceleration of discrete wavelet transform (DWT) of a core algorithm of JPEG 2000 image compression. Thus, the calculation speed of discrete wavelet transform is effectively improved.

Description

A kind of quick JPEG2000 image compression system

Technical field

The invention belongs to computer application field, especially belong to image processing field.

Background technology

Along with multimedia technology is popularized the extensive of computer science application, Image Compression becomes the key technology in modern digital image transmitting, processing, storage.No matter be at aspects such as Internet Transmissions, or all have great significance and purposes in fields such as mobile communication.JPEG2000 is the new Static Picture Compression standard proposed on the JPEG basis, is created and is safeguarded by Joint Photographic Experts Group tissue.With the JPEG compression standard, compare, not only on compression performance, optimize, can be with higher compression ratio compressing image data.And has an advantage of supporting simultaneously lossy compression method and Lossless Compression.Because image pixel can be regarded two-dimensional array as, calculate two-dimensional array and be equivalent to calculate a large amount of incoherent data, the higher original bitmap of quality particularly, utilize the design feature processing of traditional C PU serial to consume the plenty of time, can't meet the requirement of real-time of image compression in the current multimedia technology application facet.Aspect hardware realizes, traditional image compression generally adopts the hardware platforms such as DSP and FPGA to realize, but the realization of the hardware such as DSP and FPGA, require the researcher deep research to be arranged to the hardware internal structure, and also exist certain difficulty aspect transplantability.Yet the release of general-purpose computations graphic process unit General Purpose Graphic Process Unit (being called for short GPGPU), except the graphics process framework that traditional GPU has, GPGPU has also increased the parallel computation framework, and this framework is processed for computation-intensive and the parallel speed-up computation of high strength provides possibility.NVIDIA company (Nasdaq's code: NVDA) provide brand-new hardware and software development platform CUDA (Compute Unifie Compute Unified Device Architecture for its GPGPU, calculate unified equipment framework), utilize the CUDA technology to coexist on CPU and realize that its algorithm compares on GPGPU, optimizing JPEG2000 image compression core algorithm, significant lifting has been arranged on computational speed, improved to a great extent the efficiency of image compression.

The JPEG2000 Standard of image compression is compared with the JPEG compression standard, and further improvement has been done in maximum not being both aspect algorithm.At first, the discrete cosine transform (DCT) of take that it has abandoned the JPEG employing is main block coding mode, but selected to take wavelet transform (DWT) to resolve coded system as main full frame more, with this, reduce the data redundancy information comprised in image, avoided in the situation that low bit rate JPEG compression standard can produce the shortcoming of square noise.Secondly, aspect the entropy encryption algorithm, the JPEG2000 Standard of image compression adopts optimizes embedded block coding (EBCOT) algorithm blocked, and has replaced the Huffman encoding algorithm of JPEG.

The core encoder system of JPEG2000 Standard of image compression mainly comprises 7 modules, as shown in Figure 1, at first pretreated original image is carried out to forward wavelet transform and obtains wavelet coefficient, then according to specific needs the wavelet coefficient after conversion is quantized; Wavelet coefficient after then quantizing is divided into code block, each code block is carried out independently embedded encoded, the all code streams that obtain are according to rate distortion principle of optimality laminated tissue, finally, according to the layer packing rear output squeezing code stream of certain code stream form to these different qualities, namely completed the compression process of whole image.

Along with the arrival in 3D epoch, huge 3D graphics calculations amount is far beyond the computing capability scope of traditional C PU.For this reason, graphic process unit GPU produces thereupon, and GPU is special in graphics calculations designs, and compares with CPU, and GPU has high floating-point operation performance, the many advantages that high bandwidth, efficient parallel calculate.But so powerful computing capability is only for graph rendering, and this is undoubtedly a kind of waste for computational resource.In order to take full advantage of the computing capability that GPU is powerful, other science that meet simultaneously except graphics calculations are calculated the demand in field, general-purpose computations graphic process unit GPGPU(General Purpose GPU) arise at the historic moment, and obtained huge achievement.The hardware of GPGPU is to adopt single-instruction multiple-data SIMD(Single Instruction Multiple Data) structure, simultaneously by graphics process framework and the perfect adaptation of parallel computation framework, this makes the GPGPU not only can be as the one's work of graphic display card for graph rendering, the science that sets foot in to a greater extent simultaneously other non-figures is calculated field, gives full play to its powerful computation capability.As shown in Figure 2, a hardware GPGPU chip is by stream multiprocessor SM(Stream Multiprocessor) form, each stream multiprocessor comprises 8 stream handle SP(Stream Processor) and two special function unit SFU(Special Function Unit) and some on-chip memory resources, such as shared storage (shared memory) and register (Register) etc.

Fig. 3 is GPGPU hardware SM structure.Wherein register is high-speed memory, corresponding with hardware configuration, and each stream handle SP has a privately owned 32bit register.The access speed of shared storage is almost equally fast with register, but memory space is less, and according to current hardware supports, its default size is 16K, can be by all thread read and write accesss in same thread block.When utilizing the parallel speed-up computation of GPGPU, need to memory, carry out reasonable distribution according to access speed and the functional characteristics of different memory, this is the key that improves the GPGPU calculated performance.

Summary of the invention

CUDA is that NVIDIA company is the isomery development platform of its GPGPU product design the special of in June, 2007 release, has from then on also thoroughly changed the destiny of GPGPU parallel computation.CUDA provides hardware direct access interface, without as traditional GPGPU exploitation, realizing by means of graphics API such as Open GL and Direct X.Simultaneously, CUDA expands widely used c language, has further reduced the programming difficulty of the parallel speed-up computation of GPGPU, makes the developer can from the application and development of C language, be transitioned at an easy rate the application and development of GPGPU.CUDA provides single instrction multithreading SIMT (Single Instruction Multiple Thread) the isomery programming execution model corresponding with the SIMD structure of GPGPU, as shown in Figure 4.The thread structure by different level of CUDA comprises thread (Thread), thread block TB (Thread Block) and grid (Grid).Each grid is comprised of the thread block of some, and each thread block comprises at most 512 threads.A CUDA program is comprised of the program that operates in Host (CPU) end and the program that operates in Device (GPU) end.Host end is carried out serial command and is used for scheduler task and distributes, and Device carries out the parallel computation part as the coprocessor of Host, and the program of carrying out at the Device end is called Kernel (kernel) function, and Kernel carries out with the form of a Grid.A simple Device program need to complete following two processes: (1), before the kemel function call, need to copy to the data of required processing the global storage (Global Memory) of Device end from host memory (Host Memory); (2) after having calculated, result of calculation is turned back to host memory from global storage.The essence of the parallel speed-up computation of CUDA is that a division of tasks is become to a plurality of separate task pieces, utilizes thousands of thread to process simultaneously the task piece, thereby improves the computational speed of whole task.Therefore, view data is divided into to data block, utilizes the CUDA technology can further improve the speed of JPEG2000 image compression.

The Algorithms of Discrete Wavelet Transform adopted in the JPEG2000 standard has 2 kinds: 5/3 integer type wavelet arithmetic and 9/7 floating type wavelet arithmetic, and wherein 5/3 wavelet arithmetic is applicable to lossy compression method and Lossless Compression, and implementation algorithm is as follows:

y(2n+1)=x(2n+1)-[(x(2n)+x(2n+2))／2]

Y (2n)=x (2n)+[(Y (2n 1)+y (2n+1)+2)/4]

In the situation that than low bit rate, 9/7 floating type wavelet arithmetic can have been given play to superior performance, recommendation when lossy compression method.With 5/3 wavelet arithmetic, compare, 9/7 wavelet arithmetic more complicated is as follows:

Y(2n+1)=(2n+1)+α[(2n)+(2n+2)]

y(2n)=x(2n)+β[y(2n-1)+y(2n+1)]

y(2n+1)=y(2n+1)+γ[y(2n)+y(2n+2)]

Y(2n)=y(2n)+δ[y(2n-1)+y(2n+1)]

Y(2n+1)=?κ×y(2n+1)

Y(2n)=(1／κ)×y(2n)

Wherein:

α=1.486134342，β=-0.042980118，

γ=0.882911075，δ=0.443506852，

κ=1.230174105

Wavelet Transformation Algorithm by JPEG2000 is as can be known, and really relating to a large amount of data operations is in the one dimension row-column transform.Therefore can determine, basic line translation and rank transformation lifting operation are designed to the kemel function, call calculating by GPU.Other some work tasks are all given CPU and are completed.Can draw thus wavelet transformation CPU-GPU isomery Parallel Implementation task distribution principle figure as shown in Figure 5.

The specific implementation step of this process is as follows:

(1) at Host (CPU), hold, CPU assign host machine memory headroom x and y, output image data after being respectively used to deposit input image data and compress, view data is read to CPU internal memory x, and the cudaMalloc in Using Call Library Function distributes two identical global storage space (video memory) Xl and X2 at Device (GPU) end;

(2) by the cudaMemcpy function in Using Call Library Function, the view data in CPU internal memory x is copied to global storage space (video memory) Xl of Device (GPU) end formed objects, it is carried out to the calculating of small echo rank transformation;

(3) view data in global storage Xl is divided into to data block, create simultaneously the thread block with the data block similar number, thread block is mapped to each data block after cutting apart one by one, for shared storage space of each data block statement, for depositing the data of the data block that each thread of being mapped in thread block need to carry out;

(4) data in shared storage are carried out to Wavelet Lifting Transform, result of calculation deposits in video memory space X 2 in order;

(5) result of calculation in video memory space X 2 is carried out to small echo line translation calculating, result of calculation is deposited in vacant global storage space (video memory) Xl;

(6) according to the wavelet decomposition grade, control, repeat the Wavelet Lifting Transform process, until the gradational Wavelet Lifting Transform process of institute is all completed, finally the result images data in global storage Xl are turned back in the CPU host memory, discharge GPU video memory space and CPU memory headroom.

The accompanying drawing explanation

Fig. 1 is JPEG2000 core encoder schematic diagram.

Fig. 2 is the GPGPU hardware structure model.

Fig. 3 is GPGPU hardware SM structure.

Fig. 4 is CUDA isomery programming execution model.

Fig. 5 is the also structure Parallel Implementation schematic diagram of DWT.

Embodiment

The present invention is further detailed explanation below in conjunction with drawings and the specific embodiments.

Main frame Host holds CPU: adopt Inter E7400 Duo double-core 2.80GHz CPU, dominant frequency 2800MHz, host memory are 2GB;

Equipment Device holds GPU: adopt NVIDIA GeForce GTX 560 Ti equipment; 8 of stream multiprocessor quantity (each stream multiprocessor all comprises shared storage); 384 of CUDA stream handle quantity, core frequency 822MHz, stream handle frequency 1645 MHz; video memory frequency 4008 MHz; global storage (video memory) capacity 1024 MHz, video memory bandwidth 128GB/s, video memory bit wide 256bit; computing capability 2.1, bus interface PCI-E2.0x16;

Programmed environment: adopting GPU hardware driving version is 301.42, uses the programmed environment of CUDA4.1 version, windows 7 operating systems, VS2010.

The specific implementation step of JPEG2000 method for compressing image is as follows fast:

(1) at main frame Host (CPU), hold, CPU assign host machine memory headroom x and y, output image data after being respectively used to deposit input image data and compress, raw image data is read to CPU internal memory x, distribute two identical global storage space (video memory) Xl and X2 at equipment Device (GPU) end;

(2) view data in CPU internal memory x is copied to global storage space (video memory) Xl of Device (GPU) end formed objects, it is carried out to JPEG2000 still image small echo rank transformation and calculate;

(4) data in shared storage are carried out to multithreading wavelet transform DWT, result of calculation deposits in video memory space X 2 in order;

(5) result of calculation in video memory space X 2 is carried out to small echo line translation calculating, result of calculation is deposited in global storage space X l;

(6) the result images data in global storage Xl are turned back in CPU host memory y, discharge GPU video memory space and CPU memory headroom.

Because the JPEG2000 image compression on CPU and GPGPU all realizes according to traditional definition.Therefore, the compression quality of image is identical basically, and by CPU and upper testing time of GPGPU are compared to analysis, interpretation refers to table 1.

Figure 2013103750293100002DEST_PATH_IMAGE002

From experimental data, can find out, with through the cpu test result of optimizing, not comparing, it for the relatively less pixel of required deal with data, is 640 * 480 image, computational speed has improved more than 9 times, research one before comparing is 6.785ms based on the computing time that is its Lifting Wavelet algorithm of image of 512 * 512 for pixel in the design and research of the small echo Mallat algorithm of CUDA and lifting scheme, and the computational speed of the DWT boosting algorithm after visible optimization promotes to some extent.And for the more image of required deal with data, the computational speed raising has surpassed 50 times.Obviously, through the DWT algorithm in the JPEG2000 Static Picture Compression standard of CUDA optimization, to be far smaller than the computing time of CPU the computing time on GPGPU, and the increase along with the calculated data amount, present the trend of increasing substantially the operation time of CPU, and the computing time of GPGPU, growth was less.From speed-up ratio, also can find out, along with the increase that data volume is calculated, GPGPU shows more superior speed-up computation performance.

Should understand above-mentioned these embodiment only is not used in and limits the scope of the invention be used to the present invention is described, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the modification of various equivalents of the present invention.

Claims

1. quick JPEG2000 image compression system, comprise host side and the equipment end be attached thereto, host side is provided with CPU, host memory, equipment end is provided with graphic process unit GPU, comprise the stream multiprocessor that global storage (video memory) and register, shared storage etc. form, it is characterized in that, realize that the concrete steps of image compression are as follows:

(1) in host side, CPU assign host machine memory headroom x and y, output image data after being respectively used to deposit input image data and compress, raw image data is read to CPU internal memory x, in equipment end, CPU distributes two and the identical global storage of CPU internal memory x space (video memory) Xl and X2 by GPU;

(2) CPU is copied to the view data in internal memory x global storage space (video memory) Xl of equipment end formed objects, and GPU carries out JPEG2000 still image small echo rank transformation to it and calculates;

(3) GPU is divided into data block by the view data in global storage Xl, create simultaneously the thread block with the data block similar number, thread block is mapped to each data block after cutting apart one by one, for shared storage space of each data block statement, for depositing the data of the data block that each thread of being mapped in thread block need to carry out;

(4) GPU carries out multithreading wavelet transform DWT to the data in shared storage, and result of calculation deposits in video memory space X 2 in order;

(5) GPU carries out small echo line translation calculating to the result of calculation in video memory space X 2, and result of calculation is deposited in global storage space X l;

(6) GPU turns back to the result images data in global storage Xl in CPU host memory y, and GPU discharges the video memory space, and the CPU while is releasing memory space also.

2. JPEG2000 image compression system as claimed in claim 1, it is characterized in that: the Algorithms of Discrete Wavelet Transform in step (4) is 5/3 integer type wavelet arithmetic.

3. JPEG2000 image compression system as claimed in claim 1, it is characterized in that: the Algorithms of Discrete Wavelet Transform in step (4) is 9/7 floating type wavelet arithmetic.