CN114765684B

CN114765684B - JPEG parallel entropy coding method based on GPU

Info

Publication number: CN114765684B
Application number: CN202110039080.1A
Authority: CN
Inventors: 严华; 祝福顺
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2023-05-09
Anticipated expiration: 2041-01-12
Also published as: CN114765684A

Abstract

Aiming at the rate problem of the traditional JPEG entropy coding, the invention provides a JPEG entropy coding parallel processing method based on a GPU. The method mainly comprises the following 3 parallel processing steps: firstly, designing an efficient parallel strategy code to generate respective binary code streams aiming at all 8X 8 block DCT transformation data of an image to be detected; then, carrying out proper shift operation on the code streams of each 8×8 block to ensure that the code streams of adjacent 8×8 blocks are filled together without interval; finally, the code streams of all 8×8 blocks are padded together to obtain compressed image data. Experimental results show that compared with the traditional JPEG entropy coding, the method disclosed by the invention has the advantage that the image compression rate is greatly improved on the premise of not influencing the quality of a compressed image.

Description

JPEG parallel entropy coding method based on GPU

Technical Field

The invention relates to the technical problem of image compression in the field of digital image processing, in particular to an acceleration method for JPEG image entropy coding.

Background

With the recent development of information technology, real-time image compression has been attracting attention due to its wide application in the fields of high-speed video measurement systems, digital cinema and the like. In many cases, the images acquired by the cameras need to be stored quickly for further investigation. However, because the redundancy of the acquired images is high, the amount of image data is huge during acquisition, and real-time transmission and storage are difficult to meet without compression. The JPEG compression algorithm plays an irreplaceable role in these real-time systems due to its excellent performance. However, the serial processing speed of the algorithm is slow, and the real-time requirement of the system cannot be guaranteed, so that research work on parallelization of the JPEG compression algorithm is still in progress.

DCT transformation and entropy coding are the two most time-consuming parts of the JPEG compression algorithm, respectively used to remove spatial redundancy and data structure redundancy of image data. Currently, GPU-based DCT parallel algorithms have achieved good research progress in terms of acceleration and efficiency, and are successfully applied to JPEG compression algorithms. However, in the entropy encoding stage, for 8×8 image block DCT transformed data, an efficient parallel strategy needs to be designed to complete the encoding. And each 8×8 block has a variable length of a binary code stream formed by entropy encoding, and the corresponding positions of writing the 8×8 block code stream into the compressed image data cannot be synchronously determined at the time of encoding. In summary, entropy coding parallelization still faces a great technical challenge at present, and is also a main bottleneck of JPEG parallelization.

Disclosure of Invention

The invention aims to provide a GPU-based JPEG entropy coding parallel processing method for improving the JPEG compression rate.

The following technical measures are adopted to solve the technical problems: the JPEG parallel entropy coding method based on the GPU mainly comprises the following steps:

(1) Encoding: according to the difference of the image brightness Y and the chromaticity CbCr component coding task quantity, an 8X 8 image block parallel entropy coding strategy is provided, and the core is that kernel functions are respectively designed to realize the 8X 8 image block coding, wherein for the brightness Y component, CUDA thread bundles are utilized to realize the 8X 8 block parallel coding, and when the chromaticity CbCr component is coded, CUDA single thread is utilized to realize the 8X 8 block serial coding, so as to generate corresponding binary code streams;

(2) Shifting: according to the bit number of each 8X 8 block code stream, the displacement amount of the code stream formed by each 8X 8 block, which needs to be moved to the right, is rapidly and accurately calculated through the parallel prefix summation of the CUDA thread, and then the right-moving operation of each 8X 8 code stream is completed through the CUDA single thread based on the displacement amount, so that the code streams of adjacent blocks can be filled together without interval when final image data is formed;

(3) Filling: and (3) according to the result of each 8X 8 block code stream after shifting in the step (2), performing parallel prefix summation through the CUDA thread again, calculating the corresponding bit position of each shifted 8X 8 block code stream in the final image data, and then writing the code streams into the corresponding positions in a parallel filling mode of the CUDA thread to form the final image data.

Drawings

FIG. 1 is a JPEG parallel entropy encoding flow chart of the present invention.

Fig. 2 is a schematic diagram of code stream generation at each 8×8 block encoding stage.

Fig. 3 is a code stream shift diagram of each 8×8 tile shift stage.

Fig. 4 is a schematic diagram of the code stream filling of each 8×8 tile filling stage.

Fig. 5 is a graph of the compression effect of the method of the present invention with libjpeg-turbo.

Fig. 6 is a comparison of the time consumption of the present invention with the serial compression algorithm (libjpeg-turbo), parallel compression algorithm (NPP v10.2, shan method).

Fig. 7 is a comparison of the acceleration ratio of the present invention with the parallel compression algorithm (NPP v10.2, shan method) versus the serial compression algorithm (libjpeg-turbo).

Detailed Description

The invention is further described below with reference to the accompanying drawings. It is noted that the following examples are given for the purpose of illustration only and are not to be construed as limiting the scope of the invention, since numerous insubstantial modifications and adaptations of the invention will be within the scope of the invention as viewed by one skilled in the art from the foregoing disclosure.

In fig. 1, the method specifically relates to a GPU-based JPEG parallel entropy encoding method, which specifically includes the following steps:

Specifically, in the step (1) shown, the present invention designs an image 8×8 block parallel encoding method. The JPEG image data consists of a brightness component Y and a chromaticity component CbCr, the brightness component contains main information of the whole image, the change of brightness value obviously influences the perception of people on the image, and the 8 multiplied by 8 block coding task is heavy. In contrast, the chrominance components contain less information and the human eye is also less sensitive to them, and the 8 x 8 block coding task is light. Because of different coding tasks among the components, the invention designs two different parallel coding strategies respectively to further improve the coding rate.

When the brightness Y component is coded, the invention utilizes the CUDA thread bundle to realize 8X 8 block parallel coding, and each thread in the thread bundle is responsible for coding two adjacent coefficients. Firstly, DPCM coding is carried out on a DC coefficient by a thread No. 1 of a CUDA thread bundle, then, RLE coding is simultaneously carried out on the DC coefficient and an AC coefficient by all threads of the thread bundle, the calculated run length is the most critical part of the RLE coding, and a ball () function is introduced to enable all threads in the thread bundle to communicate with each other so as to complete the task. And finally, the thread recodes the RLE coding result by utilizing Huffman coding to generate a binary code stream. Because the length of the code stream obtained by encoding each thread is variable, the position of the code stream written into the compressed image data cannot be determined during encoding. Thus, 64 bytes are allocated for each 8 x 8 block to temporarily store the code stream and the temporary memory size can be adjusted if needed. In encoding the chrominance CbCr component, each 8×8 block only requires the CUDA thread to perform DPCM encoding on the DC coefficients sequentially in serial order, and then perform RLE encoding and Huffman encoding. Likewise, 64 bytes are also allocated for each 8 x 8 block in the component to temporarily store the code stream generated by the encoding. As shown in fig. 2, the result of the encoding stage is that all 8×8 blocks of encoded code streams are written into the temporary storage space, and the corresponding code stream length l is also calculated. For example, the code stream length l of the kth 8×8 block _k 17; code stream length l of k+1th 8×8 block _k+1 4; bit length l of k+2th 8×8 block _k+2 11.

In the step (2), the initial position of the code stream corresponding to the 8×8 block in the temporary storage space is 0, but most of the code streams of the 8×8 block are not integer bytes, so that the code streams of adjacent 8×8 blocks cannot be continuously filled together, and thus, a corresponding shift operation is required to ensure that the correct compressed image data is generated. Specifically, the invention rapidly and accurately calculates the displacement s of each 8×8 block code stream during displacement operation by using a parallel prefix summation algorithm through the bit number of each 8×8 block code stream, and the calculation formula is as follows:

s _k ＝ pre _k mod 8 (2)

where k represents the kth 8×8 block of the image, pre _k Represent 0 _th To k-1 _th Total length s of 8×8 block code streams _k The number of bits shifted to the right for the k 8 x 8 block code stream.

After the displacement s is calculated, all 8×8 block code streams are shifted right in parallel according to the displacement s. In order to reduce the consumption of the video memory and not to influence the shifting result, the shifting operation is executed from the last byte of the code stream during shifting, and the shifted code stream is written from the last byte of the temporary storage space. The 0xff byte is commonly used for the identifier in a JPEG file, but may also occur in image data, and in order to distinguish between the two differences, the shift stage requires the addition of 0x00 at the end of the 0xff byte of the image data.

For example, in FIG. 3, if the offset s of the kth 8×8 block _k For 2, the 17 bit code stream will shift right by 2 bits. Since 0xff bytes occur after the shift, the bit length will change from 17 to 25. At the same time, the (k+1) _th And k+2 _th The 8 x 8 block code streams will also be shifted according to the respective shift amounts s.

In the step (3) shown, since the bit stream increases by 0x00 bytes after the shift, the bit stream length of each 8 x 8 block is again changed indefinitely. Therefore, it is necessary to re-execute the parallel prefix and algorithm to calculate each 8×8 blockThe code stream is at the start of the compressed image data. Based on the calculation result, the thread can accurately write the 8×8 block code stream into the corresponding position of the compressed image data in parallel. However, if adjacent threads need to access the same memory location at the same time, access conflicts and write errors may result. As shown in fig. 4, in order to avoid the above problem, the code streams corresponding to all 8×8 blocks are divided into 3 times of parallel padding, and the 8×8 block number set Y written in parallel at the ith time _i Can be expressed as:

where M is the sum of all 8 x 8 blocks in the image.

In order to better illustrate the effectiveness of the invention, the invention is utilized to realize JPEG compression algorithms of two formats of JPEG 4:4:4 and 4:2:0, experiments are carried out on classical images in the field of digital image processing, and then the subjective effect is shown in figure 5 by comparing with methods such as libjpeg-turbo, NPP and the like. The comparison of SSIM and PSNR of the present invention and the standard JPEG serial compression method libjpeg-turbo shows that the present algorithm can achieve the same compression quality as the serial compression algorithm.

Fig. 6 shows the size of the test image from 768×512 to 2560×1600 in comparison with the time consumption of the serial compression algorithm (libjpeg-turbo) and the parallel compression algorithm (NPP v10.2, shan method) according to the present invention. As the image size becomes larger, the compression time of the libjpeg-turbo implementation increases significantly, and incomplete parallelization of the Shan method results in the entropy encoding time having the same variation law as serial execution. Experiments prove that the execution time of other parallel entropy coding methods still occupies a large proportion in the total compression time, and the execution time of the method provided by the invention is obviously shortened, so that the optimal performance is obtained.

To further evaluate the effectiveness of the method of the present invention, the present invention proposes a comparison of the acceleration ratio with other parallel algorithms (NPP v10.2, shan method), the calculation formula is as follows:

wherein t is _s Representing the time consumption of libjpeg-turbo, t _p Is a parallel algorithm that is time consuming. Meanwhile, the performance improvement of the invention relative to the NPP of the most advanced parallel algorithm is calculated, and the calculation formula is as follows:

t _pN represents the entropy encoding line time, t, of NPP _pO Representing the entropy coding time of the method of the invention. Fig. 7 shows specific experimental results, further illustrating the effectiveness of the proposed method of the present invention.

Table-jpeg-turbo vs PSNR, SSIM of the present invention

/>

Claims

1. The JPEG parallel entropy coding method based on the GPU is characterized by comprising the following steps of:

2. The GPU-based JPEG parallel entropy encoding method according to claim 1, wherein the 8 x 8 image block parallel entropy encoding strategy in step (1) takes into account the difference of the JPEG image luminance component Y and the chrominance component CbCr entropy encoding task, and encodes the two components with CUDA thread bundles and threads respectively to achieve an optimal encoding rate, wherein the length of the code stream generated by each 8 x 8 block entropy encoding is variable, and the location where the final image data is written cannot be determined at this time, so that the code stream is stored for each 8 x 8 block temporary storage space.

3. The method of claim 1, wherein in the step (2), the start positions of the code streams corresponding to the 8×8 blocks in the temporary storage space are all 0, but most of the code stream lengths are not integer bytes, so that the code streams of adjacent 8×8 blocks cannot be continuously filled together.

4. The method according to claim 1, wherein the step (3) is a filling step of JPEG parallel entropy encoding, in which final image data is generated by parallel filling, and in order to avoid access conflicts and write errors caused by different threads accessing the same location during filling, the code streams corresponding to all 8×8 blocks are divided into 3 parallel fills, so as to generate final compressed image data.