CN114765684B - JPEG parallel entropy coding method based on GPU - Google Patents
JPEG parallel entropy coding method based on GPU Download PDFInfo
- Publication number
- CN114765684B CN114765684B CN202110039080.1A CN202110039080A CN114765684B CN 114765684 B CN114765684 B CN 114765684B CN 202110039080 A CN202110039080 A CN 202110039080A CN 114765684 B CN114765684 B CN 114765684B
- Authority
- CN
- China
- Prior art keywords
- block
- parallel
- code stream
- coding
- jpeg
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/186—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a colour or a chrominance component
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/42—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/70—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/90—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
- H04N19/91—Entropy coding, e.g. variable length coding [VLC] or arithmetic coding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
- Compression Of Band Width Or Redundancy In Fax (AREA)
Abstract
Aiming at the rate problem of the traditional JPEG entropy coding, the invention provides a JPEG entropy coding parallel processing method based on a GPU. The method mainly comprises the following 3 parallel processing steps: firstly, designing an efficient parallel strategy code to generate respective binary code streams aiming at all 8X 8 block DCT transformation data of an image to be detected; then, carrying out proper shift operation on the code streams of each 8×8 block to ensure that the code streams of adjacent 8×8 blocks are filled together without interval; finally, the code streams of all 8×8 blocks are padded together to obtain compressed image data. Experimental results show that compared with the traditional JPEG entropy coding, the method disclosed by the invention has the advantage that the image compression rate is greatly improved on the premise of not influencing the quality of a compressed image.
Description
Technical Field
The invention relates to the technical problem of image compression in the field of digital image processing, in particular to an acceleration method for JPEG image entropy coding.
Background
With the recent development of information technology, real-time image compression has been attracting attention due to its wide application in the fields of high-speed video measurement systems, digital cinema and the like. In many cases, the images acquired by the cameras need to be stored quickly for further investigation. However, because the redundancy of the acquired images is high, the amount of image data is huge during acquisition, and real-time transmission and storage are difficult to meet without compression. The JPEG compression algorithm plays an irreplaceable role in these real-time systems due to its excellent performance. However, the serial processing speed of the algorithm is slow, and the real-time requirement of the system cannot be guaranteed, so that research work on parallelization of the JPEG compression algorithm is still in progress.
DCT transformation and entropy coding are the two most time-consuming parts of the JPEG compression algorithm, respectively used to remove spatial redundancy and data structure redundancy of image data. Currently, GPU-based DCT parallel algorithms have achieved good research progress in terms of acceleration and efficiency, and are successfully applied to JPEG compression algorithms. However, in the entropy encoding stage, for 8×8 image block DCT transformed data, an efficient parallel strategy needs to be designed to complete the encoding. And each 8×8 block has a variable length of a binary code stream formed by entropy encoding, and the corresponding positions of writing the 8×8 block code stream into the compressed image data cannot be synchronously determined at the time of encoding. In summary, entropy coding parallelization still faces a great technical challenge at present, and is also a main bottleneck of JPEG parallelization.
Disclosure of Invention
The invention aims to provide a GPU-based JPEG entropy coding parallel processing method for improving the JPEG compression rate.
The following technical measures are adopted to solve the technical problems: the JPEG parallel entropy coding method based on the GPU mainly comprises the following steps:
(1) Encoding: according to the difference of the image brightness Y and the chromaticity CbCr component coding task quantity, an 8X 8 image block parallel entropy coding strategy is provided, and the core is that kernel functions are respectively designed to realize the 8X 8 image block coding, wherein for the brightness Y component, CUDA thread bundles are utilized to realize the 8X 8 block parallel coding, and when the chromaticity CbCr component is coded, CUDA single thread is utilized to realize the 8X 8 block serial coding, so as to generate corresponding binary code streams;
(2) Shifting: according to the bit number of each 8X 8 block code stream, the displacement amount of the code stream formed by each 8X 8 block, which needs to be moved to the right, is rapidly and accurately calculated through the parallel prefix summation of the CUDA thread, and then the right-moving operation of each 8X 8 code stream is completed through the CUDA single thread based on the displacement amount, so that the code streams of adjacent blocks can be filled together without interval when final image data is formed;
(3) Filling: and (3) according to the result of each 8X 8 block code stream after shifting in the step (2), performing parallel prefix summation through the CUDA thread again, calculating the corresponding bit position of each shifted 8X 8 block code stream in the final image data, and then writing the code streams into the corresponding positions in a parallel filling mode of the CUDA thread to form the final image data.
Drawings
FIG. 1 is a JPEG parallel entropy encoding flow chart of the present invention.
Fig. 2 is a schematic diagram of code stream generation at each 8×8 block encoding stage.
Fig. 3 is a code stream shift diagram of each 8×8 tile shift stage.
Fig. 4 is a schematic diagram of the code stream filling of each 8×8 tile filling stage.
Fig. 5 is a graph of the compression effect of the method of the present invention with libjpeg-turbo.
Fig. 6 is a comparison of the time consumption of the present invention with the serial compression algorithm (libjpeg-turbo), parallel compression algorithm (NPP v10.2, shan method).
Fig. 7 is a comparison of the acceleration ratio of the present invention with the parallel compression algorithm (NPP v10.2, shan method) versus the serial compression algorithm (libjpeg-turbo).
Detailed Description
The invention is further described below with reference to the accompanying drawings. It is noted that the following examples are given for the purpose of illustration only and are not to be construed as limiting the scope of the invention, since numerous insubstantial modifications and adaptations of the invention will be within the scope of the invention as viewed by one skilled in the art from the foregoing disclosure.
In fig. 1, the method specifically relates to a GPU-based JPEG parallel entropy encoding method, which specifically includes the following steps:
(1) Encoding: according to the difference of the image brightness Y and the chromaticity CbCr component coding task quantity, an 8X 8 image block parallel entropy coding strategy is provided, and the core is that kernel functions are respectively designed to realize the 8X 8 image block coding, wherein for the brightness Y component, CUDA thread bundles are utilized to realize the 8X 8 block parallel coding, and when the chromaticity CbCr component is coded, CUDA single thread is utilized to realize the 8X 8 block serial coding, so as to generate corresponding binary code streams;
(2) Shifting: according to the bit number of each 8X 8 block code stream, the displacement amount of the code stream formed by each 8X 8 block, which needs to be moved to the right, is rapidly and accurately calculated through the parallel prefix summation of the CUDA thread, and then the right-moving operation of each 8X 8 code stream is completed through the CUDA single thread based on the displacement amount, so that the code streams of adjacent blocks can be filled together without interval when final image data is formed;
(3) Filling: and (3) according to the result of each 8X 8 block code stream after shifting in the step (2), performing parallel prefix summation through the CUDA thread again, calculating the corresponding bit position of each shifted 8X 8 block code stream in the final image data, and then writing the code streams into the corresponding positions in a parallel filling mode of the CUDA thread to form the final image data.
Specifically, in the step (1) shown, the present invention designs an image 8×8 block parallel encoding method. The JPEG image data consists of a brightness component Y and a chromaticity component CbCr, the brightness component contains main information of the whole image, the change of brightness value obviously influences the perception of people on the image, and the 8 multiplied by 8 block coding task is heavy. In contrast, the chrominance components contain less information and the human eye is also less sensitive to them, and the 8 x 8 block coding task is light. Because of different coding tasks among the components, the invention designs two different parallel coding strategies respectively to further improve the coding rate.
When the brightness Y component is coded, the invention utilizes the CUDA thread bundle to realize 8X 8 block parallel coding, and each thread in the thread bundle is responsible for coding two adjacent coefficients. Firstly, DPCM coding is carried out on a DC coefficient by a thread No. 1 of a CUDA thread bundle, then, RLE coding is simultaneously carried out on the DC coefficient and an AC coefficient by all threads of the thread bundle, the calculated run length is the most critical part of the RLE coding, and a ball () function is introduced to enable all threads in the thread bundle to communicate with each other so as to complete the task. And finally, the thread recodes the RLE coding result by utilizing Huffman coding to generate a binary code stream. Because the length of the code stream obtained by encoding each thread is variable, the position of the code stream written into the compressed image data cannot be determined during encoding. Thus, 64 bytes are allocated for each 8 x 8 block to temporarily store the code stream and the temporary memory size can be adjusted if needed. In encoding the chrominance CbCr component, each 8×8 block only requires the CUDA thread to perform DPCM encoding on the DC coefficients sequentially in serial order, and then perform RLE encoding and Huffman encoding. Likewise, 64 bytes are also allocated for each 8 x 8 block in the component to temporarily store the code stream generated by the encoding. As shown in fig. 2, the result of the encoding stage is that all 8×8 blocks of encoded code streams are written into the temporary storage space, and the corresponding code stream length l is also calculated. For example, the code stream length l of the kth 8×8 block k 17; code stream length l of k+1th 8×8 block k+1 4; bit length l of k+2th 8×8 block k+2 11.
In the step (2), the initial position of the code stream corresponding to the 8×8 block in the temporary storage space is 0, but most of the code streams of the 8×8 block are not integer bytes, so that the code streams of adjacent 8×8 blocks cannot be continuously filled together, and thus, a corresponding shift operation is required to ensure that the correct compressed image data is generated. Specifically, the invention rapidly and accurately calculates the displacement s of each 8×8 block code stream during displacement operation by using a parallel prefix summation algorithm through the bit number of each 8×8 block code stream, and the calculation formula is as follows:
s k = pre k mod 8 (2)
where k represents the kth 8×8 block of the image, pre k Represent 0 th To k-1 th Total length s of 8×8 block code streams k The number of bits shifted to the right for the k 8 x 8 block code stream.
After the displacement s is calculated, all 8×8 block code streams are shifted right in parallel according to the displacement s. In order to reduce the consumption of the video memory and not to influence the shifting result, the shifting operation is executed from the last byte of the code stream during shifting, and the shifted code stream is written from the last byte of the temporary storage space. The 0xff byte is commonly used for the identifier in a JPEG file, but may also occur in image data, and in order to distinguish between the two differences, the shift stage requires the addition of 0x00 at the end of the 0xff byte of the image data.
For example, in FIG. 3, if the offset s of the kth 8×8 block k For 2, the 17 bit code stream will shift right by 2 bits. Since 0xff bytes occur after the shift, the bit length will change from 17 to 25. At the same time, the (k+1) th And k+2 th The 8 x 8 block code streams will also be shifted according to the respective shift amounts s.
In the step (3) shown, since the bit stream increases by 0x00 bytes after the shift, the bit stream length of each 8 x 8 block is again changed indefinitely. Therefore, it is necessary to re-execute the parallel prefix and algorithm to calculate each 8×8 blockThe code stream is at the start of the compressed image data. Based on the calculation result, the thread can accurately write the 8×8 block code stream into the corresponding position of the compressed image data in parallel. However, if adjacent threads need to access the same memory location at the same time, access conflicts and write errors may result. As shown in fig. 4, in order to avoid the above problem, the code streams corresponding to all 8×8 blocks are divided into 3 times of parallel padding, and the 8×8 block number set Y written in parallel at the ith time i Can be expressed as:
where M is the sum of all 8 x 8 blocks in the image.
In order to better illustrate the effectiveness of the invention, the invention is utilized to realize JPEG compression algorithms of two formats of JPEG 4:4:4 and 4:2:0, experiments are carried out on classical images in the field of digital image processing, and then the subjective effect is shown in figure 5 by comparing with methods such as libjpeg-turbo, NPP and the like. The comparison of SSIM and PSNR of the present invention and the standard JPEG serial compression method libjpeg-turbo shows that the present algorithm can achieve the same compression quality as the serial compression algorithm.
Fig. 6 shows the size of the test image from 768×512 to 2560×1600 in comparison with the time consumption of the serial compression algorithm (libjpeg-turbo) and the parallel compression algorithm (NPP v10.2, shan method) according to the present invention. As the image size becomes larger, the compression time of the libjpeg-turbo implementation increases significantly, and incomplete parallelization of the Shan method results in the entropy encoding time having the same variation law as serial execution. Experiments prove that the execution time of other parallel entropy coding methods still occupies a large proportion in the total compression time, and the execution time of the method provided by the invention is obviously shortened, so that the optimal performance is obtained.
To further evaluate the effectiveness of the method of the present invention, the present invention proposes a comparison of the acceleration ratio with other parallel algorithms (NPP v10.2, shan method), the calculation formula is as follows:
wherein t is s Representing the time consumption of libjpeg-turbo, t p Is a parallel algorithm that is time consuming. Meanwhile, the performance improvement of the invention relative to the NPP of the most advanced parallel algorithm is calculated, and the calculation formula is as follows:
t pN represents the entropy encoding line time, t, of NPP pO Representing the entropy coding time of the method of the invention. Fig. 7 shows specific experimental results, further illustrating the effectiveness of the proposed method of the present invention.
Table-jpeg-turbo vs PSNR, SSIM of the present invention
Claims (4)
1. The JPEG parallel entropy coding method based on the GPU is characterized by comprising the following steps of:
(1) Encoding: according to the difference of the image brightness Y and the chromaticity CbCr component coding task quantity, an 8X 8 image block parallel entropy coding strategy is provided, and the core is that kernel functions are respectively designed to realize the 8X 8 image block coding, wherein for the brightness Y component, CUDA thread bundles are utilized to realize the 8X 8 block parallel coding, and when the chromaticity CbCr component is coded, CUDA single thread is utilized to realize the 8X 8 block serial coding, so as to generate corresponding binary code streams;
(2) Shifting: according to the bit number of each 8X 8 block code stream, the displacement amount of the code stream formed by each 8X 8 block, which needs to be moved to the right, is rapidly and accurately calculated through the parallel prefix summation of the CUDA thread, and then the right-moving operation of each 8X 8 code stream is completed through the CUDA single thread based on the displacement amount, so that the code streams of adjacent blocks can be filled together without interval when final image data is formed;
(3) Filling: and (3) according to the result of each 8X 8 block code stream after shifting in the step (2), performing parallel prefix summation through the CUDA thread again, calculating the corresponding bit position of each shifted 8X 8 block code stream in the final image data, and then writing the code streams into the corresponding positions in a parallel filling mode of the CUDA thread to form the final image data.
2. The GPU-based JPEG parallel entropy encoding method according to claim 1, wherein the 8 x 8 image block parallel entropy encoding strategy in step (1) takes into account the difference of the JPEG image luminance component Y and the chrominance component CbCr entropy encoding task, and encodes the two components with CUDA thread bundles and threads respectively to achieve an optimal encoding rate, wherein the length of the code stream generated by each 8 x 8 block entropy encoding is variable, and the location where the final image data is written cannot be determined at this time, so that the code stream is stored for each 8 x 8 block temporary storage space.
3. The method of claim 1, wherein in the step (2), the start positions of the code streams corresponding to the 8×8 blocks in the temporary storage space are all 0, but most of the code stream lengths are not integer bytes, so that the code streams of adjacent 8×8 blocks cannot be continuously filled together.
4. The method according to claim 1, wherein the step (3) is a filling step of JPEG parallel entropy encoding, in which final image data is generated by parallel filling, and in order to avoid access conflicts and write errors caused by different threads accessing the same location during filling, the code streams corresponding to all 8×8 blocks are divided into 3 parallel fills, so as to generate final compressed image data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110039080.1A CN114765684B (en) | 2021-01-12 | 2021-01-12 | JPEG parallel entropy coding method based on GPU |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110039080.1A CN114765684B (en) | 2021-01-12 | 2021-01-12 | JPEG parallel entropy coding method based on GPU |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114765684A CN114765684A (en) | 2022-07-19 |
CN114765684B true CN114765684B (en) | 2023-05-09 |
Family
ID=82363484
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110039080.1A Active CN114765684B (en) | 2021-01-12 | 2021-01-12 | JPEG parallel entropy coding method based on GPU |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114765684B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013109115A1 (en) * | 2012-01-20 | 2013-07-25 | 삼성전자 주식회사 | Method and apparatus for entropy-encoding capable of parallel processing, and method and apparatus for entropy-decoding capable of parallel processing |
CN103763561A (en) * | 2014-01-19 | 2014-04-30 | 林雁 | H264 video code parallel operation method |
CN107231558A (en) * | 2017-05-23 | 2017-10-03 | 江苏火米互动科技有限公司 | A kind of implementation method of the H.264 parallel encoder based on CUDA |
CN107770558A (en) * | 2017-09-29 | 2018-03-06 | 郑州云海信息技术有限公司 | Method, system, device and the readable storage medium storing program for executing of jpeg image decoding |
EP3624450A1 (en) * | 2018-09-17 | 2020-03-18 | InterDigital VC Holdings, Inc. | Wavefront parallel processing of luma and chroma components |
-
2021
- 2021-01-12 CN CN202110039080.1A patent/CN114765684B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013109115A1 (en) * | 2012-01-20 | 2013-07-25 | 삼성전자 주식회사 | Method and apparatus for entropy-encoding capable of parallel processing, and method and apparatus for entropy-decoding capable of parallel processing |
CN103763561A (en) * | 2014-01-19 | 2014-04-30 | 林雁 | H264 video code parallel operation method |
CN107231558A (en) * | 2017-05-23 | 2017-10-03 | 江苏火米互动科技有限公司 | A kind of implementation method of the H.264 parallel encoder based on CUDA |
CN107770558A (en) * | 2017-09-29 | 2018-03-06 | 郑州云海信息技术有限公司 | Method, system, device and the readable storage medium storing program for executing of jpeg image decoding |
EP3624450A1 (en) * | 2018-09-17 | 2020-03-18 | InterDigital VC Holdings, Inc. | Wavefront parallel processing of luma and chroma components |
Non-Patent Citations (3)
Title |
---|
H. Rahmani et al.A parallel Huffman coder on the CUDA architecture.2014 IEEE Visual Communications and Image Processing Conference.2015,311-314. * |
Yamamoto N et al.Huffman Coding with Gap Arrays for GPU Acceleration.49th International Conference on Parallel Processing-ICPP.2020,1–11. * |
高克顺.基于GPU的H.264到AVS视频转码并行设计.中国优秀硕士学位论文全文数据库.2012,(07),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN114765684A (en) | 2022-07-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105120293B (en) | Image collaboration coding/decoding method and device based on CPU and GPU | |
CN103581678B (en) | To improve the method and system of decoder capabilities by using multiple decoder channels | |
CN105933708B (en) | A kind of method and apparatus of data compression and decompression | |
US20110002396A1 (en) | Reference Frames Compression Method for A Video Coding System | |
EP0586074B1 (en) | Image processing apparatus and method suitable for multistage compression | |
CN105472389B (en) | Compression method is cached outside a kind of piece for ultra high-definition processing system for video | |
CN103841424B (en) | The system and method for compressed data in random access memory | |
US20140043347A1 (en) | Methods for jpeg2000 encoding and decoding based on gpu | |
US6643402B1 (en) | Image compression device allowing rapid and highly precise encoding while suppressing code amount of image data after compression | |
US10110896B2 (en) | Adaptive motion JPEG encoding method and system | |
CN111402380A (en) | GPU (graphics processing Unit) compressed texture processing method | |
WO2023082834A1 (en) | Video compression method and apparatus, and computer device and storage medium | |
US20110091123A1 (en) | Coding apparatus and coding method | |
US20040105497A1 (en) | Encoding device and method | |
CN112422985B (en) | Multi-core parallel hardware coding method and device suitable for JPEG | |
CN114765684B (en) | JPEG parallel entropy coding method based on GPU | |
CN109982091A (en) | A kind of processing method and processing device of image | |
CN104113759B (en) | Video system, video frame buffer recompression/decompression method and device | |
CN100370835C (en) | System and method for video data compression | |
CN105791819B (en) | The decompression method and device of a kind of frame compression method of image, image | |
CN109361926A (en) | H.264/AVC video visual quality lossless reciprocal information concealing method | |
CN104243983A (en) | Image compression circuit, image compression method, and transmission system | |
US11189006B2 (en) | Managing data for transportation | |
Lee et al. | CUDA-based JPEG2000 encoding scheme | |
CN111815502A (en) | FPGA (field programmable Gate array) acceleration method for multi-image processing based on WebP (Web Page) compression algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |