CN114827614B - Method for realizing LCEVC video coding optimization - Google Patents

Method for realizing LCEVC video coding optimization Download PDF

Info

Publication number
CN114827614B
CN114827614B CN202210447137.6A CN202210447137A CN114827614B CN 114827614 B CN114827614 B CN 114827614B CN 202210447137 A CN202210447137 A CN 202210447137A CN 114827614 B CN114827614 B CN 114827614B
Authority
CN
China
Prior art keywords
lcevc
data
gpu
transformation
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210447137.6A
Other languages
Chinese (zh)
Other versions
CN114827614A (en
Inventor
丁杨
罗雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202210447137.6A priority Critical patent/CN114827614B/en
Publication of CN114827614A publication Critical patent/CN114827614A/en
Application granted granted Critical
Publication of CN114827614B publication Critical patent/CN114827614B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/132Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/119Adaptive subdivision aspects, e.g. subdivision of a picture into rectangular or non-rectangular coding blocks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/91Entropy coding, e.g. variable length coding [VLC] or arithmetic coding

Abstract

The invention relates to a method for realizing LCEVC video coding optimization, which belongs to the field of multimedia video processing and transmission, and comprises the following steps: s1: for a given input video, acquiring optimal downsampling output by adopting a block downsampling method based on Modifiedcubic interpolation information; s2: embedding the block downsampling method in the step S1 into an LCEVC encoder, performing time-consuming analysis on each module in the LCEVC, and designing an LCEVC encoder frame based on a CPU-GPU heterogeneous platform; s3: and carrying out parallel optimization design on an up-sampling module, an improved down-sampling module, a transformation and quantization module, an inverse transformation and inverse quantization module and an entropy coding module in the LCEVC standard according to a framework, and realizing real-time low-complexity enhanced video coding at a PC end. The invention improves the coding video quality of LCEVC, shortens the coding time and improves the effective utilization rate of hardware resources.

Description

Method for realizing LCEVC video coding optimization
Technical Field
The invention belongs to the field of multimedia video processing and transmission, and relates to a method for realizing LCEVC video coding optimization.
Background
With the rapid development of video coding and decoding technology, high-definition video and ultra-high-definition video (including 4K and 8K) are popular, because they can provide users with clearer image quality and more realistic perceived quality. But the data volume of the high-definition video and the ultra-high-definition video increases with the increase of the resolution and the bit depth. In order to improve compression efficiency, reduce data volume, and meet the market demand for software-based extensions on existing and future video codecs, low Complexity Enhanced Video Coding (LCEVC) has been proposed.
From the test software LTM test of LCEVC, the LCEVC improves the compression rate to about 40%. The AVC is taken as a basic encoder, and the peak signal-to-noise ratio (PSNR), the video quality multi-method evaluation fusion (VMAF) and the mean subjective opinion score (MOS) of the same video are higher than the AVC under the same bit rate. The LCEVC encoding time with AVC as the basic encoder is 2.4 times less than the AVC encoding time, and the LCEVC encoding time with HEVC as the basic encoder is 2.7 times less than the HEVC encoding time. The LCEVC is significantly enhanced in video images when used in conjunction with a base encoder. The downsampling module of the LCEVC adopts Lansos (Lanczos) interpolation algorithm, and although the Lanczos algorithm has better interpolation effect, the downsampling module does not achieve the downsampling output with optimal performance. There is therefore a need for an improved optimization of the downsampling module of LCEVCs to enhance video image quality.
Although the LCEVC has the characteristics of low complexity and short coding time, through the test of the same high-definition video, when QP is 22, the coding time of the LCEVC taking AVC as a basic coder is 28.9s, and the coding time of the LCEVC taking HEVC as the basic coder is as high as 33.8s. For some applications with stronger real-time performance such as streaming media, sports live broadcast and the like, it is necessary to accelerate the video coding process to realize real-time coding under the LCEVC standard on the basis of not affecting the coding performance.
Disclosure of Invention
In view of the above, the present invention aims to provide a method for implementing LCEVC video coding optimization, which considers an interpolation-dependent image downsampling optimization method and implements LCEVC real-time coding based on GPU parallel optimization, thereby implementing real-time low-complexity enhanced video coding at the PC end.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a method of implementing LCEVC video coding optimization, comprising the steps of:
s1: for a given input video, an optimal downsampling output is obtained by adopting a block downsampling method based on Modifiedcubic interpolation information, so that the output video quality of a downsampling module in LCEVC is improved, and the image quality of the LCEVC coded video is further improved;
s2: embedding the block downsampling method based on the Modifiedcubic interpolation information in the step S1 into an LCEVC encoder, respectively counting time consumption of each module in the LCEVC and carrying out time consumption analysis, and designing a LCEVC encoder frame based on a CPU-GPU heterogeneous platform according to the time consumption analysis result;
s3: and (2) carrying out parallel optimization design on an up-sampling module, an improved down-sampling module, a transformation and quantization module, an inverse transformation and inverse quantization module and an entropy coding module in the LCEVC standard according to an LCEVC encoder framework based on a CPU-GPU heterogeneous platform in S2, and realizing real-time low-complexity enhanced video coding at a PC end.
Further, in the step S1, the input video sequence is set to be Y, the downsampled image is set to be X, and the optimal downsampled output X is obtained by a block downsampling method based on Modifiedcubic interpolation information * The method specifically comprises the following steps:
s11: let Y 'be the X interpolated image, H be the interpolation coefficient, phi be the matrix of the product of the interpolation coefficient and the corresponding pixel value, where Y' is expressed as:
Y'=HX+Φ
the optimal downsampled objective function is expressed as:
and (3) deriving and simplifying X to obtain:
let the closed solution of the optimal solution of X obtained by the above method equal to zero be:
X * =(H T H) -1 [H T (Y-Φ)]
wherein (H) T H) -1 Is (H) T H) Inverse matrix of (H) T H) -1 Is a full order matrix;
since the size of the H matrix increases with the size of the image, H occupies a large memory space in the program, and (H T H) -1 And the calculated amount of (c) increases with the increase of H. In order to solve the problems of high complexity and large memory occupied by a program, the characteristic that 16 pixel points around are interpolated to obtain a new pixel point is combined with the Modifiedcubic interpolation algorithm, and the Y is processed by adopting 8 multiplied by 8 blocks. When Y is an 8×8 block, the size of the interpolation matrix H is 16×64, and the constant vector Φ is a column vector of length 64, and the value of Φ is related to only the element corresponding to the boundary pixel.
S12: all 4X 4 optimal downsampling blocks X * And merging together according to the block sequence to obtain the output result of the final downsampled image.
Further, the step S2 specifically includes the following steps:
s21: embedding the block downsampling method based on the Modifiedcubic interpolation information in the step S1 into an LCEVC encoder;
s22: selecting a group of video sequences, performing multiple serial operation tests on the LCEVC encoder on a multi-core CPU, and averaging to obtain average consumed time of each frame, thereby obtaining time-consuming duty ratio of each module of the LCEVC encoder;
s23: through the time consumption ratio of each module of the LCEVC encoder in the step S22, large-scale parallel processing can be performed based on the GPU, and the CPU is more suitable for the logic control characteristic, the LCEVC encoder framework based on the CPU-GPU heterogeneous platform is designed, and the encoding framework considers the GPU and the computing resource of the CPU. The communication between the CPU and the GPU is realized by copying data, and the part of the CPU responsible for processing comprises:
1) The system is responsible for reading image data, reading an input video sequence and Y, U, V component data generated by a basic encoder into a CPU (central processing unit) memory, copying the data and transmitting the copied data into a video memory of a GPU (graphics processing unit);
2) Is responsible for reading information such as image names, resolution sizes, bit depths, up-down sampling modes, quantization parameters and the like of the coding configuration files.
3) The method is responsible for outputting coded video of a basic encoder, outputting a reconstructed video sequence and an output code stream, calculating reconstructed video PSNR and scheduling GPU threads;
for the modules with low data correlation, such as up-sampling, down-sampling, transformation, quantization, inverse transformation, inverse quantization and entropy coding, a corresponding parallel optimization algorithm is designed according to different algorithm processing procedures of each module.
Further, the step S3 specifically includes:
s31: parallelizing the up-sampling module; from the Modifiedcubic interpolation process in LCEVC, it is known that the interpolation process only involves addition, multiplication and shift calculations, and that these three operators already have corresponding hardware adders on the GPU. Therefore, GPU can be adopted to carry out parallel implementation on the up-sampling interpolation; the shared memory is an on-chip memory of the GPU, can quickly access data, and can also realize data communication among thread blocks.
S32: dividing the downsampled input image into 8×8 blocks for parallel processing; in the GPU, the allocated block number is pic_height/8, and the thread number is pic_width/8, wherein pic_height and pic_width respectively represent the Height and Width of the downsampled input image; the parallelization processing of the downsampling is basically the same as the parallelization processing of the upsampling, except that the downsampling algorithm and the interpolation algorithm are executed differently.
S33: parallelizing the transformation and quantization module;
s34: the principle of inverse transformation and parallel transformation is identical, and the principle of inverse quantization and quantization is also identical. Therefore, the parallel optimization mode for inverse transformation and inverse quantization is the same as S33;
s35: parallel processing is carried out on run-length coding and Huffman coding in the entropy coding module;
s36: the parallel optimization algorithm in S31 to S35 is realized on the GPU platform of the PC end, so that the acceleration of the LCEVC encoder is realized, and the real-time low-complexity enhanced video coding is realized.
Further, in step S31, the up-sampling parallelization implementation method is as follows:
s311: copying the image to be up-sampled from the CPU memory to the global memory of the GPU, and copying the interpolation coefficient matrix to the constant memory;
s312: filling the upper and lower boundaries of an input image, taking into account that the data volume processed by each thread is basically the same after the image is filled, dividing the filled image into 64 blocks with the height of U_PicHeight/8 and the width of U_PicWidth/8, wherein U_PicHeight and U_PicWidth respectively represent the width and the length of a video image to be interpolated, and then respectively reading the image into a shared memory according to the blocks;
s313: thread blocks and threads are distributed, the number of the distributed blocks is 8, and the number of threads is 8;
s314: each thread reads the pixel value needed in the shared memory, reads the interpolation coefficient from the constant memory to perform interpolation operation, and stores the interpolation result of each block into the global memory after the cudaDeviceSynchronize () function is synchronized;
s315: and finally, copying the up-sampling result in the GPU global memory to the CPU memory for subsequent data processing.
Further, in the step S33, the LCEVC provides two transformation modes, 2×2 transformation or 4×4 transformation; before parallel optimization, butterfly transformation is adopted for transformation modes in LCEVC so as to reduce calculation times; the 2×2 transformation formula is rewritten as follows:
the 2 x 2 transform, if computed according to conventional algorithms, requires a total of 12 additions, 6 multiplications. Through matrix transformation of butterfly transformation, only 8 additions and 4 multiplications are needed, so that hardware resources are saved, and computational complexity is reduced. The input data of the transformation process in LCEVC is a prediction residual, and for 2×2 transformation or 4×4 transformation, the transformation coefficients are parsed into layers after transformation;
for 4×4 transform, firstly, the residual data is divided into 4×4 blocks, then each block performs butterfly transform according to a transform matrix, and finally, the data of the transform coefficients are resolved into layers; the data between layers are independent and have no correlation, so that parallel calculation is adopted to optimize the quantization module after transformation; since 4×4 transformation is followed by parsing the data into 16 layers, the number of blocks allocated in the GPU is 4 and the number of threads is 4; the specific steps implemented on the GPU for 4 x 4 transforms include:
1) Copying residual data from the CPU to the global memory of the GPU, and dividing the original image into 4 multiplied by 4 blocks;
2) Allocating blocks and thread numbers, wherein the blocks are 4, and the thread numbers are 4;
3) Each thread reads the data information of the blocks in the global memory according to the assigned thread numbers, then performs butterfly transformation on each block, and analyzes the obtained transformation coefficients into layers; in the process of processing data, synchronizing by using a_syncthreads () function; the input of the quantization process is a transform coefficient, and the input data of the quantization process is T in [layer][y][x]For a 4 x 4 transform, the layers are 16, and the data in each layer is uncorrelated, so that there is no dependency, and the quantization process of each transform coefficient can be independently completed, so that quantization optimization can be performed in a completely parallel manner.
Further, in step S35, all of the data outputted by quantization are discriminated by layers, but the size of the data amount in layers differs depending on the luminance component and the chrominance component. To reduce program latency due to some thread data length inconsistencies, the data should be averaged as much as possible to improve the efficiency of parallel encoding. Based on U1 and V1 of the L-1 layer, dividing a larger data block in a layer into data with the same size as W/16 XH/16, wherein the number of allocated blocks is 16, the number of allocated thread is 30, and processing a data block with the size of W/16 XH/16 by using one thread, wherein the specific implementation method of parallel optimization of entropy coding based on GPU comprises the following steps:
1) Copying output data to be quantized from a CPU memory to a global memory of the GPU;
2) The data input by entropy coding is read into a shared memory according to the data block blocks with the size of W/16 multiplied by H/16;
3) Each thread reads the needed data from the shared storage to carry out run-length coding and Huffman coding, and uses a cudaDeviceSynchronize () function to wait for all threads to process to finish, and the coded data is sequentially transmitted into the global memory;
and obtaining an entropy coding result through GPU parallel calculation, copying and transmitting the entropy coding data to the memory of the CPU, and directly using entropy coding output data when writing a binary code stream file.
The invention has the beneficial effects that: the invention not only considers the calculation of the optimal downsampled video image by improving the downsampling algorithm so as to further improve the overall video coding quality of the LCEVC, but also considers the adoption of a block mode for solving the improved downsampling algorithm so as to reduce the calculation complexity and the memory use space. The method of the invention realizes a real-time LCEVC video coding system, considers the up-sampling, the improved down-sampling, the transformation and quantization, the inverse transformation and inverse quantization and the parallel optimization of the entropy coding module, reduces the coding time, improves the effective utilization rate of hardware resources such as CPU, GPU and the like, and improves the coding efficiency. The invention also realizes the real-time low-complexity enhanced video coding of the PC end.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is a block diagram of a system flow of the present invention;
FIG. 2 is a flow chart of parallel optimization of upsampling in step S3 of the method according to the present invention
FIG. 3 is a flow chart of parallel optimization of transformation and quantization in step S3 of the method of the present invention;
FIG. 4 is a schematic diagram illustrating a butterfly algorithm in step S3 of the method of the present invention;
fig. 5 is a flow chart of parallel optimization of entropy coding in step S3 of the method according to the invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.
As shown in fig. 1, the present invention provides an optimization method for implementing LCEVC video codec standard, including the following steps:
s1: for a given input video, a block downsampling method based on Modifiedcubic interpolation information is adopted to improve the output video quality of a downsampling module in the LCEVC, and further improve the image quality of the LCEVC encoded video. The method comprises the following specific steps:
s11: let the input video sequence be Y and the downsampled image be X, and obtain the optimal downsampled output X by the block downsampling method based on Modifiedcubic interpolation information * . The calculation process is as follows:
let Y' be the X interpolated image, H be the interpolation coefficient, phi be the matrix of the product of the interpolation coefficient and the corresponding pixel value. Wherein Y' can be represented as
Y'=HX+Φ
The optimal downsampled objective function may be expressed as:
can be obtained by simplifying X derivation
Closed-form solution for obtaining X optimal solution by making the upper part equal to zero
X * =(H T H) -1 [H T (Y-Φ)]
Wherein (H) T H) -1 Is (H) T H) Inverse matrix of (H) T H) -1 Is a full order matrix.
Since the size of the H matrix increases with the size of the image, H occupies a large memory space in the program, and (H T H) -1 And the calculated amount of (c) increases with the increase of H. In order to solve the problems of high complexity and large memory occupied by a program, the characteristic that 16 pixel points around are interpolated to obtain a new pixel point is combined with the Modifiedcubic interpolation algorithm to process Y by adopting 8X 8 blocks. When Y is an 8×8 block, the size of the interpolation matrix H is 16×64, the constant vector Φ is a column vector of length 64, and the value of Φ is related to only the element corresponding to the boundary pixel.
S12: finally, all 4×4 optimal downsampling blocks X * And merging together according to the block sequence to obtain a final downsampled output result.
S2: adding the block downsampling method based on the Modifiedcubic interpolation information in the S1 into the LCEVC encoder, respectively counting time consumption of each module in the LCEVC and carrying out time consumption analysis, and designing a LCEVC encoder frame based on the CPU-GPU heterogeneous platform through the time consumption analysis result. The method comprises the following specific steps:
s21: the block downsampling method based on Modifiedcubic interpolation information is embedded into the LCEVC encoder.
S22: and selecting a group of video sequences, testing the LCEVC encoder on a multi-core CPU, and averaging through 10 times of serial operation tests to obtain the average consumed time of each frame, thereby obtaining the time consumption ratio of each module of the LCEVC encoder.
S23: through the time consumption ratio of each module of the LCEVC encoder in S22, large-scale parallel processing can be performed based on the GPU, and the CPU is more suitable for the logic control characteristic, the LCEVC encoder framework based on the CPU-GPU heterogeneous platform is designed, and the encoding framework considers the GPU and the computing resource of the CPU. The communication between the CPU and the GPU is mainly realized by copying data, and the part of the CPU mainly responsible for processing comprises:
1) And the image data are read, Y, U, V component data generated by the input video sequence and the basic encoder are read into a CPU memory, and the CPU copies the data and transmits the data to a video memory of the GPU.
2) Is responsible for reading information such as image names, resolution sizes, bit depths, up-down sampling modes, quantization parameters and the like of the coding configuration files.
3) And the method is responsible for outputting the coded video of the basic encoder, outputting the reconstructed video sequence and outputting the code stream, and is responsible for calculating the PSNR of the reconstructed video and scheduling GPU threads.
For the modules with low data correlation, such as up-sampling, down-sampling, transformation, quantization, inverse transformation, inverse quantization and entropy coding, a corresponding parallel optimization algorithm is designed according to different algorithm processing procedures of each module.
S3: and (2) carrying out parallel optimization design on an up-sampling module, an improved down-sampling module, a transformation and quantization module and an entropy coding module in the LCEVC standard according to an LCEVC encoder framework based on a CPU-GPU heterogeneous platform in S2, so as to realize real-time low-complexity enhanced video coding. The method can improve the coding video quality of LCEVC, shorten the coding time and improve the effective utilization rate of hardware resources such as CPU, GPU and the like. The method comprises the following specific steps:
s31: as shown in fig. 2, the upsampling module is parallelized. From the Modifiedcubic interpolation process, it is known that the interpolation process only includes addition, multiplication and shift calculations, and that these three operators already have corresponding hardware adders on the GPU. The GPU may be used to implement the upsampling interpolation in parallel. The shared memory is an on-chip memory of the GPU, can quickly access data, and can also realize data communication among thread blocks. Based on the method, the up-sampling parallelization concrete implementation method is as follows:
1) Copying the image to be up-sampled from the CPU memory to the global memory of the GPU, and copying the interpolation coefficient matrix to the constant memory.
2) Filling the upper and lower boundaries of an input image, taking the fact that the data volume processed by each thread is basically the same after the image is filled into consideration, dividing the filled image into 64 blocks with the height of U_PicHeight/8 and the width of U_PicWidth/8, wherein U_PicHeight and U_PicWidth respectively represent the width and the length of a video image to be interpolated, and then respectively reading the image into a shared memory according to the blocks.
3) Thread blocks and threads are allocated. The number of assigned thread blocks is 8, and the number of threads is 8.
4) Each thread reads the pixel value needed in the shared memory, reads the interpolation coefficient from the constant memory to perform interpolation operation, and stores the interpolation result of each block into the global memory after the cudaDeviceSynchronize () function is synchronized. And finally, copying the up-sampling result in the GPU global memory to the CPU memory for subsequent data processing.
S32: the downsampled input image is divided into 8 x 8 blocks for parallel processing. In GPU, the allocated block number is pic_height/8, and the thread number is pic_width/8. Where pic_height and pic_width represent the Height and Width, respectively, of the downsampled input image. The parallelization processing of the downsampling is basically the same as the parallelization processing of the upsampling, except that the downsampling algorithm and the interpolation algorithm are executed differently.
S33: as shown in fig. 3, the transformation and quantization module is parallelized. During the transformation, LCEVC provides two transformation modes, 2×2 transformation or 4×4 transformation. Before parallel optimization, butterfly transformation is adopted for transformation modes in LCEVC, so that the calculation times are reduced. The 2 x 2 transformation formula can be rewritten as follows:
the 2 x 2 transform, if computed according to conventional algorithms, requires a total of 12 additions, 6 multiplications. As shown in fig. 4, only 8 additions and 4 multiplications are needed through matrix transformation of butterfly transformation, so that not only hardware resources are saved, but also computational complexity is reduced. The input data of the transform process in LCEVC is a prediction residual, and for 2×2 transform or 4×4 transform, transform coefficients are parsed into layers after transform. For 4×4 transform, the residual data is first divided into 4×4 blocks, each block is then butterfly transformed according to a transform matrix, and finally the transform coefficient data is parsed into layers. The data between layers are independent and have no correlation, so that parallel calculation can be adopted to optimize the quantization module after transformation. Since the 4×4 transform is followed by parsing the data into 16 layers, the number of blocks allocated in the GPU is 4 and the number of threads is 4. The specific steps implemented on the GPU for 4 x 4 transforms include:
1) The residual data is copied from the CPU into the global memory of the GPU and then the original image is divided into 4 x 4 blocks.
2) The block and thread numbers are allocated, the block number is 4, and the thread number is 4.
3) Each thread reads the data information of the blocks in the global memory according to the assigned thread numbers, then performs butterfly transformation on each block, and analyzes the obtained transformation coefficients into layers. In processing data, synchronization is required using the_syncthreads () function. The input to the quantization process is the transform coefficient. The input data of the quantization process is T in [layer][y][x]For a 4×4 transform, layer is 16. The data in each layer is uncorrelated, so that no dependency exists, and the quantization process of each transform coefficient can be independently completed, so that quantization optimization can be performed in a completely parallel manner.
S34: the principle of inverse transformation and parallel transformation is identical, and the principle of inverse quantization and quantization is also identical. Therefore, the parallel optimization method for inverse transformation and inverse quantization is the same as S33.
S35: as shown in fig. 5, run-length encoding and huffman encoding in the entropy encoding module are processed in parallel. The quantized output data are all distinguished by layers, but the size of the data in layers varies depending on the luminance component and the chrominance component. To reduce program latency due to some thread data length inconsistencies, the data should be averaged as much as possible to improve the efficiency of parallel encoding. Based on U1 and V1 of the L-1 layer, the larger data block in the layer is divided into data with the same size as W/16 XH/16. One can consider a 16 number of blocks allocated and a 30 number of threads allocated, and use one thread to process a W/16 XH/16 sized data block. The specific implementation method of the entropy coding parallel optimization based on the GPU comprises the following steps:
1) And copying the output data to be quantized from the CPU memory to the global memory of the GPU.
2) The data input by entropy coding are read into the shared memory according to the data blocks with the size of W/16 XH/16.
3) Each thread reads the required data from the shared memory to perform run-length encoding and Huffman encoding, waits for all threads to process to completion by using the cudaDeviceSynchronize () function, and sequentially transfers the encoded data into the global memory.
And obtaining an entropy coding result through GPU parallel calculation, copying and transmitting the entropy coding data to the memory of the CPU, and directly using entropy coding to output data when writing a binary code stream file.
S36: the parallel optimization algorithm in S31 to S34 is realized on the GPU platform of the PC end, so that the acceleration of the LCEVC encoder is realized, and the real-time low-complexity enhanced video coding is realized.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims (2)

1. A method for implementing LCEVC video coding optimization, characterized by: the method comprises the following steps:
s1: for a given input video, acquiring optimal downsampling output by adopting a block downsampling method based on Modifiedcubic interpolation information;
s2: embedding the block downsampling method based on the Modifiedcubic interpolation information in the step S1 into an LCEVC encoder, respectively counting time consumption of each module in the LCEVC and carrying out time consumption analysis, and designing a LCEVC encoder frame based on a CPU-GPU heterogeneous platform according to the time consumption analysis result; the step S2 specifically includes the following steps:
s21: embedding the block downsampling method based on the Modifiedcubic interpolation information in the step S1 into an LCEVC encoder;
s22: selecting a group of video sequences, performing multiple serial operation tests on the LCEVC encoder on a multi-core CPU, and averaging to obtain average consumed time of each frame, thereby obtaining time-consuming duty ratio of each module of the LCEVC encoder;
s23: through the time consumption ratio of each module of the LCEVC encoder in the step S22, a LCEVC encoder framework based on a CPU-GPU heterogeneous platform is designed, communication between the CPU and the GPU is realized in a data copying mode, and a part of the CPU responsible for processing comprises:
1) The system is responsible for reading image data, reading an input video sequence and Y, U, V component data generated by a basic encoder into a CPU (central processing unit) memory, copying the data and transmitting the copied data into a video memory of a GPU (graphics processing unit);
2) Is responsible for reading the code configuration file;
3) The method is responsible for outputting coded video of a basic encoder, outputting a reconstructed video sequence and an output code stream, calculating reconstructed video PSNR and scheduling GPU threads;
for the module with lower data correlation, designing corresponding parallel optimization algorithm according to different algorithm processing processes of each module;
s3: performing parallel optimization design on an up-sampling module, an improved down-sampling module, a conversion module, an inverse quantization module and an entropy coding module in the LCEVC standard according to an LCEVC encoder frame based on a CPU-GPU heterogeneous platform in S2, and realizing real-time low-complexity enhanced video coding at a PC end; the step S3 specifically includes:
s31: parallelizing the up-sampling module; the GPU is adopted to carry out parallel implementation on the up-sampling interpolation; the up-sampling parallelization concrete implementation method comprises the following steps:
s311: copying the image to be up-sampled from the CPU memory to the global memory of the GPU, and copying the interpolation coefficient matrix to the constant memory;
s312: filling the upper and lower boundaries of an input image, equally dividing the filled image into 64 blocks, wherein the height of each block is U_PicHeight/8, and the width of each block is U_PicWidth/8, wherein U_PicHeight and U_PicWidth respectively represent the width and the length of a video image to be interpolated, and then respectively reading the image into a shared memory according to the blocks;
s313: thread blocks and threads are distributed, the number of the distributed blocks is 8, and the number of threads is 8;
s314: each thread reads the pixel value needed in the shared memory, reads the interpolation coefficient from the constant memory to perform interpolation operation, and stores the interpolation result of each block into the global memory after the cudaDeviceSynchronize () function is synchronized;
s315: finally, copying the up-sampling result in the GPU global memory to a CPU memory for subsequent data processing;
s32: dividing the downsampled input image into 8×8 blocks for parallel processing; in the GPU, the allocated block number is pic_height/8, and the thread number is pic_width/8, wherein pic_height and pic_width respectively represent the Height and Width of the downsampled input image;
s33: parallelizing the transformation and quantization module; LCEVC provides two transformation modes, 2 x 2 transformation or 4 x 4 transformation; before parallel optimization, butterfly transformation is adopted for transformation modes in LCEVC; the 2×2 transformation formula is rewritten as follows:
the input data of the transformation process in LCEVC is a prediction residual, and for 2×2 transformation or 4×4 transformation, the transformation coefficients are parsed into layers after transformation;
for 4×4 transform, firstly, the residual data is divided into 4×4 blocks, then each block performs butterfly transform according to a transform matrix, and finally, the data of the transform coefficients are resolved into layers; optimizing the quantization module by adopting parallel calculation after transformation; the number of blocks allocated in the GPU is 4, and the number of threads is 4; the specific steps implemented on the GPU for 4 x 4 transforms include:
1) Copying residual data from the CPU to the global memory of the GPU, and dividing the original image into 4 multiplied by 4 blocks;
2) Allocating blocks and thread numbers, wherein the blocks are 4, and the thread numbers are 4;
3) Each thread reads the data information of the blocks in the global memory according to the assigned thread numbers, then performs butterfly transformation on each block, and analyzes the obtained transformation coefficients into layers; in the process of processing data, synchronizing by using a_syncthreads () function; the input of the quantization process is a transform coefficient, and the input data of the quantization process is T in [layer][y][x]For 4×4 transform, layer is 16, and quantization optimization is performed in a completely parallel manner;
s34: the parallel optimization mode for inverse transformation and inverse quantization is the same as that of S33;
s35: parallel processing is carried out on run-length coding and Huffman coding in the entropy coding module; in step S35, based on U1 and V1 of the L-1 layer, the larger data block in the layer is divided into data with the same size as W/16 XH/16, the number of allocated blocks is 16, the number of allocated reads is 30, one read is used for processing one data block with the size of W/16 XH/16, and the specific implementation method of parallel optimization of entropy coding based on GPU is as follows:
1) Copying output data to be quantized from a CPU memory to a global memory of the GPU;
2) The data input by entropy coding is read into a shared memory according to the data block blocks with the size of W/16 multiplied by H/16;
3) Each thread reads the needed data from the shared storage to carry out run-length coding and Huffman coding, and uses a cudaDeviceSynchronize () function to wait for all threads to process to finish, and the coded data is sequentially transmitted into the global memory;
obtaining an entropy coding result through GPU parallel calculation, copying entropy coding data to a memory of a CPU, and directly using entropy coding output data when writing a binary code stream file;
s36: the parallel optimization algorithm in S31 to S35 is realized on the GPU platform of the PC end, so that the acceleration of the LCEVC encoder is realized, and the real-time low-complexity enhanced video coding is realized.
2. The method for implementing LCEVC video coding optimization of claim 1, wherein: in the step S1, the input video sequence is set as Y, the downsampled image is set as X, and the optimal downsampled output X is obtained by a block downsampling method based on Modifiedcubic interpolation information * The method specifically comprises the following steps:
s11: let Y' be the X interpolated image, H be the interpolation coefficient, phi be the matrix formed by the product of the interpolation coefficient and the corresponding pixel value, the closed-form solution of the X optimal solution is:
X * =(H T H) -1 [H T (Y-Φ)]
wherein (H) T H) -1 Is (H) T H) Inverse matrix of (H) T H) -1 Is a full order matrix;
processing Y by adopting 8 multiplied by 8 blocks;
s12: all 4X 4 optimal downsampling blocks X * And merging together according to the block sequence to obtain the output result of the final downsampled image.
CN202210447137.6A 2022-04-18 2022-04-18 Method for realizing LCEVC video coding optimization Active CN114827614B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210447137.6A CN114827614B (en) 2022-04-18 2022-04-18 Method for realizing LCEVC video coding optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210447137.6A CN114827614B (en) 2022-04-18 2022-04-18 Method for realizing LCEVC video coding optimization

Publications (2)

Publication Number Publication Date
CN114827614A CN114827614A (en) 2022-07-29
CN114827614B true CN114827614B (en) 2024-03-22

Family

ID=82507961

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210447137.6A Active CN114827614B (en) 2022-04-18 2022-04-18 Method for realizing LCEVC video coding optimization

Country Status (1)

Country Link
CN (1) CN114827614B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102572436A (en) * 2012-01-17 2012-07-11 西安电子科技大学 Intra-frame compression method based on CUDA (Compute Unified Device Architecture)
CN104574277A (en) * 2015-01-30 2015-04-29 京东方科技集团股份有限公司 Image interpolation method and image interpolation device
CN104869398A (en) * 2015-05-21 2015-08-26 大连理工大学 Parallel method of realizing CABAC in HEVC based on CPU+GPU heterogeneous platform
CN107135392A (en) * 2017-04-21 2017-09-05 西安电子科技大学 HEVC motion search parallel methods based on asynchronous mode
CN109391816A (en) * 2018-10-26 2019-02-26 大连理工大学 The method for parallel processing of HEVC medium entropy coding link is realized based on CPU+GPU heterogeneous platform
CN109495743A (en) * 2018-11-15 2019-03-19 上海电力学院 A kind of parallelization method for video coding based on isomery many places platform
WO2019109771A1 (en) * 2017-12-05 2019-06-13 南京南瑞信息通信科技有限公司 Power artificial-intelligence visual-analysis system on basis of multi-core heterogeneous parallel computing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102572436A (en) * 2012-01-17 2012-07-11 西安电子科技大学 Intra-frame compression method based on CUDA (Compute Unified Device Architecture)
CN104574277A (en) * 2015-01-30 2015-04-29 京东方科技集团股份有限公司 Image interpolation method and image interpolation device
CN104869398A (en) * 2015-05-21 2015-08-26 大连理工大学 Parallel method of realizing CABAC in HEVC based on CPU+GPU heterogeneous platform
CN107135392A (en) * 2017-04-21 2017-09-05 西安电子科技大学 HEVC motion search parallel methods based on asynchronous mode
WO2019109771A1 (en) * 2017-12-05 2019-06-13 南京南瑞信息通信科技有限公司 Power artificial-intelligence visual-analysis system on basis of multi-core heterogeneous parallel computing
CN109391816A (en) * 2018-10-26 2019-02-26 大连理工大学 The method for parallel processing of HEVC medium entropy coding link is realized based on CPU+GPU heterogeneous platform
CN109495743A (en) * 2018-11-15 2019-03-19 上海电力学院 A kind of parallelization method for video coding based on isomery many places platform

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Jin-tai Shangguan ; Yan-ling Li ; Yong-gang Wang ; Hui-ling Li.Fast algorithm of modified cubic convolution interpolation.《2011 4th International Congress on Image and Signal Processing》.2011,1-4. *
Low-Complexity Enhancement Layer Compression for Scalable Lossless Video Coding Based on HEVC;Andreas Heindel; Eugen Wige; André Kaup;《IEEE Transactions on Circuits and Systems for Video Technology》;20160420;第27卷(第8期);1749 - 1760 *
Newsha Ardalani ; Clint Lestourgeon ; Karthikeyan Sankaralingam ; Xiaojin Zhu.Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance.《2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)》.2017,1-13. *
基于CPU+GPU异构平台的H.265/HEVC编码器关键技术研究;赵晴;《中国学位论文全文数据库》;20191031;全文 *
基于异构多处理平台视频编码并行化复杂度与率失真研究;袁三男, 王孟彬, 张艳秋, 陶倩昀;《上海电力大学学报》;20210507;第37卷(第03期);271-276 *
实时低复杂度增强视频编码研究;丁杨;《中国优秀硕士学位论文全文数据库》信息科技辑;20230615;全文 *

Also Published As

Publication number Publication date
CN114827614A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
US8787459B2 (en) Video coding methods and apparatus
US10659796B2 (en) Bandwidth saving architecture for scalable video coding spatial mode
US11159824B1 (en) Methods for full parallax light field compression
US20090010338A1 (en) Picture encoding using same-picture reference for pixel reconstruction
US20090010337A1 (en) Picture decoding using same-picture reference for pixel reconstruction
US8396122B1 (en) Video codec facilitating writing an output stream in parallel
KR20110111545A (en) Method and system for digital decoding 3d stereoscopic video images
CN114009027A (en) Quantization of residual in video coding
CN111741302B (en) Data processing method and device, computer readable medium and electronic equipment
KR20030081403A (en) Image coding and decoding method, corresponding devices and applications
JP2023513564A (en) Using Tiered Hierarchical Coding for Point Cloud Compression
US20220292730A1 (en) Method and apparatus for haar-based point cloud coding
CN110495178A (en) The device and method of 3D Video coding
CN113994685A (en) Exchanging information in scalable video coding
TWI672941B (en) Method, apparatus and system for processing picture
Abhayaratne et al. Scalable watermark extraction for real-time authentication of JPEG 2000 images
CN114827614B (en) Method for realizing LCEVC video coding optimization
Testolina et al. Comprehensive assessment of image compression algorithms
JP2007505545A (en) Scalable signal processing method and apparatus
CN116416216A (en) Quality evaluation method based on self-supervision feature extraction, storage medium and terminal
CN103379349B (en) A kind of View Synthesis predictive coding method, coding/decoding method, corresponding device and code stream
US20220114761A1 (en) Decoding data arrays
de Cea-Dominguez et al. Real-time 16K video coding on a GPU with complexity scalable BPC-PaCo
Montero et al. Parallel zigzag scanning and Huffman coding for a GPU-based MPEG-2 encoder
CN104350748A (en) View synthesis using low resolution depth maps

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant