CN114827614B

CN114827614B - Method for realizing LCEVC video coding optimization

Info

Publication number: CN114827614B
Application number: CN202210447137.6A
Authority: CN
Inventors: 丁杨; 罗雷
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2024-03-22
Anticipated expiration: 2042-04-18
Also published as: CN114827614A

Abstract

The invention relates to a method for realizing LCEVC video coding optimization, which belongs to the field of multimedia video processing and transmission, and comprises the following steps: s1: for a given input video, acquiring optimal downsampling output by adopting a block downsampling method based on Modifiedcubic interpolation information; s2: embedding the block downsampling method in the step S1 into an LCEVC encoder, performing time-consuming analysis on each module in the LCEVC, and designing an LCEVC encoder frame based on a CPU-GPU heterogeneous platform; s3: and carrying out parallel optimization design on an up-sampling module, an improved down-sampling module, a transformation and quantization module, an inverse transformation and inverse quantization module and an entropy coding module in the LCEVC standard according to a framework, and realizing real-time low-complexity enhanced video coding at a PC end. The invention improves the coding video quality of LCEVC, shortens the coding time and improves the effective utilization rate of hardware resources.

Description

Method for realizing LCEVC video coding optimization

Technical Field

The invention belongs to the field of multimedia video processing and transmission, and relates to a method for realizing LCEVC video coding optimization.

Background

With the rapid development of video coding and decoding technology, high-definition video and ultra-high-definition video (including 4K and 8K) are popular, because they can provide users with clearer image quality and more realistic perceived quality. But the data volume of the high-definition video and the ultra-high-definition video increases with the increase of the resolution and the bit depth. In order to improve compression efficiency, reduce data volume, and meet the market demand for software-based extensions on existing and future video codecs, low Complexity Enhanced Video Coding (LCEVC) has been proposed.

From the test software LTM test of LCEVC, the LCEVC improves the compression rate to about 40%. The AVC is taken as a basic encoder, and the peak signal-to-noise ratio (PSNR), the video quality multi-method evaluation fusion (VMAF) and the mean subjective opinion score (MOS) of the same video are higher than the AVC under the same bit rate. The LCEVC encoding time with AVC as the basic encoder is 2.4 times less than the AVC encoding time, and the LCEVC encoding time with HEVC as the basic encoder is 2.7 times less than the HEVC encoding time. The LCEVC is significantly enhanced in video images when used in conjunction with a base encoder. The downsampling module of the LCEVC adopts Lansos (Lanczos) interpolation algorithm, and although the Lanczos algorithm has better interpolation effect, the downsampling module does not achieve the downsampling output with optimal performance. There is therefore a need for an improved optimization of the downsampling module of LCEVCs to enhance video image quality.

Although the LCEVC has the characteristics of low complexity and short coding time, through the test of the same high-definition video, when QP is 22, the coding time of the LCEVC taking AVC as a basic coder is 28.9s, and the coding time of the LCEVC taking HEVC as the basic coder is as high as 33.8s. For some applications with stronger real-time performance such as streaming media, sports live broadcast and the like, it is necessary to accelerate the video coding process to realize real-time coding under the LCEVC standard on the basis of not affecting the coding performance.

Disclosure of Invention

In view of the above, the present invention aims to provide a method for implementing LCEVC video coding optimization, which considers an interpolation-dependent image downsampling optimization method and implements LCEVC real-time coding based on GPU parallel optimization, thereby implementing real-time low-complexity enhanced video coding at the PC end.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a method of implementing LCEVC video coding optimization, comprising the steps of:

s1: for a given input video, an optimal downsampling output is obtained by adopting a block downsampling method based on Modifiedcubic interpolation information, so that the output video quality of a downsampling module in LCEVC is improved, and the image quality of the LCEVC coded video is further improved;

s2: embedding the block downsampling method based on the Modifiedcubic interpolation information in the step S1 into an LCEVC encoder, respectively counting time consumption of each module in the LCEVC and carrying out time consumption analysis, and designing a LCEVC encoder frame based on a CPU-GPU heterogeneous platform according to the time consumption analysis result;

s3: and (2) carrying out parallel optimization design on an up-sampling module, an improved down-sampling module, a transformation and quantization module, an inverse transformation and inverse quantization module and an entropy coding module in the LCEVC standard according to an LCEVC encoder framework based on a CPU-GPU heterogeneous platform in S2, and realizing real-time low-complexity enhanced video coding at a PC end.

Further, in the step S1, the input video sequence is set to be Y, the downsampled image is set to be X, and the optimal downsampled output X is obtained by a block downsampling method based on Modifiedcubic interpolation information ^* The method specifically comprises the following steps:

s11: let Y 'be the X interpolated image, H be the interpolation coefficient, phi be the matrix of the product of the interpolation coefficient and the corresponding pixel value, where Y' is expressed as:

Y'＝HX+Φ

the optimal downsampled objective function is expressed as:

and (3) deriving and simplifying X to obtain:

let the closed solution of the optimal solution of X obtained by the above method equal to zero be:

X ^* ＝(H ^T H) ^-1 [H ^T (Y-Φ)]

wherein (H) ^T H) ^-1 Is (H) ^T H) Inverse matrix of (H) ^T H) ^-1 Is a full order matrix;

since the size of the H matrix increases with the size of the image, H occupies a large memory space in the program, and (H ^T H) ^-1 And the calculated amount of (c) increases with the increase of H. In order to solve the problems of high complexity and large memory occupied by a program, the characteristic that 16 pixel points around are interpolated to obtain a new pixel point is combined with the Modifiedcubic interpolation algorithm, and the Y is processed by adopting 8 multiplied by 8 blocks. When Y is an 8×8 block, the size of the interpolation matrix H is 16×64, and the constant vector Φ is a column vector of length 64, and the value of Φ is related to only the element corresponding to the boundary pixel.

S12: all 4X 4 optimal downsampling blocks X ^* And merging together according to the block sequence to obtain the output result of the final downsampled image.

Further, the step S2 specifically includes the following steps:

s21: embedding the block downsampling method based on the Modifiedcubic interpolation information in the step S1 into an LCEVC encoder;

s22: selecting a group of video sequences, performing multiple serial operation tests on the LCEVC encoder on a multi-core CPU, and averaging to obtain average consumed time of each frame, thereby obtaining time-consuming duty ratio of each module of the LCEVC encoder;

s23: through the time consumption ratio of each module of the LCEVC encoder in the step S22, large-scale parallel processing can be performed based on the GPU, and the CPU is more suitable for the logic control characteristic, the LCEVC encoder framework based on the CPU-GPU heterogeneous platform is designed, and the encoding framework considers the GPU and the computing resource of the CPU. The communication between the CPU and the GPU is realized by copying data, and the part of the CPU responsible for processing comprises:

1) The system is responsible for reading image data, reading an input video sequence and Y, U, V component data generated by a basic encoder into a CPU (central processing unit) memory, copying the data and transmitting the copied data into a video memory of a GPU (graphics processing unit);

2) Is responsible for reading information such as image names, resolution sizes, bit depths, up-down sampling modes, quantization parameters and the like of the coding configuration files.

3) The method is responsible for outputting coded video of a basic encoder, outputting a reconstructed video sequence and an output code stream, calculating reconstructed video PSNR and scheduling GPU threads;

for the modules with low data correlation, such as up-sampling, down-sampling, transformation, quantization, inverse transformation, inverse quantization and entropy coding, a corresponding parallel optimization algorithm is designed according to different algorithm processing procedures of each module.

Further, the step S3 specifically includes:

s31: parallelizing the up-sampling module; from the Modifiedcubic interpolation process in LCEVC, it is known that the interpolation process only involves addition, multiplication and shift calculations, and that these three operators already have corresponding hardware adders on the GPU. Therefore, GPU can be adopted to carry out parallel implementation on the up-sampling interpolation; the shared memory is an on-chip memory of the GPU, can quickly access data, and can also realize data communication among thread blocks.

S32: dividing the downsampled input image into 8×8 blocks for parallel processing; in the GPU, the allocated block number is pic_height/8, and the thread number is pic_width/8, wherein pic_height and pic_width respectively represent the Height and Width of the downsampled input image; the parallelization processing of the downsampling is basically the same as the parallelization processing of the upsampling, except that the downsampling algorithm and the interpolation algorithm are executed differently.

S33: parallelizing the transformation and quantization module;

s34: the principle of inverse transformation and parallel transformation is identical, and the principle of inverse quantization and quantization is also identical. Therefore, the parallel optimization mode for inverse transformation and inverse quantization is the same as S33;

s35: parallel processing is carried out on run-length coding and Huffman coding in the entropy coding module;

s36: the parallel optimization algorithm in S31 to S35 is realized on the GPU platform of the PC end, so that the acceleration of the LCEVC encoder is realized, and the real-time low-complexity enhanced video coding is realized.

Further, in step S31, the up-sampling parallelization implementation method is as follows:

s311: copying the image to be up-sampled from the CPU memory to the global memory of the GPU, and copying the interpolation coefficient matrix to the constant memory;

s312: filling the upper and lower boundaries of an input image, taking into account that the data volume processed by each thread is basically the same after the image is filled, dividing the filled image into 64 blocks with the height of U_PicHeight/8 and the width of U_PicWidth/8, wherein U_PicHeight and U_PicWidth respectively represent the width and the length of a video image to be interpolated, and then respectively reading the image into a shared memory according to the blocks;

s313: thread blocks and threads are distributed, the number of the distributed blocks is 8, and the number of threads is 8;

s314: each thread reads the pixel value needed in the shared memory, reads the interpolation coefficient from the constant memory to perform interpolation operation, and stores the interpolation result of each block into the global memory after the cudaDeviceSynchronize () function is synchronized;

s315: and finally, copying the up-sampling result in the GPU global memory to the CPU memory for subsequent data processing.

Further, in the step S33, the LCEVC provides two transformation modes, 2×2 transformation or 4×4 transformation; before parallel optimization, butterfly transformation is adopted for transformation modes in LCEVC so as to reduce calculation times; the 2×2 transformation formula is rewritten as follows:

the 2 x 2 transform, if computed according to conventional algorithms, requires a total of 12 additions, 6 multiplications. Through matrix transformation of butterfly transformation, only 8 additions and 4 multiplications are needed, so that hardware resources are saved, and computational complexity is reduced. The input data of the transformation process in LCEVC is a prediction residual, and for 2×2 transformation or 4×4 transformation, the transformation coefficients are parsed into layers after transformation;

for 4×4 transform, firstly, the residual data is divided into 4×4 blocks, then each block performs butterfly transform according to a transform matrix, and finally, the data of the transform coefficients are resolved into layers; the data between layers are independent and have no correlation, so that parallel calculation is adopted to optimize the quantization module after transformation; since 4×4 transformation is followed by parsing the data into 16 layers, the number of blocks allocated in the GPU is 4 and the number of threads is 4; the specific steps implemented on the GPU for 4 x 4 transforms include:

1) Copying residual data from the CPU to the global memory of the GPU, and dividing the original image into 4 multiplied by 4 blocks;

2) Allocating blocks and thread numbers, wherein the blocks are 4, and the thread numbers are 4;

3) Each thread reads the data information of the blocks in the global memory according to the assigned thread numbers, then performs butterfly transformation on each block, and analyzes the obtained transformation coefficients into layers; in the process of processing data, synchronizing by using a_syncthreads () function; the input of the quantization process is a transform coefficient, and the input data of the quantization process is T _in [layer][y][x]For a 4 x 4 transform, the layers are 16, and the data in each layer is uncorrelated, so that there is no dependency, and the quantization process of each transform coefficient can be independently completed, so that quantization optimization can be performed in a completely parallel manner.

Further, in step S35, all of the data outputted by quantization are discriminated by layers, but the size of the data amount in layers differs depending on the luminance component and the chrominance component. To reduce program latency due to some thread data length inconsistencies, the data should be averaged as much as possible to improve the efficiency of parallel encoding. Based on U1 and V1 of the L-1 layer, dividing a larger data block in a layer into data with the same size as W/16 XH/16, wherein the number of allocated blocks is 16, the number of allocated thread is 30, and processing a data block with the size of W/16 XH/16 by using one thread, wherein the specific implementation method of parallel optimization of entropy coding based on GPU comprises the following steps:

1) Copying output data to be quantized from a CPU memory to a global memory of the GPU;

2) The data input by entropy coding is read into a shared memory according to the data block blocks with the size of W/16 multiplied by H/16;

3) Each thread reads the needed data from the shared storage to carry out run-length coding and Huffman coding, and uses a cudaDeviceSynchronize () function to wait for all threads to process to finish, and the coded data is sequentially transmitted into the global memory;

and obtaining an entropy coding result through GPU parallel calculation, copying and transmitting the entropy coding data to the memory of the CPU, and directly using entropy coding output data when writing a binary code stream file.

The invention has the beneficial effects that: the invention not only considers the calculation of the optimal downsampled video image by improving the downsampling algorithm so as to further improve the overall video coding quality of the LCEVC, but also considers the adoption of a block mode for solving the improved downsampling algorithm so as to reduce the calculation complexity and the memory use space. The method of the invention realizes a real-time LCEVC video coding system, considers the up-sampling, the improved down-sampling, the transformation and quantization, the inverse transformation and inverse quantization and the parallel optimization of the entropy coding module, reduces the coding time, improves the effective utilization rate of hardware resources such as CPU, GPU and the like, and improves the coding efficiency. The invention also realizes the real-time low-complexity enhanced video coding of the PC end.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram of a system flow of the present invention;

FIG. 2 is a flow chart of parallel optimization of upsampling in step S3 of the method according to the present invention

FIG. 3 is a flow chart of parallel optimization of transformation and quantization in step S3 of the method of the present invention;

FIG. 4 is a schematic diagram illustrating a butterfly algorithm in step S3 of the method of the present invention;

fig. 5 is a flow chart of parallel optimization of entropy coding in step S3 of the method according to the invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

As shown in fig. 1, the present invention provides an optimization method for implementing LCEVC video codec standard, including the following steps:

s1: for a given input video, a block downsampling method based on Modifiedcubic interpolation information is adopted to improve the output video quality of a downsampling module in the LCEVC, and further improve the image quality of the LCEVC encoded video. The method comprises the following specific steps:

s11: let the input video sequence be Y and the downsampled image be X, and obtain the optimal downsampled output X by the block downsampling method based on Modifiedcubic interpolation information ^* . The calculation process is as follows:

let Y' be the X interpolated image, H be the interpolation coefficient, phi be the matrix of the product of the interpolation coefficient and the corresponding pixel value. Wherein Y' can be represented as

Y'＝HX+Φ

The optimal downsampled objective function may be expressed as:

can be obtained by simplifying X derivation

Closed-form solution for obtaining X optimal solution by making the upper part equal to zero

X ^* ＝(H ^T H) ^-1 [H ^T (Y-Φ)]

Wherein (H) ^T H) ^-1 Is (H) ^T H) Inverse matrix of (H) ^T H) ^-1 Is a full order matrix.

Since the size of the H matrix increases with the size of the image, H occupies a large memory space in the program, and (H ^T H) ^-1 And the calculated amount of (c) increases with the increase of H. In order to solve the problems of high complexity and large memory occupied by a program, the characteristic that 16 pixel points around are interpolated to obtain a new pixel point is combined with the Modifiedcubic interpolation algorithm to process Y by adopting 8X 8 blocks. When Y is an 8×8 block, the size of the interpolation matrix H is 16×64, the constant vector Φ is a column vector of length 64, and the value of Φ is related to only the element corresponding to the boundary pixel.

S12: finally, all 4×4 optimal downsampling blocks X ^* And merging together according to the block sequence to obtain a final downsampled output result.

S2: adding the block downsampling method based on the Modifiedcubic interpolation information in the S1 into the LCEVC encoder, respectively counting time consumption of each module in the LCEVC and carrying out time consumption analysis, and designing a LCEVC encoder frame based on the CPU-GPU heterogeneous platform through the time consumption analysis result. The method comprises the following specific steps:

s21: the block downsampling method based on Modifiedcubic interpolation information is embedded into the LCEVC encoder.

S22: and selecting a group of video sequences, testing the LCEVC encoder on a multi-core CPU, and averaging through 10 times of serial operation tests to obtain the average consumed time of each frame, thereby obtaining the time consumption ratio of each module of the LCEVC encoder.

S23: through the time consumption ratio of each module of the LCEVC encoder in S22, large-scale parallel processing can be performed based on the GPU, and the CPU is more suitable for the logic control characteristic, the LCEVC encoder framework based on the CPU-GPU heterogeneous platform is designed, and the encoding framework considers the GPU and the computing resource of the CPU. The communication between the CPU and the GPU is mainly realized by copying data, and the part of the CPU mainly responsible for processing comprises:

1) And the image data are read, Y, U, V component data generated by the input video sequence and the basic encoder are read into a CPU memory, and the CPU copies the data and transmits the data to a video memory of the GPU.

3) And the method is responsible for outputting the coded video of the basic encoder, outputting the reconstructed video sequence and outputting the code stream, and is responsible for calculating the PSNR of the reconstructed video and scheduling GPU threads.

S3: and (2) carrying out parallel optimization design on an up-sampling module, an improved down-sampling module, a transformation and quantization module and an entropy coding module in the LCEVC standard according to an LCEVC encoder framework based on a CPU-GPU heterogeneous platform in S2, so as to realize real-time low-complexity enhanced video coding. The method can improve the coding video quality of LCEVC, shorten the coding time and improve the effective utilization rate of hardware resources such as CPU, GPU and the like. The method comprises the following specific steps:

s31: as shown in fig. 2, the upsampling module is parallelized. From the Modifiedcubic interpolation process, it is known that the interpolation process only includes addition, multiplication and shift calculations, and that these three operators already have corresponding hardware adders on the GPU. The GPU may be used to implement the upsampling interpolation in parallel. The shared memory is an on-chip memory of the GPU, can quickly access data, and can also realize data communication among thread blocks. Based on the method, the up-sampling parallelization concrete implementation method is as follows:

1) Copying the image to be up-sampled from the CPU memory to the global memory of the GPU, and copying the interpolation coefficient matrix to the constant memory.

2) Filling the upper and lower boundaries of an input image, taking the fact that the data volume processed by each thread is basically the same after the image is filled into consideration, dividing the filled image into 64 blocks with the height of U_PicHeight/8 and the width of U_PicWidth/8, wherein U_PicHeight and U_PicWidth respectively represent the width and the length of a video image to be interpolated, and then respectively reading the image into a shared memory according to the blocks.

3) Thread blocks and threads are allocated. The number of assigned thread blocks is 8, and the number of threads is 8.

4) Each thread reads the pixel value needed in the shared memory, reads the interpolation coefficient from the constant memory to perform interpolation operation, and stores the interpolation result of each block into the global memory after the cudaDeviceSynchronize () function is synchronized. And finally, copying the up-sampling result in the GPU global memory to the CPU memory for subsequent data processing.

S32: the downsampled input image is divided into 8 x 8 blocks for parallel processing. In GPU, the allocated block number is pic_height/8, and the thread number is pic_width/8. Where pic_height and pic_width represent the Height and Width, respectively, of the downsampled input image. The parallelization processing of the downsampling is basically the same as the parallelization processing of the upsampling, except that the downsampling algorithm and the interpolation algorithm are executed differently.

S33: as shown in fig. 3, the transformation and quantization module is parallelized. During the transformation, LCEVC provides two transformation modes, 2×2 transformation or 4×4 transformation. Before parallel optimization, butterfly transformation is adopted for transformation modes in LCEVC, so that the calculation times are reduced. The 2 x 2 transformation formula can be rewritten as follows:

the 2 x 2 transform, if computed according to conventional algorithms, requires a total of 12 additions, 6 multiplications. As shown in fig. 4, only 8 additions and 4 multiplications are needed through matrix transformation of butterfly transformation, so that not only hardware resources are saved, but also computational complexity is reduced. The input data of the transform process in LCEVC is a prediction residual, and for 2×2 transform or 4×4 transform, transform coefficients are parsed into layers after transform. For 4×4 transform, the residual data is first divided into 4×4 blocks, each block is then butterfly transformed according to a transform matrix, and finally the transform coefficient data is parsed into layers. The data between layers are independent and have no correlation, so that parallel calculation can be adopted to optimize the quantization module after transformation. Since the 4×4 transform is followed by parsing the data into 16 layers, the number of blocks allocated in the GPU is 4 and the number of threads is 4. The specific steps implemented on the GPU for 4 x 4 transforms include:

1) The residual data is copied from the CPU into the global memory of the GPU and then the original image is divided into 4 x 4 blocks.

2) The block and thread numbers are allocated, the block number is 4, and the thread number is 4.

3) Each thread reads the data information of the blocks in the global memory according to the assigned thread numbers, then performs butterfly transformation on each block, and analyzes the obtained transformation coefficients into layers. In processing data, synchronization is required using the_syncthreads () function. The input to the quantization process is the transform coefficient. The input data of the quantization process is T _in [layer][y][x]For a 4×4 transform, layer is 16. The data in each layer is uncorrelated, so that no dependency exists, and the quantization process of each transform coefficient can be independently completed, so that quantization optimization can be performed in a completely parallel manner.

S34: the principle of inverse transformation and parallel transformation is identical, and the principle of inverse quantization and quantization is also identical. Therefore, the parallel optimization method for inverse transformation and inverse quantization is the same as S33.

S35: as shown in fig. 5, run-length encoding and huffman encoding in the entropy encoding module are processed in parallel. The quantized output data are all distinguished by layers, but the size of the data in layers varies depending on the luminance component and the chrominance component. To reduce program latency due to some thread data length inconsistencies, the data should be averaged as much as possible to improve the efficiency of parallel encoding. Based on U1 and V1 of the L-1 layer, the larger data block in the layer is divided into data with the same size as W/16 XH/16. One can consider a 16 number of blocks allocated and a 30 number of threads allocated, and use one thread to process a W/16 XH/16 sized data block. The specific implementation method of the entropy coding parallel optimization based on the GPU comprises the following steps:

1) And copying the output data to be quantized from the CPU memory to the global memory of the GPU.

2) The data input by entropy coding are read into the shared memory according to the data blocks with the size of W/16 XH/16.

3) Each thread reads the required data from the shared memory to perform run-length encoding and Huffman encoding, waits for all threads to process to completion by using the cudaDeviceSynchronize () function, and sequentially transfers the encoded data into the global memory.

And obtaining an entropy coding result through GPU parallel calculation, copying and transmitting the entropy coding data to the memory of the CPU, and directly using entropy coding to output data when writing a binary code stream file.

S36: the parallel optimization algorithm in S31 to S34 is realized on the GPU platform of the PC end, so that the acceleration of the LCEVC encoder is realized, and the real-time low-complexity enhanced video coding is realized.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A method for implementing LCEVC video coding optimization, characterized by: the method comprises the following steps:

s1: for a given input video, acquiring optimal downsampling output by adopting a block downsampling method based on Modifiedcubic interpolation information;

s2: embedding the block downsampling method based on the Modifiedcubic interpolation information in the step S1 into an LCEVC encoder, respectively counting time consumption of each module in the LCEVC and carrying out time consumption analysis, and designing a LCEVC encoder frame based on a CPU-GPU heterogeneous platform according to the time consumption analysis result; the step S2 specifically includes the following steps:

s23: through the time consumption ratio of each module of the LCEVC encoder in the step S22, a LCEVC encoder framework based on a CPU-GPU heterogeneous platform is designed, communication between the CPU and the GPU is realized in a data copying mode, and a part of the CPU responsible for processing comprises:

2) Is responsible for reading the code configuration file;

for the module with lower data correlation, designing corresponding parallel optimization algorithm according to different algorithm processing processes of each module;

s3: performing parallel optimization design on an up-sampling module, an improved down-sampling module, a conversion module, an inverse quantization module and an entropy coding module in the LCEVC standard according to an LCEVC encoder frame based on a CPU-GPU heterogeneous platform in S2, and realizing real-time low-complexity enhanced video coding at a PC end; the step S3 specifically includes:

s31: parallelizing the up-sampling module; the GPU is adopted to carry out parallel implementation on the up-sampling interpolation; the up-sampling parallelization concrete implementation method comprises the following steps:

s312: filling the upper and lower boundaries of an input image, equally dividing the filled image into 64 blocks, wherein the height of each block is U_PicHeight/8, and the width of each block is U_PicWidth/8, wherein U_PicHeight and U_PicWidth respectively represent the width and the length of a video image to be interpolated, and then respectively reading the image into a shared memory according to the blocks;

s315: finally, copying the up-sampling result in the GPU global memory to a CPU memory for subsequent data processing;

s32: dividing the downsampled input image into 8×8 blocks for parallel processing; in the GPU, the allocated block number is pic_height/8, and the thread number is pic_width/8, wherein pic_height and pic_width respectively represent the Height and Width of the downsampled input image;

s33: parallelizing the transformation and quantization module; LCEVC provides two transformation modes, 2 x 2 transformation or 4 x 4 transformation; before parallel optimization, butterfly transformation is adopted for transformation modes in LCEVC; the 2×2 transformation formula is rewritten as follows:

the input data of the transformation process in LCEVC is a prediction residual, and for 2×2 transformation or 4×4 transformation, the transformation coefficients are parsed into layers after transformation;

for 4×4 transform, firstly, the residual data is divided into 4×4 blocks, then each block performs butterfly transform according to a transform matrix, and finally, the data of the transform coefficients are resolved into layers; optimizing the quantization module by adopting parallel calculation after transformation; the number of blocks allocated in the GPU is 4, and the number of threads is 4; the specific steps implemented on the GPU for 4 x 4 transforms include:

3) Each thread reads the data information of the blocks in the global memory according to the assigned thread numbers, then performs butterfly transformation on each block, and analyzes the obtained transformation coefficients into layers; in the process of processing data, synchronizing by using a_syncthreads () function; the input of the quantization process is a transform coefficient, and the input data of the quantization process is T _in [layer][y][x]For 4×4 transform, layer is 16, and quantization optimization is performed in a completely parallel manner;

s34: the parallel optimization mode for inverse transformation and inverse quantization is the same as that of S33;

s35: parallel processing is carried out on run-length coding and Huffman coding in the entropy coding module; in step S35, based on U1 and V1 of the L-1 layer, the larger data block in the layer is divided into data with the same size as W/16 XH/16, the number of allocated blocks is 16, the number of allocated reads is 30, one read is used for processing one data block with the size of W/16 XH/16, and the specific implementation method of parallel optimization of entropy coding based on GPU is as follows:

obtaining an entropy coding result through GPU parallel calculation, copying entropy coding data to a memory of a CPU, and directly using entropy coding output data when writing a binary code stream file;

2. The method for implementing LCEVC video coding optimization of claim 1, wherein: in the step S1, the input video sequence is set as Y, the downsampled image is set as X, and the optimal downsampled output X is obtained by a block downsampling method based on Modifiedcubic interpolation information ^* The method specifically comprises the following steps:

s11: let Y' be the X interpolated image, H be the interpolation coefficient, phi be the matrix formed by the product of the interpolation coefficient and the corresponding pixel value, the closed-form solution of the X optimal solution is:

X ^* ＝(H ^T H) ^-1 [H ^T (Y-Φ)]

processing Y by adopting 8 multiplied by 8 blocks;