CN114827614A

CN114827614A - Method for realizing LCEVC video coding optimization

Info

Publication number: CN114827614A
Application number: CN202210447137.6A
Authority: CN
Inventors: 丁杨; 罗雷
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-07-29
Anticipated expiration: 2042-04-18
Also published as: CN114827614B

Abstract

The invention relates to a method for realizing LCEVC video coding optimization, which belongs to the field of multimedia video processing and transmission and comprises the following steps: s1: for a given input video, acquiring optimal down-sampling output by adopting a block down-sampling method based on Modifiedcubic interpolation information; s2: embedding the block downsampling method in the step S1 into an LCEVC encoder, performing time consumption analysis on each module in the LCEVC, and designing an LCEVC encoder frame based on a CPU-GPU heterogeneous platform; s3: according to the framework, the parallel optimization design is carried out on the up-sampling, the improved down-sampling, the transformation and the quantization, the inverse transformation and the inverse quantization and the entropy coding module in the LCEVC standard, and the real-time low-complexity enhanced video coding is realized at the PC end. The invention improves the quality of the coded video of the LCEVC, shortens the coding time and improves the effective utilization rate of hardware resources.

Description

Method for realizing LCEVC video coding optimization

Technical Field

The invention belongs to the field of multimedia video processing and transmission, and relates to a method for realizing LCEVC video coding optimization.

Background

With the rapid development of video encoding and decoding technologies, high definition video and ultra-high definition video (including 4K and 8K) are popular, because they can provide users with clearer picture quality and more realistic perceptual quality. However, the data amount of high definition video and ultra high definition video increases with the increase of resolution and bit depth. To improve compression efficiency and reduce data volume, Low Complexity Enhanced Video Coding (LCEVC) has been proposed to meet market demand based on software extensions on existing and future video codecs.

The test software LTM test of the LCEVC shows that the compression rate is improved to about 40 percent by the LCEVC. The peak signal-to-noise ratio (PSNR), the video quality multi-method evaluation fusion (VMAF) and the mean subjective opinion score (MOS) value of the same video are all higher than AVC under the same bit rate by taking AVC as a basic encoder. The LCEVC coding time of the AVC base codec is 2.4 times less than that of the AVC, and the LCEVC coding time of the HEVC base codec is 2.7 times less than that of the HEVC. The LCEVC is combined with a basic encoder to be applied, and the video image is obviously enhanced. The downsampling module of the LCEVC adopts Lanczos interpolation algorithm, and although the Lanczos interpolation algorithm has a good interpolation effect, the downsampling module does not achieve downsampling output with optimal performance. Therefore, optimization and improvement of a downsampling module of the LCEVC are needed, so that the video image quality is enhanced.

Although the LCEVC has the characteristics of low complexity and short coding time, through the test on the same high-definition video, when the QP is 22, the LCEVC coding time of an AVC basic codec is 28.9s, and the LCEVC coding time of an HEVC basic codec is up to 33.8 s. For some applications with strong real-time performance, such as streaming media and sports live broadcast, it is necessary to accelerate the video encoding process to implement real-time encoding under the LCEVC standard on the basis of not affecting the encoding performance.

Disclosure of Invention

In view of this, the present invention aims to provide a method for optimizing LCEVC video coding, which considers an interpolation-dependent image downsampling optimization method and GPU-based parallel optimization to implement LCEVC real-time coding, and implements real-time low-complexity enhanced video coding at a PC end.

In order to achieve the purpose, the invention provides the following technical scheme:

a method for realizing LCEVC video coding optimization comprises the following steps:

s1: for a given input video, obtaining optimal down-sampling output by adopting a block down-sampling method based on Modifiedcubic interpolation information so as to improve the output video quality of a down-sampling module in the LCEVC and further improve the image quality of the LCEVC coded video;

s2: embedding the block downsampling method based on Modifiedcubic interpolation information in the step S1 into an LCEVC encoder, counting time consumption of each module in the LCEVC, analyzing the time consumption, and designing an LCEVC encoder frame based on a CPU-GPU heterogeneous platform according to the time consumption analysis result;

s3: and performing parallel optimization design on an up-sampling module, an improved down-sampling module, a transformation module, an inverse quantization module and an entropy coding module in the LCEVC standard according to the LCEVC encoder frame based on the CPU-GPU heterogeneous platform in the S2, and realizing real-time low-complexity enhanced video coding at the PC end.

Further, in step S1, assuming that the input video sequence is Y and the downsampled image is X, an optimal downsampled output X is obtained by a block downsampling method based on Modifiedcubic interpolation information ^* The method specifically comprises the following steps:

s11: let Y 'be the image after X interpolation, H be the interpolation coefficient, phi be the matrix that the product of interpolation coefficient and corresponding pixel value constitutes, wherein Y' is expressed as:

Y'＝HX+Φ

the optimal down-sampled objective function is expressed as:

the derivation of X is simplified as follows:

let the closed-form solution of the above equation equal to zero to get the X optimal solution be:

X ^* ＝(H ^T H) ^-1 [H ^T (Y-Φ)]

wherein (H) ^T H) ^-1 Is (H) ^T H) Inverse matrix of (H) ^T H) ^-1 Is a full rank matrix;

since the size of the H matrix increases with increasing image size, H occupies a large memory space in the program, and (H) is the same time ^T H) ^-1 The amount of calculation of (a) increases as H increases. In order to solve the problems of high complexity and large memory occupied by a program, the characteristics of interpolating 16 surrounding pixels to obtain a new pixel by combining a Modifiedcubic interpolation algorithm are adopted, and 8 multiplied by 8 blocks are adopted for processing Y. When Y is an 8 × 8 block, the size of the interpolation matrix H is 16 × 64, the constant vector Φ is a column vector having a length of 64, and the value of Φ is related only to the element corresponding to the boundary pixel.

S12: all 4X 4 optimal downsample blocks X ^* And combining the images together according to the block sequence to obtain the final down-sampling image output result.

Further, the step S2 specifically includes the following steps:

s21: embedding the block downsampling method based on the modified cubic interpolation information in the step S1 into an LCEVC encoder;

s22: selecting a group of video sequences, carrying out multiple serial operation tests on the LCEVC encoder on the multi-core CPU, and averaging to obtain the average consumed time of each frame, thereby obtaining the consumed time ratio of each module of the LCEVC encoder;

s23: through the time consumption ratio of each module of the LCEVC encoder in the step S22, large-scale parallel processing can be performed based on the GPU, the CPU is more suitable for the characteristic of logic control, and an LCEVC encoder frame based on the CPU-GPU heterogeneous platform is designed, wherein the encoding frame takes both the GPU and the computing resources of the CPU into consideration. The communication between the CPU and the GPU is realized by adopting a data copying mode, and the part of the CPU responsible for processing comprises:

1) the video coding and decoding device is responsible for reading image data, reading an input video sequence and Y, U, V component data generated by a basic encoder into a CPU memory, copying the data and transmitting the copied data to a video memory of a GPU;

2) and the system is responsible for reading the coding configuration file, such as information of image name, resolution size, bit depth, up-down sampling mode, quantization parameter and the like.

3) The system is responsible for outputting a coded video of a basic encoder, outputting a reconstructed video sequence and outputting a code stream, and is responsible for calculating a reconstructed video PSNR (Peak Signal to noise ratio) and scheduling a GPU (graphics processing Unit) thread;

for the modules with lower data correlation such as up-sampling, down-sampling, transformation, quantization, inverse transformation, inverse quantization and entropy coding, corresponding parallel optimization algorithms are designed according to different algorithm processing processes of the modules.

Further, the step S3 specifically includes:

s31: carrying out parallelization processing on the up-sampling module; as can be known from the Modifiedcubic interpolation process in the LCEVC, the interpolation process only includes addition, multiplication, and shift calculations, and the three operators already have corresponding hardware adders on the GPU. Therefore, the GPU can be adopted to carry out parallel implementation on the upsampling interpolation; the shared memory is an on-chip memory of the GPU, and can be used for quickly accessing data and realizing data communication among thread blocks.

S32: dividing the down-sampled input image into 8 x 8 blocks for parallel processing; in the GPU, the distributed block number is Pic _ Height/8, the thread number is Pic _ Width/8, wherein Pic _ Height and Pic _ Width respectively represent the Height and Width of a downsampled input image; the parallel processing of downsampling is basically the same as the parallel processing flow of upsampling, and the difference is that the downsampling algorithm and the interpolation algorithm are different.

S33: carrying out parallelization processing on the transformation and quantization module;

s34: the parallel principle of inverse transformation and transformation is completely the same, and the principle of inverse quantization and quantization is also completely the same. Therefore, the parallel optimization mode for inverse transformation and inverse quantization is the same as that of S33;

s35: run-length coding and Huffman coding in the entropy coding module are processed in parallel;

s36: and the parallel optimization algorithms from S31 to S35 are realized on a GPU platform at the PC end, so that the acceleration of an LCEVC encoder is realized, and the real-time low-complexity enhanced video encoding is realized.

Further, in step S31, the method for implementing parallelization of upsampling is as follows:

s311: copying an image needing to be up-sampled from a CPU memory to a global memory of a GPU, and copying an interpolation coefficient matrix to a constant memory;

s312: filling the upper and lower boundaries of an input image, considering that the data volume processed by each thread is basically the same after the image is filled, averagely dividing the filled image into 64 blocks, wherein the height of each block is U _ PicHeight/8, and the width of each block is U _ PicWidth/8, wherein U _ PicHeight and U _ PicWidth respectively represent the width and the length of a video image to be interpolated, and then respectively reading the image into a shared memory according to the blocks;

s313: distributing thread blocks and threads, wherein the number of distributed blocks is 8, and the number of threads is 8;

s314: each thread reads a pixel value required in the shared memory, reads an interpolation coefficient from a constant memory for interpolation operation, and stores an interpolation result of each block into a global memory after cudaDeviceSynchronze () function synchronization;

s315: and finally, copying an up-sampling result in the global memory of the GPU to a CPU memory for subsequent data processing.

Further, in step S33, the LCEVC provides two transformation modes, i.e. 2 × 2 transformation or 4 × 4 transformation; before parallel optimization, butterfly transformation is adopted for a transformation mode in the LCEVC to reduce the calculation times; the 2 × 2 transformation formula is rewritten as follows:

a total of 12 additions and 6 multiplications are required for a 2 x 2 transform if calculated according to conventional algorithms. Through the matrix conversion of butterfly transformation, only 8 times of addition and 4 times of multiplication are needed, so that not only are hardware resources saved, but also the computational complexity is reduced. Input data in the transformation process in the LCEVC is prediction residual, and for 2 x 2 transformation or 4 x 4 transformation, transformation coefficients are analyzed into layers after transformation;

for 4 × 4 transform, the residual data is divided into 4 × 4 blocks, each block is subjected to butterfly transform according to a transform matrix, and finally, the data of transform coefficients are analyzed into layers; data among layers are independent and have no correlation, so that the quantization module is optimized by adopting parallel computation after transformation; since the data is analyzed into 16 layers after 4 × 4 transformation, the number of blocks allocated in the GPU is 4, and the number of threads is 4; the specific steps of implementing the 4 × 4 transform on the GPU include:

1) copying residual data from a CPU into a global memory of a GPU, and dividing an original image into 4 x 4 blocks;

2) distributing the number of blocks and threads, wherein the number of blocks is 4, and the number of threads is 4;

3) each thread reads data information of blocks in the global memory according to the distributed thread numbers, then butterfly transformation is carried out on each block, and the obtained transformation coefficients are analyzed into layers; in the process of processing data, a function of _ synchreads () is used for synchronization; the input of the quantization process is the transform coefficient, and the input data of the quantization process is T _in [layer][y][x]For 4 × 4 transform, layer is 16, data in each layer is irrelevant, so there is no dependency, and quantization process of each transform coefficient can be performed independently, so quantization optimization can be performed in a completely parallel manner.

Further, in step S35, all the quantized output data are classified by layer, but the size of the data amount in layer is different due to the difference between the luminance component and the chrominance component. In order to reduce program latency due to some thread data length inconsistencies, the data should be averaged as much as possible to improve the efficiency of parallel encoding. Dividing a larger data block in a layer into data with the same size as W/16 multiplied by H/16 by taking U1 and V1 of an L-1 layer as a reference, wherein the number of distributed blocks is 16, the number of distributed threads is 30, one thread is used for processing one data block with the size of W/16 multiplied by H/16, and the specific implementation method of entropy coding parallel optimization based on the GPU comprises the following steps:

1) copying and transmitting output data needing quantization from a CPU memory to a global memory of a GPU;

2) reading the data input by entropy coding into a shared memory in blocks according to W/16 multiplied by H/16 data blocks;

3) each thread reads required data from the shared memory to perform run length coding and Huffman coding, and uses a cudaDeviceSynchronze () function to wait for all the threads to finish processing, and transmits the coded data into a global memory in sequence;

and obtaining entropy coding results through GPU parallel calculation, copying and transmitting entropy coding data to a memory of a CPU, and directly outputting the data by using entropy coding when writing in a binary code stream file.

The invention has the beneficial effects that: the invention not only considers calculating the optimal down-sampling video image by improving the down-sampling algorithm so as to improve the LCEVC integral video coding quality, but also considers solving the improved down-sampling algorithm in a blocking mode so as to reduce the calculation complexity and the memory use space. The method realizes a real-time LCEVC video coding system, considers the upsampling, the improved downsampling, the transformation and the quantization, the inverse transformation and the inverse quantization and the parallel optimization of an entropy coding module, reduces the coding time, improves the effective utilization rate of hardware resources such as a CPU, a GPU and the like, and improves the coding efficiency. The invention also realizes the real-time low-complexity enhanced video coding of the PC terminal.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a block flow diagram of the system of the present invention;

FIG. 2 is a flow chart of the parallel optimization of upsampling in step S3 of the method of the present invention

FIG. 3 is a flow chart of the parallel optimization of transform and quantization in step S3 of the method according to the present invention;

FIG. 4 is a schematic diagram illustrating the butterfly algorithm in step S3 of the method of the present invention;

fig. 5 is a flow chart of the parallel optimization of entropy coding in step S3 of the method of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and embodiments may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

As shown in fig. 1, the present invention provides an optimization method for implementing LCEVC video coding and decoding standards, including the following steps:

s1: for a given input video, a block downsampling method based on Modifiedcubic interpolation information is adopted to improve the output video quality of a downsampling module in the LCEVC, and further improve the image quality of the LCEVC coded video. The method comprises the following specific steps:

s11: setting an input video sequence as Y and an image after down-sampling as X, and obtaining the optimal down-sampling output X by a block down-sampling method based on Modifiedcubic interpolation information ^* . The calculation process is as follows:

and setting Y' as the image after X interpolation, H as an interpolation coefficient, and phi as a matrix formed by the products of the interpolation coefficient and the corresponding pixel value. Wherein Y' may be represented as

Y'＝HX+Φ

The optimal down-sampled objective function can be expressed as:

simple derivation of X

Closed-form solution for obtaining X optimal solution by making the above formula equal to zero

X ^* ＝(H ^T H) ^-1 [H ^T (Y-Φ)]

Wherein (H) ^T H) ^-1 Is (H) ^T H) Inverse matrix of (H) ^T H) ^-1 Is a full rank matrix.

Because the size of the H matrix increases with the image sizeLarge, so that H occupies a large memory space in the program, and (H) ^T H) ^-1 The amount of calculation of (a) increases as H increases. In order to solve the problems of high complexity and large memory occupied by a program, 8 × 8 blocks are adopted for processing Y by combining the characteristic that a Modifiedcubic interpolation algorithm adopts 16 surrounding pixels to interpolate to obtain a new pixel. When Y is an 8 × 8 block, the size of the interpolation matrix H is 16 × 64, the constant vector Φ is a column vector having a length of 64, and the value of Φ is related only to the element corresponding to the boundary pixel.

S12: finally all 4X 4 optimal downsampling blocks X are processed ^* And combining the blocks together according to the block sequence to obtain a final down-sampling output result.

S2: adding the block downsampling method based on Modifiedcubic interpolation information in S1 into an LCEVC encoder, respectively counting time consumption of each module in the LCEVC and analyzing the time consumption, and designing an LCEVC encoder frame based on a CPU-GPU heterogeneous platform according to the time consumption analysis result. The method comprises the following specific steps:

s21: and embedding a block downsampling method based on Modifiedcubic interpolation information into an LCEVC encoder.

S22: and selecting a group of video sequences, testing the LCEVC encoder on the multi-core CPU, and carrying out 10 times of serial running test for averaging to obtain the average consumed time of each frame so as to obtain the consumed time ratio of each module of the LCEVC encoder.

S23: through the time consumption ratio of each module of the LCEVC encoder in S22, large-scale parallel processing can be performed based on the GPU, the CPU is more suitable for the characteristic of logic control, and an LCEVC encoder frame based on the CPU-GPU heterogeneous platform is designed, wherein the encoding frame takes both the GPU and the computing resources of the CPU into consideration. The communication between the CPU and the GPU is mainly realized by adopting a data copying mode, and the part of the CPU mainly responsible for processing comprises:

1) the video coding and decoding device is responsible for reading image data, reading an input video sequence and Y, U, V component data generated by a basic encoder into a CPU memory, copying the data by the CPU and then transmitting the copied data to a video memory of the GPU.

3) And the system is responsible for outputting the coded video of the basic encoder, outputting a reconstructed video sequence and outputting a code stream, and is responsible for calculating the PSNR of the reconstructed video and scheduling the GPU thread.

S3: and performing parallel optimization design on the up-sampling, improved down-sampling, transformation and quantization and entropy coding modules in the LCEVC standard according to the LCEVC encoder frame based on the CPU-GPU heterogeneous platform in S2, and realizing real-time low-complexity enhanced video coding. The method can improve the quality of the coded video of the LCEVC, shortens the coding time, and improves the effective utilization rate of hardware resources such as a CPU (Central processing Unit), a GPU (graphics processing Unit) and the like. The method comprises the following specific steps:

s31: as shown in fig. 2, the upsampling module is parallelized. As can be known from the Modifiedcubic interpolation process, the interpolation process only includes addition, multiplication and shift calculation, and the three operators already have corresponding hardware adders on the GPU. Therefore, the GPU can be adopted to implement the upsampling interpolation in parallel. The shared memory is an on-chip memory of the GPU, and can be used for quickly accessing data and realizing data communication among thread blocks. Based on the above, the specific implementation method of the up-sampling parallelization is as follows:

1) and copying the image needing to be up-sampled from a CPU memory to a global memory of the GPU, and copying the interpolation coefficient matrix to a constant memory.

2) Filling the upper and lower boundaries of an input image, considering that the data amount processed by each thread is basically the same after the image is filled, averagely dividing the filled image into 64 blocks, wherein the height of each block is U _ PicHeight/8, and the width of each block is U _ PicWidth/8, wherein U _ PicHeight and U _ PicWidth respectively represent the width and the length of a video image to be interpolated, and then respectively reading the image into a shared memory according to the blocks.

3) Thread blocks and threads are allocated. The number of thread blocks allocated is 8 and the number of threads is 8.

4) And each thread reads a pixel value required in the shared memory, reads an interpolation coefficient from the constant memory for interpolation operation, and stores an interpolation result of each block into the global memory after synchronization of the cudaDeviceSynchronze () function. And finally, copying an up-sampling result in the global memory of the GPU to a CPU memory for subsequent data processing.

S32: the downsampled input image is divided into 8 x 8 blocks for parallel processing. In the GPU, the number of allocated blocks is Pic _ Height/8, and the number of threads is Pic _ Width/8. Where Pic _ Height and Pic _ Width represent the Height and Width, respectively, of the downsampled input picture. The parallel processing of the down sampling is basically the same as the parallel processing flow of the up sampling, and the difference is that the executed down sampling algorithm and the interpolation algorithm are different.

S33: as shown in fig. 3, the transformation and quantization modules are parallelized. During the transformation, the LCEVC provides two transformation modes, 2 × 2 transformation or 4 × 4 transformation. Before parallel optimization, butterfly transformation is adopted for a transformation mode in the LCEVC to reduce the calculation times. The 2 × 2 transform formula can be rewritten as follows:

a 2 x 2 transform, if calculated according to conventional algorithms, requires a total of 12 additions and 6 multiplications. As shown in fig. 4, only 8 additions and 4 multiplications are needed through matrix transformation of butterfly transform, which not only saves hardware resources, but also reduces computational complexity. The input data of the transformation process in LCEVC is the prediction residual, and for 2 × 2 transformation or 4 × 4 transformation, the transformed coefficients are resolved into layers. For 4 × 4 transform, the residual data is first divided into 4 × 4 blocks, then each block is subjected to butterfly transform according to a transform matrix, and finally the data of the transform coefficients is parsed into layers. Data between layers are independent and have no correlation, so that parallel calculation can be adopted to optimize a quantization module after transformation. Since the data is parsed into 16 layers after 4 × 4 transformation, the number of blocks allocated in the GPU is 4, and the number of threads is 4. The specific steps of implementing the 4 × 4 transform on the GPU include:

1) the residual data is copied from the CPU into the global memory of the GPU, and the original image is then divided into 4 x 4 blocks.

2) And distributing the block number and the thread number, wherein the block number is 4, and the thread number is 4.

3) And each thread reads the data information of the blocks in the global memory according to the distributed thread numbers, then butterfly transformation is carried out on each block, and the obtained transformation coefficients are analyzed into layers. In the process of processing data, synchronization needs to be performed by using a _ synchreads () function. The input to the quantization process is the transform coefficients. The input data of the quantization process is T _in [layer][y][x]For a 4 × 4 transform, layer is 16. The data in each layer are irrelevant, so that no dependency exists, the quantization process of each transformation coefficient can be independently completed, and quantization optimization can be performed in a completely parallel mode.

S34: the parallel principle of inverse transformation and transformation is completely the same, and the principle of inverse quantization and quantization is also completely the same. Therefore, the parallel optimization method for inverse transformation and inverse quantization is the same as S33.

S35: as shown in fig. 5, run-length coding and huffman coding in the entropy coding module are processed in parallel. The quantized output data is distinguished by layer, but the size of the data amount in layer is different because of the difference between the luminance component and the chrominance component. In order to reduce program latency due to some threads' inconsistent data lengths, the data should be averaged as much as possible to improve the efficiency of parallel encoding. The larger data block in the layer is divided into data of the same size as W/16 XH/16 based on U1 and V1 in the L-1 layer. It can be considered that the number of allocated blocks is 16, the number of allocated threads is 30, and one thread is used to process one data block with the size of W/16 × H/16. The specific implementation method of entropy coding parallel optimization based on the GPU comprises the following steps:

1) and copying and transferring the output data needing quantization from the CPU memory to the global memory of the GPU.

2) The data input by entropy coding is read into the shared memory in blocks according to W/16 × H/16 data blocks.

3) And each thread reads required data from the shared memory to perform run length coding and Huffman coding, and transmits the coded data into the global memory in sequence after all threads finish processing by using a cudaDeviceSynchronze () function.

Entropy coding results are obtained through GPU parallel calculation, entropy coding data are copied and transmitted to a memory of a CPU, and entropy coding output data can be directly used when a binary code stream file is written.

S36: and the parallel optimization algorithms from S31 to S34 are realized on a GPU platform at the PC end, so that the acceleration of an LCEVC encoder is realized, and the real-time low-complexity enhanced video encoding is realized.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A method for realizing LCEVC video coding optimization is characterized in that: the method comprises the following steps:

s1: for a given input video, acquiring optimal downsampling output by adopting a block downsampling method based on Modifiedcubic interpolation information;

2. A method for implementing LCEVC video coding optimization according to claim 1, wherein: in step S1, assuming that the input video sequence is Y and the downsampled image is X, an optimal downsampled output X is obtained by a block downsampling method based on Modifiedcubic interpolation information ^* The method specifically comprises the following steps:

s11: and setting Y' as an image after X interpolation, H as an interpolation coefficient, phi as a matrix formed by the products of the interpolation coefficient and the corresponding pixel value, wherein the closed-form solution of the X optimal solution is as follows:

X ^* ＝(H ^T H) ^-1 [H ^T (Y-Φ)]

processing Y by 8 multiplied by 8 blocks;

3. The method of claim 1 for implementing LCEVC video coding optimization, wherein: the step S2 specifically includes the following steps:

s21: embedding the block downsampling method based on Modifiedcubic interpolation information in the step S1 into the LCEVC encoder;

s23: designing an LCEVC encoder frame based on a CPU-GPU heterogeneous platform by using the time consumption ratio of each module of the LCEVC encoder in the step S22, wherein the communication between the CPU and the GPU is realized by adopting a data copying mode, and the part of the CPU responsible for processing comprises:

1) the video encoder is responsible for reading image data, reading an input video sequence and Y, U, V component data generated by a basic encoder into a CPU memory, copying the data and transmitting the copied data to a video memory of a GPU;

2) the device is responsible for reading the code configuration file;

3) the system is responsible for outputting a coded video of a basic coder, outputting a reconstructed video sequence and outputting a code stream, and is responsible for calculating a reconstructed video PSNR (picture signal to noise ratio) and scheduling a GPU (graphics processing unit) thread;

and for the modules with lower data correlation, designing corresponding parallel optimization algorithms according to different algorithm processing processes of the modules.

4. The method of claim 1 for implementing LCEVC video coding optimization, wherein: the step S3 specifically includes:

s31: carrying out parallelization processing on the up-sampling module; the GPU is adopted to carry out parallel implementation on the upsampling interpolation;

s32: dividing the down-sampled input image into 8 x 8 blocks for parallel processing; in the GPU, the distributed block number is Pic _ Height/8, the thread number is Pic _ Width/8, wherein Pic _ Height and Pic _ Width respectively represent the Height and Width of a downsampled input image;

s34: the parallel optimization mode of inverse transformation and inverse quantization is the same as S33;

5. Method for enabling LCEVC video coding optimization according to claim 4, characterized in that: in step S31, the method for parallelizing the upsampling is as follows:

s312: filling the upper and lower boundaries of an input image, averagely dividing the filled image into 64 blocks, wherein the height of each block is U _ PicHeight/8, and the width of each block is U _ PicWidth/8, wherein the U _ PicHeight and the U _ PicWidth respectively represent the width and the length of a video image to be interpolated, and then respectively reading the image into a shared memory according to the blocks;

6. Method for enabling LCEVC video coding optimization according to claim 4, characterized in that: in step S33, the LCEVC provides two transformation modes, 2 × 2 transformation or 4 × 4 transformation; before parallel optimization, butterfly transformation is adopted for a transformation mode in the LCEVC; the 2 × 2 transformation formula is rewritten as follows:

input data in the transformation process in the LCEVC is prediction residual, and for 2 x 2 transformation or 4 x 4 transformation, transformation coefficients are analyzed into layers after transformation;

for 4 × 4 transform, the residual data is divided into 4 × 4 blocks, each block is subjected to butterfly transform according to a transform matrix, and finally, the data of transform coefficients are analyzed into layers; after transformation, parallel computation is adopted to optimize a quantization module; the number of blocks distributed in the GPU is 4, and the number of threads is 4; the specific steps of implementing the 4 × 4 transform on the GPU include:

3) each thread reads data information of blocks in the global memory according to the distributed thread numbers, then butterfly transformation is carried out on each block, and the obtained transformation coefficients are analyzed into layers; in the process of processing data, a function of _ synchreads () is used for synchronization; the input of the quantization process is the transform coefficient, and the input data of the quantization process is T _in [layer][y][x]For the 4 × 4 transform, layer is 16, and quantization optimization is performed in a completely parallel manner.

7. Method for enabling LCEVC video coding optimization according to claim 4, characterized in that: in step S35, with U1 and V1 of the L-1 layer as references, dividing the larger data block in the layer into data with the same size as W/16 × H/16, where the number of allocated blocks is 16, the number of allocated threads is 30, and one thread is used to process one data block with the size of W/16 × H/16, and a specific implementation method of entropy coding parallel optimization based on the GPU is as follows: