CN112601095A

CN112601095A - Method and system for creating fractional interpolation model of video brightness and chrominance

Info

Publication number: CN112601095A
Application number: CN202011307251.6A
Authority: CN
Inventors: 樊硕
Original assignee: Beijing Moviebook Technology Corp ltd
Current assignee: Beijing Moviebook Technology Corp ltd
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2021-04-02
Anticipated expiration: 2040-11-19
Also published as: CN112601095B

Abstract

The application provides a method and a system for creating a video brightness and chrominance fractional interpolation model, wherein in the method provided by the application, a plurality of different types of original video data are collected firstly, and an image frame sequence in the original video data is obtained to generate an original image set; then carrying out preprocessing operation on the original image set to create a training image set; and then constructing a deep convolution neural network model, taking the training image set as input data, and taking the original image set as a corresponding true value for training to obtain the deep convolution neural network model suitable for fractional interpolation of video brightness and chroma. Based on the method and the system for creating the video brightness and chromaticity fractional interpolation model, on the basis of the traditional super-resolution convolutional neural network, the brightness and chromaticity fractional interpolation based on the CNN can effectively reduce the training time consumption of the deep convolutional neural network model, so that the video coding efficiency is improved.

Description

Method and system for creating fractional interpolation model of video brightness and chrominance

Technical Field

The application relates to the field of deep learning, in particular to a method and a system for creating a video brightness and chrominance fractional interpolation model.

Background

In recent years, methods based on deep learning have been widely used, and remarkable effects are obtained in image and video processing. Convolutional Neural Networks (CNN) are the most representative model in deep learning, and it well improves the performance of the traditional method in high-level computer vision, and most of them are super-resolution.

Although CNNs are powerful in improving super-resolution, they cannot be used directly for fractional interpolation in video coding due to two main problems. First, CNN-based super resolution may alter integer pixels after convolution. Second, the training sets for super-resolution and fractional interpolation in video coding are different. The former aims at restoring high resolution images by "enhancing" the quality of low resolution images, while the latter focuses on generating fractional samples close to the current block to be encoded from a reference frame. Therefore, the current technology has no real truth to refer to, resulting in that the training can not be normally performed.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to an aspect of the present application, there is provided a method for creating a video luminance and chrominance fractional interpolation model, including:

acquiring various different types of original video data, acquiring an image frame sequence in the original video data, and generating an original image set based on the image frame sequence;

preprocessing the original image set to create a training image set;

constructing a deep convolutional neural network model;

and taking the training image set as input data of the deep convolution neural network model, and taking the original image set as a corresponding true value to train the deep convolution neural network, so as to obtain the deep convolution neural network model suitable for fractional interpolation of video brightness and chroma.

Optionally, the pre-processing the original image set to create a training image set includes:

coding the integer pixel position images in the original image set, and learning and reconstructing mapping between the integer position video and the fraction position video so as to respectively generate a half-pixel image set and a quarter-pixel image set based on the integer pixel position images;

based on the half-pel image set and the quarter-pel image set, a training image set is created.

Optionally, the constructing the deep convolutional neural network model includes:

constructing a first convolutional neural network for performing luminance component interpolation on the image frames in the training image set and a second convolutional neural network for performing chrominance component interpolation on the image frames in the training image set; wherein the first convolutional neural network and the second convolutional neural network both have a customized context model therein;

and forming a deep convolutional neural network model based on the combination of the first convolutional neural network and the second convolutional neural network.

Optionally, the training of the deep convolutional neural network with the training image set as input data of the deep convolutional neural network model and the original image set as corresponding true values includes:

inputting the training image set into the deep convolutional neural network model;

and respectively carrying out motion estimation and motion compensation on each image frame in the training image set by combining the first convolutional neural network and the second convolutional neural network with an interpolation filter of discrete cosine transform, and selecting a fractional pixel interpolation method to carry out interpolation of a luminance component and a chrominance component based on the cost of a rate distortion value.

Optionally, the fractional pixel interpolation, wherein integer pixel positions are unchanged, generates only fractional pixel positions.

According to another aspect of the present application, there is provided a system for creating a video luma and chroma fractional interpolation model, comprising:

a raw image set generation module configured to acquire raw video data of a plurality of different types, acquire an image frame sequence in the raw video data, and generate a raw image set based on the image frame sequence;

a training image set creating module configured to perform a preprocessing operation on the original image set to create a training image set;

a model construction module configured to construct a deep convolutional neural network model;

and the model training module is configured to take the training image set as input data of the deep convolutional neural network model, and take the original image set as a corresponding true value to train the deep convolutional neural network, so as to obtain the deep convolutional neural network model suitable for fractional interpolation of video brightness and chrominance.

Optionally, the training image set creating module is further configured to:

Optionally, the model building module is further configured to:

Optionally, the model training module is further configured to:

According to yet another aspect of the present application, there is also provided a computing device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor, when executing the computer program, implements the method of creating a video luma and chroma fractional interpolation model as defined in any of the above.

According to yet another aspect of the present application, there is also provided a computer readable storage medium, preferably a non-volatile readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method of creating a video luma and chroma fractional interpolation model as defined in any one of the above.

The method comprises the steps of firstly collecting original video data of various types, obtaining an image frame sequence in the original video data, and generating an original image set based on the image frame sequence; then carrying out preprocessing operation on the original image set to create a training image set; and then constructing a deep convolution neural network model, taking the training image set as input data, and taking the original image set as a corresponding true value to train the deep convolution neural network, so as to obtain the deep convolution neural network model suitable for fractional interpolation of video brightness and chroma.

Based on the method and the system for creating the video brightness and chromaticity fractional interpolation model, based on the traditional super-resolution convolutional neural network, the brightness and chromaticity fractional interpolation based on the CNN can effectively reduce the training time consumption of the deep convolutional neural network model and improve the stability of motion deviation, so that the video coding efficiency is improved.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow chart of a method for creating a video luminance and chrominance fractional interpolation model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a deep convolutional neural network model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a system for creating a video luminance and chrominance fractional interpolation model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a computing device according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a computer-readable storage medium according to an embodiment of the application.

Detailed Description

A Super-Resolution reconstruction Convolutional Neural Network (SRCNN for short) in video coding is a first CNN model based on learning, which learns the mapping between the input of a low-Resolution image and the output of a high-Resolution image, and is superior to a conventional bicubic method. The VDSR (English full term Very Deep network for Super-Resolution) model belongs to the optimization of SRCNN, and aims to learn the image details on a 20-layer CNN model so as to improve the quality of low-Resolution input.

Although CNN-based techniques have achieved superior performance compared to traditional super-resolution methods, they cannot be directly applied to fractional interpolation in video coding.

Fig. 1 is a flowchart illustrating a method for creating a video luminance and chrominance fractional interpolation model according to an embodiment of the present application. Referring to fig. 1, a method for creating a video luminance and chrominance fractional interpolation model provided in an embodiment of the present application may include:

step S101: acquiring various different types of original video data, acquiring an image frame sequence in the original video data, and generating an original image set based on the image frame sequence;

step S102: preprocessing an original image set to create a training image set;

step S103: constructing a deep convolutional neural network model;

step S104: and taking the training image set as input data of the deep convolution neural network model, and taking the original image set as a corresponding true value to train the deep convolution neural network, so as to obtain the deep convolution neural network model suitable for video brightness and chroma fractional interpolation.

The method comprises the steps of firstly collecting original video data of various different types, obtaining an image frame sequence in the original video data, and generating an original image set based on the image frame sequence; then carrying out preprocessing operation on the original image set to create a training image set; and then constructing a deep convolution neural network model, taking the training image set as input data, and taking the original image set as a corresponding true value to train the deep convolution neural network, so as to obtain the deep convolution neural network model suitable for fractional interpolation of video brightness and chroma.

Based on the method for creating the video brightness and chromaticity fractional interpolation model, based on the traditional super-resolution convolutional neural network, the brightness and chromaticity fractional interpolation based on the CNN can effectively reduce the training time consumption of the deep convolutional neural network model, and can improve the stability of motion deviation, thereby improving the video coding efficiency. Steps S101-S104 are described in detail below.

Firstly, step S101 is executed to acquire different types of raw video data, and acquire an image frame sequence therein to generate a raw image set. The original video data may include video data of multiple videos of different video contents, different types and/or different resolutions, which is not limited in this application.

After the raw video data is acquired, an image frame sequence of any one of the raw video data can be acquired, and then a raw image set is generated based on the image frame sequence. For one video data image frame sequence, the image frame sequence may be an image frame sequence formed by splicing image frames constituting the video data, and an original image set is generated based on the image frame sequences of a plurality of video data.

Referring to step S102, after the original image set is obtained, the original image set is preprocessed to create a training image set.

The training set plays a crucial role in training any network and decides the network at test time. For CNN-based super resolution, one popular training set includes low resolution images as input, and the corresponding original high resolution images as true labels. The main method of creating a CNN input is to downsample the original image and then upsample the result of the low resolution image. For the purpose of reducing the bit rate, performing MCP (motion compensated prediction) operation requires prediction of fractional pixel images from reconstructed image frames, which makes the super-resolution training set unsuitable for the fractional interpolation task. Also, integer pixels may change after convolution and there are no fractional pixels in the actual image, which is also a problem in creating the training set. The video brightness and chrominance fractional interpolation model training set provided by the embodiment of the application is used for performing fractional interpolation by applying a super-resolution idea based on CNN in video coding. When the fractional interpolation is carried out on the video brightness and chromaticity fractional interpolation model created in the embodiment of the application, the integer pixel position is unchanged, and only the fractional pixel position is generated.

In an alternative embodiment of the present application, the pre-processing operation performed on the original image set may include, when creating the training image set: coding an integer pixel position image in an original image set, and learning and reconstructing mapping between an integer position video and a fraction position video so as to respectively generate a half-pixel image set and a quarter-pixel image set based on the integer pixel position image; a training image set is then created based on the half-pel image set and the quarter-pel image set.

That is, the embodiment of the present application performs a preprocessing operation on the original image set, that is, the integer position image may be encoded, and the mapping between the integer position video and the fractional position video is reconstructed by learning. To address the problem of post-convolution integer pixel variation, a dual resolution image is generated that includes integer and half pixels and a mask that bounds the post-convolution integer pixels. That is, the integer pixel position image is preprocessed to generate half-pixel and quarter-pixel formats for the integer pixel image, respectively.

Step S103 is executed to construct a deep convolutional neural network model.

Specifically, when the deep convolutional neural network model is constructed, a first convolutional neural network for performing luminance component interpolation on image frames in a training image set and a second convolutional neural network for performing chrominance component interpolation on the image frames in the training image set can be constructed; the first convolutional neural network and the second convolutional neural network both have self-defined context models; and forming a deep convolutional neural network model based on the combination of the first convolutional neural network and the second convolutional neural network.

In the deep convolutional neural network in the embodiment of the present application, the entire network architecture includes 20 convolutional layers. Each layer was convolved with 3 x 3 large step forward filtering by applying 64. The padding is set to 1 to preserve the size of the input image after convolution. For each convolutional layer, except the last layer, a ReLU active layer is set after the convolution. The learning rate was set to start at 0.1 and decrease by ten epochs later, training stopped at 50 epochs, and the batch size was 128.

The context model is to improve the receptive field by utilizing the context information on the image area and improving the depth of the network, so that the image characteristics can extract more details. Taking the context model in the first convolutional neural network as an example, the context model can extract feature data of an image frame input to the first convolutional neural network, and then perform interpolation luminance classification on a next image frame of the image frame based on the extracted feature data, thereby further improving the resolution of the image.

In the process of coding image frames in the same video data, an MCP motion compensation prediction technology can be adopted to interpolate a reconstructed image frame to decimal precision and find out a decimal pixel closest to a current frame to be coded. In addition, the selection between different coding modes is an effective tool to improve image quality, reduce bitrate, or reduce the computational complexity of video coding standards (e.g., HEVC). The embodiment of the present application defines two context models for the selection of the luminance and chrominance interpolation methods, in order to utilize CNN and the discrete cosine transform interpolation filters mentioned below, because the CNN and discrete cosine transform interpolation filters have the capability of processing different signals, and the fractional interpolation based on CNN and the interpolation method of luminance and chrominance components are integrated into HEVC.

And finally, executing step S104, taking the training image set as input data of the deep convolutional neural network model, and taking the original image set as a corresponding true value to train the deep convolutional neural network, so as to obtain the deep convolutional neural network model suitable for video brightness and chroma fractional interpolation.

Specifically, a training image set is input into a deep convolutional neural network model; and respectively carrying out motion estimation and motion compensation on each image frame in the training image set by combining a first convolutional neural network and a second convolutional neural network with an interpolation filter of discrete cosine transform, and selecting a fractional pixel interpolation method based on rate distortion value cost (RDO cost) to carry out interpolation of a luminance component and a chrominance component.

Discrete Cosine Transform (DCT Transform) is mainly used for compressing data or images. The DCT can convert the spatial domain signal to the frequency domain, and thus has good decorrelation performance. The DCT transform itself is lossless and has symmetry. And (3) performing discrete cosine transform on the original image, wherein the energy of the transformed DCT coefficient is mainly concentrated at the upper left corner, and most of the rest coefficients are close to zero. And performing threshold operation on the transformed DCT coefficient, zeroing the coefficient less than a certain value, which is a quantization process in image compression, and then performing inverse DCT operation to obtain a compressed image.

The Interpolation Filter of the discrete cosine transform is an Interpolation Filter Based on the DCT transform, namely DCTIF (DCT-Based Interpolation Filter). Compared with the traditional filter, the DCTIF reduces the complexity and greatly improves the compression performance.

In the method provided in this embodiment, a training process of the deep convolutional neural network model, that is, a process of performing interpolation training on image frames, for example, when an image frame Z is interpolated between an adjacent image frame M and an image frame N, a true value (such as a luminance true value or a chrominance true value) on the image frame N extracted from an original image set may be specifically used as a true label for performing the image frame Z, and then a corresponding parameter in the image frame Z is updated through a back propagation algorithm to achieve a training purpose.

That is, in the embodiment of the present application, in order to fully utilize DCTIF, a CNN-based fractional interpolation and a luminance classification and chrominance component interpolation method may be integrated into a scheme in HEVC. And searching the best fractional pixel point during encoding, and interpolating the Y component of the reference frame in both DCTIF and CNN. In motion compensation, the Y component is interpolated by using a motion search method, and the U and V chrominance components are interpolated to the current frame to be encoded by using CNN and DCTIF, respectively. As shown in fig. 2, the network architecture provided by this embodiment may include 20 convolutional layers, each convolutional layer is convolved with 64 filters of 3 × 3, and the step size of each filter is 1. The padding is set to 1, and the size of the convolved input image is maintained. For each convolutional layer, a ReLU activation layer is set after convolution, except for the last layer.

In the training process, the interpolation score is mainly calculated for the original image, and the next frame with the minimum loss is selected according to the score to carry out interpolation processing. The variables considered are the YUV three variables, Y for luminance and U, V for chrominance. The input of the deep convolutional neural network model provided by the embodiment of the application can be the interpolated fractional sample and the real label of the real fractional sample extracted from the original frame. Assuming that the reconstructed image x contains integer pixels and fractional position frames y, the DCTIF is applied to the reconstructed image x to obtain fractional samples x of the reconstructed image x_j', where j takes on values from 1 to 15 in the luminance component and a combination of values from 1 to 63 in the chrominance component. And inputting the image x 'subjected to the DCTIF interpolation into the CNN to obtain an output y'. The deep convolutional neural network model of the embodiments of the present application is intended for learning. That is, each interpolated image x 'is input into CNN by DCTIF as output y', and the deep convolutional neural network model aims to learn the mapping x between x and y by minimizing the loss function_j' and y_j'. The formula for minimizing the loss function is as follows:

wherein:

y_i' represents the output of the deep convolutional neural network model, representing the predicted position; y is_iIndicating the actual location.

In the field of deep learning, a Loss Function (Loss Function) is a very important content. The loss function is used to measure the degree of disagreement between the predicted value and the true value of the model. The objective of the model is to minimize the loss function, so that the predicted value is as close as possible to the true value. Algorithms can be used to find the function minimum in general.

The basic idea of motion estimation is to divide each frame of an image sequence into a plurality of non-overlapping macroblocks, consider the displacement of all pixels in the macroblocks to be the same, and then find out the block most similar to the current block, i.e. the matching block, from each macroblock to a reference frame within a given specific search range according to a certain matching criterion, where the relative displacement between the matching block and the current block is the motion vector. When the video is compressed, the current block can be completely restored only by storing the motion vector and the residual data. In motion estimation, the Y component (luminance component) is interpolated by DCTIF and CNN. For each interpolation method, the best score sample is selected and the best score MV is sent.

In motion search, the Y component of the reference frame is interpolated by DCTIF and CNN. In motion compensation, the U and V components of the reconstructed frame (reconstructed image) are interpolated by CNN and DCTIF, and the Y component is interpolated by a method in motion search. The residual between the current CUB and the predicted CU is calculated and coded with 2 bits, indicating the interpolation method for the luma and chroma components. Finally, RDO-based fractional interpolation selection is implemented to decide which interpolation method should be used for luma and chroma fractional interpolation. The Y, U and V components of the reference frame are interpolated by DCTIF before being fed into the CNN to avoid motion offset problems.

Based on the same inventive concept, as shown in fig. 3, an embodiment of the present application further provides a system for creating a video luminance and chrominance fractional interpolation model, including:

a raw image set generation module 310 configured to collect raw video data of a plurality of different types, obtain an image frame sequence in the raw video data, and generate a raw image set based on the image frame sequence;

a training image set creation module 320 configured to perform a preprocessing operation on the original image set to create a training image set;

a model construction module 330 configured to construct a deep convolutional neural network model;

and the model training module 340 is configured to use the training image set as input data of the deep convolutional neural network model, and use the original image set as a corresponding true value to train the deep convolutional neural network, so as to obtain the deep convolutional neural network model suitable for video luminance and chrominance fractional interpolation.

In another optional embodiment of the present application, the training image set creating module 220 may be further configured to:

coding an integer pixel position image in an original image set, and learning and reconstructing mapping between an integer position video and a fraction position video so as to respectively generate a half-pixel image set and a quarter-pixel image set based on the integer pixel position image;

In another optional embodiment of the present application, the model building module 330 may be further configured to:

constructing a first convolution neural network for performing luminance component interpolation on image frames in a training image set and a second convolution neural network for performing chrominance component interpolation on the image frames in the training image set; the first convolutional neural network and the second convolutional neural network both have self-defined context models;

In another optional embodiment of the present application, the model training module 340 may be further configured to:

inputting the training image set into a deep convolution neural network model;

and respectively carrying out motion estimation and motion compensation on each image frame in the training image set by combining the first convolutional neural network and the second convolutional neural network with an interpolation filter of discrete cosine transform, and selecting a fractional pixel interpolation method based on the cost of a rate distortion value to carry out interpolation of a brightness component and a chrominance component.

In another alternative embodiment of the present application, fractional pixel interpolation, in which integer pixel positions are unchanged, only fractional pixel positions are generated.

Embodiments of the present application also provide a computing device, referring to fig. 4, comprising a memory 420, a processor 410, and a computer program stored in the memory 420 and executable by the processor 410, the computer program being stored in a space 430 for program code in the memory 420, the computer program, when executed by the processor 410, implementing steps 431 for performing any of the methods according to the present application.

The embodiment of the application also provides a computer readable storage medium. Referring to fig. 5, the computer readable storage medium comprises a storage unit for program code provided with a program 431' for performing the steps of the method according to the application, which program is executed by a processor.

The embodiment of the application also provides a computer program product containing instructions. When the computer program product is run on a computer, the computer is caused to perform the method steps according to the application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for creating a video luminance and chrominance fractional interpolation model comprises the following steps:

preprocessing the original image set to create a training image set;

constructing a deep convolutional neural network model;

2. The method of claim 1, wherein the pre-processing the raw image set to create a training image set comprises:

3. The method of claim 2, wherein constructing the deep convolutional neural network model comprises:

4. The method of claim 3, wherein training the deep convolutional neural network with the training image set as input data to the deep convolutional neural network model and the original image set as corresponding truth values comprises:

5. A system for creating a video luma and chroma fractional interpolation model, comprising:

6. The system of claim 5, wherein the training image set creation module is further configured to:

7. The system of claim 6, wherein the model building module is further configured to:

8. The system of claim 7, wherein the model training module is further configured to:

9. A computing device comprising a memory, a processor, and a computer program stored in the memory and executable by the processor, wherein the processor, when executing the computer program, implements the method of creating a video luma and chroma fractional interpolation model according to any one of claims 1-4.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of creating a video luma and chroma fractional interpolation model according to any one of claims 1-4.