CN112601095A - Method and system for creating fractional interpolation model of video brightness and chrominance - Google Patents

Method and system for creating fractional interpolation model of video brightness and chrominance Download PDF

Info

Publication number
CN112601095A
CN112601095A CN202011307251.6A CN202011307251A CN112601095A CN 112601095 A CN112601095 A CN 112601095A CN 202011307251 A CN202011307251 A CN 202011307251A CN 112601095 A CN112601095 A CN 112601095A
Authority
CN
China
Prior art keywords
image set
neural network
convolutional neural
interpolation
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011307251.6A
Other languages
Chinese (zh)
Other versions
CN112601095B (en
Inventor
樊硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Moviebook Technology Corp ltd
Original Assignee
Beijing Moviebook Technology Corp ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Moviebook Technology Corp ltd filed Critical Beijing Moviebook Technology Corp ltd
Priority to CN202011307251.6A priority Critical patent/CN112601095B/en
Publication of CN112601095A publication Critical patent/CN112601095A/en
Application granted granted Critical
Publication of CN112601095B publication Critical patent/CN112601095B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/80Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/10Selection of transformation methods according to the characteristics of the input images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/90Dynamic range modification of images or parts thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/262Analysis of motion using transform domain methods, e.g. Fourier domain methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/182Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a pixel
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/186Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a colour or a chrominance component
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/567Motion estimation based on rate distortion criteria
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20048Transform domain processing
    • G06T2207/20052Discrete cosine transform [DCT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a method and a system for creating a video brightness and chrominance fractional interpolation model, wherein in the method provided by the application, a plurality of different types of original video data are collected firstly, and an image frame sequence in the original video data is obtained to generate an original image set; then carrying out preprocessing operation on the original image set to create a training image set; and then constructing a deep convolution neural network model, taking the training image set as input data, and taking the original image set as a corresponding true value for training to obtain the deep convolution neural network model suitable for fractional interpolation of video brightness and chroma. Based on the method and the system for creating the video brightness and chromaticity fractional interpolation model, on the basis of the traditional super-resolution convolutional neural network, the brightness and chromaticity fractional interpolation based on the CNN can effectively reduce the training time consumption of the deep convolutional neural network model, so that the video coding efficiency is improved.

Description

Method and system for creating fractional interpolation model of video brightness and chrominance
Technical Field
The application relates to the field of deep learning, in particular to a method and a system for creating a video brightness and chrominance fractional interpolation model.
Background
In recent years, methods based on deep learning have been widely used, and remarkable effects are obtained in image and video processing. Convolutional Neural Networks (CNN) are the most representative model in deep learning, and it well improves the performance of the traditional method in high-level computer vision, and most of them are super-resolution.
Although CNNs are powerful in improving super-resolution, they cannot be used directly for fractional interpolation in video coding due to two main problems. First, CNN-based super resolution may alter integer pixels after convolution. Second, the training sets for super-resolution and fractional interpolation in video coding are different. The former aims at restoring high resolution images by "enhancing" the quality of low resolution images, while the latter focuses on generating fractional samples close to the current block to be encoded from a reference frame. Therefore, the current technology has no real truth to refer to, resulting in that the training can not be normally performed.
Disclosure of Invention
It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.
According to an aspect of the present application, there is provided a method for creating a video luminance and chrominance fractional interpolation model, including:
acquiring various different types of original video data, acquiring an image frame sequence in the original video data, and generating an original image set based on the image frame sequence;
preprocessing the original image set to create a training image set;
constructing a deep convolutional neural network model;
and taking the training image set as input data of the deep convolution neural network model, and taking the original image set as a corresponding true value to train the deep convolution neural network, so as to obtain the deep convolution neural network model suitable for fractional interpolation of video brightness and chroma.
Optionally, the pre-processing the original image set to create a training image set includes:
coding the integer pixel position images in the original image set, and learning and reconstructing mapping between the integer position video and the fraction position video so as to respectively generate a half-pixel image set and a quarter-pixel image set based on the integer pixel position images;
based on the half-pel image set and the quarter-pel image set, a training image set is created.
Optionally, the constructing the deep convolutional neural network model includes:
constructing a first convolutional neural network for performing luminance component interpolation on the image frames in the training image set and a second convolutional neural network for performing chrominance component interpolation on the image frames in the training image set; wherein the first convolutional neural network and the second convolutional neural network both have a customized context model therein;
and forming a deep convolutional neural network model based on the combination of the first convolutional neural network and the second convolutional neural network.
Optionally, the training of the deep convolutional neural network with the training image set as input data of the deep convolutional neural network model and the original image set as corresponding true values includes:
inputting the training image set into the deep convolutional neural network model;
and respectively carrying out motion estimation and motion compensation on each image frame in the training image set by combining the first convolutional neural network and the second convolutional neural network with an interpolation filter of discrete cosine transform, and selecting a fractional pixel interpolation method to carry out interpolation of a luminance component and a chrominance component based on the cost of a rate distortion value.
Optionally, the fractional pixel interpolation, wherein integer pixel positions are unchanged, generates only fractional pixel positions.
According to another aspect of the present application, there is provided a system for creating a video luma and chroma fractional interpolation model, comprising:
a raw image set generation module configured to acquire raw video data of a plurality of different types, acquire an image frame sequence in the raw video data, and generate a raw image set based on the image frame sequence;
a training image set creating module configured to perform a preprocessing operation on the original image set to create a training image set;
a model construction module configured to construct a deep convolutional neural network model;
and the model training module is configured to take the training image set as input data of the deep convolutional neural network model, and take the original image set as a corresponding true value to train the deep convolutional neural network, so as to obtain the deep convolutional neural network model suitable for fractional interpolation of video brightness and chrominance.
Optionally, the training image set creating module is further configured to:
coding the integer pixel position images in the original image set, and learning and reconstructing mapping between the integer position video and the fraction position video so as to respectively generate a half-pixel image set and a quarter-pixel image set based on the integer pixel position images;
based on the half-pel image set and the quarter-pel image set, a training image set is created.
Optionally, the model building module is further configured to:
constructing a first convolutional neural network for performing luminance component interpolation on the image frames in the training image set and a second convolutional neural network for performing chrominance component interpolation on the image frames in the training image set; wherein the first convolutional neural network and the second convolutional neural network both have a customized context model therein;
and forming a deep convolutional neural network model based on the combination of the first convolutional neural network and the second convolutional neural network.
Optionally, the model training module is further configured to:
inputting the training image set into the deep convolutional neural network model;
and respectively carrying out motion estimation and motion compensation on each image frame in the training image set by combining the first convolutional neural network and the second convolutional neural network with an interpolation filter of discrete cosine transform, and selecting a fractional pixel interpolation method to carry out interpolation of a luminance component and a chrominance component based on the cost of a rate distortion value.
Optionally, the fractional pixel interpolation, wherein integer pixel positions are unchanged, generates only fractional pixel positions.
According to yet another aspect of the present application, there is also provided a computing device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor, when executing the computer program, implements the method of creating a video luma and chroma fractional interpolation model as defined in any of the above.
According to yet another aspect of the present application, there is also provided a computer readable storage medium, preferably a non-volatile readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method of creating a video luma and chroma fractional interpolation model as defined in any one of the above.
The method comprises the steps of firstly collecting original video data of various types, obtaining an image frame sequence in the original video data, and generating an original image set based on the image frame sequence; then carrying out preprocessing operation on the original image set to create a training image set; and then constructing a deep convolution neural network model, taking the training image set as input data, and taking the original image set as a corresponding true value to train the deep convolution neural network, so as to obtain the deep convolution neural network model suitable for fractional interpolation of video brightness and chroma.
Based on the method and the system for creating the video brightness and chromaticity fractional interpolation model, based on the traditional super-resolution convolutional neural network, the brightness and chromaticity fractional interpolation based on the CNN can effectively reduce the training time consumption of the deep convolutional neural network model and improve the stability of motion deviation, so that the video coding efficiency is improved.
The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Drawings
Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:
FIG. 1 is a flow chart of a method for creating a video luminance and chrominance fractional interpolation model according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a deep convolutional neural network model according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a system for creating a video luminance and chrominance fractional interpolation model according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a computing device according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a computer-readable storage medium according to an embodiment of the application.
Detailed Description
A Super-Resolution reconstruction Convolutional Neural Network (SRCNN for short) in video coding is a first CNN model based on learning, which learns the mapping between the input of a low-Resolution image and the output of a high-Resolution image, and is superior to a conventional bicubic method. The VDSR (English full term Very Deep network for Super-Resolution) model belongs to the optimization of SRCNN, and aims to learn the image details on a 20-layer CNN model so as to improve the quality of low-Resolution input.
Although CNN-based techniques have achieved superior performance compared to traditional super-resolution methods, they cannot be directly applied to fractional interpolation in video coding.
Fig. 1 is a flowchart illustrating a method for creating a video luminance and chrominance fractional interpolation model according to an embodiment of the present application. Referring to fig. 1, a method for creating a video luminance and chrominance fractional interpolation model provided in an embodiment of the present application may include:
step S101: acquiring various different types of original video data, acquiring an image frame sequence in the original video data, and generating an original image set based on the image frame sequence;
step S102: preprocessing an original image set to create a training image set;
step S103: constructing a deep convolutional neural network model;
step S104: and taking the training image set as input data of the deep convolution neural network model, and taking the original image set as a corresponding true value to train the deep convolution neural network, so as to obtain the deep convolution neural network model suitable for video brightness and chroma fractional interpolation.
The method comprises the steps of firstly collecting original video data of various different types, obtaining an image frame sequence in the original video data, and generating an original image set based on the image frame sequence; then carrying out preprocessing operation on the original image set to create a training image set; and then constructing a deep convolution neural network model, taking the training image set as input data, and taking the original image set as a corresponding true value to train the deep convolution neural network, so as to obtain the deep convolution neural network model suitable for fractional interpolation of video brightness and chroma.
Based on the method for creating the video brightness and chromaticity fractional interpolation model, based on the traditional super-resolution convolutional neural network, the brightness and chromaticity fractional interpolation based on the CNN can effectively reduce the training time consumption of the deep convolutional neural network model, and can improve the stability of motion deviation, thereby improving the video coding efficiency. Steps S101-S104 are described in detail below.
Firstly, step S101 is executed to acquire different types of raw video data, and acquire an image frame sequence therein to generate a raw image set. The original video data may include video data of multiple videos of different video contents, different types and/or different resolutions, which is not limited in this application.
After the raw video data is acquired, an image frame sequence of any one of the raw video data can be acquired, and then a raw image set is generated based on the image frame sequence. For one video data image frame sequence, the image frame sequence may be an image frame sequence formed by splicing image frames constituting the video data, and an original image set is generated based on the image frame sequences of a plurality of video data.
Referring to step S102, after the original image set is obtained, the original image set is preprocessed to create a training image set.
The training set plays a crucial role in training any network and decides the network at test time. For CNN-based super resolution, one popular training set includes low resolution images as input, and the corresponding original high resolution images as true labels. The main method of creating a CNN input is to downsample the original image and then upsample the result of the low resolution image. For the purpose of reducing the bit rate, performing MCP (motion compensated prediction) operation requires prediction of fractional pixel images from reconstructed image frames, which makes the super-resolution training set unsuitable for the fractional interpolation task. Also, integer pixels may change after convolution and there are no fractional pixels in the actual image, which is also a problem in creating the training set. The video brightness and chrominance fractional interpolation model training set provided by the embodiment of the application is used for performing fractional interpolation by applying a super-resolution idea based on CNN in video coding. When the fractional interpolation is carried out on the video brightness and chromaticity fractional interpolation model created in the embodiment of the application, the integer pixel position is unchanged, and only the fractional pixel position is generated.
In an alternative embodiment of the present application, the pre-processing operation performed on the original image set may include, when creating the training image set: coding an integer pixel position image in an original image set, and learning and reconstructing mapping between an integer position video and a fraction position video so as to respectively generate a half-pixel image set and a quarter-pixel image set based on the integer pixel position image; a training image set is then created based on the half-pel image set and the quarter-pel image set.
That is, the embodiment of the present application performs a preprocessing operation on the original image set, that is, the integer position image may be encoded, and the mapping between the integer position video and the fractional position video is reconstructed by learning. To address the problem of post-convolution integer pixel variation, a dual resolution image is generated that includes integer and half pixels and a mask that bounds the post-convolution integer pixels. That is, the integer pixel position image is preprocessed to generate half-pixel and quarter-pixel formats for the integer pixel image, respectively.
Step S103 is executed to construct a deep convolutional neural network model.
Specifically, when the deep convolutional neural network model is constructed, a first convolutional neural network for performing luminance component interpolation on image frames in a training image set and a second convolutional neural network for performing chrominance component interpolation on the image frames in the training image set can be constructed; the first convolutional neural network and the second convolutional neural network both have self-defined context models; and forming a deep convolutional neural network model based on the combination of the first convolutional neural network and the second convolutional neural network.
In the deep convolutional neural network in the embodiment of the present application, the entire network architecture includes 20 convolutional layers. Each layer was convolved with 3 x 3 large step forward filtering by applying 64. The padding is set to 1 to preserve the size of the input image after convolution. For each convolutional layer, except the last layer, a ReLU active layer is set after the convolution. The learning rate was set to start at 0.1 and decrease by ten epochs later, training stopped at 50 epochs, and the batch size was 128.
The context model is to improve the receptive field by utilizing the context information on the image area and improving the depth of the network, so that the image characteristics can extract more details. Taking the context model in the first convolutional neural network as an example, the context model can extract feature data of an image frame input to the first convolutional neural network, and then perform interpolation luminance classification on a next image frame of the image frame based on the extracted feature data, thereby further improving the resolution of the image.
In the process of coding image frames in the same video data, an MCP motion compensation prediction technology can be adopted to interpolate a reconstructed image frame to decimal precision and find out a decimal pixel closest to a current frame to be coded. In addition, the selection between different coding modes is an effective tool to improve image quality, reduce bitrate, or reduce the computational complexity of video coding standards (e.g., HEVC). The embodiment of the present application defines two context models for the selection of the luminance and chrominance interpolation methods, in order to utilize CNN and the discrete cosine transform interpolation filters mentioned below, because the CNN and discrete cosine transform interpolation filters have the capability of processing different signals, and the fractional interpolation based on CNN and the interpolation method of luminance and chrominance components are integrated into HEVC.
And finally, executing step S104, taking the training image set as input data of the deep convolutional neural network model, and taking the original image set as a corresponding true value to train the deep convolutional neural network, so as to obtain the deep convolutional neural network model suitable for video brightness and chroma fractional interpolation.
Specifically, a training image set is input into a deep convolutional neural network model; and respectively carrying out motion estimation and motion compensation on each image frame in the training image set by combining a first convolutional neural network and a second convolutional neural network with an interpolation filter of discrete cosine transform, and selecting a fractional pixel interpolation method based on rate distortion value cost (RDO cost) to carry out interpolation of a luminance component and a chrominance component.
Discrete Cosine Transform (DCT Transform) is mainly used for compressing data or images. The DCT can convert the spatial domain signal to the frequency domain, and thus has good decorrelation performance. The DCT transform itself is lossless and has symmetry. And (3) performing discrete cosine transform on the original image, wherein the energy of the transformed DCT coefficient is mainly concentrated at the upper left corner, and most of the rest coefficients are close to zero. And performing threshold operation on the transformed DCT coefficient, zeroing the coefficient less than a certain value, which is a quantization process in image compression, and then performing inverse DCT operation to obtain a compressed image.
The Interpolation Filter of the discrete cosine transform is an Interpolation Filter Based on the DCT transform, namely DCTIF (DCT-Based Interpolation Filter). Compared with the traditional filter, the DCTIF reduces the complexity and greatly improves the compression performance.
In the method provided in this embodiment, a training process of the deep convolutional neural network model, that is, a process of performing interpolation training on image frames, for example, when an image frame Z is interpolated between an adjacent image frame M and an image frame N, a true value (such as a luminance true value or a chrominance true value) on the image frame N extracted from an original image set may be specifically used as a true label for performing the image frame Z, and then a corresponding parameter in the image frame Z is updated through a back propagation algorithm to achieve a training purpose.
That is, in the embodiment of the present application, in order to fully utilize DCTIF, a CNN-based fractional interpolation and a luminance classification and chrominance component interpolation method may be integrated into a scheme in HEVC. And searching the best fractional pixel point during encoding, and interpolating the Y component of the reference frame in both DCTIF and CNN. In motion compensation, the Y component is interpolated by using a motion search method, and the U and V chrominance components are interpolated to the current frame to be encoded by using CNN and DCTIF, respectively. As shown in fig. 2, the network architecture provided by this embodiment may include 20 convolutional layers, each convolutional layer is convolved with 64 filters of 3 × 3, and the step size of each filter is 1. The padding is set to 1, and the size of the convolved input image is maintained. For each convolutional layer, a ReLU activation layer is set after convolution, except for the last layer.
In the training process, the interpolation score is mainly calculated for the original image, and the next frame with the minimum loss is selected according to the score to carry out interpolation processing. The variables considered are the YUV three variables, Y for luminance and U, V for chrominance. The input of the deep convolutional neural network model provided by the embodiment of the application can be the interpolated fractional sample and the real label of the real fractional sample extracted from the original frame. Assuming that the reconstructed image x contains integer pixels and fractional position frames y, the DCTIF is applied to the reconstructed image x to obtain fractional samples x of the reconstructed image xj', where j takes on values from 1 to 15 in the luminance component and a combination of values from 1 to 63 in the chrominance component. And inputting the image x 'subjected to the DCTIF interpolation into the CNN to obtain an output y'. The deep convolutional neural network model of the embodiments of the present application is intended for learning. That is, each interpolated image x 'is input into CNN by DCTIF as output y', and the deep convolutional neural network model aims to learn the mapping x between x and y by minimizing the loss functionj' and yj'. The formula for minimizing the loss function is as follows:
Figure BDA0002788679850000081
wherein:
yi' represents the output of the deep convolutional neural network model, representing the predicted position; y isiIndicating the actual location.
In the field of deep learning, a Loss Function (Loss Function) is a very important content. The loss function is used to measure the degree of disagreement between the predicted value and the true value of the model. The objective of the model is to minimize the loss function, so that the predicted value is as close as possible to the true value. Algorithms can be used to find the function minimum in general.
The basic idea of motion estimation is to divide each frame of an image sequence into a plurality of non-overlapping macroblocks, consider the displacement of all pixels in the macroblocks to be the same, and then find out the block most similar to the current block, i.e. the matching block, from each macroblock to a reference frame within a given specific search range according to a certain matching criterion, where the relative displacement between the matching block and the current block is the motion vector. When the video is compressed, the current block can be completely restored only by storing the motion vector and the residual data. In motion estimation, the Y component (luminance component) is interpolated by DCTIF and CNN. For each interpolation method, the best score sample is selected and the best score MV is sent.
In motion search, the Y component of the reference frame is interpolated by DCTIF and CNN. In motion compensation, the U and V components of the reconstructed frame (reconstructed image) are interpolated by CNN and DCTIF, and the Y component is interpolated by a method in motion search. The residual between the current CUB and the predicted CU is calculated and coded with 2 bits, indicating the interpolation method for the luma and chroma components. Finally, RDO-based fractional interpolation selection is implemented to decide which interpolation method should be used for luma and chroma fractional interpolation. The Y, U and V components of the reference frame are interpolated by DCTIF before being fed into the CNN to avoid motion offset problems.
Based on the same inventive concept, as shown in fig. 3, an embodiment of the present application further provides a system for creating a video luminance and chrominance fractional interpolation model, including:
a raw image set generation module 310 configured to collect raw video data of a plurality of different types, obtain an image frame sequence in the raw video data, and generate a raw image set based on the image frame sequence;
a training image set creation module 320 configured to perform a preprocessing operation on the original image set to create a training image set;
a model construction module 330 configured to construct a deep convolutional neural network model;
and the model training module 340 is configured to use the training image set as input data of the deep convolutional neural network model, and use the original image set as a corresponding true value to train the deep convolutional neural network, so as to obtain the deep convolutional neural network model suitable for video luminance and chrominance fractional interpolation.
In another optional embodiment of the present application, the training image set creating module 220 may be further configured to:
coding an integer pixel position image in an original image set, and learning and reconstructing mapping between an integer position video and a fraction position video so as to respectively generate a half-pixel image set and a quarter-pixel image set based on the integer pixel position image;
based on the half-pel image set and the quarter-pel image set, a training image set is created.
In another optional embodiment of the present application, the model building module 330 may be further configured to:
constructing a first convolution neural network for performing luminance component interpolation on image frames in a training image set and a second convolution neural network for performing chrominance component interpolation on the image frames in the training image set; the first convolutional neural network and the second convolutional neural network both have self-defined context models;
and forming a deep convolutional neural network model based on the combination of the first convolutional neural network and the second convolutional neural network.
In another optional embodiment of the present application, the model training module 340 may be further configured to:
inputting the training image set into a deep convolution neural network model;
and respectively carrying out motion estimation and motion compensation on each image frame in the training image set by combining the first convolutional neural network and the second convolutional neural network with an interpolation filter of discrete cosine transform, and selecting a fractional pixel interpolation method based on the cost of a rate distortion value to carry out interpolation of a brightness component and a chrominance component.
In another alternative embodiment of the present application, fractional pixel interpolation, in which integer pixel positions are unchanged, only fractional pixel positions are generated.
The method comprises the steps of firstly collecting original video data of various types, obtaining an image frame sequence in the original video data, and generating an original image set based on the image frame sequence; then carrying out preprocessing operation on the original image set to create a training image set; and then constructing a deep convolution neural network model, taking the training image set as input data, and taking the original image set as a corresponding true value to train the deep convolution neural network, so as to obtain the deep convolution neural network model suitable for fractional interpolation of video brightness and chroma.
Based on the method and the system for creating the video brightness and chromaticity fractional interpolation model, based on the traditional super-resolution convolutional neural network, the brightness and chromaticity fractional interpolation based on the CNN can effectively reduce the training time consumption of the deep convolutional neural network model and improve the stability of motion deviation, so that the video coding efficiency is improved.
The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Embodiments of the present application also provide a computing device, referring to fig. 4, comprising a memory 420, a processor 410, and a computer program stored in the memory 420 and executable by the processor 410, the computer program being stored in a space 430 for program code in the memory 420, the computer program, when executed by the processor 410, implementing steps 431 for performing any of the methods according to the present application.
The embodiment of the application also provides a computer readable storage medium. Referring to fig. 5, the computer readable storage medium comprises a storage unit for program code provided with a program 431' for performing the steps of the method according to the application, which program is executed by a processor.
The embodiment of the application also provides a computer program product containing instructions. When the computer program product is run on a computer, the computer is caused to perform the method steps according to the application.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method for creating a video luminance and chrominance fractional interpolation model comprises the following steps:
acquiring various different types of original video data, acquiring an image frame sequence in the original video data, and generating an original image set based on the image frame sequence;
preprocessing the original image set to create a training image set;
constructing a deep convolutional neural network model;
and taking the training image set as input data of the deep convolution neural network model, and taking the original image set as a corresponding true value to train the deep convolution neural network, so as to obtain the deep convolution neural network model suitable for fractional interpolation of video brightness and chroma.
2. The method of claim 1, wherein the pre-processing the raw image set to create a training image set comprises:
coding the integer pixel position images in the original image set, and learning and reconstructing mapping between the integer position video and the fraction position video so as to respectively generate a half-pixel image set and a quarter-pixel image set based on the integer pixel position images;
based on the half-pel image set and the quarter-pel image set, a training image set is created.
3. The method of claim 2, wherein constructing the deep convolutional neural network model comprises:
constructing a first convolutional neural network for performing luminance component interpolation on the image frames in the training image set and a second convolutional neural network for performing chrominance component interpolation on the image frames in the training image set; wherein the first convolutional neural network and the second convolutional neural network both have a customized context model therein;
and forming a deep convolutional neural network model based on the combination of the first convolutional neural network and the second convolutional neural network.
4. The method of claim 3, wherein training the deep convolutional neural network with the training image set as input data to the deep convolutional neural network model and the original image set as corresponding truth values comprises:
inputting the training image set into the deep convolutional neural network model;
and respectively carrying out motion estimation and motion compensation on each image frame in the training image set by combining the first convolutional neural network and the second convolutional neural network with an interpolation filter of discrete cosine transform, and selecting a fractional pixel interpolation method to carry out interpolation of a luminance component and a chrominance component based on the cost of a rate distortion value.
5. A system for creating a video luma and chroma fractional interpolation model, comprising:
a raw image set generation module configured to acquire raw video data of a plurality of different types, acquire an image frame sequence in the raw video data, and generate a raw image set based on the image frame sequence;
a training image set creating module configured to perform a preprocessing operation on the original image set to create a training image set;
a model construction module configured to construct a deep convolutional neural network model;
and the model training module is configured to take the training image set as input data of the deep convolutional neural network model, and take the original image set as a corresponding true value to train the deep convolutional neural network, so as to obtain the deep convolutional neural network model suitable for fractional interpolation of video brightness and chrominance.
6. The system of claim 5, wherein the training image set creation module is further configured to:
coding the integer pixel position images in the original image set, and learning and reconstructing mapping between the integer position video and the fraction position video so as to respectively generate a half-pixel image set and a quarter-pixel image set based on the integer pixel position images;
based on the half-pel image set and the quarter-pel image set, a training image set is created.
7. The system of claim 6, wherein the model building module is further configured to:
constructing a first convolutional neural network for performing luminance component interpolation on the image frames in the training image set and a second convolutional neural network for performing chrominance component interpolation on the image frames in the training image set; wherein the first convolutional neural network and the second convolutional neural network both have a customized context model therein;
and forming a deep convolutional neural network model based on the combination of the first convolutional neural network and the second convolutional neural network.
8. The system of claim 7, wherein the model training module is further configured to:
inputting the training image set into the deep convolutional neural network model;
and respectively carrying out motion estimation and motion compensation on each image frame in the training image set by combining the first convolutional neural network and the second convolutional neural network with an interpolation filter of discrete cosine transform, and selecting a fractional pixel interpolation method to carry out interpolation of a luminance component and a chrominance component based on the cost of a rate distortion value.
9. A computing device comprising a memory, a processor, and a computer program stored in the memory and executable by the processor, wherein the processor, when executing the computer program, implements the method of creating a video luma and chroma fractional interpolation model according to any one of claims 1-4.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of creating a video luma and chroma fractional interpolation model according to any one of claims 1-4.
CN202011307251.6A 2020-11-19 2020-11-19 Method and system for creating fractional interpolation model of video brightness and chrominance Active CN112601095B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011307251.6A CN112601095B (en) 2020-11-19 2020-11-19 Method and system for creating fractional interpolation model of video brightness and chrominance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011307251.6A CN112601095B (en) 2020-11-19 2020-11-19 Method and system for creating fractional interpolation model of video brightness and chrominance

Publications (2)

Publication Number Publication Date
CN112601095A true CN112601095A (en) 2021-04-02
CN112601095B CN112601095B (en) 2023-01-10

Family

ID=75183333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011307251.6A Active CN112601095B (en) 2020-11-19 2020-11-19 Method and system for creating fractional interpolation model of video brightness and chrominance

Country Status (1)

Country Link
CN (1) CN112601095B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113452944A (en) * 2021-08-31 2021-09-28 江苏北弓智能科技有限公司 Picture display method of cloud mobile phone
CN113709483A (en) * 2021-07-08 2021-11-26 杭州微帧信息科技有限公司 Adaptive generation method and device for interpolation filter coefficient

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090257493A1 (en) * 2008-04-10 2009-10-15 Qualcomm Incorporated Interpolation filter support for sub-pixel resolution in video coding
CN108012157A (en) * 2017-11-27 2018-05-08 上海交通大学 Construction method for the convolutional neural networks of Video coding fractional pixel interpolation
CN110020989A (en) * 2019-05-23 2019-07-16 西华大学 A kind of depth image super resolution ratio reconstruction method based on deep learning
CN110177282A (en) * 2019-05-10 2019-08-27 杭州电子科技大学 A kind of inter-frame prediction method based on SRCNN
CN111787187A (en) * 2020-07-29 2020-10-16 上海大学 Method, system and terminal for repairing video by utilizing deep convolutional neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090257493A1 (en) * 2008-04-10 2009-10-15 Qualcomm Incorporated Interpolation filter support for sub-pixel resolution in video coding
CN108012157A (en) * 2017-11-27 2018-05-08 上海交通大学 Construction method for the convolutional neural networks of Video coding fractional pixel interpolation
CN110177282A (en) * 2019-05-10 2019-08-27 杭州电子科技大学 A kind of inter-frame prediction method based on SRCNN
CN110020989A (en) * 2019-05-23 2019-07-16 西华大学 A kind of depth image super resolution ratio reconstruction method based on deep learning
CN111787187A (en) * 2020-07-29 2020-10-16 上海大学 Method, system and terminal for repairing video by utilizing deep convolutional neural network

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113709483A (en) * 2021-07-08 2021-11-26 杭州微帧信息科技有限公司 Adaptive generation method and device for interpolation filter coefficient
CN113709483B (en) * 2021-07-08 2024-04-19 杭州微帧信息科技有限公司 Interpolation filter coefficient self-adaptive generation method and device
CN113452944A (en) * 2021-08-31 2021-09-28 江苏北弓智能科技有限公司 Picture display method of cloud mobile phone
CN113452944B (en) * 2021-08-31 2021-11-02 江苏北弓智能科技有限公司 Picture display method of cloud mobile phone

Also Published As

Publication number Publication date
CN112601095B (en) 2023-01-10

Similar Documents

Publication Publication Date Title
US11856220B2 (en) Reducing computational complexity when video encoding uses bi-predictively encoded frames
CN108848376B (en) Video encoding method, video decoding method, video encoding device, video decoding device and computer equipment
CN108848381B (en) Video encoding method, decoding method, device, computer device and storage medium
CN108848377B (en) Video encoding method, video decoding method, video encoding apparatus, video decoding apparatus, computer device, and storage medium
CN112601095B (en) Method and system for creating fractional interpolation model of video brightness and chrominance
CN113766249A (en) Loop filtering method, device and equipment in video coding and decoding and storage medium
US20240098255A1 (en) Video picture component prediction method and apparatus, and computer storage medium
CN113068025B (en) Decoding prediction method, device and computer storage medium
CN113767626A (en) Video enhancement method and device
KR100926752B1 (en) Fine Motion Estimation Method and Apparatus for Video Coding
Yasin et al. Review and evaluation of end-to-end video compression with deep-learning
WO2019150411A1 (en) Video encoding device, video encoding method, video decoding device, and video decoding method, and video encoding system
CN110392264B (en) Alignment extrapolation frame method based on neural network
CN114727116A (en) Encoding method and device
WO2022211658A1 (en) Independent positioning of auxiliary information in neural network based picture processing
WO2022211657A9 (en) Configurable positions for auxiliary information input into a picture data processing neural network
WO2020056767A1 (en) Video image component prediction method and apparatus, and computer storage medium
KR20210139327A (en) Picture prediction methods, encoders, decoders and storage media
WO2022246809A1 (en) Encoding method, decoding method, code stream, encoder, decoder and storage medium
CN112970257A (en) Decoding prediction method, device and computer storage medium
TWI834087B (en) Method and apparatus for reconstruct image from bitstreams and encoding image into bitstreams, and computer program product
Kommerla et al. Real-Time Applications of Video Compression in the Field of Medical Environments
Hu et al. Efficient image compression method using image super-resolution residual learning network
RU2701058C1 (en) Method of motion compensation and device for its implementation
KR20240024921A (en) Methods and devices for encoding/decoding image or video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Method and System for Creating a Video Brightness and Chromaticity Fraction Interpolation Model

Effective date of registration: 20230713

Granted publication date: 20230110

Pledgee: Bank of Jiangsu Limited by Share Ltd. Beijing branch

Pledgor: BEIJING MOVIEBOOK SCIENCE AND TECHNOLOGY Co.,Ltd.

Registration number: Y2023110000278