GB2600787A

GB2600787A - Method and apparatus for video super resolution

Info

Publication number: GB2600787A
Application number: GB2104311.2A
Authority: GB
Inventors: Wen Hongkai; Saied Abdelkader Abdelfattah Mohamed; Lee Royson
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2020-11-09
Filing date: 2021-03-26
Publication date: 2022-05-11
Anticipated expiration: 2041-03-26
Also published as: GB2600787B; GB202017662D0; GB202104311D0

Abstract

Method /apparatus for performing video super resolution, the method comprises steps of: receiving a video comprising a plurality of low resolution frames; and performing, using a machine learning (ML) model, an iterative process to upscale the low resolution frames by alternately: estimating, using the ML model, degradation (downsample) kernels for a group of sequential, temporally consistent low resolution frames; and upscaling, using the ML model, the group of frames using the estimated degradation kernels. Preferably, the ML model includes a kernel estimator and a frame restorer. The restorer may during the final iteration determine feature maps for the LR frames prior to performing upscaling. Both, estimator and restorer comprising a convolution neural network (CNN). The problem of over sharpening or over-smoothed due to kernel mismatch is solved by the current method exploiting kernel estimation.

Description

Method and Apparatus for Video Super Resolution

Field

[001] The present application generally relates to a method for improving the resolution of videos, and in particular to a computer-implemented method and apparatus for upscaling low resolution frames of a video to achieve video super resolution.

Background

[002] Super resolution (SR) is a problem in image processing that is concerned with how to reconstruct a high resolution (HR) image from its downscaled low resolution (LR) version. That is, super-resolution assumes the LA image is derived from a HR image, and therefore high-frequency details can be restored. For images, high frequency details need to be reconstructed from the low resolution image. However, for videos, temporal relationships in the input video may be exploited to improve reconstruction for video super-resolution. Specifically, each supporting frame is aligned with its reference frame through motion compensation before the information in the frames is merged for upscaling.

[003] Typically, video super resolution techniques combine a batch of LA frames to estimate a single HR frame. This effectively divides the task of video super resolution into a large number of separate multi-frame super resolution tasks. However, this approach is computationally expensive because each input frame needs to be processed several times.

[004] Furthermore, current techniques for video super resolution assume that the degradation process is fixed. That is, it is assumed that applying the blur kernel and the downsampling operation is pre-defined. This leads to unsatisfactory upscaling results on real-world videos as the downsampling kernel, which is used for upscaling, differs from the ground truth kernel, a phenomenon known as kernel mismatch.

[005] The present applicant has recognised the need for an improved video super resolution technique that overcomes these problems.

Summary

[6] In a first approach of the present techniques, there is provided a computer-implemented method for using a machine learning, ML, model to perform video super resolution, the method comprising: receiving a video comprising a plurality of low resolution frames; and performing, using the ML model, an iterative process to upscale the low resolution frames by alternately: estimating a degradation kernel for a group of low resolution frames, the group comprising two or more sequential low resolution frames of the received video, and upscaling the group of sequential low resolution frames using the estimated degradation kernel, wherein the sequential low resolution frames are temporally consistent.

[7] The method may further comprise outputting a video comprising super resolution frames.

[008] The term "upscaling" is used interchangeably herein with the term "upsampling", and the term "downsampled" is used interchangeably herein with the term "degradation".

[9] Videos may contain low resolution frames for a number of reasons. For example, the videos may be captured using a low resolution camera. In another example, a high resolution video may be compressed and transmitted or shared as a low resolution video, due to limited bandwidth or in order to reduce the amount of mobile data used. In each case, it may be desirable to upscale the low resolution frames to generate a better quality, higher resolution video. As explained below in more detail, the present techniques provide a video super resolution method which is more computationally-efficient and more accurate than existing techniques. The present techniques are based on the following key observations.

[10] Firstly, in real-world videos, the kernels of different frames may change over time due to a variety of factors, such as scene changes, changes in lens focus, and camera motion. Therefore, a super resolution approach that is based on a fixed kernel will not generate upscaled videos of a high quality.

[11] Secondly, in real-world videos, the kernels of different frames may also exhibit certain temporal consistency in the feature space. For example, a scene spanning multiple frames may show a person talking, but large parts of the scene do not change, such as the background behind the person. This means that it is not necessary to estimate every kernel for every frame from scratch. Instead, it is possible to exploit the correlations between kernels of frames having temporal consistency, to estimate the kernels of those frames. This advantageously reduces the number of computations which need to be performed, and thereby leads to a more time-and computationally-efficient video super resolution method. This may enable the present video super resolution methods to be performed on a wider range of devices (such as smartphones), as the methods are less computationally demanding than existing methods.

[12] Thirdly, in real-world videos, the amount or level of kernel temporal consistency may vary across different videos or across frames within videos. For example, high kernel temporal consistency may exist for a set of frames of a video when a camera which captured the set of frames is stable and/or when the scene depicted in the set of frames is fixed or steady (e.g. person sitting down or landscape/countryside). Similarly, low kernel temporal consistency may exist for a set of frames of a video when a camera which captured the set of frames is unstable (e.g. hand-held camera), when the scene depicted in the set of frames contains fast motion (e.g. people or vehicles moving), and/or when the set of frames span two scenes or a scene change.

[13] The present techniques use these observations to provide a ML model which comprises alternating between two stages for a number of iterations. In the first stage, a degradation kernel is estimated for a group of sequential low resolution frames which are temporally consistent. In the second stage, the group of low resolution frames are upscaled using the estimated degradation kernel.

[14] Performing the iterative process comprises performing a predetermined number of iterations. The predetermined number of iterations may be, for example, between 1 and 10.

[15] During each iteration, estimating a degradation kernel may comprise: inputting the group of low resolution frames into an estimator module of the ML model, and any previously computed upscaled low resolution frames; and estimating the degradation kernel for the group of low resolution frames and the upscaled low resolution frames.

[16] During each iteration, upscaling the sequential low resolution frames may comprise: inputting the group of low resolution frames into a restorer module of the ML model, and the estimated degradation kernel; and upscaling, using the estimated degradation kernel, a resolution of the group of low resolution frames to upscaled low resolution frames.

[17] The method may comprise determining, during a final iteration of the restorer module, feature maps for the group of low resolution frames prior to performing upscaling. That is, during the final iteration of the restorer module, low resolution feature maps are determined before the final upscaling step.

[18] The method may further comprise: inputting the feature maps determined during the final iteration of the restorer module into a temporal alignment module of the ML model; using the feature maps to align the group of low resolution frames with a reference frame of the received video at a feature level; and generating aligned features for the group of low resolution frames.

[19] The method may further comprise: inputting the generated aligned features into a fusion module of the ML model; and using the fusion module to: compute a contribution of each generated aligned feature of neighbouring (low resolution) frames; and fuse the aligned features into a single feature map.

[20] The method may further comprise: inputting the feature map into a restoration module of the ML model; and generating, using the feature map, super resolution frames for the group of low resolution frames.

[21] The group of low resolution frames are processed together in order to exploit temporal consistency between neighbouring or sequential frames. However, there may be practical upper and lower limits on how many frames can be included in the group. For example, if there are too few frames in the group (e.g. one), the super resolution method is no longer computationally efficient. Similarly, if there are too many frames in the group (e.g. ten), the super resolution method may not be as accurate as there may not be temporal consistency -or the same level of temporal consistency -across all the frames in the group. Thus, the group may comprise up to three or up to five sequential low resolution frames of the received video. However, it will be understood that these are non-limiting examples and that in some cases, there may be no upper limit on the number of frames that can be included in the group.

[22] The estimator module and the restorer module may each comprise a convolutional neural network (CNN).

[23] In a second approach of the present techniques, there is provided an apparatus for performing video super resolution using a machine learning, ML, model, the apparatus comprising: at least one processor coupled to memory and arranged to: receiving a video comprising a plurality of low resolution frames; and performing, using the ML model, an iterative process to upscale the low resolution frames by alternately: estimating a degradation kernel for a group of low resolution frames, the group comprising two or more sequential low resolution frames of the received video, wherein the sequential low resolution frames are temporally consistent; and upscaling the group of sequential low resolution frames using the estimated degradation kernel.

[24] The features described above with respect to the first approach apply equally to the second approach.

[25] The apparatus may be any one of: a smartphone, tablet, laptop, computer or computing device, virtual assistant device, a vehicle, a drone, an autonomous vehicle, a robot or robotic device, a robotic assistant, image capture system or device, an augmented reality system or device, a virtual reality system or device, a gaming system, an Internet of Things device, or a smart consumer device (such as a smart fridge). It will be understood that this is a non-exhaustive and non-limiting list of example apparatus.

[26] In a related approach of the present techniques, there is provided a non-transitory data carrier carrying processor control code to implement the methods described herein.

[27] As will be appreciated by one skilled in the art, the present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.

[28] Furthermore, the present techniques may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.

[29] Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.

[30] Embodiments of the present techniques also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.

[31] The techniques further provide processor control code to implement the above-described methods, for example on a general purpose computer system or on a digital signal processor (DSP). The techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier. The code may be provided on a carrier such as a disk, a microprocessor, CD-or DVDROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an AS IC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RTM) or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, such code and/or data may be distributed between a plurality of coupled components in communication with one another. The techniques may comprise a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.

[032] It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.

[33] In an embodiment, the present techniques may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.

[34] The methods described above may be wholly or partly performed on an apparatus, i.e. an electronic device, using a machine learning or artificial intelligence model. The model may be processed by an artificial intelligence-dedicated processor designed in a hardware structure specified for artificial intelligence model processing. The artificial intelligence model may be obtained by training. Here, "obtained by training" means that a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence model with multiple pieces of training data by a training algorithm. The artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of weight values.

[35] As mentioned above, the present techniques may be implemented using an Al model. A function associated with Al may be performed through the non-volatile memory, the volatile memory, and the processor. The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an Al-dedicated processor such as a neural processing unit (NPU). The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (Al) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning. Here, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or Al model of a desired characteristic is made. The learning may be performed in a device itself in which Al according to an embodiment is performed, and/o may be implemented through a separate server/system.

[36] The Al model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep 0-networks.

[37] The learning algorithm is a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

Brief description of drawings

[38] Implementations of the present techniques will now be described, by way of example only, with reference to the accompanying drawings, in which: [039] Figure 1 is a schematic diagram of an existing process to perform video super resolution; [40] Figure 2A is a schematic diagram of another existing process to perform video super resolution, and Figure 2B illustrates some of the drawbacks of the process of Figure 2A; [41] Figure 3 shows a series of kernels for frames that change over time; [42] Figure 4 shows an image illustrating that kernels of different frames exhibit temporal consistency in the feature space; [43] Figure 5 shows a graph illustrating how temporal consistency may vary in different videos; [44] Figure 6A is a schematic diagram illustrating the video super resolution approach of the present techniques; [045] Figure 6B is a diagram showing the architecture of the ML model of the present techniques; [046] Figure 7 is a flowchart of example steps to perform video super resolution; [47] Figure 8 is an apparatus used to perform video super resolution; and [48] Figure 9 shows the distribution of kernel estimation errors of different estimators for each video sequence in the test set.

Detailed description of drawings

[49] Broadly speaking, the present techniques generally relate to a method for improving the resolution of videos, and in particular to a computer-implemented method and apparatus for upscaling or upsampling low resolution frames of a video to achieve video super resolution (SR). More specifically, the present techniques exploit the temporal consistency between frames of a video to accurately and efficiently estimate degradation kernels, which can then be used to upscale the frames.

[50] Deep learning based blind super-resolution (SR) methods have recently achieve unprecedented performance in upscaling frames with unknown degradation. These models are able to accurately estimate the blur kernel from a given low-resolution (LA) image in order to leverage the kernel during restoration. However, these approaches are predominately image-based and are too compute heavy for video super-resolution. Moreover, recent blind video SR works assume all frames within the same video are downscaled using a fixed blur kernel, resulting in occurring artefacts due to the scene changes and motion blur.

[51] Although there has been significant progress to enable the usage of SR models in real-world applications, these solutions are predominantly image-based. The primary paradigm of these blind image-based solutions consist of either a two-step or an end-to-end process, starting with a kernel estimation module and followed by a SR model that aims to maximise image quality given the estimated kernel and/or noise. However, these techniques do not utilise temporal information and have to estimate kernels individually per frame, rendering them computationally expensive in video-based scenarios. On the other hand, utilising a fixed kernel to upscale every frame in the same video can lead to over sharpening or over-smoothed results due to kernel mismatch.

[52] The present techniques solve these problems by first extracting kernels using an image-based kernel estimation approach on real-world videos, and then highlighting their kernel temporal consistency. Specifically, the present techniques detail how the estimated kernel changes per frame. Lower kernel temporal consistency is observed for videos with high dynamicity of the scene and its objects. The present techniques exploit this consistency by taking techniques from both image-based blind SR and video SR, and tailoring them for blind video SR. As a result, the present techniques achieve not only more accurate kernel estimation but also better video restoration quantitatively and qualitatively as compared to known video SR and blind image-based SR approaches.

[53] Figure 1 is a schematic diagram of an existing process to perform video super resolution. Specifically, this diagram shows a frame-based super resolution (SR) process which involves kernel estimation. In this process, given an input LR frame, a kernel is first estimated which describes the degradation of the LR image. Then, the SR frame is obtained by an image SR network which takes both the original LR frame and the estimated kernel into account. However, this process is not applicable to videos because estimating the kernel per video frame is computationally expensive (i.e. requires a few minutes per frame on desktop graphics processing units, GPUs).

[54] Figure 2A is a schematic diagram of another existing process to perform video super resolution, and Figure 2B illustrates some of the drawbacks of the process of Figure 2A.

Specifically, Figure 2A shows a process in which a fixed kernel is used to upscale low resolution frames of a video. This approach only estimates a fixed kernel for the video, e.g. the kernel of the first frame, and uses this kernel to upscale the whole video. However, as shown in Figure 2B, a drawback of this process is that it often suffers significant kernel mismatch, due to the fact that the kernels of different frames in a video can be varying, which leads to unsatisfactory SR results.

[055] The present techniques for performing video super resolution is based on the following key observations.

[56] Firstly, in real-world videos, the kernels of different frames may change over time, due to various factors, such as scene changes, lens out of focus, camera motion etc. For instance, Figure 3 shows the kernels estimated by a state-of-the-art image-based kernel estimation approach for the frames of two different videos. This shows that the fixed kernel SR approach (shown in Figure 2A) will not work for video SR.

[57] Secondly, in real-world videos, the kernels of different frames also exhibit certain temporal consistency in feature space. Therefore, it is not necessary to estimate every kernel from scratch but instead, their correlations can be exploited to save computation, leading to a more efficient video SR approach. Figure 4 shows an image illustrating that kernels of different frames exhibit temporal consistency in the feature space. Figure 4 shows a principal component analysis (RCA) feature visualization of the estimated kernels of the frames in a video. It can be seen that neighbouring kernels have quite similar features, indicating that they are temporally consistent within the video.

[58] Thirdly, temporal consistency broadly exists in real-world videos, and may vary in different cases. The degradation kernels of frames of real-world videos can often be different, but on the other hand, they may also exhibit certain levels of temporal consistency, depending on the video contents such as camera/object motion blur or scene dynamicity. Figure 5 illustrates this phenomena in which kernel RCA changes for consecutive frames in different videos sequences. Specifically, the extracted kernels for each frame were reshaped and reduced through principal component analysis (RCA). The sum of absolute differences between the kernel RCA components of adjacent frames were computer and the difference is plotted using videos of varying dynamicity (shown as the left and middle plot groups in Figure 5). As a baseline, in comparison with an unrealistic real-world video without any temporal consistency, random frames were sampled from different random videos at each timestamp, and the kernel RCA changes are represented by the right plot group in Figure 5.

[59] The graph of Figure 5 shows how temporal consistency may vary in different videos, by showing the difference between the neighbouring kernels (of their RCA features).

Specifically, Figure 5 shows how kernel temporal consistency may be quantified by measuring kernel RCA change (i.e. the sum of absolute differences of kernel RCA components) for consecutive /adjacent frames in videos with high/low kernel temporal consistency. Random frames are sampled from random videos at each timestamp shown in Figure 5 as a baseline, to show the temporal kernel consistency of frames within the same video. Kernel changes are represented by solid dots while boxplots show distributions. It can be seen that the kernel differences of some videos, namely the left group of plots showing video sequences 13, 16 and 22 taken from the Something-Something dataset are of high temporal consistency, as the kernels remain largely unchanged throughout. In contrast, the middle group of plots represent the kernel differences of videos with low temporal kernel consistency, namely video sequence 2, 23 and 25 from the Something-Something dataset, as kernel changes are more significant.

More generally, videos with high kernel temporal consistency depict slow and steady movements with no motion blurs or scene changes (due to e.g. stable camera, fixed scene, etc). However, videos with low kernel temporal consistency have motion blur caused by rapid movements of the camera or object. The experiments highlight that SR kernels in real-world videos are often non-uniform and can exhibit different levels of temporal consistency.

[60] Temporal consistency is affected by the dynamicity of the videos. For instance, videos with high temporal consistency depict slow and steady movements with no motion blurs. For example, a video of a hand slowly reaching towards a cup has almost identical frames at each time step. In contrast, videos with low kernel temporal consistency consist of motion blur caused by rapid movements. For example, a video of a person weaving a hat or a person placing a container upright on a table may have frames which vary at each time step. As a result, previous works that assume fixed degradation kernels at every time step of a video suffer from severe kernel mismatch effects (such as visual artifacts and unnatural textures).

[61] A fixed kernel assumption further aggravates multi-frame super resolution (MFSR). The premise of these approaches is to utilise temporal frame information in order to boost the restoration performance. To this end, previous MFSR works used motion compensation to warp each supporting frame to its reference frame before fusing these frames together for upscaling. The optical flow used for warping is either estimated explicitly using traditional or deep motion-estimation techniques, or implicitly using adaptive filters or deformable convolutions. Although this may work well for synthetic videos in which the degradation process is fixed at every timestamp, real-world videos do not obey this assumption. As temporal alignment through implicit motion compensation cannot be visualised, a commonly-used explicit motion compensation model, PWCNet (Deqing Sun, X. Yang, Ming-Yu Liu, and J. Kautz; Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume; IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018), is used for motion estimation warping each supporting frame. Due to the kernel dynamicity, the warped supporting frames often suffer from kernel mismatch, introducing additional noise due to blur and propagating the errors through the restoration process.

[62] In order to visualise motion compensation for real-world videos, two sets of videos are considered, one from LR sequences of the original REDS dataset (Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee; Ntire 2019 challenge on video deblurring and superresolution: Dataset and study; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019), which are degraded using a fixed kernel, while the other from the REDS10 testing sequence (details discussed in below), which are generated using different per-frame kernels and thus better resemble the degradation characteristics of real-world videos than the former. An explicit deep motion estimation model PWCNet, is used, as PWCNet is commonly used in previous MFSR approaches to compute the optical flow. The optical flow is then used to warp each supporting frame. It is observed that motion compensation performs better on the fixed degradation video set, benefiting the previous approaches that were specifically designed under the fixed kernel assumption. On the other hand, due to the kernel dynamicity in real-world videos, the warped supporting frames of those approaches often suffer from kernel mismatch when dealing with videos of varying kernels. It was also found that this phenomenon is also observed with the use of implicit motion compensation, and the errors incurred from inaccurate motion compensation can be propagated throughout the restoration process.

[63] Figure 6A is a schematic diagram illustrating the video super resolution approach of the present techniques. Based on the above observations, the approach of the present techniques is designed to work with the varying downsampling kernels in real-world videos.

[64] Broadly speaking, the present techniques provide a machine learning. ML, model in which kernels are first extracted using a state-of-the-art image-based kernel estimation approach on real-world videos. The model is based on a recent image-based iterative blind SR approach which uses a deep alternating network (DAN). DAN is used to implement super resolution on images. More information on DAN can be found in: Zhengxiong Luo, Y. Huang, Shang Li, LiangWang, and Tieniu Tan, "Unfolding the alternating optimization for blind super resolution", Advances in Neural Information Processing Systems, 2020. DAN provides a single network that comprises two convolutional neural modules -one module (called the "Estimator" module) to estimate the blur kernel for a low resolution image, and the other module (called the "Restorer" module) to generate a super resolution image from the low resolution image. The Restorer module restores the SR image based on the blur kernel predicted by the Estimator module, and the restored SR image is further used to help the Estimator module to better estimate the blur kernel. The two modules are alternated repeatedly.

[65] The DAN is used as the basis for the video super resolution method of the present techniques. However, DAN is used to upscale single low resolution images, and not videos. Thus, DAN cannot be directly applied to videos without being computationally expensive. For example, if each frame of a video were fed into DAN as an image, each frame could be upscaled, but as the process is being performed on a single frame at a time, it is slow and computationally expensive. The present techniques adapt DAN to provide a machine learning model which is applicable to low resolution videos. In particular, the present techniques exploit the kernel temporal consistency of frames in a video. As a result, fewer iterations are required during inference, outperforming existing state-of-the-art approaches for blind video super-resolution.

[66] Based on the above observations, the present techniques provide a video deep alternating network. As shown in Figure 6A, the network of the present techniques comprises an estimator module and a restorer module, which work iteratively to estimate the kernels and upscale the frames. In each iteration, the estimator uses the low resolution frames and previously computed upscaled frames to estimate the kernel sequences (in the form of feature vectors), using temporal self-attention to exploit the temporal consistency in kernels. On the other hand, the restorer uses the initial or previously estimated kernel sequence to upscale the low resolution frames. After a few iterations, the upscaled frames are processed by a frame alignment module, and the final super resolution frames are outputted. Self-attention is used in the estimator to better exploit the temporal correlation between frames in the video.

[67] More specifically, the present techniques adopt the image-based SR algorithm known as deep alternating network, DAN, and tailor it for multi-frame SR.

[68] For blind image-based SR, DAN proposes an end-to-end learning approach that estimates the kernel, k, and restores the image, x, alternately. DAN comprises two convolutional modules: 1) a restorer that constructs x given the LR image and the RCA of a kernel, and 2) an estimator that learns the RCA of kernel k, based on the LR image and the resulting super-resolved image features. The basic block for both convolutional modules is a conditional residual block (ORB), which concatenates the basic and conditional inputs channel-wise and then exploits the inter-dependencies among feature maps through a channel attention layer. The alternating algorithm executes both components iteratively, starting with an initial kernel, Dirac, and resulting in the following expression: where j presents the iteration round, j E [11]. Both components are trained using the sum of the absolute difference, L1 loss, between k and k, and between x and fc estimated by the last iteration.

[69] The present techniques extend DAN for videos, by altering DAN's estimator and restorer to utilise the kernel temporal consistency among frames and mitigate the effects of inaccurate motion estimation due to kernel mismatch. Specifically, the present techniques provide a computer-implemented method for using a machine learning, ML, model to perform video super resolution, the method comprising: receiving a video comprising a plurality of low resolution frames; and performing, using the ML model, an iterative process to upscale the low resolution frames by alternately: estimating a degradation kernel for a group of low resolution frames, the group comprising two or more sequential low resolution frames of the received video, and upscaling the group of sequential low resolution frames using the estimated degradation kernel, wherein the sequential low resolution frames are temporally consistent.

[70] Instead of estimating kernels individually for each frame, the present techniques leverage the key insight that the downsampling kernels of frames within a video are temporally consistent, thereby achieving a faster and more accurate kernel estimation for videos. The DAN estimator is therefore modified to take in multiple LR frames and generate their corresponding estimated kernels.

[71] Performing the iterative process comprises performing a predetermined number of iterations. The predetermined number of iterations may be, for example, between 1 and 10.

[72] During each iteration, estimating a degradation kernel may comprise: inputting the group of low resolution frames into an estimator module of the ML model, and any previously computed upscaled low resolution frames; and estimating the degradation kernel for the group of low resolution frames and the upscaled low resolution frames.

[73] During each iteration, upscaling the sequential low resolution frames may comprise: inputting the group of low resolution frames into a restorer module of the ML model, and the estimated degradation kernel; and upscaling, using the estimated degradation kernel, a resolution of the group of low resolution frames to upscaled low resolution frames.

[74] Then, the existing channel attention block in DAN is used by adopting an early fusion approach to exploit the inter-channel relationships between basic and conditional inputs and between temporal inputs. Specifically, LR frames are concatenated channel-wise and the existing structure of DAN's estimator is leveraged without adding additional channels or layers.

The estimator of the present techniques is able to more accurately estimate kernels temporally than DAN's estimator.

[75] For multi-frame SR, the estimator of the present techniques is used to take in the reference frame and its supporting frames, and extend DAN's restorer. The performance gain of utilising the temporal information of multiple frames however, is dependent on the accuracy of the motion estimation. An inaccurate flow can result in misaligned frames after motion compensation and thus result in artifacts in the restored video. Notably, accurately predicting motion estimation is a considerable limitation, as an inaccurate flow will lead to temporal misalignment, resulting in artifacts in the warped supporting frames, and thus propagating the error through the SR model. Although there have been advances to mitigate this limitation, estimating the high resolution optical flow is still a challenge because the image x is not available and the HR optical flow is approximated using the LR frames, and because the downsampling kernels of the frames within a video vary depending on its dynamicity. As a result, on top of the kernel mismatch problem on temporal restoration, having a fixed downsampling kernel at every timestamp also exacerbates the accuracy of the flow.

[76] Therefore, instead of following the convention of employing motion compensation on the LR frames or features directly, the present techniques use motion compensation on the frames after considering their corresponding kernels, in order to mitigate the aforementioned challenges. Specifically, the present techniques use the LR feature maps at the last restorer iteration before the final upsampling block and use temporal alignment (POD module), fusion (TSA module) and video restoration (Restoration module). In other words, the present techniques merge kernel estimation and blind image restoration techniques with MFSR motion compensation methods, and make alternations in order for these modules to utilise temporal kernel consistency.

[77] Figure 6B is a diagram showing the architecture of the ML model of the present techniques. The features of the HR frames are concatenated with the LR features in each ORB block of DAN and the existing channel attention layer (CALayer) is used for temporal kernel estimation. During the last iteration, the LR features, which were conditioned on the input frames and their estimated kernel, were fed into the temporal blocks of EDVR (Xintao Wang, Kelvin C. K. Chan, K. Yu, C. Dong, and Chen Change Loy; "Edvr: Video restoration with enhanced deformable convolutional networks"; IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019) for temporal alignment, fusion, and restoration. In particular, the PCD module follows a pyramid cascading structure, which concatenates features of differing spatial sizes and uses deformable convolution at each respective pyramid level to the aligned features. The TSA module then fused these aligned features together through both temporal and spatial attention. Specifically, temporal attention maps are computed based on the aligned features and applied to these features through the dot product before concatenating and fusing them using a convolution layer. After which, the fused features are then used to compute the spatial attention maps which are then applied to these features.

[78] In other words, the method may comprise determining, during a final iteration of the restorer module, feature maps for the group of low resolution frames prior to performing upscaling. That is, during the final iteration of the restorer module, low resolution feature maps are determined before the final upscaling step.

[79] The method may further comprise: inputting the feature maps determined during the final iteration of the restorer module into a temporal alignment module (i.e. the POD module shown in Figure 6A) of the ML model; using the feature maps to align the group of low resolution frames with a reference frame of the received video at a feature level; and generating aligned features for the group of low resolution frames.

[80] The method may further comprise: inputting the generated aligned features into a fusion module (i.e. the TSA module shown in Figure 6A) of the ML model; and using the fusion module to: compute a contribution of each generated aligned feature of neighbouring frames; and fuse the aligned features into a single feature map. The contributions may be determined through attention mechanisms.

[81] The method may further comprise: inputting the feature map into a restoration module of the ML model; and generating, using the feature map, super resolution frames for the group of low resolution frames.

[82] The group of low resolution frames are processed together in order to exploit temporal consistency between neighbouring or sequential frames. However, there may be practical upper and lower limits on how many frames can be included in the group. For example, if there are too few frames in the group (e.g. one), the super resolution method is no longer computationally efficient. Similarly, if there are too many frames in the group (e.g. ten), the super resolution method may not be as accurate as there may not be temporal consistency -or the same level of temporal consistency -across all the frames in the group. Thus, the group may comprise up to three or up to five sequential low resolution frames of the received video.

[083] The estimator module and the restorer module of the network/model of the present techniques may each comprise a convolutional neural network (CNN).

[84] Training. The ML model of the present techniques is trained using a training dataset comprising a plurality of videos. The training dataset comprises 250 videos from the REDS dataset (Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee; Ntire 2019 challenge on video deblurring and superresolution: Dataset and study; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019). Anisotropic Gaussian kernels with a size of 13x13 are generated. The length of both axes are uniformly sampled in (0.6, 5), and then rotated by a random angle uniformly distributed in [-it, n]. For real-world videos, uniform multiplicative noise, up to 25% of each pixel value of the kernel, is added to the generated noise-free kernel, and then normalised to sum to one. Each frame of each HR video is degraded with a randomly generated kernel and then downsampled through bicubic interpolation to form the synthetic LR videos. The kernels are reshaped and reduced through principal component analysis (PCA) before feeding into the network.

[85] This frame-wise synthesis approach is used for two reasons. Firstly, no video dataset with real-world kernels is available, and extracting large amounts of kernel sequences from video benchmarks for training is costly. Secondly, the synthetic training kernels generated as mentioned above can create various degradation in the individual frames, and thus are able to model real-world videos with varying levels of kernel temporal consistency.

[86] In order to test the trained model, a test dataset is created using 10 sequences (000 and 010-018) from the REDS dataset, denoted as REDS10, in order to mimic the actual degradation of real-world videos that are of varying video dynamicity. In order to generate testing videos which share similar degradation properties with the real-world videos, a pool of kernel sequences is created from the Something-Something dataset (Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al.; The Something Something Video Database for Learning and Evaluating Visual Common Sense; ICCV, 2017), which is a real-world video prediction dataset. In particular, videos are sampled from the Something-Something dataset and an image-based kernel extraction method is applied to the videos to extract the sequences of kernels. This results in 10 real-world kernel sequences with varying levels of temporal consistency. Each video sequence in the test dataset is downsampled with a randomly drawn real-world kernel sequence from the pool in order to obtain the LR videos. In this way, a test dataset is created with similar degradation characteristics as that of real-world videos, which allows the corresponding ground truth to be obtained that allows quantitative evaluation of the performance of the trained model. For real-world video evaluations, videos from the Something-Something dataset are used directly.

[87] Effectiveness of Temporal Kernel Estimation. The effectiveness of taking multiple frames into account for kernel estimation was studied. In other words, instead of estimating kernels individually for each frame, the key insight that the downsampling kernels of fames within a video are temporally consistent is leveraged to achieve a faster and more accurate kernel estimation for videos. The estimator was modified to take in multiple LR frames, and their corresponding estimated kernels were generated. The existing channel attention block in DAN was used by adopting an early fusion approach, which merges information at the beginning of the block, to exploit the inter-channel relationships not only between basic and conditional inputs, but also among temporal inputs. Specifically, the features of the HR frames are concatenated with the LR features in every ORB in order to leverage the existing structure of DAN's estimator without adding additional channels or layers.

[88] The number of input frames on the estimator was experimented with, labelled as Est-a where a is the number of frames used for kernel estimation. Similarly, the number of frames p used for restoration is labelled as Res-p. For a fair comparison, DAN's restorer is used (where 3 = 1) which is single-frame and therefore not included in the adopted EDVR components. Figure 9 shows the distribution of kernel estimation errors of the aforementioned models in terms of the absolute sum of RCA difference between the estimated kernels and their respective ground truth kernels for all frames in each sequence found in REDS10. It is observed that independent kernel estimation per-frame can lead to a larger variance and numerous outliers as compared to temporal kernel estimation. Notably, temporal kernel estimation results in, on average, more accurate kernels for videos with high dynamicity, i.e. low kernel temporal consistency, while performs similarly for videos with high kernel temporal consistency. The performance increase in kernel estimation, however, did not improve performance significantly in video restoration as shown in Table 1.

Table 1:

[089] This phenomenon is also observed in recent blind iterative image SR works and these works reported that this is due to the restorer's robustness to the kernel estimation errors of the estimator since they are jointly trained. Although having a more accurate kernel estimation did not drastically impact a single-frame video restoration performance, it is shown that it is essential at improving the performance of a multi-frame restoration approach.

[090] The performance gain of utilizing the temporal information of multiple frames is dependent on the accuracy of its motion estimation; an inaccurate flow can result in misaligned frames after motion compensation and thus artifacts in the restored video. As explained above, performing motion compensation under the assumption of a fixed SR kernel directly on real-world videos can result in regular artifacts in the warped frames. To mitigate this, instead of following the convention of employing motion compensation on the LR frames or features directly, motion compensation on the LR frames is performed after considering their !ct-1 + Res-(DAN Est-3 + Res-i.

+ ResRes-3 + Res-5 ± Res- :es Res-PSNR/S 26.581( I 8 26,30/0,7124 0.72 I 3 26.3710.7170 26.54/0..7287 26.6210.7364 6; 441",11 corresponding kernels. Specifically, the feature maps at the last restorer iteration are utilized (as shown in Figure 6A), which embed both LR frame and the corresponding kernel features from the estimator. EDVR is adopted for temporal alignment, fusion, and restoration. This approach mitigates the problem of inaccurate motion compensation caused by kernel variation in real-world videos, but the restoration performance may still depend on the accuracy of estimated kernels; errors in kernel estimation would propagate and result in inaccurate motion compensation.

[091] To verify this, the multi-frame restorer, (3 = {3, 5}, was run with a single-frame estimator, a = 1 and compared with running the multi-frame restorer together with the multi-frame estimator. The results are shown in Table 1. As expected, having a multi-frame restorer resulted in an improvement in video restoration similar to that of previous works. However, these per-frame estimator MFSR models did not perform as well as their temporal estimator counterparts. In particular, although the per frame estimator MFSR model of the present techniques utilized information from 5 frames (Est-1 + Res-5) to restore each frame, it did not outperform the temporal estimator MFSR model of the present techniques that only exploited information from 3 frames (Est-3 + Res-3). Hence, it can be concluded that the kernel mismatch errors incurred during kernel estimation propagated through the implicit motion compensation module of EDVR, affecting temporal alignment, fusion, and thus restoration. In other words, more accurate estimated kernels through the temporal kernel estimator enable the multi-frame restorer to leverage temporal frame information better. Therefore, the interplay between accurate kernel estimation and motion compensation is the key to utilize temporal kernel consistency for video restoration.

[092] Figure 7 is a flowchart of example steps of a method to perform video super resolution.

The method comprises receiving a video comprising a plurality of low resolution frames (step 5100). The method comprises performing an iterative process to upscale the low resolution frames. Specifically, the method comprises performing, using the ML model, an iterative process to upscale the low resolution frames by alternately: estimating degradation kernels for a group of low resolution frames, the group comprising two or more sequential low resolution frames of the received video, wherein the sequential low resolution frames are temporally consistent (step S102); and upscaling the group of sequential low resolution frames using the estimated degradation kernels (step S104). After one or more iterations, the final SR frames may be generated and these may be used to output a super resolution video (step S106).

[93] Figure 8 is an apparatus 100 used to perform video super resolution. The apparatus 100 may be any one of: a smartphone, tablet, laptop, computer or computing device, virtual assistant device, a vehicle, a drone, an autonomous vehicle, a robot or robotic device, a robotic assistant, image capture system or device, an augmented reality system or device, a virtual reality system or device, a gaming system, an Internet of Things device, or a smart consumer device (such as a smart fridge). It will be understood that this is a non-exhaustive and non-limiting list of example devices.

[94] The apparatus 100 comprises a trained machine learning, ML, model 106 for performing video super resolution.

[95] The apparatus comprises at least one processor 102 coupled to memory 104. The at least one processor 102 may comprise one or more of: a microprocessor, a microcontroller, and an integrated circuit. The memory 104 may comprise volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example.

[96] The at least one processor 102 may be arranged to: receive a video comprising a plurality of low resolution frames and an initial kernel sequence; and perform an iterative process to upscale the low resolution frames, the iterative process comprising: estimating a degradation kernel for a group of low resolution frames, the group comprising two or more sequential low resolution frames of the received video, wherein the sequential low resolution frames are temporally consistent, and upscaling the group of sequential low resolution frames using the estimated degradation kernel.

[97] The apparatus may further comprise at least one image capture device 108 for capturing images or videos to be processed by the ML model.

[098] The apparatus may further comprise a display or display screen 108 for providing a result of the processing by the ML model to a user of the apparatus, i.e. for displaying the SR video.

[099] The apparatus may further comprise at least one interface 110 for receiving a video comprising low resolution frames. For example, the apparatus may comprise an image capture device which captures low resolution videos. In another example, the apparatus may comprise a communication module via which the apparatus receives a high resolution video that may be compressed and transmitted as a low resolution video. In each case, it may be desirable to upscale the low resolution frames to generate a better quality, higher resolution video.

[100] The present techniques may be performed on videos which have already been recorded, or may be performed in near-real time on videos which are being captured by the apparatus.

[101] Those skilled in the art will appreciate that while the foregoing has described what is considered to be the best mode and where appropriate other modes of performing present techniques, the present techniques should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiment. Those skilled in the art will recognise that present techniques have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.

Claims

CLAIMS1. A computer-implemented method for using a machine learning, ML, model to perform video super resolution, the method comprising: receiving a video comprising a plurality of low resolution frames; and performing, using the ML model, an iterative process to upscale the low resolution frames by alternately: estimating degradation kernels for a group of low resolution frames, the group comprising two or more sequential low resolution frames of the received video, wherein the sequential low resolution frames are temporally consistent, and upscaling the group of sequential low resolution frames using the estimated degradation kernels.
2. The method as claimed in claim 1 further comprising: outputting a video comprising super resolution frames.
3. The method as claimed in claim 1 or 2 wherein performing the iterative process comprises performing a predetermined number of iterations.
4. The method as claimed in claim 1, 2 or 3 wherein, during each iteration, estimating a degradation kernel comprises: inputting the group of low resolution frames into an estimator module of the ML model, and any previously computed upscaled low resolution frames; and estimating the degradation kernel for the group of low resolution frames and the upscaled low resolution frames.
5. The method as claimed in any preceding claim wherein, during each iteration, upscaling the sequential low resolution frames comprises: inputting the group of low resolution frames into a restorer module of the ML model, and the estimated degradation kernel; and upscaling, using the estimated degradation kernel, a resolution of the group of low resolution frames to upscaled low resolution frames.
6. The method as claimed in any of claims 3, 4 or 5 further comprising: determining, during a final iteration of the restorer module, feature maps for the group of low resolution frames prior to performing upscaling.
7. The method as claimed in claim 6 further comprising: inputting the feature maps determined during the final iteration of the restorer module into a temporal alignment module of the ML model; using the feature maps to align the group of low resolution frames with a reference frame of the received video at a feature level; and generating aligned features for the group of low resolution frames.
8. The method as claimed in claim 7 further comprising: inputting the generated aligned features into a fusion module of the ML model; and using the fusion module to: compute a contribution of each generated aligned feature of neighbouring frames; and fuse the aligned features into a feature map.
9. The method as claimed in claim 8 further comprising: inputting the feature map into a restoration module of the ML model; and generating, using the feature map, super resolution frames for the group of low resolution frames.
10. The method as claimed in any of claims 4 to 9 wherein the estimator module comprises a convolutional neural network.
11. The method as claimed in any of claims 5 to 10 wherein the restorer module comprises a convolutional neural network.
12. A non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out the method of any of claims 1 to 11.
13. An apparatus for performing video super resolution using a machine learning, ML, model, the apparatus comprising: at least one processor coupled to memory and arranged to: receiving a video comprising a plurality of low resolution frames; and performing, using the ML model, an iterative process to upscale the low resolution frames by alternately: estimating a degradation kernel for a group of low resolution frames, the group comprising two or more sequential low resolution frames of the received video, wherein the sequential low resolution frames are temporally consistent, and upscaling the group of sequential low resolution frames using the estimated degradation kernel.