WO2023138764A1

WO2023138764A1 - Device and method for super resolution kernel estimation

Info

Publication number: WO2023138764A1
Application number: PCT/EP2022/051138
Authority: WO
Inventors: Mehmet YAMAC; Aakif NAWAZ; Baran ATAMAN
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2023-07-27

Abstract

This disclosure relates to super resolution (SR) imaging, for example, using a convolutional neural network (CNN). The disclosure is concerned with estimating a SR kernel for an input image. To this end, the disclosure proposes a device comprising a processor. The processor is configured to estimate an effective blurring kernel based on an input image of a scene and a reference image of at least a part of the same scene. The input image has a lower image quality and/or resolution than the reference image. The processor is further configured to estimate a coarse SR kernel by convolving the effective blurring kernel with itself a specified number of times. The specified number is based on a target resolution, and the target resolution is higher than the resolution of the input image. The processor is further configured to estimate a final SR kernel by refining the coarse SR kernel.

Description

DEVICE AND METHOD FOR SUPER RESOLUTION KERNEL ESTIMATION

TECHNICAL FIELD

The present disclosure relates to super resolution (SR) imaging, for example, to SR imaging using convolutional neural network (CNN) technology. The disclosure is specifically concerned with estimating a SR kernel for an input image. To this end, the disclosure proposes a device for SR imaging, a method for SR imaging, and a corresponding computer program.

BACKGROUND

Conventional method for SR imaging, which use one or more CNNs for SR kernel estimation, have achieved remarkable performance when being applied to a lower resolution (LR) image, which has been obtained from a higher resolution (HR) image with an ideal and predefined downsampling process. In particular, most CNN-based methods have been trained with a set of such LR images, which were degraded from HR images by convolution with a fixed blurring kernel (e.g., bicubic or Gaussian) followed by subsampling. The conventional methods have accordingly been tested with similarly produced synthetic image data. However, the performance of these CNN-based methods comes short when being applied to a real- world image, which has been obtained based on a more complex and/or unknown degradation operation.

In particular, when a conventional CNN-based method is applied to such a real- world image, which has an unknown downsampling pattern (unlike the synthetically generated LR-HR image pair mentioned above), the performance of the method drops drastically. Applying the method in this way is referred to as the blind SR problem. Many conventional methods do not show any significant improvement in regards to the blind SR problem compared to a simple binary cubic interpolation for a real-world image. The blind SR problem can be defined as accurately estimating both unknown downsampling blurring kernel (SR kernel) and corresponding HR images. Estimating SR kernels accurately to solve the blind SR problem is thus still an open research question, and there are only a few exemplary approaches that focus on this problem.

A first exemplary approach is model-based and applies a bicubic interpolation to a LR image, in order to obtain a coarse estimation of a HR image. Then, an approximate SR kernel is estimated using an existing blind deblurring algorithm.

A second exemplary approach uses an internal generative adversarial network (Internal- GAN), and is thus referred to as “KernelGAN”. In particular, KernelGAN is a type of network that produces downscaled versions of test images in a training phase, in order to learn image specific SR kernels. KernelGAN is composed of one generator network and one discriminator network. Given a test image to be upscaled, the generator network generates a lower scale image by degrading and downscaling the test image. The discriminator network then tries to distinguish, whether the generated LR image has the same patch distribution as the original one. The discriminator and generator networks are trained by using crops from the test image in an alternating manner. The downscaled generator network is trained to fool the discriminator network in each iteration. After convergence, the generator network can be used as the SR degradation model. The generator network may have five fully convolutional layers and one downsampled layer with a certain scale factor. Therefore, the impulse response of the convolutional layers produces the SR kernel estimation.

Although the approach using KernelGAN is able to provide remarkable results in the SR kernel estimation, its SR kernel recovery performance is still limited. Further, there is still room for significant improvement in terms of SR kernel reconstruction accuracy. Moreover, since this approach uses an image-specific training for performing the SR kernel estimation, it is not feasible to use it in real-time applications, for instance, in mobile devices.

Consequently, while the approach with KernelGAN shows promising results, its limited recovery performance and high computational complexity make it unsuitable for real-time application usage. A third exemplary approach is the flow-based kernel prediction (FKP), which is based on KemelGAN. The FKP approach uses the normalizing flow technique to first learn a latent space for SR kernels. The SR kernel estimation is then performed in this latent space. As a result, the search space is narrowed, which gives a more robust estimation. However, the SR kernel reconstruction accuracy is low. Further, since the SR kernel estimation is done by image specific training, the estimation is slow. Thus, the approach cannot be used for real time applications.

SUMMARY

This disclosure and its solutions are further based on the following considerations made by the inventors.

A fourth exemplary approach of a CNN-based method was developed, which is referred to as KernelNet. This approach is able to significantly improve the SR kernel estimation accuracy, while simultaneously reducing the computational time. The KernelNet approach is based on a modular and interpretable kernel estimation network, and offers real-time kernel estimation, in contrast to the other conventional methods. In addition, the KernelNet approach can significantly improve the SR kernel recovery over the conventional methods described above.

However, the KernelNet approach is developed for a single image input, and therefore it is not a reference-based solution. From the single image, the method estimates a sharp gradient first, and then goes to a kernel space. Although the method is fast, it could still be much faster without this sharp gradient estimation step.

In this respect, notably all conventional CNN-based SR kernel estimation methods provide an estimation from only a single image input. There is no existing solution for a referencebased SR kernel estimation. Thus, the conventional methods lack accuracy and/or speed.

In view of the above, this disclosure has the objective to provide a device and method that are able to estimate a SR kernel more accurately and/or faster than the conventional methods, in particular, when being applied to a real-life image with complex and/or unknown degradation. The device and method should provide better results regarding the blind SR problem. The device and method should also operate faster - specifically, faster than KemelGAN and even KemelNet - in order to be optimally usable in real-time applications.

These and other objectives are achieved by the solutions of this disclosure described in the enclosed independent claims. Advantageous implementations are further defined in the dependent claims.

A first aspect of this disclosure provides a device for SR imaging, the device comprising a processor configured to: estimate an effective blurring kernel based on an input image of a scene and a reference image of at least a part of the same scene, wherein the input image has a lower image quality and/or resolution than the reference image; estimate a coarse SR kernel by convolving the effective blurring kernel with itself a specified number of times, wherein the specified number is based on a target resolution, and wherein the target resolution is higher than the resolution of the input image; and estimate a final SR kernel by refining the coarse SR kernel.

The device of the first aspect, similar to KernelNet, is configured to implement a three-step processing procedure based on the input image, in order to determine the (final) SR kernel as an output. The device is particularly able to find an accurate solution to the blind SR problem. The device of the first aspect outperforms conventional SR kernel estimation approaches, and is able to estimate the SR kernel more accurately, especially, when being applied to real-life images. For example, the device of the first aspect outperforms KemelGAN by a significant margin in SR kernel reconstruction accuracy. In addition, the SR kernel determination carried out by the device of the first aspect is fast enough to be usable in real-time applications.

Since the processor of the device of the first aspect has the input image and the reference image available, a sharp gradient of the input image does not have to be estimated, but can be taken from the reference image. Thus, the solution is faster even than KernelNet. Due to the reference image, the solution is also more accurate than all the conventional methods. In an implementation form of the first aspect, the input image has a wider field-of-view, (FOV) than the reference image.

For instance, the input image may be a wide angle image, while the reference image may be a telephoto image. These images may be obtained by different cameras/sensors of the same, for instance, mobile phone or tablet or the like. That is, the solution of this disclosure is able to leverage different cameras on the same handheld device.

In an implementation form of the first aspect, the input image is a first image acquired with a first imaging device and the reference image is a second image acquired with a second imaging device.

The first imaging device may be a camera/sensor, for example, of a mobile phone. The second imaging device may be another sensor, for example, of a SLR or DSLR camera. The second image may be taken with a much higher quality and/or resolution than the first image. This significantly improves the estimation of the final SR kernel.

In an implementation form of the first aspect, the processor is further configured to: estimate a plurality of SR kernels by using a plurality of first images of the scene as input images and a plurality of second images of the scene as reference images; use the plurality of estimated SR kernels and a plurality of high resolution images to obtain a plurality of degraded images; and train the processor by using the plurality of degraded images as input images and the plurality of high resolution images as ground truth images.

In this way, the processor can be trained to yield accurate SR kernel estimations when being applied to real-life images with complex and/or unknown degradation.

In an implementation form of the first aspect, the processor is further configured to perform a template matching and alignment procedure on the input image and the reference image before estimating the effective blurring kernel.

This improves the estimation of the effective blurring kernel, and in consequence that of the final SR kernel. In an implementation form of the first aspect, the template matching and alignment procedure comprises a step of warping the input image and/or warping the reference image, and the processor is further configured to: mask one or more warped regions of the input image and/or of the reference image before estimating the effective blurring kernel.

This further improves the estimation of the effective blurring kernel, and in consequence that of the final SR kernel.

In an implementation form of the first aspect, the estimating of the effective blurring kernel comprises: determining a first gradient of the input image and determining a second gradient of the reference image; applying a fast Fourier transformation FFT to the first gradient and the second gradient; and processing the result of applying the FFT in an FFT domain, to obtain the effective blurring kernel.

Thus, a sharp gradient estimation step based on the input image, as in KernelNet, is not required. The sharp gradient can be determined form the reference image.

In an implementation form of the first aspect, the processor is configured to estimate the effective blurring kernel, the coarse SR kernel, and the final SR kernel, by using one or more CNNs.

The device of the first aspect is accordingly configured to perform a CNN -based method for the SR kernel estimation.

In an implementation form of the first aspect, the processor is further configured to train the one or more CNNs by using a first set of one or more first images acquired with the first imaging device and a second set of one or more second images acquired with the second imaging device.

In an implementation form of the first aspect, the processor is further configured to estimate, for each pair of a first image from the first set and a second image from the second set, a difference in image quality between the first image and second image; and train the one or more CNNs based on the estimated differences in quality. Due to the above training, more accurate estimation results can be achieved when the one or more CNNs are applied to real-world images.

In an implementation form of the first aspect, the processor is configured to estimate the coarse SR kernel by convolving the effective blurring kernel with itself 2^sf times, wherein sf is an upscaling factor between the resolution of the input image and the target resolution.

In this way, the device of the first aspect is configured to upscale the effective blurring kernel, depending on the target resolution of the HR image. The upscaling, by performing the convolution, is particularly fast and efficient.

In an implementation form of the first aspect, the processor is configured to estimate the effective blurring kernel using a blind deblurring algorithm.

This technique yields fast and reliable results for obtaining the effective blurring kernel.

In an implementation form of the first aspect, the CNN comprises multiple modules, each module being usable either for estimating the effective blurring kernel, for estimating the coarse SR kernel, or for estimating the final SR kernel.

Thus, the CNN used by the device of the first aspect can be a modular NN, which may provide improved speed and accuracy.

In an implementation form of the first aspect, each module comprises either matrix multiplication layers in a FFT domain or one or more hidden layers of the CNN.

In an implementation form of the first aspect, the processor is further configured to use the final SR kernel and the input image to estimate an output image having the target resolution.

Thus, the SR kernel determination of the device of the first aspect can be directly used for SR imaging, e.g., in a real-time application. The device of the first aspect may accordingly be configured as a SR solution with SR kernel estimation. A second aspect of this disclosure provides a camera for SR imaging, the camera comprising: a first imaging device configured to obtain a first image with a wider FOV; a second imaging device configured to obtain a second image with a narrower FOV; a device according to the first aspect or any implementation form thereof, wherein the device is configured to estimate the final SR kernel by using the first image as the input image and the second image as the reference image; and use the final SR kernel and the input image to estimate an output image having the target resolution.

Notably, the device of the first aspect or any implementation form thereof may, for example, be a processing device or an imaging pipeline, but not necessarily a camera that also has optical imaging means like a sensor.

A third aspect of this disclosure provides a method for SR imaging, the method comprising: estimating an effective blurring kernel based on an input image of a scene and a reference image of at least a part of the same scene, the input image having a lower image quality and/or resolution than the reference image; estimating a coarse SR kernel by convolving the effective blurring kernel with itself a specified number of times, wherein the specified number is based on a target resolution that is higher than a resolution of the input image; and estimating a final SR kernel by refining the coarse SR kernel.

In an implementation form of the third aspect, the input image has a wider FOV than the reference image.

In an implementation form of the third aspect, the input image is a first image acquired with a first imaging device and the reference image is a second image acquired with a second imaging device.

In an implementation form of the third aspect, the method further comprises: estimating a plurality of SR kernels by using a plurality of first images of the scene as input images and a plurality of second images of the scene as reference images; using the plurality of estimated SR kernels and a plurality of high resolution images to obtain a plurality of degraded images; and training the processor by using the plurality of degraded images as input images and the plurality of high resolution images as ground truth images. In an implementation form of the third aspect, the method further comprises performing a template matching and alignment procedure on the input image and the reference image before estimating the effective blurring kernel.

In an implementation form of the third aspect, the template matching and alignment procedure comprises a step of warping the input image and/or warping the reference image, and the method further comprises masking one or more warped regions of the input image and/or of the reference image before estimating the effective blurring kernel.

In an implementation form of the third aspect, the estimating of the effective blurring kernel comprises: determining a first gradient of the input image and determining a second gradient of the reference image; applying a fast Fourier transformation FFT to the first gradient and the second gradient; and processing the result of applying the FFT in an FFT domain, to obtain the effective blurring kernel.

In an implementation form of the third aspect, the method further comprises estimating the effective blurring kernel, the coarse SR kernel, and the final SR kernel, by using one or more CNNs.

In an implementation form of the third aspect, the method further comprises training the one or more CNNs by using a first set of one or more first images acquired with the first imaging device and a second set of one or more second images acquired with the second imaging device.

In an implementation form of the third aspect, the method further comprises estimating, for each pair of a first image from the first set and a second image from the second set, a difference in image quality between the first image and second image; and training the one or more CNNs based on the estimated differences in quality.

In an implementation form of the third aspect, the method further comprises estimating the coarse SR kernel by convolving the effective blurring kernel with itself 2^sf times, wherein sf is an upscaling factor between the resolution of the input image and the target resolution. In an implementation form of the third aspect, the method comprises estimating the effective blurring kernel using a blind deblurring algorithm.

In an implementation form of the third aspect, the CNN comprises multiple modules, each module being usable either for estimating the effective blurring kernel, for estimating the coarse SR kernel, or for estimating the final SR kernel.

In an implementation form of the third aspect, each module comprises either matrix multiplication layers in a FFT domain or one or more hidden layers of the CNN.

In an implementation form of the third aspect, the method further comprises using the final SR kernel and the input image to estimate an output image having the target resolution.

The method of the third aspect and its implementation forms achieve all advantages and effects described above for the device of the first aspect and its respective implementation forms.

A fourth aspect of this disclosure provides a computer program comprising a program code for, when running on a processor, causing the method of the third aspect or any of its implementation forms to be performed.

A fifth aspect of this disclosure provides a non-transitory storage medium storing executable program code which, when executed by a processor, causes the method according to the third aspect or any of its implementation forms to be performed.

It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above described aspects and implementation forms will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

FIG. 1 shows an implementation of the solution of this disclosure. In particular, the solution is based on a device according to an embodiment of this disclosure.

FIG. 2 shows a device according to an embodiment of this disclosure.

FIG. 3 shows an exemplary device according to an embodiment of this disclosure, which may be used for realizing a reference-based blind SR kernel estimation.

FIG. 4 shows an exemplary realization of the reference-based SR using a multicamera approach.

FIG. 5a shows a comparison of the outputs from the same blind SR resolution when using different kernel estimators. In particular, KemelNet and the solution of this disclosure (referred to as KernelNet-R) are compared.

FIG. 5b shows a comparison in another image sample of KemelNet and KernelNet- R.

FIG. 6 shows a super-resolved wide angle image, which is produced with the solution of this disclosure, compared to the wide angle input image and the reference telephoto image.

FIG. 7 shows an exemplary realization of a realistic image degradation proposed by this disclosure. FIG. 8 shows a method for SR imaging according to an embodiment of this disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS

FIG. 1 shows an implementation or pipeline of the proposed solution of this disclosure. In particular, the solution of this disclosure is based on the device 100 for SR imaging. The device 100 can be used for estimating a SR kernel of a real-world image. The device 100 is configured to output this final SR kernel 103. The device 100 may be configured as a full SR solution with SR kernel estimation, that is, the device may also provide a superresolved image as an output. The device 100 may be based on one or more NNs, for instance, CNNs.

The device 100 is configured to receive an input image lOli and a reference image lOlr. The input image 10 li is of a scene, and the reference image 10 Ir is of at least a part of the same scene. The reference image lOlr has at least one of a higher image quality and a higher resolution than the input image lOli.

The device 100 is further configured to output the final SR kernel 103. The final SR kernel 103 may be estimated by the device 100 based on a target resolution, wherein the target resolution is higher than the resolution of the input image lOli. The target resolution may be the desired resolution of the super-resolved output image, which may be estimated using the final SR kernel 103 and the input image lOli. The target resolution may be predetermined in the device 100. The target resolution may also be provided to the device 100 based on one or more inputs. For instance, the target resolution could be directly input into the device 100. Alternatively, a value indicating the target resolution could be input into the device 100. For example, a desired scale factor 102 (between the resolution of the input image 10 li and the desired output image) could be input into the device 100, as shown exemplarily in FIG. 1.

Generally, the device 100 is thus configured to use a reference image lOlr for performing a blind SR kernel estimation, i.e., a reference-based blind SR kernel estimation. FIG. 2 shows the device 100 building on FIG. 1. As shown in FIG. 2, the device 100 comprises a processor (here illustrated by exemplary functional blocks, as described below). The device 100 may also be the processor in its entirety, and may be part of another device like a camera. The processor of the device 100 is configured to perform the reference-based blind SR kernel estimation as follows.

The processor (may also be referred to as processing circuitry) may comprise multiple functional blocks 114, 145, and 156 (may be referred to as processing blocks or units) as shown in FIG. 2. The processor is configured to perform, conduct or initiate the various operations of the device 100 described herein. The processor may comprise hardware and/or may be controlled by software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application- specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. The device 100 may further comprise memory circuitry (not shown), which is configured to store one or more instruction(s) that can be executed by the processor, in particular under control of the software. For instance, the memory circuitry may comprise a non-transitory storage medium storing executable software code which, when executed by the processor, causes the various operations of the device 100 to be performed. In one embodiment, the processor comprises one or more processing units and a non-transitory memory connected to the one or more processing units. The non-transitory memory may carry executable program code which, when executed by the one or more processing units causes the device 100 to perform, conduct or initiate the operations or methods described herein.

The processor is especially configured to process the input image 10 li and the reference image lOlr, which are received by the device 100. First, the processor is configured to estimate an effective blurring kernel 104 based on the input image 10 li and the reference image lOlr. This may be done by the functional block 114 of the processor. Then, the processor is configured to estimate a coarse SR kernel 105, namely by convolving the effective blurring kernel 104 with itself a specified number of times. Thereby, the specified number is based on a target resolution, e.g., on the resolution of an output image 106 that is to be obtained. This output image 106 could optionally be produced and output by the device 100. For instance, the effective blurring kernel 104 may be convolved 2^sf times with itself, wherein 2^sf is the specified number of times, and wherein sf is an upscaling factor between the resolution of the input image lOli and the target resolution lOlr. This upscaling factor corresponds to the scale factor 102 shown in FIG. 1, and may optionally be provided to the device 100 and/or the processor. This may be done by the functional block 145 of the processor. Further, the processor is configured to estimate the final SR kernel 103 by refining the coarse SR kernel 105. This may be done by the functional block 153 of the processor.

The optional functional block 113 could, optionally, apply the final SR kernel 103 to the input image lOli, in order to output the super-resolved output image 106.

FIG. 3 shows a device 100 according to an embodiment of this disclosure, which builds on the device 100 shown in FIG. 1 and 2. Same reference signs in FIG. 1, FIG. 2 and FIG. 3 are used for the same elements, wherein implementations of these elements may be identical. The device 100 is configured to realize the reference-based blind SR kernel estimation proposed in this disclosure. The device 100 may be based on a specific network 300, which is a reference-based SR kernel estimation network, and is referred to as KemelNet-R in this disclosure

The network 300 may be based on, or may be, a CNN. That is, the device 100 may be able to estimate the effective blurring kernel 104, the coarse SR kernel 105, and the final SR kernel 102, respectively, using a CNN 300. The CNN 300 may have a plurality of modules, wherein each module may be configured to either estimate the effective blurring kernel 104, or to estimate the coarse SR kernel 105, or to estimate the final SR kernel 103. Each module may comprise one or more hidden layers of the CNN 300.

In FIG. 3, the CNN 300 includes, as an example, three modules, namely a FFT-1 module 311, a FFT-2 module 312, and a refinement module 313. The respective modules 311, 312, and 313 of the CNN 300 may be associated to, or may be implemented by, the functional blocks 114, 145, and 156 of the processor as shown in FIG. 2. That is, the processor of the device 100 may be configured to implement the CNN 300. A data structure of the CNN 300, and instructions regarding an operation of the CNN 300, may be stored in a memory coupled to and working together with the processor of the device 100. For instance, the memory may have stored thereon the CNN 300 and the instructions to cause the CNN 300 to perform a method according to steps performed by the device 100 as described above (or according to the method 800 shown in FIG. 8 and described below), when the instructions are executed by the processor.

The FFT-1 module 311 may be responsible for estimating the effective blurring kernel 104 in low scale (LR), wherein the effective blurring kernel 104 is denoted k_LR. It is thereby assumed that k is the ground truth (GT) SR kernel, which is to be estimated from the input image lOli. The module 311 may be implemented, in this example, by the functional block 114 of the processor (see FIG. 1 and FIG. 2). In particular, the estimating of the effective blurring kernel 104 may comprise determining a first gradient of the input image 10 li and determining a second gradient of the reference image lOlr, and then applying a FFT to the first and the second gradient (by the FFT-1 module), and then, optionally, by further processing the result of applying the FFT to the first and second gradients in the FFT domain. Notably, in the KemelNet-R solution, there is no sharp gradient estimation module when compared to KernelNet. A sharp gradient (the second gradient) is obtained from the reference image lOlr instead, which makes the estimation much faster and more stable than that of KernelNet. From the second gradient of the reference image lOlr and the first gradient of the input image lOli, the LR blurring kernel estimation may particularly be done in the Fourier domain by using the formula given in Eq. 11 of ‘Yamac, M., Ataman, B. and Nawaz, A., 2021. “ KernelNet: A Blind Super-Resolution Kernel Estimation Network”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 453-462). 4, 7, 10, 11’.

The FFT-2 module 345 (which may also be referred to as a self-convolution module) is then configured to upsample the previously obtained effective blurring kernel 104 in the FFT domain to obtain the coarse SR kernel 105, which is denoted k_HR. In particular, the FFT-2 module may be configured to estimate the coarse SR kernel 105 by convolving the effective blurring kernel 104 with itself for the specified number of times (e.g., 2^sf times by the upscaling factor 102 mentioned above in FIG. 1). The FFT-2 module may be implemented by the functional block 145 of the processor (see FIG. 2). In other words, the estimated LR effective blurring kernel 104 may be upscaled to obtain the coarse HR SR kernel 105. The self-convolution for the coarse SR kernel estimation may be performed as explained in Section 3.2.3 of ‘Yamac, M., Ataman, B. and Nawaz, A., 2021. “KernelNet: A Blind Super-Resolution Kernel Estimation Network”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 453-462). 4, 7, 10, 11’. The refinement module 313 is then configured to obtain a finer estimation of k as an output of the CNN 300. The refinement module 156 is configured to estimate the final SR kernel 103 by refining the coarse SR kernel 105, e.g., by obtaining a finer estimation thereof. This may be implemented by the functional block 156 of the processor (see FIG. 2). For example, a few CNN layers can be used for the SR kernel estimation refinement. A realization of such consecutive CNN layers used for the SR kernel estimation refinement can be found in Section 3.2.4 of ‘Yamac, M., Ataman, B. and Nawaz, A., 2021. “ KernelNet: A Blind Super-Resolution Kernel Estimation Network”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 453-462). 4, 7, 10, I E.

In the following, a first application scenario for the device 100 is described. This application scenario refers to reference-based SR using multiple cameras. In this scenario, it is assumed that two different imaging devise (e.g., sensors or cameras) are available. A first imaging device, with which a first mage being the input image 10 li can be acquired, and a second imaging device, with which a second image being the reference image lOlr can be acquired. The second imaging device can be used to improve the input resolution, by providing the reference image lOlr having a much better optical quality and/or resolution than the input image lOli. For example, the second imaging device may be a (sensor of a) tele camera, which can be used as a reference camera. The first imaging device may be a (sensor of a) wide angle camera. The tele camera and the wide angle camera may be of the same device, e.g., of the same mobile phone or similar handheld device. As a case study, the tele camera and the wide angle camera are further considered. The wide angle camera has larger FOV compared to tele camera that is more zoomed to a local a region of interest of the scene. Thus, the input image lOli has a wider FOV than the reference image lOlr.The tele (reference) image lOlr may not be used alone either, because it may have a too narrow FOV. Therefore, the target is to obtain a super-resolved wide angle image, which may be obtained by using the extra information extracted from the patch or full image acquired with tele camera and a patch of the wide angle image.

KemelNet-R can be used for real-time SR kernel estimation of the final SR kernel 103 for the input image lOli by leveraging the reference image lOlr. Then, this final SR kernel 103 can be used inside in any non-blind SR module that takes the input image lOli and final SR kernel 103 as an input, and that produces the super-resolved output image 106. Before the process, a template matching and alignment algorithm can be applied to patches of the input image lOli and reference image lOlr. The SR kernel estimation may be more stable if further warped regions from both reference image lOlr and input image lOli are masked out. This procedure is visualized in FIG. 4.

In particular, FIG. 4 shows that the input image lOli (exemplarily, a wide angle image) and the reference image (exemplarily, a telephoto image) are obtained. Both are of the same scene, but acquired with a different FOV. Template matching is performed on the wide angle image lOli to obtain a cropped wide angle image patch 40 li. The telephoto reference image lOlr is downscaled by the scaling factor sf to obtain a downscaled telephoto image 401r. Based on these two images 40 li and 401r, a dense optical flow and uncertainty mask are estimated. Then, the cropped wide angle image patch 401i is warped, wherein the uncertainty mask is applied to mask one or more warped regions of the image 40 li, which results in a confident wide angle image patch 402i. The uncertainty mask is also applied to the downscaled telephoto image 40 Ir, which results in a confident telephoto image 402r. Then the effective blurring kernel 104 is estimated by the device 100 (KemelNet-R) based on the confident wide angle image patch 402i and the confident telephoto image 402r, and the final SR kernel 103 is produced and output. This final SR kernel 103 can be used by a non-blind SR module 403 to obtain the SR image 106 of the scene. The non-blind SR module 403 is an example of the optional functional block 113.

Exemplary results for this first application scenario are shown in FIG. 5 and FIG. 6. In a real image SR (the input image lOli was taken from a mobile phone), the proposed reference-based SR kernel estimation method (KernelNet-R), produces improved results compared to the KemelNet. FIG. 5a shows a comparison of the outputs from the same blind SR resolution network, when using KernelNet and KernelNet-R, respectively. The second image is clearly sharper and edges are better preserved without having any artifacts. FIG. 5b shows likewise that the second image produced with KemelNet-R is sharper.

All conventional SR methods that are based on a SR kernel estimation make the SR process in the sRGB space. However, a convolutional operation and its inverse operation - performed by CNNs - are linear operations, and may be handled in a linear space for a more accurate approximation. With KemelNet-R it is possible to conduct both SR kernel estimation and SR application in linear RGB. FIG. 6 shows an output image of a super-resolved wide angle image 106 produced with KemelNet-R. Both the application of the final SR kernel 103 to the wide angel input image lOli and the SR kernel estimation were conducted in linear RGB space. The reference image lOlr was taken from a telephoto camera of the same mobile phone. As it can be seen from the output image 106, it is sharper on details while also having reduced noise and artifacts. The output even can be preferable to corresponding tele patch.

The below table shows a comparison of different methods and combinations for different scale factors. The KernelNet-R method is faster than KernelNet, KernelGAN and FKP + DIP (deep image prior), while also having a lower estimation error.

In the following, a second application scenario for the device 100 is described. This second application scenario refers to realistic degradation for training a blind single image SR network when a reference camera training dataset is available. The task of improving the low-quality images captured by compact camera sensors on mobile devices, such as mobile phones, is one of the focuses of computational imaging. The practice of collecting paired images of same scenes from high-quality sensor device, like a DSLR camera, together with low quality sensor devices, such as a mobile phone camera, can be beneficial for training image enhancement algorithms. Having the paired data after further alignment, in the case of SR tasks, it is common to first apply bicubic downsampling to the mobile phone image, then mapping this low-resolution image to a DSLR ground truth image during the algorithm's training. Yet, bicubic downsampling cannot reflect the difference between low resolution, low quality sensor images and high resolution, high quality images.

Instead, the proposed solution, KernelNet-R, can be used to estimate this degradation between an input image lOli of a mobile phone and a reference image lOlr of a DSLR sensor, pair-wise. KernelNet-R (device 100) yields the final SR kernel 103, wherein as shown in FIG. 7 warping can be performed before kernel estimation similar to FIG. 4. That is, KernelNet-R may operate on a warped and masked input image 701i and a masked reference image 701r. Both images 701r and 70 li can be downscaled with noised reduction to obtain the downscaled images 702i and 702r. For instance, the images 701r and 70 li can be downscaled with an ideal kernel for noise reduction and sharpening to obtain the downscaled images 702i and 702r. The estimated final SR kernel 103 can then be applied to the downscaled input image 702i, which may be followed by another downscaling (decimation) by a desired scale factor (e.g., 4x) to create the LR realistic input image 704. Alternatively, it may be possible to clean both DSLR and main images at once by using a few bicubic downsampling steps, and then apply the estimated SR kernel combined with the decimation to the cleaned main image to produce an LR input image 704. The cleaned DSLR image can be used as ground truth image 703. The overall pipeline of the proposed realistic reference based degradation is shown in FIG. 7.

FIG. 8 shows a method 800 according to an embodiment of this disclosure. The method 800 can be performed by the device 100. The method 800 is for SR imaging.

The method 800 comprises a step 801 of estimating an effective blurring kernel 104 based on an input image lOli of a scene and a reference image lOlr of at least a part of the same scene. The input image lOli has a lower image quality and/or resolution than the reference image lOlr. The method 800 further comprises a step 802 of estimating a coarse SR kernel 105 by convolving the effective blurring kernel 104 with itself a specified number of times. This specified number is based on a target resolution that is higher than a resolution of the input image lOli. The method 800 also comprises a step 803 of estimating a final SR kernel 103 by refining the coarse SR kernel 105. The present disclosure has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed matter, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. A device (100) for super-resolution, SR, imaging, the device comprising a processor configured to: estimate an effective blurring kernel (104) based on an input image (lOli) of a scene and a reference image (lOlr) of at least a part of the same scene, wherein the input image (lOli) has a lower image quality and/or resolution than the reference image (lOlr); estimate a coarse SR kernel (105) by convolving the effective blurring kernel (104) with itself a specified number of times, wherein the specified number is based on a target resolution, and wherein the target resolution is higher than the resolution of the input image (lOli); and estimate a final SR kernel (103) by refining the coarse SR kernel (105).

2. The device (100) according to claim 1, wherein: the input image (lOli) has a wider field-of-view, FOV, than the reference image (lOlr).

3. The device (100) according to claim 1 or 2, wherein: the input image (lOli) is a first image acquired with a first imaging device and the reference image (lOlr) is a second image acquired with a second imaging device.

4. The device (100) according to claim 3, wherein the processor is further configured to: estimate a plurality of SR kernels (104) by using a plurality of first images of the scene as input images (lOli) and a plurality of second images of the scene as reference images (lOlr); and use the plurality of estimated SR kernels (104) and a plurality of high resolution images to obtain a plurality of degraded images; and train the processor by using the plurality of degraded images as input images (lOli) and the plurality of high resolution images as ground truth images.

5. The device (100) according to one of the claims 1 to 4, wherein the processor is further configured to: perform a template matching and alignment procedure on the input image (lOli) and the reference image (lOlr) before estimating the effective blurring kernel (104).

6. The device (100) according to claim 5, wherein the template matching and alignment procedure comprises a step of warping the input image (lOli) and/or warping the reference image (lOlr), and the processor is further configured to: mask one or more warped regions of the input image (lOli) and/or of the reference image (lOlr) before estimating the effective blurring kernel (104).

7. The device (100) according to one of the claims 1 to 6, wherein the estimating of the effective blurring kernel (104) comprises: determining a first gradient of the input image (lOli) and determining a second gradient of the reference image (lOlr); applying a fast Fourier transformation FFT to the first gradient and the second gradient; and processing the result of applying the FFT in an FFT domain, to obtain the effective blurring kernel (104).

8. The device (100) according to one of the claims 1 to 7, wherein the processor is configured to: estimate the effective blurring kernel (104), the coarse SR kernel (105), and the final SR kernel (103), by using one or more convolutional neural networks, CNNs (300).

9. The device (100) according to claim 8 and 4, wherein the processor is further configured to: train the one or more CNNs (300) by using a first set of one or more first images acquired with the first imaging device and a second set of one or more second images acquired with the second imaging device.

10. The device (100) according to claim 9, wherein the processor is further configured to: estimate, for each pair of a first image from the first set and a second image from the second set, a difference in image quality between the first image and second image; and train the one or more CNNs (300) based on the estimated differences in quality.

11. The device (100) according to one of the claims 1 to 10, wherein the processor is configured to: estimate the coarse SR kernel (105) by convolving the effective blurring kernel (104) with itself 2^sf times, wherein sf is an upscaling factor (102) between the resolution of the input image and the target resolution.

12. The device (100) according to one of the claims 1 to 11, wherein the processor is configured to: estimate the effective blurring kernel (104) using a blind deblurring algorithm.

13. The device (100) according to one of the claims 8 to 12, wherein: the CNN (300) comprises multiple modules (311, 312, 313), each module (311, 312, 313) being usable either for estimating the effective blurring kernel (104), for estimating the coarse SR kernel (105), or for estimating the final SR kernel (106).

14. The device (100) according to claim 13, wherein: each module (311, 312, 313) comprises either matrix multiplication layers in a FFT domain or one or more hidden layers of the CNN (300).

15. The device (100) according to one of the claims 1 to 14, wherein the processor is further configured to: use the final SR kernel (103) and the input image (lOli) to estimate an output image (106) having the target resolution.

16. A camera for SR imaging, the camera comprising: a first imaging device configured to obtain a first image with a wider FOV; a second imaging device configured to obtain a second image with a narrower FOV ; a device (100) according to one of the claims 1 to 15, wherein the device (100) is configured to estimate the final SR kernel (103) by using the first image as the input image (lOli) and the second image as the reference image (lOlr); and use the final SR kernel (103) and the input image (lOli) to estimate an output image (106) having the target resolution.

17. A method (800) for SR imaging, the method (800) comprising: estimating (801) an effective blurring kernel (104) based on an input image (lOli) of a scene and a reference image (lOlr) of at least a part of the same scene, the input image (lOli) having a lower image quality and/or resolution than the reference image (lOlr) ; estimating (802) a coarse SR kernel (105) by convolving the effective blurring kernel (104) with itself a specified number of times, wherein the specified number is based on a target resolution that is higher than a resolution of the input image (lOli); and estimating (803) a final SR kernel (103) by refining the coarse SR kernel (105). 18. A computer program comprising a program code for, when running on a processor

(100), causing the method (900) of claim 17 to be performed.