WO2023083467A1

WO2023083467A1 - Image processing apparatus and method for generating interpolated frame

Info

Publication number: WO2023083467A1
Application number: PCT/EP2021/081616
Authority: WO
Inventors: Stepan Tulyakov; Alfredo BOCCHICCHIO; Stamatios GEORGOULIS; Youanyou LI; Daniel GEHRIG; Davide SCARAMUZZA; Mathias GEHRIG
Original assignee: Huawei Technologies Co., Ltd.; University Of Zurich
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2023-05-19

Abstract

An image processing apparatus for generating an interpolated frame at a specified time between adjacent image frames of a video. The image processing apparatus includes an input module configured to receive one, two or more key image frames, and a plurality of surrounding events. The image processing apparatus includes a parametric motion model estimator configured to estimate an inter-frame motion based on the key image frames and the plurality of surrounding events, and a warping encoder configured to compute a plurality of multiscale warping interpolation features. The image processing apparatus includes a synthesis encoder configured to compute a plurality of multiscale synthesis interpolation features, and a multiscale feature fusion decoder module configured to generate the interpolated frame. The image processing apparatus enables highly efficient and accurate video frame interpolation in slow-speed motion as well as in high-speed motion.

Description

IMAGE PROCESSING APPARATUS AND METHOD FOR GENERATING

INTERPOLATED FRAME

TECHNICAL FIELD

The present disclosure relates generally to the field of image and video processing; and, more specifically, to an image processing apparatus and a method for generating an interpolated frame at a specified time between adjacent image frames of a video.

BACKGROUND

Generally, video frame interpolation (VFI) algorithms include conversion from a low-frame rate to a high-frame rate in order to obtain a high-quality video. The video frame interpolation algorithms increase the frame rate of a video by inserting intermediate frames between consecutive frames of the video. To increase the frame rate, the VFI algorithms are required to accurately estimate the image changes in a blind time between consecutive frames, which is technically challenging, especially in case of high dynamics and high-speed motion.

The existing VFI algorithms (or methods) are, such as frame-based methods, non-linear motion estimation methods, methods using additional sensors, methods using event cameras, and a time lens. Each of the existing VFI algorithms has one or more limitations associated with itself. In an example, the frame-based methods use a warping-based approach, which relies on an optical flow between key image frames to warp them on a common intermediate timestamp where they are fused, taking into account occlusions as well. The warping-based approach has two limitations. First is, the warping-based approach depends on the optical flow, which is only well defined when brightness constancy is satisfied and, therefore, it may lead to severe artefacts when brightness constancy is not maintained. Second is, the warping-based approach usually models the optical flow as a linear motion, which fails to capture complex scene dynamics, especially at low frame rates. In a case of non-linear motion estimation methods, motion estimation is usually performed on a linear scale, where correspondences between pixels are assumed to follow linear trajectories. However, in case of rotational camera ego-motion and non-rigid or deformable object motion, this assumption is usually violated. However, a few methods of non-linear motion estimation extend the linear motion assumption to quadratic or cubic motions. These non-linear models are fit over multiple frames and thus, span long time windows, which fail to capture highly non-linear motion between neighboring frames. In another example, a time lens is used, which leverages events from an event camera in the blindtime between frames of a video. The time lens depends on combining the benefits of warping, and synthesis-based interpolation approaches through image-level attention-based alphablending algorithms. The warping-based interpolation approach operates by warping boundary frames to the latent position using non-linear motion estimated from events. The synthesisbased interpolation approach operates by adding intensity changes from events to boundary frames and excels at interpolating motion of a non-rigid object with intensity changes, such as water and fire. The warping-based and synthesis-based interpolation approaches are complementary to each other. Despite of providing partially improved interpolation results, the time lens has a few limitations, such as brittle image-level fusion of warping and synthesis results, low speed for multi-frame interpolation, and various artefacts, such as edge distortion, texture wobbling, boundary frames blending, in the low contrast areas where no event is triggered. Thus, there exists a technical problem of inefficient and inaccurate video frame interpolation with various artefacts in high-speed motion.

Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks associated with the conventional methods of video frame interpolation.

SUMMARY

The present disclosure provides an image processing apparatus and a method for generating an interpolated frame at a specified time between adjacent image frames of a video. The present disclosure provides a solution to the existing problem of inefficient and inaccurate video frame interpolation with various artefacts in high-speed motion. An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provide an improved image processing apparatus and a method for generating an interpolated frame at a specified time between adjacent image frames of a video. The generation of the interpolated frame at the specified time between adjacent image frames of the video leads to a high quality of the video with improved interpolation in low-contrast areas in high-speed motion. The object of the present disclosure is achieved by the solutions provided in the enclosed independent claims. Advantageous implementations of the present disclosure are further defined in the dependent claims.

In an aspect, the present disclosure provides an image processing apparatus for generating an interpolated frame at a specified time between adjacent image frames of a video. The image processing apparatus comprises an input module configured to receive one, two or more key image frames, and a plurality of surrounding events, where each event indicates a pixel location and time associated with a change in pixel intensity above a predetermined threshold. The image processing apparatus further comprises a parametric motion model estimator configured to estimate an inter-frame motion based on the key image frames and the plurality of surrounding events. The image processing apparatus further comprises a warping encoder that is configured to compute a plurality of multi scale warping interpolation features based on a first image frame of the key frames and the inter- frame motion and a synthesis encoder that is configured to compute a plurality of multiscale synthesis interpolation features based on the first image frame and a subset of the events between the first image frame and the specified time. The image processing apparatus further comprises a multiscale feature fusion decoder module that is configured to generate the interpolated frame, where the multiscale feature fusion decoder module comprises a plurality of image decoder blocks, where each decoder block is configured to receive an output from a preceding decoder block and, each of the warping interpolation features and synthesis interpolation features as input.

The disclosed image processing apparatus enables an efficient and accurate video frame interpolation in slow-speed motion as well as in high-speed motion. The disclosed image processing apparatus uses images in addition to the events for motion estimation, which further leads to improved interpolation results in high-speed motion and in slow-speed motion as well. Moreover, by virtue of the parametric motion model estimator, the image processing apparatus enables non-linear motion estimation from inter-frame events with reduced computational complexity. Additionally, because of the multiscale feature fusion decoder module, the image processing apparatus manifests the selection of the most informative features from each encoder at each scale and improves fusion of the plurality of multiscale synthesis and warping interpolation results.

In an implementation form, the input module includes an image sensor configured to capture the image frames and an aligned event sensor with pixels configured to capture the plurality of surrounding events, where each pixel is configured to trigger an event when a change of intensity of the pixel crosses a threshold.

By virtue of the image sensor and the aligned event sensor, the input module is configured to capture image data as well as events data.

In a further implementation form, the image sensor and aligned event sensor are aligned using a beam splitter, or where the image sensor and aligned event sensor are integrated in a hybrid sensor.

It is advantageous to align the image sensor and the aligned event sensor in order to capture temporally synchronized and spatially aligned image and events data, respectively.

In a further implementation form, the parametric motion model estimator comprises an imagebased motion encoder configured to estimate motion features from the first and second adjacent image frames at a plurality of scales. The parametric motion model estimator further comprises an event-based motion encoder configured to estimate motion features from the plurality of surrounding events at a plurality of scales and a second multiscale feature fusion decoder module configured to combine each of the estimated motion features.

The use of adjacent image frames in addition to the events for motion estimation by combining image and event-based motion features results in an improved interpolation in low contrast areas.

In a further implementation form, the parametric motion model estimator is configured to compute the inter-frame spline motion including three cubic splines for each pixel location, respectively modelling a horizontal displacement, a vertical displacement and a warping priority of each pixel as a function of time.

The parametric motion model estimator enables computation of high-order spline motion from inter-frame events with reduced computational complexity.

In a further implementation form, the warping encoder includes an image encoder comprising a plurality of residual blocks configured to encode the first image frame at a plurality of scales. The warping encoder further includes a flow sampling module configured to sample the horizontal displacement, vertical displacement and warping priority of each pixel from the inter-frame spline motion, and a forward warping module configured to compute the warping interpolation features at the plurality of scales based on the encoded first image frame and the sampled inter-frame spline motion.

The warping encoder enables computation of the plurality of multiscale warping interpolation features with reduced computational complexity.

In a further implementation form, each decoder block of the multiscale feature fusion decoder module includes a gated compression module configured to attenuate each of the warping interpolation features and synthesis interpolation features and select a subset of the most informative features.

By virtue of the selection of the subset of the most informative features, the multiscale feature fusion decoder module enables improved interpolation results.

In a further implementation form, each decoder block of the multiscale feature fusion decoder module includes an upscaling layer followed by a convolution layer with a non-linear activation function.

By virtue of the upscaling layer followed by the convolution layer with the non-linear activation function, each decoder block of the multiscale feature fusion decoder module may be configured to perform for any upscaling ratio and any non-linearity.

In a further implementation form, the synthesis encoder is further configured to compute a plurality of second synthesis interpolation features based on a second image frame and a subset of the events between the second image frame and the specified time, where the specified time is between the first image frame and the second image frame. The warping encoder is further configured to compute a plurality of second warping interpolation features based on the second image frame and the inter-frame motion. Each decoder block of the multiscale feature fusion decoder module is further configured to receive each of the second warping interpolation features and second synthesis interpolation features as input.

By virtue of computing and combining each of the plurality of second synthesis as well as warping interpolation features, interpolation results are obtained with more precision and accuracy.

In another aspect, the present disclosure provides a method of generating an interpolated frame at a specified time between adjacent image frames of a video. The method comprises receiving, by an input module, one or more image frames, and a plurality of surrounding events, where each event indicates a pixel location and time associated with a change in pixel intensity above a predetermined threshold. The method further comprises estimating, by a parametric motion model estimator, an inter-frame motion based on the key image frames and the plurality of surrounding events. The method further comprises computing, by a warping encoder, a plurality of warping interpolation features based on a first image frame of the key frames and the interframe motion and computing, by a synthesis encoder, a plurality of synthesis interpolation features based on the first image frame and a subset of the events between the first image frame and the specified time. The method further comprises generating, by a multiscale feature fusion decoder module, the interpolated frame, where the multiscale feature fusion decoder module comprises a plurality of image decoder blocks, where each decoder block is configured to receive each of the warping interpolation features and synthesis interpolation features as input.

The method achieves all the advantages and technical effects of the image processing apparatus of the present disclosure.

In a yet another aspect, the present disclosure provides a computer-readable medium comprising instructions which, when executed by a processor, cause the processor to perform the method.

The processor achieves all the advantages and effects of the method after execution of the method.

It is to be appreciated that all the aforementioned implementation forms can be combined.

It has to be noted that all devices, elements, circuitry, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.

Additional aspects, advantages, features and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative implementations construed in conjunction with the appended claims that follow.

BRIEF DESCRIPTION OF THE DRAWINGS

The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.

Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:

FIG. l is a block diagram that illustrates various exemplary components of an image processing apparatus, in accordance with an embodiment of the present disclosure;

FIG. 2A illustrates processing of one or more key image frames and a plurality of surrounding events by a parametric motion model estimator and a warping encoder;

FIG. 2B illustrates non-linear inter-frame motion between boundary image frames, in accordance with an embodiment of the present disclosure;

FIG. 3 illustrates a multiscale feature fusion decoder module with gated compression, in accordance with an embodiment of the present disclosure; and

FIG. 4 is a flowchart of a method of generating an interpolated frame at a specified time between adjacent image frames of a video, in accordance with an embodiment of the present disclosure.

In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the nonunderlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.

DETAILED DESCRIPTION OF EMBODIMENTS

The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practicing the present disclosure are also possible.

FIG. l is a block diagram that illustrates various exemplary components of an image processing apparatus, in accordance with an embodiment of the present disclosure. With reference to FIG. 1, there is shown a block diagram 100 of an image processing apparatus 102 that includes an input module 104, a parametric motion model estimator 106, a warping encoder 108, a synthesis encoder 110, and a multiscale feature fusion decoder module 112. The input module 104 includes an image sensor 104A and an aligned event sensor 104B.

The image processing apparatus 102 may include suitable logic, circuitry, interfaces, or codes that is configured to generate an interpolated frame at a specified time between adjacent image frames of a video. The image processing apparatus 102 may also be referred to as a video frame interpolation system. Examples of the image processing apparatus 102 may include, but are not limited to, a hand-held device or an electronic device or a mobile device, a portable device, and the like.

The input module 104 may include suitable logic, circuitry, interfaces, or codes that is configured to receive one, two or more key image frames, and a plurality of surrounding events. Examples of the input module 104 may include, but are not limited to, an image sensor, an auxiliary event sensor, a hybrid sensor, a charge-coupled device (CCD), and the like.

The parametric motion model estimator 106 may include suitable logic, circuitry, interfaces, or codes that is configured to estimate an inter-frame motion based on the key image frames and the plurality of surrounding events. The parametric motion model estimator 106 may also be referred to as a spline motion estimator. Examples of the parametric motion model estimator 106 may include, but are not limited to, a polynomial curve estimator, a quadratic motion estimator, a cubic motion estimator, and the like. The warping encoder 108 may include suitable logic, circuitry, interfaces, or codes that is configured to compute a plurality of multiscale warping interpolation features based on a first image frame of the key frames and the inter-frame motion. Examples of the warping encoder 108 may include, but are not limited to, a convolutional neural network (CNN), a recurrent neural network (RNN), a recursive neural network, a feed-forward neural network, a deepbelief network, and a convolutional deep-belief network, and a stacked de-noising autoencoder, and the like.

The synthesis encoder 110 may include suitable logic, circuitry, interfaces, or codes that is configured to compute a plurality of multiscale synthesis interpolation features based on the first image frame and a subset of the events between the first image frame and the specified time. Examples of the synthesis encoder 110 may include, but are not limited to, a convolutional neural network (CNN), a recurrent neural network (RNN), a recursive neural network, a feedforward neural network, a deep-belief network, and a convolutional deep-belief network, and a stacked de-noising auto-encoder, and the like.

The multiscale feature fusion decoder module 112 may include suitable logic, circuitry, interfaces, or codes that is configured to generate the interpolated frame.

In operation, the input module 104 is configured to receive one, two or more key image frames, and a plurality of surrounding events, where each event indicates a pixel location and time associated with a change in pixel intensity above a predetermined threshold. In an implementation, the received key image frames may correspond to a preceding (may also be represented as /₀) and a following (may also be represented as /- boundary image frames, and the plurality of surrounding events (may also be represented as voxel grid Vo^t for events) may correspond to inter-frame events between the preceding and following boundary image frames and other nearby surrounding events. In another implementation, the key image frames may correspond to adjacent image frames, and the plurality of surrounding events may correspond to the events captured between the adjacent image frames. Each event of the plurality of surrounding events indicates time associated with illumination changes in the key image frames that corresponds to the changes in pixel intensity above the predetermined threshold and the pixel location as well. Moreover, each event represents a stream of compressed visual information and allows estimation of motion and light changes in a blind time between the key image frames. In accordance with an embodiment, the input module 104 includes the image sensor 104A configured to capture the image frames and the aligned event sensor 104B with pixels configured to capture the plurality of surrounding events, where each pixel is configured to trigger an event when a change of intensity of the pixel crosses a threshold. In an implementation, the image sensor 104A (e.g., a camera) may be configured to capture the preceding (i.e., /₀) and the following (i.e., /- boundary image frames. In said implementation, the aligned event sensor 104B (e.g., an auxiliary event camera) may be configured to capture the plurality of surrounding events (i.e., voxel grid V_0-t) between the preceding and the following boundary image frames. Each of the image sensor 104A and the aligned event sensor 104B may be either a color camera (e.g., a RGB camera) or a grey frame-based camera in a stereo configuration. The image processing apparatus 102 depends on temporally synchronized and spatially aligned events and image data. Therefore, both the sensors, that is the image sensor 104A and the aligned event sensor 104B should be temporally synchronized and have similar field-of-view (FOV), depth-of-field (DOF), synchronized focusing, and triggering times. Alternatively stated, the image processing apparatus 102 may be implemented as a video frame interpolation system with dual cameras, and each camera may be detected by analyzing hardware and software of a camera.

In accordance with an embodiment, the image sensor 104A and aligned event sensor 104B are aligned using a beam splitter or where the image sensor 104 A and aligned event sensor 104B are integrated in a hybrid sensor. In an implementation, the image sensor 104A and the aligned event sensor 104B are aligned using the beam splitter. In another implementation, the image sensor 104A and the aligned event sensor 104B may be arranged side-by-side, or in the form of the hybrid sensor. Due to the presence of dual sensors (or dual cameras, that is the image sensor 104A and the aligned event sensor 104B), the image processing apparatus 102 (i.e., the video frame interpolation system) can be examined for a camera occlusion test, that means how the image processing apparatus 102 behaves when one of the two sensors (i.e., camera) is occluded. In an example, when the aligned event sensor 104B (i.e., the auxiliary event camera) is occluded, the image processing apparatus 102 (i.e., the video frame interpolation system) should switch to default image-based mode, which is not able to handle large non-linear motion and non-rigid objects, such as water and fire. In another example, when the image sensor 104A (i.e., the frame-based camera) is occluded, the image processing apparatus 102 (i.e., the video frame interpolation system) does not perform any frame interpolation and displays error message. The parametric motion model estimator 106 is configured to estimate an inter-frame motion based on the key image frames and the plurality of surrounding events. The parametric motion model estimator 106 is configured to estimate the inter-frame motion (i.e., spline motion S₀^_t) once per inter-frame interval using the key image frames (i.e., the preceding, /₀, and the following, /₁₅ boundary image frames) and the plurality of surrounding events (i.e., the voxel grid Vo^_t) for interframe events. In contrast to conventional video frame interpolation algorithms, the parametric motion model estimator 106 is configured to use the boundary image frames in addition to the events for the inter-frame motion estimation by combining image and event-based motion features in order to ensure interpolation robustness in low contrast areas without events. The parametric motion model estimator 106 is described in more detail, for example, in FIG. 2A.

The warping encoder 108 is configured to compute a plurality of multiscale warping interpolation features based on a first image frame of the key frames and the inter- frame motion. In an implementation, the warping encoder 108 is configured to compute the plurality of multiscale warping interpolation features (may also be represented as C^_t), which warp the features extracted from the first image frame (i.e., the preceding boundary image frame, /₀), of the key image frames to time t according to motion spline approximation of the inter-frame motion from the preceding boundary image frame (i.e., /₀) to the following boundary image frame (i.e., /- , represented as, S_o^₁. The warping encoder 108 is described in more detail, for example, in FIG. 2A.

The synthesis encoder 110 is configured to compute a plurality of multiscale synthesis interpolation features based on the first image frame and a subset of the events between the first image frame and the specified time. In an implementation, the synthesis encoder 110 is configured to compute the plurality of multiscale synthesis interpolation features (may also be represented as Cg^_t) based on the first image frame (i.e., the preceding boundary image frame, /₀), and voxel grid (i.e., V_0-t) for events between the preceding and latent image. Intuitively, the synthesis encoder 110 adds changes from the subset of the events between the first image frame and the specified time to the boundary image frames and thus, can interpolate non-rigid objects with illumination changes, such as fire and water.

In accordance with an embodiment, the synthesis encoder 110 is further configured to compute a plurality of second synthesis interpolation features based on a second image frame and a subset of the events between the second image frame and the specified time, where the specified time is between the first image frame and the second image frame. In said implementation, the synthesis encoder 110 is further configured to compute the plurality of second multiscale synthesis interpolation features (may also be represented as C^_t) based on the second image frame (i.e., the following boundary image frame, /- , and voxel grid (i.e., _t^i) for events between the second image frame (i.e., the following boundary image frame, /- and the specified time t. The specified time t is defined as time between the first image frame (i.e., the preceding boundary image frame, /₀), and the second image frame (i.e., the following boundary image frame, However, the first image frame (i.e., the preceding boundary image frame, /₀), and the second image frame (i.e., the following boundary image frame, /- are encoded separately by use of the synthesis encoder 110 with shared weights. Hence, the synthesis encoder 110 may also be referred to as a shared encoder.

The warping encoder 108 is further configured to compute a plurality of second warping interpolation features based on the second image frame and the inter-frame motion. In the implementation where, the warping encoder 108 is configured to compute the plurality of multiscale warping interpolation features (i.e., C^_t) based on the first image frame (i.e., /₀) of the key frames and the inter-frame motion. In said implementation, the warping encoder 108 is further configured to compute the plurality of second multiscale warping interpolation features (may also be represented as

based on the second image frame (i.e., the following boundary image frame, /- and the motion spline approximation of the inter-frame motion. Thus, the warping encoder 108 is configured to encode the first image frame (i.e., the preceding boundary image frame, /₀), as well as the second image frame (i.e., the following boundary image frame,

The multiscale feature fusion decoder module 112 is configured to generate the interpolated frame, where the multiscale feature fusion decoder module 112 comprises a plurality of image decoder blocks, where each decoder block is configured to receive an output from a preceding decoder block and each of the warping interpolation features and synthesis interpolation features as input. In an implementation, the multiscale feature fusion decoder module 112 may be configured to combine the plurality of multi scale warping interpolation features and the plurality of second multiscale warping interpolation features (i.e.,

computed from the first image frame (i.e., the preceding boundary image frame, /₀) and the second image frame (i.e., the following boundary image frame, /- respectively, by the warping encoder 108 and the plurality of multiscale synthesis interpolation features and the plurality of second multiscale synthesis interpolation features (i.e., Cg^_t, C^_t) computed from the first image frame (i.e., the preceding boundary image frame, /₀) and the second image frame (i.e., the following boundary image frame, /- , respectively, by the synthesis encoder 110. After combination, the multiscale feature fusion decoder module 112 is configured to generate a latent frame I_t at the specified time t. Thus, the multiscale feature fusion decoder module 112 may be configured to select the most informative features from each encoder at each scale and improve fusion of synthesis and warping interpolation results. By virtue of the multiscale feature fusion decoder module 112, the image processing apparatus 102 (i.e., the video frame interpolation system) performs symmetric processing for the preceding and the following boundary image frames. The multiscale feature fusion decoder module 112 is described in more detail, for example, in FIG. 3.

In accordance with an embodiment, each decoder block of the multi scale feature fusion decoder module 112 is further configured to receive each of the second warping interpolation features and second synthesis interpolation features as input. In case of processing the second image frame (i.e., the following boundary image frame, /- , each decoder block of the multiscale feature fusion decoder module 112 is further configured to combine each of the plurality of second warping interpolation features (i.e.,

received from the warping encoder 108 and the plurality of second synthesis interpolation features (i.e., C( _t) received from the synthesis encoder 110.

In an implementation, the image processing apparatus 102 may have a memory to store the temporally synchronized and spatially aligned events and image data. Examples of implementation of the memory may include, but are not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, Solid-State Drive (SSD), or CPU cache memory. The memory may store an operating system or other program products (including one or more operation algorithms) to operate the image processing apparatus 102.

In addition to the memory, the image processing apparatus 102 may have a processor to execute the instructions stored in the memory. In an example, the processor may be a general-purpose processor. Other examples of the processor may include, but are not limited to a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application-specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a central processing unit (CPU), a state machine, a data processing unit, and other processors or control circuitry. Moreover, the processor may refer to one or more individual processors, processing devices, a processing unit that is part of a machine, such as the image processing apparatus 102.

Thus, the image processing apparatus 102 enables highly efficient and accurate video frame interpolation in slow motion as well as in high-speed motion. The image processing apparatus 102 uses images in addition to events for motion estimation, which further results in no interpolation artefacts in high-speed motion. However, in a conventional method, such as a time lens, which only uses events for motion estimation because the time lens contains information about non-linear inter-frame motion. In low contrast areas, where temporal brightness changes are below contrast threshold of a camera, then, events are not triggered. This further leads to various interpolation artefacts, such as boundary distortion, texture wobbling and boundary frames blending. These artefacts could have been avoided if the time lens used images in addition to events. The time lens relies on bi-linear interpolation for image warping therefore, the time lens requires motion from the “non-existing” latent frame to the boundary frame, which can only be approximated from the boundary frames, and this approximation works poorly for large (i.e., high-speed) motion.

Moreover, by virtue of the parametric motion model estimator 106, the image processing apparatus 102 enables non-linear motion estimation from inter-frame events with reduced computational complexity. Additionally, because of the multiscale feature fusion decoder module 112, the image processing apparatus 102 enables selection of the most informative features from each encoder at each scale and improves fusion of the plurality of multiscale synthesis and warping interpolation results. Therefore, the image processing apparatus 102 may be used in various applications of event-based video interpolation, such as an energy-efficient high resolution video interpolation, synthetic exposure, rewind trigger time and rolling shutter compensation, and the like. The aforementioned application scenarios are described in more detail, for example, in FIG. 4.

FIG. 2A illustrates the processing of one or more key image frames and a plurality of surrounding events by a parametric motion model estimator and a warping encoder, in accordance with an embodiment of the present disclosure. FIG. 2A is described in conjunction with elements from FIG. 1. With reference to FIG. 2A, there is shown a processing diagram 200A that illustrates the processing of one or more key image frames and a plurality of surrounding events. The processing diagram 200A includes the parametric motion model estimator 106 and the warping encoder 108 (of FIG. 1). The parametric motion model estimator 106 includes an image-based motion encoder 202, an event-based motion encoder 204 and a second multiscale feature fusion decoder module 206. The warping encoder 108 includes an image encoder 208, a flow sampling module 210 and a forward warping module 212.

In accordance with an embodiment, the parametric motion model estimator 106 comprises the image-based motion encoder 202 configured to estimate motion features from the first and second adjacent image frames at a plurality of scales. The image-based motion encoder 202 is configured to estimate the multi scale motion features from the first and the second adjacent images frames (i.e., the preceding and the following boundary image frames, I₀ i, respectively).

The parametric motion model estimator 106 further comprises the event-based motion encoder 204 configured to estimate motion features from the plurality of surrounding events at a plurality of scales. The event-based motion encoder 204 is configured to estimate the multiscale motion features from the plurality of surrounding events (i.e., the voxel grid Vo^t of interframe- events).

The parametric motion model estimator 106 further comprises the second multi scale feature fusion decoder module 206 configured to combine each of the estimated motion features. Each of the estimated multi scale motion features from the first and the second adjacent images frames and the multi scale motion features from the plurality of surrounding events is combined using the second multiscale feature fusion decoder module 206.

Thus, the parametric motion model estimator 106 includes two encoders, that is the imagebased motion encoder 202 and the event-based motion encoder 204, to learn the multiscale motion features from events as well as from images, leading to improved interpolation results. However, in existing VFI systems, a single encoder network is used, which learns motion features from events than images and, therefore, simply converges to local minimum while ignoring images, hence, leads to poor interpolation results.

In accordance with an embodiment, the parametric motion model estimator 106 is configured to compute the inter-frame spline motion including three cubic splines for each pixel location, respectively modelling a horizontal displacement, a vertical displacement and a warping priority of each pixel as a function of time. The parametric motion model estimator 106 infers non-linear inter-frame motion (shown in FIG. 2B) from the plurality of surrounding events and the first and second image frames (i.e., the preceding and following boundary image frames, /₀, /i, respectively) and approximates the non-linear motion with splines. The parametric motion model estimator 106 enables efficient sampling of motion between boundary and arbitrary latent frame and ensures temporal consistency of the non-linear motion. The parametric motion model estimator 106 computes three cubic splines (may also be represented as SQ , and SQ^) for each pixel location. The three cubic splines (i.e., S^, SQ^ , and SQ^) model the horizontal displacement, the vertical displacement and the warping priority, respectively, of each pixel of the preceding boundary image as the function of time correspondingly. Each spline may be represented by K control points, for example, the horizontal displacement spline (i.e.,

may be represented as horizontal displacements (Ax₀, Ax₁/_k-1, Ax₂/fc-i, — , A%i) for uniformly sampled timestamps (0, 1/K - 1 , 2/K - 1, ... , 1), as shown in FIG. 2A. Each of the three cubic splines may be used by the warping encoder 108 to compute the plurality of multi scale warping interpolation features and the plurality of second warping interpolation features (i.e., C^_t, C^_t) for any time t € [0,1] with minimum additional computational cost.

In a conventional VFI method, such as a time lens that computes motion independently from each to be inserted frame to the boundary frame by re-partitioning the events. The time lens leverages information about non-linear inter-frame motion contained in events, thus, computational complexity scales linearly o(/V) with the number of interpolated frames N and its motion estimates are independent and, thus, potentially inconsistent. In another conventional VFI method, such as frame-based interpolation method(s), which relies on linear or quadratic motion models estimated from input frames. While, the frame-based interpolation method(s) cannot accurately compute inter-frame motion due to simplicity of motion model and absence of inter-frame information, and its computational complexity o(l) does not depend on number of interpolated frames and its motion estimates are temporally consistent. In contrast to the conventional VFI methods, the parametric motion model estimator 106 infers non-linear interframe motion from the plurality of surrounding events and the first and second image frames and approximates the non-linear motion with the three cubic splines and thus, enables improved interpolation results in the low-contrast areas without events. Moreover, the parametric motion model estimator 106 computes high-order spline motion from interframe events and thus, allows interpolating N intermediate frames with o(l) instead of o(/V) computational complexity without resorting to assumption about linearity of the inter-frame motion. In accordance with an embodiment, the warping encoder 108 includes the image encoder 208 comprising a plurality of residual blocks configured to encode the first image frame at a plurality of scales. Alternatively stated, the image encoder 208 of the warping encoder 108 is configured to encode the first image frame (i.e., the preceding boundary image frame, /₀) at the multi scale. In another implementation, the image encoder 208 of the warping encoder 108 may be configured to encode the second image frame (i.e., the following boundary image frame, /- at the multiscale.

The warping encoder 108 further includes the flow sampling module 210 configured to sample the horizontal displacement, vertical displacement and warping priority of each pixel from the inter- frame spline motion. Alternatively stated, for a given time t, the flow sampling module 210 is configured to sample the flow (i.e., F₀^_t) and the warping priority (i.e., F₀^_t) from the three cubic splines using an existing cubic convolution method.

The warping encoder 108 further includes the forward warping module 212 configured to compute the warping interpolation features at the plurality of scales based on the encoded first image frame and the sampled inter-frame spline motion. The forward warping module 212 is configured to compute the plurality of multiscale warping interpolation features (i.e., C ) extracted from the encoded first image frame and the sampled flow (i.e., F₀^_t) and the warping priority (i.e., F₀^_t). The forward warping module 212 may also be referred to as a softmax splatting module. The forward warping module 212 is used for softmax splatting interpolation for warping that requires motion from the boundary frame to latent frame and thus, allows combining events and image-based motion estimation in the parametric motion model estimator 106

FIG. 2B illustrates non-linear inter-frame motion between boundary image frames, in accordance with an embodiment of the present disclosure. FIG. 2B is described in conjunction with elements from FIGs. 1 and 2A. With reference to FIG. 2B, there is shown a non-linear inter- frame motion 200B between boundary image frames (i.e., the first image frame and the second image frame).

The parametric motion model estimator 106 is configured to compute the inter-frame spline motion including three cubic splines for each pixel location, respectively modelling the horizontal displacement, the vertical displacement and the warping priority of each pixel as a function of time. The parametric motion model estimator 106 is configured to infer the non- linear inter-frame motion 200B from the plurality of surrounding events and the first image frame and the second image frame (i.e., the preceding and following boundary image frames, I_o, /₁₅ respectively) and approximates the non-linear motion with cubic splines.

FIG. 3 illustrates a multiscale feature fusion decoder module with gated compression, in accordance with an embodiment of the present disclosure. FIG. 3 is described in conjunction with elements from FIG. 1 and 2A. With reference to FIG. 3, there is shown the multiscale feature fusion decoder module 112 (of FIG. 1), that includes a plurality of image decoder blocks 302. Each decoder block of the plurality of image decoder blocks 302 includes a gated compression module 304, an upscaling layer 306 followed by a convolution layer 308. The gated compression module 304 includes a sigmoid activation layer 310.

In accordance with an embodiment, each decoder block of the multiscale feature fusion decoder module 112 includes the gated compression module 304 configured to attenuate each of the warping interpolation features and synthesis interpolation features and select a subset of the most informative features. The multiscale feature fusion decoder module 112 is configured for multiscale fusion of the warping and synthesis interpolation features, because multiscale fusion often, produce improved results. Moreover, the multiscale fusion is sensitive to small misalignment in the input images. The multiscale feature fusion decoder module 112 progressively combines the multiscale warping and synthesis interpolation features as well as features from the previous image decoder block performed on a coarser scale. Instead of using simple convolution to combine the features, the multiscale feature fusion decoder module 112 depends on the gated compression module 304, which attenuates features before combining them and thus, intuitively selects the most informative features from each source. The gated compression module 304 may be used for combining multiple exposures.

In accordance with an embodiment, each decoder block of the multiscale feature fusion decoder module 112 includes the upscaling layer 306 followed by the convolution layer 308 with a non-linear activation function. In an implementation, the upscaling layer 306 may be a 2* bilinear upscaling layer followed by the convolution layer 308 with the non-linear activation function. The non-linear activation function may be a leaky rectified linear unit activation function. However, in another implementation, the upscaling layer 306 with another upscaling ratio may be used. In an implementation, the gated compression module 304 may include an attenuation path with the convolution layer 308 and the sigmoid activation layer 310, and a skip-connection path directly from an input of the gated compression module 304. The gated compression module 304 may be configured to combine the attenuation path and the skip-connection path and insert them into the convolution layer 308 with the leaky rectified linear unit activation function.

FIG. 4 is a flowchart of a method of generating an interpolated frame at a specified time between adjacent image frames of a video, in accordance with an embodiment of the present disclosure. FIG. 4 is described in conjunction with elements from FIGs. 1, 2A, and 3. With reference to FIG. 4, there is shown a method 400 of generating an interpolated frame at a specified time between adjacent image frames of a video. The method 400 includes steps 402-to-410. The method 400 is executed by the image processing apparatus 102 (of FIG. 1).

At step 402, the method 400 comprises receiving, by the input module 104, one or more image frames, and a plurality of surrounding events, where each event indicates a pixel location and time associated with a change in pixel intensity above a predetermined threshold. The input module 104 of the image processing apparatus 102 is configured to receive the key image frames (e.g., the preceding and following boundary image frames) and the plurality of surrounding events. Each event is triggered whenever there is a change in pixel location or pixel intensity (i.e., illumination change).

In accordance with an embodiment, the method 400 further comprises receiving the key image frames and surrounding events includes capturing, by the image sensor 104 A, the key image frames and capturing, by the aligned event sensor 104B, the plurality of surrounding events, where each pixel of the aligned event sensor 104B is configured to trigger an event when a change of intensity of the pixel crosses a threshold. The image sensor 104A of the input module 104 is configured to capture the key image frames and the aligned event sensor 104B of the input module 104 is configured to capture the plurality of surrounding events.

In accordance with an embodiment, the image sensor 104A and aligned event sensor 104B are aligned using a beam splitter, or where the image sensor 104A and aligned event sensor 104B are integrated in a hybrid sensor. In an implementation, the image sensor 104A and the aligned event sensor 104B may be aligned by use of the beam splitter. In another implementation, the image sensor 104A and aligned event sensor 104B may be arranged side-by-side or integrated in the form of the hybrid sensor. At step 404, the method 400 further comprises estimating, by the parametric motion model estimator 106, an inter-frame motion based on the key image frames and the plurality of surrounding events. The parametric motion model estimator 106 is configured to estimate the inter-frame motion (i.e., linear, non-linear, etc.) using the key image frames and the plurality of surrounding events.

In accordance with an embodiment, the method 400 further comprises estimating the interframe motion comprises estimating, by the image-based motion encoder 202, motion features from the first and second adjacent image frames at a plurality of scales, estimating, by the eventbased motion encoder 204, motion features from the plurality of surrounding events at a plurality of scales, and combining, by the second multiscale feature fusion decoder module 206, each of the estimated motion features. The image-based motion encoder 202 of the parametric motion model estimator 106 is configured to estimate the multiscale motion features from the first and second adjacent image frames (i.e., the preceding and following boundary image frames, I₀ i, respectively). The event-based motion encoder 204 of the parametric motion model estimator 106 is configured to estimate the multi scale motion features from the plurality of surrounding events (i.e., the frames in between the first and second adjacent image frames). The second multiscale feature fusion decoder module 206 of the parametric motion model estimator 106 is configured to combine each of the estimated multiscale motion features.

In accordance with an embodiment, the method 400 further comprises computing the interframe spline motion including three cubic splines for each pixel location, respectively modelling a horizontal displacement, a vertical displacement and a warping priority of each pixel as a function of time. The parametric motion model estimator 106 is configured to compute three cubic splines for each pixel location. The three cubic splines model the horizontal displacement, the vertical displacement and the warping priority, respectively, of each pixel of the preceding boundary image as the function of time correspondingly, already described in detail, for example, in FIG. 2A.

At step 406, the method 400 further comprises computing, by the warping encoder 108, a plurality of warping interpolation features based on a first image frame of the key frames and the inter-frame motion. The warping encoder 108 is configured to compute the plurality of multiscale warping interpolation features based on the first image frame (i.e., the preceding boundary image frame, /₀) and the inter-frame motion. In accordance with an embodiment, the method 400 further comprises computing, the plurality of warping interpolation features includes encoding, by the image encoder 208 comprising a plurality of residual blocks, the first image frame at a plurality of scales, sampling, by the flow sampling module 210, the horizontal displacement, vertical displacement and warping priority of each pixel from the inter-frame spline motion, and computing, by the forward warping module 212, the warping interpolation features at the plurality of scales based on the encoded first image frame and the sampled inter-frame spline motion. The image encoder 208 of the warping encoder 108 is configured to encode the first image frame (i.e., the preceding boundary image frame, /₀) ^at the plurality of scales. The flow sampling module 210 of the warping encoder 108 is configured to sample the horizontal displacement, vertical displacement and warping priority of each pixel from the inter-frame spline motion. The forward warping module 212 of the warping encoder 108 is configured to compute the warping interpolation features at the plurality of scales based on the encoded first image frame and the sampled inter-frame spline motion.

At step 408, the method 400 further comprises computing, by the synthesis encoder 110, a plurality of synthesis interpolation features based on the first image frame and a subset of the events between the first image frame and the specified time. The synthesis encoder 110 is configured to compute the plurality of synthesis interpolation features based on the first image frame and the subset of events between the first image frame and the specified time.

At step 410, the method 400 further comprises generating, by the multiscale feature fusion decoder module 112, the interpolated frame, where the multiscale feature fusion decoder module 112 comprises the plurality of image decoder blocks 302, where each decoder block is configured to receive each of the warping interpolation features and synthesis interpolation features as input. The multiscale feature fusion decoder module 112 is configured to combine each of the plurality of multiscale warping interpolation features and synthesis interpolation features and generate the interpolated frame.

In accordance with an embodiment, the method 400 further comprises attenuating, by the gated compression module 304 of each decoder block of the multiscale feature fusion decoder module 112, each of the warping interpolation features and synthesis interpolation features and selecting a subset of the most informative features. The gated compression module 304 is configured to attenuate each of the warping interpolation features and synthesis interpolation features and select the subset of the most informative features. In accordance with an embodiment, each decoder block of the multi scale feature fusion decoder module 112 includes the upscaling layer 306 followed by the convolution layer 308 with a nonlinear activation function. The upscaling layer 306 may have different upscaling ratio depending on a use case.

In accordance with an embodiment, the method 400 further comprises computing a plurality of second synthesis interpolation features based on a second image frame and a subset of the events between the second image frame and the specified time, where the specified time is between the first image frame and the second image frame. The method 400 further comprises computing a plurality of second warping interpolation features based on the second image frame and the inter-frame spline motion, and generating the interpolated frame by further receiving, at each decoder block of the multiscale feature fusion decoder module 112, each of the second warping interpolation features and second synthesis interpolation features as input. The synthesis encoder 110 is configured to compute the plurality of second synthesis interpolation features based on the second image frame and the subset of the events between the second image frame and the specified time. The warping encoder 108 is configured to compute the plurality of second warping interpolation features based on the second image frame and the inter- frame spline motion. The multiscale feature fusion decoder module 112 is configured to generate the interpolated frame by receiving each of the second warping interpolation features and second synthesis interpolation features as input.

The method 400 is based on using images in addition to events for motion estimation, which further leads to improved interpolation results in slow speed motion as well as in high-speed motion. Various application scenarios of the method 400 are described as follows:

In an application scenario of “slow motion”, the method 400 may be used to capture high-speed moments at relatively low frame per second, fps, (e.g., 100-200 fps) and upscale them to high fps using interframe events. Thus, the method 400 enables motion-adaptive fps, which are adjustable after video acquisition. Also, the method 400 is applicable for unlimited length of slow motion. Moreover, in case of the slow motion, the method 400 is energy and memory efficient and handles non-rigid objects (e.g., water and fire) and light changes efficiently and accurately.

The method 400 may be used for capturing energy efficient high-resolution video. The method

400 enables capturing high resolution video with low fps (e.g., 5-10 fps) and interpolating the captured video to normal frame rate. This leads to longer battery life and memory efficiency. The video interpolation can only be performed during video acquisition.

The method 400 may be used in such application scenarios, which require capturing of only single image and interpolate the captured single image by using the inter-frame events, for example, “synthetic exposure”, “rewind trigger time” and “rolling shutter compensation”. In case of the synthetic exposure, the method 400 is used to capture a single image and interpolate the captured single image using inter-frame events. Thereafter, the interpolated images are summed up in order to generate an arbitrary exposure. Thus, the method 400 manifests an ability to freely adjust exposure time after video acquisition.

In case of the rewind trigger time, the method 400 is used to capture a single image and interpolate the captured single image using inter-frame events towards a certain moment of capture. Thus, the method 400 enables precise triggering of camera in order to capture certain moments in time.

In case of the rolling shutter compensation, the method 400 is used to capture a single rolling shutter image and interpolate each row of the image independently using events to simulate global shutter. In this way, the method 400 can compensate rolling shutter from an object motion in a single image.

Besides the aforementioned application scenarios, the method 400 can be used for video interpolation after the video acquisition. In order to perform the video interpolation after the video acquisition, the temporally synchronized and aligned events and the video must be stored in a special output format. This further improves detectability. Additionally, the method 400 may be used for non-uniform video interpolation. For example, a user can select which part of the video, the user would like to temporally up-sample and up to how much extent. The temporal upscaling ratio can be automatically determined based on an amount of motion. By virtue of using the inter-frame events in addition to the images, the method 400 is applicable on non-rigid objects, such as fire, water, splashes, and the like, in contrast to conventional image-based methods.

The steps 402-to-410 are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein. In one aspect, the present disclosure provides a computer-readable medium comprising instructions which, when executed by a processor, cause the processor to perform the method 400 (of FIG. 4).

Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as "including", "comprising", "incorporating", "have", "is" used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments. The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". It is appreciated that certain features of the present disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the present disclosure, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as suitable in any other described embodiment of the disclosure.

Claims

1. An image processing apparatus (102) for generating an interpolated frame at a specified time between adjacent image frames of a video, comprising: an input module (104) configured to receive one, two or more key image frames, and a plurality of surrounding events, wherein each event indicates a pixel location and time associated with a change in pixel intensity above a predetermined threshold; a parametric motion model estimator (106) configured to estimate an inter-frame motion based on the key image frames and the plurality of surrounding events; a warping encoder (108) configured to compute a plurality of multiscale warping interpolation features based on a first image frame of the key frames and the inter- frame motion; a synthesis encoder (110) configured to compute a plurality of multiscale synthesis interpolation features based on the first image frame and a subset of the events between the first image frame and the specified time; a multiscale feature fusion decoder module (112) configured to generate the interpolated frame, wherein the multiscale feature fusion decoder module (112) comprises a plurality of image decoder blocks (302), wherein each decoder block is configured to receive an output from a preceding decoder block and each of the warping interpolation features and synthesis interpolation features as input.

2. The image processing apparatus (102) of claim 1, wherein the input module (104) includes an image sensor (104 A) configured to capture the image frames and an aligned event sensor (104B) with pixels configured to capture the plurality of surrounding events, wherein each pixel is configured to trigger an event when a change of intensity of the pixel crosses a threshold.

3. The image processing apparatus (102) of claim 1 or claim 2, wherein the image sensor (104 A) and aligned event sensor (104B) are aligned using a beam splitter, or wherein the image sensor (104 A) and aligned event sensor (104B) are integrated in a hybrid sensor.

25

4. The image processing apparatus (102) of any preceding claim, wherein the parametric motion model estimator (106) comprises: an image-based motion encoder (202) configured to estimate motion features from the first and second adjacent image frames at a plurality of scales, an event-based motion encoder (204) configured to estimate motion features from the plurality of surrounding events at a plurality of scales, and a second multiscale feature fusion decoder module (206) configured to combine each of the estimated motion features.

5. The image processing apparatus (102) of any preceding claim, wherein the parametric motion model estimator (106) is configured to compute the inter-frame spline motion including three cubic splines for each pixel location, respectively modelling a horizontal displacement, a vertical displacement and a warping priority of each pixel as a function of time.

6. The image processing apparatus (102) of claim 5, where the warping encoder (108) includes: an image encoder (208) comprising a plurality of residual blocks configured to encode the first image frame at a plurality of scales; a flow sampling module (210) configured to sample the horizontal displacement, vertical displacement and warping priority of each pixel from the inter-frame spline motion; and a forward warping module (212) configured to compute the warping interpolation features at the plurality of scales based on the encoded first image frame and the sampled interframe spline motion.

7. The image processing apparatus (102) of any preceding claim, wherein each decoder block of the multiscale feature fusion decoder module (112) includes a gated compression module (304) configured to attenuate each of the warping interpolation features and synthesis interpolation features and select a subset of the most informative features.

8. The image processing apparatus (102) of any preceding claim, wherein each decoder block of the multiscale feature fusion decoder module (112) includes an upscaling layer (306) followed by a convolution layer (308) with a non-linear activation function.

9. The image processing apparatus (102) of any preceding claim, wherein: the synthesis encoder (110) is further configured to compute a plurality of second synthesis interpolation features based on a second image frame and a subset of the events between the second image frame and the specified time, wherein the specified time is between the first image frame and the second image frame, the warping encoder (108) is further configured to compute a plurality of second warping interpolation features based on the second image frame and the inter-frame motion, and each decoder block of the multiscale feature fusion decoder module (112) is further configured to receive each of the second warping interpolation features and second synthesis interpolation features as input.

10. A method (400) of generating an interpolated frame at a specified time between adj acent image frames of a video, comprising: receiving, by an input module (104), one or more image frames, and a plurality of surrounding events, wherein each event indicates a pixel location and time associated with a change in pixel intensity above a predetermined threshold; estimating, by a parametric motion model estimator (106), an inter- frame motion based on the key image frames and the plurality of surrounding events; computing, by a warping encoder (108), a plurality of warping interpolation features based on a first image frame of the key frames and the inter-frame motion; computing, by a synthesis encoder (110), a plurality of synthesis interpolation features based on the first image frame and a subset of the events between the first image frame and the specified time; generating, by a multiscale feature fusion decoder module (112), the interpolated frame, wherein the multiscale feature fusion decoder module (112) comprises a plurality of image decoder blocks (302), wherein each decoder block is configured to receive each of the warping interpolation features and synthesis interpolation features as input.

11. The method (400) of claim 10, wherein receiving the key image frames and surrounding events includes capturing, by an image sensor (104A), the key image frames and capturing, by an aligned event sensor, the plurality of surrounding events, wherein each pixel of the aligned event sensor is configured to trigger an event when a change of intensity of the pixel crosses a threshold.

12. The method (400) of claim 10 or claim 11, wherein the image sensor (104 A) and aligned event sensor (104B) are aligned using a beam splitter, or wherein the image sensor (104 A) and aligned event sensor (104B) are integrated in a hybrid sensor.

13. The method (400) of any one of claims 10 to 12, wherein estimating the inter-frame motion comprises: estimating, by an image-based motion encoder (202), motion features from the first and second adjacent image frames at a plurality of scales, estimating, by an event-based motion encoder (204), motion features from the plurality of surrounding events at a plurality of scales, and combining, by a second multiscale feature fusion decoder module (206), each of the estimated motion features.

14. The method (400) of any one of claims 10 to 13, further comprising computing the interframe spline motion including three cubic splines for each pixel location, respectively modelling a horizontal displacement, a vertical displacement and a warping priority of each pixel as a function of time.

15. The method (400) of claim 13 or claim 14, wherein computing, the plurality of warping interpolation features includes: encoding, by an image encoder (208) comprising a plurality of residual blocks, the first image frame at a plurality of scales; sampling, by a flow sampling module (210), the horizontal displacement, vertical displacement and warping priority of each pixel from the inter-frame spline motion; and computing, by a forward warping module (212), the warping interpolation features at the plurality of scales based on the encoded first image frame and the sampled inter-frame spline motion.

16. The method (400) of any one of claims 10 to 15, further comprising attenuating, by a gated compression module (304) of each decoder block of the multiscale feature fusion decoder module (112), each of the warping interpolation features and synthesis interpolation features and selecting a subset of the most informative features.

28

17. The method (400) of any one of claims 10 to 16, wherein each decoder block of the multiscale feature fusion decoder module (112) includes an up-scaling layer (306) followed by a convolution layer (308) with a non-linear activation function.

18. The method (400) of any one of claims 10 to 17, further comprising: computing a plurality of second synthesis interpolation features based on a second image frame and a subset of the events between the second image frame and the specified time, wherein the specified time is between the first image frame and the second image frame, computing a plurality of second warping interpolation features based on the second image frame and the inter-frame spline motion, and generating the interpolated frame by further receiving, at each decoder block of the multiscale feature fusion decoder module (112), each of the second warping interpolation features and second synthesis interpolation features as input.

19. A computer-readable medium comprising instructions which, when executed by a processor, cause the processor to perform the method (400) of any one of claims 10 to 18.

29