CN110809126A

CN110809126A - Video frame interpolation method and system based on adaptive deformable convolution

Info

Publication number: CN110809126A
Application number: CN201911032243.2A
Authority: CN
Inventors: 樊硕
Original assignee: Beijing Yingpu Technology Co Ltd
Current assignee: Beijing Yingpu Technology Co Ltd
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2020-02-18

Abstract

The method comprises the steps of firstly reading a video stream to be processed, obtaining a first image frame and a second image frame which are adjacent to each other and need to be interpolated in the video stream to be processed, then inputting the two image frames into a preset convolution neural network model to extract features, executing self-adaptive deformable convolution operation, and obtaining and outputting an intermediate frame of the first image frame and the second image frame. According to the video frame interpolation method and system based on the adaptive deformable convolution, the advantages of the kernel estimation method and the flow graph estimation method are combined, the limitations of the two methods are overcome, the complex motion, low-quality video frames and the shielding condition can be processed, and the efficiency and the accuracy of video frame interpolation selection are further improved.

Description

Video frame interpolation method and system based on adaptive deformable convolution

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video frame interpolation method and system based on adaptive deformable convolution.

Background

Video frame interpolation is one of the main problems in the field of video processing in computer vision, and refers to a method for synthesizing intermediate frames in continuous frames. By using the method, slow motion video can be obtained from ordinary video without depending on a professional high-speed camera, or the method can be applied to the field of video compression by recovering downsampled video.

However, with the conventional scheme, the partial scheme requires a large memory and is computationally expensive in estimating a kernel per pixel, and cannot handle motion larger than the kernel size; another part of the scheme of estimating the flow vector pointing to the reference position for each output pixel is difficult to handle complex motion situations when the input frame is of low quality. Therefore, it is desirable to provide a more efficient method for interpolating video frames.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to an aspect of the present application, there is provided a video frame interpolation method based on adaptive deformable convolution, including:

reading a video stream to be processed, and acquiring a first image frame and a second image frame which are adjacent and need to be interpolated in the video stream to be processed;

inputting the first image frame and the second image frame into a preset convolutional neural network model;

and respectively extracting a first image feature of the first image frame and a second image feature of the second image frame through the convolutional neural network model, and performing adaptive deformable convolution operation on the first image feature and the second image feature to obtain and output an intermediate frame of the first image frame and the second image frame.

Optionally, the convolutional neural network model comprises a feature extraction sub-network constructed by an encoder and a decoder;

the extracting, by the convolutional neural network model, first image features of the first image frame and second image features of the second image frame, respectively, includes:

respectively extracting a first image feature of the first image frame and a second image feature of the second image frame through a feature extraction sub-network in the convolutional neural network model.

Optionally, the convolutional neural network model further includes: a parameter estimation sub-network and an adaptive deformable convolution sub-network;

the performing an adaptive deformable convolution operation on the first image feature and the second image feature to obtain and output an intermediate frame of the first image frame and the second image frame includes:

inputting the first image feature and the second image feature into the parameter estimation sub-network to obtain preset parameters required for performing an adaptive deformable convolution operation; wherein the preset parameters include: kernel weights and offset vectors for each pixel point in the first image frame and the second image frame;

and simultaneously inputting the first image frame, the second image frame and the preset parameters into the adaptive deformable convolution sub-network to execute adaptive deformable convolution operation, and obtaining and outputting an intermediate frame of the first image frame and the second image frame.

Optionally, the simultaneously inputting the first image frame, the second image frame, and the preset parameter into the adaptive deformable convolution sub-network to perform an adaptive deformable convolution operation, and obtaining and outputting an intermediate frame between the first image frame and the second image frame includes:

simultaneously inputting the first image frame, the second image frame and the preset parameters into the adaptive deformable convolution sub-network, executing adaptive deformable convolution operation based on a first convolution formula, and obtaining and outputting an intermediate frame of the first image frame and the second image frame;

the first convolution formula is:

wherein, W_k.l(i, j) represents the weight of the pixel point (i, j) and the kernel position (k, l), (α)_k,l,β_k,l) Represents an offset; i represents an image before being subjected to the adaptive deformable convolution operation;

representing the image after the adaptive deformable convolution operation.

Optionally, before simultaneously inputting the adjacent image frames, the kernel weight and the offset vector of each pixel block data into the adaptive deformable convolutional neural network, and synthesizing and outputting the frame-inserted video stream, the method further includes:

and if the invisible pixel points exist in the first image frame or the second image frame, defining a shielding coefficient, inputting the shielding coefficient, the first image frame, the second image frame and preset parameters into the adaptive deformable convolution sub-network simultaneously, and performing convolution operation of pixel-by-pixel multiplication through space change to obtain and output the intermediate frame.

Optionally, the defining an occlusion coefficient, inputting the occlusion coefficient, the first image frame, the second image frame, and preset parameters into the adaptive deformable convolution sub-network, and performing convolution operation of pixel-by-pixel multiplication through spatial variation to obtain and output the intermediate frame includes:

defining the shielding coefficient V epsilon [0,1 ∈ ]]^M×NInputting the first image frame, the second image frame and preset parameters into the adaptive deformable convolution sub-network simultaneously, and performing convolution operation of pixel-by-pixel multiplication through space change and based on a second convolution formula to obtain and output the intermediate frame;

the second convolution formula is:

I_out＝Ve T_f(I_n)+(J_M，N-V)e T_b(I_n+1)

wherein pixel-by-pixel multiplication is represented; j. the design is a square_M,NA matrix representing M N; t is_fRepresenting a pre-term spatial transformation; t is_bRepresenting a post-term spatial transform; m × N represents the sizes of the input image and the output image.

According to another aspect of the present application, there is provided a video frame interpolation system based on adaptive deformable convolution, comprising:

the device comprises an adjacent image frame acquisition module, a first interpolation module and a second interpolation module, wherein the adjacent image frame acquisition module is configured to read a video stream to be processed and acquire a first image frame and a second image frame which are adjacent to each other and need to be interpolated in the video stream to be processed;

an image frame input module configured to input the first image frame and the second image frame into a preset convolutional neural network model;

an intermediate frame obtaining module configured to extract a first image feature of the first image frame and a second image feature of the second image frame through the convolutional neural network model, respectively, and perform an adaptive deformable convolution operation on the first image feature and the second image feature, to obtain and output an intermediate frame of the first image frame and the second image frame.

the intermediate frame acquisition module is further configured to:

Optionally, the intermediate frame obtaining module is further configured to, when it is determined that an invisible pixel exists in the first image frame or the second image frame, define a blocking coefficient, input the blocking coefficient, the first image frame, the second image frame, and a preset parameter to the adaptive deformable convolution sub-network at the same time, perform a convolution operation of pixel-by-pixel multiplication through spatial variation, and obtain and output the intermediate frame.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a schematic flow chart of a video frame interpolation method based on adaptive deformable convolution according to an embodiment of the application;

FIG. 2 is a schematic diagram of a convolutional neural network workflow according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a video frame interpolation system based on adaptive deformable convolution according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a computing device according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present social situation.

Detailed Description

For the field of video frame interpolation, most methods define video frame interpolation as the problem of finding a reference location on an input frame that includes a pixel value for each input frame to estimate, and then calculating an output inter-frame pixel value from the reference pixel values. The classical video frame interpolation algorithm mostly uses an optical flow algorithm as a base, the performance of the algorithm depends on the performance of the optical flow algorithm to a great extent, and the algorithm has great limitations, such as obvious degradation of the performance in the case of occlusion, multiple motion and obvious brightness change. The video frame interpolation algorithm is further improved aiming at the limitation, for example, an output image is synthesized by solving a Poisson equation, the method can partially solve the limitation of the traditional method, but the calculation cost is overhigh due to the severe optimization process; or the video is regarded as a linear combination of wavelets with different directions and frequencies, and then the intermediate frame is solved through an interpolation algorithm, the method has remarkable improvement in performance and running time, but still has the limitation of high-frequency multi-motion situations.

Video frame difference methods based on deep learning can be largely classified into two types, the first being a kernel estimation method that adaptively estimates kernels for each pixel and synthesizes intermediate frames by convolving them with the input. This method finds a suitable reference location by assigning a large weight to the pixel of interest, for example, evaluating the kernel of each location and obtaining the output pixels by convolving it over the input patch, each kernel sampling the appropriate input pixels by selectively combining them, however it requires a large memory and is computationally expensive in evaluating the kernel of each pixel and cannot handle motion larger than the kernel size, i.e., cannot handle the case of multiple motion video frames. The second method is a flow graph estimation based method that estimates the flow vector pointing to the reference position directly for each output pixel, but only one position is involved per input frame, so that it is difficult to handle complex motion situations when the input frame is of low quality.

Although the kernel-based estimation method and the flow graph-based estimation method have their own limitations, they can complement each other to make up for their own deficiencies. The kernel estimation based approach can handle complex motion or low quality frames, while the flow graph estimation based approach can handle motion cases of any magnitude, since it points directly to the reference location. Therefore, if the advantages of the two methods are combined, a better performance video frame interpolation model can be constructed. Therefore, the embodiment of the application provides a video frame interpolation model based on the adaptive deformable convolution by combining the advantages of the two methods, and the efficiency of video frame interpolation is further improved.

Fig. 1 is a schematic flow chart of a video frame interpolation method based on adaptive deformable convolution according to an embodiment of the present application. Referring to fig. 1, a video frame interpolation method based on adaptive deformable convolution according to an embodiment of the present application may include:

step S101: reading a video stream to be processed, and acquiring a first image frame and a second image frame which are adjacent and need to be interpolated in the video stream to be processed;

step S102: inputting the first image frame and the second image frame into a preset convolution neural network model;

step S103: and respectively extracting a first image feature of the first image frame and a second image feature of the second image frame through a convolutional neural network model, and performing adaptive deformable convolution operation on the first image feature and the second image feature to obtain and output an intermediate frame of the first image frame and the second image frame.

The method comprises the steps of reading a video stream to be processed, obtaining a first image frame and a second image frame which are adjacent to each other and need to be interpolated in the video stream to be processed, inputting the two image frames into a preset convolution neural network model to extract features, executing adaptive deformable convolution operation, and obtaining and outputting an intermediate frame of the first image frame and the second image frame. The method provided by the embodiment of the application has the advantages of a method based on kernel estimation and a method based on flow graph estimation, can process complex motion, low-quality video frames and the situation of occlusion, and further improves the efficiency and accuracy of video frame interpolation selection.

Convolutional Neural Networks (CNN) are a type of feed-forward Neural network that includes convolution calculations and has a deep structure, and are one of the representative algorithms for deep learning. The convolutional neural network has the characteristic learning ability and can carry out translation invariant classification on input information according to the hierarchical structure of the convolutional neural network. Adaptive deformable Convolution (abbreviated as ADC) can conveniently replace a plurality of standard Convolution units in a convolutional neural network of any existing visual recognition task, and perform end-to-end training through standard back propagation, which is adaptively adjusted according to image content, thereby adapting to geometric deformation such as shape, size and the like of different objects.

Generally speaking, interpolation is performed on video frames, that is, intermediate frames are inserted between two adjacent image frames, so as to obtain a slow motion video stream corresponding to a normal video stream. Referring to step S101, for the video stream to be processed, a first image frame and a second image frame are acquired in succession. Further, referring to step S102, after the first image frame and the second image frame are acquired, the first image frame and the second image frame may be input into a preset convolutional neural network model, and an intermediate frame between the two consecutive frames is acquired through the preset convolutional neural network model.

Optionally, the preset convolutional neural network model in this embodiment is built based on a fully-convolutional structure, and includes three sub-network structures in total: a feature extraction sub-network, a parameter estimation sub-network, and an adaptive deformable convolution sub-network.

Among them, in the feature extraction subnetwork, it is mainly composed of an encoder and a decoder. The main module of the encoder can be defined as module 1, wherein the module 1 comprises three convolutional layers and a ReLU activation function; the main module of the decoder can be defined as module 2, and module 2 contains an upsampled layer, a convolutional layer, and a ReLU activation function. The encoder consists of six modules 1 and five average pooling layers, where each module 1 is followed by an average pooling layer except for the last module 1. The decoder consists of four modules 2 and three modules 1, wherein each module 2 is followed by one module 1 except the last module 2. By inputting successive frames of video into the feature extraction sub-network, the image features of each frame can be obtained.

The ReLU activation function is an activation function using a modified linear unit (ReLU) as a neuron. For a linear function, the expression capacity of the ReLU is stronger, and the ReLU is particularly embodied in a deep network; for the nonlinear function, the gradient of the ReLU in the non-negative interval is constant, so that the problem of gradient disappearance does not exist, and the convergence rate of the model is maintained in a stable state.

The parameter estimation sub-network mainly comprises a module 1, a module 2 and a Softmax function, and is mainly used for obtaining preset parameters required for executing the self-adaptive deformable convolution, such as kernel weights and offset vectors of all pixel points in adjacent image frames. The kernel weight of a pixel is the relative importance of the pixel in the kernel. And the offset vector of the pixel point is the offset of the pixel point between two adjacent frames. In which the Softmax function, or normalized exponential function, is actually a log-gradient normalization of the finite discrete probability distribution.

After the parameters are obtained, the intermediate frames of the adjacent image frames can be obtained by convolution through the adaptive deformable convolution sub-network, and the adaptive deformable convolution sub-network adopts five convolution layers as the network architecture of the adaptive deformable convolution sub-network.

Optionally, after the convolutional neural network model is constructed, it may be trained, so that the model can acquire the intermediate frame more quickly and accurately. In this embodiment, the training data set used in training the convolutional neural network model is a large number of consecutive three-frame video sets, preferably 300 × 300 in size, that can be combined from the video acquired on the network, and the data set is normalized to a video set with balanced negative and positive values by subtracting the average pixel value for each color channel. The test dataset used is the Middlebur dataset, which is a sequence of random samples from the UCF101 and DAVIS datasets. The present embodiment is mainly to find spatial transformation between adjacent frames, and synthesize an intermediate frame through a deformable convolution operation.

After the first image frame and the second image frame are input into the convolutional neural network model, the convolutional neural network model may sequentially operate according to three sub-networks in the preset convolutional neural network model, as shown in fig. 2, specifically including:

step S201: and respectively extracting a first image feature of the first image frame and a second image feature of the second image frame through a feature extraction sub-network in the convolutional neural network model.

Step S202: inputting the first image feature and the second image feature into a parameter estimation sub-network to obtain preset parameters required for performing the adaptive deformable convolution operation; wherein, the preset parameters include: and kernel weights and offset vectors of all pixel points in the first image frame and the second image frame.

The "kernel" in computer vision is a box (circle, or arbitrary shape) that is used to define the surrounding pixels that are used to compute a new value for a pixel. The kernel weight of a pixel is the relative importance of the pixel in the kernel. And the offset vector of the pixel point is the offset of the pixel point between two adjacent frames.

Step S203: and simultaneously inputting the first image frame, the second image frame and preset parameters into an adaptive deformable convolution sub-network to execute adaptive deformable convolution operation, and obtaining and outputting an intermediate frame of the first image frame and the second image frame.

In the embodiment, the first image feature of the first image frame and the second image feature of the second image frame are respectively and accurately extracted through the convolutional neural network model, the self-adaptive deformable convolution operation is performed on the first image feature and the second image feature, and the intermediate frame of the first image frame and the intermediate frame of the second image frame are quickly and efficiently obtained and output.

The present embodiment employs an adaptive deformable convolution operation, and the root is to find the spatial transformation between adjacent frames, so as to obtain the intermediate frame based on the spatial transformation. In practical application, a spatial transform T may be defined first, and consecutive frames are defined as I_nAnd I_n+1The intermediate frame is defined as I_outThus transforming T for its antecedent space_fAnd the postspace transformation T_bCan be considered as

Wherein the antecedent space transforms T_fAnd the postspace transformation T_bIs two expressions of a spatial transformation T, wherein the antecedent spatial transformation T_fIs input as_nThe consequent spatial transform T_bIs input as_n+1. In this embodiment, the task of video frame interpolation is to find the spatial transformation T.

Optionally, after spatial variation, the image I is subjected to a deformable convolution operation to output an image

If the spatial transformation is defined as a classical convolution operation, the calculation formula is as follows:

wherein:

(i, j) representing coordinates of the pixel points;

(k, l) coordinates representing a kernel location;

f represents the size of the kernel;

i represents an image before being subjected to the adaptive deformable convolution operation;

representing the image after the adaptive deformable convolution operation;

W_k,lrepresenting the kernel weight.

The kernel weight and the offset vector of each output pixel are the core of the spatial transformation of the embodiment, that is, the kernel weight and the offset of different positions are added to the kernel weight and the offset vector on the basis of the classical convolution operation, and the adaptive deformable convolution operation is performed based on the first convolution formula, and the calculation formula is as follows:

wherein:

(i, j) representing coordinates of the pixel points;

(k, l) coordinates representing a kernel location;

W_k.l(i, j) represents the weight of the pixel point (i, j) and the kernel position (k, l);

(α_k,l,β_k,l) Represents an offset;

f represents the size of the kernel;

is the image after the adaptive deformable convolution operation.

As can be seen from the above description, the kernel weight and offset vector of each output pixel of the convolutional neural network model are required to find the kernel of the spatial transformation. As noted above, the convolutional neural network model may include three sub-network structures: a feature extraction sub-network, a parameter estimation sub-network, and an adaptive deformable convolution sub-network. For the operation of parameter estimation, a parameter estimation sub-network consisting of a module 1, a module 2 and a Softmax function is mainly used for carrying out the operation.

In combination with the above embodiments, the preset parameters may include (W)_k,l) And (α)_k,l,β_k,l). Wherein the parameter W_k,lIs the kernel weight, i.e. the weight magnitude at the kernel location (k, l), the parameter W is obtained_k,lCan be a module 1, a module 2 and a Softmax function parameter (α)_k,l,β_k,l) Is an offset vector, i.e., the relative offset from the kernel location (k, l), and obtains a parameter (α)_k,l,β_k,l) The network structure of (a) may be one module 1 and one module 2.

In practical applications, occlusion situations may exist in the image. Therefore, in an optional embodiment of the present invention, before the adjacent image frames, the kernel weight and the offset vector of each pixel block data are simultaneously input to the adaptive deformable convolution neural network, and the video stream after frame interpolation is synthesized and output, the occlusion condition of the first image frame and the second image frame can be further determined, that is, it is determined whether there are invisible pixels in the first image frame or the second image frame, if there are invisible pixels in the first image frame or the second image frame, an occlusion coefficient is defined and simultaneously input to the adaptive deformable convolution sub-network together with the first image frame, the second image frame and the preset parameters, and a convolution operation of pixel-by-pixel multiplication is performed through spatial variation, so as to obtain and output an intermediate frame.

Wherein, defining the shielding coefficient V epsilon [0,1 ∈]^M×NAnd simultaneously inputting the first image frame, the second image frame and preset parameters into a self-adaptive deformable convolution sub-network, and performing convolution operation of pixel-by-pixel multiplication based on a second convolution formula through space change to obtain and output an intermediate frame, wherein the calculation formula is as follows:

I_out＝Ve T_f(I_n)+(J_M，N-V)e T_b(I_n+1)

wherein:

representing a pixel-by-pixel multiplication;

J_M,Na matrix representing M N;

T_frepresenting a pre-term spatial transformation;

T_brepresenting a post-term spatial transform;

m × N represents the sizes of the input image and the output image.

In the present embodiment, the input image represents a first image frame and a second image frame, the output image represents an intermediate frame, the first image frame and the second image frame have the same size as the intermediate frame, and M × N represents the pixel size of each image.

That is, after the required parameters are obtained through the parameter estimation sub-network, the first image frame, the second image frame, the preset parameters and the occlusion coefficients are input into the adaptive deformable convolution sub-network together, and the final intermediate frame is obtained. According to the scheme provided by the embodiment, the occlusion part of the adjacent image frame is judged to define the occlusion coefficient, so that the intermediate frame of the first image frame and the second image frame is efficiently acquired by combining the occlusion coefficient and other parameters, and the data processing efficiency of the convolutional neural network model is further improved.

Based on the same inventive concept, as shown in fig. 3, an embodiment of the present application further provides a video frame interpolation system 300 based on multiple adaptive deformable convolutions, including:

an adjacent image frame acquiring module 310, configured to read a video stream to be processed, and acquire adjacent first and second image frames that need to be interpolated in the video stream to be processed;

an image frame input module 320 configured to input the first image frame and the second image frame into a preset convolutional neural network model;

an intermediate frame obtaining module 330 configured to extract a first image feature of the first image frame and a second image feature of the second image frame through the convolutional neural network model, respectively, and perform an adaptive deformable convolution operation on the first image feature and the second image feature, to obtain and output an intermediate frame of the first image frame and the second image frame.

The convolutional neural network model comprises a feature extraction sub-network, a parameter estimation sub-network and an adaptive deformable convolution sub-network which are constructed by an encoder and a decoder;

wherein the intermediate frame acquiring module 330 is further configured to:

and respectively extracting a first image feature of the first image frame and a second image feature of the second image frame through a feature extraction sub-network in the convolutional neural network model.

Inputting the first image feature and the second image feature into a parameter estimation sub-network to obtain preset parameters required for performing the adaptive deformable convolution operation; wherein, the preset parameters include: kernel weights and offset vectors of each pixel point in the first image frame and the second image frame;

and then simultaneously inputting the first image frame, the second image frame and preset parameters into an adaptive deformable convolution sub-network to execute adaptive deformable convolution operation, and obtaining and outputting an intermediate frame of the first image frame and the second image frame.

The first convolution formula for performing the adaptive deformable convolution operation is calculated as follows:

wherein:

i, j represents the coordinates of the pixel points;

k, l represent coordinates of the kernel location;

f represents the size of the kernel;

representing images after adaptive deformable convolution operations

W_k,lRepresenting the kernel weight.

In an optional embodiment of the present invention, the intermediate frame obtaining module 330 is further configured to, when it is determined that an invisible pixel exists in the first image frame or the second image frame, define an occlusion coefficient, input the occlusion coefficient, the first image frame, the second image frame, and a preset parameter into the adaptive deformable convolution sub-network at the same time, perform a convolution operation of multiplying pixels by pixels through spatial variation, and obtain and output an intermediate frame.

In the above alternative embodiment, the occlusion coefficient V E [0,1 ] is defined]^M×NAnd a second convolution formula of a convolution operation of pixel-by-pixel multiplication, which is calculated as follows:

I_out＝Ve T_f(In)+(J_M，N-V)e T_b(I_n+1)

wherein:

representing a pixel-by-pixel multiplication;

J_M，Na matrix representing M N;

T_frepresenting a pre-term spatial transformation;

T_brepresenting a post-term spatial transform;

m × N represents the sizes of the input image and the output image.

The method comprises the steps of reading a video stream to be processed, obtaining a first image frame and a second image frame which are adjacent to each other and need to be interpolated in the video stream to be processed, inputting the two image frames into a preset convolution neural network model to extract features, executing adaptive deformable convolution operation, and obtaining and outputting intermediate frames of the first image frame and the second image frame.

The scheme provided by the embodiment of the application has the advantages of a method based on kernel estimation and a method based on flow graph estimation, can process complex motion, low-quality video frames and the situation of occlusion, and further improves the efficiency and accuracy of video frame interpolation selection.

Embodiments of the present application further provide a computing device, referring to fig. 4, comprising a memory 420, a processor 410, and a computer program stored in the memory 420 and executable by the processor 410, the computer program being stored in a space 430 for program code in the memory 420, the computer program, when executed by the processor 410, implementing steps 431 for performing any of the methods according to the present invention.

The embodiment of the application also provides a computer readable storage medium. Referring to fig. 5, the computer readable storage medium comprises a storage unit for program code provided with a program 431' for performing the steps of the method according to the invention, which program is executed by a processor.

The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A video frame interpolation method based on adaptive deformable convolution comprises the following steps:

2. The method of claim 1, wherein the convolutional neural network model comprises a feature extraction sub-network constructed by an encoder and a decoder;

3. The method of claim 1, wherein the convolutional neural network model further comprises: a parameter estimation sub-network and an adaptive deformable convolution sub-network;

4. The method of claim 3, wherein the inputting the first image frame, the second image frame and the preset parameters into the adaptive deformable convolution sub-network simultaneously to perform an adaptive deformable convolution operation, and obtaining and outputting an intermediate frame of the first image frame and the second image frame comprises:

the first convolution formula is:

wherein, W_k.l(i, j) indicates the position and the interior of the pixel point (i, j)The size of the weight at the kernel position (k, l) (α)_k,l,β_k,l) Represents an offset; i represents an image before being subjected to the adaptive deformable convolution operation;representing the image after the adaptive deformable convolution operation.

5. The method of claim 4, wherein before inputting the neighboring image frames, the kernel weight and the offset vector of each pixel block data into the adaptive deformable convolutional neural network simultaneously, and synthesizing and outputting the interpolated video stream, the method further comprises:

6. The method of claim 5, wherein the defining of the occlusion coefficient and inputting the occlusion coefficient, the first image frame, the second image frame, and the preset parameter into the adaptive deformable convolution sub-network simultaneously, and performing a convolution operation of pixel-by-pixel multiplication through spatial variation to obtain and output the intermediate frame comprises:

the second convolution formula is:

I_out＝Ve T_f(I_n)+(J_M，N-V)e T_b(I_n+1)

wherein pixel-by-pixel multiplication is represented; j. the design is a square_M,NA matrix representing M N; t is_fRepresenting a pre-term spatial transformation；T_bRepresenting a post-term spatial transform; m × N represents the sizes of the input image and the output image.

7. A video frame interpolation system based on adaptive deformable convolution, comprising:

8. The system of claim 7, wherein the convolutional neural network model comprises a feature extraction sub-network constructed by an encoder and a decoder;

the intermediate frame acquisition module is further configured to:

9. The system of claim 7, wherein the convolutional neural network model further comprises: a parameter estimation sub-network and an adaptive deformable convolution sub-network;

the intermediate frame acquisition module is further configured to:

10. The system of claim 9, wherein:

and the intermediate frame acquisition module is also configured to define a shielding coefficient and input the shielding coefficient, the first image frame, the second image frame and preset parameters into the adaptive deformable convolution sub-network simultaneously when judging that invisible pixels exist in the first image frame or the second image frame, and perform convolution operation of pixel-by-pixel multiplication through space change to obtain and output the intermediate frame.