CN117896526B

CN117896526B - Video frame interpolation method and system based on bidirectional coding structure

Info

Publication number: CN117896526B
Application number: CN202410059485.5A
Authority: CN
Inventors: 杨晓晖; 王绍文; 王静; 张风; 司统振
Original assignee: Shandong Xianyun Information Technology Co ltd
Current assignee: Shandong Xianyun Information Technology Co ltd
Priority date: 2024-01-15
Filing date: 2024-01-15
Publication date: 2024-09-03
Anticipated expiration: 2044-01-15
Also published as: CN117896526A

Abstract

The invention discloses a video frame interpolation method and a system based on a bidirectional coding structure; preprocessing the input frame image of the first scale to obtain input frame images of the second and third scales; inputting the input frame image of the first scale into a trained interpolation frame generation model to obtain an interpolation frame; the model processes the input frame image of the first scale to obtain the original characteristic of the first scale; preprocessing the first-scale original features to obtain second-scale and third-scale original features; inputting the first, second and third-scale original features into corresponding sub-networks respectively to obtain pixel level parameters of each target pixel on the first, second and third-scale original features; based on pixel level parameters and AdaCoF of each target pixel on each scale original feature, performing warping operation on the input image of the corresponding scale to obtain a warped frame image of the corresponding scale; and synthesizing the distorted frame images with a plurality of scales to obtain an interpolation frame.

Description

Video frame interpolation method and system based on bidirectional coding structure

Technical Field

The invention relates to the technical field of computer vision, in particular to a video frame interpolation method and system based on a bidirectional coding structure.

Background

The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.

In the age of today where digital media fills people's lives, video has become one of the main media for people's communication, entertainment, and information transfer. Although having a high definition camera and powerful video processing software, in some situations, the user may still feel insufficient when watching the video, which is mainly reflected in the smoothness and naturalness of the video playing process. With the popularization of the internet, video production and sharing become easier, and anyone can capture high-quality video through a mobile phone or a video camera. At the same time, however, this also presents new challenges. High resolution, high frame rate video requires more memory space and higher bandwidth, which may be a limiting factor in some cases. At the same time, the requirements of users on video quality are continuously increasing, and users expect to watch clearer, smoother and real video contents. Moreover, because of the complexity of video data, an important task in computer vision-like tasks is to design and extract more robust image features. The existence of various problems such as illumination change, visual angle change, complex background change caused by camera movement and the like in video data clearly brings great difficulty to the work.

In this context, the video frame rate becomes a key factor that directly affects the fluency and realism of video playback. However, even with higher frame rates, processing the views frequent generated by high-speed motion or low-frame rate sources may result in a non-smooth look and feel, such discomfort manifesting as screen jams, frame jumps, or uneven transitions, reducing the user's viewing experience. In order to solve the deficiencies in video playback, video frame insertion techniques have been developed. The basic idea of this technique is to insert new frames between adjacent original frames to increase the frame rate of the video. By increasing the number of frames, the video frame inserting technology aims to create a smoother and real visual effect, fill in gaps of video streams and enable animation transition to be smoother.

However, the conventional frame interpolation method mainly adopts mathematical methods such as linear interpolation, and the methods can improve video fluency to a certain extent, but are difficult to process complex scenes and motions. Particularly in fast moving or frequently changing scenes, these methods tend to blur and distort the image, and cannot meet the user's demands for high quality video. With the rise of the deep learning technology, the video interpolation technology is innovated. Successful application of deep learning models, particularly Convolutional Neural Networks (CNNs), enables computers to better understand and generate video frames. The method injects new vitality into the video frame inserting technology, so that the method can be better suitable for different scenes and motions, and the quality of video is improved. These methods generally fall into three categories: flow-based methods, core-based methods, and methods that combine the two.

Flow-based methods utilize either off-the-shelf flow models or design-specific network estimated optical flow to guide pixel-level tasks. However, this approach often has the disadvantage of large errors or high network complexity. The kernel-based approach treats pixel interpolation as a convolution of corresponding local blocks in two input frames. However, it cannot handle movements beyond the nuclear size and is computationally expensive. The method combining the two methods combines the advantages of the two methods, but the model is too heavy, the calculation cost is too high, and the application is difficult.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a video frame interpolation method and a system based on a bidirectional coding structure; the invention can effectively improve the video frame rate and quality.

In one aspect, a video frame interpolation method based on a bi-directional coding structure is provided;

a video frame interpolation method based on a bidirectional coding structure comprises the following steps:

acquiring two continuous frame images of a first scale of a video to be interpolated And an image

For imagesAnd an imageSimultaneously performing up-sampling operation to obtain an image of a second scaleAnd an imageFor imagesAnd an imageSimultaneously performing downsampling operation to obtain a third-scale imageAnd an image

Two successive frame imagesAnd an imageInputting the interpolation frames after training to generate a model; at the same time, two successive frame images after reversing the orderAnd an imageThe model is also input into the interpolation frame generation model after training; outputting an interpolation frame by the model;

Wherein the trained interpolation frame is used for generating a model for the image And an imageImage and method for producing the sameAnd an imagePerforming feature extraction, feature enhancement and feature fusion treatment to obtain original features of a first scale;

up-sampling the original features of the first scale to obtain original features of the second scale; downsampling the original features of the first scale to obtain original features of a third scale;

the original features of the first, second and third scales are respectively input into corresponding sub-networks to respectively obtain pixel level parameters of each target pixel on the original features of the first, second and third scales;

Based on pixel level parameters of each target pixel on each scale original feature and the self-adaptive collaborative flow AdaCoF, performing warping operation on the input image of the corresponding scale to obtain a warped frame image of the corresponding scale;

And synthesizing the distorted frame images with a plurality of scales to obtain an interpolation frame.

In another aspect, a video frame interpolation system based on a bi-directional coding structure is provided, comprising:

An acquisition module configured to: acquiring two continuous frame images of a first scale of a video to be interpolated And an image

A preprocessing module configured to: for imagesAnd an imageSimultaneously performing up-sampling operation to obtain an image of a second scaleAnd an imageFor imagesAnd an imageSimultaneously performing downsampling operation to obtain a third-scale imageAnd an image

An interpolated frame generation module configured to: two successive frame imagesAnd an imageInputting the interpolation frames after training to generate a model; at the same time, two successive frame images after reversing the orderAnd an imageThe model is also input into the interpolation frame generation model after training; outputting an interpolation frame by the model;

In still another aspect, there is provided an electronic device including:

A memory for non-transitory storage of computer readable instructions; and

A processor for executing the computer-readable instructions,

Wherein the computer readable instructions, when executed by the processor, perform the method of the first aspect described above.

In yet another aspect, there is also provided a storage medium non-transitory storing computer readable instructions, wherein the instructions of the method of the first aspect are executed when the non-transitory computer readable instructions are executed by a computer.

In a further aspect, there is also provided a computer program product comprising a computer program for implementing the method of the first aspect described above when run on one or more processors.

The technical scheme has the following advantages or beneficial effects:

Firstly, the invention adopts two identical U-Net structures to carry out bidirectional coding on the original scale input frame so as to improve the perceptibility of the model on the input frame. Then, in the U-Net architecture, the present invention replaces the conventional convolution unit of the low-dimensional portion with a deep over-parameterized cyclic residual convolution unit. With this structure, the model is able to better capture global information and handle complex relationships between video frames. In addition, the invention introduces a depth separable residual convolution unit in the high-dimensional part to carry out lightweight processing, so that the model can maintain the performance while reducing the number of parameters. Finally, the invention introduces a multi-scale frame synthesis module. The module borrows existing texture information from adjacent input frames to synthesize more realistic and high frequency details, thereby improving the quality of interpolated frames.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a general system frame according to a first embodiment of the present invention;

FIG. 2 is a block diagram of a deep over-parameterized cyclic residual convolution unit according to one embodiment of the present invention;

FIG. 3 is a block diagram of a depth separable residual convolution unit according to a first embodiment of the present invention;

FIG. 4 is a schematic diagram of the internal cascade of the attention block CBAM of the first embodiment of the present invention;

FIG. 5 is a diagram of a sub-network structure for generating pixel level parameters of a target pixel according to a first embodiment of the present invention;

FIG. 6 is a diagram of a multi-scale frame synthesis network according to a first embodiment of the present invention;

Fig. 7 is a diagram of a GridNet intra-network lateral residual block structure according to a first embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

Example 1

The embodiment provides a video frame interpolation method based on a bidirectional coding structure;

s101: acquiring two continuous frame images of a first scale of a video to be interpolated And an image

S102: for imagesAnd an imageSimultaneously performing up-sampling operation to obtain an image of a second scaleAnd an imageFor imagesAnd an imageSimultaneously performing downsampling operation to obtain a third-scale imageAnd an image

S103: two successive frame imagesAnd an imageInputting the interpolation frames after training to generate a model; at the same time, two successive frame images after reversing the orderAnd an imageThe model is also input into the interpolation frame generation model after training; outputting an interpolation frame by the model;

for a first scale image based on pixel level parameters of each target pixel on the first scale original feature and an adaptive co-flow AdaCoF And an imagePerforming warping operation to obtain a warped frame image of a first scaleAnd an image

For a second scale image based on pixel level parameters of each target pixel on the second scale original feature and the adaptive co-flow AdaCoFAnd an imagePerforming warping operation to obtain a warped frame image of a second scaleAnd an image

For a third scale image based on pixel level parameters of each target pixel on the third scale original feature and the adaptive co-flow AdaCoFAnd an imagePerforming warping operation to obtain a warped frame image of a third scaleAnd an image

For warped frame imagesImage processing apparatusImage processing apparatusImage processing apparatusImage processing apparatusAnd an imageAnd carrying out synthesis operation to obtain an interpolation frame.

Further, the trained interpolation frame generates a model, and the training process comprises the following steps:

constructing a training set, wherein the training set is a continuous two-frame image of a known interpolation frame;

And inputting the training set into the interpolation frame generation model, training the model, and stopping training when the total loss function value of the model is not reduced or the iteration number exceeds the set number, so as to obtain the trained interpolation frame generation model.

It should be appreciated that the present invention uses the Vimeo-90k dataset, which consists of 51312 triples, as a training set to train the model proposed by the present invention. Each triplet contains three consecutive frames with a resolution of 256 x 448. In order to generate high quality interpolated frames, the present invention refines the training results using the first and third frames as training inputs and the second frame as training target. During training, the invention increases the number of samples by flipping horizontally and vertically. To verify the validity of the proposed model, the present invention was tested on multiple datasets, including UCFs 101, DAVIS and Middlebury. UCF101 contains, among other things, various complex actions and contexts. DAVIS contain 29 high resolution triplets with blurred and occluded scenes. Notably, in Middlebury, which contains two sub-data sets, the present invention uses a data set containing 12 sequences as a test set, each sequence containing high resolution video frames.

Further, as shown in fig. 1, the trained interpolation frame generation model includes:

the device comprises a feature extraction module, a multi-scale frame distortion network and a multi-scale frame synthesis network which are connected in sequence;

The feature extraction module is used for extracting the image And an imageImage and method for producing the sameAnd an imagePerforming feature extraction, feature enhancement and feature fusion treatment to obtain original features of a first scale;

the multi-scale frame distortion network is used for carrying out up-sampling processing on the original features of the first scale to obtain the original features of the second scale; downsampling the original features of the first scale to obtain original features of a third scale;

The multi-scale frame synthesis network is used for generating a distorted frame imageImage processing apparatusImage processing apparatusImage processing apparatusImage processing apparatusAnd an imageAnd carrying out synthesis operation to obtain an interpolation frame.

It should be appreciated that given two consecutive frame imagesAnd an imageIn the case of input, the system is used to generate an absent intermediate frame I _t to improve the smoothness and sharpness of the video, where 0< t <1. In order to improve the performance of the system, the invention introduces a bidirectional coding structure which consists of two identical U-Net and is used for carrying out forward and backward coding on the input frames so as to extract the bidirectional motion characteristics. In the decoding stage, two-branch decoders are connected to better capture and integrate bi-directional motion features. The ability of the model to model scene changes is improved by the exchange of information between the forward and backward branches. Then, the invention adopts the deep over-parameterized cyclic residual convolution unit to replace the traditional convolution unit of the low-dimensional part of the U-Net framework, and introduces the deep separable residual convolution unit into the high-dimensional part to construct a lightweight U-Net framework so as to enhance the learning capacity of the model without increasing the computational complexity. Finally, the invention provides a multi-scale frame synthesis module which consists of GridNet and is used for synthesizing interpolation frames.

Further, as shown in fig. 1, the feature extraction module includes:

Reverse coding U-Net branches and forward coding U-Net branches;

The reverse coding U-Net branch comprises: a reverse encoder and a reverse decoder connected in sequence;

the reverse encoder includes: the device comprises a first depth over-parameterized cyclic residual convolution unit, a first average pooling layer, a second depth over-parameterized cyclic residual convolution unit, a second average pooling layer, a third depth over-parameterized cyclic residual convolution unit, a third average pooling layer, a first depth separable residual convolution unit, a fourth average pooling layer, a second depth separable residual convolution unit and a fifth average pooling layer which are sequentially connected.

The inverse decoder includes: the device comprises a third depth separable residual convolution unit, a first upsampling layer, a first adder, a fourth depth separable residual convolution unit, a second upsampling layer, a second adder, a fourth depth over-parameterized cyclic residual convolution unit, a third upsampling layer, a third adder, a fifth depth over-parameterized cyclic residual convolution unit, a fourth upsampling layer and a fourth adder which are connected in sequence.

Further, the output end of the second deep parameterized cyclic residual convolution unit is connected with the input end of the first attention mechanism module, and the output end of the first attention mechanism module is connected with the input end of the fourth adder;

The output end of the third deep over-parameterization cyclic residual convolution unit is connected with the input end of a second attention mechanism module, and the output end of the second attention mechanism module is connected with the input end of a third adder;

The output end of the first depth separable residual convolution unit is connected with the input end of a third attention mechanism module, and the output end of the third attention mechanism module is connected with the input end of the second adder;

The output end of the second depth separable residual convolution unit is connected with the input end of a fourth attention mechanism module, and the output end of the fourth attention mechanism module is connected with the input end of the first adder.

Further, the forward encoding U-Net branch comprises: a forward encoder and a forward decoder connected in sequence;

The forward encoder includes: the device comprises a sixth depth over-parameterized cyclic residual convolution unit, a sixth averaging pooling layer, a seventh depth over-parameterized cyclic residual convolution unit, a seventh averaging pooling layer, an eighth depth over-parameterized cyclic residual convolution unit, an eighth averaging pooling layer, a fifth depth separable residual convolution unit, a ninth averaging pooling layer, a sixth depth separable residual convolution unit and a tenth averaging pooling layer which are connected in sequence.

The forward decoder includes: the device comprises a seventh depth separable residual convolution unit, a fifth upsampling layer, a fifth adder, an eighth depth separable residual convolution unit, a sixth upsampling layer, a sixth adder, a ninth depth over-parameterized cyclic residual convolution unit, a seventh upsampling layer, a seventh adder, a tenth depth over-parameterized cyclic residual convolution unit, an eighth upsampling layer and an eighth adder which are connected in sequence.

Further, the output end of the seventh deep over-parameterized cyclic residual convolution unit is connected with the input end of a fifth attention mechanism module, and the output end of the fifth attention mechanism module is connected with the input end of an eighth adder;

The output end of the eighth deep over-parameterization cyclic residual convolution unit is connected with the input end of a sixth attention mechanism module, and the output end of the sixth attention mechanism module is connected with the input end of a seventh adder;

The output end of the fifth depth separable residual convolution unit is connected with the input end of a seventh attention mechanism module, and the output end of the seventh attention mechanism module is connected with the input end of a sixth adder;

the output end of the sixth depth separable residual convolution unit is connected with the input end of the eighth attention mechanism module, and the output end of the eighth attention mechanism module is connected with the input end of the fifth adder.

Further, the output end of the third depth separable residual convolution unit is connected with the input end of the fifth adder; the output end of the seventh depth separable residual convolution unit is connected with the input end of the first adder;

The output end of the fourth depth separable residual convolution unit is connected with the input end of the sixth adder; the output end of the eighth depth separable residual convolution unit is connected with the input end of the second adder;

the output end of the fourth depth over-parameterized cyclic residual convolution unit is connected with the input end of the seventh adder; the output end of the ninth-depth over-parameterized cyclic residual convolution unit is connected with the input end of the third adder;

The output end of the fifth deep over-parameterized cyclic residual convolution unit is connected with the input end of the eighth adder; the output end of the tenth depth over-parameterized cyclic residual convolution unit is connected with the input end of the fourth adder.

Further, the reverse encoder is configured to implement two consecutive framesAnd an imageA plurality of reverse coding features with different scales are obtained through coding; the reverse decoder is used for realizing decoding processing of reverse coding features to obtain a plurality of reverse decoding features with different scales;

the first, second, third and fourth attention mechanism modules are used for realizing cascade connection of reverse coding features of different scales and reverse decoding features of corresponding scales;

The forward encoder is used to implement two consecutive frames And an imageObtain a plurality of forward coding features with different scales; the forward decoder is used for realizing decoding processing of the forward coding features to obtain a plurality of forward decoding features with different scales;

fifth, sixth, seventh and eighth attention mechanism modules are used for realizing cascade connection of forward coding features of different scales and forward decoding features of corresponding scales;

fusing reverse decoding features of different scales with forward decoding features of corresponding scales;

the forward decoding features of different scales are fused with the backward decoding features of corresponding scales;

And finally, a ninth adder sums the output of the fourth adder and the output of the eighth adder to obtain the original characteristic.

Further, as shown in fig. 1, the original scale input frame is up-sampled and down-sampled to obtain two-scale and one-half-scale input images, respectively, and the obtained images are used for subsequent frame warping operations.

Further, the original scale input frame is used as a feature extraction module to be input, and two optimized U-Net is used for carrying out bidirectional feature extraction. Wherein the input to perform the forward encoding U-Net branch is an imageAnd an imageInput to perform reverse coding U-Net branching as an imageAnd an imageReplacing a traditional convolution unit of a low-dimensional part of the U-Net framework by a deep parameterized cyclic residual convolution unit, and introducing a deep separable residual convolution unit into a high-dimensional part to construct a lightweight U-Net framework; the cascade channel attention mechanism and the space attention mechanism are utilized between the encoder and the decoder to replace the traditional jump connection, and the cascade channel attention mechanism and the space attention mechanism are used for cascading the coding features and decoding features with the same dimension to perform feature enhancement;

Extracting bidirectional motion characteristics from an input frame by using two identical U-Net branches; in extracting features, two decoders are connected using a jump connection; adding the decoding characteristics of the setting layer of the forward branch decoder to the characteristic diagram of the corresponding layer of the reverse branch decoder, and splicing in the channel dimension, and vice versa; after connection, the two decoders continue up-sampling and convolution operations on the respective branches, gradually generating bidirectional motion features; finally, the bidirectional motion features are fused to obtain original features for the input of the subsequent subnetwork.

Up-sampling and down-sampling the front and rear original images respectively to obtain images of double scale and half scale, which are used for subsequent frame warping operation; inputting original scale input image into two U-Net architectures to extract bidirectional motion characteristics, wherein the input of U-Net branches for executing forward coding is two continuous frame imagesAnd an imageThe input of the U-Net branches for performing the inverse coding is two consecutive frame imagesAnd an imageReplacing a traditional convolution unit of a low-dimensional part of the U-Net framework by a deep parameterized cyclic residual convolution unit, and introducing a deep separable residual convolution unit into a high-dimensional part to construct a lightweight U-Net framework; and extracting a weight graph from the coding features by using a cascade channel attention and space attention mechanism between the coding features and decoding features in the same dimension in the U-Net architecture, and carrying out feature enhancement by using a pixel level multiplication method.

Extracting bidirectional motion characteristics by using two identical U-Net architectures, and connecting two decoders through jump connection; the decoding features of a layer of the forward branch decoder are added to the feature map of the corresponding layer of the backward branch decoder, and splicing is performed in the channel dimension, and vice versa. After the splicing is completed, the two decoders continue to perform up-sampling and convolution operations on the respective branches, and bidirectional motion characteristics are gradually generated; and fusing the bidirectional motion characteristics to obtain original characteristics.

In the traditional U-Net architecture, the deep parameterized cyclic residual convolution unit is introduced to replace the traditional convolution unit of the low-dimensional part, so that the model enhances the learning capacity of the model without increasing the computational complexity. In addition, a depth separable residual convolution unit is introduced into the high-dimensional part to construct a lightweight U-Net architecture, so that the model is helped to maintain the original performance of the model, and the parameter number is reduced.

To improve the performance of the model and minimize information loss, the present invention utilizes an attention mechanism module (CBAM) instead of the traditional jump connection in the U-Net architecture to establish a relationship between the encoder and decoder. Fig. 4 shows a detailed architecture of CBAM, which contains two sub-modules of a Channel Attention Module (CAM) and a Spatial Attention Module (SAM).

The Channel Attention Module (CAM) includes: parallel average pooling module and maximum pooling module, and sigmoid activation function; the channel information in the coding feature F is parallelly aggregated through an average pooling module and a maximum pooling module, and then the aggregated channel information is subjected to a sigmoid activation function to obtain a channel attention feature map; an intermediate feature map F' obtained by pixel-wise multiplying the channel attention feature map F _c with the encoding feature F is used as an input to a Spatial Attention Module (SAM);

A Spatial Attention Module (SAM) comprising: parallel average pooling layer, maximum pooling layer, channel splicing layer, convolution layer and sigmoid function; the average pooling layer and the maximum pooling layer respectively aggregate the statistical information of one channel, then two feature maps are obtained based on the channel splicing layer, finally, the channel is reduced to one channel through the convolution layer, a spatial attention feature map F _s is generated by using a sigmoid function, and the spatial attention feature map F _s and the output F' of the Channel Attention Module (CAM) are subjected to pixel-by-pixel multiplication operation to obtain enhanced coding features.

Wherein the two phases of the Convolution Block Attention Module (CBAM) can be expressed as:

Where M _c and M _s represent a Channel Attention Module (CAM) and a Spatial Attention Module (SAM), respectively, and CBAM is the entire convolution block attention module. W ₀ and W ₁ represent the weights of two 1x1 convolutions of the first row of the first stage, the convolutions of the two rows sharing weights. W ^7×7 represents the weight of the 7×7 convolution layer in the spatial attention module, F and F ^′ represent the inputs to CAM and SAM, σ (·) and x represent the sigmoid function and convolution operation AndRepresenting element additions and multiplications, avgPool (·) and MaxPool (·) represent average pooling and maximum pooling, respectively.

Given two consecutive input frames, up-sampling and down-sampling operations are first performed to obtain input frames of twice and half dimensions that are to be used in subsequent frame warping processes. Meanwhile, the input frame of the original scale is reserved as the input of the model. In unidirectional encoding, information propagates in only one direction, possibly resulting in loss of information or incomplete representation of features. Therefore, a bi-directional encoding structure is designed to perform feature extraction to more fully capture relevant information in the input frame. Specifically, two identical U-Net architectures are used to bidirectionally encode the input frames and extract the bidirectional motion features. The inputs of the branches performing forward coding are two consecutive frame imagesAnd an imageWhile the input of the branches performing the inverse coding is two consecutive frame imagesAnd an imageThe process may be expressed in mathematical form as (2).

In (2), F _forward represents the forward feature, F _backward represents the reverse feature, and U represents the U-Net architecture.

In extracting features, the present invention uses a jump connection to connect two decoders. In particular, the invention adds the decoding characteristics of a layer of the forward branch decoder to the characteristic diagram of the corresponding layer of the backward branch decoder, and performs splicing in the channel dimension, and vice versa. After the connection, the two decoders continue up-sampling and convolution operations on the respective branches, gradually generating bi-directional motion features. Finally, the invention merges the bi-directional motion features to obtain the original features for the input of the subsequent subnetwork.

Further, as shown in fig. 2, the internal structures of the first, second, third, fourth, fifth, sixth, seventh, eighth, ninth and tenth depth over-parameterized cyclic residual convolution units are the same, and the first depth over-parameterized cyclic residual convolution unit includes:

The first depth super-parameterized convolution layer DO-Conv, the second depth super-parameterized convolution layer DO-Conv, the first linear rectification function layer, the third depth super-parameterized convolution layer DO-Conv, the second linear rectification function layer and the tenth adder are sequentially connected;

The input end of the first depth super-parameterized convolution layer DO-Conv is also connected with a tenth adder through a fourth depth super-parameterized convolution layer DO-Conv;

the input end of the first depth super-parameterized convolution layer DO-Conv is used as the input end of the first depth super-parameterized cyclic residual convolution unit;

the output of the tenth adder is used as the output of the first depth over-parameterized cyclic residual convolution unit.

Further, as shown in fig. 3, the internal structures of the first, second, third, fourth, fifth, sixth, seventh and eighth depth separable residual convolution units are identical, the first depth separable residual convolution unit comprising:

The first depth separable convolution layer, the third linear rectification function layer, the second depth separable convolution layer, the fourth linear rectification function layer, the fifth depth super-parameterized convolution layer DO-Conv, the fifth linear rectification function layer and the eleventh adder are sequentially connected;

The input end of the first depth separable convolution layer is also connected with an eleventh adder through a sixth depth super parameterized convolution layer DO-Conv;

the input end of the first depth separable convolution layer is the input end of the first depth separable residual convolution unit; the output of the eleventh adder is the output of the first depth separable residual convolution unit.

Further, as shown in fig. 4, the internal structures of the first, second, third, fourth, fifth, sixth, seventh and eighth attention mechanism modules are identical, and the first attention mechanism module includes:

The first attention mechanism module input end, the first multiplier, the second multiplier and the first attention mechanism module output end are sequentially connected;

the input end of the first attention mechanism module is respectively connected with the input end of the first branch and the input end of the second branch;

The first branch comprises: the first average pooling layer, the first convolution layer, the sixth linear rectification function layer and the second convolution layer are sequentially connected; the input end of the first averaging pooling layer is the input end of the first branch;

The second branch comprises: the first maximum pooling layer, the third convolution layer, the seventh linear rectification function layer and the fourth convolution layer are sequentially connected; the input end of the first maximum pooling layer is the input end of the second branch;

The output end of the second convolution layer and the output end of the fourth convolution layer are both connected with the input end of a twelfth adder, the output end of the twelfth adder is connected with the input end of the first sigmoid activation function layer, and the output end of the first sigmoid activation function layer is connected with the input end of the first multiplier;

The output end of the first multiplier is respectively connected with the input end of the second average pooling layer and the input end of the second maximum pooling layer;

The output end of the second average pooling layer and the output end of the second maximum pooling layer are both connected with the input end of the channel splicing unit, the output end of the channel splicing unit is connected with the input end of the fifth convolution layer, the output end of the fifth convolution layer is connected with the input end of the second sigmoid activation function layer, and the output end of the second sigmoid activation function layer is connected with the input end of the second multiplier. The channel splicing unit is used for splicing the two input characteristic graphs on the corresponding channels.

Further, as shown in fig. 1, the multi-scale frame warp network includes three parallel sub-networks: a first subnetwork, a second subnetwork, and a third subnetwork.

Further, as shown in fig. 5, the internal structures of the first sub-network, the second sub-network, and the third sub-network are identical, and the first sub-network includes:

a third branch, a fourth branch, a fifth branch, a sixth branch, a seventh branch, and an eighth branch;

The third branch is identical to the sixth branch in internal structure, and comprises: the seventh-depth super-parameterized convolution layer DO-Conv, the eighth linear rectification function layer, the eighth-depth super-parameterized convolution layer DO-Conv, the ninth linear rectification function layer, the ninth-depth super-parameterized convolution layer DO-Conv, the tenth linear rectification function layer, the ninth upsampling layer, the tenth-depth super-parameterized convolution layer DO-Conv and the first softmax activation function layer are sequentially connected.

The internal structures of the fourth, fifth, seventh and eighth branches are the same, and the fourth branch comprises: the eleventh depth super-parameterized convolution layer DO-Conv, the eleventh linear rectification function layer, the twelfth depth super-parameterized convolution layer DO-Conv, the twelfth linear rectification function layer, the thirteenth depth super-parameterized convolution layer DO-Conv, the thirteenth linear rectification function layer, the tenth upsampling layer and the fourteenth depth super-parameterized convolution layer DO-Conv are sequentially connected.

Further, the multi-scale frame distortion network performs up-sampling and down-sampling processing on the original features, and performs pixel level parameter extraction on the three-scale original features through three groups of sub-networks; then, by means of the adaptive collaborative flow AdaCoF, frame warping operation is performed on the three-scale input frames according to the extracted pixel-level parameters, and finally three-scale warped frames are obtained.

Further, the multi-scale frame distortion network performs up-sampling and down-sampling on the original features to obtain original features with double scale and half scale respectively; estimating the offset vector of each target pixel in the horizontal and vertical directions on each ruler output graph and the weight of each target pixel convolution kernel through three groups of subnetworks by using the obtained three scale original features; and adopting an adaptive co-flow, expanding a motion sampling range by using an offset vector, and simultaneously, enabling each pixel not to share weight so as to distort an input image of three scales into an output image, and executing pixel level operation to obtain a distorted frame of three scales.

And estimating pixel level parameters of each target pixel on the input frames with different scales through three groups of subnetworks, and executing warping operation on the input frames by referring to the adaptive co-flow, so that the warped frames with three scales are finally obtained. Wherein the estimated offset vector of each target pixel is used to expand the sampling position of the information, the weight of each pixel is not shared any more, and the operation is expressed mathematically by the following equation:

Wherein, (i, j) is a target pixel point, K is the size of a convolution kernel, W _m,n (i, j) is the weight of the (p, q) th convolution kernel at the target pixel (i, j), (alpha _m,n,β_m,n) is an offset vector which can point to any position outside the grid point, and d is an expansion value.

Further, as shown in fig. 6, the multi-scale frame synthesis network includes GridNet; gridNet includes: three parallel branches: ninth, tenth and eleventh branches;

the ninth branch includes: the first residual block, the second residual block, the third residual block, the fourth residual block, the fifth residual block and the sixth residual block are sequentially connected;

the tenth branch includes: a seventh residual block, an eighth residual block, a ninth residual block, a tenth residual block, an eleventh residual block, and a twelfth residual block connected in sequence;

the eleventh branch comprises: a thirteenth residual block, a fourteenth residual block, a fifteenth residual block, a sixteenth residual block, a seventeenth residual block, and an eighteenth residual block connected in sequence;

The output end of the first residual block is connected with the input end of the first downsampling layer, and the output end of the first downsampling layer is connected with the input end of the seventh residual block; the output end of the seventh residual block is connected with the input end of the second downsampling layer, and the output end of the second downsampling layer is connected with the input end of the thirteenth residual block;

The output end of the second residual block is connected with the input end of a third downsampling layer, and the output end of the third downsampling layer is connected with the input end of an eighth residual block; the output end of the eighth residual block is connected with the input end of a fourth downsampling layer, and the output end of the fourth downsampling layer is connected with the input end of the fourteenth residual block;

The output end of the third residual block is connected with the input end of a fifth downsampling layer, and the output end of the fifth downsampling layer is connected with the input end of a ninth residual block; the output end of the ninth residual block is connected with the input end of a sixth downsampling layer, and the output end of the sixth downsampling layer is connected with the input end of the fifteenth residual block;

The input end of the fourth residual error block is also connected with the output end of an eleventh upsampling layer, the input end of the eleventh upsampling layer is also connected with the output end of a tenth residual error block, the input end of the tenth residual error block is also connected with the output end of a twelfth upsampling layer, and the input end of the twelfth upsampling layer is also connected with the input end of a sixteenth residual error block;

The input end of the fifth residual block is also connected with the output end of the thirteenth upsampling layer, the input end of the thirteenth upsampling layer is also connected with the output end of the eleventh residual block, the input end of the eleventh residual block is also connected with the output end of the fourteenth upsampling layer, and the input end of the fourteenth upsampling layer is also connected with the input end of the seventeenth residual block;

The input end of the sixth residual block is also connected with the output end of a fifteenth upsampling layer, the input end of the fifteenth upsampling layer is also connected with the output end of the twelfth residual block, the input end of the twelfth residual block is also connected with the output end of the sixteenth upsampling layer, and the input end of the sixteenth upsampling layer is also connected with the input end of the eighteenth residual block;

the output end of the sixth residual block is connected with the input end of a seventh downsampling layer, and the output end of the seventh downsampling layer is connected with the input end of the thirteenth adder;

The output end of the eighteenth residual error block is connected with the input end of the seventeenth upsampling layer, and the output end of the seventeenth upsampling layer is connected with the input end of the thirteenth adder;

The output end of the twelfth residual block is connected with the input end of a thirteenth adder, and the output end of the thirteenth adder is the output end of the multi-scale frame synthesis network;

The input end of the first residual block, the input end of the seventh residual block and the input end of the thirteenth residual block are used as input ends of a multi-scale frame synthesis network;

An input end of the first residual block for inputting a warped frame image of a second scale And an imageAn input end of a seventh residual block for inputting a warped frame image of the first scaleAnd an imageAn input terminal of thirteenth residual block for inputting distorted frame image of third scaleAnd an image

Further, as shown in fig. 6, the internal structures of the first, second, third, and fourth … … eighteenth residual blocks are the same, and the first residual block includes:

the first PRelu activation function layer, the sixth convolution layer, the seventh convolution layer, the third average pooling layer, the eighth convolution layer, the second PRelu activation function layer, the ninth convolution layer, the third multiplier and the fourteenth adder are sequentially connected;

The input of the first PRelu activation function layer is also connected to the input of the fourteenth adder, and the input of the third averaging pooling layer is also connected to the input of the third multiplier.

Further, the distorted frames of three scales are respectively used as the input of GridNet different rows, so that an intermediate frame is generated. From top to bottom, the input and feature maps of each row have progressively smaller resolutions.

Inputting the three-scale warped frames into different rows of a multi-scale frame synthesis network for synthesizing intermediate frames; the output of each row is obtained, the output of the ninth branch is downsampled, the output of the eleventh branch is upsampled, and then they are combined to generate the final interpolation result.

The multi-scale frame synthesis module comprises a multi-scale frame synthesis network. The invention uses GridNet comprising three rows and six columns as a composite network whose input consists of warped frames of different scales. From top to bottom, the input and feature maps of each row have progressively smaller resolutions. In the synthetic network, PReLU activation and residual structures are added to improve the results. In addition, the invention introduces a channel attention mechanism, so that the synthesized network can pay attention to important channel information. During upsampling, the present invention employs bilinear upsampling to avoid the creation of checkerboard artifacts. Fig. 6 shows the overall framework of a multi-scale frame synthesis network, and fig. 7 is a diagram of a GridNet intra-network lateral residual block structure.

Example two

The embodiment provides a video frame interpolation system based on a bidirectional coding structure;

A video frame interpolation system based on a bi-directional coding structure, comprising:

Here, the acquiring module, the preprocessing module, and the interpolation frame generating module correspond to steps S101 to S103 in the first embodiment, and the modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in the first embodiment. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

The foregoing embodiments are directed to various embodiments, and details of one embodiment may be found in the related description of another embodiment.

The proposed system may be implemented in other ways. For example, the system embodiments described above are merely illustrative, such as the division of the modules described above, are merely a logical function division, and may be implemented in other manners, such as multiple modules may be combined or integrated into another system, or some features may be omitted, or not performed.

Example III

The embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein the processor is coupled to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software.

The method in the first embodiment may be directly implemented as a hardware processor executing or implemented by a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Example IV

The present embodiment also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the method of embodiment one.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The video frame interpolation method based on the bidirectional coding structure is characterized by comprising the following steps:

acquiring two continuous frame images of a first scale of a video to be interpolated And an image；

For imagesAnd an imageSimultaneously performing up-sampling operation to obtain an image of a second scaleAnd an image; For imagesAnd an imageSimultaneously performing downsampling operation to obtain a third-scale imageAnd an image；

Based on pixel level parameters of each target pixel on each scale original feature and the self-adaptive collaborative flow AdaCoF, performing warping operation on the input image of the corresponding scale to obtain a warped frame image of the corresponding scale; synthesizing the distorted frame images with a plurality of scales to obtain an interpolation frame;

The trained interpolation frame generation model comprises the following steps:

The feature extraction module is used for extracting the image And an imageImage and method for producing the sameAnd an imagePerforming feature extraction, feature enhancement and feature fusion processing to obtain original features of a first scale, wherein the method specifically comprises the following steps:

Reverse coding U-Net branches and forward coding U-Net branches;

The reverse encoder includes: the device comprises a first depth over-parameterized cyclic residual convolution unit, a first average pooling layer, a second depth over-parameterized cyclic residual convolution unit, a second average pooling layer, a third depth over-parameterized cyclic residual convolution unit, a third average pooling layer, a first depth separable residual convolution unit, a fourth average pooling layer, a second depth separable residual convolution unit and a fifth average pooling layer which are connected in sequence;

The inverse decoder further includes: the device comprises a third depth separable residual convolution unit, a first upsampling layer, a first adder, a fourth depth separable residual convolution unit, a second upsampling layer, a second adder, a fourth depth over-parameterized cyclic residual convolution unit, a third upsampling layer, a third adder, a fifth depth over-parameterized cyclic residual convolution unit, a fourth upsampling layer and a fourth adder which are connected in sequence;

The output end of the second deep parameterized cyclic residual convolution unit is connected with the input end of the first attention mechanism module, and the output end of the first attention mechanism module is connected with the input end of the fourth adder;

The output end of the second depth separable residual convolution unit is connected with the input end of a fourth attention mechanism module, and the output end of the fourth attention mechanism module is connected with the input end of the first adder;

the forward encoded U-Net branch comprises: a forward encoder and a forward decoder connected in sequence;

the forward encoder includes: the device comprises a sixth depth over-parameterized cyclic residual convolution unit, a sixth averaging pooling layer, a seventh depth over-parameterized cyclic residual convolution unit, a seventh averaging pooling layer, an eighth depth over-parameterized cyclic residual convolution unit, an eighth averaging pooling layer, a fifth depth separable residual convolution unit, a ninth averaging pooling layer, a sixth depth separable residual convolution unit and a tenth averaging pooling layer which are connected in sequence;

The forward decoder further includes: the system comprises a seventh depth separable residual convolution unit, a fifth upsampling layer, a fifth adder, an eighth depth separable residual convolution unit, a sixth upsampling layer, a sixth adder, a ninth depth over-parameterized cyclic residual convolution unit, a seventh upsampling layer, a seventh adder, a tenth depth over-parameterized cyclic residual convolution unit, an eighth upsampling layer and an eighth adder which are connected in sequence;

The output end of the seventh deep over-parameterization cyclic residual convolution unit is connected with the input end of a fifth attention mechanism module, and the output end of the fifth attention mechanism module is connected with the input end of an eighth adder;

The output end of the sixth depth separable residual convolution unit is connected with the input end of the eighth attention mechanism module, and the output end of the eighth attention mechanism module is connected with the input end of the fifth adder;

The output end of the third depth separable residual convolution unit is connected with the input end of the fifth adder; the output end of the seventh depth separable residual convolution unit is connected with the input end of the first adder;

the output end of the fifth deep over-parameterized cyclic residual convolution unit is connected with the input end of the eighth adder; the output end of the tenth depth over-parameterized cyclic residual convolution unit is connected with the input end of the fourth adder;

for a first scale image based on pixel level parameters of each target pixel on the first scale original feature and an adaptive co-flow AdaCoF And an imagePerforming warping operation to obtain a warped frame image of a first scaleAnd an image；

For a second scale image based on pixel level parameters of each target pixel on the second scale original feature and the adaptive co-flow AdaCoFAnd an imagePerforming warping operation to obtain a warped frame image of a second scaleAnd an image；

For a third scale image based on pixel level parameters of each target pixel on the third scale original feature and the adaptive co-flow AdaCoFAnd an imagePerforming warping operation to obtain a warped frame image of a third scaleAnd an image；

The multi-scale frame synthesis network is used for generating a distorted frame imageImage and method for producing the sameImage and method for producing the sameImage and method for producing the sameImage and method for producing the sameAnd an imagePerforming synthesis operation to obtain an interpolation frame;

The multi-scale frame synthesis network comprises GridNet; gridNet includes: three parallel branches: ninth, tenth and eleventh branches;

The first residual block includes: the first PReLU activation function layer, the sixth convolution layer, the seventh convolution layer, the third average pooling layer, the eighth convolution layer, the second PReLU activation function layer, the ninth convolution layer, the third multiplier and the fourteenth adder are sequentially connected; the input end of the first PReLU activating function layer is also connected with the input end of the fourteenth adder, and the input end of the third averaging pooling layer is also connected with the input end of the third multiplier; further, the distorted frames of three scales are respectively used as the input of GridNet different rows, so that an intermediate frame is generated; from top to bottom, the input and feature maps of each row have progressively smaller resolutions; introducing a channel attention mechanism; in the upsampling process, bilinear upsampling is used.

2. The method for video frame interpolation based on a bi-directional coding structure of claim 1,

The first depth over-parameterized cyclic residual convolution unit includes:

The output end of the tenth adder is used as the output end of the first depth over-parameterized cyclic residual convolution unit;

the first depth separable residual convolution unit includes:

3. The method for video frame interpolation based on bi-directional coding structure according to claim 2, wherein the inverse encoder is configured to implement two consecutive framesAnd an imageA plurality of reverse coding features with different scales are obtained through coding; the reverse decoder is used for realizing decoding processing of reverse coding features to obtain a plurality of reverse decoding features with different scales;

and finally, a ninth adder sums the output of the fourth adder and the output of the eighth adder to obtain the original characteristics.

4. A video frame interpolation system based on a bi-directional coding structure, comprising:

An acquisition module configured to: acquiring two continuous frame images of a first scale of a video to be interpolated And an image；

A preprocessing module configured to: for imagesAnd an imageSimultaneously performing up-sampling operation to obtain an image of a second scaleAnd an image; For imagesAnd an imageSimultaneously performing downsampling operation to obtain a third-scale imageAnd an image；

The trained interpolation frame generation model comprises the following steps:

Reverse coding U-Net branches and forward coding U-Net branches;

5. An electronic device, comprising:

A memory for non-transitory storage of computer readable instructions; and

A processor for executing the computer-readable instructions,

Wherein the computer readable instructions, when executed by the processor, perform the method of any of claims 1-3.

6. A storage medium, characterized by non-transitory storage of computer readable instructions, wherein the instructions of the method of any of claims 1-3 are performed when the non-transitory computer readable instructions are executed by a computer.