AU2018101526A4

AU2018101526A4 - Video interpolation based on deep learning

Info

Publication number: AU2018101526A4
Application number: AU2018101526A
Authority: AU
Inventors: Xipeng Chai; Xiaoyu FAN; Xiaoyan Feng; Zixuan WANG; Sen Zhang
Original assignee: Fan Xiaoyu Miss; Feng Xiaoyan Miss; Wang Zixuan Miss
Current assignee: Fan Xiaoyu Miss; Feng Xiaoyan Miss; Wang Zixuan Miss
Priority date: 2018-10-14
Filing date: 2018-10-14
Publication date: 2018-11-29
Anticipated expiration: 2026-10-14

Abstract

An algorithm designed for frame interpolation is presented. Said algorithm applies neural network and deep learning, and is divided into two separated zones: the training zone and the testing zone (figure 3). Data for training is massive randomly selected image groups, each group containing 3 successive frames. In the training zone, the program extracts voxel flow from the first frame to the third frame via neural network, creates a frame that should be inserted between them, and compares it with the second frame, which is considered the ideal output. Loss is defined as the deviation from the second frame (the control) and calculated following a more advanced formula. The model is automatically optimized during the training procedure in an unsupervised fashion and can be saved and restored. The formula defining the loss is modified.

Description

TITLE

Video interpolation based on deep learning

BACKGROUND OF THE INVENTION

The growing demand for high-resolution videos stimulates the exploration of methods for increasing frame rates. The loss of frames can be caused by compression (during the process of transmission) for the sake of time and storage. Moreover, instrument inferiority may be a contributing factor to low frame rates. Therefore, frame synthesis products applied in terminals are critical in improving the quality of videos.

Primary frame interpolation approaches include repeating and averaging. The former approach simply inserts a frame that is identical to the preceding and the succeeding frames, while the latter approach mainly calculates the mean value of the previous and the subsequent frames. As it turns out, the former one makes little difference to the video, and the latter one makes the video vague by merely inserting the mixture of two adjacent frames.

Newer solutions offer smarter methods for predicting object motion. The most widely used approaches among them are categorized into

2018101526 26 Oct 2018 block-matching & motion estimation and optical-flow analyzing. The former approach divides a frame into several blocks following a certain pattern. Each block stationed at a certain location is expected a counterpart within the vicinity of the same location in the next frame. The deviation between them is marked as a motion vector, used in the process of motion estimation. More advanced algorithms divide the frame into blocks that consist of pixels that share a similar motion pattern, which leads to a more reasonable and subtle output. During recent years, more advanced techniques have been adopted to modify block-matching and motion estimation algorithms, promoting the performance of frame synthesizing.

Based on an unchanging model, said approaches do work well when the changes are not too acute and when movements are highly predictable. However, anchored in divided blocks, severe disadvantages are destined to exist. Having too few blocks makes it difficult to approach an expected output and makes the video blurry, while having too many shall greatly boost the workload of calculation. Videos are abundant in changes and surprises, apart from that movements are sometimes far from simple. Therefore, it is nearly impossible for said algorithms to properly handle most circumstances by the few patterns given by programmers. Due to the unpredictability and complexity of videos, it is crucial for the program

2018101526 26 Oct 2018 to be able to learn from a diversity of objects so as to find a model that appeals to the video well enough.

Furthermore, when it comes to evaluating the output, coherence and fluency are more accurately evaluated and quantified by computers than humans, because minute similarities between frames can be converted into precise numbers to illustrate the deviation from expected output. Through continuous learning, each tiny deviation from the expected output is dealt with and minimized. Therefore, applying machine learning techniques in frame synthesizing is efficient. An algorithm using deep learning can learn from given data in an unsupervised fashion, consumes a reasonable amount of time when training but little time or workload once the training is complete. This invention offers a solution for frame synthesizing that adopts deep learning and neural network to better learn from given sample videos and offer more satisfying outcomes.

SUMMARY OF THE INVENTION

Synthesizing new video frames in an existing video has long been a challenging problem due to the complexity of the video motion and appearance. Existing approaches to address this problem basically estimate optical flow between the preceding and succeeding frames or use generative convolutional neural networks (CNNs) to hallucinate RGB

2018101526 26 Oct 2018 pixel values of the synthesized frames. We use a new algorithm integrating estimation of the optical flow between frames with generative convolutional neural networks (CNNs). The approach uses existing videos to train an unsupervised CNN to generate voxels.

We employ MATLAB to sample a training set from UCF101, currently the largest dataset used for action recognition. We extract 1400 video files from UCF 101 and construct 10 groups of frames per video. Each group of frames comprises three consecutive frames and each frame is resized to 256 * 256 * 3. The program takes two frames from the training set as input and the frame between them as the target. After applying the convolutional encoding and decoding layers to predict the 3D voxel flow, the deep network reconstructs the target and yields an image of the projected motion field.

Fundamentally, the program has two subsets, training and testing, respectively. The initialization of the program involves reading data from the dataset, producing text files with all the needed file paths, and setting all the required parameters. The file paths are the paths of the input and target images selected from the created data set. The parameters mainly contain batch size of 1 (can be modified by the user, and it makes little difference to the program), maximum steps of 10000000, and learning rate

2018101526 26 Oct 2018 of 0,0003. For the training part, we aim to minimize the loss to generate output similar to the target; For the testing part, we used the well-trained model to produce the desired synthesized frames.

The model begins with the normalization layer. We adopt instance normalization instead of batch normalization. In principle, batch normalization computes the mean and variance within a mini-batch. It tries to make the distribution of the entire layer fit the normal distribution, thus providing a layer with inputs which are zero mean and unit variance. Batch normalization diminishes the accuracy of the estimation as the batch size becomes smaller because it normalizes the features of images along the batch dimension and a result of a single instance is highly dependent on other instances. The focus on relative differences between instances makes batch normalization perform well on classification tasks. But it ignores absolute differences and adds noise to the gradients for a single instance. Instance normalization increases the independence among instances and hence performs better in generating images closely related to inputs.

Then we build the section of encoding, carrying three basic processing units. Each unit incorporates convolution layers with max-pooling layers. The convolution kernel sizes of the encoder are 5*5, 5*5, 3 *3, separately. A bottleneck layer is followed to reduce dimensionality. The

2018101526 26 Oct 2018 section of decoding contains three basic processing units for bilinear upsampling and convolution. We use 3*3,5*5,5*5 convolution kernels for the decoder. Besides, we add Deep Residual Learning network (ResNet) to our program to solve the vanishing gradient problem and thereby make it easier to converge. There is a difficulty in optimization when we increase the depth of the network. A residual neural network can apply a shortcut connection to skip layers that would degrade the performance. Consequently, it can avoid negative outcomes of the network and accelerate the training speed.

Apart from the output image produced by the program, we also generate an image representing projected motion field based on the calculation of optical flow. The optical flow between frames is consisted of spatial and temporal components. We make use of displacement vectors to compute pixel motion from an earlier frame to a later frame. We define the spatial components as the difference between the absolute coordinate of the location in the preceding frame and that in the succeeding frame, correspondingly The resulting image of motion field is formed by merging the vertical optical flow field image and the horizontal optical flow field image. On this account, we are able to visualize the speed and the direction of object motion.

2018101526 26 Oct 2018

As for the loss function, we apply the following formula to combine LI value and SSIM value to improve the quality of the output image.

L^w,x - a -L^ssim + (1-6)-G „-l!' ^σσ,

LI value is concerned with absolute deviations and SSIM value is a perceptual metric that quantifies image quality degradation. Said formula is more advanced compared to existing formula of the loss function, which generally apply two parameters: LI (symbolizes the absolute deviation) and L2 (symbolizes the square root of the summary of all squares of vectors. This parameter helps to reduce overfitting). This formula makes the loss value descend faster and promotes the quality of outcomes. Humans are sensitive to structural information but insensitive to high-luminance areas and complex veins, and the SSIM parameter helps to modify the outcome of areas that humans are more sensitive to.

DESCRIPTION OF THE DRAWINGS

Figure 1 is an overview that describes the function of our invention;

Figure 2 is the detailed process of initialization, during which all parameters are initialized and file paths are written in text files;

Figure 3 is the main structure of our algorithm that consists of the “train” mode and “test” mode, determined by the parameter flag;

2018101526 26 Oct 2018

Figure 4 is the detailed structure of training process, including restoration of previous models, object selection, voxel flow extraction, frame generation, loss calculation, and saving. It describes the overall framework of the training algorithm.

Figure 5 is a more elaborate description of the neural network, which includes: convolutional encoder, Residual Network, and convolutional decoder. Both encoder and decoder include 3 processing units. For encoder section, each processing unit contains both convolution and max-pooling. Decoder section is similar to encoder section.

Figure 6a and Figure 6b illustrate input frames of one example in the trainning part.

Figure7a, 7b and 7c respectively illustrate projected frame, project motion field and target frame of the example.

This is the link to our final video compositing effect with our outcomes on the left and the original video on the right. We recommend that you play the video with 0.4x speed to more clearly observe the differences.

2018101526 26 Oct 2018 https://youtu.be/Kgs3zrc0HGQ

DESCRIPTION OF PREFERRED EMBODIMENT

This invention requires an operating environment for python3.6. Our invention is based on two software: Pycharm(for editing the program) and Anaconda(for creating a virtual environment). Used packages include: tensorflow, opencv, dataset, numpy, os and so on.

Paths of data are referenced in the program. To ensure that the training is carried out smoothly, certain lines of the code need to be adjusted.

Data used for training is stored in a large number of lower-level folders of one parent directory. Each folder contains a group of three adjacent frames. Three text files are needed as data lists in the training program, each of them containing names of the lower-level folders, followed by the character “/” and the name of the frame (e.g. A ( 1)-1/frame _1.png or A(l)-l/frame_2.png or A(1)-1 /frame_3.png). Each path occupies one line.

The path of a group of frames is composed of two parts. The former part is the address of the superior folder mentioned above (e.g. /Users/(\xsQmwAQ)/Desktop/program summer/dataset/). The latter part is the address of said data lists (e.g.

/Users/(\xsQmwAQ)/Desktop/program/frame inserting/framel.txt). Certain

2018101526 26 Oct 2018 functions are used to extract mentioned code pieces stored in the data lists. With the said two parts concatenated together, an address of a frame is formed.

Before using this program, said data set and data lists should be prepared, and the code directing the address of a frame should be adjusted according to the specific operating environment of the user.

Furthermore, in order that the training process is carried out faster, usage of GPU is recommended, since large-scale calculations are completed with GPU much faster than with CPU.

The last part, after getting the well-trained model with a remained low loss, the user is able to conduct the test process. When the process is over, all the results can be founded in the destination file. In order to highlight the differences, synthesizing frames into a video is recommended.

EXAMPLE

In order to make the previous statement more comprehensive, here is an example in the training part, (the path file has already been made)

Stepl: The program reads two frames from the file-path file and takes io

2018101526 26 Oct 2018 them as inputs. Input frames are illustrated as Figure 6a and Figure 6b.

Step2: From the inputs, the convolutional encoder-decoder predicts the 3D voxel flow, after which the desired frame (named predict) and a projected motion field are synthesized by a volume sampling layer.

Step3: The program compares the predicted frame with the target frame( the original frame), and calculates the loss parameter. Based on the feedback, the neural network model is optimized.

Remark:

Step 3 is the end of a training process. The program will keep reading frames, generating desired frames, and optimizing the model until the loss is reduced to a very low value. The illustration may not seem lucid enough because the pictures are black-and-white. The video link given above offers a more distinct illustration.

Claims

There is one page in the claims only.

2018101526 14 Oct 2018

Claims:

1. A method of video frame synthesis, which is animating for increasing the video rate, whose frame is generated by analyzing Voxel Flow; the Voxel flow is generated by convolutional neural network after inputting two frames; the neural network is consisted of three parts: Convolutional Encoder, Resnet and Convolutional Decoder; the application of Resnet is conducive to the acceleration of data convergence; because instance-normalization is used, large batch size is a not necessary requirement; smaller or bigger batch size will not affect the accuracy of the results.

2018101526 26 Oct 2018