AU2018101526A4 - Video interpolation based on deep learning - Google Patents

Video interpolation based on deep learning Download PDF

Info

Publication number
AU2018101526A4
AU2018101526A4 AU2018101526A AU2018101526A AU2018101526A4 AU 2018101526 A4 AU2018101526 A4 AU 2018101526A4 AU 2018101526 A AU2018101526 A AU 2018101526A AU 2018101526 A AU2018101526 A AU 2018101526A AU 2018101526 A4 AU2018101526 A4 AU 2018101526A4
Authority
AU
Australia
Prior art keywords
frame
training
frames
neural network
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2018101526A
Inventor
Xipeng Chai
Xiaoyu FAN
Xiaoyan Feng
Zixuan WANG
Sen Zhang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fan Xiaoyu Miss
Feng Xiaoyan Miss
Wang Zixuan Miss
Original Assignee
Fan Xiaoyu Miss
Feng Xiaoyan Miss
Wang Zixuan Miss
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fan Xiaoyu Miss, Feng Xiaoyan Miss, Wang Zixuan Miss filed Critical Fan Xiaoyu Miss
Priority to AU2018101526A priority Critical patent/AU2018101526A4/en
Application granted granted Critical
Publication of AU2018101526A4 publication Critical patent/AU2018101526A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Processing Or Creating Images (AREA)
  • Image Analysis (AREA)

Abstract

An algorithm designed for frame interpolation is presented. Said algorithm applies neural network and deep learning, and is divided into two separated zones: the training zone and the testing zone (figure 3). Data for training is massive randomly selected image groups, each group containing 3 successive frames. In the training zone, the program extracts voxel flow from the first frame to the third frame via neural network, creates a frame that should be inserted between them, and compares it with the second frame, which is considered the ideal output. Loss is defined as the deviation from the second frame (the control) and calculated following a more advanced formula. The model is automatically optimized during the training procedure in an unsupervised fashion and can be saved and restored. The formula defining the loss is modified.

Description

TITLE
Video interpolation based on deep learning
BACKGROUND OF THE INVENTION
The growing demand for high-resolution videos stimulates the exploration of methods for increasing frame rates. The loss of frames can be caused by compression (during the process of transmission) for the sake of time and storage. Moreover, instrument inferiority may be a contributing factor to low frame rates. Therefore, frame synthesis products applied in terminals are critical in improving the quality of videos.
Primary frame interpolation approaches include repeating and averaging. The former approach simply inserts a frame that is identical to the preceding and the succeeding frames, while the latter approach mainly calculates the mean value of the previous and the subsequent frames. As it turns out, the former one makes little difference to the video, and the latter one makes the video vague by merely inserting the mixture of two adjacent frames.
Newer solutions offer smarter methods for predicting object motion. The most widely used approaches among them are categorized into
2018101526 26 Oct 2018 block-matching & motion estimation and optical-flow analyzing. The former approach divides a frame into several blocks following a certain pattern. Each block stationed at a certain location is expected a counterpart within the vicinity of the same location in the next frame. The deviation between them is marked as a motion vector, used in the process of motion estimation. More advanced algorithms divide the frame into blocks that consist of pixels that share a similar motion pattern, which leads to a more reasonable and subtle output. During recent years, more advanced techniques have been adopted to modify block-matching and motion estimation algorithms, promoting the performance of frame synthesizing.
Based on an unchanging model, said approaches do work well when the changes are not too acute and when movements are highly predictable. However, anchored in divided blocks, severe disadvantages are destined to exist. Having too few blocks makes it difficult to approach an expected output and makes the video blurry, while having too many shall greatly boost the workload of calculation. Videos are abundant in changes and surprises, apart from that movements are sometimes far from simple. Therefore, it is nearly impossible for said algorithms to properly handle most circumstances by the few patterns given by programmers. Due to the unpredictability and complexity of videos, it is crucial for the program
2018101526 26 Oct 2018 to be able to learn from a diversity of objects so as to find a model that appeals to the video well enough.
Furthermore, when it comes to evaluating the output, coherence and fluency are more accurately evaluated and quantified by computers than humans, because minute similarities between frames can be converted into precise numbers to illustrate the deviation from expected output. Through continuous learning, each tiny deviation from the expected output is dealt with and minimized. Therefore, applying machine learning techniques in frame synthesizing is efficient. An algorithm using deep learning can learn from given data in an unsupervised fashion, consumes a reasonable amount of time when training but little time or workload once the training is complete. This invention offers a solution for frame synthesizing that adopts deep learning and neural network to better learn from given sample videos and offer more satisfying outcomes.
SUMMARY OF THE INVENTION
Synthesizing new video frames in an existing video has long been a challenging problem due to the complexity of the video motion and appearance. Existing approaches to address this problem basically estimate optical flow between the preceding and succeeding frames or use generative convolutional neural networks (CNNs) to hallucinate RGB
2018101526 26 Oct 2018 pixel values of the synthesized frames. We use a new algorithm integrating estimation of the optical flow between frames with generative convolutional neural networks (CNNs). The approach uses existing videos to train an unsupervised CNN to generate voxels.
We employ MATLAB to sample a training set from UCF101, currently the largest dataset used for action recognition. We extract 1400 video files from UCF 101 and construct 10 groups of frames per video. Each group of frames comprises three consecutive frames and each frame is resized to 256 * 256 * 3. The program takes two frames from the training set as input and the frame between them as the target. After applying the convolutional encoding and decoding layers to predict the 3D voxel flow, the deep network reconstructs the target and yields an image of the projected motion field.
Fundamentally, the program has two subsets, training and testing, respectively. The initialization of the program involves reading data from the dataset, producing text files with all the needed file paths, and setting all the required parameters. The file paths are the paths of the input and target images selected from the created data set. The parameters mainly contain batch size of 1 (can be modified by the user, and it makes little difference to the program), maximum steps of 10000000, and learning rate
2018101526 26 Oct 2018 of 0,0003. For the training part, we aim to minimize the loss to generate output similar to the target; For the testing part, we used the well-trained model to produce the desired synthesized frames.
The model begins with the normalization layer. We adopt instance normalization instead of batch normalization. In principle, batch normalization computes the mean and variance within a mini-batch. It tries to make the distribution of the entire layer fit the normal distribution, thus providing a layer with inputs which are zero mean and unit variance. Batch normalization diminishes the accuracy of the estimation as the batch size becomes smaller because it normalizes the features of images along the batch dimension and a result of a single instance is highly dependent on other instances. The focus on relative differences between instances makes batch normalization perform well on classification tasks. But it ignores absolute differences and adds noise to the gradients for a single instance. Instance normalization increases the independence among instances and hence performs better in generating images closely related to inputs.
Then we build the section of encoding, carrying three basic processing units. Each unit incorporates convolution layers with max-pooling layers. The convolution kernel sizes of the encoder are 5*5, 5*5, 3 *3, separately. A bottleneck layer is followed to reduce dimensionality. The
2018101526 26 Oct 2018 section of decoding contains three basic processing units for bilinear upsampling and convolution. We use 3*3,5*5,5*5 convolution kernels for the decoder. Besides, we add Deep Residual Learning network (ResNet) to our program to solve the vanishing gradient problem and thereby make it easier to converge. There is a difficulty in optimization when we increase the depth of the network. A residual neural network can apply a shortcut connection to skip layers that would degrade the performance. Consequently, it can avoid negative outcomes of the network and accelerate the training speed.
Apart from the output image produced by the program, we also generate an image representing projected motion field based on the calculation of optical flow. The optical flow between frames is consisted of spatial and temporal components. We make use of displacement vectors to compute pixel motion from an earlier frame to a later frame. We define the spatial components as the difference between the absolute coordinate of the location in the preceding frame and that in the succeeding frame, correspondingly The resulting image of motion field is formed by merging the vertical optical flow field image and the horizontal optical flow field image. On this account, we are able to visualize the speed and the direction of object motion.
2018101526 26 Oct 2018
As for the loss function, we apply the following formula to combine LI value and SSIM value to improve the quality of the output image.
Lw,x - a -Lssim + (1-6)-G „-l!' σσ,
LI value is concerned with absolute deviations and SSIM value is a perceptual metric that quantifies image quality degradation. Said formula is more advanced compared to existing formula of the loss function, which generally apply two parameters: LI (symbolizes the absolute deviation) and L2 (symbolizes the square root of the summary of all squares of vectors. This parameter helps to reduce overfitting). This formula makes the loss value descend faster and promotes the quality of outcomes. Humans are sensitive to structural information but insensitive to high-luminance areas and complex veins, and the SSIM parameter helps to modify the outcome of areas that humans are more sensitive to.
DESCRIPTION OF THE DRAWINGS
Figure 1 is an overview that describes the function of our invention;
Figure 2 is the detailed process of initialization, during which all parameters are initialized and file paths are written in text files;
Figure 3 is the main structure of our algorithm that consists of the “train” mode and “test” mode, determined by the parameter flag;
2018101526 26 Oct 2018
Figure 4 is the detailed structure of training process, including restoration of previous models, object selection, voxel flow extraction, frame generation, loss calculation, and saving. It describes the overall framework of the training algorithm.
Figure 5 is a more elaborate description of the neural network, which includes: convolutional encoder, Residual Network, and convolutional decoder. Both encoder and decoder include 3 processing units. For encoder section, each processing unit contains both convolution and max-pooling. Decoder section is similar to encoder section.
Figure 6a and Figure 6b illustrate input frames of one example in the trainning part.
Figure7a, 7b and 7c respectively illustrate projected frame, project motion field and target frame of the example.
This is the link to our final video compositing effect with our outcomes on the left and the original video on the right. We recommend that you play the video with 0.4x speed to more clearly observe the differences.
2018101526 26 Oct 2018 https://youtu.be/Kgs3zrc0HGQ
DESCRIPTION OF PREFERRED EMBODIMENT
This invention requires an operating environment for python3.6. Our invention is based on two software: Pycharm(for editing the program) and Anaconda(for creating a virtual environment). Used packages include: tensorflow, opencv, dataset, numpy, os and so on.
Paths of data are referenced in the program. To ensure that the training is carried out smoothly, certain lines of the code need to be adjusted.
Data used for training is stored in a large number of lower-level folders of one parent directory. Each folder contains a group of three adjacent frames. Three text files are needed as data lists in the training program, each of them containing names of the lower-level folders, followed by the character “/” and the name of the frame (e.g. A ( 1)-1/frame _1.png or A(l)-l/frame_2.png or A(1)-1 /frame_3.png). Each path occupies one line.
The path of a group of frames is composed of two parts. The former part is the address of the superior folder mentioned above (e.g. /Users/(\xsQmwAQ)/Desktop/program summer/dataset/). The latter part is the address of said data lists (e.g.
/Users/(\xsQmwAQ)/Desktop/program/frame inserting/framel.txt). Certain
2018101526 26 Oct 2018 functions are used to extract mentioned code pieces stored in the data lists. With the said two parts concatenated together, an address of a frame is formed.
Before using this program, said data set and data lists should be prepared, and the code directing the address of a frame should be adjusted according to the specific operating environment of the user.
Furthermore, in order that the training process is carried out faster, usage of GPU is recommended, since large-scale calculations are completed with GPU much faster than with CPU.
The last part, after getting the well-trained model with a remained low loss, the user is able to conduct the test process. When the process is over, all the results can be founded in the destination file. In order to highlight the differences, synthesizing frames into a video is recommended.
EXAMPLE
In order to make the previous statement more comprehensive, here is an example in the training part, (the path file has already been made)
Stepl: The program reads two frames from the file-path file and takes io
2018101526 26 Oct 2018 them as inputs. Input frames are illustrated as Figure 6a and Figure 6b.
Step2: From the inputs, the convolutional encoder-decoder predicts the 3D voxel flow, after which the desired frame (named predict) and a projected motion field are synthesized by a volume sampling layer.
Step3: The program compares the predicted frame with the target frame( the original frame), and calculates the loss parameter. Based on the feedback, the neural network model is optimized.
Remark:
Step 3 is the end of a training process. The program will keep reading frames, generating desired frames, and optimizing the model until the loss is reduced to a very low value. The illustration may not seem lucid enough because the pictures are black-and-white. The video link given above offers a more distinct illustration.

Claims (1)

  1. There is one page in the claims only.
    2018101526 14 Oct 2018
    Claims:
    1. A method of video frame synthesis, which is animating for increasing the video rate, whose frame is generated by analyzing Voxel Flow; the Voxel flow is generated by convolutional neural network after inputting two frames; the neural network is consisted of three parts: Convolutional Encoder, Resnet and Convolutional Decoder; the application of Resnet is conducive to the acceleration of data convergence; because instance-normalization is used, large batch size is a not necessary requirement; smaller or bigger batch size will not affect the accuracy of the results.
    2018101526 26 Oct 2018
AU2018101526A 2018-10-14 2018-10-14 Video interpolation based on deep learning Ceased AU2018101526A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2018101526A AU2018101526A4 (en) 2018-10-14 2018-10-14 Video interpolation based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2018101526A AU2018101526A4 (en) 2018-10-14 2018-10-14 Video interpolation based on deep learning

Publications (1)

Publication Number Publication Date
AU2018101526A4 true AU2018101526A4 (en) 2018-11-29

Family

ID=64350526

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2018101526A Ceased AU2018101526A4 (en) 2018-10-14 2018-10-14 Video interpolation based on deep learning

Country Status (1)

Country Link
AU (1) AU2018101526A4 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112188236A (en) * 2019-07-01 2021-01-05 北京新唐思创教育科技有限公司 Video interpolation frame model training method, video interpolation frame generation method and related device
CN113891027A (en) * 2021-12-06 2022-01-04 深圳思谋信息科技有限公司 Video frame insertion model training method and device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112188236A (en) * 2019-07-01 2021-01-05 北京新唐思创教育科技有限公司 Video interpolation frame model training method, video interpolation frame generation method and related device
CN113891027A (en) * 2021-12-06 2022-01-04 深圳思谋信息科技有限公司 Video frame insertion model training method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
US11328523B2 (en) Image composites using a generative neural network
US11775829B2 (en) Generative adversarial neural network assisted video reconstruction
JP7373554B2 (en) Cross-domain image transformation
CN111417988A (en) System and method for real-time complex character animation and interactivity
US20210397945A1 (en) Deep hierarchical variational autoencoder
US20230401672A1 (en) Video processing method and apparatus, computer device, and storage medium
US20230123820A1 (en) Generating animated digital videos utilizing a character animation neural network informed by pose and motion embeddings
US11954828B2 (en) Portrait stylization framework using a two-path image stylization and blending
US20220156987A1 (en) Adaptive convolutions in neural networks
AU2018101526A4 (en) Video interpolation based on deep learning
CA3180427A1 (en) Synthesizing sequences of 3d geometries for movement-based performance
CN114972574A (en) WEB-based digital image real-time editing using latent vector stream renderer and image modification neural network
US20230086807A1 (en) Segmented differentiable optimization with multiple generators
CN115512014A (en) Method for training expression driving generation model, expression driving method and device
Huang et al. IA-FaceS: A bidirectional method for semantic face editing
US20240013464A1 (en) Multimodal disentanglement for generating virtual human avatars
US11948240B2 (en) Systems and methods for computer animation using an order of operations deformation engine
Zanni et al. N-ary implicit blends with topology control
US20230319223A1 (en) Method and system for deep learning based face swapping with multiple encoders
CN115496843A (en) Local realistic-writing cartoon style migration system and method based on GAN
US20230154090A1 (en) Synthesizing sequences of images for movement-based performance
Tan et al. FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization
Sun et al. Generation of virtual digital human for customer service industry
US20240146868A1 (en) Video frame interpolation method and apparatus, and device
US20230316587A1 (en) Method and system for latent-space facial feature editing in deep learning based face swapping

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry