WO2022033048A1

WO2022033048A1 - Video frame interpolation method, model training method, and corresponding device

Info

Publication number: WO2022033048A1
Application number: PCT/CN2021/085220
Authority: WO
Inventors: 黄哲威; 衡稳; 周舒畅
Original assignee: 北京迈格威科技有限公司
Priority date: 2020-08-13
Filing date: 2021-04-02
Publication date: 2022-02-17
Also published as: CN112104830A; CN112104830B

Abstract

A video frame interpolation method, a model training method, and a corresponding device. The video frame interpolation method comprises: acquiring a first video frame and a second video frame; utilizing, on the basis of the first video frame and the second video frame, a first neural network to calculate an optical flow between the first video frame and a first intermediate video frame and/or an optical flow between the second video frame and the first intermediate video frame; utilizing the optical flow between the first video frame and the first intermediate video frame to reverse map the first video frame to acquire a first mapped video frame, and/or, utilizing the optical flow between the second video frame and the first intermediate video frame to reverse map the second video frame to acquire a second mapped video frame; and determining the first intermediate video frame on the basis of the first mapped video frame and/or of the second mapped video frame. In the method, the accuracy of calculating the optical flow of an intermediate frame is increased; therefore, the image quality of the first intermediate video frame ultimately acquired is improved, and the efficiency of frame interpolation using the method is increased.

Description

Video frame insertion method, model training method and corresponding device

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority of the Chinese Patent Application No. 202010815538.3 and entitled "Video Frame Insertion Method, Model Training Method and Corresponding Device" filed with the China Patent Office on August 13, 2020, the entire contents of which are incorporated by reference in in this application.

technical field

The present application relates to the technical field of video processing, and in particular, to a video frame insertion method, a model training method, and a corresponding device.

Background technique

Video frame interpolation is a classic task in video processing, which aims to synthesize intermediate frames with smooth transitions from two frames before and after a video. The application scenarios of video frame insertion include: first, it is used to improve the video frame rate displayed by the device, so that users can feel the video is clearer and smoother; second, in video production and editing, it is used to assist in realizing the slow motion effect of the video, or It is used to add intermediate frames between key frames of animation to reduce the labor cost of animation production; third, it is used for intermediate frame compression of video, or to provide auxiliary data for other computer vision tasks.

The video frame insertion algorithm based on optical flow is a kind of algorithm that has been studied more in recent years. The typical method of using this kind of algorithm for frame insertion is to first train an optical flow calculation network, and use the network to calculate the optical flow between the frames before and after. Then, the optical flow between the front and rear frames is linearly interpolated to obtain the optical flow of the intermediate frame, and finally the intermediate frame is obtained based on the optical flow of the intermediate frame, that is, the frame to be inserted between the front and rear frames. However, since the optical flow of the intermediate frame is synthesized by the optical flow between the front and rear frames, in the final intermediate frame, the edge of the moving object is prone to ghosting, resulting in poor frame insertion effect, and the existing algorithm steps are more complicated , the frame insertion process takes a long time.

SUMMARY OF THE INVENTION

The purpose of the embodiments of the present application includes providing a video frame insertion method, a model training method and a corresponding device, so as to improve the above technical problems.

To achieve the above purpose, the application provides the following technical solutions:

An embodiment of the present application provides a video frame insertion method, including: acquiring a first video frame and a second video frame; and calculating a first intermediate video frame based on the first video frame and the second video frame by using a first neural network The optical flow from the video frame to the first video frame and/or the optical flow from the first intermediate video frame to the second video frame; wherein, the first intermediate video frame is to be inserted into the first video frame and a video frame between the second video frames; using the optical flow of the first intermediate video frame to the first video frame to backward map the first video frame to obtain a first mapped video frame, and or, performing backward mapping on the second video frame by using the optical flow from the first intermediate video frame to the second video frame to obtain a second mapped video frame; according to the first mapped video frame and/or or the second mapped video frame determines the first intermediate video frame.

The first video frame and the second video frame are two frames before and after the video (which may or may not be two consecutive frames), and the above method is directly based on the first video frame and the second video frame when inserting frames. , using the first neural network to calculate the optical flow of the intermediate frame (referring to the optical flow from the first intermediate video frame to the first video frame and/or the optical flow from the first intermediate video frame to the second video frame) without using the first The optical flow between the video frame and the second video frame, so that the optical flow of the intermediate frame obtained is more accurate, and the image quality of the first intermediate video frame obtained on this basis is better, and it is not easy to generate at the edge of moving objects. ghosting. In addition, the steps of the above method are simple, and the frame insertion efficiency is significantly improved, so that good results can also be achieved when applied to scenarios such as real-time frame insertion and high-definition video frame insertion.

In an implementation manner, the first neural network includes at least one optical flow calculation module connected in sequence, and based on the first video frame and the second video frame, the first neural network is used to calculate the first intermediate video The optical flow from the frame to the first video frame includes: determining the first image input to each optical flow calculation module according to the first video frame, and determining the input to each optical flow calculation module according to the second video frame The second image of the module; using each optical flow calculation module, based on the optical flow input to the optical flow calculation module, backward mapping is performed on the first image and the second image input into the optical flow calculation module, based on the first image obtained by the mapping A mapping image and a second mapping image correct the optical flow input to the optical flow calculation module, and output the corrected optical flow; wherein, the optical flow input to the first optical flow calculation module is the first video frame and the The preset optical flow between the first intermediate video frames, the optical flow input to other optical flow calculation modules is the optical flow output by the previous optical flow calculation module, and the optical flow output by the last optical flow calculation module is the first optical flow. Optical flow from the first intermediate video frame to the first video frame calculated by a neural network.

In the above implementation manner, by setting at least one optical flow calculation module in the first neural network, the calculation result of the optical flow of the intermediate frame is continuously corrected, and an accurate optical flow of the intermediate frame is finally obtained.

In an implementation manner, the first image input to each optical flow calculation module is determined according to the first video frame, and the second image input to each optical flow calculation module is determined according to the second video frame , including: using the first video frame as the first image input to each optical flow calculation module, and using the second video frame as the second image input to each optical flow calculation module; or, using the The image obtained after the downsampling of the first video frame is used as the first image input to each optical flow calculation module, and the image obtained after the downsampling of the second video frame is used as the second image input to each optical flow calculation module ; wherein, the shapes of the two down-sampled images input by the same optical flow calculation module are the same; or, the feature map output after the first video frame is processed by the convolution layer is used as the first image input to each optical flow calculation module , and the feature map output by the second video frame after being processed by the convolution layer is used as the second image input to each optical flow calculation module; wherein, the shapes of the two feature maps input by the same optical flow calculation module are the same.

In an implementation manner, the image obtained by downsampling the first video frame is used as the first image input to each optical flow calculation module, and the image obtained by downsampling the second video frame As the second image input to each optical flow calculation module, the method includes: down-sampling the first video frame and the second video frame respectively to form an image pyramid of the first video frame and the second video The image pyramid of the frame, each layer of the image pyramid starting from the top layer corresponds to an optical flow calculation module of the first neural network starting from the first optical flow calculation module; starting from the top layer of the two image pyramids and proceeding downwards The layer-by-layer traversal is performed, and the two down-sampled images located in the same layer are respectively used as the first image and the second image input to the optical flow calculation module corresponding to the layer.

In an implementation manner, the feature map output after the first video frame is processed by the convolution layer is used as the first image input to each optical flow calculation module, and the second video frame is convoluted The feature map output after the multi-layer processing is used as the second image input to each optical flow calculation module, including: using the first feature extraction network to perform feature extraction on the first video frame and the second video frame respectively, forming the The feature pyramid of the first video frame and the feature pyramid of the second video frame, each layer of the feature pyramid starting from the top layer corresponds to an optical flow of the first neural network starting from the first optical flow calculation module computing module; wherein, the first feature extraction network is a convolutional neural network; starting from the top layer of the two feature pyramids, the traversal is performed layer by layer, and the two feature maps located in the same layer are respectively used as the input light corresponding to the layer. The first image and the second image of the stream computing module.

In the above three implementations, the input of the optical flow calculation module can be either the original image (referring to the first video frame or the second video frame), the original image after downsampling, or the feature map, which is very flexible . Among them, if the feature map is used as the input of the optical flow calculation module, convolution calculation is required, which requires a large amount of calculation. However, because more deep-level features in the image are considered in the optical flow calculation, the optical flow calculation The results are also more accurate. However, if the original image or the down-sampled original image is used as the input of the optical flow calculation module, there is no need to perform convolution calculation, the calculation amount is small, and the optical flow calculation efficiency is high.

Among them, when using the down-sampled image as the input of the optical flow calculation module, the image pyramid can be constructed based on the original image, and then starting from the top layer of the image pyramid (corresponding to the down-sampled image with smaller size and lower precision), the down- The sampled image is input to the corresponding optical flow calculation module to realize the gradual refinement of the optical flow calculation. Similarly, when the feature map is used as the input of the optical flow calculation module, the feature pyramid can be constructed based on the original image, and then starting from the top layer of the feature pyramid (corresponding to the feature map with smaller size and lower precision), the feature map is divided layer by layer. Input to the corresponding optical flow calculation module to realize the gradual refinement of the optical flow calculation.

In an implementation manner, the optical flow input to the optical flow calculation module is modified based on the first mapping image and the second mapping image obtained by mapping, and the modified optical flow is output, including: the first mapping image obtained based on the mapping. A mapping image, a second mapping image and the optical flow input to the optical flow calculation module, the second neural network is used to predict the optical flow correction term; the optical flow correction term is used to correct the optical flow input to the optical flow calculation module, And output the corrected optical flow.

In an implementation manner, the optical flow input to the optical flow calculation module is modified based on the first mapping image and the second mapping image obtained by mapping, and the modified optical flow is output, including: the first mapping image obtained based on the mapping. A mapping image and a second mapping image are used to correct the optical flow input to the optical flow calculation module by using the descriptor matching unit, sub-pixel correction layer and regularization layer in the LiteFlownet network, and output the corrected optical flow.

The above two implementations provide two schemes for correcting the optical flow of the intermediate frame. One is to directly transfer the optical flow correction structure in the LiteFlownet network, and the other is to design a second neural network for optical flow correction. For example, the second neural network can adopt a simple codec architecture, and its computational complexity is small, which is beneficial to complete the optical flow correction quickly.

In an implementation manner, based on the first video frame and the second video frame, using a first neural network to calculate the optical flow from the first intermediate video frame to the first video frame and the first intermediate The optical flow from the video frame to the second video frame includes: using a first neural network to calculate the optical flow from the first intermediate video frame to the first video frame, and calculating the optical flow from the first intermediate video frame to the first video frame according to the first intermediate video frame. The optical flow of the first video frame calculates the optical flow from the first intermediate video frame to the second video frame; or, using the first neural network to calculate the optical flow from the first intermediate video frame to the second video frame , and calculate the optical flow from the first intermediate video frame to the first video frame according to the optical flow from the first intermediate video frame to the second video frame.

In the above implementation manner, there is a conversion relationship between the optical flow from the first intermediate video frame to the first video frame and the optical flow from the first intermediate video frame to the second video frame, so that only one can be obtained to calculate the other, There is no need to perform the optical flow calculation twice through the first neural network, thus significantly improving the efficiency of the optical flow calculation.

In an implementation manner, the calculating the optical flow from the first intermediate video frame to the second video frame according to the optical flow from the first intermediate video frame to the first video frame includes: calculating the optical flow from the first intermediate video frame to the second video frame. The optical flow from the first intermediate video frame to the first video frame is inverted as the optical flow from the first intermediate video frame to the second video frame; calculating the optical flow from the first intermediate video frame to the first video frame from the optical flow of the second video frame, including: inverting the optical flow from the first intermediate video frame to the second video frame as the optical flow of the first intermediate video frame to the first video frame.

In the above implementation manner, it is assumed that the object moves linearly between the first video frame and the second video frame (the motion trajectory is a linear uniform motion), then the optical flow from the first intermediate video frame to the first video frame is the same as that of the first intermediate video frame. The optical flows from the video frame to the second video frame are mutually opposite optical flows (referring to the two optical flows in opposite directions and the same size), which is simple and efficient to calculate. If the first video frame and the second video frame are continuous video frames, or when the frame rate of the video is high, this assumption is easy to satisfy, and any motion of objects in the frame can be approximated as the accumulation of a large number of linear motions.

In an implementation manner, the determining the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame includes: based on the first intermediate video frame to the first intermediate video frame The optical flow of a video frame modifies the first mapped video frame to obtain the first intermediate video frame; or, based on the optical flow from the first intermediate video frame to the second video frame, the Two-mapped video frames are modified to obtain the first intermediate video frame; or, based on the optical flow from the first intermediate video frame to the first video frame and/or the first intermediate video frame to the first intermediate video frame For the optical flow of two video frames, the first fused video frame formed by the fusion of the first mapped video frame and the second mapped video frame is modified to obtain the first intermediate video frame.

In the above implementation manner, the initially calculated first intermediate frame video (referring to the first mapped video frame, the second mapped video frame or the first merged video frame) is modified to improve image quality and frame insertion effect.

In an implementation manner, based on the optical flow from the first intermediate video frame to the first video frame, a first fusion formed after fusion of the first mapped video frame and the second mapped video frame is performed. Modifying the video frame to obtain the first intermediate video frame, comprising: based on the first mapped video frame, the second mapped video frame, and the optical flow from the first intermediate video frame to the first video frame , using the third neural network to predict the first image correction term and the first fusion mask; according to the instructions of the pixel values in the first fusion mask, the first mapped video frame and the second mapped video frame are fused is the first fused video frame; the first fused video frame is modified by using the first image modification item to obtain the first intermediate video frame.

In the above implementation manner, a third neural network is designed to learn the method of fusion and correction of video frames, which is beneficial to improve the quality of the finally obtained first intermediate video frame.

In an implementation manner, the third neural network includes a second feature extraction network and a codec network, the codec network includes an encoder and a decoder, and the the optical flow from the second mapped video frame and the first intermediate video frame to the first video frame, and using the third neural network to predict the first image correction term and the first fusion mask, including: using the first image correction term Two feature extraction networks respectively perform feature extraction on the first video frame and the first video frame; use the optical flow from the first intermediate video frame to the first video frame to extract the second feature extraction network Perform backward mapping on the obtained feature map; map the obtained mapped feature map, the first mapped video frame, the second mapped video frame, and the optical flow from the first intermediate video frame to the first video frame Input to the encoder for feature extraction; use the decoder to predict the first image modification item and the first fusion mask according to the features extracted by the encoder.

In the above implementation manner, the deep-level features (such as edges, textures, etc.) in the original image are extracted by designing the second feature extraction network, and these features are input into the codec network, which is beneficial to improve the effect of image correction.

In an implementation manner, determining the first intermediate video frame according to the first mapped video frame and the second mapped video frame includes: based on the first mapped video frame and the second mapped video frame and the optical flow from the first intermediate video frame to the first video frame, using the fourth neural network to predict a second fusion mask; The mapped video frame and the second mapped video frame are fused into the first intermediate video frame.

In the above implementation manner, designing a fourth neural network for learning a method for fusion of video frames is beneficial to improve the quality of the finally obtained first intermediate video frames.

An embodiment of the present application provides a model training method, including: acquiring a training sample, where the training sample includes a third video frame, a fourth video frame, and a reference between the third video frame and the fourth video frame video frame; based on the third video frame and the fourth video frame, using the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame and/or the second intermediate video frame to the The optical flow of the fourth video frame; wherein, the second intermediate video frame is a video frame to be inserted between the third video frame and the fourth video frame; using the second intermediate video frame to The optical flow of the third video frame performs backward mapping on the third video frame to obtain a third mapped video frame, and/or, uses the optical flow from the second intermediate video frame to the fourth video frame to map the third video frame. performing backward mapping on the fourth video frame to obtain a fourth mapped video frame; determining the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame; according to the second intermediate video frame A prediction loss is calculated for the video frame and the reference video frame, and the parameters of the first neural network are updated according to the prediction loss.

The above method is used for training the first neural network used in the video frame insertion method, and the neural network can accurately calculate the optical flow of the intermediate frame and improve the frame insertion effect.

In an implementation manner, the calculating the prediction loss according to the second intermediate video frame and the reference video frame includes: calculating a first loss according to a difference between the second intermediate video frame and the reference video frame; respectively calculating the image gradient of the second intermediate video frame and the image gradient of the reference video frame, and calculating a second loss according to the difference between the image gradient of the second intermediate video frame and the image gradient of the reference video frame; The predicted loss is calculated based on the first loss and the second loss.

In the above implementation manner, adding a second loss representing the gradient image difference into the prediction loss is beneficial to improve the problem of blurred object edges in the generated second intermediate video frame.

In an implementation manner, the calculating the prediction loss according to the second intermediate video frame and the reference video frame includes: calculating a first loss according to a difference between the second intermediate video frame and the reference video frame; Calculate the optical flow from the reference video frame to the third video frame and/or the optical flow from the reference video frame to the fourth video frame by using the pre-trained fifth neural network; according to the first neural network A third loss is calculated based on the difference between the optical flow calculated by the network and the corresponding optical flow calculated by the fifth neural network; the predicted loss is calculated according to the first loss and the third loss.

In the above implementation manner, the optical flow calculated by the pre-trained fifth neural network is used as a label to perform supervised training on the first neural network, and the optical flow knowledge transfer is realized (specifically, a third loss is added to the prediction loss) , which is beneficial to improve the prediction accuracy of the optical flow of the intermediate frame by the first neural network, thereby improving the quality of the finally obtained first intermediate video frame.

In an implementation manner, the first neural network includes at least one optical flow calculation module connected in sequence, and each optical flow calculation module outputs the second intermediate video frame modified by the module to the third intermediate video frame. Optical flow of a video frame; the calculating a prediction loss according to the second intermediate video frame and the reference video frame includes: calculating a first loss according to the difference between the second intermediate video frame and the reference video frame; using The pre-trained fifth neural network calculates the optical flow from the reference video frame to the third video frame; according to the difference between the optical flow output by each optical flow calculation module and the optical flow calculated by the fifth neural network A fourth loss is calculated from the difference of ; the predicted loss is calculated according to the first loss and the fourth loss.

In the above implementation manner, the optical flow calculated by the pre-trained fifth neural network is used as a label to perform supervised training on the first neural network, and the optical flow knowledge transfer is realized (specifically, the fourth loss is added to the prediction loss) , which is beneficial to improve the prediction accuracy of the optical flow of the intermediate frame by the first neural network, thereby improving the quality of the finally obtained first intermediate video frame.

When the first neural network includes at least one optical flow calculation module, the optical flow calculation result is gradually generated from coarse to fine, so that the loss calculation can be performed on the output of each optical flow calculation module, and the fourth loss can be obtained by accumulating, By calculating the fourth loss, it is beneficial to adjust the parameters of each optical flow calculation module more accurately, so that the prediction ability of each optical flow calculation module is improved.

In an implementation manner, the calculating the third loss according to the difference between the optical flow calculated by the first neural network and the corresponding optical flow calculated by the fifth neural network includes: using the fifth neural network to calculate the third loss. The optical flow calculated by the neural network performs backward mapping on the third video frame to obtain a fifth mapped video frame; determining the fifth neural network calculation according to the difference between the fifth mapped video frame and the reference video frame Whether the optical flow vector at each pixel position obtained is accurate; the first effective optical flow vector in the optical flow is calculated according to the fifth neural network and the second effective optical flow in the corresponding optical flow calculated by the first neural network is calculated. The difference of optical flow vectors calculates the third loss; wherein, the first effective optical flow vector refers to the accurate optical flow vector calculated by the fifth neural network, and the second effective optical flow vector refers to the first neural network. The optical flow vector located at the pixel position corresponding to the first effective optical flow vector in the corresponding optical flow calculated by the network.

In an implementation manner, calculating the fourth loss according to the difference between the optical flow output by each optical flow calculation module and the optical flow calculated by the fifth neural network includes: using the fifth neural network The calculated optical flow performs backward mapping on the third video frame to obtain a fifth mapped video frame; according to the difference between the fifth mapped video frame and the reference video frame, determine the value calculated by the fifth neural network. Whether the optical flow vector at each pixel position is accurate; calculate the difference between the first effective optical flow vector in the optical flow and the third effective optical flow vector in the optical flow output by each optical flow calculation module according to the fifth neural network Difference calculation of the fourth loss; wherein, the first effective optical flow vector refers to the accurate optical flow vector calculated by the fifth neural network, and the third effective optical flow vector refers to the light output by each optical flow calculation module. The optical flow vector in the flow at the pixel position corresponding to the first effective optical flow vector.

The inventor's long-term research has found that when the fifth neural network performs optical flow calculation, the optical flow vector calculated at some pixel positions may be inaccurate due to the ambiguity of the boundary and the occlusion area. For these optical flow vectors, Instead of using it as a label for supervised learning by the first neural network, only those optical flow vectors that are more accurately calculated are used as optical flow labels, that is, the content of the above two implementations.

In an implementation manner, the first neural network includes at least one optical flow calculation module connected in sequence, and each optical flow calculation module utilizes a pair of descriptor matching units, sub-pixel correction layers and regularization layers in the LiteFlownet network. The optical flow input into the optical flow calculation module is corrected, and based on the third video frame and the fourth video frame, the first neural network is used to calculate the optical flow from the second intermediate video frame to the third video frame. Before the flow and/or the optical flow of the second intermediate video frame to the fourth video frame, the method further includes: initializing the parameters of the first neural network using parameters obtained by pre-training of the LiteFlownet network.

If the optical flow calculation module in the first neural network is obtained by structure migration based on the LiteFlownet network, when training the first neural network, the parameters of the LiteFlownet network can be directly loaded as the initial value of its parameters, and on this basis Performing parameter fine-tuning can not only speed up the convergence speed of the first neural network, but also help to improve its performance.

In an implementation manner, determining the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame includes: based on the third mapped video frame, the fourth mapped video frame and the optical flow from the second intermediate video frame to the third video frame, using the third neural network to predict the second image correction term and the third fusion mask; according to the indication of pixel values in the third fusion mask fusing the third mapped video frame and the fourth mapped video frame into a second fused video frame; using the second image correction item to modify the second fused video frame to obtain the second intermediate video frame; calculating the prediction loss according to the second intermediate video frame and the reference video frame, and updating the parameters of the first neural network according to the prediction loss, including: according to the second intermediate video frame and the The prediction loss is calculated from the reference video frame, and the parameters of the first neural network and the third neural network are updated according to the prediction loss.

If the third neural network is used for image correction when the first neural network is used for frame insertion, the third neural network can be trained together with the first neural network in the model training stage, which is beneficial to simplify the training process.

In an implementation manner, determining the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame includes: based on the third mapped video frame, the fourth mapped video frame and the optical flow from the second intermediate video frame to the third video frame, using the fourth neural network to predict the second image correction term and the fourth fusion mask; according to the indication of pixel values in the fourth fusion mask fusing the third mapped video frame and the fourth mapped video frame into the second intermediate video frame; calculating a prediction loss according to the second intermediate video frame and the reference video frame, and calculating the prediction loss according to the second intermediate video frame and the reference video frame Prediction loss updating the parameters of the first neural network includes: calculating prediction loss according to the second intermediate video frame and the reference video frame, and updating the first neural network and the fourth neural network according to the prediction loss parameters of the neural network.

If the fourth neural network is used for image correction when the first neural network is used for frame insertion, the fourth neural network can be trained together with the first neural network in the model training stage, which is beneficial to simplify the training process.

An embodiment of the present application provides a device for video frame insertion, including: a first video frame obtaining unit, configured to obtain a first video frame and a second video frame; a first optical flow calculation unit, configured based on the first video frame and the second video frame, using the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame and/or the optical flow from the first intermediate video frame to the second video frame; wherein , the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame; a first backward mapping unit is configured to use the first intermediate video frame to The optical flow of the first video frame performs backward mapping on the first video frame to obtain a first mapped video frame, and/or, uses the optical flow from the first intermediate video frame to the second video frame to map the first video frame. performing backward mapping on the second video frame to obtain a second mapped video frame; a first intermediate frame determination unit, configured to determine the first intermediate frame according to the first mapped video frame and/or the second mapped video frame video frame.

An embodiment of the present application provides a model training apparatus, including: a second video frame obtaining unit, configured to obtain a training sample, where the training sample includes a third video frame, a fourth video frame, and a the reference video frame between the fourth video frames; the second optical flow calculation unit is configured to use the first neural network to calculate the second intermediate video frame to the The optical flow of the third video frame and/or the optical flow from the second intermediate video frame to the fourth video frame; wherein, the second intermediate video frame is to be inserted into the third video frame and the fourth video a video frame between frames; a second backward mapping unit, configured to perform backward mapping on the third video frame by using the optical flow from the second intermediate video frame to the third video frame to obtain a third mapping video frame, and/or, using the optical flow from the second intermediate video frame to the fourth video frame to perform backward mapping on the fourth video frame to obtain a fourth mapped video frame; a second intermediate frame determining unit , for determining the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame; a parameter update unit for determining the second intermediate video frame according to the second intermediate video frame and the reference video frame A prediction loss is calculated, and parameters of the first neural network are updated according to the prediction loss.

The embodiments of the present application provide a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and run by a processor, the video provided in the embodiments of the present application is executed. Frame insertion method.

An embodiment of the present application provides an electronic device, including: a memory and a processor, where computer program instructions are stored in the memory, and when the computer program instructions are read and run by the processor, the computer program instructions provided in the embodiments of the present application are executed. video frame interpolation method.

Description of drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the embodiments of the present application. It should be understood that the following drawings only show some embodiments of the present application, therefore It should not be regarded as a limitation of the scope. For those of ordinary skill in the art, other related drawings can also be obtained from these drawings without any creative effort.

Fig. 1 shows a possible flow of the video frame insertion method provided by the embodiment of the present application;

Fig. 2 shows a possible network architecture of the video frame insertion method provided by the embodiment of the present application;

FIG. 3 shows a possible structure of the first neural network provided by the embodiment of the present application;

4 shows a method for constructing a first image and a second image by a feature pyramid;

FIG. 5 shows a possible structure of the second neural network provided by the embodiment of the present application;

FIG. 6 shows a possible structure of the third neural network provided by the embodiment of the present application;

FIG. 7 shows a possible process of the model training method provided by the embodiment of the present application;

FIG. 8 shows a possible network architecture of the model training method provided by the embodiment of the present application;

FIG. 9 shows a possible structure of a video frame insertion device provided by an embodiment of the present application;

FIG. 10 shows another possible structure of the video frame insertion device provided by the embodiment of the present application;

FIG. 11 shows a possible structure of the electronic device provided by the embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

The terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion such that a process, method, article or device comprising a list of elements includes not only those elements, but also other not expressly listed elements, or also include elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

The terms "first", "second", etc. are only used to distinguish one entity or operation from another, and should not be construed to indicate or imply relative importance, nor to require or imply such entities or operations. There is no such actual relationship or sequence between operations.

FIG. 1 shows a possible flow of the video frame insertion method provided by the embodiment of the present application, and FIG. 2 shows a network architecture that can be used in the method, for reference when describing the video frame insertion method. The method in FIG. 1 may be performed by, but is not limited to, the electronic device shown in FIG. 11 . For the structure of the electronic device, reference may be made to the following description about FIG. 11 . Referring to Figure 1, the method includes:

Step S110: Acquire the first video frame and the second video frame.

The first video frame and the second video frame are two frames before and after the video to be inserted, and the first video frame and the second video frame may be two consecutive frames before and after, or may not be two consecutive frames. Except for the timing relationship between the two, this application does not limit the selection of the first video frame and the second video frame. For the convenience of description, the first video frame is denoted as I ₁ , and the first video frame is denoted as I ₂ .

Step S120: Based on the first video frame and the second video frame, use the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame and the optical flow from the first intermediate video frame to the second video frame.

The first intermediate video frame is a video frame to be inserted between I ₁ and I _2. The application does not limit the insertion position of the first intermediate video frame. For example, it can be inserted into the middle position of I ₁ and I ₂ , or May not be inserted in the middle of I ₁ and I ₂ . For the convenience of explanation, the first intermediate video frame is denoted as I _syn1 .

The so-called frame insertion is mainly to obtain I _syn1 , and it is easy to insert I _syn1 into the video. The solution of the present application obtains I _syn1 based on the optical flow of the first intermediate video frame. The optical flow of the first intermediate video frame includes the optical flow from the first intermediate video frame to the first video frame and the optical flow from the first intermediate video frame to the second video frame. The optical flow of , the former is recorded as Flow _mid→1 , and the latter is recorded as Flow _mid→2 .

In some implementations, I ₁ and I ₂ may be input to the first neural network, and Flow _mid→1 and Flow _mid→2 may be predicted by the first neural network, respectively.

If the motion of objects in I ₁ and I ₂ conforms to certain motion laws, there is also a conversion relationship corresponding to this law between Flow _mid→1 and Flow _mid→2 . Therefore, in other implementation manners, Flow _mid→1 may be calculated by using the first neural network, and Flow mid→2 may be converted according to Flow _mid→1 _, as shown in FIG. 2 (Flow _mid→2 is not shown). Of course, it is also possible to use the first neural network to calculate Flow _mid→2 , and to convert Flow _mid→1 according to Flow _mid→2 . In these implementations, the required two optical flows can be obtained only by using the first neural network to perform one optical flow calculation, thus significantly improving the efficiency of the optical flow calculation.

Optionally, it is assumed that the object moves linearly between I ₁ and I ₂ (the motion trajectory is a linear uniform motion), then Flow _mid→1 and Flow _mid→2 are mutually opposite optical flows, and one optical flow is obtained and taken. Conversely, another optical flow can be calculated. The so-called mutually opposite optical flows can be expressed as Flow _mid→1 =-Flow _mid→2 by the formula, which can be understood as the opposite directions of Flow _mid→1 and Flow _mid→2 and the same size. Since an optical flow can be regarded as a set of optical flow vectors at each pixel position in the image, to find the opposite optical flow of an optical flow, it is only necessary to reverse all the optical flow vectors contained in the optical flow, and calculate Simple and efficient. Since any motion of an object in a frame over a long period of time can be approximated as the accumulation of a large number of linear motions in a short period of time, if I ₁ and I ₂ are continuous video frames, or when the frame rate of the video is high, the above linear motion The assumption is easy to satisfy, that is, it is highly feasible to use the above method for optical flow conversion.

Taking the case of Flow _mid→1 =-Flow _mid→2 as an example, Figure 3 shows the structure of a first neural network that can calculate Flow _mid→1 . Referring to FIG. 3 , the first neural network includes at least one optical flow calculation module connected in sequence (three optical flow calculation modules are shown in the figure). Each optical flow calculation module is used to correct the optical flow input to the module, and output the corrected optical flow.

Among them, the optical flow input to the first optical flow calculation module (optical flow calculation module 1 in Figure 3) is a preset Flow _mid→1 . Since no optical flow calculation has been performed at this time, the preset optical flow The flow can take a default value, such as zero (meaning that all optical flow vectors contained in the optical flow take zero). After the first optical flow calculation module corrects the preset Flow _mid→1 , it outputs a correction result, and the correction result can also be regarded as the Flow _mid→1 calculated by the first optical flow calculation module. For each optical flow calculation module after the first optical flow calculation module, the Flow _mid→1 output by the previous optical flow calculation module is corrected, and the correction result is output, and the correction result can be regarded as the calculation of the optical flow calculation module. out Flow _mid→1 . For the last optical flow calculation module (optical flow calculation module 3 in Figure 3), the output Flow _mid→1 is the optical flow finally calculated by the first neural network. It can be seen that in the first neural network, the calculation result of Flow _mid→1 is continuously revised from coarse to fine, and finally a relatively accurate optical flow calculation result is obtained.

Each optical flow calculation module has a similar structure, as shown on the left side of Figure 3. In addition to Flow _mid→1 , the input of the optical flow calculation module also includes the first image and the second image, which are denoted as J ₁ and J ₂ respectively for the convenience of explanation, but J ₁ and J input by different optical flow calculation modules ₂ are not necessarily the same. Wherein, J ₁ is determined according to I ₁ , and J ₂ is determined according to I ₂ , which may specifically include, but is not limited to, one of the following ways:

(1) Take I ₁ as J ₁ directly, take I ₂ as J ₂ , and input I ₁ and I ₂ to each optical flow calculation module. Mode (1) does not need to calculate the input of the optical flow calculation module, so it is beneficial to improve the efficiency of optical flow calculation.

(2) The feature map output after I ₁ is processed by the convolution layer is taken as J ₁ , and the feature map output after I ₂ is processed by the convolution layer is taken as J ₂ . Since I ₁ and I ₂ can output multiple feature maps with different scales after being processed by multiple convolutional layers, for each optical flow calculation module, feature maps with different scales can be input, but the J input from the same optical flow module ₁ and J ₂ are the same shape. Method (2) requires convolution calculation for the input of the optical flow calculation module, which requires a large amount of calculation. However, because more deep-level features in the image are considered in the optical flow calculation, the optical flow calculation result is also more accurate. .

In some implementations, a first feature extraction network may be used to extract features for _I1 and _I2 , respectively, to form a feature pyramid of _I1 and a feature pyramid of _I2 , wherein the first feature extraction network is a convolutional neural network , each layer of the feature pyramid starting from the top layer corresponds to an optical flow calculation module of the first neural network starting from the first optical flow calculation module, and the feature shapes of the same layer of the image pyramid are the same.

For example, referring to Fig. 4, using the first feature extraction network (not shown in the figure) to perform feature extraction on I ₁ and I ₂ respectively, to obtain two 3-layer feature pyramids, corresponding to the 3 optical flow calculation modules in Fig. 3, Among them, the first layer (the top layer, that is, the layer closest to I ₁ and I ₂ in the figure) corresponds to the optical flow calculation module 1, the second layer corresponds to the optical flow calculation module 2, and the third layer (the bottom layer, that is, the farthest away in the figure). The layers of I ₁ and I ₂ ) correspond to the optical flow calculation module 3. Each layer of the feature pyramid is a feature map. In the feature pyramid of I ₁ , the feature map of the i-th (i=1, 2, 3) layer is marked as

In the feature pyramid of _I2 , the feature map of the i-th layer is denoted as

and

have the same shape.

After constructing the two feature pyramids, traverse layer by layer from the top layer of the two feature pyramids downward, and use the two feature maps at the same layer as J ₁ and J ₂ of the corresponding optical flow calculation module of the layer. For example, in Figure 4, the

and

as J ₁ and J ₂ of the i-th optical flow calculation module in Fig. 3, respectively.

Since the size of the feature maps in the feature pyramid gradually increases from the top layer to the bottom layer, the top layer corresponds to the feature maps with smaller size and lower accuracy, and the bottom layer corresponds to the feature maps with larger size and higher accuracy. The layer inputs the feature map to the corresponding optical flow calculation module, which is conducive to the gradual refinement of the optical flow calculation. But generally speaking, according to the characteristics of the convolutional neural network, the large-size feature maps are extracted first, and the small-size feature maps are extracted later, that is, the construction order of the feature pyramid is from the bottom to the top.

It should be pointed out that since I ₁ and I ₂ themselves can also be regarded as a special feature map, in mode (2), I ₁ and I ₂ are not excluded as J ₁ of the first optical flow calculation module. and J ₂ .

(3) The image obtained after I ₁ is down-sampled as J ₁ , and the image obtained after I ₂ is down-sampled as J ₂ . Since I ₁ and I ₂ can output multiple down-sampled images with different scales after being down-sampled multiple times, for each optical flow calculation module, down-sampled images with different scales can be input, but the J input from the same optical flow module ₁ and J ₂ are the same shape. Mode (3) only needs to perform simple downsampling calculation for the input of the optical flow calculation module, and the calculation amount is small, so it is beneficial to improve the calculation efficiency of the optical flow calculation module. It should be noted that the convolution operation can also be regarded as downsampling to a certain extent, but the downsampling in mode (3) should be understood as not including downsampling through convolution, for example, it can be directly spaced according to the downsampling multiple. The pixels in the original image are extracted for downsampling.

In some implementations, I ₁ and I ₂ may be down-sampled to form an image pyramid of I ₁ and an image pyramid of I ₂ , and each layer of the image pyramid starting from the top layer corresponds to the first neural network from the first light An optical flow calculation module starting from the flow calculation module, the image shape of the same layer of the image pyramid is the same. The structure of the image pyramid is similar to that of the feature pyramid, except that the down-sampled original image (referred to as I ₁ or I ₂ ) instead of the feature map constitutes the pyramid.

After constructing the two image pyramids, start from the top layer of the two image pyramids to traverse down layer by layer, and take the two down-sampled images at the same layer as J ₁ and J ₂ of the corresponding optical flow calculation module of the layer.

Since the size of the downsampled images in the image pyramid gradually increases from the top layer to the bottom layer, the top layer corresponds to a downsampled image with a smaller size and lower precision, and the bottom layer corresponds to a downsampled image with a larger size and higher precision. In the beginning, the down-sampled images are input to the corresponding optical flow calculation module layer by layer, which is conducive to the gradual refinement of the optical flow calculation. But generally speaking, according to the characteristics of the downsampling operation, the large-sized down-sampled images are generated first, and the small-sized down-sampled images are generated later, that is, the order of building the image pyramid is from the bottom layer to the top layer.

It should be pointed out that since I ₁ and I ₂ themselves can also be regarded as a special down-sampling image (the down-sampling multiple is 1), it is not excluded to use I ₁ and I ₂ as the first image in mode (3). J ₁ and J ₂ of the optical flow calculation modules.

Continuing to refer to FIG. 3 , in the optical flow calculation module, backward warp is performed on J ₁ input to the optical flow calculation module based on Flow _mid→1 input to the optical flow calculation module to obtain a first mapped image, denoted by for

that is

and perform backward mapping on J ₂ input to the optical flow calculation module to obtain a second mapping image, denoted as

that is

The optical flow calculation module includes an optical flow correction module, and the optical flow correction module is used to input the Flow _mid→1 of the optical flow calculation module and the above.

is the input, which is used to correct Flow _mid→1 , and output the corrected Flow _mid→1 _, which is also the output of the optical flow calculation module.

Two implementations of the optical flow correction module are listed below. It can be understood that the optical flow correction module can also adopt other implementations:

(1) Design a second neural network to input Flow _mid→1 ,

Input to the second neural network, use the second neural network to predict an optical flow correction term Flow _res , and then use Flow _res to correct the Flow _mid→1 of the input optical flow calculation module to obtain the revised Flow _mid→1 . For example, in an optional solution, the Flow _mid→1 of the input optical flow calculation module is added to Flow _res (which may be direct addition or weighted summation) to obtain the revised Flow _mid→1 . Among them, the second neural network can adopt a relatively simple network structure, so as to reduce the amount of calculation and improve the efficiency of optical flow correction, thus speeding up the calculation speed of the optical flow by the optical flow calculation module.

The second neural network may employ an encoder-decoder network, and FIG. 5 shows a possible structure of the second neural network. In Figure 5, the left part of the network (R1 to R4) is the encoder and the right part (D1 to D4) is the decoder. Among them, Ri (i=1, 2, 3, 4) represents a coding module, for example, it can be a residual block (Resblock), and Di (i=1, 2, 3, 4) represents a decoding module, for example, it can be A deconvolution layer. Flow _mid→1 ,

The three items of data are spliced and input to R1. In addition to inputting the extracted features to the next encoding module, each encoding module except R4 will also input them into the decoder, and add them to the output of the corresponding decoding module. To achieve feature fusion on different scales, the features extracted by R4 are directly output to D4, and D1 outputs the optical flow correction item Flow _res predicted by the second neural network. The intermediate output of the second neural network (referring to the output of the convolutional layer and the deconvolutional layer) can be batch normalized, and use Prelu as the nonlinear activation function. It can be understood that FIG. 5 is only an example, and the second neural network can also adopt other structures.

(2) Directly transfer the optical flow correction structure in the LiteFlownet network. The LiteFlownet network is an existing network that can be used for optical flow calculation, but the LiteFlownet network can only be used to calculate the optical flow between the frames before and after, such as the optical flow Flow _1→2 from the first video frame to the second video frame, not Used to calculate the intermediate frame optical flow Flow _mid→1 .

In the NetE part of the LiteFlownet network, there is also a structure similar to the optical flow correction module, which is called the optical flow inference module. This structure can be roughly divided into three parts: the descriptor matching unit. ), sub-pixel refinement unit, and regularization module.

The above optical flow reasoning module can be directly migrated to the optical flow correction module of this application, but the input of each part needs to be modified to a certain extent:

Among them, the input of the descriptor matching unit is transformed as

and Flow _mid→1 before correction, calculated in the descriptor matching unit

and

matching cost volume between

The four items of information of Flow _mid→1 before correction and the calculated matching cost capacity are input to the convolutional neural network in the descriptor matching unit for calculation, and finally the Flow _mid→1 calculated by the descriptor matching unit is output. Among them, the matching cost capacity is used to measure the mapping image

and

coincidence between.

The input of the subpixel correction layer is transformed as

As well as describing the Flow _mid→1 output by the sub-matching unit, the sub-pixel correction layer corrects the input Flow _mid→1 at sub-pixel accuracy, and outputs the revised Flow _mid→1 .

The input of the regularization layer is transformed as

And the Flow _mid→1 output by the sub-pixel correction layer, the regularization layer smoothes the input Flow _mid→1 , and outputs the corrected Flow _mid→1 , which is the output of the optical flow correction module.

In addition, a feature pyramid will be constructed in the NetC part of the LiteFlownet network, so that this part of the convolutional layer can also be migrated to the solution of this application as the first feature extraction network for extracting

and

J ₁ and J ₂ as input to the optical flow calculation module.

Compared with the method (1), the method (2) effectively migrates the existing optical flow calculation results, but because the LiteFlownet network contains more operators, the operation will be more complicated.

Step S130: performing backward mapping on the first video frame by using the optical flow from the first intermediate video frame to the first video frame to obtain the first mapped video frame, and using the optical flow from the first intermediate video frame to the second video frame Perform backward mapping on the second video frame to obtain a second mapped video frame.

After Flow _mid→1 is calculated in step S120, in step S130, Flow _mid→1 can be used to perform backward mapping on I ₁ to obtain the first mapped video frame, denoted as

that is

And perform backward mapping to I ₂ to obtain the second mapped video frame, denoted as

that is

as shown in picture 2.

Step S140: Determine the first intermediate video frame according to the first mapped video frame and the second mapped video frame.

In some implementations, the

and

Perform fusion to obtain the first fusion video frame, denoted as I _fusion1 , and then correct I _fusion1 according to Flow _mid→1 and/or Flow _mid→2 , and use the corrected image as I _syn1 , which is conducive to improving Image quality of I _syn1 , improved frame insertion. Wherein, if there is a conversion relationship between Flow _mid→1 and Flow _mid→2 , then I _fusion1 can be corrected only according to Flow _mid→1 or Flow _mid→2 . The above frame fusion and image correction processes can be performed successively, for example, first right

and

Average to get I _fusion1 , and then design a neural network to correct I _fusion1 . However, the process of frame fusion and image correction can also be implemented based on a neural network, that is, using the neural network to learn the methods of video frame fusion and image correction at the same time, as shown in Figure 2.

In Figure 2, first, the

Flow _mid→1 is input to the third neural network, and the third neural network is used to predict the first image correction item and the first fusion mask, which are denoted as I _res1 and mask1 respectively.

Then, as indicated by the pixel values in mask1,

and

Fusion is I _fusion1 . For example, each pixel value in mask1 can only take 0 or 1, and the pixel value at a weak position is 0, indicating that the pixel value of I _fusion1 at this position is

The pixel value at this position, if the pixel value at a certain position is 1, it means that the pixel value of I _fusion1 at this position is

The pixel value at this location.

Finally, I _fusion1 is corrected by I _res1 to obtain I _syn1 . For example, in the optional scheme, I _fusion1 and I _res1 are added (it can be a direct addition or a weighted summation) to obtain I _syn1 , in the case of direct addition, I _syn1 =I _fusion1 +I _res1 .

The structure of the third neural network is exemplified below. In some implementations, the third neural network includes a second feature extraction network and a codec network. Its working principle is as follows: first, the second feature extraction network compares I ₁ and I ₂ respectively. Perform feature extraction, and then use Flow _mid→1 to perform backward mapping on the feature map extracted by the second feature extraction network, and then map the mapped feature map obtained by mapping,

And Flow _mid→1 is input to the encoder of the encoder-decoder network for feature extraction, and finally the decoder of the encoder-decoder network is used to predict I _res1 and mask1 according to the features extracted by the encoder.

Figure 6 shows an implementation of the third neural network in accordance with the above description. Referring to FIG. 6 , the left part (C1 to C3) of the network is the second feature extraction network, and the right part is the codec network, wherein the main structure of the codec network is similar to that in FIG. In the second feature extraction network, Ci (i= ₁ , 2, 3) represents one or more convolutional layers, so that two 3-layer feature pyramids are constructed using the second feature extraction network. In the feature pyramid, the feature map of the i-th (i=1, 2, 3) layer is denoted as F _1-i (F _1-1 is the bottom layer, F _1-3 is the top layer), in the feature pyramid constructed based on I ₂ , the feature map of the i-th layer is denoted as F _2-i (F _2-1 is the bottom layer, F _2-3 is the top layer), and F _1-i and F _2-i have the same shape. In Fig. 6, based on Flow _mid→1 , the feature maps F _1-i and F _2-i are respectively backward mapped, and the obtained mapped feature maps are denoted as warp(F _1-i ) and warp(F _{2-i )} ). Then warp(F _1-i ) and warp(F _2-i ) are concatenated with the output of the encoding module Ri as the input of the encoding module Ri+1. It can be understood that FIG. 6 is only an example, and the third neural network can also adopt other structures.

In the scheme shown in Figure 2, I _res1 and mask1 are predicted by the third neural network, but in other implementations, the scheme can be optionally simplified: first, the

Flow _mid→1 is input to the fourth neural network, and then the fourth neural network is used to predict the second fusion mask, denoted as mask2, and finally according to the indication of the pixel value in mask2

and

fused to directly fused to I _syn1 . These implementations do not need to calculate I _res1 , so the calculation process is simpler, and the fourth neural network can also focus on learning the fusion mask. The design of the fourth neural network may refer to the third neural network, which will not be described in detail here.

In other implementations, it is also possible to directly fuse

and

For example, the two are directly averaged to obtain I _syn1 . The calculation process of these implementations is extremely simple, but the quality of the obtained intermediate frame will be poorer.

In the scheme shown in FIG. 2, the first intermediate video frame is generated by fusion of the first mapped video frame and the second mapped video frame (may be modified), but there are also some schemes, the first intermediate video frame is Directly based on the first mapped video frame or the second mapped video frame (possibly with modifications). The specific steps of these programs are as follows:

Scheme A

Step A1: obtaining the first video frame and the second video frame;

Step A2: Based on the first video frame and the second video frame, use the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame;

Step A3: using the optical flow from the first intermediate video frame to the first video frame to perform backward mapping on the first video frame to obtain the first mapped video frame;

Step A4: Determine the first intermediate video frame according to the first mapped video frame.

For step A4, in different implementations, the first mapped video frame may be directly used as the first intermediate video frame; the first mapped video frame may also be modified based on the optical flow from the first intermediate video frame to the first video frame , to obtain the first intermediate video frame. For example, a neural network can be designed to modify the first mapped video frame. The structure of the neural network can refer to the third neural network, but since it does not involve video frame fusion, the neural network only needs to It is enough to predict the image correction term. For other contents of steps A1 to A4, reference may be made to steps S110 to S140, which will not be described in detail.

Option B

Step B1: obtaining the first video frame and the second video frame;

Step B2: Based on the first video frame and the second video frame, use the first neural network to calculate the optical flow from the first intermediate video frame to the second video frame;

Step B3: using the optical flow from the first intermediate video frame to the second video frame to perform backward mapping on the second video frame to obtain the second mapped video frame;

Step B4: Determine the first intermediate video frame according to the second mapped video frame.

For step B4, in different implementations, the second mapped video frame may be directly used as the first intermediate video frame; the second mapped video frame may also be modified based on the optical flow from the first intermediate video frame to the second video frame , to obtain the first intermediate video frame. For other contents of steps B1 to B4, reference may be made to steps S110 to S140, which will not be described in detail.

To sum up, the frame insertion method provided by the embodiments of the present application, when performing video frame insertion, directly calculates the optical flow of the intermediate frame (referring to the first intermediate video frame) based on the first video frame and the second video frame by using the first neural network. frame to first video frame and/or first intermediate video frame to second video frame optical flow) without using the optical flow between the first video frame and the second video frame to calculate the intermediate frame optical flow, The accuracy of the optical flow of the intermediate frame thus obtained is high, and the image quality of the first intermediate video frame obtained on this basis is good, and ghost images are not easily generated at the edge of the moving object. In addition, the steps of the above method are simple, and the frame insertion efficiency is significantly improved, so that good results can also be achieved when applied to scenarios such as real-time frame insertion and high-definition video frame insertion.

It should be pointed out that, in various possible implementation manners of the video frame insertion method, all the places where backward mapping is used can also be replaced by forward warp, and the optical flow used for the mapping also needs to be adjusted accordingly. For example, if Flow _mid→1 is used for backward mapping of the first video frame, after replacement, Flow _1→mid (optical flow from the first video frame to the first intermediate video frame) should be used for forward mapping of the first video frame , and the first neural network should also be changed to output Flow _1→mid ; for another example, if Flow _mid→2 is used to do backward mapping to the second video frame, after replacement, Flow _2→mid (the second video frame to the The optical flow of an intermediate video frame) is forward mapped to the second video frame, and the first neural network should also output Flow _2→mid instead.

In addition, it should also be pointed out that in some implementations of the video frame insertion method, more than one step will map the video frame (for example, backward mapping will be performed in step S130, and if the implementation of FIG. 3 is adopted in step S120, it will also be Backward mapping), these steps either all use backward mapping, or all use forward mapping, that is, in the video frame insertion process, the mapping type should be consistent.

In comparison, the use of forward-backward mapping needs to solve the fusion problem when multiple points are mapped to the same position, and the current hardware does not support forward-backward mapping enough, so in this application, backward-backward mapping is mainly used as an example. , but it is not intended to preclude the use of forward-backward-mapping schemes.

FIG. 7 shows a possible flow of the model training method provided by the embodiment of the present application, and the method can be used to train the first neural network model used in the model frame insertion method in FIG. 1 . Figure 8 shows a network architecture that can be used in the method for reference when describing the model training method. The method in FIG. 7 may be performed by, but is not limited to, the electronic device shown in FIG. 11 , and for the structure of the electronic device, reference may be made to the following description about FIG. 11 . Referring to Figure 7, the method includes:

Step S210: Obtain training samples.

The training set consists of multiple training samples, and each training sample is used in a similar manner during the training process. Therefore, any one of the training samples can be used as an example to illustrate the training process. Each training sample may include 3 video frames, namely the third video frame, the fourth video frame, and the reference video frame located between the third video frame and the fourth video frame, and these 3 video frames are denoted as I ₃ respectively , I ₄ and I _mid , as shown in FIG. 8 . Wherein, the video frame to be inserted between I ₃ and I ₄ is the second intermediate video frame, denoted as I _syn2 , and I _mid corresponds to I _syn2 , representing the real video frame at the position of I _syn2 (that is, the ground truth of the intermediate frame) . When selecting training samples, three consecutive frames can be taken from the video as a sample, the first frame of the three frames is taken as I ₃ , the second frame is taken as I _mid , and the third frame is taken as I ₄ .

Step S220: Based on the third video frame and the fourth video frame, use the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame and the optical flow from the second intermediate video frame to the fourth video frame.

For this step, reference may be made to step S120, which will not be described in detail. For convenience of description, the optical flow from the second intermediate video frame to the third video frame and the optical flow from the second intermediate video frame to the fourth video frame are denoted as Flow _mid→3 and Flow _mid→4 , respectively. In Figure 8, assuming that the object moves linearly between I ₃ and I ₄ , there is Flow _mid→3 =-Flow _mid→4 , so in Figure 8, the first neural network only needs to calculate Flow _mid→3 , that is Can.

Step S230: performing backward mapping on the third video frame by using the optical flow from the second intermediate video frame to the third video frame to obtain a third mapped video frame, and using the optical flow from the second intermediate video frame to the fourth video frame Perform backward mapping on the fourth video frame to obtain a fourth mapped video frame.

After Flow _mid→3 is calculated in step S220, in step S230, Flow _mid→3 can be used to perform backward mapping on I ₃ to obtain a third mapped video frame, denoted as

that is

And carry out backward mapping to I ₄ , obtain the fourth mapping video frame, denoted as

that is

As shown in Figure 8.

Step S240: Determine the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame.

Step S240 may refer to step S140. In some implementations, in step S240, a third neural network is used to perform image correction. Referring to FIG. 8 , the process is as follows:

First, put

Flow _mid→3 is input to the third neural network, and the third neural network is used to predict the second image correction term and the third fusion mask, which are denoted as I _res2 and mask3 respectively. Then, as indicated by the pixel values in mask3,

and

The fusion is I _fusion2 , and the specific method can refer to the description of mask1 above. Finally, I _fusion2 is corrected by I _res2 to obtain I _syn2 .

In other implementations, the above scheme can also be simplified: first, the

Flow _mid→3 is input to the fourth neural network, and then the fourth neural network is used to predict the fourth fusion mask, denoted as mask4, and finally according to the indication of the pixel value in mask4

and

fused to directly fused to I _syn2 .

Of course, in some implementations, image correction may not be performed, for example, directly

and

Take the average to get I _syn2 .

Step S250: Calculate the prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.

The loss calculation will be explained later. First, in the solution of the present application, the first neural network is bound to be used. Therefore, after the prediction loss is calculated, the parameters of the first neural network can be updated by using the back-propagation algorithm. Secondly, if the third neural network is used in step S240, then in step S250, the parameters of the third neural network are also updated together, that is, the third neural network and the first neural network are trained together, which can simplify the training process . Similarly, if the fourth neural network is used in step S240, in step S250, the parameters of the fourth neural network are also updated together, that is, the fourth neural network and the first neural network are trained together. During training, steps S210 to S250 are iteratively executed, and the training is terminated when the training termination condition (eg, model convergence, etc.) is satisfied.

The prediction loss can be uniformly expressed by the following formula:

Loss _sum = Loss _l1 +αLoss _sobel +βLoss _epe +γLoss _{multiscale-epe}

Among them, Loss _sum is the total prediction loss, and there are four losses on the right side, namely the first loss Loss _l1 , the second loss Loss _sobel , the third loss Loss _epe and the fourth loss Loss _{multiscale-epe} , where the first loss is The basic loss must be included when calculating the prediction loss. The other three losses are optional. Depending on the implementation, one or more of them can be added, or none of them can be added, but pay attention to the third loss and the third loss. Four losses cannot be added at the same time. α, β, and γ are weighting coefficients, which are used as hyperparameters of the network. It should be understood that other loss terms may also be added to the right-hand side of the equation. Each loss is described in detail below:

The first loss is calculated according to the difference between I _syn2 and I _mid , and the purpose of setting the first loss is to make I _syn2 closer to I _mid through learning, that is, to make the image quality of the intermediate frame better. In some implementations, the difference between I _syn2 and I _mid can be defined as the pixel-by-pixel distance between the two, for example, when using the L1 distance:

Loss _l1 =∑i∑j|I _syn2 (i,j)-I _mid (i,j)|

Among them, i and j together represent a pixel position.

The second loss is calculated according to the difference between the image gradient of I _syn2 and the image gradient of I _mid . The purpose of setting the second loss is to improve the problem of blurred object edges of the generated I _syn2 through learning (the image gradient corresponds to the edge in the image information). The image gradient can be calculated by applying a gradient operator to the image, such as Sobel operator, Roberts operator, Prewitt operator, etc. The difference between the image gradient of I _syn2 and the image gradient of I _mid can be defined as the pixel-by-pixel distance between the two . For example, when using Sobel operator and L1 distance:

Loss _sobel =∑i∑j|Sobel(I _syn2 )(i,j)-Sobel(I _mid )(i,j)|

Among them, Sobel( ) represents the use of the Sobel operator to calculate the image gradient of an image.

The calculation of the above-mentioned first loss and second loss is directly related to I _syn2 , but I _syn2 is calculated according to Flow _mid→3 , so the first neural network is also very important for the accuracy of optical flow calculation, so, In some implementations, optical flow labels can be set to perform supervised training of the first neural network.

For example, referring to Fig. 8, pre-training (meaning that the network is trained before performing the steps of Fig. 7) a fifth neural network (eg, LiteFlownet network) with optical flow calculation function, input I ₃ and I _mid to The fifth neural network, the optical flow from the reference video frame calculated by the fifth neural network to the third video frame (denoted as

) as the optical flow labels (i.e. the ground truth of the optical flow of intermediate frames). Among them, the calculation of the optical flow between two video frames (rather than the optical flow of the middle frame of the two video frames) can be done by the existing optical flow calculation network.

The third loss is based on the Flow _mid→3 calculated by the third neural network and

The purpose of setting the third loss is to improve the accuracy of Flow _mid→3 calculated by the third neural network through learning. This loss reflects the optical flow from the fifth neural network to the third neural network. Knowledge transfer. In some implementations, Flow _mid→3 is the same as

The difference can be defined as the distance between the optical flow vectors contained in the two (L2 distance), which is expressed as follows:

Among them, Flow _mid→3 (i,j),

Both represent the optical flow vector at pixel position (i, j).

Optionally, if the first neural network includes at least one optical flow calculation module (refer to FIG. 3 for its structure), and each optical flow calculation module outputs the Flow _mid→3 corrected by the module, from coarse to fine Flow _mid→3 is calculated. At this time, each optical flow calculation module can be supervised by using the optical flow label to improve the optical flow calculation capability of each optical flow calculation module. The specific method is that, for each optical flow calculation module, the output optical flow Flow _mid→3 and the optical flow calculated by the fifth neural network are used.

The difference between them calculates a loss (for the calculation method, please refer to the calculation of the third loss), and then accumulate these losses to obtain the fourth loss. The formula for calculating the fourth loss is as follows:

Among them, n represents the total number of optical flow calculation modules,

Represents Flow _mid→3 output by the kth optical flow calculation module.

Compared with the third loss, the fourth loss can also realize the optical flow knowledge transfer from the fifth neural network to the third neural network, and by calculating the fourth loss, it is beneficial to adjust the parameters of each optical flow calculation module more accurately , but the fourth loss is computationally more complicated.

Optionally, the inventor's long-term research has found that when the fifth neural network performs optical flow calculation, the optical flow vector calculated at some pixel positions may be inaccurate due to the ambiguity of the boundary and the occlusion area. The optical flow vector can not be used as the label for the supervised learning of the first neural network, but only those optical flow vectors that are more accurately calculated are used as the optical flow label. The specific method is as follows:

First, using the fifth neural network to calculate

Perform backward mapping on I ₃ (of course, forward and backward mapping can be used) to obtain the fifth mapped video frame, denoted as

Then, according to

The difference from I _mid determines whether the optical flow vector at each pixel position calculated by the fifth neural network is accurate. For example, it can be calculated

The mean value of the L1 distance from I _mid at each pixel (because the video frame may be a multi-channel image, the mean value can be calculated at each pixel), if the mean value of the L1 distance is greater than a certain threshold, it indicates that the fifth neural The optical flow vector calculated by the network at the pixel position is inaccurate, otherwise it indicates that the optical flow vector calculated by the fifth neural network at the pixel position is accurate. For those optical flow vectors that are accurately calculated, it may be called the first effective optical flow vector. It shows that the first effective optical flow vector accounts for the vast majority of the optical flow vectors calculated by the fifth neural network, because the fifth neural network is equivalent to calculating the optical flow of the intermediate frame when the intermediate frame is known, and its accuracy is still Guaranteed.

Finally, the third loss or the fourth loss is calculated according to the first effective optical flow vector in the optical flow calculated by the fifth neural network:

When calculating the third loss, according to the fifth neural network

The difference between the first effective optical flow vector in and the second effective optical flow vector in Flow _mid→3 calculated by the first neural network is calculated; wherein, the second effective optical flow vector refers to the first effective optical flow vector calculated by the first neural network. The optical flow vector located at the pixel position corresponding to the first effective optical flow vector in Flow _mid→3 . For example, the fifth neural network calculates

The optical flow vector located at (1,1) is a first effective optical flow vector, then the optical flow vector located at (1,1) in Flow _mid→3 calculated by the first neural network is a second effective optical flow vector Optical flow vector.

When calculating the fourth loss, according to the fifth neural network

The difference between the first effective optical flow vector in and the third effective optical flow vector in Flow _mid→3 output by each optical flow calculation module of the first neural network is calculated (the differences are calculated separately and then accumulated). The third effective optical flow vector refers to the optical flow vector located at the pixel position corresponding to the first effective optical flow vector in Flow _mid→3 output by each optical flow calculation module.

As mentioned above, in some implementations, the optical flow calculation module in the first neural network is obtained by performing structure migration based on the LiteFlownet network (that is, in step S220, each optical flow calculation module is migrated from the LiteFlownet network using The descriptor matching unit, the sub-pixel correction layer and the regularization layer correct the optical flow input to the optical flow calculation module). For these implementations, when training the first neural network, the parameters obtained by the pre-training of the LiteFlownet network can be directly loaded as the initial values of its parameters, and on this basis, the parameters can be fine-tuned. This transfer learning method not only The convergence speed of the first neural network can be accelerated, and its performance can be improved. Among them, the LiteFlownet network is pre-trained, but it is not limited to the FlyingChairs dataset.

In the scheme shown in FIG. 8, the second intermediate video frame is generated by fusion of the third mapped video frame and the fourth mapped video frame (may be modified), but there are also some schemes, the second intermediate video frame is Directly based on the third mapped video frame or the fourth mapped video frame (possibly with modifications). The specific steps of these programs are as follows:

Option C

Step C1: obtaining training samples, the training samples include the third video frame, the fourth video frame and the reference video frame;

Step C2: based on the third video frame and the fourth video frame, use the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame;

Step C3: using the optical flow from the second intermediate video frame to the third video frame to perform backward mapping on the third video frame to obtain the third mapped video frame;

Step C4: determining the second intermediate video frame according to the third mapped video frame;

Step C5: Calculate the prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.

Wherein, if a neural network (whose structure can refer to the third neural network) is used to modify the third mapped video frame in step C4, then in step C5, the neural network can perform parameter update together with the first neural network. For other contents of steps C1 to C5, reference may be made to steps S210 to S250, which will not be described in detail.

Option D

Step D1: obtaining training samples, the training samples include a third video frame, a fourth video frame and a reference video frame;

Step D2: based on the third video frame and the fourth video frame, use the first neural network to calculate the optical flow from the second intermediate video frame to the fourth video frame;

Step D3: using the optical flow from the second intermediate video frame to the fourth video frame to perform backward mapping on the fourth video frame to obtain the fourth mapped video frame;

Step D4: determining the second intermediate video frame according to the fourth mapped video frame;

Step D5: Calculate the prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.

Wherein, if a neural network (the structure of which can refer to the third neural network) is used to modify the fourth mapped video frame in step D4, then in step D5, the neural network can perform parameter update together with the first neural network. For other contents of steps D1 to D5, reference may be made to steps S210 to S250, which will not be described in detail.

It should be pointed out that if a fifth neural network is provided to provide an optical flow label, the calculation result of the fifth neural network should keep corresponding to the calculation result of the first neural network. For example, if the first neural network calculates the optical flow from the second intermediate video frame to the third video frame (scheme C), the fifth neural network should calculate the optical flow between the third video frame and the reference video frame based on the third video frame and the reference video frame. flow; if the first neural network calculates the optical flow from the second intermediate video frame to the fourth video frame (scheme D), then the fifth neural network should calculate the optical flow between the two based on the fourth video frame and the reference video frame. flow; if the first neural network calculates the optical flow from the second intermediate video frame to the third video frame and the optical flow from the second intermediate video frame to the fourth video frame (the scheme in Figure 7), then the fifth neural network The optical flow between the two should be calculated based on the third video frame and the reference video frame, and the optical flow between the two should be calculated based on the fourth video frame and the reference video frame.

It should be pointed out that in various possible implementations of the model training method, all the places where backward mapping is used can also be replaced with forward mapping, and the optical flow used in the mapping also needs to be adjusted accordingly. For example, if Flow _mid→3 is used for backward mapping of the third video frame, after replacement, Flow _3→mid (optical flow from the third video frame to the second intermediate video frame) should be used for forward mapping of the third video frame , and the first neural network should also be changed to output Flow _3→mid ; for another example, if Flow _mid→4 is used to do backward mapping to the fourth video frame, after replacement, Flow _4→mid (the fourth video frame to the The optical flow of the second middle video frame) does forward mapping to the fourth video frame, and the first neural network should also output Flow _4→mid instead.

In addition, it should be pointed out that in some implementations of the model training method, more than one step will map the video frame, and these steps either all use backward mapping, or all use forward mapping, that is, in the model training process, Mapping types should be consistent.

FIG. 9 shows a functional block diagram of a video frame insertion apparatus 300 provided by an embodiment of the present application. 9, the video frame insertion device 300 includes:

The first video frame obtaining unit 310 is used to obtain the first video frame and the second video frame;

A first optical flow calculation unit 320, configured to use a first neural network to calculate an optical flow and/or an optical flow from a first intermediate video frame to the first video frame based on the first video frame and the second video frame The optical flow from the first intermediate video frame to the second video frame; wherein, the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame;

A first backward mapping unit 330, configured to perform backward mapping on the first video frame by using the optical flow from the first intermediate video frame to the first video frame to obtain a first mapped video frame, and/or , using the optical flow from the first intermediate video frame to the second video frame to perform backward mapping on the second video frame to obtain a second mapped video frame;

A first intermediate frame determining unit 340, configured to determine the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame.

In an implementation of the device 300 for video frame insertion, the first neural network includes at least one optical flow calculation module connected in sequence, and the first optical flow calculation unit 320 is based on the first video frame and the second video frame, using the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame, including: determining the first image input to each optical flow calculation module according to the first video frame, and, according to The second video frame determines the second image input to each optical flow calculation module; using each optical flow calculation module, based on the optical flow input to the optical flow calculation module, the first image and the input to the optical flow calculation module are respectively The second image is subjected to backward mapping, and the optical flow input to the optical flow calculation module is corrected based on the first and second mapping images obtained from the mapping, and the corrected optical flow is output; wherein, the first optical flow is input The optical flow of the calculation module is the preset optical flow between the first video frame and the first intermediate video frame, the optical flow input to other optical flow calculation modules is the optical flow output by the previous optical flow calculation module, and finally The optical flow output by an optical flow calculation module is the optical flow from the first intermediate video frame to the first video frame calculated by the first neural network.

In an implementation manner of the video frame insertion apparatus 300, the first optical flow calculation unit 320 determines the first image input to each optical flow calculation module according to the first video frame, and determines according to the second video frame Inputting the second image of each optical flow calculation module includes: using the first video frame as the first image input to each optical flow calculation module, and using the second video frame as input to each optical flow calculation The second image of the module; or, the image obtained after the down-sampling of the first video frame is used as the first image input to each optical flow calculation module, and the image obtained after the down-sampling of the second video frame is used as Input the second image of each optical flow calculation module; wherein, the shapes of the two down-sampled images input by the same optical flow calculation module are the same; or, the feature map output after the first video frame is processed by the convolution layer is used as The first image of each optical flow calculation module is input, and the feature map output after the second video frame is processed by the convolution layer is used as the second image input to each optical flow calculation module; wherein, the same optical flow calculation The two feature maps input by the module have the same shape.

In an implementation manner of the video frame insertion device 300, the first optical flow calculation unit 320 takes the image obtained by down-sampling the first video frame as the first image input to each optical flow calculation module, and calculates the The image obtained after the down-sampling of the second video frame is used as the second image input to each optical flow calculation module, including: down-sampling the first video frame and the second video frame respectively to form the first video frame. The image pyramid of the video frame and the image pyramid of the second video frame, each layer of the image pyramid starting from the top layer corresponds to an optical flow calculation module of the first neural network starting from the first optical flow calculation module; Starting from the top layer of the two image pyramids, the traversal is performed layer by layer, and the two down-sampled images located in the same layer are used as the first image and the second image input to the optical flow calculation module corresponding to the layer respectively.

In an implementation manner of the video frame insertion device 300, the first optical flow calculation unit 320 uses the feature map output after the first video frame is processed by the convolution layer as the first image input to each optical flow calculation module, And, using the feature map outputted by the second video frame after being processed by the convolutional layer as the second image input to each optical flow calculation module, including: using the first feature extraction network to separately extract the first video frame and the Feature extraction is performed on the second video frame to form a feature pyramid of the first video frame and a feature pyramid of the second video frame, and each layer from the top layer of the feature pyramid corresponds to the first neural network from the first An optical flow calculation module starting from an optical flow calculation module; wherein, the first feature extraction network is a convolutional neural network; from the top layer of the two feature pyramids, the traversal is performed layer by layer, and the two layers located in the same layer are traversed layer by layer. The feature maps are respectively used as the first image and the second image input to the optical flow calculation module corresponding to the layer.

In an implementation manner of the video frame insertion device 300, the first optical flow calculation unit 320 corrects the optical flow input to the optical flow calculation module based on the first mapped image and the second mapped image obtained by mapping, and outputs the corrected optical flow. The optical flow, including: based on the first mapping image obtained by mapping, the second mapping image and the optical flow input to the optical flow calculation module, using the second neural network to predict the optical flow correction item; using the optical flow correction item to input The optical flow of the optical flow calculation module is corrected, and the corrected optical flow is output.

In an implementation manner of the video frame insertion device 300, the first optical flow calculation unit 320 corrects the optical flow input to the optical flow calculation module based on the first mapped image and the second mapped image obtained by mapping, and outputs the corrected optical flow. The optical flow includes: based on the first mapping image and the second mapping image obtained by mapping, using the descriptor matching unit, sub-pixel correction layer and regularization layer in the LiteFlownet network to modify the optical flow input to the optical flow calculation module , and output the corrected optical flow.

In an implementation of the apparatus 300 for video frame insertion, the first optical flow calculation unit 320 uses a first neural network to calculate, based on the first video frame and the second video frame, from the first intermediate video frame to the The optical flow of the first video frame and the optical flow from the first intermediate video frame to the second video frame includes: using the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame, and Calculate the optical flow from the first intermediate video frame to the second video frame according to the optical flow from the first intermediate video frame to the first video frame; or, use the first neural network to calculate the first intermediate video frame to the second video frame, and calculate the optical flow from the first intermediate video frame to the first video frame according to the optical flow from the first intermediate video frame to the second video frame.

In an implementation of the apparatus 300 for video frame insertion, the first optical flow calculation unit 320 calculates the distance between the first intermediate video frame and the first video frame according to the optical flow between the first intermediate video frame and the first video frame. The optical flow of two video frames includes: inverting the optical flow from the first intermediate video frame to the first video frame as the optical flow from the first intermediate video frame to the second video frame; An optical flow calculation unit 320 calculates the optical flow from the first intermediate video frame to the first video frame according to the optical flow from the first intermediate video frame to the second video frame, including: The optical flow from the intermediate video frame to the second video frame is inverted as the optical flow from the first intermediate video frame to the first video frame.

In an implementation of the apparatus 300 for video frame insertion, the first intermediate frame determining unit 340 determines the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame, including: based on The optical flow from the first intermediate video frame to the first video frame modifies the first mapped video frame to obtain the first intermediate video frame; or, based on the first intermediate video frame to the first video frame The optical flow of the second video frame modifies the second mapped video frame to obtain the first intermediate video frame; or, based on the optical flow from the first intermediate video frame to the first video frame and/or The optical flow from the first intermediate video frame to the second video frame is modified by modifying the first fused video frame formed after the fusion of the first mapped video frame and the second mapped video frame, to obtain the The first intermediate video frame.

In an implementation manner of the apparatus 300 for video frame insertion, the first intermediate frame determining unit 340 determines, based on the optical flow from the first intermediate video frame to the first video frame, the mapping between the first mapped video frame and the first video frame. Modifying the first fused video frame formed after the fusion of the second mapped video frame to obtain the first intermediate video frame includes: based on the first mapped video frame, the second mapped video frame and the first From the optical flow from the intermediate video frame to the first video frame, the third neural network is used to predict the first image correction item and the first fusion mask; according to the indication of the pixel value in the first fusion mask, the A mapped video frame and the second mapped video frame are fused into the first fused video frame; the first fused video frame is modified by using the first image modification item to obtain the first intermediate video frame.

In an implementation of the video frame insertion apparatus 300, the third neural network includes a second feature extraction network and a codec network, the codec network includes an encoder and a decoder, and the first intermediate frame determination unit 340, based on the first mapped video frame, the second mapped video frame, and the optical flow from the first intermediate video frame to the first video frame, use a third neural network to predict a first image correction term and a first image correction term. a fusion mask, comprising: using the second feature extraction network to perform feature extraction on the first video frame and the first video frame respectively; using the difference between the first intermediate video frame and the first video frame The optical flow performs backward mapping on the feature map extracted by the second feature extraction network; the mapped feature map obtained by mapping, the first mapped video frame, the second mapped video frame and the first intermediate video are mapped backwards; The optical flow from the frame to the first video frame is input to the encoder for feature extraction; the decoder predicts a first image modification item and a first fusion mask according to the features extracted by the encoder.

In an implementation of the apparatus 300 for video frame insertion, the first intermediate frame determining unit 340 determines the first intermediate video frame according to the first mapped video frame and the second mapped video frame, including: based on the The optical flow from the first mapped video frame, the second mapped video frame, and the first intermediate video frame to the first video frame, and a fourth neural network is used to predict a second fusion mask; according to the second The indication of pixel values in the fusion mask fuses the first mapped video frame and the second mapped video frame into the first intermediate video frame.

For the model training device 300 provided by the embodiments of the present application, the implementation principle and the technical effects produced have been introduced in the foregoing method embodiments. For brief description, for the parts not mentioned in the device embodiments, reference may be made to the corresponding content in the method embodiments. .

FIG. 10 shows a functional block diagram of a model training apparatus 400 provided by an embodiment of the present application. 10, the model training device 400 includes:

A second video frame obtaining unit 410, configured to obtain training samples, where the training samples include a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;

The second optical flow calculation unit 420 is configured to use the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame and/or the third video frame based on the third video frame and the fourth video frame. Optical flow from two intermediate video frames to the fourth video frame; wherein, the second intermediate video frame is a video frame to be inserted between the third video frame and the fourth video frame;

A second backward mapping unit 430, configured to perform backward mapping on the third video frame by using the optical flow from the second intermediate video frame to the third video frame to obtain a third mapped video frame, and/or , using the optical flow from the second intermediate video frame to the fourth video frame to perform backward mapping on the fourth video frame to obtain a fourth mapped video frame;

A second intermediate frame determining unit 440, configured to determine the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame;

A parameter updating unit 450, configured to calculate a prediction loss according to the second intermediate video frame and the reference video frame, and update parameters of the first neural network according to the prediction loss.

In an implementation manner of the model training apparatus 400, the parameter updating unit 450 calculates the prediction loss according to the second intermediate video frame and the reference video frame, including: according to the second intermediate video frame and the reference video frame Calculate the first loss; calculate the image gradient of the second intermediate video frame and the image gradient of the reference video frame respectively, and calculate the image gradient of the second intermediate video frame and the reference video frame Calculate the second loss; calculate the predicted loss according to the first loss and the second loss.

In an implementation manner of the model training apparatus 400, the parameter updating unit 450 calculates the prediction loss according to the second intermediate video frame and the reference video frame, including: according to the second intermediate video frame and the reference video frame Calculate the first loss; use the pre-trained fifth neural network to calculate the optical flow from the reference video frame to the third video frame and/or the optical flow from the reference video frame to the fourth video frame ; Calculate the third loss according to the difference between the optical flow calculated by the first neural network and the corresponding optical flow calculated by the fifth neural network; Calculate the said first loss and the third loss Predict loss.

In an implementation of the model training device 400, the first neural network includes at least one optical flow calculation module connected in sequence, and each optical flow calculation module outputs the second intermediate video frame corrected by the module The optical flow to the third video frame; the parameter update unit 450 calculates the prediction loss according to the second intermediate video frame and the reference video frame, including: according to the second intermediate video frame and the reference video frame. Calculate the first loss by difference; use the pre-trained fifth neural network to calculate the optical flow from the reference video frame to the third video frame; according to the optical flow output by each optical flow calculation module and the fifth neural network A fourth loss is calculated from the difference between the calculated optical flows; the predicted loss is calculated according to the first loss and the fourth loss.

In an implementation manner of the model training device 400, the parameter updating unit 450 calculates the third loss according to the difference between the optical flow calculated by the first neural network and the corresponding optical flow calculated by the fifth neural network, Including: using the optical flow calculated by the fifth neural network to perform backward mapping on the third video frame to obtain a fifth mapped video frame; determining according to the difference between the fifth mapped video frame and the reference video frame Whether the optical flow vector at each pixel position calculated by the fifth neural network is accurate; the first effective optical flow vector in the optical flow calculated according to the fifth neural network corresponds to the first effective optical flow vector calculated by the first neural network The difference of the second effective optical flow vector in the optical flow calculates the third loss; wherein, the first effective optical flow vector refers to the accurate optical flow vector calculated by the fifth neural network, and the second effective optical flow vector Refers to the optical flow vector located at the pixel position corresponding to the first effective optical flow vector in the corresponding optical flow calculated by the first neural network.

In an implementation manner of the model training apparatus 400, the parameter updating unit 450 calculates the fourth loss according to the difference between the optical flow output by each optical flow calculation module and the optical flow calculated by the fifth neural network, including: Use the optical flow calculated by the fifth neural network to perform backward mapping on the third video frame to obtain a fifth mapped video frame; determine the third video frame according to the difference between the fifth mapped video frame and the reference video frame Whether the optical flow vector at each pixel position calculated by the fifth neural network is accurate; the first effective optical flow vector in the optical flow and the optical flow output by each optical flow calculation module are calculated according to the fifth neural network. The difference of the third effective optical flow vector calculates the fourth loss; wherein, the first effective optical flow vector refers to the accurate optical flow vector calculated by the fifth neural network, and the third effective optical flow vector refers to each The optical flow vector at the pixel position corresponding to the first effective optical flow vector in the optical flow output by the optical flow calculation module.

In an implementation of the model training device 400, the first neural network includes at least one optical flow calculation module connected in sequence, and each optical flow calculation module utilizes a descriptor matching unit and a sub-pixel correction layer in the LiteFlownet network and the regularization layer modifies the optical flow input to the optical flow calculation module, the device further includes: a parameter initialization unit, used for the second optical flow calculation unit 420 based on the third video frame and the fourth video frame, before using the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame and/or the optical flow from the second intermediate video frame to the fourth video frame, using the LiteFlownet network pre-training obtained The parameters initialize the parameters of the first neural network.

In an implementation of the model training apparatus 400, the second intermediate frame determining unit 440 determines the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame, including: based on the third mapped video frame Three mapped video frames, the fourth mapped video frame, and the optical flow from the second intermediate video frame to the third video frame, using the third neural network to predict the second image correction term and the third fusion mask; The indication of pixel values in the third fusion mask fuses the third mapped video frame and the fourth mapped video frame into the second fused video frame; using the second image correction item to fuse the first Two fused video frames are modified to obtain the second intermediate video frame; the parameter updating unit 450 calculates a prediction loss according to the second intermediate video frame and the reference video frame, and updates the first neural network according to the prediction loss The parameters of the network include: calculating a prediction loss according to the second intermediate video frame and the reference video frame, and updating the parameters of the first neural network and the third neural network according to the prediction loss.

In an implementation of the model training apparatus 400, the second intermediate frame determining unit 440 determines the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame, including: based on the third mapped video frame Three mapped video frames, the fourth mapped video frame, and the optical flow from the second intermediate video frame to the third video frame, and the fourth neural network is used to predict the second image correction term and the fourth fusion mask; The indication of pixel values in the fourth fusion mask fuses the third mapped video frame and the fourth mapped video frame into the second intermediate video frame; the parameter update unit 450 fuses the second intermediate video frame according to the calculating a prediction loss with the reference video frame, and updating the parameters of the first neural network according to the prediction loss, including: calculating a prediction loss according to the second intermediate video frame and the reference video frame, and The prediction loss updates the parameters of the first neural network and the fourth neural network.

For the model training device 400 provided by the embodiments of the present application, the implementation principle and the technical effects produced have been introduced in the foregoing method embodiments. For brief description, for the parts not mentioned in the device embodiments, reference may be made to the corresponding content in the method embodiments. .

The embodiment of the present application also provides a video frame insertion device, including:

A third video frame obtaining unit, used for obtaining the first video frame and the second video frame;

A third optical flow calculation unit, configured to calculate the optical flow and/or the optical flow from the first video frame to the first intermediate video frame by using the first neural network estimation based on the first video frame and the second video frame. The optical flow from the second video frame to the first intermediate video frame; wherein, the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame;

a first forward mapping unit, configured to perform forward mapping on the first video frame by using the optical flow from the first video frame to the first intermediate video frame to obtain a first mapped video frame, and/or, Using the optical flow from the second video frame to the first intermediate video frame to perform forward mapping on the second video frame to obtain a second mapped video frame;

A third intermediate frame determining unit, configured to determine the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame.

The above-mentioned video frame insertion device is similar to the video frame insertion device 300, and the difference mainly includes the use of forward mapping to replace the backward mapping in the video frame insertion device 300. Various possible implementations of the video frame insertion device can also refer to The video frame insertion apparatus 300 will not be repeated.

The embodiment of the present application also provides a model training device, including:

a fourth video frame obtaining unit, configured to obtain a training sample, the training sample includes a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;

a fourth optical flow calculation unit, configured to use the first neural network to calculate the optical flow from the third video frame to the second intermediate video frame and/or the optical flow based on the third video frame and the fourth video frame The optical flow from the fourth video frame to the second intermediate video frame; wherein, the second intermediate video frame is the video frame to be inserted between the third video frame and the fourth video frame;

a second forward mapping unit, configured to perform forward mapping on the third video frame by using the optical flow from the third video frame to the second intermediate video frame to obtain a third mapped video frame, and/or, use the first The optical flow from the four video frames to the second intermediate video frame performs forward mapping on the fourth video frame to obtain a fourth mapped video frame;

a third intermediate frame determining unit, configured to determine the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame;

A second parameter updating unit, configured to calculate a prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.

The above-mentioned model training device is similar to the model training device 400, and the difference mainly includes using forward mapping to replace the backward mapping in the model training device 400. Various possible implementations of the model training device can also refer to the model training device 400. , will not be repeated.

FIG. 11 shows a possible structure of the electronic device 500 provided by the embodiment of the present application. 11, an electronic device 500 includes a processor 510, a memory 520, and a communication interface 530, and these components are interconnected and communicate with each other through a communication bus 540 and/or other forms of connection mechanisms (not shown).

Wherein, the memory 520 includes one or more (only one is shown in the figure), which may be, but not limited to, a random access memory (Random Access Memory, RAM for short), a read only memory (Read Only Memory, ROM for short) , Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), Electrical Erasable Programmable Read-Only Memory, referred to as EEPROM) and so on. The processor 510 and possibly other components may access the memory 520, read and/or write data therein.

The processor 510 includes one or more (only one is shown in the figure), which may be an integrated circuit chip, and has signal processing capability. The above-mentioned processor 510 may be a general-purpose processor, including a central processing unit (Central Processing Unit, referred to as CPU), a micro control unit (Micro Controller Unit, referred to as MCU), a network processor (Network Processor, referred to as NP) or other conventional processing It can also be a dedicated processor, including a graphics processor (Graphics Processing Unit, GPU), a neural network processor (Neural-network Processing Unit, referred to as NPU), a digital signal processor (Digital Signal Processor, referred to as DSP), dedicated Integrated circuits (Application Specific Integrated Circuits, referred to as ASIC), Field Programmable Gate Array (Field Programmable Gate Array, referred to as FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Also, when there are multiple processors 510, some of them may be general-purpose processors, and the other may be dedicated processors.

The communication interface 530 includes one or more (only one is shown in the figure), which can be used to communicate directly or indirectly with other devices for data exchange. Communication interface 530 may include an interface for wired and/or wireless communication.

One or more computer program instructions may be stored in the memory 520, and the processor 510 may read and execute these computer program instructions to implement the video frame insertion method and/or the model training method provided by the embodiments of the present application.

It can be understood that the structure shown in FIG. 11 is only for illustration, and the electronic device 500 may further include more or less components than those shown in FIG. 11 , or have different configurations from those shown in FIG. 11 . Each component shown in FIG. 11 may be implemented in hardware, software, or a combination thereof. The electronic device 500 may be a physical device, such as a PC, a notebook computer, a tablet computer, a mobile phone, a server, an embedded device, etc., or a virtual device, such as a virtual machine, a virtualized container, and the like. In addition, the electronic device 500 is not limited to a single device, and may be a combination of a plurality of devices or a cluster composed of a large number of devices.

Embodiments of the present application further provide a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and run by a processor of a computer, the computer program instructions provided by the embodiments of the present application are executed. Video frame insertion method. For example, the computer-readable storage medium may be implemented as the memory 520 in the electronic device 500 in FIG. 11 .

The above descriptions are merely examples of the present application, and are not intended to limit the protection scope of the present application. For those skilled in the art, various modifications and changes may be made to the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the protection scope of this application.

Industrial Applicability

The present application provides a video frame insertion method, a model training method and a corresponding device, which are applied to the process of video processing to achieve a better frame insertion effect, and the frame insertion process takes less time.

Claims

A method for video frame insertion, comprising:

Get the first video frame and the second video frame;

Based on the first video frame and the second video frame, using the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame and/or the first intermediate video frame to the second video frame Optical flow of video frames; wherein, the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame;

Perform backward mapping on the first video frame by using the optical flow from the first intermediate video frame to the first video frame to obtain a first mapped video frame, and/or, use the first intermediate video frame to The optical flow of the second video frame performs backward mapping on the second video frame to obtain a second mapped video frame;

The first intermediate video frame is determined from the first mapped video frame and/or the second mapped video frame.
The video frame insertion method according to claim 1, wherein the first neural network comprises at least one optical flow calculation module connected in sequence, and based on the first video frame and the second video frame, using the first A neural network calculates the optical flow from the first intermediate video frame to the first video frame, including:

Determine the first image input to each optical flow calculation module according to the first video frame, and determine the second image input to each optical flow calculation module according to the second video frame;

Using each optical flow calculation module, the first image and the second image input to the optical flow calculation module are respectively backward mapped based on the optical flow input to the optical flow calculation module, and the first mapped image and the second image obtained based on the mapping The mapping image corrects the optical flow input to the optical flow calculation module, and outputs the corrected optical flow;

The optical flow input to the first optical flow calculation module is the preset optical flow between the first video frame and the first intermediate video frame, and the optical flow input to other optical flow calculation modules is the previous optical flow The optical flow output by the calculation module, and the optical flow output by the last optical flow calculation module is the optical flow calculated by the first neural network from the first intermediate video frame to the first video frame.
The video frame insertion method according to claim 2, wherein the first image input to each optical flow calculation module is determined according to the first video frame, and the input image of each optical flow calculation module is determined according to the second video frame. The second image of the optical flow calculation module, including:

Using the first video frame as the first image input to each optical flow calculation module, and using the second video frame as the second image input to each optical flow calculation module; or,

The image obtained after downsampling the first video frame is used as the first image input to each optical flow calculation module, and the image obtained after downsampling the second video frame is used as the input image of each optical flow calculation module. the second image; wherein, the shapes of the two down-sampled images input by the same optical flow calculation module are the same; or,

The feature map output after the first video frame is processed by the convolution layer is used as the first image input to each optical flow calculation module, and the feature map output after the second video frame is processed by the convolution layer is used as Input the second image of each optical flow calculation module; wherein, the shapes of the two feature maps input by the same optical flow calculation module are the same.
The video frame insertion method according to claim 3, wherein the image obtained after down-sampling the first video frame is used as the first image input to each optical flow calculation module, and the The image obtained after the downsampling of the two video frames is used as the second image input to each optical flow calculation module, including:

Downsampling the first video frame and the second video frame, respectively, to form an image pyramid of the first video frame and an image pyramid of the second video frame, each image pyramid starting from the top layer. The layer corresponds to an optical flow calculation module of the first neural network starting from the first optical flow calculation module;

Starting from the top layer of the two image pyramids, the traversal is performed layer by layer, and the two down-sampled images located in the same layer are used as the first image and the second image input to the optical flow calculation module corresponding to the layer respectively.
The video frame insertion method according to claim 3, wherein the feature map output after the first video frame is processed by the convolution layer is used as the first image input to each optical flow calculation module, and, The feature map output after the second video frame is processed by the convolution layer is used as the second image input to each optical flow calculation module, including:

The first feature extraction network is used to extract features from the first video frame and the second video frame, respectively, to form a feature pyramid of the first video frame and a feature pyramid of the second video frame. The feature pyramid Each layer starting from the top layer corresponds to an optical flow calculation module of the first neural network starting from the first optical flow calculation module; wherein, the first feature extraction network is a convolutional neural network;

Starting from the top layer of the two feature pyramids, the traversal is performed layer by layer, and the two feature maps located in the same layer are respectively used as the first image and the second image input to the optical flow calculation module corresponding to the layer.
The video frame insertion method according to any one of claims 2-5, wherein the optical flow input to the optical flow calculation module is modified based on the first mapping image and the second mapping image obtained by mapping, And output the corrected optical flow, including:

Based on the first mapped image, the second mapped image and the optical flow input to the optical flow calculation module, the optical flow correction term predicted by the second neural network or the descriptor matching unit and sub-pixel correction in the LiteFlownet network are used. The layer and the regularization layer correct the optical flow input to the optical flow calculation module, and output the corrected optical flow.
The video frame insertion method according to any one of claims 1-6, wherein the first intermediate video is calculated based on the first video frame and the second video frame by using a first neural network The optical flow from the frame to the first video frame and the optical flow from the first intermediate video frame to the second video frame, including:

Calculate the optical flow from the first intermediate video frame to the first video frame by using the first neural network, and calculate the first intermediate video frame according to the optical flow from the first intermediate video frame to the first video frame the optical flow to the second video frame; or,

Calculate the optical flow from the first intermediate video frame to the second video frame by using the first neural network, and calculate the first intermediate video frame according to the optical flow from the first intermediate video frame to the second video frame optical flow to the first video frame.
The method for video frame insertion according to claim 7, wherein the calculation from the first intermediate video frame to the second video is performed according to an optical flow from the first intermediate video frame to the first video frame. Optical flow of frames, including:

Inverting the optical flow from the first intermediate video frame to the first video frame as the optical flow from the first intermediate video frame to the second video frame;

The calculating the optical flow from the first intermediate video frame to the first video frame according to the optical flow from the first intermediate video frame to the second video frame includes:

Inverting the optical flow from the first intermediate video frame to the second video frame is taken as the optical flow from the first intermediate video frame to the first video frame.
The video frame insertion method according to any one of claims 1-8, wherein the first intermediate video frame is determined according to the first mapped video frame and/or the second mapped video frame ,include:

Modify the first mapped video frame based on the optical flow from the first intermediate video frame to the first video frame to obtain the first intermediate video frame; or,

Modify the second mapped video frame based on the optical flow from the first intermediate video frame to the second video frame to obtain the first intermediate video frame; or,

based on the optical flow of the first intermediate video frame to the first video frame and/or the optical flow of the first intermediate video frame to the second video frame, mapping the first video frame to the second video frame The first fused video frame formed by the fusion of the second mapped video frame is modified to obtain the first intermediate video frame.
The video frame insertion method according to claim 9, wherein based on the optical flow from the first intermediate video frame to the first video frame, the first mapped video frame and the second mapped video are The first fused video frame formed after the fusion of the frames is corrected to obtain the first intermediate video frame, including:

Based on the first mapped video frame, the second mapped video frame, and the optical flow from the first intermediate video frame to the first video frame, a third neural network is used to predict a first image correction term and a first fusion mask;

fusing the first mapped video frame and the second mapped video frame into the first fused video frame according to the indication of pixel values in the first fusion mask;

The first fused video frame is modified by using the first image modification item to obtain the first intermediate video frame.
The video frame insertion method according to claim 10, wherein the third neural network comprises a second feature extraction network and a codec network, the codec network comprises an encoder and a decoder, and the The optical flow from the first mapped video frame, the second mapped video frame, and the first intermediate video frame to the first video frame, using the third neural network to predict the first image correction term and the first fusion masks, including:

Use the second feature extraction network to perform feature extraction on the first video frame and the first video frame respectively;

Using the optical flow from the first intermediate video frame to the first video frame to perform backward mapping on the feature map extracted by the second feature extraction network;

Input the mapped feature map obtained by mapping, the first mapped video frame, the second mapped video frame, and the optical flow from the first intermediate video frame to the first video frame to the encoder for feature extraction ;

Using the decoder to predict the first image correction term and the first fusion mask according to the features extracted by the encoder.
A model training method, comprising:

acquiring training samples, the training samples including a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;

Based on the third video frame and the fourth video frame, using the first neural network to calculate the optical flow of the second intermediate video frame to the third video frame and/or the second intermediate video frame to the fourth video frame optical flow; wherein, the second intermediate video frame is a video frame to be inserted between the third video frame and the fourth video frame;

Perform backward mapping on the third video frame by using the optical flow from the second intermediate video frame to the third video frame to obtain a third mapped video frame, and/or, use the second intermediate video frame to The optical flow of the fourth video frame performs backward mapping on the fourth video frame to obtain a fourth mapped video frame;

determining the second intermediate video frame from the third mapped video frame and/or the fourth mapped video frame;

A prediction loss is calculated from the second intermediate video frame and the reference video frame, and parameters of the first neural network are updated according to the prediction loss.
The model training method according to claim 12, wherein the calculating the prediction loss according to the second intermediate video frame and the reference video frame comprises:

calculating a first loss according to the difference between the second intermediate video frame and the reference video frame;

respectively calculating the image gradient of the second intermediate video frame and the image gradient of the reference video frame, and calculating a second loss according to the difference between the image gradient of the second intermediate video frame and the image gradient of the reference video frame;

The predicted loss is calculated based on the first loss and the second loss.
The model training method according to claim 12, wherein the calculating the prediction loss according to the second intermediate video frame and the reference video frame comprises:

calculating a first loss according to the difference between the second intermediate video frame and the reference video frame;

Using the pre-trained fifth neural network to calculate the optical flow from the reference video frame to the third video frame and/or the optical flow from the reference video frame to the fourth video frame;

calculating a third loss according to the difference between the optical flow calculated by the first neural network and the corresponding optical flow calculated by the fifth neural network;

The predicted loss is calculated based on the first loss and the third loss.
A method for video frame insertion, comprising:

Get the first video frame and the second video frame;

Based on the first video frame and the second video frame, using the first neural network to estimate and calculate the optical flow from the first video frame to the first intermediate video frame and/or the second video frame to the first video frame Optical flow of an intermediate video frame; wherein, the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame;

Perform forward mapping on the first video frame by using the optical flow from the first video frame to the first intermediate video frame to obtain a first mapped video frame, and/or, use the second video frame to map the first video frame to the first video frame. The optical flow of the first intermediate video frame performs forward mapping on the second video frame to obtain a second mapped video frame;

The first intermediate video frame is determined from the first mapped video frame and/or the second mapped video frame.
A model training method, comprising:

acquiring training samples, the training samples including a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;

Based on the third video frame and the fourth video frame, use the first neural network to calculate the optical flow from the third video frame to the second intermediate video frame and/or the fourth video frame to the second intermediate video frame optical flow; wherein, the second intermediate video frame is a video frame to be inserted between the third video frame and the fourth video frame;

Forward-mapping the third video frame by using the optical flow from the third video frame to the second intermediate video frame to obtain a third mapped video frame, and/or using the fourth video frame to the second intermediate video frame The optical flow of the video frame performs forward mapping on the fourth video frame to obtain the fourth mapped video frame;

determining the second intermediate video frame from the third mapped video frame and/or the fourth mapped video frame;

A prediction loss is calculated from the second intermediate video frame and the reference video frame, and parameters of the first neural network are updated according to the prediction loss.
A device for video frame insertion, comprising:

a first video frame obtaining unit, used to obtain the first video frame and the second video frame;

A first optical flow calculation unit, configured to use a first neural network to calculate an optical flow and/or an optical flow from the first intermediate video frame to the first video frame based on the first video frame and the second video frame. An optical flow from an intermediate video frame to the second video frame; wherein, the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame;

a first backward mapping unit, configured to perform backward mapping on the first video frame by using the optical flow from the first intermediate video frame to the first video frame to obtain a first mapped video frame, and/or, Using the optical flow from the first intermediate video frame to the second video frame to perform backward mapping on the second video frame to obtain a second mapped video frame;

A first intermediate frame determination unit, configured to determine the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame.
A model training device, comprising:

a second video frame obtaining unit, configured to obtain training samples, where the training samples include a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;

A second optical flow calculation unit, configured to use the first neural network to calculate the optical flow and/or the second optical flow from the second intermediate video frame to the third video frame based on the third video frame and the fourth video frame an optical flow from an intermediate video frame to the fourth video frame; wherein, the second intermediate video frame is a video frame to be inserted between the third video frame and the fourth video frame;

a second backward mapping unit, configured to perform backward mapping on the third video frame by using the optical flow from the second intermediate video frame to the third video frame to obtain a third mapped video frame, and/or, Using the optical flow from the second intermediate video frame to the fourth video frame to perform backward mapping on the fourth video frame to obtain a fourth mapped video frame;

a second intermediate frame determining unit, configured to determine the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame;

A first parameter updating unit, configured to calculate a prediction loss according to the second intermediate video frame and the reference video frame, and update parameters of the first neural network according to the prediction loss.
A device for video frame insertion, comprising:

The third video frame acquisition unit is used to acquire the first video frame and the second video frame;

A third optical flow calculation unit, configured to calculate the optical flow and/or the optical flow from the first video frame to the first intermediate video frame by using the first neural network estimation based on the first video frame and the second video frame. The optical flow from the second video frame to the first intermediate video frame; wherein, the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame;

a first forward mapping unit, configured to perform forward mapping on the first video frame by using the optical flow from the first video frame to the first intermediate video frame to obtain a first mapped video frame, and/or, Using the optical flow from the second video frame to the first intermediate video frame to perform forward mapping on the second video frame to obtain a second mapped video frame;

A third intermediate frame determining unit, configured to determine the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame.
A model training device, comprising:

a fourth video frame obtaining unit, configured to obtain a training sample, the training sample includes a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;

a fourth optical flow calculation unit, configured to use the first neural network to calculate the optical flow from the third video frame to the second intermediate video frame and/or the optical flow based on the third video frame and the fourth video frame The optical flow from the fourth video frame to the second intermediate video frame; wherein, the second intermediate video frame is the video frame to be inserted between the third video frame and the fourth video frame;

a second forward mapping unit, configured to perform forward mapping on the third video frame by using the optical flow from the third video frame to the second intermediate video frame to obtain a third mapped video frame, and/or, use the first The optical flow from the four video frames to the second intermediate video frame performs forward mapping on the fourth video frame to obtain a fourth mapped video frame;

a third intermediate frame determining unit, configured to determine the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame;

A second parameter updating unit, configured to calculate a prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.